U.S. patent application number 17/173308 was filed with the patent office on 2021-08-26 for computerized pipelines for transforming input data into data structures compatible with models.
This patent application is currently assigned to SAS Institute Inc.. The applicant listed for this patent is SAS Institute Inc.. Invention is credited to James Allen Cox, Nancy Anne Rausch.
Application Number | 20210263949 17/173308 |
Document ID | / |
Family ID | 1000005481368 |
Filed Date | 2021-08-26 |
United States Patent
Application |
20210263949 |
Kind Code |
A1 |
Cox; James Allen ; et
al. |
August 26, 2021 |
COMPUTERIZED PIPELINES FOR TRANSFORMING INPUT DATA INTO DATA
STRUCTURES COMPATIBLE WITH MODELS
Abstract
Computerized pipelines can transform input data into data
structures compatible with models in some examples. In one such
example, a system can obtain a first table that includes first data
referencing a set of subjects. The system can then execute a
sequence of processing operations on the first data in a particular
order defined by a data-processing pipeline to modify an analysis
table to include features associated with the set of subjects.
Executing each respective processing operation in the sequence to
generate the modified analysis table may involve: deriving a
respective set of features from the first data by executing a
respective feature-extraction operation on the first data; and
adding the respective set of features to the analysis table. The
system may then execute a predictive model on the modified analysis
table for generating a predicted value based on the modified
analysis table.
Inventors: |
Cox; James Allen; (Cary,
NC) ; Rausch; Nancy Anne; (Cary, NC) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
SAS Institute Inc. |
Cary |
NC |
US |
|
|
Assignee: |
SAS Institute Inc.
Cary
NC
|
Family ID: |
1000005481368 |
Appl. No.: |
17/173308 |
Filed: |
February 11, 2021 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
62979686 |
Feb 21, 2020 |
|
|
|
62984385 |
Mar 3, 2020 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06N 3/084 20130101;
G06N 3/0445 20130101; G06F 16/258 20190101 |
International
Class: |
G06F 16/25 20060101
G06F016/25; G06N 3/08 20060101 G06N003/08; G06N 3/04 20060101
G06N003/04 |
Claims
1. A system comprising: one or more processing devices; and one or
more memory devices including instructions that are executable by
the one or more processing devices for causing the one or more
processing devices to: obtain a first table that includes first
data referencing a set of subjects, wherein each subject in the set
of subjects is correlated in the first data to one or more variable
values describing a transaction associated with the subject, and
wherein the first data includes at least one one-to-many
relationship in which a subject in the set of subjects is
referenced in multiple observations; obtain second data referencing
the set of subjects, wherein each subject in the set of subjects is
correlated in the second data to one or more attributes describing
the subject; generate an analysis table based on the second data,
the analysis table being separate from the first table; execute a
sequence of processing operations on the first data in a particular
order defined by a data-processing pipeline to modify the analysis
table to include features associated with the set of subjects,
wherein executing each respective processing operation in the
sequence to generate the modified analysis table involves: deriving
a respective set of features from the first data by executing a
respective feature-extraction operation on the first data; and
adding the respective set of features to the analysis table, such
that each subject in the set of subjects is correlated in the
analysis table to corresponding values for the respective set of
features; and execute a predictive model on the modified analysis
table for generating a predicted value based on the modified
analysis table.
2. The system of claim 1, wherein executing each respective
processing operation in the sequence further involves performing a
model-accuracy test comprising: determining a current value for an
accuracy metric that indicates an accuracy of the predictive model,
the current value being determined by providing the modified
analysis table as input to the predictive model; comparing the
current value for the accuracy metric to a prior value for the
accuracy metric that was generated in relation to a prior
processing operation in the sequence, to determine whether the
current value is improved as compared to the prior value; and
generating an output indicating whether the current value is
improved as compared to the prior value.
3. The system of claim 2, wherein the one or more memory devices
further include instructions that are executable by the one or more
processing devices for causing the one or more processing devices
to: generate a graphical user interface (GUI) indicating whether
each processing operation in the sequence increased the accuracy of
the predictive model.
4. The system of claim 3, wherein the GUI indicates that a
particular processing operation in the sequence of processing
operations did not improve the accuracy of the predictive model,
and wherein the one or more memory devices further include
instructions that are executable by the one or more processing
devices for causing the one or more processing devices to: receive
a user input for removing the particular processing operation from
the data-processing pipeline; in response to the user input, update
the data-processing pipeline to remove the particular processing
operation; and execute the updated data-processing pipeline on the
first data to generate an updated version of the modified analysis
table for use with the predictive model.
5. The system of claim 1, wherein the one or more memory devices
further include instructions that are executable by the one or more
processing devices for causing the one or more processing devices
to automatically generate the data-processing pipeline by:
automatically selecting the processing operations from among a
group of processing operations based on a plurality of
characteristics of the first data; and automatically arranging the
processing operations in the particular order based on the
plurality of characteristics.
6. The system of claim 1, wherein the one or more memory devices
further include pipeline-creation software that is executable by
the one or more processing devices for causing the one or more
processing devices to generate a graphical user interface (GUI)
that includes an extensible toolbox of feature-extraction
operations that are selectable and arrangeable by a user to create
data-processing pipelines, the data-processing pipelines being
configured to apply feature-extraction operations on input data to
generate analysis tables.
7. The system of claim 1, wherein executing at least one processing
operation in the sequence involves: concatenating the first data
together into a text string; and performing the respective
feature-extraction operation on the text string.
8. The system of claim 1, wherein the predictive model is a trained
machine-learning model.
9. The system of claim 1, wherein the one or more memory devices
include pipeline-creation software that is executable by the one or
more processing devices for causing the one or more processing
devices to automatically generate program code based on the
data-processing pipeline, the program code being configured to be
executed independently of the pipeline-creation software for
performing the sequence of processing operations faster than
executing the data-processing pipeline in pipeline-creation
software.
10. The system of claim 9, wherein the one or more memory devices
further include instructions that are executable by the one or more
processing devices for causing the one or more processing devices
to automatically generate the program code based on the
data-processing pipeline by, for each processing operation in the
sequence: selecting a code template, from among a plurality of code
templates, that is associated with the processing operation;
modifying the selected code template based on a set of parameters;
and incorporating the modified code template into the program
code.
11. The system of claim 9, wherein the one or more memory devices
further include instructions that are executable by the one or more
processing devices for causing the one or more processing devices
to execute a plurality of iterations of the data-processing
pipeline on a plurality of data tables, wherein each iteration of
the plurality of iterations involves executing the sequence of
processing operations in the data-processing pipeline on respective
set of data from a respective data table among the plurality of
data tables to expand the analysis table.
12. A method comprising: obtaining, by one or more processing
devices, a first table that includes first data referencing a set
of subjects, wherein each subject in the set of subjects is
correlated in the first data to one or more variable values
describing a transaction associated with the subject, and wherein
the first data includes at least one one-to-many relationship in
which a subject in the set of subjects is referenced in multiple
observations; obtaining, by the one or more processing devices,
second data referencing the set of subjects, wherein each subject
in the set of subjects is correlated in the second data to one or
more attributes describing the subject; generating, by the one or
more processing devices, an analysis table based on the second
data, the analysis table being separate from the first table;
executing, by the one or more processing devices, a sequence of
processing operations on the first data in a particular order
defined by a data-processing pipeline to modify the analysis table
to include features associated with the set of subjects, wherein
executing each respective processing operation in the sequence to
generate the modified analysis table involves: deriving a
respective set of features from the first data by executing a
respective feature-extraction operation on the first data; and
adding the respective set of features to the analysis table, such
that each subject in the set of subjects is correlated in the
analysis table to corresponding values for the respective set of
features; and executing, by the one or more processing devices, a
predictive model on the modified analysis table for generating a
predicted value based on the modified analysis table.
13. The method of claim 12, wherein executing each respective
processing operation in the sequence further involves performing a
model-accuracy test comprising: determining a current value for an
accuracy metric that indicates an accuracy of the predictive model,
the current value being determined by providing the modified
analysis table as input to the predictive model; comparing the
current value for the accuracy metric to a prior value for the
accuracy metric that was generated in relation to a prior
processing operation in the sequence, to determine whether the
current value is improved as compared to the prior value; and
generating an output indicating whether the current value is
improved as compared to the prior value.
14. The method of claim 13, further comprising: generating a
graphical user interface (GUI) indicating whether each processing
operation in the sequence increased the accuracy of the predictive
model.
15. The method of claim 14, wherein the GUI indicates that a
particular processing operation in the sequence of processing
operations did not improve the accuracy of the predictive model,
and further comprising: receiving a user input for removing the
particular processing operation from the data-processing pipeline;
in response to the user input, updating the data-processing
pipeline to remove the particular processing operation; and
executing the updated data-processing pipeline on the first data to
generate an updated version of the modified analysis table for use
with the predictive model.
16. The method of claim 12, further comprising automatically
generating the data-processing pipeline by: automatically selecting
the processing operations from among a group of processing
operations based on a plurality of characteristics of the first
data; and automatically arranging the processing operations in the
particular order based on the plurality of characteristics.
17. The method of claim 12, further comprising executing
pipeline-creation software to generate a graphical user interface
(GUI) that includes an extensible toolbox of feature-extraction
operations that are selectable and arrangeable by a user to create
data-processing pipelines, the data-processing pipelines being
configured to apply feature-extraction operations on input data to
generate analysis tables.
18. The method of claim 12, wherein executing at least one
processing operation in the sequence involves: concatenating the
first data together into a text string; and performing the
respective feature-extraction operation on the text string.
19. The method of claim 12, wherein the predictive model is a
trained machine-learning model.
20. The method of claim 12, further comprising executing
pipeline-creation software to automatically generate program code
based on the data-processing pipeline, the program code being
configured to be executed independently of the pipeline-creation
software for performing the sequence of processing operations
faster than executing the data-processing pipeline in
pipeline-creation software.
21. The method of claim 20, further comprising automatically
generating the program code based on the data-processing pipeline
by, for each processing operation in the sequence: selecting a code
template, from among a plurality of code templates, that is
associated with the processing operation; modifying the selected
code template based on a set of parameters; and incorporating the
modified code template into the program code.
22. The method of claim 12, further comprising executing a
plurality of iterations of the data-processing pipeline on a
plurality of data tables, wherein each iteration of the plurality
of iterations involves executing the sequence of processing
operations in the data-processing pipeline on respective set of
data from a respective data table among the plurality of data
tables to expand the analysis table.
23. A non-transitory computer-readable medium comprising program
code that is executable by one or more processing devices for
causing the one or more processing devices to: obtain a first table
that includes first data referencing a set of subjects, wherein
each subject in the set of subjects is correlated in the first data
to one or more variable values describing a transaction associated
with the subject, and wherein the first data includes at least one
one-to-many relationship in which a subject in the set of subjects
is referenced in multiple observations; obtain second data
referencing the set of subjects, wherein each subject in the set of
subjects is correlated in the second data to one or more attributes
describing the subject; generate an analysis table based on the
second data, the analysis table being separate from the first
table; execute a sequence of processing operations on the first
data in a particular order defined by a data-processing pipeline to
modify the analysis table to include features associated with the
set of subjects, wherein executing each respective processing
operation in the sequence to generate the modified analysis table
involves: deriving a respective set of features from the first data
by executing a respective feature-extraction operation on the first
data; and adding the respective set of features to the analysis
table, such that each subject in the set of subjects is correlated
in the analysis table to corresponding values for the respective
set of features; and execute a predictive model on the modified
analysis table for generating a predicted value based on the
modified analysis table.
24. The non-transitory computer-readable medium of claim 23,
wherein executing each respective processing operation in the
sequence further involves performing a model-accuracy test
comprising: determining a current value for an accuracy metric that
indicates an accuracy of the predictive model, the current value
being determined by providing the modified analysis table as input
to the predictive model; comparing the current value for the
accuracy metric to a prior value for the accuracy metric that was
generated in relation to a prior processing operation in the
sequence, to determine whether the current value is improved as
compared to the prior value; and generating an output indicating
whether the current value is improved as compared to the prior
value.
25. The non-transitory computer-readable medium of claim 24,
further comprising program code that is executable by the one or
more processing devices for causing the one or more processing
devices to: generate a graphical user interface indicating that a
particular processing operation in the sequence of processing
operations did not improve the accuracy of the predictive model;
receive a user input for removing the particular processing
operation from the data-processing pipeline; in response to the
user input, update the data-processing pipeline to remove the
particular processing operation; and execute the updated
data-processing pipeline on the first data to generate an updated
version of the modified analysis table for use with the predictive
model.
26. The non-transitory computer-readable medium of claim 23,
further comprising program code that is executable by the one or
more processing devices for causing the one or more processing
devices to automatically generate the data-processing pipeline by:
automatically selecting the processing operations from among a
group of processing operations based on a plurality of
characteristics of the first data; and automatically arranging the
processing operations in the particular order based on the
plurality of characteristics.
27. The non-transitory computer-readable medium of claim 23,
further comprising pipeline-creation software that is executable by
the one or more processing devices for causing the one or more
processing devices to generate a graphical user interface (GUI)
that includes an extensible toolbox of feature-extraction
operations that are selectable and arrangeable by a user to create
data-processing pipelines, the data-processing pipelines being
configured to apply feature-extraction operations on input data to
generate analysis tables.
28. The non-transitory computer-readable medium of claim 23,
wherein executing at least one processing operation in the sequence
involves: concatenating the first data together into a text string;
and performing the respective feature-extraction operation on the
text string.
29. The non-transitory computer-readable medium of claim 23,
further comprising pipeline-creation software that is executable by
the one or more processing devices for causing the one or more
processing devices to automatically generate program code based on
the data-processing pipeline, the program code being configured to
be executed independently of the pipeline-creation software for
performing the sequence of processing operations faster than
executing the data-processing pipeline in pipeline-creation
software.
30. The non-transitory computer-readable medium of claim 29,
further comprising program code that is executable by the one or
more processing devices for causing the one or more processing
devices to automatically generate the program code based on the
data-processing pipeline by, for each processing operation in the
sequence: selecting a code template, from among a plurality of code
templates, that is associated with the processing operation;
modifying the selected code template based on a set of parameters;
and incorporating the modified code template into the program code.
Description
REFERENCE TO RELATED APPLICATION
[0001] This claims the benefit of priority under 35 U.S.C. .sctn.
119(e) to U.S. Provisional Patent Application No. 62/979,686, filed
Feb. 21, 2020, and to U.S. Provisional Patent Application No.
62/984,385, filed Mar. 3, 2020, the entirety of each of which is
hereby incorporated by reference herein.
TECHNICAL FIELD
[0002] The present disclosure relates generally to data-processing
pipelines. More specifically, but not by way of limitation, this
disclosure relates to computerized pipelines for transforming input
data into data structures such as analysis tables compatible with
predictive models and other models.
BACKGROUND
[0003] In database theory, it is generally desirable to normalize
data and link data tables together through a series of keys. In
particular, data tables are often linked by primary and secondary
keys, so that data redundancy and dependency is minimized. While
spreading data across a multitude of key-linked data tables yields
several advantages in the context of building relational databases,
that arrangement of data can be unsuitable in other contexts, such
as in the context of predictive modelling.
[0004] Predictive modelling generally involves using
machine-learning models (e.g., neural networks, classifiers, etc.)
or other types of models to predict a future value. Predictive
models often require their input data to be de-normalized and
formatted in certain ways, for example with all relevant
information stored in a single data table in which each column
corresponds to a unique variable and each row corresponds to an
individual observation. Data tables that are properly formatted for
use with predictive models can be referred to as Analytical Base
Tables (ABTs) or analysis tables. Generating an analysis table can
involve selecting and rearranging data from multiple key-linked
data tables to make the data suitable for use with a predictive
model. This is often a manual process that can be complex,
subjective, tedious, and error prone. For example, a data scientist
may manually comb through thousands or millions of rows of data in
key-linked data tables to determine which information to extract.
Next, the data scientist determines how to format that information
in a way that is compatible with the particular predictive model.
The data scientist then creates the analysis table with the
formatted information for use with the predictive model. Not only
is this process subjective and complex, but it is also exceedingly
slow. It is common for such data-preparation processes to take up a
significant proportion of a data scientist's time when performing
predictive analysis. For example, some estimates indicate that only
20% of a data scientist's time is spent performing the desired
analysis while 80% of their time is spent on finding, organizing
data, and preparing the data.
SUMMARY
[0005] One example of the present disclosure can include a system
comprising one or more processing devices and one or more memory
devices. The one or more memory devices can include instructions
that are executable by the one or more processing devices for
causing the one or more processing devices to perform operations.
The operations can include obtaining a first table that includes
first data referencing a set of subjects, wherein each subject in
the set of subjects is correlated in the first data to one or more
variable values describing a transaction associated with the
subject, and wherein the first data includes at least one
one-to-many relationship in which a subject in the set of subjects
is referenced in multiple observations. The operations can include
obtaining second data referencing the set of subjects, wherein each
subject in the set of subjects is correlated in the second data to
one or more attributes describing the subject. The operations can
include generating an analysis table based on the second data, the
analysis table being separate from the first table. The operations
can include executing a sequence of processing operations on the
first data in a particular order defined by a data-processing
pipeline to modify the analysis table to include features
associated with the set of subjects. Executing each respective
processing operation in the sequence to generate the modified
analysis table can involve: deriving a respective set of features
from the first data by executing a respective feature-extraction
operation on the first data; and adding the respective set of
features to the analysis table, such that each subject in the set
of subjects is correlated in the analysis table to corresponding
values for the respective set of features. The operations can
include executing a predictive model on the modified analysis table
for generating a predicted value based on the modified analysis
table.
[0006] Another example of the present disclosure can include a
method involving obtaining a first table that includes first data
referencing a set of subjects, wherein each subject in the set of
subjects is correlated in the first data to one or more variable
values describing a transaction associated with the subject, and
wherein the first data includes at least one one-to-many
relationship in which a subject in the set of subjects is
referenced in multiple observations. The method can also include
obtaining second data referencing the set of subjects, wherein each
subject in the set of subjects is correlated in the second data to
one or more attributes describing the subject. The operations can
include generating an analysis table based on the second data, the
analysis table being separate from the first table. The method can
also include executing a sequence of processing operations on the
first data in a particular order defined by a data-processing
pipeline to modify the analysis table to include features
associated with the set of subjects. Executing each respective
processing operation in the sequence to generate the modified
analysis table can involve: deriving a respective set of features
from the first data by executing a respective feature-extraction
operation on the first data; and adding the respective set of
features to the analysis table, such that each subject in the set
of subjects is correlated in the analysis table to corresponding
values for the respective set of features. The method can also
include executing a predictive model on the modified analysis table
for generating a predicted value based on the modified analysis
table. Some or all of the method may be implemented by one or more
processing devices.
[0007] Still another example of the present disclosure can include
a non-transitory computer-readable medium comprising program code
that is executable by one or more processing devices for causing
the one or more processing devices to perform operations. The
operations can include obtaining a first table that includes first
data referencing a set of subjects, wherein each subject in the set
of subjects is correlated in the first data to one or more variable
values describing a transaction associated with the subject, and
wherein the first data includes at least one one-to-many
relationship in which a subject in the set of subjects is
referenced in multiple observations. The operations can include
obtaining second data referencing the set of subjects, wherein each
subject in the set of subjects is correlated in the second data to
one or more attributes describing the subject. The operations can
include generating an analysis table based on the second data, the
analysis table being separate from the first table. The operations
can include executing a sequence of processing operations on the
first data in a particular order defined by a data-processing
pipeline to modify the analysis table to include features
associated with the set of subjects. Executing each respective
processing operation in the sequence to generate the modified
analysis table can involve: deriving a respective set of features
from the first data by executing a respective feature-extraction
operation on the first data; and adding the respective set of
features to the analysis table, such that each subject in the set
of subjects is correlated in the analysis table to corresponding
values for the respective set of features. The operations can
include executing a predictive model on the modified analysis table
for generating a predicted value based on the modified analysis
table.
[0008] This summary is not intended to identify key or essential
features of the claimed subject matter, nor is it intended to be
used in isolation to determine the scope of the claimed subject
matter. The subject matter should be understood by reference to
appropriate portions of the entire specification, any or all
drawings, and each claim.
[0009] The foregoing, together with other features and examples,
will become more apparent upon referring to the following
specification, claims, and accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
[0010] The present disclosure is described in conjunction with the
appended figures:
[0011] FIG. 1 depicts a block diagram of an example of a computing
system according to some aspects.
[0012] FIG. 2 depicts an example of devices that can communicate
with each other over an exchange system and via a network according
to some aspects.
[0013] FIG. 3 depicts a block diagram of a model of an example of a
communications protocol system according to some aspects.
[0014] FIG. 4 depicts a hierarchical diagram of an example of a
communications grid computing system including a variety of control
and worker nodes according to some aspects.
[0015] FIG. 5 depicts a flow chart of an example of a process for
adjusting a communications grid or a work project in a
communications grid after a failure of a node according to some
aspects.
[0016] FIG. 6 depicts a block diagram of a portion of a
communications grid computing system including a control node and a
worker node according to some aspects.
[0017] FIG. 7 depicts a flow chart of an example of a process for
executing a data analysis or processing project according to some
aspects.
[0018] FIG. 8 depicts a block diagram including components of an
Event Stream Processing Engine (ESPE) according to some
aspects.
[0019] FIG. 9 depicts a flow chart of an example of a process
including operations performed by an event stream processing engine
according to some aspects.
[0020] FIG. 10 depicts a block diagram of an ESP system interfacing
between a publishing device and multiple event subscribing devices
according to some aspects.
[0021] FIG. 11 depicts a flow chart of an example of a process for
generating and using a machine learning model according to some
aspects.
[0022] FIG. 12 depicts a node-link diagram of an example of a
neural network according to some aspects.
[0023] FIG. 13 depicts a data table including an example of raw
data associated with patient vaccinations according to some aspects
of the present disclosure.
[0024] FIG. 14 depicts a data table including an example of raw
data associated with vaccine codes according to some aspects of the
present disclosure.
[0025] FIG. 15 depicts an example of transforming raw data into
model ready data according to some aspects of the present
disclosure.
[0026] FIG. 16 depicts an example of a process for generating a
pipeline according to some aspects of the present disclosure.
[0027] FIG. 17 depicts an example of a process for automatically
generating a pipeline according to some aspects of the present
disclosure.
[0028] FIG. 18 depicts a flow chart of an example of a process for
implementing a model-accuracy test according to some aspects of the
present disclosure.
[0029] FIG. 19 depicts an example of a graphical user interface
according to some aspects of the present disclosure.
[0030] FIG. 20 depicts a flow chart of an example of a process for
generating a pipeline according to some aspects of the present
disclosure.
[0031] In the appended figures, similar components or features can
have the same reference label. Further, various components of the
same type can be distinguished by following the reference label
with a lowercase letter that distinguishes among the similar
components. If only the first reference label is used in the
specification, the description is applicable to any one of the
similar components having the same first reference label
irrespective of the lowercase letter.
DETAILED DESCRIPTION
[0032] Predictive models and other types of models are often
configured to receive input data in a particular format. Examples
of such input data can include one or more transactional tables,
subject tables, and/or other tables, which may be in a
database-normalized format that is different from the particular
format expected by the predictive models. Input data that is not in
that particular format expected by the predictive models may be
incompatible with such models, causing the models to generate
inaccurate results or malfunction. For example, many models are
configured to operate on input data that is formatted such that a
single row refers to a single subject, and consequently many
predictive modeling toolkits generally assume that input data has
this format. Such models are unable to properly handle input data
in other formats, for example data that includes one-to-many
relationships in which there are multiple observations (e.g., rows
of data) involving a single subject in a data table. If a model is
provided with input data that has such one-to-many relationships,
the model may fail or provide inaccurate results. Additionally, the
accuracy of a model depends on the amount, quality, and type of
input data that is supplied. Supplying an insufficient amount of
input data or the wrong types of input data can lead to poor
results. To avoid these issues, data scientists are often called in
to prepare data (e.g., raw data) for use with a model. But
preparing data can be a complex, subjective, tedious, error-prone,
and slow process. By some estimates, data scientists may even
devote up to 80% of their time to manually studying the data,
identifying relevant variables in the data, and properly formatting
the relevant variables, in order to generate an analysis table that
is compatible for use with a target model. Such analysis tables are
often thousands or millions of rows long and dozens of columns
wide, though their sizes can far exceed this amount.
[0033] As noted above, generating an analysis table for use with a
target model can be difficult. An analysis table is a type of data
structure such as a single data table that includes relevant
information from input data. The input data may or may not be in
the form of one or more data tables. The information in the
analysis table can be formatted and arranged in a particular
configuration that is compatible with a target model. Analysis
tables may also include additional variables that are not present
in the input data, but that are derived from the information in the
input data, in an effort to improve the modelling results (e.g.,
the accuracy of an output from a model). These additional variables
are typically determined and included in the analysis table by a
data scientist after extensive study of the input data. Generating
and including additional variables (e.g., from other complementary
tables) into the analysis table can be another complex, slow, and
error-prone process that places additional burdens on the data
scientist.
[0034] Some examples of the present disclosure can overcome one or
more of the abovementioned problems by generating a computerized
data-processing pipeline for processing and transforming data into
formats compatible with target models, such as predictive models
and other models. The pipeline can include a sequence of processing
operations configured to analyze an input data (e.g., raw data from
one or more key-linked data tables) and generate an analysis table
that is compatible with a target model based on the input data.
Once the pipeline has been generated, a computer can execute the
pipeline to automatically generate an analysis table that is
compatible with the target model. This can significantly reduce the
amount of the time, complexity, subjectivity, and errors associated
with preparing big data for modelling.
[0035] At least one of the processing operations in the sequence
can be configured to execute a feature-extraction operation on the
input data for determining one or more variables of significance to
the modelling process, and to incorporate values for the one or
more determined variables into the analysis table. A
feature-extraction operation is a computer operation for performing
feature extraction. Feature extraction is a dimensionality
reduction technique that involves deriving values ("features") from
an initial dataset, such as from column values in a data table. The
features may not be expressly included in the initial dataset, but
rather may be derived from the initial dataset or from
complementary data. Such feature-extraction operations may more
rapidly and accurately identify relevant information to include in
the analysis table, as compared to manual analysis by a data
scientist.
[0036] In some examples, the pipeline may be manually generated by
a user using pipeline-creation software. The pipeline-creation
software can include a graphical user interface (GUI) with an
extensible toolbox of processing operations that can be added to
the pipeline in a desired order. For example, the toolbox may
include a set of feature-extraction operations that can be
drag-and-dropped onto a canvas of the GUI and arranged in a desired
order to generate a pipeline. The processing operations may have
certain default values and variables that can be further customized
by the user, as desired. Once the pipeline is created, the user can
execute the pipeline in the pipeline-creation software on input
data, in order to transform the input data into an analysis table
that is compatible with a target model.
[0037] In some examples, the pipeline-creation software can
automatically generate the pipeline based on the characteristics of
the input data. For example, the pipeline-creation software can
analyze the characteristics of the input data to determine a set of
processing operations to include in the pipeline and to determine
an order to implement the set of processing operations. The
pipeline-creation software can then generate the pipeline by
organizing the set of processing operations in the determined
order. The pipeline-creation software can output the automatically
generated pipeline to the user in the GUI, so that the user can
view the pipeline and either accept it as-is or make any desired
customizations. For example, the user can customize the
automatically generated pipeline by manually adding processing
operations to the pipeline, removing processing operations from the
pipeline, and/or reordering the processing operations in the
pipeline. Once the user is finished customizing the pipeline, the
user can execute the pipeline on an input data to generate a
corresponding analysis table. In some examples, the user can
iterate this process by repeatedly executing the pipeline and
further refining (e.g., customizing) the pipeline based on the
results thereof, to improve the pipeline over the course of
multiple iterations.
[0038] The pipeline can include any number and combination of novel
processing operations to generate an analysis table. For example,
the input data can include categorical variables, represented by
character strings, numbers, and/or symbols. It may be challenging
to analyze categorical variables when they exist in a many-to-one
relationship with a subject. So, the pipeline can include a
processing operation configured to concatenate the categorical
variables together into a single text string, converting them into
a longer character string that contains all of the categorical data
for each subject. The processing operation can then analyze the
character string using analysis techniques typically reserved for
unstructured text. This can be more efficient and effective than
alternative approaches, such as "dummying" the variables whereby
there is a single variable created for each categorical value. This
may be particularly true when the categorical variables contain
hundreds or thousands of different values.
[0039] In some examples, the pipeline-creation software can
determine whether each processing operation in the pipeline is
helpful to, harmful to, or extraneous to a modelling result and
notify the user accordingly. For example, each processing operation
in the pipeline can modify the analysis table in a particular
manner, such as by including more information into the analysis
table, removing existing information from the analysis table, or
reformatting the information in the analysis table. After each
processing operation modifies the analysis table, the
pipeline-creation software can provide the modified analysis table
as input to a target model and determine a resulting accuracy of
the target model. The pipeline-creation software can then determine
if the accuracy of the model increased, as compared to the accuracy
of the model after the prior processing operation in the pipeline.
If the accuracy of the model increased, the pipeline-creation
software can determine that the current processing operation is a
helpful processing operation that improves the modelling result. If
the accuracy of the model decreased, the pipeline-creation software
can determine that the current processing operation is a harmful
processing operation that is detrimental to the modelling result.
If the accuracy of the model stayed substantially the same (e.g.,
within a preset tolerance range) the pipeline-creation software can
determine that the processing operation is an extraneous processing
operation that has little or no impact on the modelling result. The
pipeline-creation software can notify the user of whether each
processing operation in the pipeline helps, harms, or is extraneous
to the modelling result, so that the user can modify the pipeline
as desired.
[0040] For example, the pipeline-creation software can indicate in
the GUI that a particular processing operation is extraneous to the
modelling result. Such extraneous processing operations may
unnecessarily consume time and computing resources (e.g.,
processing power and memory) for little or no gain. So, the user
can remove or adjust the parameter settings of the extraneous
processing operation to avoid wasting computing resources. As
another example, the GUI can indicate that a particular processing
operation is harmful to the modelling result. Such harmful
processing operations are not only detrimental to the modelling
result but also consume time and computing resources. So, the user
can remove or adjust the parameter settings of the harmful
processing operation to improve the modelling result.
Alternatively, the system may automatically remove or adjust the
parameter settings of the harmful processing operation to improve
the modelling result. In this way, the pipeline-creation software
can assist the user in creating a more optimal pipeline.
[0041] In some examples, the pipeline-creation software can
automatically generate program code to implement the pipeline. The
program code can be automatically generated using code templates
with parameters (e.g., fields and variables) that are configurable
by the pipeline-creation software. The program code can be
configured to be executed independently of the pipeline-creation
software and more rapidly than is possible by executing the
pipeline in the pipeline-creation software. For example, the
pipeline-creation software can generate program code that can be
deployed to a production environment for executing the pipeline on
other input data, which may have the same structure as the original
input data. The program code can be optimized for the production
environment, for example by being configured to run on multiple
processors in parallel in the production environment. Thus, the
user can initially create the pipeline using the pipeline-creation
software, and then quickly and easily deploy the program code for
the pipeline to a computing environment, so that the pipeline can
be subsequently executed in a faster manner independently of the
pipeline-creation software. Generating and deploying the program
code can allow the pipeline to be quickly and repeatedly executed
in relation to other input data. This may also allow other software
to more easily interface with and initiate the pipeline.
[0042] The pipeline can be configured to convert any suitable type
of input data into an analysis table. One example of such a data
type can be transactional data describing transactions associated
with one or more subjects. In a typical database architecture,
transactional data may be stored in a first data table that is
linked to a second data table by one or more keys. An example of
the second data table can be a subject table that includes subject
data (e.g., attributes of the subjects that engaged in the
transactions). The transactional data and/or subject data may
contain large amounts of information, such as millions of rows of
information. Using conventional approaches, it can be challenging
for data scientists to convert the large numbers of rows of
transactional data in the first data table, along with the large
somewhat smaller number of rows of subject data in the second data
table, into a unified analysis table that is properly formatted and
suitably compatible with a target model, such as a predictive
model. But some examples described herein can automate the process
using a pipeline, which can yield significant improvements to the
modelling results.
[0043] These illustrative examples are given to introduce the
reader to the general subject matter discussed here and are not
intended to limit the scope of the disclosed concepts. The
following sections describe various additional features and
examples with reference to the drawings in which like numerals
indicate like elements but, like the illustrative examples, should
not be used to limit the present disclosure.
[0044] FIGS. 1-12 depict examples of systems and methods usable for
generating pipelines configured to transform input data into
analysis tables according to some aspects of the present
disclosure. For example, FIG. 1 is a block diagram of an example of
the hardware components of a computing system according to some
aspects. Data transmission network 100 is a specialized computer
system that may be used for processing large amounts of data where
a large number of computer processing cycles are required.
[0045] Data transmission network 100 may also include computing
environment 114. Computing environment 114 may be a specialized
computer or other machine that processes the data received within
the data transmission network 100. The computing environment 114
may include one or more other systems. For example, computing
environment 114 may include a database system 118 or a
communications grid 120. The computing environment 114 can include
one or more processing devices (e.g., distributed over one or more
networks or otherwise in communication with one another) that may
be collectively be referred to herein as a processor or a
processing device.
[0046] Data transmission network 100 also includes one or more
network devices 102. Network devices 102 may include client devices
that can communicate with computing environment 114. For example,
network devices 102 may send data to the computing environment 114
to be processed, may send communications to the computing
environment 114 to control different aspects of the computing
environment or the data it is processing, among other reasons.
Network devices 102 may interact with the computing environment 114
through a number of ways, such as, for example, over one or more
networks 108.
[0047] In some examples, network devices 102 may provide a large
amount of data, either all at once or streaming over a period of
time (e.g., using event stream processing (ESP)), to the computing
environment 114 via networks 108. For example, the network devices
102 can transmit electronic messages, all at once or streaming over
a period of time, to the computing environment 114 via networks
108.
[0048] The network devices 102 may include network computers,
sensors, databases, or other devices that may transmit or otherwise
provide data to computing environment 114. For example, network
devices 102 may include local area network devices, such as
routers, hubs, switches, or other computer networking devices.
These devices may provide a variety of stored or generated data,
such as network data or data specific to the network devices 102
themselves. Network devices 102 may also include sensors that
monitor their environment or other devices to collect data
regarding that environment or those devices, and such network
devices 102 may provide data they collect over time. Network
devices 102 may also include devices within the internet of things,
such as devices within a home automation network. Some of these
devices may be referred to as edge devices, and may involve
edge-computing circuitry. Data may be transmitted by network
devices 102 directly to computing environment 114 or to
network-attached data stores, such as network-attached data stores
110 for storage so that the data may be retrieved later by the
computing environment 114 or other portions of data transmission
network 100. For example, the network devices 102 can transmit data
to a network-attached data store 110 for storage. The computing
environment 114 may later retrieve the data from the
network-attached data store 110 and apply the data as input to a
pipeline according to some aspects described herein.
[0049] Network-attached data stores 110 can store data to be
processed by the computing environment 114 as well as any
intermediate or final data generated by the computing system in
non-volatile memory. But in certain examples, the configuration of
the computing environment 114 allows its operations to be performed
such that intermediate and final data results can be stored solely
in volatile memory (e.g., RAM), without a requirement that
intermediate or final data results be stored to non-volatile types
of memory (e.g., disk). This can be useful in certain situations,
such as when the computing environment 114 receives ad hoc queries
from a user and when responses, which are generated by processing
large amounts of data, need to be generated dynamically (e.g., on
the fly). In this situation, the computing environment 114 may be
configured to retain the processed information within memory so
that responses can be generated for the user at different levels of
detail as well as allow a user to interactively query against this
information.
[0050] Network-attached data stores 110 may store a variety of
different types of data organized in a variety of different ways
and from a variety of different sources. For example,
network-attached data stores may include storage other than primary
storage located within computing environment 114 that is directly
accessible by processors located therein. Network-attached data
stores may include secondary, tertiary or auxiliary storage, such
as large hard drives, servers, virtual memory, among other types.
Storage devices may include portable or non-portable storage
devices, optical storage devices, and various other mediums capable
of storing, containing data. A machine-readable storage medium or
computer-readable storage medium may include a non-transitory
medium in which data can be stored and that does not include
carrier waves or transitory electronic communications. Examples of
a non-transitory medium may include, for example, a magnetic disk
or tape, optical storage media such as compact disk or digital
versatile disk, flash memory, memory or memory devices. A
computer-program product may include code or machine-executable
instructions that may represent a procedure, a function, a
subprogram, a program, a routine, a subroutine, a module, a
software package, a class, or any combination of instructions, data
structures, or program statements. A code segment may be coupled to
another code segment or a hardware circuit by passing or receiving
information, data, arguments, parameters, or memory contents.
Information, arguments, parameters, data, etc. may be passed,
forwarded, or transmitted via any suitable means including memory
sharing, message passing, token passing, network transmission,
among others. Furthermore, the data stores may hold a variety of
different types of data. For example, network-attached data stores
110 may hold unstructured (e.g., raw) data.
[0051] The unstructured data may be presented to the computing
environment 114 in different forms such as a flat file or a
conglomerate of data records, and may have data values and
accompanying time stamps. The computing environment 114 may be used
to analyze the unstructured data in a variety of ways to determine
the best way to structure (e.g., hierarchically) that data, such
that the structured data is tailored to a type of further analysis
that a user wishes to perform on the data. For example, after being
processed, the unstructured time-stamped data may be aggregated by
time (e.g., into daily time period units) to generate time series
data or structured hierarchically according to one or more
dimensions (e.g., parameters, attributes, or variables). For
example, data may be stored in a hierarchical data structure, such
as a relational online analytical processing (ROLAP) or
multidimensional online analytical processing (MOLAP) database, or
may be stored in another tabular form, such as in a flat-hierarchy
form.
[0052] Data transmission network 100 may also include one or more
server farms 106. Computing environment 114 may route select
communications or data to the sever farms 106 or one or more
servers within the server farms 106. Server farms 106 can be
configured to provide information in a predetermined manner. For
example, server farms 106 may access data to transmit in response
to a communication. Server farms 106 may be separately housed from
each other device within data transmission network 100, such as
computing environment 114, or may be part of a device or
system.
[0053] Server farms 106 may host a variety of different types of
data processing as part of data transmission network 100. Server
farms 106 may receive a variety of different data from network
devices, from computing environment 114, from cloud network 116, or
from other sources. The data may have been obtained or collected
from one or more websites, sensors, as inputs from a control
database, or may have been received as inputs from an external
system or device. Server farms 106 may assist in processing the
data by turning raw data into processed data based on one or more
rules implemented by the server farms. For example, sensor data may
be analyzed to determine changes in an environment over time or in
real-time.
[0054] Data transmission network 100 may also include one or more
cloud networks 116. Cloud network 116 may include a cloud
infrastructure system that provides cloud services. In certain
examples, services provided by the cloud network 116 may include a
host of services that are made available to users of the cloud
infrastructure system on demand. Cloud network 116 is shown in FIG.
1 as being connected to computing environment 114 (and therefore
having computing environment 114 as its client or user), but cloud
network 116 may be connected to or utilized by any of the devices
in FIG. 1. Services provided by the cloud network 116 can
dynamically scale to meet the needs of its users. The cloud network
116 may include one or more computers, servers, or systems. In some
examples, the computers, servers, or systems that make up the cloud
network 116 are different from the user's own on-premises
computers, servers, or systems. For example, the cloud network 116
may host an application, and a user may, via a communication
network such as the Internet, order and use the application on
demand. In some examples, the cloud network 116 may host an
application for generating pipelines generate analysis tables based
on input data.
[0055] While each device, server, and system in FIG. 1 is shown as
a single device, multiple devices may instead be used. For example,
a set of network devices can be used to transmit various
communications from a single user, or remote server 140 may include
a server stack. As another example, data may be processed as part
of computing environment 114.
[0056] Each communication within data transmission network 100
(e.g., between client devices, between a device and connection
management system 150, between server farms 106 and computing
environment 114, or between a server and a device) may occur over
one or more networks 108. Networks 108 may include one or more of a
variety of different types of networks, including a wireless
network, a wired network, or a combination of a wired and wireless
network. Examples of suitable networks include the Internet, a
personal area network, a local area network (LAN), a wide area
network (WAN), or a wireless local area network (WLAN). A wireless
network may include a wireless interface or combination of wireless
interfaces. As an example, a network in the one or more networks
108 may include a short-range communication channel, such as a
Bluetooth or a Bluetooth Low Energy channel. A wired network may
include a wired interface. The wired or wireless networks may be
implemented using routers, access points, bridges, gateways, or the
like, to connect devices in the network 108. The networks 108 can
be incorporated entirely within or can include an intranet, an
extranet, or a combination thereof. In one example, communications
between two or more systems or devices can be achieved by a secure
communications protocol, such as secure sockets layer (SSL) or
transport layer security (TLS). In addition, data or transactional
details may be encrypted.
[0057] Some aspects may utilize the Internet of Things (IoT), where
things (e.g., machines, devices, phones, sensors) can be connected
to networks and the data from these things can be collected and
processed within the things or external to the things. For example,
the IoT can include sensors in many different devices, and high
value analytics can be applied to identify hidden relationships and
drive increased efficiencies. This can apply to both big data
analytics and real-time (e.g., ESP) analytics.
[0058] As noted, computing environment 114 may include a
communications grid 120 and a transmission network database system
118. Communications grid 120 may be a grid-based computing system
for processing large amounts of data. The transmission network
database system 118 may be for managing, storing, and retrieving
large amounts of data that are distributed to and stored in the one
or more network-attached data stores 110 or other data stores that
reside at different locations within the transmission network
database system 118. The computing nodes in the communications grid
120 and the transmission network database system 118 may share the
same processor hardware, such as processors that are located within
computing environment 114.
[0059] In some examples, the computing environment 114, a network
device 102, or both can implement one or more processes for
generating pipelines configured to transform input data into
analysis tables. For example, the computing environment 114, a
network device 102, or both can implement one or more versions of
the processes discussed with respect to any of the figures.
[0060] FIG. 2 is an example of devices that can communicate with
each other over an exchange system and via a network according to
some aspects. As noted, each communication within data transmission
network 100 may occur over one or more networks. System 200
includes a network device 204 configured to communicate with a
variety of types of client devices, for example client devices 230,
over a variety of types of communication channels.
[0061] As shown in FIG. 2, network device 204 can transmit a
communication over a network (e.g., a cellular network via a base
station 210). In some examples, the communication can include times
series data. The communication can be routed to another network
device, such as network devices 205-209, via base station 210. The
communication can also be routed to computing environment 214 via
base station 210. In some examples, the network device 204 may
collect data either from its surrounding environment or from other
network devices (such as network devices 205-209) and transmit that
data to computing environment 214.
[0062] Although network devices 204-209 are shown in FIG. 2 as a
mobile phone, laptop computer, tablet computer, temperature sensor,
motion sensor, and audio sensor respectively, the network devices
may be or include sensors that are sensitive to detecting aspects
of their environment. For example, the network devices may include
sensors such as water sensors, power sensors, electrical current
sensors, chemical sensors, optical sensors, pressure sensors,
geographic or position sensors (e.g., GPS), velocity sensors,
acceleration sensors, flow rate sensors, among others. Examples of
characteristics that may be sensed include force, torque, load,
strain, position, temperature, air pressure, fluid flow, chemical
properties, resistance, electromagnetic fields, radiation,
irradiance, proximity, acoustics, moisture, distance, speed,
vibrations, acceleration, electrical potential, and electrical
current, among others. The sensors may be mounted to various
components used as part of a variety of different types of systems.
The network devices may detect and record data related to the
environment that it monitors, and transmit that data to computing
environment 214.
[0063] The network devices 204-209 may also perform processing on
data it collects before transmitting the data to the computing
environment 214, or before deciding whether to transmit data to the
computing environment 214. For example, network devices 204-209 may
determine whether data collected meets certain rules, for example
by comparing data or values calculated from the data and comparing
that data to one or more thresholds. The network devices 204-209
may use this data or comparisons to determine if the data is to be
transmitted to the computing environment 214 for further use or
processing. In some examples, the network devices 204-209 can
pre-process the data prior to transmitting the data to the
computing environment 214. For example, the network devices 204-209
can reformat the data before transmitting the data to the computing
environment 214 for further processing (e.g., via a pipeline).
[0064] Computing environment 214 may include machines 220, 240.
Although computing environment 214 is shown in FIG. 2 as having two
machines 220, 240, computing environment 214 may have only one
machine or may have more than two machines. The machines 220, 240
that make up computing environment 214 may include specialized
computers, servers, or other machines that are configured to
individually or collectively process large amounts of data. The
computing environment 214 may also include storage devices that
include one or more databases of structured data, such as data
organized in one or more hierarchies, or unstructured data. The
databases may communicate with the processing devices within
computing environment 214 to distribute data to them. Since network
devices may transmit data to computing environment 214, that data
may be received by the computing environment 214 and subsequently
stored within those storage devices. Data used by computing
environment 214 may also be stored in data stores 235, which may
also be a part of or connected to computing environment 214.
[0065] Computing environment 214 can communicate with various
devices via one or more routers 225 or other inter-network or
intra-network connection components. For example, computing
environment 214 may communicate with client devices 230 via one or
more routers 225. Computing environment 214 may collect, analyze or
store data from or pertaining to communications, client device
operations, client rules, or user-associated actions stored at one
or more data stores 235. Such data may influence communication
routing to the devices within computing environment 214, how data
is stored or processed within computing environment 214, among
other actions.
[0066] Notably, various other devices can further be used to
influence communication routing or processing between devices
within computing environment 214 and with devices outside of
computing environment 214. For example, as shown in FIG. 2,
computing environment 214 may include a machine 240 that is a web
server. Computing environment 214 can retrieve data of interest,
such as client information (e.g., product information, client
rules, etc.), technical product details, news, blog posts, e-mails,
forum posts, electronic documents, social media posts (e.g.,
Twitter.TM. posts or Facebook.TM. posts), time series data,
transactional data, and so on.
[0067] In addition to computing environment 214 collecting data
(e.g., as received from network devices, such as sensors, and
client devices or other sources) to be processed as part of a big
data analytics project, it may also receive data in real time as
part of a streaming analytics environment. As noted, data may be
collected using a variety of sources as communicated via different
kinds of networks or locally. Such data may be received on a
real-time streaming basis. For example, network devices 204-209 may
receive data periodically and in real time from a web server or
other source. Devices within computing environment 214 may also
perform pre-analysis on data it receives to determine if the data
received should be processed as part of an ongoing project. For
example, as part of a modelling project, the computing environment
214 can perform a pre-analysis of the data use one or more
pipelines. The pre-analysis can include determining whether the
data is in a correct format for the model using the data and, if
not, reformatting the data into the correct format.
[0068] FIG. 3 is a block diagram of a model of an example of a
communications protocol system according to some aspects. More
specifically, FIG. 3 identifies operation of a computing
environment in an Open Systems Interaction model that corresponds
to various connection components. The model 300 shows, for example,
how a computing environment, such as computing environment (or
computing environment 214 in FIG. 2) may communicate with other
devices in its network, and control how communications between the
computing environment and other devices are executed and under what
conditions.
[0069] The model 300 can include layers 302-314. The layers 302-314
are arranged in a stack. Each layer in the stack serves the layer
one level higher than it (except for the application layer, which
is the highest layer), and is served by the layer one level below
it (except for the physical layer 302, which is the lowest layer).
The physical layer 302 is the lowest layer because it receives and
transmits raw bites of data, and is the farthest layer from the
user in a communications system. On the other hand, the application
layer is the highest layer because it interacts directly with a
software application.
[0070] As noted, the model 300 includes a physical layer 302.
Physical layer 302 represents physical communication, and can
define parameters of that physical communication. For example, such
physical communication may come in the form of electrical, optical,
or electromagnetic communications. Physical layer 302 also defines
protocols that may control communications within a data
transmission network.
[0071] Link layer 304 defines links and mechanisms used to transmit
(e.g., move) data across a network. The link layer manages
node-to-node communications, such as within a grid-computing
environment. Link layer 304 can detect and correct errors (e.g.,
transmission errors in the physical layer 302). Link layer 304 can
also include a media access control (MAC) layer and logical link
control (LLC) layer.
[0072] Network layer 306 can define the protocol for routing within
a network. In other words, the network layer coordinates
transferring data across nodes in a same network (e.g., such as a
grid-computing environment). Network layer 306 can also define the
processes used to structure local addressing within the
network.
[0073] Transport layer 308 can manage the transmission of data and
the quality of the transmission or receipt of that data. Transport
layer 308 can provide a protocol for transferring data, such as,
for example, a Transmission Control Protocol (TCP). Transport layer
308 can assemble and disassemble data frames for transmission. The
transport layer can also detect transmission errors occurring in
the layers below it.
[0074] Session layer 310 can establish, maintain, and manage
communication connections between devices on a network. In other
words, the session layer controls the dialogues or nature of
communications between network devices on the network. The session
layer may also establish checkpointing, adjournment, termination,
and restart procedures.
[0075] Presentation layer 312 can provide translation for
communications between the application and network layers. In other
words, this layer may encrypt, decrypt or format data based on data
types known to be accepted by an application or network layer.
[0076] Application layer 314 interacts directly with software
applications and end users, and manages communications between
them. Application layer 314 can identify destinations, local
resource states or availability or communication content or
formatting using the applications.
[0077] For example, a communication link can be established between
two devices on a network. One device can transmit an analog or
digital representation of an electronic message that includes a
data set to the other device. The other device can receive the
analog or digital representation at the physical layer 302. The
other device can transmit the data associated with the electronic
message through the remaining layers 304-314. The application layer
314 can receive data associated with the electronic message. The
application layer 314 can identify one or more applications, such
as an application for generating pipelines configured to transform
input data into analysis tables, to which to transmit data
associated with the electronic message. The application layer 314
can transmit the data to the identified application.
[0078] Intra-network connection components 322, 324 can operate in
lower levels, such as physical layer 302 and link layer 304,
respectively. For example, a hub can operate in the physical layer,
a switch can operate in the physical layer, and a router can
operate in the network layer. Inter-network connection components
326, 328 are shown to operate on higher levels, such as layers
306-314. For example, routers can operate in the network layer and
network devices can operate in the transport, session,
presentation, and application layers.
[0079] A computing environment 330 can interact with or operate on,
in various examples, one, more, all or any of the various layers.
For example, computing environment 330 can interact with a hub
(e.g., via the link layer) to adjust which devices the hub
communicates with. The physical layer 302 may be served by the link
layer 304, so it may implement such data from the link layer 304.
For example, the computing environment 330 may control which
devices from which it can receive data. For example, if the
computing environment 330 knows that a certain network device has
turned off, broken, or otherwise become unavailable or unreliable,
the computing environment 330 may instruct the hub to prevent any
data from being transmitted to the computing environment 330 from
that network device. Such a process may be beneficial to avoid
receiving data that is inaccurate or that has been influenced by an
uncontrolled environment. As another example, computing environment
330 can communicate with a bridge, switch, router or gateway and
influence which device within the system (e.g., system 200) the
component selects as a destination. In some examples, computing
environment 330 can interact with various layers by exchanging
communications with equipment operating on a particular layer by
routing or modifying existing communications. In another example,
such as in a grid-computing environment, a node may determine how
data within the environment should be routed (e.g., which node
should receive certain data) based on certain parameters or
information provided by other layers within the model.
[0080] The computing environment 330 may be a part of a
communications grid environment, the communications of which may be
implemented as shown in the protocol of FIG. 3. For example,
referring back to FIG. 2, one or more of machines 220 and 240 may
be part of a communications grid-computing environment. A gridded
computing environment may be employed in a distributed system with
non-interactive workloads where data resides in memory on the
machines, or compute nodes. In such an environment, analytic code,
instead of a database management system, can control the processing
performed by the nodes. Data is co-located by pre-distributing it
to the grid nodes, and the analytic code on each node loads the
local data into memory. Each node may be assigned a particular
task, such as a portion of a processing project, or to organize or
control other nodes within the grid. For example, each node may be
assigned a portion of a processing task for a pipeline.
[0081] FIG. 4 is a hierarchical diagram of an example of a
communications grid computing system 400 including a variety of
control and worker nodes according to some aspects. Communications
grid computing system 400 includes three control nodes and one or
more worker nodes. Communications grid computing system 400
includes control nodes 402, 404, and 406. The control nodes are
communicatively connected via communication paths 451, 453, and
455. The control nodes 402-406 may transmit information (e.g.,
related to the communications grid or notifications) to and receive
information from each other. Although communications grid computing
system 400 is shown in FIG. 4 as including three control nodes, the
communications grid may include more or less than three control
nodes.
[0082] Communications grid computing system 400 (which can be
referred to as a "communications grid") also includes one or more
worker nodes. Shown in FIG. 4 are six worker nodes 410-420.
Although FIG. 4 shows six worker nodes, a communications grid can
include more or less than six worker nodes. The number of worker
nodes included in a communications grid may be dependent upon how
large the project or data set is being processed by the
communications grid, the capacity of each worker node, the time
designated for the communications grid to complete the project,
among others. Each worker node within the communications grid
computing system 400 may be connected (wired or wirelessly, and
directly or indirectly) to control nodes 402-406. Each worker node
may receive information from the control nodes (e.g., an
instruction to perform work on a project) and may transmit
information to the control nodes (e.g., a result from work
performed on a project). Furthermore, worker nodes may communicate
with each other directly or indirectly. For example, worker nodes
may transmit data between each other related to a job being
performed or an individual task within a job being performed by
that worker node. In some examples, worker nodes may not be
connected (communicatively or otherwise) to certain other worker
nodes. For example, a worker node 410 may only be able to
communicate with a particular control node 402. The worker node 410
may be unable to communicate with other worker nodes 412-420 in the
communications grid, even if the other worker nodes 412-420 are
controlled by the same control node 402.
[0083] A control node 402-406 may connect with an external device
with which the control node 402-406 may communicate (e.g., a
communications grid user, such as a server or computer, may connect
to a controller of the grid). For example, a server or computer may
connect to control nodes 402-406 and may transmit a project or job
to the node, such as a project or job related to executing a
pipeline for generating an analysis table based on input data. The
project may include the data set. The data set may be of any size
and can include a time series, in some examples. Once the control
node 402-406 receives such a project including a large data set,
the control node may distribute the data set or projects related to
the data set to be performed by worker nodes. Alternatively, for a
project including a large data set, the data set may be receive or
stored by a machine other than a control node 402-406 (e.g., a
Hadoop data node).
[0084] Control nodes 402-406 can maintain knowledge of the status
of the nodes in the grid (e.g., grid status information), accept
work requests from clients, subdivide the work across worker nodes,
and coordinate the worker nodes, among other responsibilities.
Worker nodes 412-420 may accept work requests from a control node
402-406 and provide the control node with results of the work
performed by the worker node. A grid may be started from a single
node (e.g., a machine, computer, server, etc.). This first node may
be assigned or may start as the primary control node 402 that will
control any additional nodes that enter the grid.
[0085] When a project is submitted for execution (e.g., by a client
or a controller of the grid) it may be assigned to a set of nodes.
After the nodes are assigned to a project, a data structure (e.g.,
a communicator) may be created. The communicator may be used by the
project for information to be shared between the project code
running on each node. A communication handle may be created on each
node. A handle, for example, is a reference to the communicator
that is valid within a single process on a single node, and the
handle may be used when requesting communications between
nodes.
[0086] A control node, such as control node 402, may be designated
as the primary control node. A server, computer or other external
device may connect to the primary control node. Once the control
node 402 receives a project, the primary control node may
distribute portions of the project to its worker nodes for
execution. For example, a project for generating an analysis table
based on input data can be initiated on communications grid
computing system 400. A primary control node can control the work
to be performed for the project in order to complete the project as
requested or instructed. The primary control node may distribute
pipeline work to the worker nodes 412-420 based on various factors,
such as which subsets or portions of projects may be completed most
efficiently and in the correct amount of time. For example, a
worker node 412 may execute a processing operation in the pipeline
using at least a portion of data that is already local (e.g.,
stored on) the worker node. The primary control node also
coordinates and processes the results of the work performed by each
worker node 412-420 after each worker node 412-420 executes and
completes its job. For example, the primary control node may
receive a result from one or more worker nodes 412-420, and the
primary control node may organize (e.g., collect and assemble) the
results received and compile them to produce a complete result for
the project received from the end user.
[0087] Any remaining control nodes, such as control nodes 404, 406,
may be assigned as backup control nodes for the project. In an
example, backup control nodes may not control any portion of the
project. Instead, backup control nodes may serve as a backup for
the primary control node and take over as primary control node if
the primary control node were to fail. If a communications grid
were to include only a single control node 402, and the control
node 402 were to fail (e.g., the control node is shut off or
breaks) then the communications grid as a whole may fail and any
project or job being run on the communications grid may fail and
may not complete. While the project may be run again, such a
failure may cause a delay (severe delay in some cases, such as
overnight delay) in completion of the project. Therefore, a grid
with multiple control nodes 402-406, including a backup control
node, may be beneficial.
[0088] In some examples, the primary control node may open a pair
of listening sockets to add another node or machine to the grid. A
socket may be used to accept work requests from clients, and the
second socket may be used to accept connections from other grid
nodes. The primary control node may be provided with a list of
other nodes (e.g., other machines, computers, servers, etc.) that
can participate in the grid, and the role that each node can fill
in the grid. Upon startup of the primary control node (e.g., the
first node on the grid), the primary control node may use a network
protocol to start the server process on every other node in the
grid. Command line parameters, for example, may inform each node of
one or more pieces of information, such as: the role that the node
will have in the grid, the host name of the primary control node,
the port number on which the primary control node is accepting
connections from peer nodes, among others. The information may also
be provided in a configuration file, transmitted over a secure
shell tunnel, recovered from a configuration server, among others.
While the other machines in the grid may not initially know about
the configuration of the grid, that information may also be sent to
each other node by the primary control node. Updates of the grid
information may also be subsequently sent to those nodes.
[0089] For any control node other than the primary control node
added to the grid, the control node may open three sockets. The
first socket may accept work requests from clients, the second
socket may accept connections from other grid members, and the
third socket may connect (e.g., permanently) to the primary control
node. When a control node (e.g., primary control node) receives a
connection from another control node, it first checks to see if the
peer node is in the list of configured nodes in the grid. If it is
not on the list, the control node may clear the connection. If it
is on the list, it may then attempt to authenticate the connection.
If authentication is successful, the authenticating node may
transmit information to its peer, such as the port number on which
a node is listening for connections, the host name of the node,
information about how to authenticate the node, among other
information. When a node, such as the new control node, receives
information about another active node, it can check to see if it
already has a connection to that other node. If it does not have a
connection to that node, it may then establish a connection to that
control node.
[0090] Any worker node added to the grid may establish a connection
to the primary control node and any other control nodes on the
grid. After establishing the connection, it may authenticate itself
to the grid (e.g., any control nodes, including both primary and
backup, or a server or user controlling the grid). After successful
authentication, the worker node may accept configuration
information from the control node.
[0091] When a node joins a communications grid (e.g., when the node
is powered on or connected to an existing node on the grid or
both), the node is assigned (e.g., by an operating system of the
grid) a universally unique identifier (UUID). This unique
identifier may help other nodes and external entities (devices,
users, etc.) to identify the node and distinguish it from other
nodes. When a node is connected to the grid, the node may share its
unique identifier with the other nodes in the grid. Since each node
may share its unique identifier, each node may know the unique
identifier of every other node on the grid. Unique identifiers may
also designate a hierarchy of each of the nodes (e.g., backup
control nodes) within the grid. For example, the unique identifiers
of each of the backup control nodes may be stored in a list of
backup control nodes to indicate an order in which the backup
control nodes will take over for a failed primary control node to
become a new primary control node. But, a hierarchy of nodes may
also be determined using methods other than using the unique
identifiers of the nodes. For example, the hierarchy may be
predetermined, or may be assigned based on other predetermined
factors.
[0092] The grid may add new machines at any time (e.g., initiated
from any control node). Upon adding a new node to the grid, the
control node may first add the new node to its table of grid nodes.
The control node may also then notify every other control node
about the new node. The nodes receiving the notification may
acknowledge that they have updated their configuration
information.
[0093] Primary control node 402 may, for example, transmit one or
more communications to backup control nodes 404, 406 (and, for
example, to other control or worker nodes 412-420 within the
communications grid). Such communications may be sent periodically,
at fixed time intervals, between known fixed stages of the
project's execution, among other protocols. The communications
transmitted by primary control node 402 may be of varied types and
may include a variety of types of information. For example, primary
control node 402 may transmit snapshots (e.g., status information)
of the communications grid so that backup control node 404 always
has a recent snapshot of the communications grid. The snapshot or
grid status may include, for example, the structure of the grid
(including, for example, the worker nodes 410-420 in the
communications grid, unique identifiers of the worker nodes
410-420, or their relationships with the primary control node 402)
and the status of a project (including, for example, the status of
each worker node's portion of the project). The snapshot may also
include analysis or results received from worker nodes 410-420 in
the communications grid. The backup control nodes 404, 406 may
receive and store the backup data received from the primary control
node 402. The backup control nodes 404, 406 may transmit a request
for such a snapshot (or other information) from the primary control
node 402, or the primary control node 402 may send such information
periodically to the backup control nodes 404, 406.
[0094] As noted, the backup data may allow a backup control node
404, 406 to take over as primary control node if the primary
control node 402 fails without requiring the communications grid to
start the project over from scratch. If the primary control node
402 fails, the backup control node 404, 406 that will take over as
primary control node may retrieve the most recent version of the
snapshot received from the primary control node 402 and use the
snapshot to continue the project from the stage of the project
indicated by the backup data. This may prevent failure of the
project as a whole.
[0095] A backup control node 404, 406 may use various methods to
determine that the primary control node 402 has failed. In one
example of such a method, the primary control node 402 may transmit
(e.g., periodically) a communication to the backup control node
404, 406 that indicates that the primary control node 402 is
working and has not failed, such as a heartbeat communication. The
backup control node 404, 406 may determine that the primary control
node 402 has failed if the backup control node has not received a
heartbeat communication for a certain predetermined period of time.
Alternatively, a backup control node 404, 406 may also receive a
communication from the primary control node 402 itself (before it
failed) or from a worker node 410-420 that the primary control node
402 has failed, for example because the primary control node 402
has failed to communicate with the worker node 410-420.
[0096] Different methods may be performed to determine which backup
control node of a set of backup control nodes (e.g., backup control
nodes 404, 406) can take over for failed primary control node 402
and become the new primary control node. For example, the new
primary control node may be chosen based on a ranking or
"hierarchy" of backup control nodes based on their unique
identifiers. In an alternative example, a backup control node may
be assigned to be the new primary control node by another device in
the communications grid or from an external device (e.g., a system
infrastructure or an end user, such as a server or computer,
controlling the communications grid). In another alternative
example, the backup control node that takes over as the new primary
control node may be designated based on bandwidth or other
statistics about the communications grid.
[0097] A worker node within the communications grid may also fail.
If a worker node fails, work being performed by the failed worker
node may be redistributed amongst the operational worker nodes. In
an alternative example, the primary control node may transmit a
communication to each of the operable worker nodes still on the
communications grid that each of the worker nodes should
purposefully fail also. After each of the worker nodes fail, they
may each retrieve their most recent saved checkpoint of their
status and re-start the project from that checkpoint to minimize
lost progress on the project being executed. In some examples, a
communications grid computing system 400 can be used to generate
pipelines configured for creating analysis tables for models based
on input data.
[0098] FIG. 5 is a flow chart of an example of a process for
adjusting a communications grid or a work project in a
communications grid after a failure of a node according to some
aspects. The process may include, for example, receiving grid
status information including a project status of a portion of a
project being executed by a node in the communications grid, as
described in operation 502. For example, a control node (e.g., a
backup control node connected to a primary control node and a
worker node on a communications grid) may receive grid status
information, where the grid status information includes a project
status of the primary control node or a project status of the
worker node. The project status of the primary control node and the
project status of the worker node may include a status of one or
more portions of a project being executed by the primary and worker
nodes in the communications grid. The process may also include
storing the grid status information, as described in operation 504.
For example, a control node (e.g., a backup control node) may store
the received grid status information locally within the control
node. Alternatively, the grid status information may be sent to
another device for storage where the control node may have access
to the information.
[0099] The process may also include receiving a failure
communication corresponding to a node in the communications grid in
operation 506. For example, a node may receive a failure
communication including an indication that the primary control node
has failed, prompting a backup control node to take over for the
primary control node. In an alternative embodiment, a node may
receive a failure that a worker node has failed, prompting a
control node to reassign the work being performed by the worker
node. The process may also include reassigning a node or a portion
of the project being executed by the failed node, as described in
operation 508. For example, a control node may designate the backup
control node as a new primary control node based on the failure
communication upon receiving the failure communication. If the
failed node is a worker node, a control node may identify a project
status of the failed worker node using the snapshot of the
communications grid, where the project status of the failed worker
node includes a status of a portion of the project being executed
by the failed worker node at the failure time.
[0100] The process may also include receiving updated grid status
information based on the reassignment, as described in operation
510, and transmitting a set of instructions based on the updated
grid status information to one or more nodes in the communications
grid, as described in operation 512. The updated grid status
information may include an updated project status of the primary
control node or an updated project status of the worker node. The
updated information may be transmitted to the other nodes in the
grid to update their stale stored information.
[0101] FIG. 6 is a block diagram of a portion of a communications
grid computing system 600 including a control node and a worker
node according to some aspects. Communications grid 600 computing
system includes one control node (control node 602) and one worker
node (worker node 610) for purposes of illustration, but may
include more worker and/or control nodes. The control node 602 is
communicatively connected to worker node 610 via communication path
650. Therefore, control node 602 may transmit information (e.g.,
related to the communications grid or notifications), to and
receive information from worker node 610 via communication path
650.
[0102] Similar to in FIG. 4, communications grid computing system
(or just "communications grid") 600 includes data processing nodes
(control node 602 and worker node 610). Nodes 602 and 610 comprise
multi-core data processors. Each node 602 and 610 includes a
grid-enabled software component (GESC) 620 that executes on the
data processor associated with that node and interfaces with buffer
memory 622 also associated with that node. Each node 602 and 610
includes database management software (DBMS) 628 that executes on a
database server (not shown) at control node 602 and on a database
server (not shown) at worker node 610.
[0103] Each node also includes a data store 624. Data stores 624,
similar to network-attached data stores 110 in FIG. 1 and data
stores 235 in FIG. 2, are used to store data to be processed by the
nodes in the computing environment. Data stores 624 may also store
any intermediate or final data generated by the computing system
after being processed, for example in non-volatile memory. However
in certain examples, the configuration of the grid computing
environment allows its operations to be performed such that
intermediate and final data results can be stored solely in
volatile memory (e.g., RAM), without a requirement that
intermediate or final data results be stored to non-volatile types
of memory. Storing such data in volatile memory may be useful in
certain situations, such as when the grid receives queries (e.g.,
ad hoc) from a client and when responses, which are generated by
processing large amounts of data, need to be generated quickly or
on-the-fly. In such a situation, the grid may be configured to
retain the data within memory so that responses can be generated at
different levels of detail and so that a client may interactively
query against this information.
[0104] Each node also includes a user-defined function (UDF) 626.
The UDF provides a mechanism for the DMBS 628 to transfer data to
or receive data from the database stored in the data stores 624
that are managed by the DBMS. For example, UDF 626 can be invoked
by the DBMS to provide data to the GESC for processing. The UDF 626
may establish a socket connection (not shown) with the GESC to
transfer the data. Alternatively, the UDF 626 can transfer data to
the GESC by writing data to shared memory accessible by both the
UDF and the GESC.
[0105] The GESC 620 at the nodes 602 and 610 may be connected via a
network, such as network 108 shown in FIG. 1. Therefore, nodes 602
and 610 can communicate with each other via the network using a
predetermined communication protocol such as, for example, the
Message Passing Interface (MPI). Each GESC 620 can engage in
point-to-point communication with the GESC at another node or in
collective communication with multiple GESCs via the network. The
GESC 620 at each node may contain identical (or nearly identical)
software instructions. Each node may be capable of operating as
either a control node or a worker node. The GESC at the control
node 602 can communicate, over a communication path 652, with a
client device 630. More specifically, control node 602 may
communicate with client application 632 hosted by the client device
630 to receive queries and to respond to those queries after
processing large amounts of data.
[0106] DMBS 628 may control the creation, maintenance, and use of
database or data structure (not shown) within nodes 602 or 610. The
database may organize data stored in data stores 624. The DMBS 628
at control node 602 may accept requests for data and transfer the
appropriate data for the request. With such a process, collections
of data may be distributed across multiple physical locations. In
this example, each node 602 and 610 stores a portion of the total
data managed by the management system in its associated data store
624.
[0107] Furthermore, the DBMS may be responsible for protecting
against data loss using replication techniques. Replication
includes providing a backup copy of data stored on one node on one
or more other nodes. Therefore, if one node fails, the data from
the failed node can be recovered from a replicated copy residing at
another node. However, as described herein with respect to FIG. 4,
data or status information for each node in the communications grid
may also be shared with each node on the grid.
[0108] FIG. 7 is a flow chart of an example of a process for
executing a data analysis or a processing project according to some
aspects. As described with respect to FIG. 6, the GESC at the
control node may transmit data with a client device (e.g., client
device 630) to receive queries for executing a project and to
respond to those queries after large amounts of data have been
processed. The query may be transmitted to the control node, where
the query may include a request for executing a project, as
described in operation 702. The query can contain instructions on
the type of data analysis to be performed in the project and
whether the project should be executed using the grid-based
computing environment, as shown in operation 704.
[0109] To initiate the project, the control node may determine if
the query requests use of the grid-based computing environment to
execute the project. If the determination is no, then the control
node initiates execution of the project in a solo environment
(e.g., at the control node), as described in operation 710. If the
determination is yes, the control node may initiate execution of
the project in the grid-based computing environment, as described
in operation 706. In such a situation, the request may include a
requested configuration of the grid. For example, the request may
include a number of control nodes and a number of worker nodes to
be used in the grid when executing the project. After the project
has been completed, the control node may transmit results of the
analysis yielded by the grid, as described in operation 708.
Whether the project is executed in a solo or grid-based
environment, the control node provides the results of the
project.
[0110] As noted with respect to FIG. 2, the computing environments
described herein may collect data (e.g., as received from network
devices, such as sensors, such as network devices 204-209 in FIG.
2, and client devices or other sources) to be processed as part of
a data analytics project, and data may be received in real time as
part of a streaming analytics environment (e.g., ESP). Data may be
collected using a variety of sources as communicated via different
kinds of networks or locally, such as on a real-time streaming
basis. For example, network devices may receive data periodically
from network device sensors as the sensors continuously sense,
monitor and track changes in their environments. More specifically,
an increasing number of distributed applications develop or produce
continuously flowing data from distributed sources by applying
queries to the data before distributing the data to geographically
distributed recipients. An event stream processing engine (ESPE)
may continuously apply the queries to the data as it is received
and determines which entities should receive the data. Client or
other devices may also subscribe to the ESPE or other devices
processing ESP data so that they can receive data after processing,
based on for example the entities determined by the processing
engine. For example, client devices 230 in FIG. 2 may subscribe to
the ESPE in computing environment 214. In another example, event
subscription devices 1024a-c, described further with respect to
FIG. 10, may also subscribe to the ESPE. The ESPE may determine or
define how input data or event streams from network devices or
other publishers (e.g., network devices 204-209 in FIG. 2) are
transformed into meaningful output data to be consumed by
subscribers, such as for example client devices 230 in FIG. 2.
[0111] FIG. 8 is a block diagram including components of an Event
Stream Processing Engine (ESPE) according to some aspects. ESPE 800
may include one or more projects 802. A project may be described as
a second-level container in an engine model managed by ESPE 800
where a thread pool size for the project may be defined by a user.
Each project of the one or more projects 802 may include one or
more continuous queries 804 that contain data flows, which are data
transformations of incoming event streams. The one or more
continuous queries 804 may include one or more source windows 806
and one or more derived windows 808.
[0112] The ESPE may receive streaming data over a period of time
related to certain events, such as events or other data sensed by
one or more network devices. The ESPE may perform operations
associated with processing data created by the one or more devices.
For example, the ESPE may receive data from the one or more network
devices 204-209 shown in FIG. 2. As noted, the network devices may
include sensors that sense different aspects of their environments,
and may collect data over time based on those sensed observations.
For example, the ESPE may be implemented within one or more of
machines 220 and 240 shown in FIG. 2. The ESPE may be implemented
within such a machine by an ESP application. An ESP application may
embed an ESPE with its own dedicated thread pool or pools into its
application space where the main application thread can do
application-specific work and the ESPE processes event streams at
least by creating an instance of a model into processing
objects.
[0113] The engine container is the top-level container in a model
that manages the resources of the one or more projects 802. In an
illustrative example, there may be only one ESPE 800 for each
instance of the ESP application, and ESPE 800 may have a unique
engine name. Additionally, the one or more projects 802 may each
have unique project names, and each query may have a unique
continuous query name and begin with a uniquely named source window
of the one or more source windows 806. ESPE 800 may or may not be
persistent.
[0114] Continuous query modelling involves defining directed graphs
of windows for event stream manipulation and transformation. A
window in the context of event stream manipulation and
transformation is a processing node in an event stream processing
model. A window in a continuous query can perform aggregations,
computations, pattern-matching, and other operations on data
flowing through the window. A continuous query may be described as
a directed graph of source, relational, pattern matching, and
procedural windows. The one or more source windows 806 and the one
or more derived windows 808 represent continuously executing
queries that generate updates to a query result set as new event
blocks stream through ESPE 800. A directed graph, for example, is a
set of nodes connected by edges, where the edges have a direction
associated with them.
[0115] An event object may be described as a packet of data
accessible as a collection of fields, with at least one of the
fields defined as a key or unique identifier (ID). The event object
may be created using a variety of formats including binary,
alphanumeric, XML, etc. Each event object may include one or more
fields designated as a primary identifier (ID) for the event so
ESPE 800 can support operation codes (opcodes) for events including
insert, update, upsert, and delete. Upsert opcodes update the event
if the key field already exists; otherwise, the event is inserted.
For illustration, an event object may be a packed binary
representation of a set of field values and include both metadata
and field data associated with an event. The metadata may include
an opcode indicating if the event represents an insert, update,
delete, or upsert, a set of flags indicating if the event is a
normal, partial-update, or a retention generated event from
retention policy management, and a set of microsecond timestamps
that can be used for latency measurements.
[0116] An event block object may be described as a grouping or
package of event objects. An event stream may be described as a
flow of event block objects. A continuous query of the one or more
continuous queries 804 transforms a source event stream made up of
streaming event block objects published into ESPE 800 into one or
more output event streams using the one or more source windows 806
and the one or more derived windows 808. A continuous query can
also be thought of as data flow modelling.
[0117] The one or more source windows 806 are at the top of the
directed graph and have no windows feeding into them. Event streams
are published into the one or more source windows 806, and from
there, the event streams may be directed to the next set of
connected windows as defined by the directed graph. The one or more
derived windows 808 are all instantiated windows that are not
source windows and that have other windows streaming events into
them. The one or more derived windows 808 may perform computations
or transformations on the incoming event streams. The one or more
derived windows 808 transform event streams based on the window
type (that is operators such as join, filter, compute, aggregate,
copy, pattern match, procedural, union, etc.) and window settings.
As event streams are published into ESPE 800, they are continuously
queried, and the resulting sets of derived windows in these queries
are continuously updated.
[0118] FIG. 9 is a flow chart of an example of a process including
operations performed by an event stream processing engine according
to some aspects. As noted, the ESPE 800 (or an associated ESP
application) defines how input event streams are transformed into
meaningful output event streams. More specifically, the ESP
application may define how input event streams from publishers
(e.g., network devices providing sensed data) are transformed into
meaningful output event streams consumed by subscribers (e.g., a
data analytics project being executed by a machine or set of
machines).
[0119] Within the application, a user may interact with one or more
user interface windows presented to the user in a display under
control of the ESPE independently or through a browser application
in an order selectable by the user. For example, a user may execute
an ESP application, which causes presentation of a first user
interface window, which may include a plurality of menus and
selectors such as drop down menus, buttons, text boxes, hyperlinks,
etc. associated with the ESP application as understood by a person
of skill in the art. Various operations may be performed in
parallel, for example, using a plurality of threads.
[0120] At operation 900, an ESP application may define and start an
ESPE, thereby instantiating an ESPE at a device, such as machine
220 and/or 240. In an operation 902, the engine container is
created. For illustration, ESPE 800 may be instantiated using a
function call that specifies the engine container as a manager for
the model.
[0121] In an operation 904, the one or more continuous queries 804
are instantiated by ESPE 800 as a model. The one or more continuous
queries 804 may be instantiated with a dedicated thread pool or
pools that generate updates as new events stream through ESPE 800.
For illustration, the one or more continuous queries 804 may be
created to model business processing logic within ESPE 800, to
predict events within ESPE 800, to model a physical system within
ESPE 800, to predict the physical system state within ESPE 800,
etc. For example, as noted, ESPE 800 may be used to support sensor
data monitoring and management (e.g., sensing may include force,
torque, load, strain, position, temperature, air pressure, fluid
flow, chemical properties, resistance, electromagnetic fields,
radiation, irradiance, proximity, acoustics, moisture, distance,
speed, vibrations, acceleration, electrical potential, or
electrical current, etc.).
[0122] ESPE 800 may analyze and process events in motion or "event
streams." Instead of storing data and running queries against the
stored data, ESPE 800 may store queries and stream data through
them to allow continuous analysis of data as it is received. The
one or more source windows 806 and the one or more derived windows
808 may be created based on the relational, pattern matching, and
procedural algorithms that transform the input event streams into
the output event streams to model, simulate, score, test, predict,
etc. based on the continuous query model defined and application to
the streamed data.
[0123] In an operation 906, a publish/subscribe (pub/sub)
capability is initialized for ESPE 800. In an illustrative
embodiment, a pub/sub capability is initialized for each project of
the one or more projects 802. To initialize and enable pub/sub
capability for ESPE 800, a port number may be provided. Pub/sub
clients can use a host name of an ESP device running the ESPE and
the port number to establish pub/sub connections to ESPE 800.
[0124] FIG. 10 is a block diagram of an ESP system 1000 interfacing
between publishing device 1022 and event subscription devices
1024a-c according to some aspects. ESP system 1000 may include ESP
subsystem 1001, publishing device 1022, an event subscription
device A 1024a, an event subscription device B 1024b, and an event
subscription device C 1024c. Input event streams are output to ESP
subsystem 1001 by publishing device 1022. In alternative
embodiments, the input event streams may be created by a plurality
of publishing devices. The plurality of publishing devices further
may publish event streams to other ESP devices. The one or more
continuous queries instantiated by ESPE 800 may analyze and process
the input event streams to form output event streams output to
event subscription device A 1024a, event subscription device B
1024b, and event subscription device C 1024c. ESP system 1000 may
include a greater or a fewer number of event subscription devices
of event subscription devices.
[0125] Publish-subscribe is a message-oriented interaction paradigm
based on indirect addressing. Processed data recipients specify
their interest in receiving information from ESPE 800 by
subscribing to specific classes of events, while information
sources publish events to ESPE 800 without directly addressing the
receiving parties. ESPE 800 coordinates the interactions and
processes the data. In some cases, the data source receives
confirmation that the published information has been received by a
data recipient.
[0126] A publish/subscribe API may be described as a library that
enables an event publisher, such as publishing device 1022, to
publish event streams into ESPE 800 or an event subscriber, such as
event subscription device A 1024a, event subscription device B
1024b, and event subscription device C 1024c, to subscribe to event
streams from ESPE 800. For illustration, one or more
publish/subscribe APIs may be defined. Using the publish/subscribe
API, an event publishing application may publish event streams into
a running event stream processor project source window of ESPE 800,
and the event subscription application may subscribe to an event
stream processor project source window of ESPE 800.
[0127] The publish/subscribe API provides cross-platform
connectivity and endianness compatibility between ESP application
and other networked applications, such as event publishing
applications instantiated at publishing device 1022, and event
subscription applications instantiated at one or more of event
subscription device A 1024a, event subscription device B 1024b, and
event subscription device C 1024c.
[0128] Referring back to FIG. 9, operation 906 initializes the
publish/subscribe capability of ESPE 800. In an operation 908, the
one or more projects 802 are started. The one or more started
projects may run in the background on an ESP device. In an
operation 910, an event block object is received from one or more
computing device of the publishing device 1022.
[0129] ESP subsystem 1001 may include a publishing client 1002,
ESPE 800, a subscribing client A 1004, a subscribing client B 1006,
and a subscribing client C 1008. Publishing client 1002 may be
started by an event publishing application executing at publishing
device 1022 using the publish/subscribe API. Subscribing client A
1004 may be started by an event subscription application A,
executing at event subscription device A 1024a using the
publish/subscribe API. Subscribing client B 1006 may be started by
an event subscription application B executing at event subscription
device B 1024b using the publish/subscribe API. Subscribing client
C 1008 may be started by an event subscription application C
executing at event subscription device C 1024c using the
publish/subscribe API.
[0130] An event block object containing one or more event objects
is injected into a source window of the one or more source windows
806 from an instance of an event publishing application on
publishing device 1022. The event block object may be generated,
for example, by the event publishing application and may be
received by publishing client 1002. A unique ID may be maintained
as the event block object is passed between the one or more source
windows 806 and/or the one or more derived windows 808 of ESPE 800,
and to subscribing client A 1004, subscribing client B 1006, and
subscribing client C 1008 and to event subscription device A 1024a,
event subscription device B 1024b, and event subscription device C
1024c. Publishing client 1002 may further generate and include a
unique embedded transaction ID in the event block object as the
event block object is processed by a continuous query, as well as
the unique ID that publishing device 1022 assigned to the event
block object.
[0131] In an operation 912, the event block object is processed
through the one or more continuous queries 804. In an operation
914, the processed event block object is output to one or more
computing devices of the event subscription devices 1024a-c. For
example, subscribing client A 1004, subscribing client B 1006, and
subscribing client C 1008 may send the received event block object
to event subscription device A 1024a, event subscription device B
1024b, and event subscription device C 1024c, respectively.
[0132] ESPE 800 maintains the event block containership aspect of
the received event blocks from when the event block is published
into a source window and works its way through the directed graph
defined by the one or more continuous queries 804 with the various
event translations before being output to subscribers. Subscribers
can correlate a group of subscribed events back to a group of
published events by comparing the unique ID of the event block
object that a publisher, such as publishing device 1022, attached
to the event block object with the event block ID received by the
subscriber.
[0133] In an operation 916, a determination is made concerning
whether or not processing is stopped. If processing is not stopped,
processing continues in operation 910 to continue receiving the one
or more event streams containing event block objects from the, for
example, one or more network devices. If processing is stopped,
processing continues in an operation 918. In operation 918, the
started projects are stopped. In operation 920, the ESPE is
shutdown.
[0134] As noted, in some examples, big data is processed for an
analytics project after the data is received and stored. In other
examples, distributed applications process continuously flowing
data in real-time from distributed sources by applying queries to
the data before distributing the data to geographically distributed
recipients. As noted, an event stream processing engine (ESPE) may
continuously apply the queries to the data as it is received and
determines which entities receive the processed data. This allows
for large amounts of data being received and/or collected in a
variety of environments to be processed and distributed in real
time. For example, as shown with respect to FIG. 2, data may be
collected from network devices that may include devices within the
internet of things, such as devices within a home automation
network. However, such data may be collected from a variety of
different resources in a variety of different environments. In any
such situation, embodiments of the present technology allow for
real-time processing of such data.
[0135] Aspects of the present disclosure provide technical
solutions to technical problems, such as computing problems that
arise when an ESP device fails which results in a complete service
interruption and potentially significant data loss. The data loss
can be catastrophic when the streamed data is supporting mission
critical operations, such as those in support of an ongoing
manufacturing or drilling operation. An example of an ESP system
achieves a rapid and seamless failover of ESPE running at the
plurality of ESP devices without service interruption or data loss,
thus significantly improving the reliability of an operational
system that relies on the live or real-time processing of the data
streams. The event publishing systems, the event subscribing
systems, and each ESPE not executing at a failed ESP device are not
aware of or effected by the failed ESP device. The ESP system may
include thousands of event publishing systems and event subscribing
systems. The ESP system keeps the failover logic and awareness
within the boundaries of out-messaging network connector and
out-messaging network device.
[0136] In one example embodiment, a system is provided to support a
failover when event stream processing (ESP) event blocks. The
system includes, but is not limited to, an out-messaging network
device and a computing device. The computing device includes, but
is not limited to, one or more processors and one or more
computer-readable mediums operably coupled to the one or more
processor. The processor is configured to execute an ESP engine
(ESPE). The computer-readable medium has instructions stored
thereon that, when executed by the processor, cause the computing
device to support the failover. An event block object is received
from the ESPE that includes a unique identifier. A first status of
the computing device as active or standby is determined. When the
first status is active, a second status of the computing device as
newly active or not newly active is determined. Newly active is
determined when the computing device is switched from a standby
status to an active status. When the second status is newly active,
a last published event block object identifier that uniquely
identifies a last published event block object is determined. A
next event block object is selected from a non-transitory
computer-readable medium accessible by the computing device. The
next event block object has an event block object identifier that
is greater than the determined last published event block object
identifier. The selected next event block object is published to an
out-messaging network device. When the second status of the
computing device is not newly active, the received event block
object is published to the out-messaging network device. When the
first status of the computing device is standby, the received event
block object is stored in the non-transitory computer-readable
medium.
[0137] FIG. 11 is a flow chart of an example of a process for
generating and using a machine-learning model according to some
aspects. Machine learning is a branch of artificial intelligence
that relates to mathematical models that can learn from,
categorize, and make predictions about data. Such mathematical
models, which can be referred to as machine-learning models, can
classify input data among two or more classes; cluster input data
among two or more groups; predict a result based on input data;
identify patterns or trends in input data; identify a distribution
of input data in a space; or any combination of these. Examples of
machine-learning models can include (i) neural networks; (ii)
decision trees, such as classification trees and regression trees;
(iii) classifiers, such as naive bias classifiers, logistic
regression classifiers, ridge regression classifiers, random forest
classifiers, least absolute shrinkage and selector (LASSO)
classifiers, and support vector machines; (iv) clusterers, such as
k-means clusterers, mean-shift clusterers, and spectral clusterers;
(v) factorizers, such as factorization machines, principal
component analyzers and kernel principal component analyzers; and
(vi) ensembles or other combinations of machine-learning models. In
some examples, neural networks can include deep neural networks,
feed-forward neural networks, recurrent neural networks,
convolutional neural networks, radial basis function (RBF) neural
networks, echo state neural networks, long short-term memory neural
networks, bi-directional recurrent neural networks, gated neural
networks, hierarchical recurrent neural networks, stochastic neural
networks, modular neural networks, spiking neural networks, dynamic
neural networks, cascading neural networks, neuro-fuzzy neural
networks, or any combination of these.
[0138] Different machine-learning models may be used
interchangeably to perform a task. Examples of tasks that can be
performed at least partially using machine-learning models include
various types of scoring; bioinformatics; cheminformatics; software
engineering; fraud detection; customer segmentation; generating
online recommendations; adaptive websites; determining customer
lifetime value; search engines; placing advertisements in real time
or near real time; classifying DNA sequences; affective computing;
performing natural language processing and understanding; object
recognition and computer vision; robotic locomotion; playing games;
optimization and metaheuristics; detecting network intrusions;
medical diagnosis and monitoring; or predicting when an asset, such
as a machine, will need maintenance.
[0139] Any number and combination of tools can be used to create
machine-learning models. Examples of tools for creating and
managing machine-learning models can include SAS.RTM. Enterprise
Miner, SAS.RTM. Rapid Predictive Modeler, and SAS.RTM. Model
Manager, SAS Cloud Analytic Services (CAS).RTM., SAS Viya.RTM. of
all which are by SAS Institute Inc. of Cary, N.C..
[0140] Machine-learning models can be constructed through an at
least partially automated (e.g., with little or no human
involvement) process called training. During training, input data
can be iteratively supplied to a machine-learning model to enable
the machine-learning model to identify patterns related to the
input data or to identify relationships between the input data and
output data. With training, the machine-learning model can be
transformed from an untrained state to a trained state. Input data
can be split into one or more training sets and one or more
validation sets, and the training process may be repeated multiple
times. The splitting may follow a k-fold cross-validation rule, a
leave-one-out-rule, a leave-p-out rule, or a holdout rule. An
overview of training and using a machine-learning model is
described below with respect to the flow chart of FIG. 11.
[0141] In block 1104, training data is received. In some examples,
the training data is received from a remote database or a local
database, constructed from various subsets of data, or input by a
user. The training data can be used in its raw form for training a
machine-learning model or pre-processed into another form, which
can then be used for training the machine-learning model. For
example, the raw form of the training data can be smoothed,
truncated, aggregated, clustered, or otherwise manipulated into
another form, which can then be used for training the
machine-learning model.
[0142] In block 1106, a machine-learning model is trained using the
training data. The machine-learning model can be trained in a
supervised, unsupervised, or semi-supervised manner. In supervised
training, each input in the training data is correlated to a
desired output. This desired output may be a scalar, a vector, or a
different type of data structure such as text or an image. This may
enable the machine-learning model to learn a mapping between the
inputs and desired outputs. In unsupervised training, the training
data includes inputs, but not desired outputs, so that the
machine-learning model has to find structure in the inputs on its
own. In semi-supervised training, only some of the inputs in the
training data are correlated to desired outputs.
[0143] In block 1108, the machine-learning model is evaluated. An
evaluation dataset can be obtained, for example, via user input or
from a database. The evaluation dataset can include inputs
correlated to desired outputs. The inputs can be provided to the
machine-learning model and the outputs from the machine-learning
model can be compared to the desired outputs. If the outputs from
the machine-learning model closely correspond with the desired
outputs, the machine-learning model may have a high degree of
accuracy. For example, if 90% or more of the outputs from the
machine-learning model are the same as the desired outputs in the
evaluation dataset, the machine-learning model may have a high
degree of accuracy. Otherwise, the machine-learning model may have
a low degree of accuracy. The 90% number is an example only. A
realistic and desirable accuracy percentage is dependent on the
problem and the data.
[0144] In some examples, if the machine-learning model has an
inadequate degree of accuracy for a particular task, the process
can return to block 1106, where the machine-learning model can be
further trained using additional training data or otherwise
modified to improve accuracy. If the machine-learning model has an
adequate degree of accuracy for the particular task, the process
can continue to block 1110.
[0145] In block 1110, new data is received. In some examples, the
new data is received from a remote database or a local database,
constructed from various subsets of data, or input by a user. The
new data may be unknown to the machine-learning model. For example,
the machine-learning model may not have previously processed or
analyzed the new data.
[0146] In block 1112, the trained machine-learning model is used to
analyze the new data and provide a result. For example, the new
data can be provided as input to the trained machine-learning
model. The trained machine-learning model can analyze the new data
and provide a result that includes a classification of the new data
into a particular class, a clustering of the new data into a
particular group, a prediction based on the new data, or any
combination of these.
[0147] In block 1114, the result is post-processed. For example,
the result can be added to, multiplied with, or otherwise combined
with other data as part of a job. As another example, the result
can be transformed from a first format, such as a time series
format, into another format, such as a count series format. Any
number and combination of operations can be performed on the result
during post-processing.
[0148] A more specific example of a machine-learning model is the
neural network 1200 shown in FIG. 12. The neural network 1200 is
represented as multiple layers of interconnected neurons, such as
neuron 1208, that can exchange data between one another. The layers
include an input layer 1202 for receiving input data, a hidden
layer 1204, and an output layer 1206 for providing a result. The
hidden layer 1204 is referred to as hidden because it may not be
directly observable or have its input directly accessible during
the normal functioning of the neural network 1200. Although the
neural network 1200 is shown as having a specific number of layers
and neurons for exemplary purposes, the neural network 1200 can
have any number and combination of layers, and each layer can have
any number and combination of neurons.
[0149] The neurons and connections between the neurons can have
numeric weights, which can be tuned during training. For example,
training data can be provided to the input layer 1202 of the neural
network 1200, and the neural network 1200 can use the training data
to tune one or more numeric weights of the neural network 1200. In
some examples, the neural network 1200 can be trained using
backpropagation. Backpropagation can include determining a gradient
of a particular numeric weight based on a difference between an
actual output of the neural network 1200 and a desired output of
the neural network 1200. Based on the gradient, one or more numeric
weights of the neural network 1200 can be updated to reduce the
difference, thereby increasing the accuracy of the neural network
1200. This process can be repeated multiple times to train the
neural network 1200. For example, this process can be repeated
hundreds or thousands of times to train the neural network
1200.
[0150] In some examples, the neural network 1200 is a feed-forward
neural network. In a feed-forward neural network, every neuron only
propagates an output value to a subsequent layer of the neural
network 1200. For example, data may only move one direction
(forward) from one neuron to the next neuron in a feed-forward
neural network.
[0151] In other examples, the neural network 1200 is a recurrent
neural network. A recurrent neural network can include one or more
feedback loops, allowing data to propagate in both forward and
backward through the neural network 1200. This can allow for
information to persist within the neural network. For example, a
recurrent neural network can determine an output based at least
partially on information that the recurrent neural network has seen
before, giving the recurrent neural network the ability to use
previous input to inform the output.
[0152] In some examples, the neural network 1200 operates by
receiving a vector of numbers from one layer; transforming the
vector of numbers into a new vector of numbers using a matrix of
numeric weights, a nonlinearity, or both; and providing the new
vector of numbers to a subsequent layer of the neural network 1200.
Each subsequent layer of the neural network 1200 can repeat this
process until the neural network 1200 outputs a final result at the
output layer 1206. For example, the neural network 1200 can receive
a vector of numbers as an input at the input layer 1202. The neural
network 1200 can multiply the vector of numbers by a matrix of
numeric weights to determine a weighted vector. The matrix of
numeric weights can be tuned during the training of the neural
network 1200. The neural network 1200 can transform the weighted
vector using a nonlinearity, such as a sigmoid tangent or the
hyperbolic tangent. In some examples, the nonlinearity can include
a rectified linear unit, which can be expressed using the following
equation:
y=max(x, 0)
where y is the output and x is an input value from the weighted
vector. The transformed output can be supplied to a subsequent
layer, such as the hidden layer 1204, of the neural network 1200.
The subsequent layer of the neural network 1200 can receive the
transformed output, multiply the transformed output by a matrix of
numeric weights and a nonlinearity, and provide the result to yet
another layer of the neural network 1200. This process continues
until the neural network 1200 outputs a final result at the output
layer 1206.
[0153] Other examples of the present disclosure may include any
number and combination of machine-learning models having any number
and combination of characteristics. The machine-learning model(s)
can be trained in a supervised, semi-supervised, or unsupervised
manner, or any combination of these. The machine-learning model(s)
can be implemented using a single computing device or multiple
computing devices, such as the communications grid computing system
400 discussed above.
[0154] Implementing some examples of the present disclosure at
least in part by using machine-learning models can reduce the total
number of processing iterations, time, memory, electrical power, or
any combination of these consumed by a computing device when
analyzing data. For example, a neural network may more readily
identify patterns in data than other approaches. This may enable
the neural network to analyze the data using fewer processing
cycles and less memory than other approaches, while obtaining a
similar or greater level of accuracy.
[0155] Some machine-learning approaches may be more efficiently and
quickly executed and processed with machine-learning specific
processors (e.g., not a generic CPU). Such processors may also
provide an energy savings when compared to generic CPUs. For
example, some of these processors can include a graphical
processing unit (GPU), an application-specific integrated circuit
(ASIC), a field-programmable gate array (FPGA), an artificial
intelligence (Al) accelerator, a neural computing core, a neural
computing engine, a neural processing unit, a purpose-built chip
architecture for deep learning, and/or some other machine-learning
specific processor that implements a machine learning approach or
one or more neural networks using semiconductor (e.g., silicon
(Si), gallium arsenide(GaAs)) devices. Furthermore, these
processors may also be employed in heterogeneous computing
architectures with a number of and a variety of different types of
cores, engines, nodes, and/or layers to achieve various energy
efficiencies, thermal processing mitigation, processing speed
improvements, data communication speed improvements, and/or data
efficiency targets and improvements throughout various parts of the
system when compared to a homogeneous computing architecture that
employs CPUs for general purpose computing.
[0156] FIG. 13 depicts a data table 1300 including an example of
raw data associated with patient vaccinations according to some
aspects of the present disclosure. Raw data may be data that has
not yet been processed by a pipeline of the present disclosure. The
raw data may be obtained from a relational database or another
source.
[0157] In this example, the raw data shows vaccinations received by
patients. But other examples may involve more, fewer, or different
variables. Each row in the data table 1300 is a unique observation
indicating a particular vaccine received by a patient. Although the
exemplary data table 1300 shows a relatively small number of rows
and variables for simplicity, it will be appreciated that an actual
set of raw data may include millions of rows of data and hundreds
or thousands of variables.
[0158] As described above, raw data may be stored in formats that
are incompatible for use with a model, such as a predictive model.
In this example, the raw data is in a format that may be
incompatible with a model because the raw data is not organized
such that all of the vaccination information associated with a
single patient is in the same row of the data table 1300. Rather
than having such one-to-one relationships, the data table 1300
includes several one-to-many relationships. A one-to-many
relationship can occur when a single subject (e.g., a single
patient) is referenced in multiple observations in a data table.
Some models are unable to properly handle raw data that has
one-to-many relationships. It may therefore be desirable to
transform the raw data into a format that is more suitable for
modelling, such as the format of the model ready data shown in FIG.
15 and described in greater detail later on.
[0159] FIG. 14 depicts a data table 1400 including an example of
raw data associated with vaccination codes according to some
aspects of the present disclosure. In this example, the data table
1400 indicates the individual vaccines associated with a each
vaccination code. For example, the vaccination code "MMR"
corresponds to three vaccines--one for mumps, one for measles, and
one for rubella. But, other examples may involve more, fewer, or
different variables. Each row in the data table 1400 is a unique
observation corresponding to a unique vaccination code. Although
the exemplary data table 1400 shows a relatively small number of
rows and variables for simplicity, an actual set of model ready
data may include millions of rows of data and hundreds or thousands
of variables.
[0160] In this example, the raw data includes one-to-many
relationships between the vaccination codes and their corresponding
vaccinations, because the same vaccination can be found in multiple
different columns of the data table 1400. For instance, the "Mumps"
and "Measles" vaccinations can each be found in both columns two
and three. The "tetanus" vaccination can likewise be found in both
columns two and three. The "Pertussis" vaccination can be found in
columns two and four.
[0161] In some examples, it may be desirable to convert the data
tables 1300, 1400 of FIGS. 13-14 into a corresponding analysis
table having "model ready data," which can be data that is properly
formatted to be compatible with a target model. One example of such
an analysis table 1500 is shown in FIG. 15. The analysis table 1500
includes one row for each subject (e.g., patient) and one column
for each disease that the patient might be vaccinated against.
Thus, the analysis table 1500 includes one-to-one relationships and
lacks the one-to-many relationships, rendering the analysis table
1500 more compatible with certain models than the raw data.
[0162] In the analysis table 1500, each cell either has a value of
zero or has a value of one. A value of one in a cell indicates that
the corresponding patient was vaccinated for the corresponding
disease. A value of zero in a cell indicates that the corresponding
patient was not vaccinated for the corresponding disease. Although
a particular patient may have been vaccinated multiple times for
the same disease (e.g., patient three was vaccinated twice for
pertussis), in this example the cell values are binary values
indicating if a patient was or was not vaccinated for a particular
disease.
[0163] Although the examples shown in FIGS. 13-15 are overly
simplistic for exemplary purposes, a real scenario may be far more
complex. Such complex scenarios may require a data scientist to
spend days or weeks manually identifying relevant variables and
performing this conversion process to generate an analysis table
(e.g., a table with model ready data) for use in a modelling
process. But in some examples, the generation of such analysis
tables can be automated, for example by using the process shown in
FIG. 16.
[0164] FIG. 16 depicts an example of a process 1600 for generating
a pipeline according to some aspects of the present disclosure.
While FIG. 16 depicts a certain sequence of operations for
illustrative purposes, other examples can include more operations,
fewer operations, different operations, or a different order of the
operations shown in FIG. 16. The process 1600 may be implemented by
one or more processing devices, which may collectively be referred
to herein as "a processing device."
[0165] The process 1600 begins with a processing device obtaining
input data 1602. In some examples, the processing device can obtain
the input data 1602 by downloading or otherwise receiving the input
data 1602 from a source, such as a database. In other examples, the
processing device can obtain the input data 1602 by generating some
or all of the input data 1602.
[0166] The input data 1602 can include one or more data tables with
rows and columns. Each row or column, depending on the
configuration, may correspond to an individual observation. The
data tables may be key-linked (e.g., linked together by one or more
keys). As one specific example, the input data 1602 can include a
transaction table with transaction data. Each row in the
transaction table can correspond to a transaction and include one
or more variable values describing attributes of the transaction.
Examples of transactional data may include product demand data,
vaccination data, phone call data, website visitation data,
software download data, etc. The input data 1602 may also include a
subject table with subject data. Each row in the subject table can
correspond to a subject associated with a transaction described in
the transaction table and include one or more variable values
describing attributes of the subject. The transaction table and the
subject table may be linked together by a key, such as by
subject.
[0167] The processing device may also receive user inputs. For
example, the processing device can receive a user selection of a
target variable to model in a modelling process. The target
variable may be one of the variables included in the data tables of
the input data 1602. Additionally or alternatively, the processing
device can receive a user selection of an objective associated with
the modelling process. In some examples, the user may provide such
user inputs to the processing device via a graphical user interface
(GUI) of pipeline-creation software.
[0168] In some examples, the input data 1602 can include
one-to-many relationships or may otherwise be sub-optimal for use
with a target model 1618. It may therefore be desirable to execute
some or all of the remaining operations of FIG. 16 to generate a
pipeline 1610 configured to transform the input data 1602 into an
analysis table 1616 that is better suited to the target model
1618.
[0169] More specifically, in operation 1604, the processing device
analyzes one or more characteristics of input data 1602 to generate
an output 1606. In some examples, the output 1606 can include
metrics associated with the input data 1602. Examples of the
metrics can include statistical information, cardinality
information, frequency information, and content classifications
associated with the input data 1602. The statistical information
may include, for example, a mean or standard deviation of a
variable value in the input data 1602. The cardinality information
may include, for example, a number of unique elements in a variable
set in the input data 1602. The frequency information may include,
for example, a count of how many times a variable value is present
in the input data 1602. Content classifications may include, for
example, groups or clusters assigned to content in the input data
1602. The processing device can determine the metrics using any
suitable approach. For example, the processing device may execute
one or more machine-learning models to determine the metrics. In
one such example, the processing device can execute a classifier
model on the input data 1602 to determine one or more classes
associated with observations or variables in the input data
1602.
[0170] In some examples, the output 1606 can indicate data quality
problems relating to the input data 1602. For example, the
processing device can execute a decision tree analysis with respect
to the input data 1602 to identify problematic (e.g., improperly
formatted, inconsistent, or missing) variable values. Problematic
variable values may indicate that the input data 1602 is of poor
quality. The processing device may assign the input data 1602 a
data quality score based on such analysis. One example of such a
data quality score may be a numerical value that falls between 0
and 100, with higher values corresponding to higher quality and
lower values corresponding to lower quality.
[0171] In some examples, the output 1606 can include reformatted
versions of the input data 1602. For example, the processing device
can automatically cleanse the input data 1602 by executing
processing operations on the input data 1602, to de-duplicate,
standardize, recode, impute, or enrich the input data 1602. As
another example, the processing device can automatically join
information in the input data 1602 together based on the determined
metrics, for example by clustering data together based on the
classes (e.g., join keys) determined using the clustering model.
Reformatting the input data 1602 using some or all of these
techniques may make the input data 1602 more suitable for use in
subsequent steps of the process 1600.
[0172] In operation 1608, the processing device generates a
pipeline 1610. In some examples, the processing device can
automatically generate the pipeline 1610 based on the output 1606.
For example, the processing device can select processing operations
to include in the pipeline 1610 based on the metrics and/or the
reformatted input data associated with the output 1606. In other
examples, the pipeline 1610 may be generated based on input from
the user. For example, a user can manually select the processing
operations to include in the pipeline 1610. In still other
examples, the pipeline 1610 may be partially generated by the
processing device and partially generated by the user, for example
through an iterative process in which an initial version of the
pipeline 1610 is automatically generated by the processing device
and then further refined by the user.
[0173] The processing operations can each be configured to perform
any number and combination of tasks. For example, a processing
operation can be configured to transpose variables to features,
group variables, generate rules from variable combinations, perform
dimensionality reduction, identify sequences, accumulate
information based on statistical information such as mean or moment
statistics, perform time-series analysis, or any combination of
these. In some examples, the processing operations can leverage
supervised- and unsupervised- machine-learning models to perform
some or all of the tasks. The processing operations may pass
variable values and other information between one another to
facilitate execution of the pipeline 1610.
[0174] Any number and combination of processing operations may be
included in the pipeline 1610 based on predefined rules 1620. As
one example, the processing device can include a
cardinality-reduction operation in the pipeline 1610 if the metrics
indicate that the input data includes high-cardinality variables.
Examples of high-cardinality variables may be e-mail addresses,
identification numbers, phone numbers, or user names. The
cardinality-reduction operation can involve executing a
machine-learning model configured to reduce the cardinality of a
variable in the input data.
[0175] As another example, the processing device can include a
frequency-rollup operation in the pipeline 1610 if the metrics
indicate that the input data has one or more high-frequency
variable values. The frequency-rollup operation can involve
executing a machine-learning model to identify the high-frequency
variable values in the input data 1602. A high-frequency variable
value can be a variable value that is among the k most frequently
occurring values in the input data 1602 and that occurs at least i
times, where k and i may be selected by the user. After determining
the high-frequency variable values, the frequency-rollup operation
can calculate a frequency metric for each high-frequency variable
value. The frequency metric for a high-frequency variable value can
be calculated by weighting the frequency of the high-frequency
variable value by a count variable. The frequency-rollup operation
may then add the frequency metrics as a new variable in the
analysis table 1616.
[0176] As yet another example, the processing device can include
text-analysis operation in the pipeline 1610 if the metrics
indicate that the input data has text data. If the text data is
structured text, the text-analysis operation can be configured to
generate a pseudo-document or another data structure (e.g., a
string) that includes some or all of the structured text
concatenated together. In some cases, this may allow the structured
text to be analyzed as if it was unstructured text. Analyzing the
structured text as if it was unstructured text may provide numerous
advantages. For example, each of the items in the structured text
may be treated like "terms" in a term dictionary, thereby allowing
the processing device to automatically determine that certain
categorical items are related to one another based on the
relationships of the corresponding terms in the term dictionary.
Identifying such relationships may otherwise be difficult or
impossible by analyzing the structured text directly. Additionally
or alternatively, the processing device can include an operation to
project each subject's set of values for the categorical variable
into a multidimensional space, such that similarity of subjects is
indicated by proximity in that multi-dimensional space. The axes in
that space may be rotated, such that subjects might be aligned by
certain characteristics of their data, like topics for unstructured
data analysis. For example, if the data indicates different movies
or books associated with individuals (e.g., in their social media
profiles), then the processing device could identify people that
are interested in certain genres.
[0177] The processing operations can be selected (e.g., manually by
the user and/or automatically by the processing device) from a
toolbox of available processing operations. The toolbox may be
included in pipeline-creation software executed by the processing
device. The toolbox can be extensible, in that new processing
operations can be added to the toolbox over time. For example, the
user can download new processing operations from the Internet
and/or manually program new processing operations.
[0178] In examples in which the processing device automatically
selects the processing operations to include in the pipeline 1610,
the processing device may also automatically determine an order for
the processing operations in the pipeline 1610. In some examples,
the processing device can determine the order based on predefined
rules 1620. The predefined rules 1620 may specify that certain
processing operations are to occur before or after other processing
operations. For example, a first processing operation may supply a
required output to a second processing operation. So, the
predefined rules 1620 may specify that the second processing
operation is to follow the first processing operation. As another
example, the predefined rules 1620 may specify that certain types
of variables are to be processed by the pipeline 1610 before other
types of variables. For instance, it may be desirable to process
structured text before processing numerical values, and this may be
reflected in the predefined rules 1620. Based on such predefined
rules 1620, the processing device may position a first processing
operation configured to process structured text prior to a second
processing operation configured to process numerical values in the
pipeline 1610. It will be appreciated that any number and
combination of techniques can be used to organize the processing
operations in the pipeline 1610 into a particular order.
[0179] In some examples, the processing device can present the
pipeline 1610 to the user. For example, the processing device can
generate a visual depiction of the pipeline 1610 in a GUI of the
pipeline-creation software. The user may then be able to further
customize the pipeline 1610, for example by dragging-and-dropping
processing operations from the toolbox into the pipeline 1610,
deleting existing processing operations in the pipeline 1610,
and/or reorganizing the processing operations in the pipeline 1610
as desired. For simplicity, an automatically generated pipeline may
be referred to herein as an "initial pipeline" in its initial form
prior to user customizations, and may be referred to as a
"customized pipeline" in its subsequent form after one or more user
customizations.
[0180] Once the user determines that the pipeline 1610 is ready to
be executed, the user can initiate the pipeline 1610. The
processing device can apply the pipeline 1610 to the input data to
generate an analysis table 1616 suitable for use with the target
model 1618. The input data to which the pipeline 1610 is applied
may be the original input data 1602 or the reformatted input data
generated in operation 1604. Applying the pipeline 1610 to the
input data can involve executing the pipeline 1610 on the input
data to generate the analysis table 1616, which can include model
ready data (e.g., clean and properly formatted data).
[0181] Having automatically created the analysis table 1616, in
some examples the processing device may then execute the target
model 1618 on the analysis table 1616. For example, the processing
device can provide the analysis table 1616 as input to a predictive
model that is configured to predict a value for a target variable.
The target variable may have been selected by the user prior to
operation 1604. The predictive model can receive the analysis table
1616 and responsively generate a predicted value for the target
variable. The predicted value may then be output to the user, for
example as part of the GUI of the pipeline-creation software.
[0182] In some examples, the processing device can also implement
operation 1612. In operation 1612, the processing device generates
program code 1614 for the pipeline 1610. The program code can be
programmed in any suitable programming language, such as Java, C++,
C, Python, R, or SAS Language, which is a proprietary programming
language created by SAS Institute.RTM. of Cary, N.C.
[0183] The program code may be configured to be deployed outside
the context of the pipeline-creation software that was used to
generate the pipeline 1610. For example, the program code can be
configured to be deployed to a production environment and executed
on input data (e.g., input data 1602 or different input data),
without using the pipeline-creation software. Since the
pipeline-creation software can include various software elements
that consume computing resources (e.g., memory and processing
power) and introduce computational overhead, executing the pipeline
1610 outside of the context of the pipeline-creation software by
using the program code can reduce consumption of computing
resources and significantly expedite execution of the pipeline
1610.
[0184] In some examples, the processing device can generate the
program code using one or more code templates 1622. In particular,
the processing device may have access to a repository of predefined
code templates 1622. Each code template 1622 in the repository can
include a segment of program code that can be incorporated into the
overall program code 1614. The segment of program code may have
variables that are modifiable throughout the flow of the pipeline
1610. Such variables may also be used by other tools. The segment
of program code may be used to define one or more of the processing
operations in the pipeline 1610. Since each code template 1622 may
correspond to one or more of the processing operations in the
pipeline 1610, the processing device can select the code templates
1622 from the repository based on the processing operations in the
pipeline 1610 and include the code templates 1622 into the program
code 1614. In some examples, the processing device can organize the
code templates 1622 in the program code 1614 based on the order of
the corresponding processing operations in the pipeline 1610. For
example, the processing device can organize the code templates 1622
in the program code 1614 in the same sequence that the processing
operations are organized in the pipeline 1610.
[0185] In some examples, the processing device can modify the code
templates 1622 before or after incorporating them into the program
code 1614. For example, the processing device can modify variable
values of a code template 1622 based on one or more parameters
provided by the user. Additionally or alternatively, the processing
device can modify the variable values based on one or more
parameters determined in operation 1604, such as one or more of the
metrics. Additionally or alternatively, the processing device can
modify the variable values based on one or more parameters
determined in operation 1608, for example based on one or more
processing operations executed in the pipeline 1610.
[0186] The processing device can generate the program code before
executing, while executing, or after executing the pipeline 1610.
For example, the processing device can dynamically build the
program code 1614 while executing the pipeline 1610. In one such
example, the processing device can execute the pipeline 1610 as a
sequence of steps. Each step can involve performing a corresponding
processing operation. Each step may also involve selecting,
modifying, and incorporating a corresponding code template into the
program code 1614. In this way, the program code 1614 can be
dynamically built over the course of the sequence of steps. By
dynamically building the program code 1614 while the pipeline 1610
is executed, or by building the program code 1614 after the
pipeline 1610 is executed, the program code 1614 can be customized
to include information (e.g., variables or variable values)
generated as a result of one or more processing operations in the
pipeline 1610. This may not be possible if the program code 1614 is
generated prior to executing the pipeline 1610.
[0187] One example of a process 1700 for automatically generating a
pipeline 1610 is shown in FIG. 17. The process 1700 may be
implemented by the pipeline-creation software, in some examples.
While FIG. 17 depicts a certain sequence of operations for
illustrative purposes, other examples can include more operations,
fewer operations, different operations, or a different order of the
operations shown in FIG. 17. The operations of FIG. 17 are
described below with reference to the components of FIG. 16
described above.
[0188] The process 1700 beings at operation 1702, which in some
examples may be a subpart of operation 1604. At operation 1702, a
processing device extracts features from input data 1602. This may
involve performing one or more feature-extraction operations on the
input data 1602. Examples of the features may be the metrics
described above, such as statistical information, cardinality
information, frequency information, and content classifications.
Such features may be derived from, but not explicitly included in,
the input data 1602.
[0189] At operation 1704, the processing device determines if the
input data 1602 has a high cardinality. The processing device may
make this determination based on the features extracted in
operation 1702. The input data 1602 can have a high cardinality if
the input data 1602 includes one or more variables that have a high
cardinality. If the processing device determines that the input
data 1602 does not have a high cardinality, the process 1700 can
proceed to operation 1710. Otherwise, the processing device can add
a cardinality-reduction operation 1706 into the pipeline 1610.
[0190] The cardinality-reduction operation 1706 can automatically
reduce the number of levels associated with a high-cardinality
variable in the input data 1602. A "level" is a value of a
variable. A high-cardinality variable is a variable with a large
number of values in the input data 1602. For example, a variable
that has 1000 different values in the input data 1602 may be
considered a high-cardinality variable that has 1000 levels. The
cardinality-reduction operation 1706 can reduce the total number of
levels associated with a high-cardinality variable that exist in
the input data 1602. For example, the cardinality-reduction
operation 1706 can collapse together rarely occurring levels for a
high-cardinality variable, can collapse together more-common levels
for a target variable selected by the user, or both of these.
Reducing the amount of levels that are incorporated in the input
data 1602, and thus that are later incorporated into the analysis
table 1616, may focus the analysis table 1616 on the more-relevant
variable values. This can yield improved modelling results, since
less-relevant variable values (e.g., extraneous variable values)
are excluded from the analysis table 1616. Reducing the amount of
levels that are incorporated into the analysis table 1616 may also
decrease the size of the analysis table 1616. This can enable the
analysis table 1616 to be analyzed faster by the target model
1618.
[0191] At operation 1710, the processing device determines if
association rules are to be generated. The processing device may
make this determination based on the features extracted in
operation 1702. Association rules can be rules for correlating
levels of one or more variables together. If the processing device
determines that the association rules are not to be generated, then
the process 1700 can continue to operation 1714. Otherwise, the
processing device can add an association-rule generation operation
1712 into the pipeline 1610.
[0192] The association-rule generation operation 1712 can
automatically generate association rules for levels of one or more
variables in the input data 1602. If the variables are in a
particular taxonomy, the association-rule generation operation 1712
can generate the association rules based on that taxonomy.
[0193] In some examples, the association-rule generation operation
1712 can generate the rules by performing a market basket analysis
on the input data 1602. Market basket analysis can be a technique
that identifies the strength of associations between pairs of
variables and identifies patterns of co-occurrence. A co-occurrence
is when two or more things (e.g., variable values) take place
together. Market basket analysis can produce If-Then scenario
rules, for example if variable A has value X then it is likely that
variable B has value Y. The rules can be probabilistic in nature in
that they can be derived from the frequencies of co-occurrences in
the observations of the input data 1602. A new variable can be
added to the analysis table 1616 for each new rule that is
generated. In this way, the association-rule generation operation
1712 can add information to the analysis table 1616 that was not
originally present in the input data 1602.
[0194] At operation 1714, the processing device determines if
frequencies are to be rolled up. The processing device may make
this determination based on the features extracted in operation
1702, possibly as reduced by operation 1706. If the processing
device determines that frequencies are not to be rolled up, the
process 1700 can continue to operation 1718. Otherwise, the
processing device can add a frequency-rollup operation 1717 into
the pipeline 1610.
[0195] The frequency-rollup operation 1716 can determine
high-frequency variable values in the input data 1602. After
determining the high-frequency variable values, the
frequency-rollup operation 1716 can calculate a frequency metric
for each high-frequency variable value. The frequency-rollup
operation 1716 may then add a new variable for the frequency
metrics in the analysis table 1616. In this way, the
frequency-rollup operation 1716 can add information to the analysis
table 1616 that was not originally present in the input data
1602.
[0196] At operation 1718, the processing device determines if text
in the input data 1602 is to be analyzed. The processing device may
make this determination based on the features extracted in
operation 1702. If the processing device determines that text in
the input data 1602 is not to be analyzed, the process 1700 can
continue to operation 1722. Otherwise, the processing device can
add a text-analysis operation 1720 into the pipeline 1610.
[0197] In some examples, the text-analysis operation 1720 can
generate a pseudo-document or another data structure that contains
one or more text strings by concatenating together the possible
categorical values of the input data 1602. For example, the
text-analysis operation can generate a data structure that includes
a space-separate text string in which each textual value for each
variable, across some or all observations (e.g., transactions) for
a unique subject in the input data 1602, are concatenated together.
This process may be repeated for each unique subject in the input
data 1602, such that there are N data structures if there are N
unique subjects. Concatenating the structured text together into a
data structure may allow the structured text to be treated as if it
is unstructured text, so that textual analysis techniques (e.g.,
topic analysis, predictive rule generation from terms, etc.)
typically reserved for unstructured text may be applied to the
input data 1602. This may allow for a broader range of textual
analysis techniques to be applied to the input data 1602.
[0198] Additionally or alternatively, the text-analysis operation
1720 can generate a data structure indicating the number of
observations for a unique subject in the input data 1602. This
process may be repeated for each unique subject in the input data
1602, such that there are N data structures if there are N unique
subjects.
[0199] At operation 1722, the processing device determines
predictive rules are to be generated. The processing device may
make this determination based on some or all of the features
extracted in prior operations, such as operation 1702 or operation
1720. If the processing device determines that the predictive rules
are not to be generated, the process 1700 can end. Otherwise, the
processing device can add a predictive-rule generation operation
1724 into the pipeline 1610.
[0200] The predictive-rule generation operation 1724 can generate
rules to predict levels of a target variable based on the presence
or absence of a variable in the input data 1602 or in the
text-analysis operation 1720. The levels can be numerical values or
textual terms (e.g., in the case of categorical value converted
into unstructured text). The predictive-rule generation operation
1724 can then add a new variable to the analysis table 1616 based
on each rule generated. In this way, the predictive-rule generation
operation 1724 can add information to the analysis table 1616 that
was not originally present in the input data 1602.
[0201] It will be appreciated that the pipeline 1610 shown in FIG.
17 is intended to be illustrative and non-limiting. Other examples
may include more, fewer, or different processing operations in the
pipeline 1610. And other examples may include more, fewer, or
different rules for incorporating processing operations into the
pipeline 1610. In general, any suitable number and combination of
processing operations can be incorporated into the pipeline 1610
based on any number and combination of rules.
[0202] After generating the pipeline 1610, the processing device
can execute the pipeline 1610 (e.g., using the pipeline-creation
software or the program code 1614) on the input data 1602. As the
processing device performs the processing operations in the
pipeline 1610, some or all of the processing operations can derive
new information from the input data 1602 and incorporate the new
information into the analysis table 1616. Since the new information
is derived from the original input data 1602, the new information
may be considered features of the input data 1602 and each such
processing operation may be considered a feature-extraction
operation.
[0203] As noted above, it may be desirable to determine how each of
the processing operations in the pipeline 1610 influences the
modelling result, so that the pipeline can be optimized. To that
end, the processing device can determine the impact of each
processing operation on the modelling result. In some examples, the
processing device can dynamically (e.g., in real time) determine
the impact of each processing operation in the pipeline 1610 on the
modelling result as the pipeline 1610 is being executed, for
example by running a model-accuracy test 1708 after some or all of
the processing operations are performed. This is represented in
FIG. 17 by the model-accuracy tests 1708b-e. Each of the
model-accuracy tests 1708b-e can indicate if the modelling result
is improved by the corresponding processing operation. That
information may then be output to the user, so that the user can
improve the pipeline 1610. Additionally or alternatively, that
information can be automatically acted upon by the processing
device to automatically improve (e.g., optimize) the pipeline 1610.
For example, the processing device may automatically remove harmful
processing operations or extraneous processing operations from the
pipeline 1610, which can improve the modelling result and/or reduce
the amount of computing resources that are consumed by executing
the pipeline 1610.
[0204] In some examples, the model-accuracy test 1708 can involve
the operations shown in FIG. 18. Other examples may involve more
operations, fewer operations, different operations, or a different
order of the operations shown in FIG. 18. The operations of FIG. 18
are described below with reference to the components of FIGS. 16-17
described above.
[0205] In operation 1802, the processing device determines a prior
value for an accuracy metric. The accuracy metric indicates the
accuracy of a target model 1618. Examples of the accuracy metric
can include accuracy, area under the curve (AUC), Mean Squared
Error (MSE), F1 statistic correlation, correlation, etc. The prior
value may have been generated in relation to a prior processing
operation in the pipeline 1610. For example, if the current
processing operation in the pipeline 1610 is the text-analysis
operation 1720, then the prior value may have been generated during
the model-accuracy test 1708c associated with the frequency-rollup
operation 1716, since the frequency-rollup operation 1716 precedes
the text-analysis operation 1720 in the pipeline 1610 of FIG.
17.
[0206] In operation 1804, a processing device determines a current
value for the accuracy metric. The current value can be generated
in relation to the current processing operation in the pipeline
1610. For example, if the current processing operation in the
pipeline 1610 is the association-rule generation operation 1712,
then the current value may have been generated in relation to the
association-rule generation operation 1712.
[0207] In some examples, the processing device can generate the
current value by supplying the analysis table 1616 as training data
for training the target model 1618. Because the analysis table 1616
may be modified by some or all of the processing operations in the
pipeline 1610, the analysis table 16161 may be different each time
it is used to train the target model 1618 during one of the
model-accuracy tests 1708. As a result, the value of the accuracy
metric may change based on how each processing operation in the
pipeline 1610 modifies the analysis table 1616.
[0208] In operation 1806, the processing device compares the
current value for the accuracy metric to the prior value, to
determine whether the current value is improved as compared to the
prior value. For example, the processing device can compare the
current value for a selected metric to the prior value for the
selected metric to determine if there is a difference between the
two. If so, the change can be attributed to the current processing
operation. If the change increases the value of the selected
metric, the processing device or the user can determine that
current processing operation is a helpful processing operation that
enhances the modelling result. If the change decreases the value of
the selected metric, the processing device or the user can
determine that current processing operation is a harmful processing
operation that is detrimental the modelling result. If there is
little or no change to the value of the selected metric, the
processing device or the user can determine that the current
processing operation is an extraneous processing operation.
[0209] In some examples, the processing device can execute optional
operation 1808. In this operation, the processing device generates
an output indicating whether the current processing operation is a
helpful processing operation, a harmful processing operation, or an
extraneous processing operation. In some examples, the processing
device may incorporate the output into a GUI. For example, the GUI
may color code a processing operation in the pipeline 1610 as
green, red, or gray to indicate that the processing operation is a
helpful processing operation, a harmful processing operation, or an
extraneous processing operation, respectively. Of course, this
color-coding scheme is intended to be exemplary and other
color-coding schemes may also be used.
[0210] In some examples, the processing device can execute optional
operation 1820. In this operation, the processing device
automatically removes the current processing operation from the
pipeline 1610. In some examples, the processing device can
automatically remove the current processing operation if the
current processing operation is a harmful processing operation or
an extraneous operation. In other examples, the processing device
can automatically remove the current processing operation from the
pipeline 1610 if the current processing operation is a helpful
processing operation. For example, the processing device can
determine that the modelling improvement afforded by the helpful
processing operation is outweighed by the amount of computing
resources consumed by the help processing operation. So, the
processing device can remove the helpful processing operation from
the pipeline 1610.
[0211] FIG. 19 depicts an example of a GUI 1900 configured to
assist in generating pipelines according to some aspects of the
present disclosure. In this example, the GUI 1900 includes a frame
1902 with a toolbox of N processing operations (e.g., Processing
Operations A-N) that may be included in a pipeline 1906. A use can
manually drag-and-drop the processing operations into a canvas
region 1904 and arrange them in a desired order to create the
pipeline 1906. Additionally, or alternatively, a processing device
can automatically select processing operations from the toolbox and
organize the selected processing operations to create at least a
portion of the pipeline 1906, which may then be further customized
by the user. For example, the user may add additional processing
operations into the automatically generated pipeline, remove
existing operations from the automatically generated pipeline, or
change the order of the processing operations in the automatically
generated pipeline. Once the user is satisfied with the pipeline
1906, the user can select a play button 1908 to execute the
pipeline 1906 on any set of input data, in order to generate an
analysis table for use in a modelling process.
[0212] In some examples, the processing device can execute a
model-accuracy test with respect to each processing operation in
the pipeline 1906. The processing device can then update the GUI
1900 to reflect the results of the model-accuracy tests. The GUI
1900 can indicate the results of the model-accuracy tests with
status indicators, such as status indicator 1910. The status
indicator for a given processing operation can specify whether the
processing operation is a helpful processing operation, a harmful
processing operation, or an extraneous processing operation. In the
example shown in FIG. 19, helpful processing operations are
indicated by a "+" symbol, harmful processing operations are
indicated by a "-" symbol, and extraneous processing operations are
indicated by a ".about." symbol. But other examples may use other
schemes such as a color coding to delineate between helpful,
harmful, and extraneous processing operations. For example, the GUI
1900 could use a color scheme in which the status indicators are
colored red to represent a harmful processing operation, green to
represent a helpful processing operation, or gray to represent an
extraneous processing operation.
[0213] A user can view the status indicators and adjust the
pipeline 1906 accordingly. For example, the user may remove
Processing Operation D from the pipeline 1906 upon determining that
Processing Operation D is a harmful processing operation.
Additionally, or alternatively, the user may remove Processing
Operation Y from the pipeline 1906 upon determining that Processing
Operation Y is an extraneous processing operation. Once the user
has made any desired changes to the pipeline 1906, the user may
select the play button 1908 again to execute the updated pipeline
on input data to generate an updated version of the analysis
table.
[0214] It will be appreciated that the FIG. 19 is intended to be
illustrative and non-limiting. Other examples may include more
components, fewer components, different components, or a different
arrangement of the components shown in FIG. 19. Additionally, the
graphical objects shown in FIG. 19 may have different shapes,
sizes, colors, icons, and locations in other examples. For
instance, although the status indicator (e.g., status indicator
1910) for each processing operation is shown in a particular
position relative to the processing operation in FIG. 19, in other
examples the status indicators can be in other locations of the GUI
1900. And although the status indicators are shown as having a
circular shape in FIG. 19, in other examples the status indicators
may have other shapes and sizes.
[0215] FIG. 20 depicts a flow chart of an example of a process for
generating a pipeline according to some aspects of the present
disclosure. Other examples may involve more operations, fewer
operations, different operations, or a different order of the
operations shown in FIG. 20.
[0216] In operation 2002, a processing device obtains a first table
that includes first data (e.g., transactional data) referencing a
set of subjects. Obtaining the first table may involve receiving or
generating the first table. The first data may be considered part
of the input data 1602 of FIG. 16.
[0217] In the first data, each subject in the set of subjects can
be correlated to one or more variable values describing a
transaction associated with the subject. The first data can include
at least one one-to-many relationships, in which a single subject
in the set of subjects is reference in multiple observations. The
one-to-many relationship(s) in the first data may be incompatible
with a target model.
[0218] In operation 2004, the processing device obtains second data
(e.g., subject data) referencing the set of subjects. The second
data can be separate from the first data. For example, the second
data may be part of a second table that is separate from the first
table.
[0219] Obtaining the second data may involve receiving or
generating the second data. For example, the second data may be
generated (e.g., inferred or derived) based on the first data by
extracting the set of subjects from the first table. The second
data may be considered part of the input data 1602 of FIG. 16.
[0220] In the second data, each subject in the set of subjects can
be correlated to one or more attributes describing the subject. The
second table may be key-linked to the first table by the set of
subjects.
[0221] In operation 2006, the processing device generates an
analysis table based on the second data. For example, the
processing device can generate an analysis table that includes some
or all of the subject data in the second table. In some examples,
the analysis table may be separate from the first table and the
second table. Alternatively, the processing device can use the
second table as the analysis table, rather than copying information
from the second table into a separate analysis table.
[0222] In operation 2008, the processing device executes a sequence
of processing operations on the first data in a particular order
defined by a pipeline to modify the analysis table to include
features associated with the set of subjects. In some examples, the
pipeline can be a least partially defined by a user. Additionally,
or alternatively, the pipeline can be at least partially defined by
a computer automatically. Each of the processing operations in the
pipeline can be configured to modify the analysis table by adding
data to the analysis table or removing data from the analysis
table.
[0223] The pipeline can include any number and combination of
processing operations. In some examples, each processing operation
in the sequence can be configured to determine a respective set of
features based on the first data by executing a respective
feature-extraction operation on the first data. The respective set
of features can then be added to the analysis table, for example so
that each subject in the set of subjects is correlated in the
analysis table to corresponding values for the respective set of
features.
[0224] In some examples, the analysis table may lack the
one-to-many relationships in the first data, with which a target
model may be incompatible, thereby reducing potential compatibility
issues. Additionally, the analysis table may have less information
than the first data and may have information that is more relevant
to the target model than the first data. This can allow the
analysis table to be consumed faster than, and render more accurate
results than, the first data. As a result, the analysis table may
be more fit for consumption by the target model than the first
data.
[0225] In operation 2010, the processing device determines if there
is an additional input table (e.g., transaction table) for use by
the pipeline. If so, the process can return to operation 2002 and
the process can repeat, for example to further expand the analysis
table with additional information. In such circumstances, the
additional input table would serve as the "table" in operation 2002
and the data therein would serve as the "first data" in operation
2002. This process can iterate any number of times, for example
until there are no more input tables to be operated upon.
[0226] In operation 2012, the processing device executes the target
model on the modified analysis table for generating an output value
based on the modified analysis table. An example of the target
model can be a predictive model and an example of the output value
can be a predicted value. The predictive model may be, for example,
a machine-learning model configured to predict demand for a
hardware or software product, a number of visits to a website, a
potential adverse reaction to a vaccination, product demand data, a
number of secure connections to a network, etc.
[0227] Although the above operations are described with respect to
two data tables, a similar process can be applied to any number and
combination of tables (e.g., one or four tables). For example, a
similar process may be applied in an embodiment that involves the
first table and excludes the second table, whereby the second data
may be derived from the first data of the first table. As one such
example, the processing device may derive subject data from the
transactional data in the first table. In other examples, a similar
process can be applied to three or more tables, where each table
can be used to add additional data to the analysis table.
[0228] Additionally, although various examples are described herein
with respect to an analysis table, similar principles can be
applied to any other suitable type of data structure. Thus, the
principles described herein are not intended to be limited to
analysis tables or data tables.
[0229] In the previous description, for the purposes of
explanation, specific details are set forth in order to provide a
thorough understanding of examples of the technology. But various
examples can be practiced without these specific details. The
figures and description are not intended to be restrictive.
[0230] The previous description provides examples that are not
intended to limit the scope, applicability, or configuration of the
disclosure. Rather, the previous description of the examples
provides those skilled in the art with an enabling description for
implementing an example. Various changes may be made in the
function and arrangement of elements without departing from the
spirit and scope of the technology as set forth in the appended
claims.
[0231] Specific details are given in the previous description to
provide a thorough understanding of the examples. But the examples
may be practiced without these specific details. For example,
circuits, systems, networks, processes, and other components can be
shown as components in block diagram form to prevent obscuring the
examples in unnecessary detail. In other examples, well-known
circuits, processes, algorithms, structures, and techniques may be
shown without unnecessary detail in order to avoid obscuring the
examples.
[0232] Also, individual examples may have been described as a
process that is depicted as a flowchart, a flow diagram, a data
flow diagram, a structure diagram, or a block diagram. Although a
flowchart can describe the operations as a sequential process, many
of the operations can be performed in parallel or concurrently. In
addition, the order of the operations can be re-arranged. And a
process can have more or fewer operations than are depicted in a
figure. A process can correspond to a method, a function, a
procedure, a subroutine, a subprogram, etc. When a process
corresponds to a function, its termination can correspond to a
return of the function to the calling function or the main
function.
[0233] Systems depicted in some of the figures can be provided in
various configurations. In some examples, the systems can be
configured as a distributed system where one or more components of
the system are distributed across one or more networks in a cloud
computing system.
* * * * *