U.S. patent application number 12/023267 was filed with the patent office on 2009-01-01 for cluster processing of an aggregated dataset.
Invention is credited to Alberto Agostinelli, Andrea Basilico, Cheryl G. Bergeon, Craig Joseph Chapa, Marshall Ashby Gibbs, Bradley Michael Griglione, Gregory David Neil Hudson, Herbert Dennis Hunt, Arvid C. Johnson, Ting Liu, Trevor Mason, John Randall West, Jay Alan Yusko.
Application Number | 20090006309 12/023267 |
Document ID | / |
Family ID | 46331842 |
Filed Date | 2009-01-01 |
United States Patent
Application |
20090006309 |
Kind Code |
A1 |
Hunt; Herbert Dennis ; et
al. |
January 1, 2009 |
CLUSTER PROCESSING OF AN AGGREGATED DATASET
Abstract
Systems and methods are presented that may involve receiving a
aggregated dataset, wherein the aggregated dataset includes data
from a panel data source, a fact data source, and a dimension data
source that have been associated with a standard population
database. The process may also involve storing the aggregated data
in a partition within a partitioned database, wherein the partition
is associated with a data characteristic. The process may also
involve associating a master processing node with a plurality of
slave nodes, wherein each of the plurality of slave nodes is
associated with a partition of the partitioned database. The
process may also involve submitting an analytic query to the master
processing node. The process may also involve assigning analytic
processing to at least one of the plurality of slave nodes by the
master processing node, wherein the assignment is based at least in
part on the association of the partition with the data
characteristic. The process may also involve reading the aggregated
data from the partitioned database by the assigned slave node. The
process may also involve analyzing the aggregated data by the
assigned slave node, wherein the analysis produces a result at each
slave node. The process may also involve combining the results from
each of the plurality of slave nodes by the master processing node
into a master result and reporting the master result to a user
interface.
Inventors: |
Hunt; Herbert Dennis; (San
Francisco, CA) ; West; John Randall; (Sunnyvale,
CA) ; Gibbs; Marshall Ashby; (Clarendon Hills,
IL) ; Griglione; Bradley Michael; (Lake Zurich,
IL) ; Hudson; Gregory David Neil; (Riverside, IL)
; Basilico; Andrea; (Lomazzo, IT) ; Johnson; Arvid
C.; (Frankfort, IL) ; Bergeon; Cheryl G.;
(Arlington Heights, IL) ; Chapa; Craig Joseph;
(Lake Barrington, IL) ; Agostinelli; Alberto;
(Trezzo sull'Adda, IT) ; Yusko; Jay Alan;
(Lombard, IL) ; Mason; Trevor; (Bolingbrook,
IL) ; Liu; Ting; (Downers Grove, IL) |
Correspondence
Address: |
STRATEGIC PATENTS P.C..
C/O PORTFOLIOIP, P.O. BOX 52050
MINNEAPOLIS
MN
55402
US
|
Family ID: |
46331842 |
Appl. No.: |
12/023267 |
Filed: |
January 31, 2008 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
12021263 |
Jan 28, 2008 |
|
|
|
12023267 |
|
|
|
|
60886798 |
Jan 26, 2007 |
|
|
|
60886801 |
Jan 26, 2007 |
|
|
|
60887573 |
Jan 31, 2007 |
|
|
|
60891508 |
Feb 24, 2007 |
|
|
|
60891936 |
Feb 27, 2007 |
|
|
|
60952898 |
Jul 31, 2007 |
|
|
|
Current U.S.
Class: |
1/1 ;
707/999.001; 707/E17.009 |
Current CPC
Class: |
G06Q 30/02 20130101;
G06F 16/24556 20190101; G06F 16/2453 20190101 |
Class at
Publication: |
707/1 ;
707/E17.009 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Claims
1. A method comprising: receiving an aggregated dataset, wherein
the aggregated dataset includes data from a panel data source, a
fact data source, and a dimension data source that have been
associated with a standard population database; storing the
aggregated data in a partition within a partitioned database,
wherein the partition is associated with a data characteristic;
associating a master processing node with a plurality of slave
nodes, wherein each of the plurality of slave nodes is associated
with a partition of the partitioned database; submitting an
analytic query to the master processing node; assigning analytic
processing to at least one of the plurality of slave nodes by the
master processing node, wherein the assignment is based at least in
part on the association of the partition with the data
characteristic; and reading the aggregated data from the
partitioned database by the assigned slave node; analyzing the
aggregated data by the assigned slave node, wherein the analysis
produces a result at each slave node; combining the results from
each of the plurality of slave nodes by the master processing node
into a master result; and reporting the master result to a user
interface.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the benefit of the following U.S.
provisional applications: App. No. 60/887,573 filed on Jan. 31,
2007 and entitled "Analytic Platform," App. No. 60/891,508 filed on
Feb. 24, 2007 and entitled "Analytic Platform," App. No. 60/891,936
filed on Feb. 27, 2007 and entitled "Analytic Platform," App. No.
60/952,898 filed on Jul. 31, 2007 and entitled "Analytic
Platform."
[0002] This application is a continuation-in-part of U.S.
application Ser. No. 12/021,263 filed on Jan. 28, 2008 and entitled
"Associating a Granting Matrix with an Analytic Platform", which
claims the benefit of the following U.S. provisional applications:
App. No. 60/886,798 filed on Jan. 26, 2007 and entitled "A Method
of Aggregating Data," App. No. 60/886,801 filed on Jan. 26, 2007
and entitled "Utilizing Aggregated Data."
[0003] Each of the above applications is incorporated by reference
herein in its entirety.
BACKGROUND
[0004] 1. Field
[0005] This invention relates to methods and systems for analyzing
data, and more particularly to methods and systems for aggregating,
projecting, and releasing data.
[0006] 2. Description of Related Art
[0007] Currently, there exists a large variety of data sources,
such as census data or movement data received from point-of-sale
terminals, sample data received from manual surveys, panel data
obtained from the inputs of consumers who are members of panels,
fact data relating to products, sales, and many other facts
associated with the sales and marketing efforts of an enterprise,
and dimension data relating to dimensions along which an enterprise
wishes to understand data, such as in order to analyze consumer
behaviors, to predict likely outcomes of decisions relating to an
enterprise's activities, and to project from sample sets of data to
a larger universe. Conventional methods of synthesizing,
aggregating, and exploring such a universe of data comprise
techniques such as OLAP, which fix aggregation points along the
dimensions of the universe in order to reduce the size and
complexity of unified information sets such as OLAP stars.
Exploration of the unified information sets can involve run-time
queries and query-time projections, both of which are constrained
in current methods by a priori decisions that must be made to
project and aggregate the universe of data. In practice, going back
and changing the a priori decisions can lift these constraints, but
this requires an arduous and computationally complex restructuring
and reprocessing of data.
[0008] According to current business practices, unified information
sets and results drawn from such information sets can be released
to third parties according to so-called "releasability" rules.
Theses rules might apply to any and all of the data from which the
unified information sets are drawn, the dimensions (or points or
ranges along the dimensions), the third party (or members or
sub-organizations of the third party), and so on. Given this, there
can be a complex interaction between the data, the dimensions, the
third party, the releasability rules, the levels along the
dimensions at which aggregations are performed, the information
that is drawn from the unified information sets, and so on. In
practice, configuring a system to apply the releasability rules is
an error-prone process that requires extensive manual set up and
results in a brittle mechanism that cannot adapt to on-the-fly
changes in data, dimensions, third parties, rules, aggregations,
projections, user queries, and so on.
[0009] Various projection methodologies are known in the art. Still
other projection methodologies are subjects of the present
invention. In any case, different projection methodologies provide
outputs that have different statistical qualities. Analysts are
interested in specifying the statistical qualities of the outputs
at query-time. In practice, however, the universe of data and the
projection methodologies that are applied to it are what drive the
statistical qualities. Existing methods allow an analyst to choose
a projection methodology and thereby affect the statistical
qualities of the output, but this does not satisfy the analyst's
desire to directly dictate the statistical qualities.
[0010] Information systems are a significant bottle neck for market
analysis activities. The architecture of information systems is
often not designed to provide on-demand flexible access,
integration at a very granular level, or many other critical
capabilities necessary to support growth. Thus, information systems
are counter-productive to growth. Hundreds of market and consumer
databases make it very difficult to manage or integrate data. For
example, there may be a separate database for each data source,
hierarchy, and other data characteristics relevant to market
analysis. Different market views and product hierarchies
proliferate among manufacturers and retailers. Restatements of data
hierarchies waste precious time and are very expensive. Navigation
from among views of data, such as from global views to regional to
neighborhood to store views is virtually impossible, because there
are different hierarchies used to store data from global to region
to neighborhood to store-level data. Analyses and insights often
take weeks or months, or they are never produced. Insights are
often sub-optimal because of silo-driven, narrowly defined, ad hoc
analysis projects. Reflecting the ad hoc nature of these analytic
projects are the analytic tools and infrastructure developed to
support them. Currently, market analysis, business intelligence,
and the like often use rigid data cubes that may include hundreds
of databases that are impossible to integrate. These systems may
include hundreds of views, hierarchies, clusters, and so forth,
each of which is associated with its own rigid data cube. This may
make it almost impossible to navigate from global uses that are
used, for example, to develop overall company strategy, down to
specific program implementation or customer-driven uses. These ad
hoc analytic tools and infrastructure are fragmented and
disconnected.
[0011] In sum, there are many problems associated with the data
used for market analysis, and there is a need for a flexible,
extendable analytic platform, the architecture for which is
designed to support a broad array of evolving market analysis
needs. Furthermore, there is a need for better business
intelligence in order to accelerate revenue growth, make business
intelligence more customer-driven, to gain insights about markets
in a more timely fashion, and a need for data projection and
release methods and systems that provide improved dimensional
flexibility, reduced query-time computational complexity, automatic
selection and blending of projection methodologies, and flexibly
applied releasability rules.
SUMMARY
[0012] In embodiments, systems and methods may involve using a
platform as disclosed herein for applications described herein
where the systems and methods involve receiving a aggregated
dataset, wherein the aggregated dataset includes data from a panel
data source, a fact data source, and a dimension data source that
have been associated with a standard population database. The
process may also involve storing the aggregated data in a partition
within a partitioned database, wherein the partition is associated
with a data characteristic. The process may also involve
associating a master processing node with a plurality of slave
nodes, wherein each of the plurality of slave nodes is associated
with a partition of the partitioned database. The process may also
involve submitting an analytic query to the master processing node.
The process may also involve assigning analytic processing to at
least one of the plurality of slave nodes by the master processing
node, wherein the assignment is based at least in part on the
association of the partition with the data characteristic. The
process may also involve reading the aggregated data from the
partitioned database by the assigned slave node. The process may
also involve analyzing the aggregated data by the assigned slave
node, wherein the analysis produces a result at each slave node.
The process may also involve combining the results from each of the
plurality of slave nodes by the master processing node into a
master result 4120 and reporting the master result to a user
interface.
[0013] These and other systems, methods, objects, features, and
advantages of the present invention will be apparent to those
skilled in the art from the following detailed description of the
preferred embodiment and the drawings. Capitalized terms used
herein (such as relating to titles of data objects, tables, or the
like) should be understood to encompass other similar content or
features performing similar functions, except where the context
specifically limits such terms to the use herein.
BRIEF DESCRIPTION OF THE FIGURES
[0014] The invention and the following detailed description of
certain embodiments thereof may be understood by reference to the
following figures:
[0015] FIG. 1 illustrates an analytic platform for performing data
analysis.
[0016] FIG. 2 depicts cluster processing of an aggregated
dataset.
DETAILED DESCRIPTION
[0017] Referring to FIG. 1, the methods and systems disclosed
herein are related to improved methods for handling and using data
and metadata for the benefit of an enterprise. An analytic platform
100 may support and include such improved methods and systems. The
analytic platform 100 may include, in certain embodiments, a range
of hardware systems, software modules, data storage facilities,
application programming interfaces, human-readable interfaces, and
methodologies, as well as a range of applications, solutions,
products, and methods that use various outputs of the analytic
platform 100, as more particularly detailed herein, other
embodiments of which would be understood by one of ordinary skill
in the art and are encompassed herein. Among other components, the
analytic platform 100 includes methods and systems for providing
various representations of data and metadata, methodologies for
acting on data and metadata, an analytic engine, and a data
management facility that is capable of handling disaggregated data
and performing aggregation, calculations, functions, and real-time
or quasi-real-time projections. In certain embodiments, the methods
and systems enable much more rapid and flexible manipulation of
data sets, so that certain calculations and projections can be done
in a fraction of the time as compared with older generation
systems.
[0018] In embodiments, data compression and aggregations of data,
such as fact data sources 102, and dimension data sources 104, may
be performed in conjunction with a user query such that the
aggregation dataset can be specifically generated in a form most
applicable for generating calculations and projections based on the
query. In embodiments, data compression and aggregations of data
may be done prior to, in anticipation of, and/or following a query.
In embodiments, an analytic platform 100 (described in more detail
below) may calculate projections and other solutions dynamically
and create hierarchical data structures with custom dimensions that
facilitate the analysis. Such methods and systems may be used to
process point-of-sale (POS) data, retail information, geography
information, causal information, survey information, census data
and other forms of data and forms of assessments of past
performance (e.g. estimating the past sales of a certain product
within a certain geographical region over a certain period of time)
or projections of future results (e.g. estimating the future or
expected sales of a certain product within a certain geographical
region over a certain period of time). In turn, various estimates
and projections can be used for various purposes of an enterprise,
such as relating to purchasing, supply chain management, handling
of inventory, pricing decisions, the planning of promotions,
marketing plans, financial reporting, and many others.
[0019] Referring still to FIG. 1 an analytic platform 100 is
illustrated that may be used to analyze and process data in a
disaggregated or aggregated format, including, without limitation,
dimension data defining the dimensions along which various items
are measured and factual data about the facts that are measured
with respect to the dimensions. Factual data may come from a wide
variety of sources and be of a wide range of types, such as
traditional periodic point-of-sale (POS) data, causal data (such as
data about activities of an enterprise, such as in-store
promotions, that are posited to cause changes in factual data),
household panel data, frequent shopper program information, daily,
weekly, or real time POS data, store database data, store list
files, stubs, dictionary data, product lists, as well as custom and
traditional audit data. Further extensions into transaction level
data, RFID data and data from non-retail industries may also be
processed according to the methods and systems described
herein.
[0020] In embodiments, a data loading facility 108 may be used to
extract data from available data sources and load them to or within
the analytic platform 100 for further storage, manipulation,
structuring, fusion, analysis, retrieval, querying and other uses.
The data loading facility 108 may have the a plurality of
responsibilities that may include eliminating data for
non-releasable items, providing correct venue group flags for a
venue group, feeding a core information matrix with relevant
information (such as and without limitation statistical metrics),
or the like. In an embodiment, the data loading facility 108
eliminate non-related items. Available data sources may include a
plurality of fact data sources 102 and a plurality of dimension
data sources 104. Fact data sources 102 may include, for example,
facts about sales volume, dollar sales, distribution, price, POS
data, loyalty card transaction files, sales audit files, retailer
sales data, and many other fact data sources 102 containing facts
about the sales of the enterprise, as well as causal facts, such as
facts about activities of the enterprise, in-store promotion
audits, electronic pricing and/or promotion files, feature ad
coding files, or others that tend to influence or cause changes in
sales or other events, such as facts about in-store promotions,
advertising, incentive programs, and the like. Other fact data
sources may include custom shelf audit files, shipment data files,
media data files, explanatory data (e.g., data regarding weather),
attitudinal data, or usage data. Dimension data sources 104 may
include information relating to any dimensions along which an
enterprise wishes to collect data, such as dimensions relating to
products sold (e.g. attribute data relating to the types of
products that are sold, such as data about UPC codes, product
hierarchies, categories, brands, sub-brands, SKUs and the like),
venue data (e.g. store, chain, region, country, etc.), time data
(e.g. day, week, quad-week, quarter, 12-week, etc.), geographic
data (including breakdowns of stores by city, state, region,
country or other geographic groupings), consumer or customer data
(e.g. household, individual, demographics, household groupings,
etc.), and other dimension data sources 104. While embodiments
disclosed herein relate primarily to the collection of sales and
marketing-related facts and the handling of dimensions related to
the sales and marketing activities of an enterprise, it should be
understood that the methods and systems disclosed herein may be
applied to facts of other types and to the handling of dimensions
of other types, such as facts and dimensions related to
manufacturing activities, financial activities, information
technology activities, media activities, supply chain management
activities, accounting activities, political activities,
contracting activities, and many others.
[0021] In an embodiment, the analytic platform 100 comprises a
combination of data, technologies, methods, and delivery mechanisms
brought together by an analytic engine. The analytic platform 100
may provide a novel approach to managing and integrating market and
enterprise information and enabling predictive analytics. The
analytic platform 100 may leverage approaches to representing and
storing the base data so that it may be consumed and delivered in
real-time, with flexibility and open integration. This
representation of the data, when combined with the analytic methods
and techniques, and a delivery infrastructure, may minimize the
processing time and cost and maximize the performance and value for
the end user. This technique may be applied to problems where there
may be a need to access integrated views across multiple data
sources, where there may be a large multi-dimensional data
repository against which there may be a need to rapidly and
accurately handle dynamic dimensionality requests, with appropriate
aggregations and projections, where there may be highly
personalized and flexible real-time reporting 190, analysis 192 and
forecasting capabilities required, where there may be a need to tie
seamlessly and on-the-fly with other enterprise applications 184
via web services 194 such as to receive a request with specific
dimensionality, apply appropriate calculation methods, perform and
deliver an outcome (e.g. dataset, coefficient, etc.), and the
like.
[0022] The analytic platform 100 may provide innovative solutions
to application partners, including on-demand pricing insights,
emerging category insights, product launch management, loyalty
insights, daily data out-of-stock insights, assortment planning,
on-demand audit groups, neighborhood insights, shopper insights,
health and wellness insights, consumer tracking and targeting, and
the like.
[0023] A decision framework may enable new revenue and competitive
advantages to application partners by brand building, product
innovation, consumer-centric retail execution, consumer and shopper
relationship management, and the like. Predictive planning and
optimization solutions, automated analytics and insight solutions,
and on-demand business performance reporting may be drawn from a
plurality of sources, such as InfoScan, total C-scan, daily data,
panel data, retailer direct data, SAP, consumer segmentation,
consumer demographics, FSP/loyalty data, data provided directly for
customers, or the like.
[0024] The analytic platform 100 may have advantages over more
traditional federation/consolidation approaches, requiring fewer
updates in a smaller portion of the process. The analytic platform
100 may support greater insight to users, and provide users with
more innovative applications. The analytic platform 100 may provide
a unified reporting and solutions framework, providing on-demand
and scheduled reports in a user dashboard with summary views and
graphical dial indicators, as well as flexible formatting options.
Benefits and products of the analytic platform 100 may include
non-additive measures for custom product groupings, elimination of
restatements to save significant time and effort, cross-category
visibility to spot emerging trends, provide a total market picture
for faster competitor analysis, provide granular data on demand to
view detailed retail performance, provide attribute driven analysis
for market insights, and the like.
[0025] The analytic capabilities of the present invention may
provide for on-demand projection, on-demand aggregation,
multi-source master data management, and the like. On-demand
projection may be derived directly for all possible geographies,
store and demographic attributes, per geography or category, with
built-in dynamic releasability controls, and the like. On-demand
aggregation may provide both additive and non-additive measures,
provide custom groups, provide cross-category or geography
analytics, and the like. Multi-source master data management may
provide management of dimension member catalogue and hierarchy
attributes, processing of raw fact data that may reduce
harmonization work to attribute matching, product and store
attributes stored relationally, with data that may be extended
independently of fact data, and used to create additional
dimensions, and the like.
[0026] In addition, the analytic platform 100 may provide
flexibility, while maintaining a structured user approach.
Flexibility may be realized with multiple hierarchies applied to
the same database, the ability to create new custom hierarchies and
views, rapid addition of new measures and dimensions, and the like.
The user may be provided a structured approach through publishing
and subscribing reports to a broader user base, by enabling
multiple user classes with different privileges, providing security
access, and the like. The user may also be provided with increased
performance and ease of use, through leading-edge hardware and
software, and web application for integrated analysis.
[0027] In embodiments, the data available within a fact data source
102 and a dimension data source 104 may be linked, such as through
the use of a key. For example, key-based fusion of fact 102 and
dimension data 104 may occur by using a key, such as using the
Abilitec Key software product offered by Acxiom, in order to fuse
multiple sources of data. For example, such a key can be used to
relate loyalty card data (e.g., Grocery Store 1 loyalty card,
Grocery Store 2 loyalty card, and Convenience Store 1 loyalty card)
that are available for a single customer, so that the fact data
from multiple sources can be used as a fused data source for
analysis on desirable dimensions. For example, an analyst might
wish to view time-series trends in the dollar sales allotted by the
customer to each store within a given product category.
[0028] In embodiments the data loading facility may comprise any of
a wide range of data loading facilities, including or using
suitable connectors, bridges, adaptors, extraction engines,
transformation engines, loading engines, data filtering facilities,
data cleansing facilities, data integration facilities, or the
like, of the type known to those of ordinary skill in the art. In
various embodiments, there are many situations where a store will
provide POS data and causal information relating to its store. For
example, the POS data may be automatically transmitted to the facts
database after the sales information has been collected at the
stores POS terminals. The same store may also provide information
about how it promoted certain products, its store or the like. This
data may be stored in another database; however, this causal
information may provide one with insight on recent sales activities
so it may be used in later sales assessments or forecasts.
Similarly, a manufacturer may load product attribute data into yet
another database and this data may also be accessible for sales
assessment or projection analysis. For example, when making such
analysis one may be interested in knowing what categories of
products sold well or what brand sold well. In this case, the
causal store information may be aggregated with the POS data and
dimension data corresponding to the products referred to in the POS
data. With this aggregation of information one can make an analysis
on any of the related data.
[0029] Referring still to FIG. 1, data that is obtained by the data
loading facility 108 may be transferred to a plurality of
facilities within the analytic platform 100, including the data
mart 114. In embodiments the data loading facility 108 may contain
one or more interfaces 182 by which the data loaded by the data
loading facility 108 may interact with or be used by other
facilities within the platform 100 or external to the platform.
Interfaces to the data loading facility 108 may include
human-readable user interfaces, application programming interfaces
(APIs), registries or similar facilities suitable for providing
interfaces to services in a services oriented architecture,
connectors, bridges, adaptors, bindings, protocols, message
brokers, extraction facilities, transformation facilities, loading
facilities and other data integration facilities suitable for
allowing various other entities to interact with the data loading
facility 108. The interfaces 182 may support interactions with the
data loading facility 108 by applications 184, solutions 188,
reporting facilities 190, analyses facilities 192, services 194 or
other entities, external to or internal to an enterprise. In
embodiments these interfaces are associated with interfaces 182 to
the platform 100, but in other embodiments direct interfaces may
exist to the data loading facility 108, either by other components
of the platform 100, or by external entities.
[0030] Referring still to FIG. 1, in embodiments the data mart
facility 114 may be used to store data loaded from the data loading
facility 108 and to make the data loaded from the data loading
facility 108 available to various other entities in or external to
the platform 100 in a convenient format. Within the data mart 114
facilities may be present to further store, manipulate, structure,
subset, merge, join, fuse, or perform a wide range of data
structuring and manipulation activities. The data mart facility 114
may also allow storage, manipulation and retrieval of metadata, and
perform activities on metadata similar to those disclosed with
respect to data. Thus, the data mart facility 114 may allow storage
of data and metadata about facts (including sales facts, causal
facts, and the like) and dimension data, as well as other relevant
data and metadata. In embodiments, the data mart facility 114 may
compress the data and/or create summaries in order to facilitate
faster processing by other of the applications 184 within the
platform 100 (e.g. the analytic server 134). In embodiments the
data mart facility 114 may include various methods, components,
modules, systems, sub-systems, features or facilities associated
with data and metadata.
[0031] In certain embodiments the data mart facility 114 may
contain one or more interfaces 182 (not shown on FIG. 1), by which
the data loaded by the data mart facility 114 may interact with or
be used by other facilities within the platform 100 or external to
the platform. Interfaces to the data mart facility 114 may include
human-readable user interfaces, application programming interfaces
(APIs), registries or similar facilities suitable for providing
interfaces to services in a services oriented architecture,
connectors, bridges, adaptors, bindings, protocols, message
brokers, extraction facilities, transformation facilities, loading
facilities and other data integration facilities suitable for
allowing various other entities to interact with the data mart
facility 114. These interfaces may comprise interfaces 182 to the
platform 100 as a whole, or may be interfaces associated directly
with the data mart facility 114 itself, such as for access from
other components of the platform 100 or for access by external
entities directly to the data mart facility 114. The interfaces 182
may support interactions with the data mart facility 114 by
applications 184, solutions 188, reporting facilities 190, analyses
facilities 192, services 194 (each of which is describe in greater
detail herein) or other entities, external to or internal to an
enterprise.
[0032] In certain optional embodiments, the security facility 118
may be any hardware or software implementation, process, procedure,
or protocol that may be used to block, limit, filter or alter
access to the data mart facility 114, and/or any of the facilities
within the data mart facility 114, by a human operator, a group of
operators, an organization, software program, bot, virus, or some
other entity or program. The security facility 118 may include a
firewall, an anti-virus facility, a facility for managing
permission to store, manipulate and/or retrieve data or metadata, a
conditional access facility, a logging facility, a tracking
facility, a reporting facility, an asset management facility, an
intrusion-detection facility, an intrusion-prevention facility or
other suitable security facility.
[0033] Still referring to FIG. 1, the analytic platform 100 may
include an analytic engine 134. The analytic engine 134 may be used
to build and deploy analytic applications or solutions or undertake
analytic methods based upon the use of a plurality of data sources
and data types. Among other things, the analytic engine 134 may
perform a wide range of calculations and data manipulation steps
necessary to apply models, such as mathematical and economic
models, to sets of data, including fact data, dimension data, and
metadata. The analytic engine 134 may be associated with an
interface 182, such as any of the interfaces described herein.
[0034] The analytic engine 134 may interact with a model storage
facility 148, which may be any facility for generating models used
in the analysis of sets of data, such as economic models,
econometric models, forecasting models, decision support models,
estimation models, projection models, and many others. In
embodiments output from the analytic engine 134 may be used to
condition or refine models in the model storage 148; thus, there
may be a feedback loop between the two, where calculations in the
analytic engine 134 are used to refine models managed by the model
storage facility 148.
[0035] In embodiments, a security facility 138 of the analytic
engine 134 may be the same or similar to the security facility 118
associated with the data mart facility 114, as described herein.
Alternatively, the security facility 138 associated with the
analytic engine 134 may have features and rules that are
specifically designed to operate within the analytic engine
134.
[0036] As illustrated in FIG. 1, the analytic platform 100 may
contain a master data management hub 150 (MDMH). In embodiments the
MDMH 150 may serve as a central facility for handling dimension
data used within the analytic platform 100, such as data about
products, stores, venues, geographies, time periods and the like,
as well as various other dimensions relating to or associated with
the data and metadata types in the data sources 102, 104, the data
loading facility 108, the data mart facility 114, the analytic
engine 134, the model storage facility 148 or various applications,
184, solutions 188, reporting facilities 190, analytic facilities
192 or services 194 that interact with the analytic platform 100.
The MDMH 150 may in embodiments include a security facility 152, an
interface 158, a data loader 160, a data manipulation and
structuring facility 162, and one or more staging tables 164. The
data loader 160 may be used to receive data. Data may enter the
MDMH from various sources, such as from the data mart 114 after the
data mart 114 completes its intended processing of the information
and data that it received as described herein. Data may also enter
the MDMH 150 through a user interface 158, such as an API or a
human user interface, web browser or some other interface, of any
of the types disclosed herein or in the documents incorporated by
reference herein. The user interface 158 may be deployed on a
client device, such as a PDA, personal computer, laptop computer,
cellular phone, or some other client device capable of handling
data. In embodiments, the staging tables 164 may be included in the
MDMH 150.
[0037] In embodiments, a matching facility 180 may be associated
with the MDMH 150. The matching facility 180 may receive an input
data hierarchy within the MDMH 150 and analyze the characteristics
of the hierarchy and select a set of attributes that are salient to
a particular analytic interest (e.g., product selection by a type
of consumer, product sales by a type of venue, and so forth). The
matching facility 180 may select primary attributes, match
attributes, associate attributes, block attributes and prioritize
the attributes. The matching facility 180 may associate each
attribute with a weight and define a set of probabilistic weights.
The probabilistic weights may be the probability of a match or a
non-match, or thresholds of a match or non-match that is associated
with an analytic purpose (e.g., product purchase). The
probabilistic weights may then be used in an algorithm that is run
within a probabilistic matching engine (e.g., IBM QualityStage).
The output of the matching engine may provide information on, for
example, other products which are appropriate to include in a data
hierarchy, the untapped market (i.e. other venues) in which a
product is probabilistically more likely to sell well, and so
forth. In embodiments, the matching facility 180 may be used to
generate projections of what types of products, people, customers,
retailers, stores, store departments, etc. are similar in nature
and therefore they may be appropriate to combine in a projection or
an assessment.
[0038] As illustrated in FIG. 1, the analytic platform 100 may
include a projection facility 178. A projection facility 178 may be
used to produce projections, whereby a partial data set (such as
data from a subset of stores of a chain) is projected to a universe
(such as all of the stores in a chain), by applying appropriate
weights to the data in the partial data set. A wide range of
potential projection methodologies exist, including cell-based
methodologies, store matrix methodologies, iterative proportional
fitting methodologies, virtual census methodologies, and others.
The methodologies can be used to generate projection factors. As to
any given projection, there is typically a tradeoff among various
statistical quality measurements associated with that type of
projection. Some projections are more accurate than others, while
some are more consistent, have less spillage, are more closely
calibrated, or have other attributes that make them relatively more
or less desirable depending on how the output of the projection is
likely to be used. In embodiments of the platform 100, the
projection facility 178 takes dimension information from the MDMH
150 or from another source and provides a set of projection
weightings along the applicable dimensions, typically reflected in
a matrix of projection weights, which can be applied at the data
mart facility 114 to a partial data set in order to render a
projected data set. The projection facility 178 may have an
interface 182 of any of the types disclosed herein.
[0039] As shown in FIG. 1, an interface 182 may be included in the
analytic platform 100. In embodiments, data may be transferred to
the MDMH 150 of the platform 100 using a user interface 182. The
interface 182 may be a web browser operating over the Internet or
within an intranet or other network, it may be an analytic engine
134, an application plug-in, or some other user interface that is
capable of handling data. The interface 182 may be human readable
or may consist of one or more application programming interfaces,
or it may include various connectors, adaptors, bridges, services,
transformation facilities, extraction facilities, loading
facilities, bindings, couplings, or other data integration
facilities, including any such facilities described herein or in
documents incorporated by reference herein.
[0040] As illustrated in FIG. 1, the platform 100 may interact with
a variety of applications 184, solutions 188, reporting facilities
190, analytic facilities 192 and services 194, such as web
services, or with other platforms or systems of an enterprise or
external to an enterprise. Any such applications 184, solutions
188, reporting facilities 190, analytic facilities 192 and services
194 may interact with the platform 100 in a variety of ways, such
as providing input to the platform 100 (such as data, metadata,
dimension information, models, projections, or the like), taking
output from the platform 100 (such as data, metadata, projection
information, information about similarities, analytic output,
output from calculations, or the like), modifying the platform 100
(including in a feedback or iterative loop), being modified by the
platform 100 (again optionally in a feedback or iterative loop), or
the like.
[0041] In embodiments one or more applications 184 or solutions 188
may interact with the platform 100 via an interface 182.
Applications 184 and solutions 188 may include applications and
solutions (consisting of a combination of hardware, software and
methods, among other components) that relate to planning the sales
and marketing activities of an enterprise, decision support
applications, financial reporting applications, applications
relating to strategic planning, enterprise dashboard applications,
supply chain management applications, inventory management and
ordering applications, manufacturing applications, customer
relationship management applications, information technology
applications, applications relating to purchasing, applications
relating to pricing, promotion, positioning, placement and
products, and a wide range of other applications and solutions.
[0042] In embodiments, applications 184 and solutions 188 may
include analytic output that is organized around a topic area. For
example, the organizing principle of an application 184 or a
solution 188 may be a new product introduction. Manufacturers may
release thousands of new products each year. It may be useful for
an analytic platform 100 to be able to group analysis around the
topic area, such as new products, and organize a bundle of analyses
and workflows that are presented as an application 184 or solution
188. Applications 184 and solutions 188 may incorporate planning
information, forecasting information, "what if?" scenario
capability, and other analytic features. Applications 184 and
solutions 188 may be associated with web services 194 that enable
users within a client's organization to access and work with the
applications 184 and solutions 188.
[0043] In embodiments, the analytic platform 100 may facilitate
delivering information to external applications 184. This may
include providing data or analytic results to certain classes of
applications 184. For example and without limitation, an
application may include enterprise resource planning/backbone
applications 184 such as SAP, including those applications 184
focused on Marketing, Sales & Operations Planning and Supply
Chain Management. In another example, an application may include
business intelligence applications 184, including those
applications 184 that may apply data mining techniques. In another
example, an application may include customer relationship
management applications 184, including customer sales force
applications 184. In another example, an application may include
specialty applications 184 such as a price or SKU optimization
application. The analytic platform 100 may facilitate supply chain
efficiency applications 184. For example and without limitation, an
application may include supply chain models based on sales out
(POS/FSP) rather than sales in (Shipments). In another example, an
application may include RFID based supply chain management. In
another example, an application may include a retailer co-op to
enable partnership with a distributor who may manage collective
stock and distribution services. The analytic platform 100 may be
applied to industries characterized by large multi-dimensional data
structures. This may include industries such as telecommunications,
elections and polling, and the like. The analytic platform 100 may
be applied to opportunities to vend large amounts of data through a
portal with the possibility to deliver highly customized views for
individual users with effectively controlled user accessibility
rights. This may include collaborative groups such as insurance
brokers, real estate agents, and the like. The analytic platform
100 may be applied to applications 184 requiring self monitoring of
critical coefficients and parameters. Such applications 184 may
rely on constant updating of statistical models, such as financial
models, with real-time flows of data and ongoing re-calibration and
optimization. The analytic platform 100 may be applied to
applications 184 that require breaking apart and recombining
geographies and territories at will.
[0044] In embodiments, a data field may be dynamically altered to
conform to a bit size or some other desired format. A record of the
dynamic alteration may be tracked by the analytic platform 100 and
stored in a database that may be accessed by other facilities of
the analytic platform 100. In an example, a data field may relate
to sales data. In order to, in part, reduce the processing time
required to utilize the sales data as part of an analysis, the
sales data field may be dynamically altered to conform to a desired
bit size of, for example, 6 bits. Once this alteration is made, a
record may be stored indicating that each sales datum in the sales
field is a datum of 6 bits. Upon making an analytic query involving
the sales field (e.g., "compute average sales by store") the query
may communicate with the stored data indicating the dynamic
alteration of sales data to a 6 bit size format. With this
information, the analytic query may process and analyze the sales
data by reading the sales field in 6 bit units. This process may
remove the need for the sales data to be associated with a header
and/or footer indicating how the sales data is to be read and
processed. As a result, processing speed may be increased.
[0045] In embodiments, the MDMH 150 may be associated with a
partitioned database. The MDMH 150 may be further associated with a
master cluster node that is, in turn, associated with a plurality
of slave cluster nodes. Each partition of the partitioned database
may be associated with a slave cluster node or a plurality of slave
cluster nodes. Each slave cluster node may be associated with a
mirror slave cluster node. The mirror slave cluster node may be
used in the event of a node failure of the slave cluster node to
which it is assigned to mirror. In an example, data, such as sales
data, may enter the analytic platform 100 using a data loading
facility 108. The sales data may be loaded with the causal fact
extractor 110 and processed into a data mart 114 which may store
the sales data within a partitioned database. In an alternate
embodiment, the sales data mart may be processed by the MDMH 150
and the MDMH 150 used to create a portioned sales database. In this
simplified example, the partitioned sales database may have two
partitions, Partition One and Partition Two, each associated with
one of the two stores for which sales data are available. Partition
One may be associated with Slave Cluster Node One. Partition Two
may be associated with Slave Cluster Node Two. Each slave cluster
node may, in turn, be associated with a slave cluster node mirror
that is associated with the same database partition as the slave
cluster node to which it is a mirror. The MDMH 150 and the master
cluster node may store and/or have access to stored data indicating
the associations among the database partitions and the slave
cluster nodes. In an example, upon receipt of an analytic query to
summarize sales data for Store One, the master cluster node may
command the Slave Cluster Node One (which is associated with the
Store One sales data that is stored in Partition One) to process
Store One's sales data. This command from the master cluster node
may be associated with information relating to dynamic alterations
that have been performed on the stored data (e.g., the bit size of
each stored datum) to enable the slave node to accurately read the
sales data during analysis. Similarly, the analysis may take place
on a plurality of slave cluster nodes, each of which is associated
with a database partition or plurality of database partitions.
[0046] In embodiments, the partitioned database may be updated as
new data become available. The update may be made on the fly, at a
set interval, or according to some other criteria.
[0047] In embodiments, the cluster-based processing may be
associated with bitmap compression techniques, including
word-aligned hybrid (WAH) code compression. In an example, WAH
compression may be used to increase cluster processing speed by
using run-length encoding for long sequences of identical bits and
encoding/decoding bitmaps in word size groupings in order to reduce
their computational complexity.
[0048] In embodiments, failover clusters may be implemented for the
purpose of improving the availability of services which a cluster
provides. Failover clusters may operate using redundant nodes,
which may be used to provide service when system components fail.
Failover cluster implementations may manage the redundancy inherent
in a cluster to minimize the impact of single points of failure. In
embodiments, load-balancing clusters may operate by having all
workload come through one or more load-balancing front ends, which
then distribute it to a collection of back end servers. Such a
cluster of computers is sometimes referred to as a server farm. In
embodiments, high-performance clusters may be implemented to
provide increased performance by splitting a computational task
across many different nodes in the cluster. Such clusters commonly
run custom programs which have been designed to exploit the
parallelism available on high-performance clusters.
High-performance clusters are optimized for workloads which require
jobs or processes happening on the separate cluster computer nodes
to communicate actively during the computation. These include
computations where intermediate results from one node's
calculations will affect future calculations on other nodes.
[0049] Message passing interface (MPI) refers to a
language-independent computer communications descriptive
application programming interface (API) for message-passing on a
parallel computer. MPI has defined semantics and flexible
interpretations; it does not define the protocol by which these
operations are to be performed in the sense of sockets for TCP/IP
or other layer-4 and below models in the ISO/OSI Reference Model.
It is consequently a layer-5+ type set of interfaces, although
implementations can cover most layers of the reference model, with
sockets+TCP/IP as a common transport used inside the
implementation. MPI's goals are high performance, scalability, and
portability. It may express parallelism explicitly rather than
implicitly. MPI is a de facto standard for communication among the
processes modeling a parallel program on a distributed memory
system. Often these programs are mapped to clusters, actual
distributed memory supercomputers, and to other environments.
However, the principal MPI-1 model has no shared memory concept,
and MPI-2 has only a limited distributed shared memory concept used
in one portion of that set of extensions.
[0050] In embodiments, the analytic server may use ODBC to connect
to a data server.
[0051] An ODBC library may use socket communication through the
socket library to communicate with the data server. The data server
may be cluster-based in order to distribute the data server
processing. A socket communication library may reside on the data
server. In an embodiment, the data server may pass information to a
SQL parser module. In an embodiment, Gnu Flex and/or Bison may used
to generate a Lexer and parser.
[0052] In embodiments, a master node and multiple slave nodes may
be used in a cluster framework. A master node may obtain the SQL
code by ODBC sockets and forward it to a parser to interpret the
SQL sequence. Once the server has received SQL as part of a query
request, MPI may be used to distribute the server request to slave
nodes for processing. In embodiments, a bitvector implementation
may be used.
[0053] In embodiments, retrieval may be facilitated based at least
in part on representing the data as efficiently as possible. This
efficiency may enable the data to be kept in memory as an in-memory
database. In order to facilitate the process, data structures may
be used that are small enough that they may be stored in memory. In
an example, unlike a relational database, multiple record types may
be used to allow minimizing the data size so that it may be kept in
memory within a hardware implementation. Keeping the data within a
hardware implementation may have the additional advantage of
reducing the expense of the system. In embodiments, the cluster
system may fit modestly sized hardware nodes with modest amounts of
memory. This may keep the data near the CPU, so that one mustn't
use file-based I/O. Data that is in the regular system memory may
be directly accessed by the CPU.
[0054] In embodiments, a distribution hash key may be used to
divide the data among the nodes.
[0055] In embodiments, the data may be partitioned by one
dimension. In an example, an analyst may want to analyze a set of
retail store data looking at which products are selling, taking
into account the size of the store revenue in which they are sold.
Store One may have $10 M in revenue, Store Two $20 M, and Store
Three $30 M. In this example, the analytic goal is to determine how
well a brand of cola is selling relative to the size of the store
in which it is sold. To accomplish this, one may analyze the total
potential size and figure out how well a product is selling
relative to the whole. However, this may be difficult because one
may have to look across multiple time periods in which the product
may be selling multiple times but only count it once. The use of a
distinct sum or count operator may be expensive, especially in
something that is in millions of records. Instead, this data may be
partitioned by "venue" so that a venue only exists on one of the
processing nodes. If all of a venue's data is processed on a unique
node there is a reduced risk of double-counting, as the data only
reside in a single location. On the other hand, if the data are
distributed by venue and some other key, one might have data for
the same venue located in multiple places. By partitioning by venue
and associating each venue with an independent node, the venues may
be added on the master node.
[0056] In embodiments, partitioning may be done within each node by
certain dimensions in order to more efficiently access those data
according to which data dimensions clients have used in the past.
For example, data may be partitioned by venue and time, so that on
any given processing node it is relatively easy to access
particular sets of information based on venue and time dimensions.
In embodiments, partitioning may be used as an implicit indexing
method. This may simplify the process of analyzing wanted data
without having to build an actual index.
[0057] In embodiments, cluster processing may be dynamically
configurable to accommodate increases and/or reductions in the
number of nodes that are used.
[0058] In embodiments, cluster processing may have failover
processes that may re-enable a cluster by having a node take on the
function of another node that has failed
[0059] In embodiments, a threading model may be used for
inter-processing communication between the nodes and the master.
Posix threads may be used in combination with an MPI. In
embodiments, multiple threads may run with one logical process and
with separate physical processes running on different machines. A
thread model may form the backbone of communication between
processing elements. In an example, if there is a master and two
slaves, there may be one physical process on the master and one on
each slave node. An inbound SQL request may come into the master
node and be intercepted by a thread that is using a socket. The
thread may transmit to a master thread running on each slave
process that creates threads that do actual analysis and, in turn,
communicate to a listener thread on the master that passes
information to a collator thread on the master. A new series of
threads may be created for new thread arrival. The listener threads
may be designed to look for information from a specific slave
source. If a query comes into the system, a new collator thread may
be created, a new worker thread created in each slave node, and
information sent from each slave node to a listener on the master
that passes information to the collator thread created for that
query. The collator thread may then pass information back through
the socket to the ODBC client. In embodiments, this system may be
scalable. For every slave that is created, the system may create a
new listener thread for that code.
[0060] In embodiments, inter-server communication may be done
through MPI. Data server and client communication may be conducted
using regular sockets. Each server may have data (its partition of
information), so that each of the servers knows what information
for which it is responsible. The collator may collate the partial
results into a final result set.
[0061] In an example, ODBC may pass to a master node and a master
thread in the master node's process. The SQL query may be
translated into something the server can understand. Next, the
master node may pass a thread to all nodes as part of a Query One.
The first node may retrieve Store One data, and may add up a
partial result and creates a data tuple that it communicates back
to the listener for that slave node. The Second Node may do the
same thing and communicate with its listener. Nodes with only Store
Two (as opposed to Store One data) may do nothing. At the master
node, the collator may add up the results from the two relevant
listeners' results. Next, through socket communication, it may
communicate the result through ODBC communication to the client.
After that is accomplished, the collator thread and worker threads
that performed the retrieval may be omitted. In embodiments, these
transient threads may be associated with and used for a particular
query.
[0062] In embodiments, a normalization scheme may be used in order
to minimize the size of internal data structures.
[0063] An aspect of the present invention relates to cluster
processing of an aggregated dataset. As will be explained in more
detail in FIG. 2, a logical process 4100 may be used to for
processing the aggregated dataset in clusters.
[0064] The present invention illustrates the processing of the
aggregated data. In FIG. 1, a fact data source 102 and a dimension
data source 104 may be linked through a key. The fact data source
102 from multiple data source can be used as an aggregated data
source for analysis on desirable dimensions. For example, an
analyst might wish to view time-series trends in the dollar sales
allotted by the customer to each store within a given product
category.
[0065] In embodiments, referring to FIG. 2, systems and methods may
involve using a platform as disclosed herein for applications
described herein where the systems and methods involve receiving a
aggregated dataset, wherein the aggregated dataset includes data
from a panel data source, a fact data source, and a dimension data
source that have been associated with a standard population
database 4102. The process may also involve storing the aggregated
data in a partition within a partitioned database, wherein the
partition is associated with a data characteristic 4104. The
process may also involve associating a master processing node with
a plurality of slave nodes, wherein each of the plurality of slave
nodes is associated with a partition of the partitioned database
4108. The process may also involve submitting an analytic query to
the master processing node 4110. The process may also involve
assigning analytic processing to at least one of the plurality of
slave nodes by the master processing node, wherein the assignment
is based at least in part on the association of the partition with
the data characteristic 4112. The process may also involve reading
the aggregated data from the partitioned database by the assigned
slave node 4114. The process may also involve analyzing the
aggregated data by the assigned slave node, wherein the analysis
produces a result at each slave node 4118. The process may also
involve combining the results from each of the plurality of slave
nodes by the master processing node into a master result 4120 and
reporting the master result to a user interface 4122.
[0066] The elements depicted in flow charts and block diagrams
throughout the figures imply logical boundaries between the
elements. However, according to software or hardware engineering
practices, the depicted elements and the functions thereof may be
implemented as parts of a monolithic software structure, as
standalone software modules, or as modules that employ external
routines, code, services, and so forth, or any combination of
these, and all such implementations are within the scope of the
present disclosure. Thus, while the foregoing drawings and
description set forth functional aspects of the disclosed systems,
no particular arrangement of software for implementing these
functional aspects should be inferred from these descriptions
unless explicitly stated or otherwise clear from the context.
[0067] Similarly, it will be appreciated that the various steps
identified and described above may be varied, and that the order of
steps may be adapted to particular applications of the techniques
disclosed herein. All such variations and modifications are
intended to fall within the scope of this disclosure. As such, the
depiction and/or description of an order for various steps should
not be understood to require a particular order of execution for
those steps, unless required by a particular application, or
explicitly stated or otherwise clear from the context.
[0068] The methods or processes described above, and steps thereof,
may be realized in hardware, software, or any combination of these
suitable for a particular application. The hardware may include a
general-purpose computer and/or dedicated computing device. The
processes may be realized in one or more microprocessors,
microcontrollers, embedded microcontrollers, programmable digital
signal processors or other programmable device, along with internal
and/or external memory. The processes may also, or instead, be
embodied in an application specific integrated circuit, a
programmable gate array, programmable array logic, or any other
device or combination of devices that may be configured to process
electronic signals. It will further be appreciated that one or more
of the processes may be realized as computer executable code
created using a structured programming language such as C, an
object oriented programming language such as C++, or any other
high-level or low-level programming language (including assembly
languages, hardware description languages, and database programming
languages and technologies) that may be stored, compiled or
interpreted to run on one of the above devices, as well as
heterogeneous combinations of processors, processor architectures,
or combinations of different hardware and software.
[0069] Thus, in one aspect, each method described above and
combinations thereof may be embodied in computer executable code
that, when executing on one or more computing devices, performs the
steps thereof. In another aspect, the methods may be embodied in
systems that perform the steps thereof, and may be distributed
across devices in a number of ways, or all of the functionality may
be integrated into a dedicated, standalone device or other
hardware. In another aspect, means for performing the steps
associated with the processes described above may include any of
the hardware and/or software described above. All such permutations
and combinations are intended to fall within the scope of the
present disclosure.
[0070] While the invention has been disclosed in connection with
the preferred embodiments shown and described in detail, various
modifications and improvements thereon will become readily apparent
to those skilled in the art. Accordingly, the spirit and scope of
the present invention is not to be limited by the foregoing
examples, but is to be understood in the broadest sense allowable
by law.
[0071] All documents referenced herein are hereby incorporated by
reference.
* * * * *