U.S. patent application number 14/329591 was filed with the patent office on 2014-10-30 for matrix computation framework.
The applicant listed for this patent is Microsoft Corporation. Invention is credited to Xiuwei Chen, Zhengping Qian, Yuan Yu, Zheng Zhang.
Application Number | 20140324935 14/329591 |
Document ID | / |
Family ID | 47142707 |
Filed Date | 2014-10-30 |
United States Patent
Application |
20140324935 |
Kind Code |
A1 |
Zhang; Zheng ; et
al. |
October 30, 2014 |
MATRIX COMPUTATION FRAMEWORK
Abstract
Described herein are technologies pertaining to matrix
computation. A computer-executable algorithm that is configured to
execute perform a sequence of computations over a matrix tile is
received and translated into a global directed acyclic graph that
includes vertices that perform a sequence of matrix computations
and edges that represent data dependencies amongst vertices. A
vertex in the global directed acyclic graph is represented by a
local directed acyclic graph that includes vertices that perform a
sequence of matrix computations at the block level, thereby
facilitating pipelined, data-driven matrix computation.
Inventors: |
Zhang; Zheng; (Beijing,
CN) ; Qian; Zhengping; (Beijing, CN) ; Chen;
Xiuwei; (Beijing, CN) ; Yu; Yuan; (Cupertino,
CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Microsoft Corporation |
Redmond |
WA |
US |
|
|
Family ID: |
47142707 |
Appl. No.: |
14/329591 |
Filed: |
July 11, 2014 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
13105915 |
May 12, 2011 |
8788556 |
|
|
14329591 |
|
|
|
|
Current U.S.
Class: |
708/400 |
Current CPC
Class: |
G06F 11/0793 20130101;
G06F 11/0709 20130101; G06F 17/16 20130101 |
Class at
Publication: |
708/400 |
International
Class: |
G06F 17/16 20060101
G06F017/16 |
Claims
1. A computing system, comprising: a processor; and a memory that
comprises a system that is executed by the processor, the system
configured to: express matrix computations as a sequence of
operations to be performed over a tile of a matrix, the tile being
a portion of the matrix; translate the sequence of operations into
a global directed acyclic graph (DAG); and cause the sequence of
operations to be executed based upon the global DAG.
2. The computing system of claim 1, the global DAG includes
vertices and edges, a vertex in the vertices of the global DAG is
representative of a sub-sequence of operations in the sequence of
operations, the edges representative of data dependencies between
vertices of the global DAG.
3. The computing system of claim 2, the system further configured
to schedule the vertices at respective computing devices in a
distributed computing environment.
4. The computing system of claim 3, the system further configured
to reschedule the vertex responsive to receipt of an indication
that a computing device that has been assigned the vertex has a
fault corresponding thereto.
5. The computing system of claim 3, the system further configured
to transmit a block of the matrix to a computing device in the
respective computing devices responsive to receipt of an indication
that another computing device in the respective computing devices
has a fault corresponding thereto.
6. The computing system of claim 2, the system further configured
to translate the vertex in the global DAG into a local DAGs,
wherein the local DAG represents a sequence of operations to be
performed over a block of the matrix, the block of the matrix being
a portion of the tile of the matrix.
7. The computing system of claim 6, the local DAG comprises
vertices and edges, a vertex in the vertices of the local DAG is
representative of a sub-sequence of the sequence of operations to
be performed over the block of the matrix, the edges representative
of data dependencies between vertices of the local DAG.
8. The computing system of claim 7, the system further configured
to transmit the local DAG to another computing device, the another
computing device configured to perform the sequence of operations
over the block of the matrix responsive to receiving the local
DAG.
9. The computing system of claim 8, the system further configured
to transmit another block of the matrix to the another computing
device, the another computing device configured to perform the
sequence of operations over the block of the matrix based upon the
another block of the matrix.
10. The computing system of claim 1, the system further configured
to identify blocks of the matrix that are to be tracked over time
to facilitate efficient fault tolerance.
11. A method comprising: receiving a matrix computation that is to
be performed over a matrix; expressing the matrix computation as
sequences of operations to be performed over tiles of the matrix,
the tiles being portions of the matrix; translating, by a
processor, a sequence of operations to be performed over a tile of
a matrix as a global directed acyclic graph (DAG) that comprises
vertices and edges, the vertices representative of sub-sequences of
operations to be performed over blocks of the tile, the edges
representative of data dependencies between the vertices; and
scheduling the vertices at computing devices in a distributed
computing environment.
12. The method of claim 11, further comprising: translating a
vertex in the vertices into a local DAG, the local DAG
representative of a sequence of operations to be performed over a
block of the tile, the local DAG comprises vertices and edges, the
vertices of the local DAG representative of a sub-sequence of the
operations to be performed over the block, the edges representative
of data dependencies between the vertices.
13. The method of claim 12, the translating of the vertex into the
local DAG being based upon a size of a memory cache of a computing
device that is to receive the local DAG.
14. The method of claim 13, wherein the local DAG is represented as
skeleton code that causes an operator to fire responsive to receipt
of the block.
15. The method of claim 13, further comprising transmitting blocks
of the matrix to the computing devices.
16. The method of claim 11, further comprising: receiving an
indication that execution of a sub-sequence of operations
represented by a vertex in the global DAG has failed at a computing
device; and rescheduling the vertex at another computing device
responsive to receiving the indication.
17. A computer-readable medium comprising instructions that, when
executed by a processor, cause the processor to perform acts,
comprising: receiving at least one computation that is to be
executed over a matrix; responsive to receiving the at least one
computation, representing the at least one computation as a
sequence of operations that are to be undertaken on tiles of the
matrix; responsive to representing the at least one computation as
the sequence of operations, translating at least one operation into
a global directed acyclic graph (DAG) that comprises: a plurality
of vertices that represent a corresponding plurality of sequential
operations on at least one tile of the matrix; and a plurality of
edges that represent data dependencies between the plurality of
vertices; and responsive to translating the at least one operation
into the global DAG, scheduling the vertices of the global DAG at a
plurality of computing devices in a distributed computing
environment, wherein the plurality of computing devices
collectively perform the at least one operation over the
matrix.
18. The computer-readable medium of claim 17, the instructions
further comprising: responsive to translating the at least one
operation into the global DAG, representing a vertex in the global
DAG as a local DAG that comprises a plurality of vertices that are
configured to perform a corresponding plurality of sequential
operations on at least one block that corresponds to the matrix,
wherein a size of the block is smaller than a size of the at least
one tile; and causing the sequential operations that are
represented by the plurality of vertices in the local DAG to be
executed in a data-driven manner.
19. The computer-readable medium of claim 18, the acts further
comprising: setting the size of the block based upon size of a
cache of a computing device in the plurality of computing devices
that is to executed the local DAG over the block.
20. The computer-readable medium of claim 17, further comprising:
receiving an indication that a computing device in the computing
devices has a fault corresponding thereto; and rescheduling a
vertex assigned to the computing device to another computing device
responsive to receiving the indication.
Description
RELATED APPLICATION
[0001] This application is a continuation of U.S. patent
application Ser. No. 13/105,915, filed on May 12, 2011, and
entitled "MATRIX COMPUTATION FRAMEWORK", the entirety of which is
incorporated herein by reference.
BACKGROUND
[0002] The term "high-performance computing" generally refers to
the utilization of clusters of computers to solve advanced
computation problems. The term is most commonly associated with
computing undertaken in connection with scientific research or
computational science. Exemplary applications that can be
classified as high-performance computing applications include, but
are not limited to, visual computing, including robust facial
recognition and robust 3-D modeling with crowd-sourced photos,
research undertaken with respect to web mining, machine learning,
and the like.
[0003] A conventional approach for performing parallel computation
of data in connection with high-performance computing is the single
instruction multiple data (SIMD) approach. This approach describes
the utilization of computers with multiple processing elements that
perform the same operation on multiple different data
simultaneously, thereby exploiting data level parallelism. Machines
configured to perform SIMD generally undertake staged processing
such that a bottleneck is created during synchronization of data.
Specifically, another machine or computing element may depend upon
output of a separate machine or computing element, and various
dependencies may exist. In SIMD, a computing element waits until
all data that is depended upon is received and then undertakes
processing thereon. This creates a significant scalability
bottleneck.
[0004] Large-scale data intensive computation has recently
attracted a tremendous amount of attention, both in the research
community and in industry. Moreover, many algorithms utilized in
high-performance computing applications can be expressed as matrix
computation. Conventional mechanisms for coding kernels utilized in
connection with matrix computation, as well as designing
applications that utilize matrix computations, are relatively low
level. Specifically, writing new computation kernels that
facilitate matrix computation requires a deep understanding of
interfaces that allow processes to communicate with one another by
sending and receiving messages, such as the message passing
interface (MPI). This makes it quite difficult for scientists to
program algorithms that facilitate matrix computation.
SUMMARY
[0005] The following is a brief summary of subject matter that is
described in greater detail herein. This summary is not intended to
be limiting as to the scope of the claims.
[0006] Described herein are various technologies pertaining to
pipelined matrix computation. With more particularity, matrix
computations can be expressed as a sequence of operations that are
performed on tiles of the matrix, wherein a matrix tile is a
portion of the matrix. As will be understood by one skilled in the
art of matrix computation, matrices can be relatively large such
that a tile of the matrix may be in the order of several tens of
thousands of elements. In an example, these operations that are to
be executed on tiles at execution time can be translated into
directed acyclic graphs (DAGs). A DAG that represents a sequence of
operations that are to be performed on a particular matrix tile can
be referred to herein as a global DAG. The global DAG comprises a
plurality of vertices and corresponding edges, where a vertex in
the global DAG performs a sequence of operations on the tile and
edges represent data dependencies among vertices. Pursuant to an
example, each vertex in a global DAG can be assigned to a
particular computing element, wherein a computing element may be a
processor, a computer, or a collection of processors.
[0007] As mentioned above, a vertex in the global DAG is configured
to perform a plurality of computing operations on the matrix tile.
As described herein, such vertex can be further represented by a
local DAG. The local DAG also comprises a plurality of vertices
that are configured to perform a sequence of mathematical (matrix)
computations at a matrix block level, where a matrix block is
significantly smaller than a matrix tile. For instance, a size of a
block can be on the order of a size of a cache of a computing
device that is configured to perform mathematical computations at
the block level. In contrast, a matrix tile is typically of the
order of main memory size. The local DAG additionally comprises a
plurality of edges that couple vertices in the local DAG to
represent data dependencies amongst vertices. In this approach, the
local DAG may be configured to output blocks that can be consumed
by other vertices in the global DAG. Accordingly, the system
operates in a data-driven manner such that data producers produce
output blocks as soon as requisite input blocks are received, such
that computation can be pushed through the system as far as
possible at the matrix block level.
[0008] As can be ascertained, the above describes a pipelined
approach for performing complex matrix computations, such that
blocks can be pushed through the local DAG and the global DAG as
far as possible. In large-scale computing systems, however, faults
may occur. For instance, network issues may cause a particular
computing device to go off-line. Maintenance may cause a particular
computing device to be down for some period of time, etc. One
mechanism for fault tolerance is to simply restart all computations
from the top of the global DAG. However, this is time consuming and
suboptimal. Described herein is an approach for fault tolerance in
a matrix computation system that performs matrix computations on
matrix blocks and outputs matrix blocks in a data-driven manner.
This fault tolerance is based at least in part upon monitoring
which blocks are needed by child vertices in the local DAG and/or
the global DAG to perform matrix computations.
[0009] Other aspects will be appreciated upon reading and
understanding the attached figures and description.
BRIEF DESCRIPTION OF THE DRAWINGS
[0010] FIG. 1 is a functional block diagram of an exemplary system
that facilitates pipelined matrix computation.
[0011] FIG. 2 illustrates an exemplary global directed acyclic
graph and a corresponding local directed acyclic graph.
[0012] FIG. 3 illustrates a particular vertex in a directed acyclic
graph that performs a computation based upon received matrix
blocks.
[0013] FIG. 4 is a functional block diagram of an engine that
facilitates performing matrix computation.
[0014] FIG. 5 is an exemplary depiction of data that is monitored
in connection with performing fault tolerance in a matrix
computation system.
[0015] FIG. 6 is a flow diagram that illustrates an exemplary
methodology for representing a vertex in a global directed acyclic
graph as a local directed acyclic graph.
[0016] FIG. 7 is a flow diagram illustrates an exemplary
methodology for causing sequential operations to be executed in a
local directed acyclic graph in a data-driven manner.
[0017] FIG. 8 is an exemplary computing system.
DETAILED DESCRIPTION
[0018] Various technologies pertaining to matrix computation in a
data-driven manner will now be described with reference to the
drawings, where like reference numerals represent like elements
throughout. In addition, several functional block diagrams of
exemplary systems are illustrated and described herein for purposes
of explanation; however, it is to be understood that functionality
that is described as being carried out by certain system components
may be performed by multiple components. Similarly, for instance, a
component may be configured to perform functionality that is
described as being carried out by multiple components.
Additionally, as used herein, the term "exemplary" is intended to
mean serving as an illustration or example of something, and is not
intended to indicate a preference.
[0019] With reference now to FIG. 1, an exemplary system 100 that
facilitates pipelined matrix computation in a high-performance
computing environment is illustrated. Prior to describing
components of the system, a unified programming model will be
described herein that facilitates exposing a unified programming
model to both matrix kernel developers and application developers
that rely on matrix computations. At the application level, a
unified programming model described herein provides a scalable
matrix computation library that can support both basic matrix
operations, such as multiplication, as well as higher-level
computations such as Cholesky and LU factorization. For example,
such library can be integrated into a software framework that
supports several programming languages, which allows language
interoperability. Accordingly, this library can be accessed and
utilized by applications and/or kernels written in a variety of
different programming languages such that a scientific researcher
and/or computer programmer can readily program applications and
matrix computation kernels utilizing such framework. The library
that supports the matrix operations described above can be invoked
directly utilizing one of the languages supported by the
aforementioned software framework as method calls to perform
large-scale matrix computations. Accordingly, for instance,
programs may be generated to perform many data intensive
computations that include matrix application kernels.
[0020] The unified programming model can adopt the widely used tile
abstraction for writing matrix computation kernels. In such an
abstraction, a matrix is divided into a plurality of tiles (square
cell matrices), and matrix computations are expressed as a sequence
of operations on tiles. Shown below, an exemplary tile algorithm
that solves matrix inversion is displayed in matrix format and in
code supported by the unified programming model. As shown,
programming a tile algorithm in such programming model is fairly
straightforward and is a direct translation of the algorithm into a
sequential program. Parallelization and distributed execution of
such program will be described below. Inversion of a 2.times.2
matrix over coefficient ring R can be expressed as follows:
M = [ A B C D ] , M - 1 = [ ( A - BD - 1 C ) - 1 ( C - DB - 1 A ) -
1 ( B - AC - 1 D ) - 1 ( D - CA - 1 B ) - 1 ] ##EQU00001##
To avoid inverting all of A, B, C, D, only A can be required to be
invertible, as shown here:
M - 1 = [ I - A - 1 B 0 I ] [ A - 1 0 0 S A - 1 ] [ I 0 - CA - 1 I
] = [ A - 1 + A - 1 BS A - 1 CA - 1 - A - 1 BS A - 1 - S A - 1 CA -
1 S A - 1 ] ##EQU00002##
where S.sub.A=D-CA.sup.-1B is the Schur complement of A in M.
Alternatively, M.sup.-1 can be expressed as follows:
M - 1 = [ S D - 1 - S D - 1 BD - 1 - D - 1 CS D - 1 D - 1 + D - 1
CS D - 1 BD - 1 ] ##EQU00003##
An exemplary program that can be utilized to perform this matrix
computation is as follows:
TABLE-US-00001 public ParallelMatrix Inverse( ) { var M =
Partition(this, 2, 2); var A = M[0, 0]; var B = M[0, 1]; var C =
M[1, 0]; var D = M[1, 1]; var Ai = A.Inverse( ); var CAi = C * Ai;
var Si = (D - CAi * B),Inverse( ); var AiBSi = Ai * B * Si; var
result = new ParallelMatrix[2, 2]; result[0, 0] = Ai +AiBSi * CAi;
result[0, 1] = AiBSi.Neg( ); result[1, 0] = (Si * CAi).Neg( );
result[1, 1] = Si; return ParallelMatrix.Combine(result); }
[0021] As will be described below, tile algorithms can be
automatically translated into directed acyclic graphs (DAGs) at
execution time, wherein a DAG includes a plurality of vertices and
edges. A vertex in a directed acyclic graph is configured to
perform a sequence of mathematical operations on tiles and edges in
the DAG capture data dependencies among vertices.
[0022] The system 100 comprises a data store 102, which may be a
hard drive, memory or other suitable storage media. The data store
102 comprises computer-executable code 104, which may be, for
example, a tile algorithm such as the tile algorithm presented
above. It is to be understood that numerous tile algorithms have
been generated for performing matrix computation and such tile
algorithms are contemplated and are intended to fall under the
scope of the hereto appended claims. A data store 102 further
comprises a matrix representation 106. Pursuant to an example, the
matrix representation 106 may be symbolic or may include numerical
values. Oftentimes such matrix representation 106 can be quite
large--on the order of tens of thousands or hundreds of thousands
of entries.
[0023] A scheduler component 108 can receive the
computer-executable code 104 and the matrix representation 106,
wherein the computer executable code 104 is configured to perform
one or more computations over at least portions of the matrix
represented by the matrix representation 106. The scheduler
component 108 may then cause the computer-executable code 104 to be
represented as a global DAG that includes a plurality of vertices
and a plurality of edges. Vertices in the global DAG are configured
to perform a sequence of operations on tiles of the matrix, wherein
a tile is a square subportion of such matrix. As used herein, a
matrix tile may be relatively large, such as on the order of size
of main memory in a computing device that is configured to perform
matrix computations over the matrix tile. Edges between vertices in
the global DAG represent data dependencies between vertices.
Therefore, for example, a first vertex that is coupled to a second
vertex by an edge indicates that the second vertex is dependent
upon output of the first vertex. Pursuant to an example, the
scheduler component 108 can translate the computer-executable code
104 into the global DAG at execution time of the
computer-executable code 104. Furthermore, the scheduler component
108 can cause the global DAG to be retained in the data store 102
or other suitable data store that is accessible to the scheduler
component 108. As will be described in greater detail below, a
scheduler component is also configured to cause the vertices in the
global DAG to be scheduled such that these vertices can be executed
on computing devices in a distributed computing platform.
[0024] It is to be understood that translating the
computer-executable code 104 into the global DAG ensures that
synchronization of vertex computation is strictly the result of
data dependencies between vertices. As mentioned previously, the
global DAGs operate at the level of a matrix tile and can be
directly executed through utilization of a general purpose
distributed DAG execution engine. Conventional general purpose DAG
execution engines, however, do not allow computation of depending
vertices to overlap, thereby creating a performance bottleneck for
distributed matrix computation. Accordingly, the system 100
facilitates pipelined DAG execution to explore inter-vertex
parallelism. Thus, the global DAG can be executed in a data-driven
manner. This can be accomplished, for example, by further
translating individual vertices in the global DAG into more
granular DAGs that carry out computations at matrix block levels.
As used herein, a matrix block is significantly smaller than a
matrix tile, such as, for example, on the order of size of a cache
of a computing device that is configured to perform matrix
computations. Therefore, while size of a matrix tile is on the
order of size of main memory, size of a matrix block is on the
order of size of the cache.
[0025] With more particularity, for a vertex in the global DAG, the
scheduler component 108 can translate such vertex into a local DAG.
The local DAG can comprise a plurality of vertices and
corresponding edges, wherein vertices in the local DAG carry out
the actual matrix computation by calling into an existing math
library. Again, this computation is undertaken at the block level
and can be performed in a data-driven manner. Edges in the local
DAG represent data dependencies amongst vertices in the local
DAG.
[0026] The system 100 may further comprise a plurality of computing
devices 110-112 that are configured to perform matrix computations.
Pursuant to an example, these computing devices 110-112 may be
standalone computing devices that are in communication with a
computing device that comprises the scheduler component 108 by way
of a network connection. Therefore, for example, the scheduler
component 108 may be comprised by a parent computing device that is
configured to schedule computation amongst the plurality of
computing devices 110-112. Additionally, as will be described
below, the scheduler component 108 can facilitate fault tolerance
with respect to failures of one or more of the computing devices
110-112.
[0027] In an example, the first computing device 110 can comprise
an executor component 114 that is configured to execute vertices in
the local DAG. More specifically, if the local DAG depends upon
data from another local DAG, such data can be provided to the
computing device 110 in the form of matrix blocks. As such blocks
are received, computation that is based upon these blocks can be
pushed through the local DAG as far as possible. Therefore, the
executor component 114 facilitates executing the local DAG in a
data-driven manner, in parallel with other vertex operations in the
local DAG as well as in parallel with computations undertaken by
other computing scheduled to perform matrix computations by the
scheduler component 108.
[0028] It can be ascertained that performing matrix computations in
such a highly parallel pipelined manner can cause fault tolerance
to become relatively complex due to several data dependencies.
Accordingly, the system 100 may comprise a fault detector component
116 that can detect that a fault has occurred in a computation at
one or more of the computing devices 110-112. Such faults may
exist, for instance, due to network failures, maintenance, hardware
failures, etc. at the computing devices 110-112. In an exemplary
approach, the fault detector component 116 can detect a fault at
one of the computing devices 110-112 and can inform the scheduler
component 108 of such fault. At such point in time, the scheduler
component 108 can identify the vertex in the global DAG that is
being executed at the computing device where the failure occurred,
and can cause such vertex to be restarted. For instance, the
scheduler component 108 can reschedule the local DAG that is a
translation of the vertex in the global DAG at a different
computing device that has not failed. Due to the various types of
data dependencies, however, this approach of restarting a vertex
may be inefficient.
[0029] Accordingly, as will be described in greater detail herein,
the computing devices 110-112 can be configured to provide the
scheduler component 108 with data that is indicative of blocks that
have been consumed by vertices in the local DAG, blocks that have
been output by vertices in the local DAG, and blocks that are
needed by vertices in the local DAG to perform a computation.
Thereafter, vertices in the local DAG at the computing device where
the fault occurred can be selectively determined for re-starting
(or rescheduled at another computing device). Additional detail
pertaining to selectively determining which vertices in a local DAG
to restart will be provided below.
[0030] With reference now to FIG. 2, an exemplary global DAG 200
that can be created by the scheduler component 108 is illustrated.
As mentioned above, the exemplary global DAG 200 may comprise a
plurality of vertices 202-214. In this exemplary data structure,
each vertex in the global DAG 200 is configured to perform a
sequence of matrix computations at a matrix tile level. The global
DAG 200 further comprises a plurality of edges 216-232. These edges
represent data dependencies amongst vertices. More specifically,
the edge 218 indicates that the vertex 204 is unable to perform
desired computations until the vertex 202 outputs results of a
matrix computation over a matrix tile. Similarly, the vertex 208,
as evidenced by the edge 224 and the edge 216, can perform its
sequence of operations only after the vertex 202 and the vertex 204
have completed their respective computations. It can be readily
ascertained, however, that the global DAG 200 facilitates parallel
computation as, for example, the vertex 204 and the vertex 206 can
execute their respective sequential instructions immediately
subsequent to receiving computations from the vertex 202.
[0031] As mentioned above, vertices in the global DAG 200 can be
represented as local DAGs. For instance, the vertex 206 can be
represented as a local DAG 234. The local DAG 234 comprises a
plurality of vertices 236-254 that are configured to perform matrix
computations at the block level. Accordingly, as blocks are
received from a parent vertex in the global DAG 200, the respective
computations can be pushed at the block level as far as possible
through the local DAG 234. In other words, a vertex can perform a
computation at the block level and output a resulting block for
provision to a child vertex immediately subsequent to performing
the computation. This results in increased inter-vertex parallel
computing, wherein the vertices in the global DAG 200 and the local
DAG 234 execute in a data-driven manner rather than in a staged
manner.
[0032] The local DAG 234 further comprises a plurality of edges
256-280. Again, these edges can represent data dependencies between
vertices in the local DAG 234. As mentioned previously, while the
local DAG 234 is shown for purposes of explanation, it is to be
understood that the local DAG 234 can be represented by a DAG-free
skeleton code to reduce overhead that may be caused by multiple
vertices and interdependencies.
[0033] Referring now to FIG. 3, an exemplary depiction 300 of a
vertex 302 that may be included in a local DAG is illustrated. The
vertex 302 is configured to perform a series of matrix computations
at the block level. Pursuant to an example, the vertex 302 may
receive output blocks from a plurality of other vertices in the
local DAG, wherein the computations in the vertex 302 depend upon
such blocks. Accordingly, immediately responsive to the vertex 302
receiving a first block, a second block and a third block, the
vertex 302 can be configured to generate an output block. With more
particularity, the vertex 302 can carry out actual matrix
computation by calling into an existing math library 306 that is in
a data store 308 that is accessible to the vertex 302. Numerous
math libraries that include a rich set of operators from basic
matrix operators to high-level solvers currently exist, and can be
utilized in connection with the systems and methods described
herein.
[0034] Now referring to FIG. 4, an exemplary depiction of a local
DAG execution engine 400 is illustrated. The local DAG execution
engine 400 can be seen from the perspective of the global DAG as a
black box. The local DAG execution engine 400 can be driven by two
simple state machines. These state machines are represented by a
load component 402 and a compute component 404. The load component
402 is configured to load blocks from parent vertices in the global
DAG. The arriving blocks from the global DAG can be new to the
local DAG execution engine 400 or may be blocks previously seen.
The latter can occur if recomputation is triggered under some
failure sequences.
[0035] The compute component 404 is configured to scan available
local blocks (from the global DAG or vertices in the local DAG) and
push computation through the local DAG as far as possible in the
local DAG. The compute component 404 can call appropriate routines
from the math library 306 described above for any vertices that are
ready to perform computations on output blocks. This causes
potential production of blocks for downstream vertices in the
global DAG.
[0036] The computing framework of the local DAG execution engine
400 resembles a data-driven machine. Therefore, the local DAG
execution engine 400 can match and fire while walking the local
DAG. This is relatively straightforward if the local DAG is small.
The local DAG, however, may be rather large. For instance, in a
Cholesky decomposition of a large matrix, where each tile is made
up of 256.times.256 blocks, the local DAG can have approximately
8.5 million vertices. Storing and manipulating such a large graph
can impose non-trivial memory and CPU overhead.
[0037] Accordingly, a DAG-free representation of the local DAG can
be implemented to avoid much of such overhead. The scheduler
component 108 can automatically transform original sequential code
into a skeleton code in which no operations are actually carried
out. As new blocks arrive from the global DAG or more generally,
whenever computation is required, the skeleton code can be executed
to fire computation for operators whose operands are available. In
other words, the skeleton code can cause an operator to fire
responsive to receipt of a block from a parent vertex in the global
DAG. Partial results (e.g., outputs of vertices of the local DAG)
can be stored by the local DAG execution engine 400 and fetched at
an appropriate time.
[0038] With reference now to FIG. 5, a state diagram 500 is
illustrated that describes a vertex for pipelined execution and
failure handling. The pipelined DAG execution model described above
yields substantial performance gain compared to a staged DAG
execution model. Handling fault tolerance, however, becomes more
challenging. To achieve non-blocking pipelined execution, partial
results (blocks) of the vertex computation are observed and
consumed by its descendents, potentially without bounds. It can
further be noted that an arbitrary number of vertices at arbitrary
positions in the global DAG may be hosted on one machine, and
therefore a single machine can break multiple holes in a local DAG
that is being executed.
[0039] As mentioned above, a possible mechanism for dealing with
failure of a vertex is to restart that vertex since all computation
is deterministic. For a failed vertex, however, several of the
re-computed blocks may not be needed. Additionally, the overhead of
(unwanted) redundant computing is not restricted to the restarted
vertex. A redundant reproduced block can trigger unnecessary
computing in a depending vertex and so on and so forth, in a
cascading fashion. Therefore, without careful bookkeeping, overhead
utilized for fault handling may be nontrivial.
[0040] Determining a particular set of blocks that are needed to be
tracked, at first glance appears difficult. For instance, a vertex
v may have a child vertex w. A block b that w has received and
stored in its input buffer is not needed anymore from v's
perspective. Thus, in other words, if v suffers a failure and
recovers it need not reproduce b. b is not needed either if w has
consumed b for any blocks that depend on b, and so forth. Yet there
may be complicated sequences of failures that leads w to need b
again, and if v itself has lost b, then v needs to recompute b.
[0041] Amid such seemingly intricate patterns, a set of simple and
straightforward invariants exists. Specifically, two invariants
which govern intra- and inter-vertex data dependencies exist. If
such two invariants are enforced across all vertices, and the
related dependency states are recorded and restored upon recovery
(which is different than the data itself), then regardless of a
number of vertices that are restarted and their respective topology
positions in the DAG, the protocol described herein guarantees that
only necessary blocks are recomputed.
[0042] There are but a few states that need to be kept for the
aforementioned protocol to operate properly. Specifically, as
shown, a vertex V 502 can maintain a first list 504 v.all that
indicates a list of all blocks the vertex V can compute in its
lifetime. A second list 506 identifies blocks that have been
received from parent vertices (v.in), and a third list 508
identifies blocks that are still to be received from parent
vertices (v.need) to product output blocks. A fourth list 510
identifies blocks that have been output by the vertex V 502
(v.out), and a fifth list 512 identifies blocks that are to be
output by the vertex Vin the future (v.future).
[0043] The first of the two aforementioned two invariants binds the
relationship of buffers inside the vertex V 502. As mentioned
above, what the vertex needs (v.need) is anything that it needs to
produce new blocks, minus what it already has in its input buffer.
v.in specifies the blocks that are available in V's input buffer.
What the vertex V needs is anything that it needs to produce new
blocks minus what it already has in its input buffer:
v.needv.dep(v.future)-v.in (1)
v.dep is a function that, given the identity of an output block b,
returns the indices of the set of input blocks that b depends upon.
For matrix computation, v.dep can be typically discovered
symbolically. For instance, for a 4.times.4 matrix multiplication,
C [0, 0] depends on A[0, 0:3] (A's first row) and B[0:3, 0] (B's
first column).
[0044] The second invariant complements the first and binds the
relationship of outwards facing buffers across neighbors.
Specifically, the second invariance specifies what v.future really
is. In this invariant, v.out is a set of blocks the vertex V 502
has computed and is available to the rest of the system. As
described above, v.all is all the blocks that the vertex is to
compute in its lifetime.
v.futurev.children.need.andgate.v.all-v.out (2)
The invariant states that what a vertex needs to produce is the
union of everything to satisfy its children intercepts with what
this vertex is responsible for (as a child vertex may depend on
other vertices), but minus what it has already made available to
the children.
[0045] The following invariant combines both the aforementioned
invariants, and explains the intuition of why enforcing these
invariants is sufficient to guarantee full recovery without
introducing unnecessary redundant computation:
v.needv.dep(v.children.need.andgate.v.all-v.out)-v.in (3)
[0046] In the absence of failure, the system functions as a
pipeline. Initially all vertices have their "future" set as equal
to their corresponding "all." In other words, the vertex needs to
produce all blocks. Then for a vertex that has data to consume
(v.need.andgate.v.parents.out is not in NULL), it is scheduled to
an individual machine if there is one. A vertex relinquishes the
machine if it is either finished computing (e.g., v.future is NULL)
or it is starving for data. The load component 402 moves blocked
from parent of vertices or input file (for first level vertices) to
v.in, and the compute component 404 fills v.out (described above).
Both actions modify other data structures (v.need, v.future)
accordingly.
[0047] The scheduler component 108 can record the progress and
whereabouts of the outputs (the union of v.out of all vertices) so
the newly scheduled vertices can know where to fetch data. This
metadata can be stored in a reliable data repository. Accordingly,
when failure occurs, in addition to knowing identities of vertices
have crashed, it can also be ascertained what data has been lost.
Conceptually then, to determine what blocks are needed for a
recovering vertex, child vertices can be queried for their need set
(v.need) (the identities of blocks in v.need), which is sufficient
to compute v.need for the parent vertex. If the children vertices
happened to have crashed also, the recovery will eventually set
their "need" sets appropriately and that in turn will update the
parent vertex's "future" and "need)" set, ensuring that the system
converges to the invariance.
[0048] The same principle can be upheld to handle even more
complicated cases. For instance, retired vertices (e.g., those who
have computed all outputs) are said to be hibernating at the
scheduler component 108 and in that sense they never truly retire.
Should any of the children vertices of a retired vertex request
blocks that are missing from the system due to failure, the retired
vertex is reactivated since their "future" set is no longer
empty.
[0049] With reference now to FIGS. 6-7, various exemplary
methodologies are illustrated and described. While the
methodologies are described as being a series of acts that are
performed in a sequence, it is to be understood that the
methodologies are not limited by the order of the sequence. For
instance, some acts may occur in a different order than what is
described herein. In addition, an act may occur concurrently with
another act. Furthermore, in some instances, not all acts may be
required to implement a methodology described herein.
[0050] Moreover, the acts described herein may be
computer-executable instructions that can be implemented by one or
more processors and/or stored on a computer-readable medium or
media. The computer-executable instructions may include a routine,
a sub-routine, programs, a thread of execution, and/or the like.
Still further, results of acts of the methodologies may be stored
in a computer-readable medium, displayed on a display device,
and/or the like. The computer-readable medium may any suitable data
storage medium, such as memory, hard drive, CD, DVD, flash drive,
or the like. A "computer-readable medium", as the term is used
herein, is not intended to encompass a propagated signal.
[0051] Turning now to FIG. 6, an exemplary methodology 600 that
facilitates representing a vertex in a global DAG as a local DAG
and executing a mathematical operation over a matrix block in a
data-driven manner is illustrated. The methodology 600 starts at
602, and at 604, a computer-executable algorithm is received that
is configured to execute a matrix computation over a tile of a
matrix.
[0052] At 606, the computer-executable algorithm is translated into
a computer-implemented global DAG, wherein the global DAG comprises
a plurality of vertices and a corresponding plurality of edges. A
vertex in the plurality of vertices is configured to perform a
sequence of operations on the tile of the matrix and the plurality
of edges represent data dependencies between coupled vertices.
[0053] At 608, the vertex in the global DAG is represented as a
DAG-free local DAG that comprises a plurality of vertices, wherein
a vertex in the local DAG is configured to execute a mathematical
operation on a block of a matrix, wherein the block is smaller than
the tile. The vertex in the local DAG is configured to execute the
mathematical operation in a data-driven manner.
[0054] Turning now to FIG. 7, another exemplary methodology 700
that facilitates executing vertices in a local DAG in a data-driven
manner is illustrated. The methodology 700 starts at 702, and at
704 at least one computation that is to be executed over a matrix
is received. This computation, for instance, may be a tile
algorithm that is configured to perform matrix operations over at
least a tile of the matrix.
[0055] At 706, subsequent to receiving at least one computation,
the at least one computation is represented as a sequence of
operations that are to be undertaken on tiles of the matrix.
[0056] At 708, subsequent to representing at least one computation
as a sequence of operations, at least one operation is translated
into a global directed acyclic graph that comprises a plurality of
vertices that are configured to perform a corresponding plurality
of sequential operations on at least one tile of the matrix. The
global directed acyclic graph also includes a plurality of edges
that represent data dependencies between the plurality of
vertices.
[0057] At 710, subsequent to the translating of the at least one
operation in the global directed acyclic graph, at least one vertex
in the global directed acyclic graph is represented as a local
directed acyclic graph that comprises a plurality of vertices that
are configured to perform a corresponding plurality of sequential
operations on at least one block that corresponds to the matrix,
wherein a size of the block is smaller than a size of the at least
one tile.
[0058] At 712, the sequential operations that are represented by
the plurality of vertices in a local directed acyclic graph are
caused to be executed in a data-driven manner that supports
parallelism and improves performance with respect to
high-performance computing. The methodology 700 completes at
714.
[0059] Now referring to FIG. 8, a high-level illustration of an
exemplary computing device 800 that can be used in accordance with
the systems and methodologies disclosed herein is illustrated. For
instance, the computing device 800 may be used in a system that
supports high performance computing. In another example, at least a
portion of the computing device 800 may be used in a system that
supports pipelined matrix computation. The computing device 800
includes at least one processor 802 that executes instructions that
are stored in a memory 804. The memory 804 may be or include RAM,
ROM, EEPROM, Flash memory, or other suitable memory. The
instructions may be, for instance, instructions for implementing
functionality described as being carried out by one or more
components discussed above or instructions for implementing one or
more of the methods described above. The processor 802 may access
the memory 804 by way of a system bus 806. In addition to storing
executable instructions, the memory 804 may also store matrix
tiles, matrix blocks, etc.
[0060] The computing device 800 additionally includes a data store
808 that is accessible by the processor 802 by way of the system
bus 806. The data store may be or include any suitable
computer-readable storage, including a hard disk, memory, etc. The
data store 808 may include executable instructions, matrix tiles,
matrix blocks, etc. The computing device 800 also includes an input
interface 810 that allows external devices to communicate with the
computing device 800. For instance, the input interface 810 may be
used to receive instructions from an external computer device, from
a user, etc. The computing device 800 also includes an output
interface 812 that interfaces the computing device 800 with one or
more external devices. For example, the computing device 800 may
display text, images, etc. by way of the output interface 812.
[0061] Additionally, while illustrated as a single system, it is to
be understood that the computing device 800 may be a distributed
system. Thus, for instance, several devices may be in communication
by way of a network connection and may collectively perform tasks
described as being performed by the computing device 800.
[0062] As used herein, the terms "component" and "system" are
intended to encompass hardware, software, or a combination of
hardware and software. Thus, for example, a system or component may
be a process, a process executing on a processor, or a processor.
Additionally, a component or system may be localized on a single
device or distributed across several devices. Furthermore, a
component or system may refer to a portion of memory and/or a
series of transistors.
[0063] It is noted that several examples have been provided for
purposes of explanation. These examples are not to be construed as
limiting the hereto-appended claims. Additionally, it may be
recognized that the examples provided herein may be permutated
while still falling under the scope of the claims.
* * * * *