U.S. patent application number 17/227590 was filed with the patent office on 2021-07-29 for data flow processing method and related device.
The applicant listed for this patent is HUAWEI TECHNOLOGIES CO., LTD.. Invention is credited to Chen CHENG, Lijuan HAI, Christopher RODRIGUES, Peng WU.
Application Number | 20210232394 17/227590 |
Document ID | / |
Family ID | 1000005564225 |
Filed Date | 2021-07-29 |
United States Patent
Application |
20210232394 |
Kind Code |
A1 |
HAI; Lijuan ; et
al. |
July 29, 2021 |
DATA FLOW PROCESSING METHOD AND RELATED DEVICE
Abstract
The present disclosure relates to data flow processing methods
and devices. One example method includes obtaining a dependency
relationship and an execution sequence of operating a data flow by
a plurality of processing units, generating synchronization logic
based on the dependency relationship and the execution sequence,
and inserting the synchronization logic into an operation pipeline
of each of the plurality of processing unit to generate executable
code.
Inventors: |
HAI; Lijuan; (Beijing,
CN) ; CHENG; Chen; (Hangzhou, CN) ; RODRIGUES;
Christopher; (Santa Clara, CA) ; WU; Peng;
(Shenzhen, CN) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
HUAWEI TECHNOLOGIES CO., LTD. |
Shenzhen |
|
CN |
|
|
Family ID: |
1000005564225 |
Appl. No.: |
17/227590 |
Filed: |
April 12, 2021 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
PCT/CN2019/110741 |
Oct 12, 2019 |
|
|
|
17227590 |
|
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06F 9/3867 20130101;
G06F 9/30087 20130101; G06F 9/522 20130101; G06F 9/3005
20130101 |
International
Class: |
G06F 9/30 20060101
G06F009/30; G06F 9/52 20060101 G06F009/52; G06F 9/38 20060101
G06F009/38 |
Foreign Application Data
Date |
Code |
Application Number |
Oct 23, 2018 |
CN |
201811236134.8 |
Claims
1. A data flow processing method, wherein the method comprises:
obtaining a dependency relationship and an execution sequence of
operating a data flow by a plurality of processing units;
generating synchronization logic based on the dependency
relationship and the execution sequence; and inserting the
synchronization logic into an operation pipeline of each of the
plurality of processing units to generate executable code.
2. The method according to claim 1, wherein obtaining the
dependency relationship and the execution sequence of operating the
data flow by the plurality of processing units comprises: obtaining
descriptive code used to describe the data flow; and determining
the dependency relationship and the execution sequence based on the
descriptive code.
3. The method according to claim 2, wherein the descriptive code
comprises at least one of a keyword used to define a buffer
variable, a keyword used to describe a read operation and a write
operation for buffering the data flow, an operator used to specify
a write buffer variable, or a keyword used to specify a read buffer
variable.
4. The method according to claim 1, wherein the dependency
relationship indicates that a first operation instruction in an
operation pipeline of a first processing unit of the plurality of
processing units is executed first before a second operation
instruction in an operation pipeline of a second processing unit of
the plurality of processing units starts to be executed, and
wherein the execution sequence indicates a time sequence in which
operation instructions of the plurality of processing units that
are transmitted to operation pipelines of a corresponding type to
wait for execution.
5. The method according to claim 1, wherein generating the
synchronization logic based on the dependency relationship and the
execution sequence comprises: constructing, based on the dependency
relationship and the execution sequence, a dependency decision tree
of operating the data flow by the plurality of processing units;
and generating the synchronization logic based on the dependency
decision tree.
6. The method according to claim 1, wherein the synchronization
logic comprises a barrier instruction and an event synchronization
instruction, and wherein generating the synchronization logic based
on the dependency relationship and the execution sequence
comprises: generating the event synchronization instruction based
on the dependency relationship; and generating the barrier
instruction based on the execution sequence.
7. The method according to claim 1, wherein generating the
synchronization logic based on the dependency relationship and the
execution sequence comprises: determining whether the dependency
relationship is transfer dependency; and generating the
synchronization logic when the dependency relationship is not the
transfer dependency.
8. A data flow processing apparatus, comprising a memory, a
communications bus, and at least one processor, wherein the memory
stores programming instructions for execution by the at least one
processor to perform operations comprising: obtaining a dependency
relationship and an execution sequence of operating a data flow by
a plurality of processing units; generating synchronization logic
based on the dependency relationship and the execution sequence;
and inserting the synchronization logic into an operation pipeline
of each of the plurality of processing units to generate executable
code.
9. The apparatus according to claim 8, wherein obtaining the
dependency relationship and the execution sequence of operating the
data flow by the plurality of processing units comprises: obtaining
descriptive code used to describe the data flow; and determining
the dependency relationship and the execution sequence based on the
descriptive code.
10. The apparatus according to claim 9, wherein the descriptive
code comprises at least one of a keyword used to define a buffer
variable, a keyword used to describe a read operation and a write
operation for buffering the data flow, an operator used to specify
a write buffer variable, or a keyword used to specify a read buffer
variable.
11. The apparatus according to claim 8, wherein the dependency
relationship indicates that a first operation instruction in an
operation pipeline of a first processing unit of the plurality of
processing units is executed first before a second operation
instruction in an operation pipeline of a second processing unit of
the plurality of processing units starts to be executed, and
wherein the execution sequence indicates a time sequence in which
operation instructions of the plurality of processing units that
are transmitted to operation pipelines of a corresponding type to
wait for execution.
12. The apparatus according to claim 8, wherein generating the
synchronization logic based on the dependency relationship and the
execution sequence comprises: constructing, based on the dependency
relationship and the execution sequence, a dependency decision tree
of operating the data flow by the plurality of processing units;
and generating the synchronization logic based on the dependency
decision tree.
13. The apparatus according to claim 8, wherein the synchronization
logic comprises a barrier instruction and an event synchronization
instruction, and wherein generating the synchronization logic based
on the dependency relationship and the execution sequence
comprises: generating the event synchronization instruction based
on the dependency relationship; and generating the barrier
instruction based on the execution sequence.
14. The apparatus according to claim 8, wherein generating the
synchronization logic based on the dependency relationship and the
execution sequence comprises: determining whether the dependency
relationship is transfer dependency; and generating the
synchronization logic when the dependency relationship is not the
transfer dependency.
15. A computer-readable storage medium, wherein the
computer-readable storage medium stores an instruction which when
run on a computer, cause the computer to perform operations
comprising: obtaining a dependency relationship and an execution
sequence of operating a data flow by a plurality of processing
units; generating synchronization logic based on the dependency
relationship and the execution sequence; and inserting the
synchronization logic into an operation pipeline of each of the
plurality of processing units to generate executable code.
16. The computer-readable storage medium according to claim 15,
wherein obtaining the dependency relationship and the execution
sequence of operating the data flow by the plurality of processing
units comprises: obtaining descriptive code used to describe the
data flow; and determining the dependency relationship and the
execution sequence based on the descriptive code.
17. The computer-readable storage medium according to claim 16,
wherein the descriptive code comprises at least one of a keyword
used to define a buffer variable, a keyword used to describe a read
operation and a write operation for buffering the data flow, an
operator used to specify a write buffer variable, or a keyword used
to specify a read buffer variable.
18. The computer-readable storage medium according to claim 15,
wherein the dependency relationship indicates that a first
operation instruction in an operation pipeline of a first
processing unit of the plurality of processing units is executed
first before a second operation instruction in an operation
pipeline of a second processing unit of the plurality of processing
units starts to be executed, and wherein the execution sequence
indicates a time sequence in which operation instructions of the
plurality of processing units that are transmitted to operation
pipelines of a corresponding type to wait for execution.
19. The computer-readable storage medium according to claim 15,
wherein generating the synchronization logic based on the
dependency relationship and the execution sequence comprises:
constructing, based on the dependency relationship and the
execution sequence, a dependency decision tree of operating the
data flow by the plurality of processing units; and generating the
synchronization logic based on the dependency decision tree.
20. The computer-readable storage medium according to claim 15,
wherein the synchronization logic comprises a barrier instruction
and an event synchronization instruction, and wherein generating
the synchronization logic based on the dependency relationship and
the execution sequence comprises: generating the event
synchronization instruction based on the dependency relationship;
and generating the barrier instruction based on the execution
sequence.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application is a continuation of International
Application No. PCT/CN2019/110741, filed on Oct. 12, 2019, which
claims priority to Chinese Patent Application No. 201811236134.8,
filed on Oct. 23, 2018. The disclosures of the aforementioned
applications are hereby incorporated by reference in their
entireties.
TECHNICAL FIELD
[0002] This application relates to the data processing field, and
in particular, to a data flow processing method and a related
device.
BACKGROUND
[0003] With rapid development of machine learning and deep learning
technologies, computing capabilities of computers in a traditional
architecture cannot meet a current service requirement. Therefore,
dedicated hardware accelerators, for example, a tensor processing
unit (TPU) developed by Google and the world's first commercial
deep learning processor launched by Cambricon, that are customized
in depth for services in the artificial intelligence (AI) field are
successively launched. An acceleration effect oriented to machine
learning and deep learning models is faster than those of a
traditional central processing unit (CPU) and a traditional
graphics processing unit (GPU) by more than one order of
magnitude.
[0004] To improve a parallel throughput capability, an AI hardware
accelerator usually uses a design principle of decoupling data
access from computing. A plurality of parallel operation pipelines
are provided internally to process data in an asynchronous and
parallel manner. For example, some operation pipelines specially
perform a direct memory access (DMA) operation to access data, some
operation pipelines specially perform a matrix multiplication
operation, and some operation pipelines specially perform a vector
operation. After a data access instruction is sent, immediate
asynchronous returning is performed, and a subsequent operation
(for example, a matrix multiplication operation or a vector
operation) can be performed without waiting for accessed data to be
ready. For a plurality of operations such as A read B write, A
write B write, or A write B read that occur at a same address, if
there is no time sequence dependency between the plurality of
operations, execution concurrency can be improved in the
asynchronous and parallel manner. If there is a time sequence
dependency between the plurality of operations, in the asynchronous
and parallel manner, an operation may be performed without waiting
for data access to be ready. As a result, an incorrect calculation
result is generated.
[0005] To resolve the foregoing problem, a TPU provides a pipeline
synchronization instruction to manage an asynchronous and parallel
operation pipeline. A tensor virtual machine (TVM) provides a more
convenient manner of automatically inserting a synchronization
instruction to implement time sequence consistency. However, there
are still problems of low compilation performance and low data
processing efficiency.
SUMMARY
[0006] Embodiments of this application provide a data flow
processing method and a related device, to improve compilation
performance and data processing efficiency.
[0007] According to a first aspect, an embodiment of this
application provides a data flow processing method, including:
first obtaining a dependency relationship and an execution sequence
of operating a data flow by a plurality of processing units, and
then generating synchronization logic based on the dependency
relationship and the execution sequence; and finally, inserting the
synchronization logic into an operation pipeline of each of the
plurality of processing units, to generate executable code.
[0008] The dependency relationship and the execution sequence
between operations are determined through serialization analysis,
and a compiler automatically inserts the synchronization logic.
This simplifies programming code, thereby improving compilation
performance and data processing efficiency.
[0009] In a possible design, descriptive code used to describe the
data flow is obtained, and the dependency relationship and the
execution sequence are determined based on the descriptive code. A
user defines a buffer and an operation pipeline and specifies a
read buffer and a write buffer of the operation pipeline to
describe the data flow. A synchronization manner based on a data
flow description hides hardware synchronization details, simplifies
programming code, and decouples a hardware architecture and
software development to facilitate separate upgrades of software
and hardware.
[0010] In another possible design, the descriptive code includes at
least one of a keyword used to define a buffer variable, a keyword
used to describe a read operation and a write operation for
buffering the data flow, an operator used to specify a write buffer
variable, and a keyword used to specify a read buffer variable. The
descriptive code is a language for describing synchronization
between a plurality of pipelines based on a data flow.
[0011] In another possible design, the dependency relationship
indicates that because operation instructions in the plurality of
operation pipelines access, that is, read and write, a same storage
address, an operation instruction in one operation pipeline needs
to be executed first before an operation instruction in another
operation pipeline can start to be executed. The execution sequence
indicates a time sequence in which operation instructions of the
plurality of processing units that are transmitted to a
corresponding type of operation pipeline wait for execution.
[0012] In another possible design, a dependency decision tree of
operating the data flow by the plurality of processing units may be
constructed based on the dependency relationship and the execution
sequence, and the synchronization logic is generated based on the
dependency decision tree. The dependency decision tree is
constructed to simplify representation of the dependency
relationship between the operations.
[0013] In another possible design, the synchronization logic
includes a barrier instruction and an event synchronization
instruction, where the event synchronization instruction is
generated based on the dependency relationship, and the barrier
instruction is generated based on the execution sequence. The
barrier instruction and the event synchronization instruction are
generated, so that the barrier instruction and the event
synchronization instruction are inserted into the operation
pipeline, to ensure data processing correctness.
[0014] In another possible design, the barrier instruction is used
to ensure that all operation instructions before the barrier
instruction are executed first before a subsequent operation
instruction can start to be executed. When a single operation
pipeline is blocked, all operation instructions in the operation
pipeline before the barrier instruction are executed first before a
subsequent operation instruction can start to be executed. When all
operation pipelines are blocked, operation instructions in all the
operation pipelines before the barrier instruction are executed
first before a subsequent operation instruction can start to be
executed. The event synchronization instruction is used to ensure
synchronization between operation instructions in different
operation pipelines.
[0015] In another possible design, it may be determined whether the
dependency relationship is transfer dependency. When the dependency
relationship is not transfer dependency, the synchronization logic
is generated, to eliminate transfer dependency between operations,
ensure insertion of an optimal synchronization instruction,
maximize synchronization resource utilization, and reduce
synchronization overheads.
[0016] In another possible design, a buffer includes a first area
and a second area, and a data flow may be written into the first
area. After all data flows are written into the first area, the
first area and the second area are switched to each other, a new
data flow is written into the second area, and the originally
written data flow is read from the first area. In this way, data
processing performance is improved by using a double buffering
technology.
[0017] In another possible design, a prefetch request is sent
before the buffer may fail, so that a data flow is already written
into the buffer in advance when the data flow is read, thereby
avoiding a processor pause caused by a failure of the buffer.
Efficient executable code is generated through prefetch
optimization.
[0018] According to a second aspect, an embodiment of this
application provides a data flow processing apparatus. The data
flow processing apparatus is configured to implement the method and
the functions performed by the compiler in the first aspect, and is
implemented by using hardware/software. The hardware/software
thereof includes units corresponding to the foregoing
functions.
[0019] According to a third aspect, an embodiment of this
application provides a data flow processing device, including: a
processor, a memory, and a communications bus, where the
communications bus is configured to implement connection and
communication between the processor and the memory, and the
processor executes a program stored in the memory, to perform the
steps in the data flow processing method according to the first
aspect.
[0020] In a possible design, the data flow processing device
provided in this embodiment of this application may include a
corresponding module configured to perform an action of the data
flow processing apparatus in the foregoing method design. The
module may be software and/or hardware.
[0021] According to a fourth aspect, an embodiment of this
application provides a computer-readable storage medium, where the
computer-readable storage medium stores an instruction, and when
the instruction runs on a computer, the computer is enabled to
perform the methods according to the foregoing aspects.
[0022] According to a fifth aspect, an embodiment of this
application provides a computer program product including an
instruction, where when the computer program product runs on a
computer, the computer is enabled to perform the methods according
to the foregoing aspects.
BRIEF DESCRIPTION OF DRAWINGS
[0023] To describe the technical solutions in the embodiments of
this application or in the background more clearly, the following
briefly describes the accompanying drawings for describing the
embodiments of this application or the background.
[0024] FIG. 1 is a schematic structural diagram of a TPU according
to an embodiment of this application;
[0025] FIG. 2 is an architectural diagram of a processor according
to an embodiment of this application;
[0026] FIG. 3 is a schematic diagram of converting a virtual thread
parallel program into an explicit synchronous program according to
an embodiment of this application:
[0027] FIG. 4 is a schematic diagram of an effect of interleaving
scheduling optimization by a compiler according to an embodiment of
this application;
[0028] FIG. 5 is a schematic architectural diagram of an
application system according to an embodiment of this
application;
[0029] FIG. 6 is a schematic diagram of synchronization of a
plurality of operation pipelines according to an embodiment of this
application:
[0030] FIG. 7 is a schematic flowchart of a data flow processing
method according to an embodiment of this application:
[0031] FIG. 8 shows descriptive code based on a data flow
description according to an embodiment of this application:
[0032] FIG. 9 is a schematic diagram of a dependency relationship
according to an embodiment of this application;
[0033] FIG. 10 is a schematic diagram of an execution sequence
according to an embodiment of this application;
[0034] FIG. 11(A) is a schematic diagram of transfer dependency
according to an embodiment of this application;
[0035] FIG. 11(B) is a schematic diagram of another transfer
dependency according to an embodiment of this application;
[0036] FIG. 11(C) is a schematic diagram of still another transfer
dependency according to an embodiment of this application;
[0037] FIG. 12 is a schematic structural diagram of a chip
according to an embodiment of this application;
[0038] FIG. 13 shows programming code for explicitly invoking a
synchronization instruction according to an embodiment of this
application;
[0039] FIG. 14 shows programming code based on a data flow
description according to an embodiment of this application;
[0040] FIG. 15 is a schematic structural diagram of a data flow
processing apparatus according to an embodiment of this
application; and
[0041] FIG. 16 is a schematic structural diagram of a data flow
processing device according to this application.
DESCRIPTION OF EMBODIMENTS
[0042] The following describes the embodiments of this application
with reference to the accompanying drawings in the embodiments of
this application.
[0043] FIG. 1 is a schematic structural diagram of a TPU according
to an embodiment of this application. TPU hardware provides a
synchronization instruction between operation pipelines, and the
synchronization instruction may be explicitly invoked by software,
to ensure execution of an instruction time sequence. The TPU
internally has four different types of processing units, and each
processing unit corresponds to an operation pipeline. The TPU not
only includes a core acceleration unit such as a matrix multiply
unit, but also includes a plurality of data buffers. A data flow in
a unified buffer (UB) and a weight queue (weight FIFO) is input
into the matrix multiply unit, and then is output to an activation
by the matrix multiply unit to execute an activation function.
Explicit control synchronization occurs between matrix
multiplication and data access, and a corresponding synchronization
instruction is provided. The synchronization instruction may be
invoked through programming control.
[0044] FIG. 2 is an architectural diagram of a processor according
to an embodiment of this application. The processor internally
includes six different types of processing units, where each
processing unit corresponds to one operation pipeline, and all
pipelines include three DMA pipelines and three neural functional
units (NFU). The three NFUs are separately configured to be
responsible for operations such as multiplication, accumulation,
and activation. The processor provides a synchronization
instruction between operation pipelines, and the synchronization
instruction may be invoked through programming control.
[0045] In conclusion, the foregoing two manners simplify hardware
design, but have highly difficult programming. In addition, a
synchronization instruction is directly exposed to an upper-layer
developer, causing severe coupling between a program and hardware
and hindering a hardware upgrade or code migration. To resolve the
foregoing problem, a TVM may be used to perform synchronous
analysis and parallel optimization. A virtual thread binding
mechanism is introduced in the TVM to describe a relationship
between a service operation and an underlying execution unit,
thereby ensuring highly-concurrent synchronous control. A user
needs to explicitly specify a virtual thread ID corresponding to a
task. Each tensor operation in the task is mapped to each operation
pipeline ID according to a certain rule. In terms of semantics,
serial execution is performed within a virtual thread, and parallel
execution is performed between virtual threads. The TVM analyzes a
time sequence relationship between operations, inserts a
synchronization instruction into a virtual thread to ensure serial
execution, and interleaves scheduling optimization between virtual
threads.
[0046] FIG. 3 is a schematic diagram of converting a virtual thread
parallel program into an explicit synchronous programming model
according to an embodiment of this application. FIG. 3 includes:
Step 0. A program with a relatively high abstraction level
describes a virtual thread by using an annotation. Step 1: Add a
synchronization instruction, where push_dep_to is a production
interface of a synchronization message, and pop_dep_from is a
consumption interface of a synchronization message. Step 2: Map a
plurality of virtual threads to one (physical entity) thread, to
interleave scheduling optimization. FIG. 4 is a schematic diagram
of an effect of interleaving scheduling optimization by a compiler
according to an embodiment of this application. Operations of the
two virtual threads (a virtual thread 0 and a virtual thread 1) on
which a compiler has interleaved scheduling optimization may be
performed in parallel. However, serial execution within the virtual
thread suppresses parallel execution of a plurality of operation
pipelines of a hardware accelerator, and affects compilation
performance and data processing efficiency of the compiler.
[0047] FIG. 5 is a schematic architectural diagram of an
application system according to an embodiment of this application.
The application system includes a plurality of memory units (MEM)
and a plurality of operation pipelines (P0, P1, P2, P3, and P4). A
hardware accelerator is generally designed based on a principle of
decoupling data access from computing, and internally provides a
plurality of parallel operation pipelines to execute specific types
of operations. After an operation instruction is sent, immediate
returning is performed without waiting for actual completion of an
operation, thereby improving execution concurrency of the plurality
of operation pipelines. However, time sequence consistency between
the plurality of operation pipelines needs to be ensured. When
instructions of operation pipelines are concurrently executed, if
there is data dependency between the operation pipelines, a
synchronization instruction needs to be invoked to synchronize an
execution sequence between the operation pipelines. FIG. 6 is a
schematic diagram of synchronization of a plurality of operation
pipelines according to an embodiment of this application. An
operation pipeline is used as an example. A synchronization
operation process includes: firstly, waiting for completion of a
write operation performed by a predecessor execution unit,
secondly, waiting for completion of a read operation performed by a
successor execution unit, thirdly, executing an instruction,
fourthly, instructing the successor execution unit to read data,
and fifthly, instructing the predecessor execution unit to write
data. In conclusion, a synchronization instruction needs to be
inserted before and after execution of an operation of each
operation pipeline, to ensure a data dependency sequence between
the operation of each operation pipeline and the predecessor
execution unit and between the operation of each operation pipeline
and the successor execution unit. Based on the foregoing design
principle, an embodiment of this application provides the following
technical solution.
[0048] FIG. 7 is a schematic flowchart of a data flow processing
method according to an embodiment of this application. This
embodiment of this application includes the following steps.
[0049] S701: Obtain a dependency relationship and an execution
sequence of operating a data flow by a plurality of processing
units.
[0050] During specific implementation, a compiler may obtain
descriptive code used to describe the data flow, and determine the
dependency relationship and the execution sequence based on the
descriptive code. The descriptive code includes at least one of a
keyword used to define a buffer variable, a keyword used to
describe a read operation and a write operation for buffering the
data flow, an operator used to specify a write buffer variable, and
a keyword used to specify a read buffer variable. A user defines a
buffer and an operation pipeline and specifies a read buffer and a
write buffer of the operation pipeline to describe the data flow A
synchronization manner based on a data flow description hides
hardware synchronization details, simplifies programming, and
decouples a hardware architecture and software development to
facilitate software and hardware upgrade.
[0051] Certainly, in this embodiment of this application, the
dependency relationship and the execution sequence of operating the
data flow by the plurality of processing units may alternatively be
obtained in another manner.
[0052] For example, a language for describing synchronization
between a plurality of pipelines based on a data flow is designed,
and seven keywords such as make_buffer, Buffer, rawPtr, Pipeline,
Stage, depend_on, and clear, and operators "<-" and "<-+" are
extended. make_buffer and Buffer are used to define a buffer
variable. rawPtr is used to obtain an address of a buffer variable.
Stage is used to describe a read operation and a write operation
for buffering a data flow. depend_on( ) is used to indicate that a
buffer variable in brackets is a read buffer variable of a current
operation. Pipeline is used to describe a data flow to be
synchronized. clear is used to switch to a next area of double
buffers. "<-" and "<-+" are used to specify that a buffer
variable before the operator is a write buffer variable of a
current operation, where after "<-" is executed, the double
buffers are automatically switched.
[0053] FIG. 8 shows descriptive code based on a data flow
description according to an embodiment of this application. A first
row of code is used to define an address of a buffer variable r1. A
second row of code is used to define an address of a buffer
variable r2. A third row of code is used to describe a function
range of a data flow. A fifth row of code is used to describe
writing, at a stage two of an operation pipeline, data into the
address indicated by r1. A sixth row of code is used to obtain the
specific address indicated by r1. An eighth row of code is used to
describe writing, at a stage three of the operation pipeline, data
into an address indicated by r2. A ninth row of code is used to
obtain a specific address indicated by r2. An eleventh row of code
is used to describe reading, at a stage four, data from the
addresses indicated by r1 and r2. A twelfth row of code is used to
obtain the specific addresses indicated by r1 and r2. Buffer
variables r1 and r2 before "<-" and buffer variables r1 and r2
of depend_on form a production and consumption relationship between
operation pipelines. An operation at the stage four depends on an
operation at the stage two and an operation at the stage three.
[0054] The dependency relationship indicates that because operation
instructions in the plurality of operation pipelines access, that
is, read and write, a same storage address, an operation
instruction in one operation pipeline is executed first before an
operation instruction in another operation pipeline can start to be
executed. The execution sequence (which may also be referred to as
an instruction transmission sequence) indicates a time sequence in
which operation instructions of the plurality of processing units
that are transmitted to a corresponding type of operation pipeline
wait for execution. An algorithm mainly considers a time sequence,
that is, a software execution sequence in which operation
instructions are transmitted to a corresponding operation pipeline.
However, an actual time sequence of hardware execution may be
different from the execution sequence.
[0055] For example, as shown in FIG. 8, operations in three
operation pipelines are respectively produce1, produce2, and
consumer. The produce1 performs write access to storage space r1,
the produce2 performs write access to storage space r2, and the
consumer performs read access to the storage space r1 and the
storage space r2. In this way, there is a corresponding dependency
relationship of first writing and then reading. Therefore, produce1
and produce2 operations are completed first before a consumer
operation can start to be performed. In other words, there is a
dependency relationship between the consumer operation and the
produce1 and between the consumer operation and the produce2. In
FIG. 8, the execution sequence is a time sequence in which
operation instructions transmitted to a corresponding operation
pipeline wait for execution. In consideration of existence of a
loop in code, the execution sequence should be: produce1 (first
iteration)--->produce2 (first iteration)--->consumer (first
iteration)--->produce1 (second iteration)--->produce2 (second
iteration)---->consumer (second iteration).
[0056] Further, as shown in FIG. 9, for an access operation of
Buffer a0, four operation pipelines Stage 1, Stage 2, Stage 3, and
Stage 4 are provided. If Stage 1, Stage 2, Stage 3, and Stage 4 are
executed sequentially, where a0 is double buffer addresses, Stage 1
and Stage 3 are write operations, and Stage 2 and Stage 4 are read
operations, stage 1 and stage 2 write a ping address of a0 and read
the ping address of a0 respectively, and Stage 3 and Stage 4 write
a pong address of a0 and read the pong address of a0 respectively.
Therefore, there is a dependency relationship between Stage 1 and
Stage 2, and there is a dependency relationship between Stage 3 and
Stage 4.
[0057] Further, as shown in FIG. 10, statements of an operation A
and an operation B are in a for loop, and the loop is iterated for
10 times. A compiler may determine an execution sequence of any two
operations based on a position of each operation in the for loop
and a quantity of iterations of the loop in which each operation is
located. A quantity of code rows indicates a position of a single
operation in the for loop, and a loop variable indicates a quantity
of iteration executions. When two operations are in different loop
iterations, an instance with a smaller loop iteration variable
occurs earlier. For example, (3, {i=0}) indicates the operation B
in a first loop iteration, and (2, {i=1}) indicates the operation A
in a second loop iteration. Because the loop iteration variable of
the operation B is smaller than that of the operation A, the
operation B is performed before the operation A. When two operation
instances are in a same loop iteration, an operation instance with
a front code location occurs earlier. For example, (2, i=1})
indicates the operation A in the second loop iteration, and (3,
{i=1}) indicates the operation B in the second loop iteration.
Because the code location of the operation A is before that of the
operation B, the operation A is performed before the operation B.
When quantities of loop iterations of two operation instances are
indeterminate, an earlier operation may be determined based on
values of loop iteration variables x and y.
[0058] S702. Generate synchronization logic based on the dependency
relationship and the execution sequence. The synchronization logic
may also be referred to as a synchronization instruction.
[0059] During specific implementation, the dependency relationship
indicates that a first operation instruction in an operation
pipeline of a first processing unit of the plurality of processing
units is executed first before a second operation instruction in an
operation pipeline of a second processing unit of the plurality of
processing units starts to be executed. The execution sequence
indicates a time sequence in which operation instructions of the
plurality of processing units that are transmitted to a
corresponding type of operation pipeline wait for execution. The
synchronization logic includes a barrier instruction and an event
synchronization instruction, and the barrier instruction may be
generated based on the execution sequence. The barrier instruction
is used to ensure that all operation instructions before the
barrier instruction are executed first before a subsequent
operation instruction can start to be executed. When a single
operation pipeline is blocked, all operation instructions in the
operation pipeline before the barrier instruction are executed
first before a subsequent operation instruction can start to be
executed. When all operation pipelines are blocked, operation
instructions in all the operation pipelines before the barrier
instruction are executed first before a subsequent operation
instruction can start to be executed. The event synchronization
instruction may be generated based on the dependency relationship.
The event synchronization instruction is used to ensure
synchronization between operation instructions in different
operation pipelines. For example, all operation instructions before
an operation instruction in an operation pipeline M are executed
first before an operation instruction after an operation
instruction in an operation pipeline V can start to be
executed.
[0060] Optionally, not all dependency relationships between
operations require generation of a synchronization instruction. It
may be determined whether the dependency relationship is transfer
dependency, where the transfer dependency represents a mutual
dependency relationship generated in relationship transfer of a
plurality of operations. When the dependency relationship is the
transfer dependency, the synchronization logic is not generated,
and when the dependency relationship is not the transfer
dependency, the synchronization logic is generated, to eliminate
transfer dependency between operations, ensure insertion of an
optimal synchronization instruction, maximize synchronization
resource utilization, and reduce synchronization overheads.
[0061] For example, as shown in FIG. 11(A), there are three
operations H, I, and J. There is a dependency relationship between
J and H and between J and I, and there is a dependency relationship
between I and H. In this case, there is transfer dependency between
H and J. Because J and I are synchronized, I and H are
synchronized, and J and H are already synchronized while J and I
are synchronized, no synchronization instruction needs to be
generated for J and H. Further, as shown in FIG. 11(B), for three
operations H, I, and J there is a dependency relationship between H
and I, and I and J are operations of a same pipeline. In this case,
there is transfer dependency between H and J. Because operations of
a same pipeline start to be executed sequentially, and H and J are
implicitly synchronized while H and I are synchronized, no
synchronization instruction needs to be generated for J and H.
Further, as shown in FIG. 11(C), for three operations H, I, and J,
there is a dependency relationship between I and J, and H and I are
operations of a same pipeline. In this case, there is transfer
dependency between H and J. Because operations of a same pipeline
start to be executed sequentially, and J and H are implicitly
synchronized while J and I are synchronized, no synchronization
instruction needs to be generated for J and H.
[0062] Further, a dependency decision tree of operating the data
flow by the plurality of processing units may be constructed based
on the dependency relationship and the execution sequence; and the
synchronization logic is generated based on the dependency decision
tree, thereby simplifying representation of the dependency
relationship between the operations by constructing the dependency
decision tree. The dependency decision tree is a tree-like
structure, where each node in the tree-like structure represents an
operation, an inter-layer relationship in the tree-like structure
represents the execution sequence, and a connection relationship in
the tree-like structure may indicate that there is a dependency
relationship between two operations.
[0063] S703: Insert the synchronization logic into an operation
pipeline of each of the plurality of processing units, to generate
executable code.
[0064] Optionally, a buffer may include a first area and a second
area, and a data flow may be written into the first area. After all
data flows are written into the first area, the first area and the
second area are switched to each other, a new data flow is written
into the second area, and the originally written data flow is read
from the first area. In this way, data processing performance is
improved by using a double buffering technology.
[0065] Optionally, a prefetch request is sent before the buffer may
fail, so that a data flow is already written to the buffer in
advance when the data flow is read, thereby avoiding a processor
pause caused by a failure of the buffer. Efficient executable code
is generated through prefetch optimization.
[0066] For example, FIG. 12 is a schematic structural diagram of a
chip according to an embodiment of this application. The chip
includes six parallel operation pipelines: a scalar pipeline
(PIPE_S), a vector pipeline (PIPE_V), a matrix pipeline (PIPE_M),
and three DMA pipelines (PIPE_MTE1, PIPE_MTE 2, and PIPE_MTE 3).
All instructions first uniformly enter the scalar pipeline, and
then the scalar pipeline distributes the instructions to other
operation pipelines. As can be learned from FIG. 12, the chip
internally includes a plurality of levels of memory space such as a
buffer L1, a buffer A, a buffer B, a buffer C, and a uniform
buffer. When there is a data dependency between operations of
various operation pipelines in these levels of memory space, the
synchronization logic needs to be used to ensure an execution
sequence of the instructions.
[0067] It should be understood that the synchronization logic of
the operation pipelines is provided inside the chip, and the
synchronization logic includes a barrier instruction pipe_barrier
(pipe) and event synchronization instructions set_flag(pipe,
tripperp, eventId) and wait_flag(pipe, tripperp, eventId). The
barrier instruction is used to ensure that all instructions before
the barrier instruction are executed first before a subsequent
instruction can start to be executed. The parameter pipe is used to
specify an operation pipeline. When a single operation pipeline is
blocked, all instructions in the operation pipeline before the
barrier instruction are executed first before a subsequent
instruction can start to be executed. When all operation pipelines
are blocked, instructions in all the operation pipelines before the
barrier instruction are executed first before a subsequent
instruction can start to be executed. set_flag and wait_flag
respectively indicate setting and waiting of a synchronization
event, pipe indicates an operation pipeline of a setting event,
tripperp indicates an operation pipeline of a waiting event, evenId
indicates an event ID, and set_flag and wait_flag need to be used
in pairs.
[0068] FIG. 13 shows programming code for explicitly invoking a
synchronization instruction according to an embodiment of this
application. The programming code is used to process an activation
function (rectified linear unit, ReLu) operator. Implementation of
the ReLu operator in a chip includes three operations: a first
operation of loading data from a global memory to a first UB
memory; a second operation of reading data from a UB to perform a
vector operation and writing a vector operation result to a second
UB memory; and a third operation of returning data in the first UB
memory to the global memory. Because there is a dependency
relationship between the three operations in the UB memory,
set_flag and wait_flag need to be explicitly inserted to ensure an
execution sequence of instructions. The synchronization logic has
been marked in FIG. 13. The second operation is used as an example.
The data in the UB needs to be read to perform the vector
operation, and the vector operation result is written into another
UB memory. Therefore, waiting is required before the second
operation is performed.
[0069] Corresponding to the explicit invoking manner shown in FIG.
13, FIG. 14 shows programming code based on a data flow description
according to an embodiment of this application. A user needs to
define only a buffer and an operation pipeline and specifies a read
buffer and a write buffer of the operation pipeline to describe a
dependency relationship and an execution sequence of a data flow.
For example, the buffer is defined by make_buffer ((half*)
flowTable->ubInputAddr[0]), a first operation pipeline is
defined by stage outToUb, a second operation pipeline is defined by
Stage vector_rule, and a third operation pipeline is defined by
Stage ubToOut. After the foregoing coding is completed, a compiler
may perform analysis based on a data flow description specified by
the user, determine the dependency relationship and the execution
sequence to generate synchronization logic, and insert the
synchronization logic into target code to generate executable code,
to achieve a same effect as that of the programming code shown in
FIG. 13. However, compared with a manner of explicitly invoking a
synchronization instruction, in a synchronization manner based on a
data flow description, the synchronization logic does not need to
be inserted into programming code, and instead the compiler
automatically inserts the synchronization logic after performing
dependency analysis.
[0070] In this embodiment of this application, a user defines a
buffer and an operation pipeline and specifies a read buffer and a
write buffer of the operation pipeline to describe the data flow. A
synchronization manner based on a data flow description hides
hardware synchronization details, simplifies programming, and
decouples a hardware architecture and software development to
facilitate software and hardware upgrade. In addition, the compiler
may determine the dependency relationship and the execution
sequence between operations through serialization analysis, and
automatically insert the synchronization logic. Further, the
transfer dependency is eliminated, and insertion of an optimal
synchronization instruction is ensured, thereby improving
performance of the compiler and data processing efficiency.
[0071] The foregoing describes the method in the embodiments of
this application in detail. The following provides an apparatus in
the embodiments of this application.
[0072] FIG. 15 is a schematic structural diagram of a data flow
processing apparatus according to an embodiment of this
application. The data flow processing apparatus may include: an
obtaining module 1501 and a processing module 1502. Detailed
descriptions of the modules are as follows:
[0073] The obtaining module 1501 is configured to obtain a
dependency relationship and an execution sequence of operating a
data flow by a plurality of processing units.
[0074] The processing module 1502 is configured to generate
synchronization logic based on the dependency relationship and the
execution sequence.
[0075] The processing module 1502 is further configured to insert
the synchronization logic into an operation pipeline of each of the
plurality of processing units, to generate executable code.
[0076] The processing module 1502 is further configured to: obtain
descriptive code used to describe the data flow; and determine the
dependency relationship and the execution sequence based on the
descriptive code.
[0077] The descriptive code includes at least one of a keyword used
to define a buffer variable, a keyword used to describe a read
operation and a write operation for buffering the data flow, an
operator used to specify a write buffer variable, and a keyword
used to specify a read buffer variable.
[0078] The dependency relationship indicates that a first operation
instruction in an operation pipeline of a first processing unit of
the plurality of processing units is executed first before a second
operation instruction in an operation pipeline of a second
processing unit of the plurality of processing units starts to be
executed. The execution sequence indicates a time sequence in which
operation instructions of the plurality of processing units that
are transmitted to a corresponding type of operation pipeline wait
for execution.
[0079] The processing module 1502 is further configured to:
construct, based on the dependency relationship and the execution
sequence, a dependency decision tree of operating the data flow by
the plurality of processing units, and generate the synchronization
logic based on the dependency decision tree.
[0080] The processing module 1502 is further configured to:
generate an event synchronization instruction based on the
dependency relationship; and generate a barrier instruction based
on the execution sequence.
[0081] The processing module 1502 is further configured to:
determine whether the dependency relationship is transfer
dependency; and when the dependency relationship is not transfer
dependency, generate the synchronization logic.
[0082] It should be noted that, for implementation of each module,
correspondingly refer to corresponding descriptions in the method
embodiment shown in FIG. 7, and the method and the functions
performed by the compiler in the foregoing embodiments are
performed.
[0083] FIG. 16 is a schematic structural diagram of a data flow
processing device according to this application. As shown in FIG.
16, the data flow processing device may include: at least one
processor 1601, at least one communications interface 1602, at
least one memory 1603, and at least one communications bus
1604.
[0084] The processor 1601 may be a central processing unit, a
general-purpose processor, a digital signal processor, an
application-specific integrated circuit, a field programmable gate
array or another programmable logical device, a transistor logical
device, a hardware component, or any combination thereof. The
processor may implement or execute various example logical blocks,
modules, and circuits described with reference to content disclosed
in this application. Alternatively, the processor may be a
combination of processors implementing a computing function, for
example, a combination of one or more microprocessors, or a
combination of the digital signal processor and a microprocessor.
The communications bus 1604 may be a peripheral component
interconnect PCI bus, an extended industry standard architecture
EISA bus, or the like. The bus may be classified into an address
bus, a data bus, a control bus, and the like. For ease of
representation, only one thick line is used to represent the bus in
FIG. 16, but this does not mean that there is only one bus or only
one type of bus. The communications bus 1604 is configured to
implement connection and communication between these components.
The communications interface 1602 of the device in this embodiment
of this application is configured to perform signaling or data
communication with another node device. The memory 1603 may include
a volatile memory, for example, a non-volatile dynamic random
access memory (NVRAM), a phase change random access memory (PRAM),
or a magnetoresistive random access memory (MRAM), and may further
include a non-volatile memory, for example, at least one magnetic
disk storage device, an electrically erasable programmable
read-only memory (EEPROM), a flash memory device such as a NOR
flash memory or a NAND flash memory, or a semiconductor device such
as a solid state disk (SSD). Optionally, the memory 1603 may
alternatively be at least one storage apparatus far away from the
processor 1601. Optionally, the memory 1603 may further store a set
of program code. Optionally, the processor 1601 may further execute
a program stored in the memory 1603.
[0085] obtaining a dependency relationship and an execution
sequence of operating a data flow by a plurality of processing
units;
[0086] generating synchronization logic based on the dependency
relationship and the execution sequence; and
[0087] inserting the synchronization logic into an operation
pipeline of each of the plurality of processing units, to generate
executable code.
[0088] Optionally, the processor 1601 is further configured to
perform the following operations:
[0089] obtaining descriptive code used to describe the data flow;
and
[0090] determining the dependency relationship and the execution
sequence based on the descriptive code.
[0091] The descriptive code includes at least one of a keyword used
to define a buffer variable, a keyword used to describe a read
operation and a write operation for buffering the data flow, an
operator used to specify a write buffer variable, and a keyword
used to specify a read buffer variable.
[0092] The dependency relationship indicates that a first operation
instruction in an operation pipeline of a first processing unit of
the plurality of processing units is executed first before a second
operation instruction in an operation pipeline of a second
processing unit of the plurality of processing units starts to be
executed. The execution sequence indicates a time sequence in which
operation instructions of the plurality of processing units that
are transmitted to a corresponding type of operation pipeline wait
for execution.
[0093] Optionally, the processor 1601 is further configured to
perform the following operations:
[0094] constructing, based on the dependency relationship and the
execution sequence, a dependency decision tree for operating the
data flow by the plurality of processing units; and
[0095] generating the synchronization logic based on the dependency
decision tree.
[0096] Optionally, the processor 1601 is further configured to
perform the following operations:
[0097] generating an event synchronization instruction based on the
dependency relationship; and
[0098] generating a barrier instruction based on the execution
sequence.
[0099] Optionally, the processor 1601 is further configured to
perform the following operations:
[0100] determining whether the dependency relationship is transfer
dependency; and
[0101] generating the synchronization logic when the dependency
relationship is not transfer dependency.
[0102] Further, the processor may further cooperate with the memory
and the communications interface to perform operations of the data
flow processing apparatus in the foregoing embodiments of this
application.
[0103] All or some of the foregoing embodiments may be implemented
by using software, hardware, firmware, or any combination thereof.
When software is used to implement the embodiments, the embodiments
may be implemented completely or partially in a form of a computer
program product. The computer program product includes one or more
computer instructions. When the computer program instructions are
loaded and executed on the computer, the procedure or functions
according to the embodiments of this application are all or
partially generated. The computer may be a general-purpose
computer, a dedicated computer, a computer network, or other
programmable apparatuses. The computer instructions may be stored
in a computer-readable storage medium or may be transmitted from a
computer-readable storage medium to another computer-readable
storage medium. For example, the computer instructions may be
transmitted from a website, computer, server, or data center to
another website, computer, server, or data center in a wired (for
example, a coaxial cable, an optical fiber, or a digital subscriber
line (DSL)) or wireless (for example, infrared, radio, or
microwave) manner. The computer-readable storage medium may be any
usable medium accessible by a computer, or a data storage device,
such as a server or a data center, integrating one or more usable
media. The usable medium may be a magnetic medium (for example, a
floppy disk, a hard disk, or a magnetic tape), an optical medium
(for example, a DVD), a semiconductor medium (for example, a
solid-state drive (SSD)), or the like.
[0104] The objectives, technical solutions, and beneficial effects
of this application are further described in detail in the
foregoing specific implementations. Any modification, equivalent
replacement, or improvement made without departing from the spirit
and principle of this application shall fall within the protection
scope of this application.
* * * * *