U.S. patent application number 17/119234 was filed with the patent office on 2021-05-27 for information processing method and terminal device.
The applicant listed for this patent is Shanghai Cambricon Information Technology Co., Ltd.. Invention is credited to Tianshi CHEN, Shuai HU, Shaoli LIU, Zai WANG.
Application Number | 20210158484 17/119234 |
Document ID | / |
Family ID | 1000005415628 |
Filed Date | 2021-05-27 |
View All Diagrams
United States Patent
Application |
20210158484 |
Kind Code |
A1 |
CHEN; Tianshi ; et
al. |
May 27, 2021 |
INFORMATION PROCESSING METHOD AND TERMINAL DEVICE
Abstract
Disclosed are an information processing method and a terminal
device. The method comprises: acquiring first information, wherein
the first information is information to be processed by a terminal
device; calling an operation instruction in a calculation apparatus
to calculate the first information so as to obtain second
information; and outputting the second information. By means of the
examples in the present disclosure, a calculation apparatus of a
terminal device can be used to call an operation instruction to
process first information, so as to output second information of a
target desired by a user, thereby improving the information
processing efficiency. The present technical solution has
advantages of a fast computation speed and high efficiency.
Inventors: |
CHEN; Tianshi; (Pudong New
Area, CN) ; LIU; Shaoli; (Pudong New Area, CN)
; WANG; Zai; (Pudong New Area, CN) ; HU;
Shuai; (Pudong New Area, CN) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Shanghai Cambricon Information Technology Co., Ltd. |
Pudong New Area |
|
CN |
|
|
Family ID: |
1000005415628 |
Appl. No.: |
17/119234 |
Filed: |
December 11, 2020 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
16760235 |
Jul 23, 2020 |
|
|
|
PCT/CN2018/105463 |
Sep 13, 2018 |
|
|
|
17119234 |
|
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06K 9/46 20130101; G06T
3/4053 20130101; G06T 1/60 20130101; G06T 1/20 20130101; G06T
3/4046 20130101 |
International
Class: |
G06T 3/40 20060101
G06T003/40; G06T 1/20 20060101 G06T001/20; G06T 1/60 20060101
G06T001/60; G06K 9/46 20060101 G06K009/46 |
Claims
1. An information processing method applied to a computation
circuit, wherein the computation circuit includes a communication
circuit and operation circuit, and the method comprises:
controlling, by the computation circuit, the communication circuit
to obtain a first image to be processed, wherein the first image
has a resolution of a first-level size; controlling, by the
computation circuit, the operation circuit to obtain and execute an
operation instruction to perform resolution optimization on the
first image to obtain a second image, wherein the second image has
a resolution of a second-level size, the first-level size is
smaller than the second-level size, and the operation instruction
is a preset instruction for optimizing an image resolution
2. The method of claim 1, wherein the controlling, by the
computation circuit, the communication circuit to obtain a first
image to be processed includes: controlling, by the computation
clrcuit, the communication circuit to obtain an original image to
be processed input by a user, wherein the original image has a
resolution of the first-level size, and controlling, by the
computation circuit, the operation circuit to pre-process the
original image to obtain the first image to be processed, wherein
the pre-processing is an operation preset by a user side or a
terminal side.
3. The method of claim 1, wherein the computation circuit further
includes a register circuit and a controller circuit, and the
controlling, by the computation circuit, the operation circuit to
obtain and call an operation instruction to perform resolution
optimization on the first image, so as to obtain the second image
includes: controlling, by the computation circuit, the controller
circuit to fetch an operation instruction from the register
circuit, and sending, by the computation circuit, the operation
instruction to the operation circuit. controlling, by the
computation circuit, the operation circuit to call the operation
instruction to perform feature extraction on the first image to
obtain a feature image, and controlling, by the computation
circuit, the operation circuit to pre-process the feature image to
obtain the second image, wherein the pre-processing is an operation
preset by a user side or a terminal side.
4. The method of claim 3, wherein the pre-processing includes one
or more of the following processing: translation, scaling
transformation, non-linear transformation, normalization, format
conversion, data deduplication, processing of data exception, and
data missing filling.
5. The method of claim 3, wherein the calling the operation
instruction to perform feature extraction on the first image to
obtain a feature image includes: controlling, by the computation
circuit, the operation circuit to perform feature extraction on the
first image based on an operation instruction set of at least one
thread to obtain a feature image, wherein the operation instruction
set includes at least one operation instruction, and an order of
calling the operation instruction in the operation instruction set
is customized by a user side or a terminal side.
6. The method of claim 1, wherein the computation circuit further
includes a data access circuit and a storage medium, and the
computation circuit controls the operation circuit to send the
second image to the data access circuit and store the second image
in the storage medium.
7. The method of claim 1, wherein the operation circuit includes a
primary operation module and a plurality of secondary operation
modules, wherein the primary operation module is interconnected
with the plurality of secondary operation modules by an
interconnection module, and when the operation instruction is a
convolution operation instruction, the calling the operation
instruction to perform resolution optimization on the first image
includes: controlling, by the computation circuit, the secondary
operation modules to implement a convolution operation of input
data and a convolution kernel in a convolutional neural network
algorithm, wherein the input data is the first image and the
convolutional neural network algorithm corresponds to the
convolution operation instruction, controlling, by the computation
circuit, the interconnection module to implement data transfer
between the primary operation module and the secondary operation
modules, before a forward operation of a neural network fully
connected layer starts, transferring, by the primary operation
module, the input data to each secondary operation module through
the interconnection module, and after the computation of the
secondary operation modules is completed, splicing, by the
interconnection module, output scalars of the respective secondary
operation modules stage by stage to obtain an intermediate vector,
and sending the intermediate vector back to the primary operation
module, and controlling, by the computation circuit, the primary
operation module to splice intermediate vectors corresponding to
all input data into an intermediate result for subsequent
operations.
8. The method of claim 7, wherein the performing subsequent
operations on the intermediate result includes: controlling, by the
computation circuit, the primary operation module to add bias data
to the intermediate result, and then performing an activation
operation.
9. The method of claim 8, wherein the primary operation module
includes a first operation circuit, wherein the first operation
circuit includes a vector addition circuit and an activation
circuit, the controlling, by the computation circuit, the primary
operation module to add bias data to the intermediate result, and
then performing an activation operation include: controlling, by
the computation circuit, the vector addition circuit to implement a
bias addition operation of a convolutional neural network operation
and perform element-wise addition on bias data and the intermediate
result to obtain a bias result, and controlling, by the computation
circuit, the activation circuit to perform an activation function
operation on the bias result.
10. (canceled)
11. The method of claim 7, wherein each secondary operation module
includes a second operation circuit, wherein the second operation
circuit includes a vector multiplication circuit and an
accumulation circuit. the controlling, by the computation circuit,
the secondary operation modules to perform a convolution operation
of input data and a convolution kernel in a convolutional neural
network algorithm includes: controlling, by the computation
circuit, the vector multiplication circuit to perform a vector
multiplication operation of the convolution operation, and
controlling, by the computation circuit, the accumulation circuit
to perform an accumulation operation of the convolution
operation.
12. (canceled)
13. (canceled)
14. The method of claim 7, wherein the computation circuit controls
the plurality of secondary operation modules to compute respective
output scalars in parallel by using the same input data and
respective convolution kernels.
15. A computation circuit, comprising a communication circuit and
operation circuit, wherein the communication circuit is configured
to obtain a first image to be processed, wherein the first image
has a resolution of a first-level size; the operation circuit is
configured to obtain and call an operation instruction to perform
resolution optimization on the first image to obtain a second
image, wherein the second image has a resolution of a second-level
size, the first-level size is smaller than the second-level size,
and the operation instruction is a preset instruction for
optimizing an image resolution
16. The computation circuit of claim 15, wherein the communication
circuit is configured to obtain an original image to be processed
input by a user, wherein the original image has a resolution of the
first-level size, and the operation circuit is configured to
pre-process the original image to obtain the first image to be
processed, wherein the pre-processing is an operation preset by a
user side or a terminal side.
17. The computation circuit of claim 15, further comprising a
register circuit and a controller circuit wherein the controller
circuit is configured to fetch an operation instruction from the
register circuit, and send the operation instruction to the
operation circuit. the operation circuit is configured to call the
operation instruction to perform feature extraction on the first
image to obtain a feature image, and the operation circuit is
configured to pre-process the feature image to obtain the second
image, wherein the pre-processing includes one or more of the
following processing: translation, scaling transformation,
non-linear transformation, normalization, format conversion, data
deduplication, processing of data exception, and data missing
filling.
18. The computation circuit of claim 17, wherein the operation
circuit is configured to perform feature extraction on the first
image based on an operation instruction set of at least one thread
to obtain a feature image, wherein the operation instruction set
includes at least one operation instruction, and an order of
calling the operation instruction in the operation instruction set
is customized by a user side or a terminal side.
19. The computation circuit of claim 15, wherein the operation
circuit includes a primary operation module and a plurality of
secondary operation modules, wherein the primary operation module
is interconnected with the plurality of secondary operation modules
by an interconnection module, and when the operation instruction is
a convolution operation instruction, the secondary operation
modules are configured to implement a convolution operation of
input data and a convolution kernel in a convolutional neural
network algorithm, wherein the input data is the first image and
the convolutional neural network algorithm corresponds to the
convolution operation instruction, the interconnection module is
configured to implement data transfer between the primary operation
module and the secondary operation modules, before a forward
operation of a neural network fully connected layer starts, the
primary operation module sends the input data to each secondary
operation module through the interconnection module, and after the
computation of the secondary operation modules is completed, the
interconnection module splices output scalars of the respective
secondary operation modules stage by stage into an intermediate
vector and sends the intermediate vector back to the primary
operation module, and the primary operation module is configured to
splice intermediate vectors corresponding to all input data into an
intermediate result, and perform subsequent operations on the
intermediate result, wherein the primary operation module is
configured to add bias data to the intermediate result, and then
perform an activation operation.
20. (canceled)
21. (canceled)
22. (canceled)
23. The computation circuit of claim 19, wherein the plurality of
secondary modules use the same input data and respective
convolution kernels to compute respective output scalars in
parallel.
24. (canceled)
25. (canceled)
26. (canceled)
Description
TECHNICAL FIELD
[0001] The present disclosure relates to the technical field of
information technology, and particularly to an information
processing method and related products.
BACKGROUND
[0002] With the growing information technology and people's
ever-increasing demand, the need for timeliness of information
becomes stronger. At present, a terminal obtains and processes
information based on a general-purpose processor, such as
implementing a super-resolution image in a general-purpose
processor, which means improving the resolution of an image.
[0003] However, in practical applications, this method of obtaining
information based on a general-purpose processor may be limited by
the operation speed of the general-purpose processor. In
particular, when the load of a general-purpose processor is large,
the method may lead to low efficiency and high latency of
information processing.
SUMMARY
[0004] Examples of the present disclosure provide an information
computation method and related products, which can increase
processing speed and efficiency of a computation device.
[0005] In a first aspect, an example of the present disclosure
provides an information processing method which is applied to a
computation device, where the computation device includes a
communication unit and an operation unit. The method includes:
[0006] controlling, by the computation device, the communication
unit to obtain a first image to be processed, where the first image
has a resolution of a first-level size;
[0007] controlling, by the computation device, the operation unit
to obtain and call an operation instruction to perform resolution
optimization on the first image to obtain a second image, where
[0008] the second image has a resolution of a second-level size,
the first-level size is smaller than the second-level size, and the
operation instruction is a preset instruction for optimizing an
image resolution.
[0009] In some possible examples, the controlling, by the
computation device, the communication unit to obtain a first image
to be processed includes:
[0010] controlling, by the computation device, the communication
unit to obtain an original image to be processed input by a user,
where the original image has a resolution of the first-level size,
and
[0011] controlling, by the computation device, the operation unit
to pre-process the original image to obtain the first image to be
processed, where the pre-processing is an operation preset by a
user side or a terminal side.
[0012] In some possible examples, the computation device further
includes a register unit and a controller unit, and the
controlling, by the computation device, the operation unit to
obtain and call an operation instruction to perform resolution
optimization on the first image, so as to obtain the second image
includes:
[0013] controlling, by the computation device, the controller unit
to fetch an operation instruction from the register unit, and
sending, by the computation device, the operation instruction to
the operation unit;
[0014] controlling, by the computation device, the controller unit
to call the operation instruction to perform feature extraction on
the first image to obtain a feature image, and
[0015] controlling, by the computation device, the operation unit
to pre-process the feature image to obtain the second image, where
the pre-processing is an operation preset by a user side or a
terminal side.
[0016] In some possible examples, the pre-processing includes one
or more of the following processing manners: translation, scaling
transformation, non-linear transformation, normalization, format
conversion, data deduplication, processing of data exception, and
data missing filling.
[0017] In some possible examples, the calling the operation
instruction to perform feature extraction on the first image to
obtain a feature image includes:
[0018] controlling, by the computation device, the operation unit
to perform feature extraction on the first image based on an
operation instruction set of at least one thread to obtain a
feature image, where the operation instruction set includes at
least one operation instruction, and an order of calling the
operation instruction in the operation instruction set is
customized by a user side or a terminal side.
[0019] In some possible examples, the computation device further
includes a data access unit and a storage medium,
[0020] the computation device controls the operation unit to send
the second image to the data access unit and store the second image
in the storage medium.
[0021] In some possible examples, the operation unit includes a
primary operation module and a plurality of secondary operation
modules, where the primary operation module is interconnected with
the plurality of secondary operation modules by an interconnection
module, and when the operation instruction is a convolution
operation instruction,
[0022] the calling the operation instruction to perform resolution
optimization on the first image includes:
[0023] controlling, by the computation device, the secondary
operation modules to implement a convolution operation of input
data and a convolution kernel in a convolutional neural network
algorithm, where the input data is the first image and the
convolutional neural network algorithm corresponds to the
convolution operation instruction,
[0024] controlling, by the computation device, the interconnection
module to implement data transfer between the primary operation
module and the secondary operation modules; before a forward
operation of a neural network fully connected layer starts,
transferring, by the primary operation module, the input data to
each secondary operation module through the interconnection module;
and after the computation of the secondary operation modules is
completed, splicing, by the interconnection module, output scalars
of the respective secondary operation modules stage by stage to
obtain an intermediate vector, and sending the intermediate vector
back to the primary operation module; and
[0025] controlling, by the computation device, the primary
operation module to splice intermediate vectors corresponding to
all input data into an intermediate result, and performing
subsequent operations on the intermediate result.
[0026] In some possible examples, the performing subsequent
operations on the intermediate result includes:
[0027] controlling, by the computation device, the primary
operation module to add bias data to the intermediate result, and
then performing an activation operation.
[0028] In some possible examples, the primary operation module
includes a first operation unit, where the first operation unit
includes a vector addition unit and an activation unit,
[0029] the step of controlling, by the computation device, the
primary operation module to add bias data to the intermediate
result, and then performing an activation operation include:
[0030] controlling, by the computation device, the vector addition
unit to implement a bias addition operation of a convolutional
neural network operation and perform element-wise addition on bias
data and the intermediate result to obtain a bias result; and
[0031] controlling, by the computation device, the activation unit
to perform an activation function operation on the bias result.
[0032] In some possible examples, the primary operation module
includes a first storage unit, a first operation unit, a first data
dependency determination unit, and a first storage unit; and the
above method includes:
[0033] controlling, by the computation device, the first storage
unit to cache input data and output data used by the primary
operation module during a computation process, where the output
data includes the second image;
[0034] controlling, by the computation device, the first operation
unit to perform various operational functions of the primary
operation module;
[0035] controlling, by the computation device, the data dependency
determination unit to ensure that there is no consistency conflict
in reading data from and writing data to the first storage unit,
read an input neuron vector from the first storage unit, and send
the vector to the secondary operation modules through the
interconnection module; and sends an intermediate result vector
from the interconnection module to the first operation unit.
[0036] In some possible examples, each secondary operation module
includes a second operation unit, where the second operation unit
includes a vector multiplication unit and an accumulation unit,
[0037] the controlling, by the computation device, the secondary
operation modules to perform a convolution operation of input data
and a convolution kernel in a convolutional neural network
algorithm includes:
[0038] controlling, by the computation device, the vector
multiplication unit to perform a vector multiplication operation of
the convolution operation, and
[0039] controlling, by the computation device, the accumulation
unit to perform an accumulation operation of the convolution
operation.
[0040] In some possible examples, each secondary operation module
includes a second operation unit, a second data dependency
determination unit, a second storage unit, and a third storage
unit; and the method includes:
[0041] controlling, by the computation device, the second operation
unit to perform various arithmetic and logical operations of the
secondary operation module,
[0042] controlling, by the computation device, the second data
dependency determination unit to perform a reading/writing
operation on the second storage unit and the third storage unit
during a computation process and ensure that there is no
consistency conflict between the reading and writing operations on
the second storage unit and the third storage unit,
[0043] controlling, by the computation device, the second storage
unit to cache input data and an output scalar obtained from the
computation performed by the secondary operation module, and
[0044] controlling, by the computation device, the third storage
unit to cache a convolution kernel required by the secondary
operation module during a computation process.
[0045] In some possible examples, the first data dependency or the
second data dependency ensures that there is no consistency
conflict in reading and writing in the following manners: storage
addresses corresponding to data/instructions stored in the
corresponding storage unit do not overlap; or determining whether
there is dependency between a control signal that has not been
executed and data of a control signal that is being executed, if
there is no dependency, the control signal is allowed to be issued
immediately, otherwise, the control signal is not allowed to be
issued until all control signals on which the control signal is
dependent have been executed, where
[0046] the computation device controls the controller unit to
obtain an operation instruction from the register unit and decode
the operation instruction into the control signal for controlling
behavior of other modules, where the other modules include the
primary operation module and the plurality of secondary operation
modules.
[0047] In some possible examples, the computation device controls
the plurality of secondary operation modules to compute respective
output scalars in parallel by using the same input data and
respective convolution kernels.
[0048] In some possible examples, an activation function active
used by the primary operation module may be any of the following
non-linear functions: sigmoid, tanh, relu, softmax, or may be a
linear function.
[0049] In some possible examples, the interconnection module forms
a data channel for continuous or discrete data between the primary
operation module and the plurality of secondary operation modules.
The interconnection module has any of the following structures: a
tree structure, a ring structure, a grid structure, a hierarchical
interconnection, and a bus structure.
[0050] In a second aspect, an example of the present disclosure
provides a computation device which includes a function unit
configured to perform the method of the first aspect.
[0051] In a third aspect, an example of the present disclosure
provides a computer readable storage medium on which a computer
program used for electronic data exchange is stored, where the
computer program enables a computer to perform the method of the
first aspect.
[0052] In a fourth aspect, an example of the present disclosure
further provides a computer program product which includes a
non-transitory computer readable storage medium on which a computer
program is stored. The computer program may cause a computer to
perform the method of the first aspect.
[0053] In a fifth aspect, an example of the present disclosure
provides a chip which includes the computation device of the second
aspect.
[0054] In a sixth aspect, an example of the present disclosure
provides a chip package structure which includes the chip of the
fifth aspect.
[0055] In a seventh aspect, an example of the present disclosure
provides a board card which includes the chip package structure of
the sixth aspect.
[0056] In an eighth aspect, an example of the present disclosure
provides an electronic device which includes the board card of the
seventh aspect.
[0057] In some examples, the electronic device includes a data
processing device, a robot, a computer, a printer, a scanner, a
tablet, a smart terminal, a mobile phone, a traffic recorder, a
navigator, a sensor, a webcam, a server, a cloud-based server, a
camera, a video camera, a projector, a watch, a headphone, a mobile
storage, a wearable device, a vehicle, a household appliance,
and/or a medical equipment.
[0058] In some examples, the vehicle includes an airplane, a ship,
and/or a car. The household electrical appliance includes a
television, an air conditioner, a microwave oven, a refrigerator, a
rice cooker, a humidifier, a washing machine, an electric lamp, a
gas cooker, and a range hood. The medical equipment includes a
nuclear magnetic resonance spectrometer, a B-ultrasonic scanner,
and/or an electrocardiograph.
[0059] Technical effects of implementing the examples of the
present disclosure are as follows:
[0060] It can be seen that through the examples of the present
disclosure, the computation device may control a communication unit
to obtain a first image to be processed, where the first image has
a resolution of a first-level size; and then the computation device
may control an operation unit to call an operation instruction to
perform resolution optimization to obtain a second image, where the
second image has a resolution of a second-level size, the
first-level size is smaller than the second-level size, and the
operation instruction is a preset instruction for optimizing the
resolution of an image; in this way, the resolution of an image can
be improved and increased, compared with the prior art using a
general-purpose processor to improve resolution, the present
disclosure has technical effects of lower power consumption and
faster speed.
BRIEF DESCRIPTION OF THE DRAWINGS
[0061] In order to illustrate the technical solutions in the
examples of the present disclosure more clearly, the drawings to be
used in the description of the examples are briefly explained
below. Obviously, the drawings in the description below are some
examples of the present disclosure. Other drawings can be obtained
according to the disclosed drawings without any creative effort by
those skilled in the art.
[0062] FIG. 1A is a structural diagram of a computation device
according to an example of the present disclosure.
[0063] FIG. 1B is a schematic flowchart of a convolutional neural
network algorithm.
[0064] FIG. 1C is a schematic diagram of an instruction of a device
supporting a convolutional neural network forward operation
according to an example of the present disclosure.
[0065] FIG. 1D is a block diagram of an overall structure of a
device for performing a convolutional neural network forward
operation according to an example of the present disclosure.
[0066] FIG. 1E is a structural diagram of an H-tree module (an
implementation of an interconnection module) of a device for
performing a convolutional neural network forward operation
according to an example of the present disclosure.
[0067] FIG. 1F is a block diagram of a structure of a primary
operation module of a device for performing a convolutional neural
network forward operation according to an example of the present
disclosure.
[0068] FIG. 1G is a block diagram of a structure of a secondary
operation module of a device for performing a convolutional neural
network forward operation according to an example of the present
disclosure.
[0069] FIG. 1H is a block diagram of a process of a single-layer
convolutional neural network forward operation according to an
example of the present disclosure.
[0070] FIG. 2 is a flowchart of an information processing method
according to an example of the present disclosure.
[0071] FIG. 3 is a schematic diagram of calling an operation
instruction based on single-thread according to an example of the
present disclosure.
[0072] FIG. 4 is a schematic diagram of calling an operation
instruction based on multiple threads according to an example of
the present disclosure.
[0073] FIG. 5 is a structural diagram of another computation device
according to an example of the present disclosure.
DETAILED DESCRIPTION OF THE EXAMPLES
[0074] Technical solutions in examples of the present disclosure
will be described clearly and completely hereinafter with reference
to the accompanied drawings in the examples of the present
disclosure. Obviously, the examples to be described are merely some
rather than all examples of the present disclosure. All other
examples obtained by those of ordinary skill in the art based on
the examples of the present disclosure without creative efforts
shall fall within the protection scope of the present
disclosure.
[0075] Terms such as "first", "second", "third", and "fourth" in
the specification, the claims, and the drawings are used for
distinguishing different objects rather than describing a specific
order. In addition, terms such as "include", "have", and any
variant thereof are used for indicating non-exclusive inclusion.
For instance, a process, a method, a system, a product, or an
equipment including a series of steps or units is not limited to
the listed steps or units, but optionally includes steps or units
that are not listed, or optionally includes other steps or units
inherent to the process, the method, the product, or the
equipment.
[0076] Reference to "example" means that a particular feature, a
structure, or a characteristic described in conjunction with the
example may be included in at least one example of the present
disclosure. The term used in various places in the specification
does not necessarily refer to the same example, nor does it refer
to an example that is mutually exclusive, independent, or
alternative to other examples. It can be explicitly and implicitly
understood by those skilled in the art that the examples described
herein may be combined with other examples.
[0077] First, a computation device used in the present disclosure
is introduced. FIG. 1A provides a computation device, where the
device includes a storage medium 611 (optional), a register unit
612, an interconnection module 613, an operation unit 614, a
controller unit 615, and a data access unit 616, where
[0078] the operation unit 614 include at least two of the
following: an addition arithmetic unit, a multiplication arithmetic
unit, a comparator, and an activation arithmetic unit.
[0079] The interconnection module 613 is configured to control a
connection relationship of the arithmetic units in the operation
unit 614 so that the at least two arithmetic units form a different
computation topology.
[0080] The instruction storage unit (which may be a register unit,
an instruction cache, or a scratchpad memory) 612 is configured to
store the operation instruction, an address of a data block in the
storage medium, and a computation topology corresponding to the
operation instruction.
[0081] The operation instruction may include an operation field and
an opcode. Taking a convolution operation instruction as an
example, as shown in Table 1, register 0, register 1, register 2,
register 3, and register 4 may be operation fields. Each of the
register 0, register 1, register 2, register 3, and register 4 may
be one or a plurality of registers.
TABLE-US-00001 Opcode Register 0 Register 1 Register 2 Register 3
Register 4 COMPUTE Input data Input data Convolution Convolution
Address of an starting length kernel kernel activation address
starting length function address interpolation table IO Address of
Data Address of an an external length internal memory of memory of
data data NOP JUMP Target address MOVE Input address Data size
Output address
[0082] The storage medium 611 may be an off-chip memory, and in
certain applications, may also be an on-chip memory for storing a
data block. The data block may be n-dimensional data, where n is an
integer greater than or equal to 1. For instance, when n=1, the
data is one-dimensional data, which is a vector; when n=2, the data
is two-dimensional data, which is a matrix; and when n is equal to
or greater than 3, the data is multi-dimensional data.
[0083] The control unit 615 is configured to fetch an operation
instruction, an operation field corresponding to the operation
instruction, and a first computation topology corresponding to the
operation instruction from the register unit 612, and decode the
operation instruction into an execution instruction. The execution
instruction is configured to control the operation unit to perform
an operation, transfer the operation field to the data access unit
616, and transfer the computation topology to the interconnection
module 613.
[0084] The data access unit 616 is configured to fetch a data block
corresponding to the operation field from the storage medium 611
and transfer the data block to the interconnection module 613.
[0085] The interconnection module 613 is configured to receive the
first computation topology and the data block. In an example, the
interconnection module 613 is further configured to rearrange the
data block according to the first computation topology.
[0086] The operation unit 614 is configured to call an arithmetic
unit of the operation unit 614 according to the execution
instruction to perform an operation on the data block to obtain an
operation result, transfer the operation result to the data access
unit, and store the result in the storage medium. In an example,
the operation unit 614 is configured to call an arithmetic unit
according to the first computation topology and the execution
instruction to perform an operation on the rearranged data block to
obtain an operation result, transfer the operation result to the
data access unit, and store the result in the storage medium.
[0087] In another example, the interconnection module 613 is
configured to form the first computation topology according to the
connection relationships of the arithmetic units in the operation
unit 614.
[0088] An interconnection module is set in the computation device
provided by the present disclosure. The interconnecting module can
connect the arithmetic units in the computation unit to obtain a
computation topology corresponding to the computation instruction
according to the needs of the computation instruction, so that
there is no need to store or fetch intermediate data of the
computation in subsequent operations of the operation unit. Through
this structure, a single instruction can implement a single input
and perform operations of a plurality of arithmetic units to obtain
a computation result, which improves the computation
efficiency.
[0089] A computation method of the computation device shown in FIG.
1A is explained below based on different operation instructions. As
an instance, the operation instruction may be a convolution
operation instruction. The convolution operation instruction can be
applied to a neural network, so the convolution operation
instruction may also be called a convolutional neural network
operation instruction. A formula to be perform by the convolution
operation instruction may be: s=s(.SIGMA.wx.sub.i+b), which is to
multiply a convolution kernel W by input data x.sub.i, find the
sum, add a bias b, and then perform an activation operation s(h) to
obtain a final output result S. According to the formula, the
computation topology may be obtained, which is: the multiplication
arithmetic unit--the addition arithmetic unit--the (optional)
activation arithmetic unit.
[0090] A method of performing a convolution operation instruction
by the computation device shown in FIG. 1A may include:
[0091] fetching, by the control unit 615, a convolution operation
instruction, an operation field corresponding to the convolution
operation instruction, and the first computation topology (the
multiplication arithmetic unit--the addition arithmetic unit--the
addition arithmetic unit--the activation arithmetic unit)
corresponding to the convolution operation instruction from the
register unit 612; transferring, by the control unit, the operation
field to a data access unit, and transferring the first computation
topology to the interconnection module;
[0092] fetching, by the data access unit, a convolution kernel w
and a bias b (if b is 0, there is no need to fetch the bias b)
corresponding to the operation field from the storage medium, and
transferring the convolution kernel w and the bias b to the
operation unit; and
[0093] multiplying, by the multiplication arithmetic unit of the
computation unit, a convolution kernel w and input data Xi to
obtain a first result, inputting the first result to the addition
arithmetic unit to perform addition to obtain a second result,
adding the second result and a bias b to obtain a third result,
inputting the third result to the activation arithmetic unit to
perform an activation operation to obtain an output result S,
transferring the output result S to the data access unit, and
storing, by the data access unit, the output result in the storage
medium. After each step, the result may be transferred to the data
access and stored in storage medium without performing a following
step. The step of adding the second result and the bias b to obtain
the third result is optional, which means this step is not required
when b is 0.
[0094] In addition, the order of addition and multiplication can be
reversed.
[0095] The technical solution provided by the present disclosure
can realize convolution operations according to one instruction
which is a convolution operation instruction. There is no need to
store or obtain intermediate data of convolution operations (such
as a first result, a second result, and a third result). The
technical solution may reduce the storing and obtaining operations
of intermediate data, and may have technical effects of reducing a
corresponding operation step and improving outcomes of convolution
operations.
[0096] It should be understood that the instruction set used in the
present disclosure may include one or a plurality of operation
instructions. The operation instruction includes, but is not
limited to a COMPUTE instruction (an operation instruction), a
CONFIG instruction, an IO instruction, an NOP instruction, a JUMP
instruction, a MOVE instruction, etc. The COMPUTE instruction
includes, but is not limited to, a convolution (CONV) instruction,
a pooling operation instruction, etc. Specifically, an executable
computation instruction in the present disclosure includes:
[0097] a convolution operation instruction. In an example, the
convolution COMPUTE instruction (the CONV instruction)
includes:
[0098] a convolutional neural network sigmoid instruction:
according to the instruction, a device fetches input data and a
convolution kernel of a specified size from a specified address in
a memory (optionally a scratchpad memory or a scalar register
file), performs a convolution operation in a convolution operation
component, and optionally, performs sigmoid activation on an output
result;
[0099] a convolutional neural network TanH instruction: according
to the instruction, the device fetches input data and a convolution
kernel of a specified size from a specified address in a memory
(optionally a scratchpad memory) respectively, performs a
convolution operation in the convolution operation component, and
then performs TanH activation on an output result;
[0100] a convolutional neural network ReLU instruction: according
to the instruction, the device fetches input data and a convolution
kernel of a specified size from a specified address in the memory
(optionally a scratchpad memory) respectively, performs a
convolution operation in a convolution operation component, and
then performs ReLU activation on an output result; and
[0101] a convolutional neural network group instruction: according
to the instruction, the device fetches input data and a convolution
kernel of a specified size from a specified address in the memory
(optionally a scratchpad memory) respectively, partitions the input
data and the convolution kernel into groups, performs a convolution
operation in a convolution operation component, and then performs
activation on an output result.
[0102] A convolution operation instruction (pure convolution
operation instruction): according to the instruction, the device
fetches input data and a convolution kernel of a specified size
from a specified address in the memory (optionally a scratchpad
memory) respectively, and performs a convolution operation in a
convolution operation component. The above-mentioned specified size
may be set by the user or manufacturer. For instance, in a
computation device of a first manufacturer, the specified size may
be set to data of A bit, and in a computation device of a second
manufacturer, the specified size may be set to data of B bit. The
data of A bit and the data of B bit have different sizes.
[0103] The pooling instruction. In an example, the pooling COMPUTE
instruction (the pooling operation instruction, which is also
referred to as the pooling instruction in the present disclosure)
specifically includes:
[0104] a Maxpooling forward operation instruction: according to the
instruction, the device fetches input data of a specified size from
a specified address in a memory (optionally a scratchpad memory or
a scalar register file), performs a Maxpooling forward operation in
a pooling operation component, and writes a result back to a
specified address in the memory (optionally a scratchpad memory or
a scalar register file);
[0105] a Maxpooling backward training instruction: according to the
instruction, the device fetches input data of a specified size from
a specified address in a memory (optionally a scratchpad memory or
a scalar register file), performs Maxpooling backward training in a
pooling operation component, and writes a result back to a
specified address in the memory (optionally a scratchpad memory or
a scalar register file);
[0106] an Avgpooling forward operation instruction: according to
the instruction, the device fetches input data of a specified size
from a specified address in a memory (optionally a scratchpad
memory or a scalar register file), performs an Avgpooling forward
operation in a pooling operation component, and writes a result
back to a specified address in the memory (optionally a scratchpad
memory or a scalar register file);
[0107] an Avgpooling backward training instruction: according to
the instruction, the device fetches input data of a specified size
from a specified address in a memory (optionally a scratchpad
memory or a scalar register file), performs Avgpooling backward
training in a pooling operation component, and writes a result back
to a specified address in the memory (optionally a scratchpad
memory or a scalar register file);
[0108] a Minpooling forward operation instruction: according to the
instruction, the device fetches input data of a specified size from
a specified address in a memory (optionally a scratchpad memory or
a scalar register file), performs a Minpooling forward operation in
a pooling operation component, and writes a result back to a
specified address in the memory (optionally a scratchpad memory or
a scalar register file); and
[0109] a Minpooling backward training instruction: according to the
instruction, the device fetches input data of a specified size from
a specified address in a memory (optionally a scratchpad memory or
a scalar register file), performs Minpooling backward training in a
pooling operation component, and writes a result back to a
specified address in the memory (optionally a scratchpad memory or
a scalar register file).
[0110] A batch normalization instruction can be used for a batch
normalization computation.
[0111] A fully connected instruction may include a fully connected
layer forward operation instruction.
[0112] A fully connected layer forward operation instruction:
according to the instruction, a device fetches weight data and bias
data from a specified address in a memory, performs a full
connection operation in a computation unit, and writes a result
back to a specified address in a scratchpad memory.
[0113] The CONFIG instruction configures various constants required
by a computation of a current artificial neural network layer
before the computation starts. For instance, 1/kernel_area can be
obtained by configuration using the CONFIG instruction. In the
batch normalization computation, the CONFIG instruction configures
various constants required for a current layer before a batch
normalization computation begins.
[0114] The IO instruction is for reading-in input data required for
a computation from an external storage space, and storing data to
the external space after the computation finishes.
[0115] The NOP instruction is for emptying control signals in all
control signal cache queues in the current device, and ensuring
that all instructions before the NOP instruction are finished. The
NOP instruction itself does not include any operations.
[0116] The JUMP instruction is for controlling jumping of a next
instruction address to be read from an instruction storage unit, so
that the jumping of a control flow can be realized.
[0117] The MOVE instruction is for moving data of an address in an
internal address space of the device to another address in the
internal address space of the device. This process is independent
of an operation unit and does not occupy resources of the operation
unit during execution.
[0118] Optionally, operation instructions that can be executed by
the computation device may further include:
[0119] a Matrix Mult Vector (MMV) instruction: according to the
instruction, the device fetches matrix data and vector data of a
set length from a specified address in a scratchpad memory,
performs a matrix-multiply-vector operation in the operation unit,
and writes a computation result back to a specified address in the
scratchpad memory; it is worth noting that a vector can be stored
in the scratchpad memory as a matrix of a special form (a matrix
with only one row of elements);
[0120] a Vector Mult Matrix (VMM) instruction: according to the
instruction, the device fetches vector data and matrix data of a
set length from a specified address in a scratchpad memory,
performs a vector-multiply-matrix operation in the operation unit,
and writes a computation result back to a specified address in the
scratchpad memory; it is worth noting that a vector can be stored
in the scratchpad memory as a matrix of a special form (a matrix
with only one row of elements);
[0121] a Matrix Mult Scalar (VMS) instruction: according from
instruction, the device fetches matrix data of a set length from a
specified address in a scratchpad memory, fetches matrix data of a
specified size from a specified address of a scalar register file,
and performs a scalar-multiply-matrix operation in the operation
unit, and writes a computation result back to a specified address
in the scratchpad memory; it is worth noting that the scalar
register file stores not only an address of the matrix but also
scalar data;
[0122] a Tensor Operation (TENS) instruction: according to the
instruction, the device fetches two pieces of matrix data of a set
length from two specified addresses in a scratchpad memory,
performs a tensor operation on the two pieces of matrix data in the
operation unit, and writes a result back to a specified address of
the scratchpad memory;
[0123] a Matrix Add Matrix (MA) instruction: according to the
instruction, the device fetches two pieces of matrix data of a set
length from two specified addresses in a scratchpad memory, adds
the two pieces of matrix data in the operation unit, and writes a
computation result back to a specified address in the scratchpad
memory;
[0124] a Matrix Sub Matrix (MS) instruction: according to the
instruction, the device fetches two pieces of matrix data of a set
length from two specified addresses in a scratchpad memory,
performs a subtraction operation on the two pieces of matrix data
in the operation unit, and writes a computation result back to a
specified address in the scratchpad memory;
[0125] a Matrix Retrieval (MR) instruction: according to the
instruction, the device fetches vector data of a set length from a
specified address in a scratchpad memory, fetches matrix data of a
specified size from a specified address in the scratchpad memory;
in the operation unit, the vector is an index vector, and an
i.sup.th element of an output vector is a number obtained from an
i.sup.th column of the matrix by using an i.sup.th element of the
index vector as an index; and the output vector is written back to
a specified address in the scratchpad memory;
[0126] a Matrix Load (ML) instruction: according to the
instruction, the device loads data of a set length from a specified
external source address to a specified address in a scratchpad
memory;
[0127] a Matrix Store (MS) instruction: according to the
instruction, the device stores matrix data of a set length from a
specified address in a scratchpad memory to an external target
address;
[0128] a Matrix Move (MMOVE) instruction: according to the
instruction, the device moves matrix data of a set length from a
specified address in a scratchpad memory to another specified
address in the scratchpad memory;
[0129] a Vector-Inner-Product instruction (VP): according to the
instruction, the device fetches vector data of a specified size
from a specified address in a memory (optionally a scratchpad
memory or a scalar register file), performs an inner product (a
scalar) on two vectors in a vector computation unit, and writes the
result back; optionally, the result is written back to a specified
address in the memory (optionally a scratchpad memory or a scalar
register file);
[0130] a vector cross product instruction (TENS): according to the
instruction, the device fetches vector data of a specified size
from a specified address in a memory (optionally a scratchpad
memory or a scalar register file), performs an inner product (a
scalar) on two vectors in a vector computation unit, and writes the
result back; optionally, the result is written back to a specified
address in the memory (optionally a scratchpad memory or a scalar
register file);
[0131] a vector elementary arithmetic operation including a
Vector-Add-Scalar instruction (VAS): according to the instruction,
the device fetches vector data of a specified size from a specified
address in a memory (optionally a scratchpad memory or a scalar
register file), fetches scalar data from a specified address of a
scalar register file of the memory, adds the scalar to each element
of the vector in a scalar computation unit, and writes the result
back; optionally, the result is written back to a specified address
in the memory (optionally a scratchpad memory or a scalar register
file);
[0132] a Scalar-Sub-Vector instruction (SSV): according to the
instruction, the device fetches scalar data from a specified
address in the scalar register in a memory (optionally a scratchpad
memory or a scalar register file), fetches vector data from a
specified address in the memory (optionally the scratchpad memory
or the scalar register file), subtracts corresponding elements of
the vector from the scalar in a vector computation unit, and writes
the result back; optionally, the result is written back to a
specified address in the memory (optionally a scratchpad memory or
a scalar register file);
[0133] a Vector-Dev-Vector instruction (VD): according to the
instruction, the device fetches vector data of a specified size
from a specified address in a memory (optionally a scratchpad
memory or a scalar register file), performs an element-wise
division of two vectors in a vector computation unit, and writes
the result back; optionally, the result is written back to a
specified address in the memory (optionally a scratchpad memory or
a scalar register file);
[0134] a Scalar-Dev-Vector instruction (SDV): according to the
instruction, the device fetches scalar data from a specified
address in the scalar register file of a memory (optionally a
scratchpad memory or a scalar register file), fetches vector data
of a specified size from a specified address in the memory
(optionally the scratchpad memory), divides the scalar by
corresponding elements in the vector in a vector computation unit,
and writes the result back; optionally, the result is written back
to a specified position in the memory (optionally a scratchpad
memory or a scalar register file).
[0135] The computation device can also execute a vector logic
instruction, including:
[0136] a Vector-AND-Vector instruction (VAV): according to the
instruction, the device fetches vector data of a specified size
from a specified address in a memory (optionally a scratchpad
memory or a scalar register file) respectively, performs an
element-wise AND on two vectors in a vector computation unit, and
writes the result back; optionally, the result is written back to a
specified address in the memory (optionally a scratchpad memory or
a scalar register file);
[0137] a Vector-AND instruction (VAND): according to the
instruction, the device fetches vector data of a specified size
from a specified address in a memory (optionally a scratchpad
memory or a scalar register file), performs an element-wise AND
operation on two vectors in a vector computation unit, and writes
the result back; optionally, the result is written back to a
specified address in the scalar register file of the memory
(optionally a scratchpad memory or a scalar register file);
[0138] a Vector-OR-Vector instruction (VOV): according to the
instruction, the device fetches vector data of a specified size
from a specified address in a memory (optionally a scratchpad
memory) respectively, performs an element-wise OR operation on two
vectors in a vector computation unit, and writes the result back;
optionally, the result is written back to a specified address in
the memory (optionally a scratchpad memory or a scalar register
file);
[0139] a Vector-OR instruction (VOR): according to the instruction,
the device fetches vector data of a specified size from a specified
address in a memory (optionally a scratchpad memory or a scalar
register file), performs an OR operation on each element of the
vector in a vector computation unit, and writes the result back;
optionally, the result is written back to a specified address in
the scalar register file of the memory (optionally a scratchpad
memory or a scalar register file);
[0140] a transcendental function instruction: according to the
instruction, the device fetches vector data of a specified size
from a specified address in a memory (optionally a scratchpad
memory or a scalar register file), performs a transcendental
function operation on the vector data in an operation unit, and
writes the result back; optionally, the result is written back to a
specified address in a storage unit of the memory (optionally a
scratchpad memory or a scalar register file). optionally, the
result is written back specified address in the memory (optionally
a scratchpad memory or a scalar register file);
[0141] The computation device can also execute a vector comparison
operation instruction, including:
[0142] a Greater-Equal operation instruction (GE): according to the
instruction, the device may obtain parameters of the instruction,
including a length of a vector, a starting address of two vectors,
and a storage address of an output vector, directly from the
instruction or by accessing the serial number of the register of a
memory (optionally a scratchpad memory or a scalar register file)
provided by the instruction, then read data of the two vectors, and
compare the elements at all positions in the vectors in a vector
comparison operation unit; at the position of a row, if the value
of a previous vector is greater than or equal to the value of a
subsequent vector, the value of the comparison result vector at
that position is set to 1, otherwise it is set to 0; finally, the
comparison result is written back to a specified storage address in
the memory (optionally the scratchpad memory or the scalar register
file);
[0143] a Less-Equal operation instruction (LE): according to the
instruction, the device may obtain the parameters of the
instruction, including the length of a vector, the starting address
of the two vectors, and the storage address of the output vector,
directly from the instruction or by accessing the serial number of
the register of a memory (optionally a scratchpad memory or a
scalar register file) provided by the instruction, then read the
data of the two vectors, and compare the elements at all positions
in the vectors in a vector comparison operation unit; at the
position of a row, if the value of a previous vector is less than
or equal to the value of a subsequent vector, the value of the
comparison result vector at that position is set to 1, otherwise it
is set to 0; finally, the comparison result is written back to a
specified storage address in the memory (optionally the scratchpad
memory or the scalar register file);
[0144] a Greater-Than operation instruction (GT): according to the
instruction, the device may obtain the parameters of the
instruction, including the length of a vector, the starting address
of the two vectors, and the storage address of the output vector,
directly from the instruction or by accessing the serial number of
the register of a memory (optionally a scratchpad memory or a
scalar register file) provided by the instruction, then read the
data of the two vectors, and compare the elements at all positions
in the vectors in a vector comparison operation unit; at the
position of a row, if the value of a previous vector is greater
than the value of a subsequent vector, the value of the comparison
result vector at that position is set to 1, otherwise it is set to
0; finally, the comparison result is written back to a specified
storage address in the memory (optionally the scratchpad memory or
the scalar register file);
[0145] a Less-Than operation instruction (LT): according to the
instruction, the device may obtain the parameters of the
instruction, including the length of a vector, the starting address
of the two vectors, and the storage address of the output vector,
directly from the instruction or by accessing the serial number of
the register of a memory (optionally a scratchpad memory or a
scalar register file) provided by the instruction, then read the
data of the two vectors, and compare the elements at all positions
in the vectors in a vector comparison operation unit; at the
position of a row, if the value of a previous vector is less than
the value of a subsequent vector, the value of the comparison
result vector at that position is set to 1, otherwise it is set to
0; finally, the comparison result is written back to a specified
storage address in the memory (optionally the scratchpad memory or
the scalar register file);
[0146] an Equal operation instruction (EQ): according to the
instruction, the device may obtain the parameters of the
instruction, including the length of a vector, the starting address
of the two vectors, and the storage address of the output vector,
directly from the instruction or by accessing the serial number of
the register of a memory (optionally a scratchpad memory or a
scalar register file) provided by the instruction, then read the
data of the two vectors, and compare the elements at all positions
in the vectors in a vector comparison operation unit; at the
position of a row, if the value of a previous vector is equal to
the value of a subsequent vector, the value of the comparison
result vector at that position is set to 1, otherwise it is set to
0; finally, the comparison result is written back to a specified
storage address in the memory (optionally the scratchpad memory or
the scalar register file);
[0147] an Unequal operation instruction (UEQ): according to the
instruction, the device may obtain the parameters of the
instruction, including the length of a vector, the starting address
of the two vectors, and the storage address of the output vector,
directly from the instruction or by accessing the serial number of
the register of a memory (optionally a scratchpad memory or a
scalar register file) provided by the instruction, then read the
data of the two vectors, and compare the elements at all positions
in the vectors in a vector comparison operation unit; at the
position of a row, if the value of a previous vector is unequal to
the value of a subsequent vector, the value of the comparison
result vector at that position is set to 1, otherwise it is set to
0; finally, the comparison result is written back to a specified
storage address in the memory (optionally the scratchpad memory or
the scalar register file);
[0148] a Vector Max instruction (VMAX): according to the
instruction, the device fetches vector data of a specified size
from a specified address in a scratchpad memory of a memory
(optionally a scratchpad memory or a scalar register file), selects
a largest element from the vector data as a result, and writes the
result back; optionally, the result is written back to a specified
address in the scalar register file of the memory (optionally a
scratchpad memory or a scalar register file);
[0149] a Vector Min instruction (VMIN): according to the
instruction, the device fetches vector data of a specified size
from a specified address in a memory (optionally a scratchpad
memory or a scalar register file), selects a minimum element from
the vector data as a result, and writes the result back;
optionally, the result is written back to a specified address in
the scalar register file of the memory (optionally a scratchpad
memory or a scalar register file);
[0150] a Cyclic Shift operation instruction: according to the
instruction, the device may obtain parameters of the instruction
directly from the instruction or by accessing the serial number of
the register of a memory (optionally a scratchpad memory or a
scalar register file) provided by the instruction, then cyclically
shift vectors in a vector shift unit (which may be a separate
vector shift unit or a computation unit), and then write the result
of the shift back to a specified storage address in the memory
(optionally the scratchpad memory or the scalar register file),
where a format of the cyclic shift operation instruction format may
include four operation fields, a starting address and length of a
vector, a shift stride, and a storage address of an output vector;
and
[0151] a Random-Vector generation instruction: according to the
instruction, the device reads one or more randomly distributed
parameters, and the size and storage address of a random vector to
be generated from the instruction or from the register of a memory
(optionally a scratchpad memory or a scalar register file),
generates the random vector that is in line with the random
distribution in a random vector generation unit, and then writes
the result of the random vector back to the specified storage
address in the memory (optionally the scratchpad memory or the
scalar register file).
[0152] The Random-Vector generation instruction may be:
[0153] a Uniform distribution instruction (UNIF): according to the
instruction, the device reads uniformly distributed upper and lower
bound parameters, and the size and storage address of the random
vector to be generated from the instruction or from the register
file of a memory (optionally a scratchpad memory or a scalar
register file), generates the random vector that is in line with
the uniform distribution in a random vector generation unit, and
then writes the result of the random vector back to the specified
storage address in the memory (optionally the scratchpad memory or
the scalar register file); and
[0154] a Gaussian distribution instruction (GAUS): according to the
instruction, the device reads Gaussian distributed mean and
variance parameters, and the size and storage address of the random
vector to be generated from the instruction or from the register of
a memory (optionally a scratchpad memory or a scalar register
file), generates the random vector that is in line with the
Gaussian distribution in a random vector generation unit, and then
writes the result of the random vector back to the specified
storage address in the memory (optionally the scratchpad memory or
the scalar register file).
[0155] When the computation device shown in FIG. 1A is used to
execute a convolutional neural network algorithm (a convolution
operation instruction), please refer to the flowchart of the
convolutional neural network algorithm shown in FIG. 1B. As shown
in FIG. 1B, a convolutional neural network includes output data, an
activation function, an input data layer, and a convolution
kernel.
[0156] Each computation process includes: selecting corresponding
input data x.sup.i in the input data layer according to a
convolution window, and then performing an addition operation on
the input data and the convolution kernel. A computation process of
the output data is s=s(.SIGMA.wx.sub.i+b), which is to multiply a
convolution kernel w by input data x.sup.i, find the sum, add a
bias b, and then perform an activation operation s(h) to obtain a
final output data s. The multiplication of the convolution kernel
and the input data is vector multiplication.
[0157] According to the size k.sub.x of the convolution kernel on
an X axis and the size k.sub.y of the convolution kernel on the Y
axis, the convolution window firstly selects input data of which
the size is the same as that of the convolution kernel from the
input data of which the size of the X axis is W and the size of the
Y axis is H, performs horizontal translation and then vertical
translation according to translation position vectors S.sub.x and
S.sub.y of the convolution window, and traverses all the input
data.
[0158] FIG. 1C shows a format of an instruction set according to an
example of the present disclosure. As shown in the figure, a
convolutional neural network operation instruction includes at
least one opcode and at least one operation field. The opcode is
for indicating a function of the convolutional neural network
operation instruction. A convolutional neural network operation
unit can perform a convolutional neural network operation by
identifying the opcode. The operation field is for indicating data
information of the convolutional neural network operation
instruction. The data information may be an immediate operand or a
register number (which, optionally, may be a register file), which
includes a starting address and a length of input data, a starting
address and a length of the convolution kernel, and a type of an
activation funciton.
[0159] The instruction set includes: convolutional neural network
COMPUTE instruction with different functions, a CONFIG instruction,
an IO instruction, an NOP instruction, a JUMP instruction, and a
MOVE instruction. The above operation instructions will not be
further described herein. For details, please refer to related
descriptions in the above examples.
[0160] Optionally, the instruction set may further include a
convolution activation CONV_ACTIVATE instruction.
[0161] The convolution activation CONV_ACTIVATE instruction:
according to the instruction, the device fetches input data and a
convolution kernel of a specified size from a specified address in
the scratchpad memory (optionally), performs a convolution
operation in a convolution operation component, and then performs
an activation function operation on an output result; the
above-mentioned specified size may be set by the manufacturer or
user.
[0162] In one example, the CONV_ACTIVATE instruction includes: a
convolution operation instruction and an activation instruction.
The activation instruction is configured to perform an activation
function operation, and the convolution operation instruction is
configured to perform a convolution operation. For details, please
refer to related descriptions in the above examples.
[0163] FIG. 1D is a schematic structural diagram of a device for
performing a convolutional neural network forward operation
according to an example of the present disclosure. As shown in FIG.
3, the device includes an instruction storage unit 1, a controller
unit 2, a data access unit 3, an interconnection module 4, a
primary operation module 5, and a plurality of secondary operation
modules 6. The instruction storage unit 1, the controller unit 2,
the data access unit 3, the interconnection module 4, the primary
operation module 5, and the plurality of secondary operation
modules 6 may all be realized in a form of a hardware circuit (for
instance, including but not limited to FPGA, CGRA, ASIC, analog
circuit, memristor, etc.).
[0164] The instruction storage unit 1 is configured to read an
instruction through the data access unit 3 and store the
instruction.
[0165] The controller unit 2 is configured to read an instruction
from the instruction storage unit 1, decode the instruction into a
control signal for controlling the behavior of other modules, and
send the control signal to other modules such as the data access
unit 3, the primary operation module 5, and the plurality of
secondary operation modules 6.
[0166] The data access unit 3 can access an external address space,
directly read and write data to each storage unit inside the device
to complete the loading and storage of the data,
[0167] The interconnection module 4 is configured to connect the
primary operation module and the secondary operation modules, and
can be implemented into different interconnection topologies (such
as tree structure, ring structure, grid structure, hierarchical
interconnection, bus structure, etc.).
[0168] FIG. 1E schematically shows an implementation of the
interconnection module 4: an H-tree module. The interconnection
module 4 forms a data path between the primary operation module 5
and the plurality of secondary operation modules 6, where the data
path is a binary tree path composed of a plurality of nodes. Each
node can transfer data received from an upstream node to two
downstream nodes, and merge data returned by the two downstream
nodes and return to an upstream node. For instance, at the
beginning of a computational phase of a convolutional neural
network, neuron data in the primary operation module 5 is sent to
each secondary operation module 6 through the interconnection
module 4; when the secondary operation modules 6 finish computing,
neuron values output by the respective secondary operation modules
are spliced stage-by-stage into a complete vector composed of
neurons in the interconnection module. For instance, if there are N
secondary operation modules in the device, input data x.sub.i is
transferred to the N secondary operation modules and each of the
secondary operation modules performs a convolution operation on the
input data x.sub.i and the convolution kernel corresponding to the
secondary operation module to obtain a piece of scalar data. The
scalar data of the respective secondary operation module are merged
into an intermediate vector including N elements by the
interconnection module 4. If the convolution window obtains a total
of A*B pieces of (A pieces in the X direction, B pieces in the Y
direction, where X and Y are coordinate axes of the
three-dimensional orthogonal coordinate system) input data xi by
traverse, a convolution operation is perform on the above A*B
pieces of x.sub.i and all the vectors obtained are merged in the
primary operation module to obtain a three-dimensional intermediate
result of A*B*N.
[0169] FIG. 1F is a block diagram of a structure of the primary
operation module 5 of a device for performing a convolutional
neural network forward operation according to an example of the
present disclosure. As shown in FIG. 1F, the primary operation
module 5 includes a first operation unit 51, a first data
dependency determination unit 52, and a first storage unit 53.
[0170] The first operation unit 51 includes a vector addition unit
511 and an activation unit 512. The first operation unit 51 is
configured to receive a control signal from the controller unit and
complete various operational functions of the primary operation
module 5. The vector addition unit 511 is configured to perform an
operation of adding a bias in the forward computation of the
convolutional neural network, and perform element-wise addition on
bias data and the intermediate results to obtain a bias result. The
activation operation unit 512 performs an activation function
operation on the bias result. The bias data may be read in from an
external address space, or may be stored locally.
[0171] The data dependency determination unit 52 is a port for the
first operation unit 51 to read/write the first storage unit 53, so
as to ensure consistency in reading data from and writing data to
the first storage unit 53. At the same time, the first data
dependency determination unit 52 is also configured to send data
read from the first storage unit 53 to the secondary operation
modules through the interconnection module 4. Output data of the
secondary operation modules 6 is directly sent to the first
operation unit 51 through the interconnection module 4. An
instruction output by the controller unit 2 is sent to the
operation unit 51 and the first data dependency determination unit
52 to control their behavior.
[0172] The storage unit 53 is configured to cache input data and
output data used by the primary operation module 5 during a
computation process.
[0173] FIG. 1G is a block diagram of a structure of the secondary
operation modules 6 of a device for performing a convolutional
neural network forward operation according to an example of the
present disclosure. As shown in FIG. 1E, each secondary operation
module 6 includes a second operation unit 61, a data dependency
determination unit 62, a second storage unit 63, and a third
storage unit 64.
[0174] The second operation unit 61 is configured to receive a
control signal from the controller unit 2 and perform a convolution
operation. The second operation unit includes a vector
multiplication unit 611 and an accumulation unit 612, which are
respectively responsible for a vector multiplication operation and
an accumulation operation in a convolution operation.
[0175] The second data dependency determination unit 62 is
responsible for reading and writing the second storage unit 63
during a computation process. Before performing read/write
operations, the second data dependency determination unit 62 first
ensures that there is no consistency conflict between the reading
and writing of data used by instructions. For instance, all control
signals sent to the data dependency unit 62 are stored in the
instruction queue inside the data dependency unit 62. In this
queue, if a range of data to be read by a reading instruction
conflicts with a range of data to be written by a writing
instruction that is located at the front of the queue, the
instruction can only be executed until a writing instruction
depended by the instruction has been executed.
[0176] The second storage unit 63 is configured to cache input data
and output scalar data of the secondary operation modules 6.
[0177] The third storage unit 64 is configured to cache convolution
kernel data required by the secondary operation modules 6 in a
computation process.
[0178] FIG. 1H is a flowchart of executing a convolutional neural
network by a convolutional neural network operation device
according to an example of the present disclosure. As shown in FIG.
1H, a process of executing the convolutional neural network neural
network instruction includes:
[0179] a step S1, pre-storing an IO instruction in a starting
address of the instruction storage unit 1;
[0180] a step S2, the operation starts, reading, by the controller
unit 2, the IO instruction from the starting address of the
instruction storage unit 1, and according to a control signal
decoded from the instruction, reading, by the data access unit 3,
all corresponding convolutional neural network operation
instructions from an external address space, and caching the
instructions in the instruction storage unit 1;
[0181] a step S3, reading, by the controller unit 2, a next IO
instruction from the instruction storage unit, and according to a
control signal obtained by decoding, reading, by the data access
unit 3, all data (such as input data, an interpolation table for a
quick activation function operation, a constant table for
configuring parameters of the operation device, bias data, etc.)
required by the primary operation module 5 from the external
address space to the first storage unit 53 of the primary operation
module 5; and a step S4, reading, by the controller unit 2, a next
IO instruction from the instruction storage unit, and according to
a control signal decoded from the instruction, reading, by the data
access unit 3, convolution kernel data required by the secondary
operation modules 6 from the external address space;
[0182] a step S5, reading, by the controller unit 2, a next CONFIG
instruction from the instruction storage unit, and according to a
control signal obtained by decoding, configuring, by the device,
various constants required by the computation of the neural network
layer; for instance, the first operation unit 51 and the second
operation unit 61 may configure a value of an internal register of
the parameter configuration unit in the control signal, where the
parameter includes, for instance, data required by an activation
function;
[0183] a step S6, reading, by the controller unit 2, a next COMPUTE
instruction from the instruction storage unit, and according to a
control signal decoded from the instruction, sending, by the
primary operation module 5, input data in a convolution window to
each secondary operation module 6 through an interconnection module
4 and saving the input data to the second storage unit 63 of the
secondary operation module 6; and then moving the convolution
window according to the instruction;
[0184] a step S7, according to the control signal decoded from the
COMPUTE instruction, reading, by the operation unit 61 of the
secondary operation module 6, the convolution kernel from the third
storage unit 64; reading the input data from the second storage
unit 63 to complete the convolution operation of the input data and
the convolution kernel; and returning an obtained intermediate
result through the interconnection module 4;
[0185] a step S8, in the interconnection module 4, splicing
intermediate results returned from respective secondary operation
modules 6 stage by stage to obtain a complete intermediate
vector;
[0186] a step S9, obtaining, by the primary operation module 5, the
intermediate vector returned by the interconnection module 4;
traversing, by the convolution window, all input data; splicing, by
the primary operation module, all returned vectors into an
intermediate result; according to the control signal decoded from
the COMPUTE instruction, reading bias data from the first storage
unit 53, adding the intermediate result and the bias data in a
vector addition unit 511 to obtain a bias result; activating the
bias result by the activation unit 512, and writing final output
data back to the first storage unit; and
[0187] a step S10, reading, by the controller unit 2, a next IO
instruction from the instruction storage unit, and according to a
control signal decoded from the instruction, storing, by the data
access unit 3, the output data in the first storage unit 53 to a
specified address in the external address space, then the operation
finishes.
[0188] The implementation of a multi-layer convolutional neural
network is similar to that of a single-layer convolutional neural
network. After an upper layer of the convolutional neural network
is executed, an operation instruction of a next layer uses an
output data address of the upper layer stored in the primary
operation unit as an input data address of this layer. Similarly,
the address of a convolution kernel and the address of bias data in
the instruction may also be changed to an address corresponding to
this layer.
[0189] The present disclosure uses a device and an instruction set
for performing the convolutional neural network forward operation,
which solves the problem of the lack of CPU and GPU computation
performance, and the problem of high front-end decoding overhead.
The present disclosure effectively improves support for the forward
operation of a multi-layer convolutional neural network.
[0190] By using a dedicated on-chip cache for the forward operation
of a multi-layer convolutional neural network, input neurons and
convolution kernel data may be fully reused, which may avoid
repeated reading of these data from the memory, reduce the memory
access bandwidth, and prevent the memory bandwidth from becoming a
performance bottleneck of the forward operation of a multi-layer
artificial neural network.
[0191] Based on the above examples, FIG. 2 shows an information
processing method according to an example of the present
disclosure. The method shown in FIG. 2 may include:
[0192] a step S102, obtaining a first image to be processed, where
the first image has a resolution of a first-level size.
[0193] The first image may be a picture or a video frame image, and
a count of the first image is not limited herein. In other words,
the first image may be one or more pictures, or be a frame image of
one or more segments of videos.
[0194] The method may further include a step S104, using, by the
computation device, the first image as input of the operation unit
to call the operation instruction to perform resolution
optimization on the first image, so as to obtain a second image,
where
[0195] the second image has a resolution of a second-level size,
the first-level size is smaller than the second-level size, and the
operation instruction is a preset instruction for optimizing an
image resolution.
[0196] The operation instruction includes, but is not limited to, a
convolution operation instruction, a pooling instruction, a
normalization instruction, a non-linear activation instruction, and
the like. For details, please refer to related descriptions in the
above examples of FIG. 1. Optionally, the process of calling
related operation instructions in the computation device (such as
an operation unit) to perform resolution optimization on the target
image will not be further described herein. For details, please
refer to the specific descriptions of calling related instruction
in the above examples of FIG. 1.
[0197] The first-level size and the second-level size are both used
to describe the resolution of an image, and the first-level size is
less than or equal to the second-level size. For example, the
resolution size of the first image (the first-level size) is
800.times.600, and the resolution size of the second image (the
second level size) may be 1024.times.768.
[0198] Some examples involved in the present disclosure are
described below.
[0199] In the step S102, an input format of the first image may be
an image format such as bmp, gif, jpeg, etc., or may be
multi-dimensional matrix data converted from pixels of the
image.
[0200] In an optional example, the step S102 specifically includes:
obtaining an original image to be processed input by a user and
pre-processing the original image to obtain the first image to be
processed, where the pre-processing is an operation customized by
the user side or the terminal side (the computation device side)
and includes one or more of the following processing: translation,
scaling transformation, non-linear transformation, normalization,
format conversion, data deduplication, processing of data
exception, data missing filling, color conversion, and image
restoration.
[0201] In a specific implementation, the computation device obtains
an original image to be processed input by a user. The description
of the original image will not be further described herein. For
details, please refer to the related description of the first
image. Further, the computation device may call a related operation
instruction to perform pre-processing, such as normalization,
format conversion, color conversion, etc., on the original image to
obtain the first image to be processed. The pre-processing
includes, but is not limited to, format conversion (such as
normalization processing and the like), color conversion (such as
converting into a gray-scale image), image restoration, image
modification, and other processing. Correspondingly, the operation
instruction may be an instruction related to the pre-processing.
For instance, when the pre-processing is the normalization
processing, the corresponding operation instruction is a
normalization instruction.
[0202] The pre-processing includes, but is not limited to, any one
or more of the following: data format conversion (such as
normalization, integer data conversion, etc.), data deduplication,
data exception processing, filling missing data, scaling,
translation, and the like. For instance, the data format conversion
may specifically be: conversion between continuous data and
discrete data; power conversion, which is to convert non-power
weight data in input data (a multi-dimensional matrix of the target
image) of a neural network to power weight data; statistics of
floating-point data, which is to count bits of exponent offset and
exponent bits required for storing different types of data during a
forward operation of the artificial neural network; and
floating-point data conversion, which is to convert between a
short-bit floating point data type and a long-bit floating point
data type, and the like, which is not limited in the present
disclosure.
[0203] It should be understood that, when the preprocessing is a
processing such as translation, scaling transformation, and
non-linear operation conversion, the computation device converts
the first image into image pixel data that can be recognized by the
device. Conversely, when a subsequent computation device performs
similar pre-processing on the image pixel data, it can be
understood that the computation device can map/convert the image
pixel data into a corresponding image and output the image to the
user for viewing.
[0204] The step S104 has the following two implementations.
[0205] As a specific implementation of S104, the computation device
calls an operation instruction to perform feature extraction on the
first image, so as to directly obtain and output the second
image.
[0206] As another implementation of S104, the computation device
calls an operation instruction to perform feature extraction on the
first image to obtain a feature image, then pre-processes the
feature image to obtain the second image, where the pre-processing
is an operation preset by a user side or a terminal side.
[0207] Firstly, some examples involved in the feature extraction
are described below. The purpose of the feature extraction in the
present disclosure is to optimize the resolution of the first image
so as to change it into a super-resolution second image. In other
words, the feature extraction in the present disclosure can be
regarded as image resolution optimization.
[0208] Specifically, the computation device may call related
instructions in the operation unit to perform feature extraction on
the first image to obtain a feature image. It should be understood
that when an expression form of the first image is a
multi-dimensional matrix, the feature extraction performed on the
first image is a process of data dimensionality reduction and
resolution optimization, which may reduce the complexity of data
processing to a certain extent, reduce the computation load of a
computation device, and improve computation efficiency.
[0209] In an optional example, the operation instruction may be an
instruction for feature extraction. For details, please refer to
related descriptions in the above examples.
[0210] In an optional example, the operation instruction may
include any one or more of the following instructions: a
convolution operation instruction, a normalization instruction, a
non-linear activation instruction, and a pooling instruction. It
should be noted that when there are a plurality of the first
operation instructions (which can also be called an operation
instruction set), an order, count, and calling thread of the
respective first operation instructions called in the operation
instruction set may be customized by the user side or the
computation device side (such as a terminal), which is not limited
herein.
[0211] FIG. 3 shows a schematic diagram of calling an operation
instruction based on single-thread to perform feature extraction.
Specifically, the controller unit may extract a convolution
operation instruction from the register unit and send the
convolution operation instruction to the operation unit to process
the first image, thereby obtaining a first intermediate image. Then
the controller unit may fetch a normalization instruction from the
register unit and send the normalization instruction to the
operation unit to process the first intermediate image, thereby
obtaining a second intermediate image. The controller unit may
fetch a non-linear activation instruction from the register unit
and send the non-linear activation instruction to the operation
unit to process the second intermediate image, thereby obtaining a
third intermediate image. Then the controller unit may fetch a
pooling instruction from the register unit and send the pooling
instruction to the operation unit to process the third intermediate
image, thereby obtaining a feature image after feature
extraction.
[0212] Optionally, when each operation instruction shown in FIG. 3
is called for execution, the execution order may be changed; for
instance, the normalization instruction may be called before the
convolution operation instruction, which is not limited herein.
[0213] In an optional example, the present disclosure supports
multi-thread (multiple pipelines) feature extraction processing. In
other words, the feature extraction in the present disclosure may
be implemented by threads splitting or merging. Implementations of
thread splitting include, but are not limited to, data copying,
data grouping, and the like, while implementations of thread
merging include, but are not limited to, data addition and
subtraction, data multiplication, data combination and arrangement,
and the like.
[0214] FIG. 4 shows a schematic diagram of calling an operation
instruction based on multiple threads to perform feature
extraction. Specifically, the computation device may perform data
operations of two threads at the same time. The operation
instructions to be used in each thread may be the same or
different, and an order and a count of calling the operation
instructions are not limited herein. As shown in FIG. 4, one of the
threads sequentially executes the operation instructions in FIG. 3
twice at the same time, while the other thread sequentially
executes the operation instructions in FIG. 3 once.
[0215] It should be noted that when multi-thread feature extraction
is involved in the present disclosure, the feature image after
feature extraction may be obtained by aggregating result data
processed by each thread. In other words, the feature image data
after the feature extraction may include, but is not limited to, a
plurality of pieces of matrix data with the same dimension or
different dimensions, which is not limited herein.
[0216] Secondly, the examples involved in pre-processing the
feature image are introduced below. For details, please refer to
the related description of pre-processing the original image in the
step S102. The pre-processing includes, but is not limited to,
translation, scaling, non-linear operation, and the like.
[0217] The examples of the present disclosure are briefly
introduced below combined with the examples of FIGS. 1A-1H.
[0218] In the step S102, the computation device obtains a first
image to be processed input by the user. In an optional example,
the communication unit may be the storage medium (the off-chip
memory) shown in 1A or be an input/output (IO) unit, which is not
limited herein.
[0219] In an optional example, the computation device may be the
computation device shown in FIG. 1A or FIG. 1D. Specifically, the
computation device can store various operation instructions in the
register unit or the instruction storage unit through the data
access unit; further, the computation device can read/write and
store various operation instructions through the data access unit .
The controller unit is configured to control the reading of various
operation instructions from the register unit (or the instruction
storage unit, etc.) and decode the operation instruction into an
executable operation instruction. Optionally, the controller unit
may also send the operation instruction to the operation unit for
execution. Specifically, related arithmetic units can be called in
turn for data processing according to the computation topology
corresponding to the operation instruction. The convolution
operation instruction is described in details below as an instance.
The interconnection module is configured to receive input data (the
first image) and a computation topology, where the computation
topology is a topology corresponding to the operation instruction.
For instance, when the operation instruction is a convolution
operation instruction, the corresponding computation topology may
be: the multiplication arithmetic unit--the addition arithmetic
unit--(optional) the activation arithmetic unit. Each type of
arithmetic unit is configured to perform a corresponding
computational function operation, for instance, the multiplication
arithmetic unit is configured to perform a multiplication
operation, etc., which will not be further described in the present
disclosure.
[0220] Other descriptions of the step S102 are similar to those of
the above examples, which will not be further described herein.
[0221] Correspondingly, specific implementations of the step S104
are described below.
[0222] In a specific implementation, the computation device fetches
a corresponding operation instruction from the register unit (or
the instruction storage unit) through the controller unit and the
data access unit, where the operation instruction is configured to
process the first image (which may specifically be resolution
optimization). For the operation instruction, please refer to the
related introduction in the above examples; for instance, the
instruction may be an operation instruction for resolution
optimization. The count of the operation instructions is not
limited herein.
[0223] Further, after the controller unit fetches the operation
instruction, the controller unit sends the operation instruction to
the operation unit to perform resolution optimization on the first
image in the operation unit according to the computation topology
corresponding to the operation instruction, so as to obtain the
second image.
[0224] A specific implementation process of the step S104 is
described in detail below with the operation instruction being a
convolution operation instruction as an instance.
[0225] In a specific implementation, referring to the computation
device shown in FIG. 1A, the computation device obtains a first
image to be processed input by a user through the communication
unit (or a storage medium, or an off-chip memory). Optionally, the
computation device may call a related computation instruction to
perform conversion of a preset format on the first image , thereby
obtaining image data which can be identified and processed by the
computation device, such as a matrix or vector composed of i pieces
of x.sub.i pixel data. The preset format is customized by the user
side or the computation device side. Further, the computation
device fetches a convolution operation instruction from the
register unit through the data access unit and the controller unit,
and sends the convolution operation instruction to .sub.i+b). W is
convolution kernel, and x.sub.i is input data. Correspondingly, the
computation device controls the operation unit to execute the
convolution operation instruction on the input data x.sub.i (the
first image). Specifically, the computation device calls the
multiplication arithmetic unit in the operation unit to multiply a
convolution kernel W by input data x.sub.1, calls the addition
arithmetic unit to find the sum, adds a bias b, and then calls the
activation arithmetic unit to perform an activation operation s(h),
so as to obtain a final output result S. The output result is the
second image or intermediate data. When the output result is
intermediate data, according to a similar computation principle of
the above convolution operation instruction, the computation device
may further call other operation instructions to process the
intermediate data. The process is repeated until the second image
is obtained.
[0226] In another specific implementation, referring to the
computation device shown in FIG. 1D, the process is similar to that
of the above step S104 and uses the computation device shown in 1D.
The operation unit may specifically include a primary operation
module, secondary operation modules, and an interconnection module
connecting the primary operation module and the secondary operation
modules. The interconnection module may be configured to transfer
data between the primary operation module and the secondary
operation modules, receive a computation topology corresponding to
an operation instruction, etc. The computation device may control
an implementation of a bias b operation and an activation S (h)
operation in the convolution operation in the primary operation
module, and control an implementation of a vector multiplication
operation wx.sub.i and an accumulation operation .SIGMA. in the
respective secondary operation modules. Specifically, the
computation device may transfer input data xi (the first image) to
each secondary operation module through the controller unit, so as
to first call a multiplication arithmetic unit in each secondary
operation module to multiply a convolution kernel W by the input
data x.sub.i, and then call an addition arithmetic unit to sum and
obtain an output scalar. Then the interconnection module is
configured to accumulate and splice output scalars of the
respective secondary operation modules stage by stage into an
intermediate vector and send the intermediate vector to the primary
operation module. Further, the computation device calls the
addition arithmetic unit in the primary operation module to splice
intermediate vectors corresponding to all input data into an
intermediate result, adds a bias b to the intermediate result, and
then calls an activation arithmetic unit to perform an activation
operation s(h) to obtain a final output result S.
[0227] For the implementation of calling related operation
instructions in the computation device to process the first image,
please refer to related descriptions of the above FIGS. 1A to 1H.
In other words, the examples of the FIGS. 1A to 1H may also be
correspondingly applied to the examples of the information
processing method described in FIG. 2, and will not be further
described herein. It should be understood that the convolution
operation instruction in the above description is only used as an
instance to illustrate the convolution operation instruction
calling and data processing, which is not a limitation;
accordingly, when the operation instruction is another instruction
instead of the convolution operation instruction, a related
processing method similar to that of the convolution operation
instruction may also be used to implement steps of the method
examples of the present disclosure.
[0228] Based on the examples of the present disclosure, the
resolution of an image may be improved. Compared with the prior art
that uses a general-purpose processor and software for resolution
optimization, the present disclosure may have technical effects of
lower power consumption and faster speed.
[0229] FIG. 5 is a structural diagram of a computation device
(which may as a terminal) according to an example of the present
disclosure. The computation device shown in FIG. 5 includes a
communication unit 617 and an operation unit 614, where
[0230] the communication unit 617 is configured to obtain a first
image to be processed, where the first image has a resolution of a
first-level size;
[0231] the operation unit 614 is configured to obtain and call an
operation instruction to perform resolution optimization on the
first image to obtain a second image, where
[0232] the second image has a resolution of a second-level size,
the first-level size is smaller than the second-level size, and the
operation instruction is a preset instruction for optimizing an
image resolution.
[0233] Optionally, the computation device further includes a
storage medium 611 (optional), a register unit 612, an
interconnection module 613, a controller unit 615, and a data
access unit 616. For the above function units, please refer to
related descriptions of the examples in FIG. 1. Optionally, the
communication unit and the storage medium may be the same or
different. For instance, the communication unit may be a storage
medium or be an IO unit of the computation device, which is not
limited herein.
[0234] In an optional example,
[0235] the communication unit is configured to obtain an original
image to be processed input by a user, where the original image has
a resolution of the first-level size, and
[0236] the operation unit is configured to pre-process the original
image to obtain the first image to be processed, where the
pre-processing is an operation preset by a user side or a terminal
side.
[0237] In an optional example, the computation device further
includes a register unit 612 and a controller unit 615, where
[0238] the controller unit is configured to fetch an operation
instruction from the register unit, and send the operation
instruction to the operation unit;
[0239] the operation unit is configured to call the operation
instruction to perform feature extraction on the first image to
obtain a feature image; and
[0240] the operation unit is configured to pre-process the feature
image to obtain the second image, where the pre-processing is an
operation preset by a user side or a terminal side.
[0241] In an optional examples, the pre-processing includes one or
more of the following processing: translation, scaling
transformation, non-linear transformation, normalization, format
conversion, data deduplication, processing of data exception, and
data missing filling.
[0242] In an optional example,
[0243] the operation unit is configured to perform feature
extraction on the first image based on an operation instruction set
of at least one thread to obtain a feature image, wherein the
operation instruction set includes at least one of the operation
instructions, and an order of calling each operation instruction in
the operation instruction set is customized by a user side or a
terminal side.
[0244] In an optional example, the computation device further
includes a data access unit and a storage medium,
[0245] the operation unit is configured to send the second image to
the data access unit and store the second image in the storage
medium.
[0246] In an optional example, the operation unit includes a
primary operation module and a plurality of secondary operation
modules, where the primary operation module is interconnected with
the plurality of secondary operation modules by an interconnection
module, and when the operation instruction is a convolution
operation instruction,
[0247] the secondary operation modules are configured to implement
a convolution operation of input data and convolution kernels in a
convolutional neural network algorithm, where the input data is the
first image and the convolutional neural network algorithm
corresponds to the convolution operation instruction,
[0248] the interconnection module is configured to implement data
transfer between the primary operation module and the secondary
operation modules; before a forward operation of a neural network
fully connected layer starts, the primary operation module sends
the input data to each secondary operating through the
interconnection module; and after the computation of the secondary
operation modules is completed, the interconnection module splices
output scalars of the respective secondary operation modules stage
by stage into an intermediate vector and sends the intermediate
vector back to the primary operation module, and
[0249] the primary operation module is configured to splice
intermediate vectors corresponding to all input data into an
intermediate result, and perform subsequent operations on the
intermediate result, where
[0250] In an optional example,
[0251] the primary operation module is configured to add bias data
to the intermediate result, and then perform an activation
operation.
[0252] In an optional example, the primary operation module
includes a first operation unit, where the first operation unit
includes a vector addition unit and an activation unit,
[0253] the vector addition unit is configured to implement a bias
operation of a convolutional neural network operation and perform
element-wise addition on bias data and the intermediate result to
obtain a bias result; and
[0254] the activation unit is configured to perform an activation
function operation on the bias result.
[0255] In an optional example, the primary operation module
includes a first storage unit, a first operation unit, a first data
dependency determination unit, and a first storage unit; where
[0256] the first storage unit is configured to cache input data and
output data used by the primary operation module during a
computation process, wherein the output data includes the second
image,
[0257] the first operation unit is configured to perform various
operational functions of the primary operation module;
[0258] The data dependency determination unit is configured to
ensure that there is no consistency conflict in reading data from
and writing data to the first storage unit, read an input neuron
vector from the first storage unit, and send the vector to the
secondary operation modules through the interconnection module;
and
[0259] sending an intermediate result vector from the
interconnection module to the first operation unit.
[0260] In an optional example, the secondary operation modules
include a second operation unit, where the second operation unit
includes a vector multiplication unit and an accumulation unit,
[0261] the secondary operation modules are configured to perform a
convolution operation of input data and a convolution kernel in a
convolution neural network algorithm, which includes:
[0262] the vector multiplication unit is configured to perform a
vector multiplication operation of a convolution operation, and
[0263] the accumulation unit is configured to perform an
accumulation operation of the convolution operation.
[0264] In an optional example, each secondary operation module
includes a second operation unit, a second data dependency
determination unit, a second storage unit, and a third storage
unit;
[0265] the second operation unit is configured to perform various
arithmetic and logical operations of the secondary operation
modules,
[0266] the second data dependency determination unit is configured
to perform a reading/writing operation on the second storage unit
and the third storage unit during a computation process to ensure
that there is no consistency conflict between the reading and
writing operations on the second storage unit and the third storage
unit,
[0267] the second storage unit is configured to cache input data
and an output scalar obtained from the computation performed by the
secondary operation module, and
[0268] the third storage unit is configured to cache a convolution
kernel required by the secondary operation module in the
computation process.
[0269] In an optional example, the first data dependency or the
second data dependency ensures that there is no consistency
conflict in reading and writing in the following manners: storage
addresses corresponding to data/instructions stored in the
corresponding storage unit do not overlap; or determining whether
there is dependency between a control signal that has not been
executed and data of a control signal that is being executed; if
there is no dependency, the control signal is allowed to be issued
immediately; otherwise, the control signal is not allowed to be
issued until all control signals on which the control signal is
dependent have been executed; where
[0270] the computation device controls the controller unit to
obtain an operation instruction from the register unit and decode
the operation instruction into the control signal for controlling
behavior of other modules, where the other modules include the
primary operation module and the plurality of of secondary
operation modules.
[0271] In an optional example, the plurality of secondary operation
modules are configured to compute respective output scalars in
parallel by configuration using the same input data and respective
convolution kernels.
[0272] In an optional example, an activation function active used
by the primary operation module may be any of the following
non-linear functions: sigmoid, tanh, relu, softmax, or may be a
linear function.
[0273] In an optional example, the interconnection module forms a
data channel for continuous or discrete data between the primary
operation module and the plurality of secondary operation modules.
The interconnection module has any of the following structures: a
tree structure, a ring structure, a grid structure, a hierarchical
interconnection, and a bus structure.
[0274] For those which are not shown or described in the present
disclosure, please refer to related descriptions of the above
examples.
[0275] An example of the present disclosure further provides a
computer storage medium on which a computer program is stored for
electronic data exchange. The computer program may cause a computer
to perform part or all of the steps of any information processing
method described in the foregoing method examples.
[0276] An example of the present disclosure further provides a
computer program product, where the computer program product
includes a non-transitory computer-readable storage medium on which
a computer program is stored. The computer program may cause a
computer to perform part or all of the steps of any information
processing method described in the foregoing method examples.
[0277] An example of the present disclosure also provides an
acceleration device which includes: a memory which stores
executable instructions, and a processor configured to execute the
executable instructions in the storage unit according to the
information processing method.
[0278] The processing unit may be a single processing unit, or may
include two or more processing units. In addition, the processor
may also include a general-purpose processor (CPU) or a graphics
processing unit (GPU), a field programmable gate array (FPGA), or
an application-specific integrated circuit (ASIC) to set up and
operate a neural network. The processor may also include an on-chip
memory for caching (including a memory in the processing
device).
[0279] In some examples, the present disclosure provides a chip
which includes the above neural network processor configured to
execute the information processing method.
[0280] In some examples, the present disclosure provides a chip
package structure which includes the above chip.
[0281] In some examples, the present disclosure provides a board
card which includes the above chip package structure.
[0282] In some examples, the present disclosure provides an
electronic device which includes the above board card.
[0283] The electronic device may include a data processing device,
a robot, a computer, a printer, a scanner, a tablet, a smart
terminal, a mobile phone, a traffic recorder, a navigator, a
sensor, a webcam, a server, a cloud-based server, a camera, a video
camera, a projector, a watch, a headphone, a mobile storage, a
wearable device, a vehicle, a household appliance, and/or a medical
equipment.
[0284] The vehicle may include an airplane, a ship, and/or a car.
The household electrical appliance may include a television, an air
conditioner, a microwave oven, a refrigerator, an electric rice
cooker, a humidifier, a washing machine, an electric lamp, a gas
cooker, and a range hood. The medical equipment may include a
nuclear magnetic resonance spectrometer, a B-ultrasonic scanner,
and/or an electrocardiograph.
[0285] It should be noted that, the foregoing examples of method,
for the sake of conciseness, are all described as a series of
action combinations, but those skilled in the art should know that
since according to the present disclosure, the steps may be
performed in a different order or simultaneously, the disclosure is
not limited by the described order of action. Secondly, those
skilled in the art should also understand that the examples
described in the specification are all optional, and the actions
and modules involved are not necessarily required for this
disclosure.
[0286] In the examples above, the description of each example has
its own emphasis. For a part that is not described in detail in one
example, reference may be made to related descriptions in other
examples.
[0287] It should be understood that in the examples provided by the
present disclosure, the disclosed device may be implemented in
another manner. For instance, the examples above are merely
illustrative. For instance, the division of the units is only a
logical function division. In a certain implementation, there may
be another manner for division. For instance, a plurality of units
or components may be combined or may be integrated in another
system, or some features can be ignored or not performed. In
addition, the displayed or discussed mutual coupling or direct
coupling or communication connection may be implemented through
indirect coupling or communication connection of some interfaces,
devices or units, and may be electrical or other forms.
[0288] The units described as separate components may or may not be
physically separated. The components shown as units may or may not
be physical units. In other words, the components may be located in
one place, or may be distributed to a plurality of network units.
According to certain needs, some or all of the units can be
selected for realizing the purposes of the examples of the present
disclosure.
[0289] In addition, the functional units in each example of the
present application may be integrated into one processing unit, or
each of the units may exist separately and physically, or two or
more units may be integrated into one unit. The integrated units
above may be implemented in the form of hardware or in the form of
software program modules.
[0290] When the integrated units are implemented in the form of a
software program module and sold or used as an independent product,
they may be stored in a computer-readable memory. Based on such
understanding, the essence of the technical solutions of the
present disclosure, or a part of the present disclosure that
contributes to the prior art, or all or part of technical
solutions, can all or partly embodied in the form of a software
product that is stored in a memory. The software product includes
several instructions to enable a computer device (which may be a
personal computer, a server, or a network device, etc.) to perform
all or part of the steps of the methods described in the examples
of the present disclosure. The foregoing memory includes: a USB
flash drive, a read-only memory (ROM), a random access memory
(RAM), a mobile hard disk, a magnetic disk, or an optical disc, and
other media that can store program codes.
[0291] A person of ordinary skill in the art may understand that
all or part of the steps of the foregoing examples of method may be
completed by a program instructing related hardware. The program
may be stored in a computer-readable memory, and the memory may
include a flash disk, a read-only memory (ROM), a random access
memory (RAM), a magnetic disk, an optical disk, or the like.
[0292] The examples of the present disclosure have been described
in detail above. Specific examples have been used in the
specification to explain the principles and implementation manners
of the present disclosure. The descriptions of the above examples
are only used to facilitate understanding of the methods and core
ideas of the present disclosure. Persons of ordinary skill in the
art may change the implementation and application scope according
to the ideas of the present application. In summary, the content of
this specification should not be construed as a limitation on the
present disclosure.
* * * * *