U.S. patent application number 17/367512 was filed with the patent office on 2022-01-13 for methods and apparatus for localized processing within multicore neural networks.
This patent application is currently assigned to Femtosense, Inc.. The applicant listed for this patent is Femtosense, Inc.. Invention is credited to Sam Brian Fok, Alexander Smith Neckar, Scott Henry Reid.
Application Number | 20220012575 17/367512 |
Document ID | / |
Family ID | 1000005707539 |
Filed Date | 2022-01-13 |
United States Patent
Application |
20220012575 |
Kind Code |
A1 |
Fok; Sam Brian ; et
al. |
January 13, 2022 |
METHODS AND APPARATUS FOR LOCALIZED PROCESSING WITHIN MULTICORE
NEURAL NETWORKS
Abstract
Methods and apparatus for localized processing within multicore
neural networks. Unlike existing solutions that rely on commodity
software and hardware to perform "brute force" large scale neural
network processing the various techniques described herein map and
partition a neural network into the hardware limitations of a
target platform. Specifically, the various implementations
described herein synergistically leverage localization, sparsity,
and distributed scheduling, to enable neural network processing
within embedded hardware applications. As described herein,
hardware-aware mapping/partitioning enhances neural network
performance by e.g., avoiding pin-limited memory accesses,
processing data in compressed formats/skipping unnecessary
operations, and decoupling scheduling between cores.
Inventors: |
Fok; Sam Brian; (San
Leandro, CA) ; Neckar; Alexander Smith; (Redwood
City, CA) ; Reid; Scott Henry; (Palo Alto,
CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Femtosense, Inc. |
Palo Alto |
CA |
US |
|
|
Assignee: |
Femtosense, Inc.
Palo Alto
CA
|
Family ID: |
1000005707539 |
Appl. No.: |
17/367512 |
Filed: |
July 5, 2021 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
63050090 |
Jul 9, 2020 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06N 3/063 20130101;
G06N 3/0454 20130101; G06N 3/0481 20130101 |
International
Class: |
G06N 3/063 20060101
G06N003/063; G06N 3/04 20060101 G06N003/04 |
Goverment Interests
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
[0003] This invention was made with Government support under
Agreement No. N00014-19-9-0003, awarded by ONR. The Government has
certain rights in the invention.
Claims
1. A neural network processing apparatus, comprising: a plurality
of cores; and one or more memories configured to store a first set
of global parameters and a second set of local parameters; wherein
each core comprises logic configured to: obtain a first input
vector; perform global neural network processing based on the first
input vector and the first set of global parameters; perform local
neural network processing based on the second set of local
parameters and a previous activation state vector of the core; and
generate a result vector for distribution to the plurality of
cores.
2. The neural network processing apparatus of claim 1, wherein the
first set of global parameters comprises a portion of a global
matrix associated with each core of the plurality of cores.
3. The neural network processing apparatus of claim 1, further
comprising logic configured to sparsify a dense vector to produce a
resulting sparse vector.
4. The neural network processing apparatus of claim 1, wherein:
each core further comprises logic configured to: obtain one or more
portions of a global activation vector from other cores of the
plurality of cores; create the global activation vector based on
the one or more portions of the global activation vector from the
other cores of the plurality of cores; and perform the global
neural network processing based on the global activation
vector.
5. The neural network processing apparatus of claim 1, wherein each
core further comprises logic configured to broadcast the result
vector to each other core of the plurality of cores.
6. The neural network processing apparatus of claim 1, wherein each
core further comprises logic configured to update the previous
activation state vector of the core based on the result vector.
7. The neural network processing apparatus of claim 1, wherein: the
one or more memories comprises a first memory and a second memory;
the first memory of the one or more memories is exclusively
associated with a first core of the plurality of cores configured
to store a first portion of the first set of global parameters and
a second portion of the second set of local parameters associated
with the first core; and the second memory of the one or more
memories is exclusively associated with a second core of the
plurality of cores different from the first core of the plurality
of cores and is configured to store a third portion of the first
set of global parameters different from the first portion and a
fourth portion of the second set of local parameters different from
the second portion associated with the second core.
8. A method of operating a core of a multicore neural network
architecture comprising: receiving a sparse activation vector;
retrieving a dense activation vector, a portion of a global weight
matrix, and a local matrix; calculating an updated dense activation
vector based on the sparse activation vector, the dense activation
vector, and the portion of the global weight matrix, and the local
matrix; calculating a portion of an updated sparse activation
vector based on the updated dense activation vector; and
broadcasting the updated sparse activation vector to other cores of
the multicore neural network architecture.
9. The method of claim 8, wherein calculating the portion of the
updated sparse activation vector includes applying a rectified
linear activation function to the updated dense activation
vector.
10. The method of claim 8, wherein the dense activation vector, the
portion of the global weight matrix, and the local matrix are
retrieved from a memory associated with the core.
11. The method of claim 10, wherein the memory is not associated
with the other cores of the multicore neural network
architecture.
12. The method of claim 8, wherein calculating the portion of the
updated sparse activation vector comprises: concatenating a sparse
input vector and a sparse global activation vector to form a
concatenated sparse vector; multiplying the portion of the global
weight matrix with the concatenated sparse vector creating a first
intermediate data structure; multiplying the local matrix with the
dense activation vector creating a second intermediate data
structure; performing a first sigmoid function on a first sum of a
first section of the first intermediate data structure added to a
second section of the second intermediate data structure creating a
third intermediate data structure; performing a second sigmoid
function on a second sum of a third section of the first
intermediate data structure added to a fourth section of the second
intermediate data structure creating a fourth intermediate data
structure; performing a hyperbolic tangent function on a third sum
of a fifth section of the first intermediate data structure and a
first product of the third intermediate data structure and a sixth
section of the second intermediate data structure creating a fifth
intermediate data structure; and calculating the updated dense
activation vector based on a fourth sum of a second product of the
fourth intermediate data structure and the sparse global activation
vector and a third product of the fifth intermediate data structure
and a difference of one and the fourth intermediate data
structure.
13. The method of claim 8, further comprising receiving a sparse
input vector and the sparse activation vector, from a broadcast
communication to a plurality of cores of the multicore neural
network architecture.
14. The method of claim 8, wherein broadcasting the updated sparse
activation vector to the other cores of the multicore neural
network architecture occurs asynchronous to broadcasts from the
other cores.
15. A non-transitory computer readable apparatus comprising a
storage medium having one or more computer programs stored thereon,
the one or more computer programs, when executed by a processing
apparatus, being configured to: obtain a first sparse vector;
perform global neural network processing based on the first sparse
vector and a first set of global parameters; perform local neural
network processing based on a second set of local parameters and a
dense vector that is specific to each core; and sparsify the dense
vector to generate a second sparse vector for broadcast to a
plurality of cores.
16. The non-transitory computer readable apparatus of claim 15,
wherein the first sparse vector comprises an input vector and the
first set of global parameters comprises a portion of a global
matrix associated with a first core.
17. The non-transitory computer readable apparatus of claim 15,
wherein sparsifying the dense vector comprises skipping elements or
adding null elements to the dense vector.
18. The non-transitory computer readable apparatus of claim 15,
wherein sparsifying the dense vector comprises applying a rectified
linear activation function to the dense vector.
19. The non-transitory computer readable apparatus of claim 15,
wherein each core further comprises logic configured to broadcast
the second sparse vector to each other core of the plurality of
cores.
20. The non-transitory computer readable apparatus of claim 15,
wherein the one or more computer programs are further configured to
broadcast the second sparse vector to the plurality of cores.
Description
PRIORITY
[0001] This application claims the benefit of priority to U.S.
Provisional Patent Application Ser. No. 63/050,090 filed Jul. 9,
2020 and entitled "METHODS AND APPARATUS FOR LOCALIZED PROCESSING
WITHIN MULTICORE NEURAL NETWORKS", which is incorporated herein by
reference in its entirety.
RELATED APPLICATIONS
[0002] This application is related to U.S. patent application Ser.
No. ______, filed and entitled "METHODS AND APPARATUS FOR MATRIX
AND VECTOR STORAGE AND OPERATIONS", and U.S. patent application
Ser. No. ______, filed and entitled "METHODS AND APPARATUS FOR
THREAD-BASED SCHEDULING IN MULTICORE NEURAL NETWORKS", each of
which are incorporated herein by reference in its entirety.
COPYRIGHT
[0004] A portion of the disclosure of this patent document contains
material that is subject to copyright protection. The copyright
owner has no objection to the facsimile reproduction by anyone of
the patent document or the patent disclosure, as it appears in the
Patent and Trademark Office patent files or records, but otherwise
reserves all copyright rights whatsoever.
TECHNICAL FIELD
[0005] This disclosure relates generally to the field of neural
networking. More particularly, the present disclosure is directed
to hardware, software, and/or firmware implementations of neural
network processing.
DESCRIPTION OF RELATED TECHNOLOGY
[0006] Incipient research is directed to so-called "neural network"
computing. Unlike traditional computer architectures, neural
network processing emulates a network of connected nodes (aka
neurons) that loosely model the neuro-biological functionality
found in the human brain. While neural network computing is still
in its infancy, such technologies already have great promise for
e.g., compute rich, low power, and/or continuous processing
applications.
[0007] Existing neural networks are most commonly emulated within
general-purpose programming environments because commodity hardware
and software compilers are well understood and readily available.
Unfortunately, such implementations suffer from many inefficiencies
due to e.g., hardware limitations (e.g., physical connectivity),
compiler design, and/or instruction scheduling. Neural networks
would be a great fit for parallel processing and distributed
computing models; however, corresponding changes to hardware and
compilers are needed.
SUMMARY
[0008] The present disclosure addresses the foregoing needs by
disclosing, inter alia, methods, devices, systems, and computer
programs for neural network processing within multicore network
processors.
[0009] In one aspect, methods and apparatus for operating a
multicore neural network architecture are disclosed. One exemplary
apparatus embodiment includes: a plurality of cores; one or more
memories, where the one or more memories are configured to store a
first set of global parameters and a second set of local
parameters. In one exemplary embodiment, each core comprises logic
configured to: obtain a first sparse vector; perform global neural
network processing based on the first sparse vector and the first
set of global parameters; perform local neural network processing
based on the second set of local parameters and a dense vector that
is specific to each core; and sparsify the dense vector to generate
a second sparse vector for broadcast to the plurality of cores.
[0010] In one such embodiment, a non-transitory computer readable
apparatus comprising a storage medium having one or more computer
programs stored thereon is disclosed. In one exemplary embodiment,
the non-transitory computer readable apparatus includes one or more
computer programs that when executed by a processing apparatus, is
configured to: obtain a first sparse vector; perform global neural
network processing based on the first sparse vector and a first set
of global parameters; perform local neural network processing based
on a second set of local parameters and a dense vector that is
specific to each core; and sparsify the dense vector to generate a
second sparse vector for broadcast to a plurality of cores.
[0011] In one aspect, methods and apparatus for operating a core of
a multicore neural network architecture are disclosed. One
exemplary method embodiment includes: receiving an input vector, a
global activation vector, a local activation vector a portion of a
global weight matrix, and a local matrix, by the core of the
multicore architecture; calculating an updated local activation
vector based on the input vector, the global activation vector, the
local activation vector, the portion of the global weight matrix,
and the local matrix; calculating an updated localized portion of
an updated global activation vector based on the updated local
activation vector; and broadcasting the updated localized portion
of an updated global activation vector to other cores of the
multicore architecture.
[0012] Other features and advantages of the present disclosure will
immediately be recognized by persons of ordinary skill in the art
with reference to the attached drawings and detailed description of
exemplary embodiments as given below.
BRIEF DESCRIPTION OF THE DRAWINGS
[0013] FIG. 1 is a graphical representation of a multicore
processor architecture, commonly used within the processing
arts.
[0014] FIG. 2A is a graphical representation of one exemplary
multicore architecture, in accordance with the various principles
described herein.
[0015] FIG. 2B is a graphical representation of the extensible
nature of the multicore architecture, in accordance with the
various principles described herein.
[0016] FIG. 3 is a logical block diagram illustrating the data
traffic flow throughout the multicore architecture, in accordance
with the principles described herein.
[0017] FIG. 4 is a graphical representation of an existing gated
recurrent unit (GRU) neural network process that is commonly used
within the related arts.
[0018] FIG. 5 is a graphical representation of one exemplary
modified GRU neural network process that maps locality-based
processing based on hardware considerations, in accordance with the
various principles described herein.
[0019] FIG. 6 is a graphical representation of an exemplary neural
network's parameters to be partitioned into a multicore
architecture, useful to illustrate aspects of the present
disclosure.
[0020] FIG. 7 is a graphical representation of re-grouped and
stacked neural network parameters, useful to illustrate various
aspects of the present disclosure.
[0021] FIG. 8 is a graphical representation of partitioning neural
network parameters based on identified data dependencies, useful to
illustrate aspects of the present disclosure.
[0022] FIG. 9 is a graphical representation of neural network
parameter distribution within the target multicore architecture,
useful to illustrate aspects of the present disclosure.
[0023] FIG. 10 is a graphical representation of a partitioned
memory footprint that fits within the target multicore
architecture, useful to illustrate aspects of the present
disclosure.
[0024] FIG. 11 is a logical flow diagram of an exemplary method for
operating a multicore neural network architecture, in accordance
with the various principles described herein.
[0025] FIG. 12 is a logical flow diagram of an exemplary method for
computing a modified gated recurrent unit (GRU) within a core of a
multicore neural network architecture, in accordance with the
various principles described herein.
[0026] FIGS. 13 and 14 are segments of pseudocode for operating a
single core of a multicore neural network, in accordance with the
various principles described herein.
[0027] FIG. 15 is a logical flow diagram of an exemplary method for
optimizing machine models from standard machine learning
frameworks, in accordance with the various principles described
herein.
[0028] FIG. 16 is a graphical representation of an exemplary
hierarchy of layers of a machine learning model that have been
tagged with heterogenous precision, useful to illustrate aspects of
the present disclosure.
[0029] FIG. 17 is a logical flow diagram of a method for
partitioning and placing code, useful in conjunction with various
embodiments described herein.
[0030] FIG. 18 is a logical flow diagram of a method for optimizing
machine models to operate within a multicore architecture, useful
in conjunction with various embodiments described herein.
[0031] FIG. 19 is a logical flow diagram of a method for generating
assembly code, useful in conjunction with various embodiments
described herein.
DETAILED DESCRIPTION
[0032] In the following detailed description, reference is made to
the accompanying drawings which form a part hereof wherein like
numerals designate like parts throughout, and in which is shown, by
way of illustration, embodiments that may be practiced. It is to be
understood that other embodiments may be utilized, and structural
or logical changes may be made without departing from the scope of
the present disclosure. Therefore, the following detailed
description is not to be taken in a limiting sense, and the scope
of embodiments is defined by the appended claims and their
equivalents.
[0033] Aspects of the disclosure are disclosed in the accompanying
description. Alternate embodiments of the present disclosure and
their equivalents may be devised without departing from the spirit
or scope of the present disclosure. It should be noted that any
discussion herein regarding "one embodiment", "an embodiment", "an
exemplary embodiment", and the like indicate that the embodiment
described may include a particular feature, structure, or
characteristic, and that such particular feature, structure, or
characteristic may not necessarily be included in every embodiment.
In addition, references to the foregoing do not necessarily
comprise a reference to the same embodiment. Finally, irrespective
of whether it is explicitly described, one of ordinary skill in the
art would readily appreciate that each of the particular features,
structures, or characteristics of the given embodiments may be
utilized in connection or combination with those of any other
embodiment discussed herein.
[0034] Various operations may be described as multiple discrete
actions or operations in turn, in a manner that is most helpful in
understanding the claimed subject matter. However, the order of
description should not be construed as to imply that these
operations are necessarily order dependent. In particular, these
operations may not be performed in the order of presentation.
Operations described may be performed in a different order than the
described embodiment. Various additional operations may be
performed and/or described operations may be omitted in additional
embodiments.
The Complexity of Software-Based Neural Networks
[0035] FIG. 1 is a graphical representation of a multicore
processor architecture 100, commonly used within the processing
arts. The multicore processor 102 may include one or more cores
112A, 112B . . . 112N. Each core may include logic (e.g.,
arithmetic logic units (ALUs), registers, etc.) arranged to perform
various control and data path operations. Examples of control and
data path operations may include without limitation: instruction
fetch/instruction decode (IF/ID), operation execution and
addressing, memory accesses, and/or data write back. A small amount
of frequently used instructions and data may be locally cached
"on-chip" for fast access; otherwise, "off-chip" storage provides
cost-effective storage of bulk data (104A, 104B . . . 104N).
[0036] During operation, the processor cores 112A, 112B . . . 112N
read and write computer instructions and/or data from the external
memories 104A, 104B . . . 104N via a shared bus interface 106. Each
computer instruction (also referred to as an "opcode") identifies
the operation to be sequentially performed based on one or more
operands (data, register locations, and/or memory addresses). By
linking together sequences of computer instructions, it is possible
to compute any computable sequence.
[0037] In "general-purpose" computing, the processor cores and
memories may be tasked with any arbitrary task. A shared bus
architecture and monolithic memory map flexibly allows every core
112A, 112B . . . 112N to access any memory location within the
external memories 104A, 104B . . . 104N. As a practical matter,
however, the shared bus interface 106 is physically pin-limited;
there is a fixed width data bus that services all processor-memory
connections one-at-a-time. Limited connectivity can significantly
affect performance where multiple cores try to access the memories
at the same time. Additionally, local cache sizes are limited;
reading and writing to large data structures may require multiple
"off-chip" transactions across the pin-limited bus. Finally,
"global" data structures cannot be accessed by more than one core
at a time (simultaneous access could result in data hazards and
race conditions).
[0038] Unlike general-purpose computing, so-called "neural network"
computing uses biologically-inspired algorithms that take their
inspiration from the human brain. Neural networks are characterized
by a multi-layered composition of high-dimensional linear and
non-linear functions. The intermediate function outputs between
layers are known as activations. Neural networks typically contain
a large number of parameters that are used for e.g., vector-matrix
operations. The parameters are tuned in a gradient descent training
process based on known input/output data pairings. After training,
the parameters are held constant during deployment as the neural
network processes novel input data to execute its trained task. For
example, FIG. 1 graphically depicts one exemplary neural network
computation that is performed as a vector-matrix multiplication
150. As shown therein, neural activations are modeled as a vector
of digital values (a) that are multiplied by a matrix of parameter
weights (B) for the neural network; the output (c) corresponds to
the output neural activations.
[0039] Unfortunately, naively allocating neural network processing
to the multicore processor architecture 100 is extremely
inefficient. Firstly, each of the cores 112A, 112B, . . . 112N must
access the complete set of neural network data structures. The
vector and matrix dimensions are a function of the number of nodes
(neurons) within the neural network, thus neural networks of any
significant size exceed data sizes that can be efficiently cached
on-chip. As a result, all of the cores 112A, 112B, . . . 112N
constantly move data across the pin-limited bus interface 106.
Additionally, each of the cores 112A, 112B, . . . 112N read and
write to the same data structures (a, B, c) and often block one
another.
[0040] As a related issue, "Big O" notation is used in the computer
arts to classify algorithms according to computational complexity
(run time and space requirements O, as a function of input size N.)
Big O notation is widely used to describe the limiting behavior of
a function as it increases, e.g., processing complexity, memory
storage, bandwidth utilization, etc. For example, vector-matrix
multiplication has a computational complexity of O(N.sup.2) for
vector size (N) because each element of the vector must be
multiplied by a corresponding element of each row and column of the
matrix. Doubling the vector size (N) quadruples the computational
complexity (O(N.sup.2)).
[0041] Referring back to FIG. 1, existing neural networking
solutions rely on general-purpose vector-matrix operations. Such
solutions often rely on hardware accelerators to perform
"brute-force" element-by-element calculation. However, the data
structures that are used in neural network processing can be made
to be quite sparse (a high ratio of null values.) Brute force
vector-matrix operations can be particularly inefficient for sparse
data structures because the vast majority of memory reads,
vector-matrix multiplications, and memory write-backs are
unnecessary (null valued). Furthermore, as neural networks continue
to grow in size and complexity, inefficient brute force solutions
will quadratically increase in complexity.
[0042] Substantial factors in neural network energy consumption may
include moving large amounts of data, and storing a large number of
parameters in leaky SRAM (static random access memory). Charging
and discharging wires to transfer data takes energy. Wire energy
costs scale with wire length (e.g., chip area) and is a significant
concern for chip design. As a related issue, neural networks are
parameter-rich, but on-chip SRAM memory is costly to implement.
On-chip SRAM is optimized for performance, not power consumption,
so SRAM cells may consume significant amounts of energy even when
idle, due to leakage. The combination of these factors can limit
neural network adoption; in one specific example, remote
applications are often power constrained.
Exemplary Multicore Architecture
[0043] The aforementioned complexities of neural network processing
have presented significant issues for embedded device
implementations. Notably, existing neural network implementations
are handled within software, without regard to the underlying
hardware platform limitations; unfortunately, physical connectivity
(e.g., pin limitations), computational complexity, and/or
scheduling overhead present significant obstacles for embedded
devices. More directly, improved solutions for handling neural
networks in embedded environments are needed; ideally, such
solutions should enable compute rich, low power, and/or continuous
processing applications.
[0044] To these ends, various principles described herein
synergistically leverage locality, sparsity, and distributed
scheduling, to enable neural network processing within embedded
hardware applications. Unlike existing solutions that rely on
commodity software and hardware to perform "brute force" large
scale neural network processing; the various techniques described
herein map and partition a neural network based on the hardware
limitations of a target platform. The exemplary hardware-aware
mapping/partitioning described herein enhances neural network
performance by e.g., avoiding pin-limited memory accesses,
processing data in compressed formats/skipping unnecessary
operations, and distributing task scheduling while decoupling
timing requirements between cores.
[0045] In a first aspect, hardware-aware mapping and partitioning
may be used to minimize data transfers and parameter storage
requirements. In one embodiment, neural networking parameters and
activations may be sparsified to fit the neural network within chip
constraints without performance degradation; in other embodiments,
the neural network may be sparsified and trained for acceptable
levels of performance degradation. As described in greater detail
herein, fewer parameters and activations may result in fewer memory
lookups and data transfers. By avoiding costly off-chip data
transfers and minimizing chip area, the chip can minimize both
static power and dynamic energy costs.
[0046] In a second aspect, hardware-aware mapping and partitioning
may be used to localize parameters to where they are used. Various
embodiments of the present disclosure enable a "compute-near
memory" approach where computations are distributed across multiple
cores to strategically co-locate parameters with their associated
logic. Co-locating data and processing reduce data transfers across
the chip and only requires a small set of parameters for each core
to locally process and store.
[0047] Furthermore, the exemplary hardware-aware mapping and
partitioning techniques may exploit a high degree of parallelism to
complete tasks quickly and maximize the time spent in low-power
sleep states to mitigate leakage. In one embodiment, the multicore
architecture comprises a number of variable-length
Single-Instruction-Multiple-Data (SIMD) cores that can perform the
same operations on multiple data elements via parallel data paths
(e.g. a matrix-vector multiply or a pointwise non-linearity on a
vector). During operation, the data paths may each operate in
parallel so multiple instructions can execute simultaneously in a
core. Likewise, each core of the multicore processor may operate in
parallel, communicating with other cores only when necessary.
[0048] Other optimizations described herein may manage thread
scheduling during compile-time and program-time, rather than
run-time. In other words, rather than using a centralized scheduler
that is evaluated at "run-time", the neural network is compiled at
"compile-time" into threads; threads and their thread dependency
count are distributed to each core at program-time. The core can
run the thread at run-time without any scheduling conflict. Certain
implementations may also leverage instruction-level support for
sparse vector-matrix operations.
[0049] FIG. 2A is a graphical representation of one exemplary
multicore architecture 200, in accordance with the various
principles described herein. As shown, the architecture 200 does
not use an external memory to store the neural network data
structures nor any intermediate results. Instead, each core
includes its own processing hardware (212A, 212B, 212C, 212D),
local weights (214A, 214B, 214C, 214D), global weights (216A, 216B,
216C, 216D), working memory (218A, 218B, 218C, 218D), and
accumulator (220A, 220B, 220C, 220D). While the following
discussion is presented in the context of a core with its own
dedicated memories, the techniques described herein may be used in
shared memory systems and/or hybrids thereof. More generally,
dedicated core resources may enable improved core performance
whereas shared resources across cores may provide flexibility
and/or cross-core communication opportunities.
[0050] Unlike existing neural network processors which naively
distribute processing load (discussed above), the exemplary
multicore architecture decouples processing among the cores. In one
aspect of the present disclosure, neural network processing is
mathematically transformed (mapped) and spatially partitioned into
dense "neighborhood" processing and sparse "global" communications
processing (see e.g., Techniques for Targeting a Neural Network to
the Multicore Architecture). As described in greater detail
hereinafter, the mapping/partitioning may be based on the physical
processing hardware-memory connectivity; in other words, processing
hardware and memories transactions may be mapped/partitioned so
that they are not pin-limited. The mapping/partitioning preserves
the properties of the original global neural network at a fraction
of the memory accesses.
[0051] As shown in FIG. 2A, the local neighborhood weights are
stored in the local weight memories (214A, 214B, 214C, 214D) and
each core's subset (or "slice") of the global network weights are
stored in the global weight memories (216A, 216B, 216C, 216D).
During operation, applicable weights are retrieved from the
corresponding memory for computation; intermediate results may be
stored within a working memory (218A, 218B, 218C, 218D) and/or
accumulator (220A, 220B, 220C, 220D).
[0052] While the illustrated embodiment is shown in the context of
four (4) cores emulating a global neural network of nodes, the
multicore architecture described herein may be broadly extended to
any number of cores and/or any number of nodes (see e.g., FIG. 2B).
Additionally, the foregoing discussion presented a symmetric
distribution, however asymmetric distributions may be substituted
with equal success. Partitioning may be scaled to individual core's
capabilities and/or application requirements. For example,
asymmetric systems may enable high performance cores (more logic,
memory, and/or faster clock rates) and low power cores (less logic,
less memory, and/or power efficient clocking). In such
implementations, matrix operations may be sized to complete within
operational constraints, given a core's capabilities. Furthermore,
any consolidation, division, distribution, agglomeration, and/or
combination of processing hardware and/or memory may be substituted
by artisans of ordinary skill in the related arts, given the
contents of the present disclosure.
[0053] FIG. 3 is a logical block diagram illustrating the data
traffic flow 300 throughout the multicore architecture, in
accordance with the various principles described herein. Each
neighborhood (302A, 302B, 302C, 302D) is characterized by a locally
dense neural network. Neighborhoods are connected via a global
interconnect matrix (304A, 304B, 304C, 304D) to the other
neighborhoods; the output of the neighborhoods can be further
sparsified prior to global distribution via interconnect logic
(306A, 306B, 306C, 306D).
[0054] Various aspects described herein synergistically leverage
globally sparse, locally dense connectivity to attain a variety of
benefits heretofore unrealized. For instance, existing neural
network techniques naively store, and brute force process, every
matrix element in memory (whether connected or not). Naive
(hardware agnostic) storage requires O(N.sup.2) memory for an
N.times.N matrix, which is considerably more than necessary for a
sparse matrix with very few non-null elements. Similarly, brute
force calculation quadratically increases in complexity as a
function of network size (regardless of the matrix's sparsity). In
contrast, one exemplary embodiment compresses sparse neural network
data structures based on actual, non-null, connectivity (rather
than all possible connections). This greatly reduces storage
requirements as well as computational complexity. In one such
variant, the compression and reduction in complexity is sized to
fit within the memory footprint and processing capabilities of a
core.
[0055] As a further optimization, there are overhead costs
associated with compression, and different techniques have
different costs and benefits. Since vectors and matrices are used
differently in neural network processing, these data structures may
be represented differently to further enhance performance. For
example, as discussed in U.S. patent application Ser. No. ______,
filed ______ and entitled "METHODS AND APPARATUS FOR MATRIX AND
VECTOR STORAGE AND OPERATIONS", previously incorporated herein by
reference in its entirety, exemplary embodiments compress sparse
neural network data structures based on actual, non-null,
connectivity (rather than all possible connections). This greatly
reduces storage requirements as well as computational complexity.
In some variants, the compression and reduction in complexity is
sized to fit within the memory footprint and processing
capabilities of a core. The exemplary compression schemes represent
sparse matrices with links to compressed column data structures,
where each compressed column data structure only stores non-null
entries to optimize column-based lookups of non-null entries.
Similarly, sparse vector addressing skips nulled entries to
optimize for vector-specific non-null multiply-accumulate
operations.
[0056] Additionally, existing neural network processing relies on a
centralized task scheduler that consumes significant processing and
transactional overhead to coordinate between cores. In contrast,
the sparse global communications between cores of the exemplary
multicore architecture decouples neighborhood processing and
enables the multicore architecture to asynchronously operate the
cores in parallel. Consequently, optimized variants may distribute
task coordination between cores and implement asynchronous
handshaking protocols between cores. For example, as discussed in
U.S. patent application Ser. No. ______, filed ______ and entitled
"METHODS AND APPARATUS FOR THREAD-BASED SCHEDULING IN MULTICORE
NEURAL NETWORKS", previously incorporated herein by reference in
its entirety, thread-level parallelism and asynchronous handshaking
are leveraged to decouple core-to-core dependencies. The principles
described therein enable threads to run independently of one
another, without any centralized scheduling and/or resource locking
(e.g., semaphore signaling, critical path execution, etc.)
Decoupling thread dependencies allows cores to execute threads
asynchronously. In one such implementation, the multicore
architecture includes a set of distributed cores that run in
parallel. The cores communicate with each other via an
interconnecting network of router nodes. Each core processes its
threads asynchronously with respect to the other cores. Most
threads correspond to the dense neighborhood, and the core can
process these threads independently of the other cores. Global
communication is sparse (infrequent) and is handled via an
asynchronous handshake protocol.
Techniques for Targeting a Neural Network to the Multicore
Architecture
[0057] In a first aspect of the present disclosure, a global neural
network is mapped into a set of sparsely interconnected, dense
neighborhood neural networks that are partitioned based on hardware
platform constraints. In one exemplary embodiment, the
transformation may be performed on a modified gated recurrent unit
(GRU). Alternative implementations may perform the transformation
on modified Long Short-Term Memory (LSTM) or any other
"remember-forget" recurrent neural network (RNN) logic. More
generally, any logic or component that retains/removes information
between nodes of the neural network may be modified to transform a
first domain (first vector space) to a second domain (second vector
space).
[0058] As used herein, the terms "transform", "transformation",
"map", "mapping", and/or other linguistic derivations thereof refer
to mathematical or algorithmic functions that relate a function
from a first domain (vector space) to a second domain (vector
space). Transforms may be linear or non-linear; linearity is
present where the mathematical functions of addition and scaling
are preserved (i.e., the result of multiple operators considered
together is the same as the sum of the operations considered
individually). Similarly, transformations may be lossless
(reversible) or lossy (irreversible); for example, lossy
transformations may e.g., reduce unnecessary precision, decimate
values to add sparsity, etc. More generally, while illustrative
examples of linear matrix transformations are described below, any
algorithmic transformation may be substituted with equal success by
artisans of ordinary skill in the related arts.
[0059] As used herein, the terms "partition", "partitioning",
"place", "placing", and/or other linguistic derivations thereof
refer to the allocation and assignment of hardware to perform
algorithms or logic. For example, the dataflow may be partitioned
into a multicore architecture by assigning specific functions
(neighborhoods) to a specific core. Partitioning may be implemented
within software (e.g., non-transitory computer readable
instructions executed by processing logic), within hardware (e.g.,
logic gates and/or sequential logic), or some combination thereof
(e.g., firmware, etc.)
[0060] As a brief aside, so-called "backpropagation" refers to
neural network processing techniques that use error information in
supervised learning. Recurrent neural networks (RNNs) are an
example of one type of neural network processing that benefits from
backpropagation techniques. During operation, data propagates
"forward" through the nodes of the network (the RNN is a temporal
directed graph), error information (gradient information) is used
to improve the network's weighting and propagated "backward".
[0061] The temporal nature of recurrent neural networks (RNNs)
allows the RNN to exhibit dynamic behavior over time. As a
practical matter, temporally recent gradient information has a
greater influence on behavior; however, over time, the gradient
diminishes in importance (also referred to as a "vanishing
gradient.") Various techniques are used in RNNs to optimize the
amount of information that is "remembered" or "forgotten" in the
network.
[0062] Gated recurrent units (GRUs) are commonly used in recurrent
neural networks (RNNs) to retain/remove gradient information. FIG.
4 is a logical representation of an existing GRU process 400 that
is commonly used within the related arts. During operation, the
input vector (x) is modified based on previous activation vectors
(h). The exemplary GRU process 400 uses a hyperbolic tangent (tanh)
function to positively or negatively reinforce network state
information (reinforcement may range from +1 to -1), and a sigmoid
to "remember" (+1) or "forget" (0) network state information.
[0063] As shown therein, the input vector at time t (x.sub.t) is
multiplied by a first set of input weights (W.sub.ir) and the
previous activation vector (h.sub.t-1) is multiplied by a first set
of network weights (W.sub.hr). The first result is summed and
scaled according to a first sigmoid non-linearity (sigmoid). This
step is described by the equation:
r.sub.t=.sigma.(W.sub.irx.sub.t+W.sub.hrh.sub.t-1) EQN 1:
[0064] Additionally, the input vector and previous activation
vector are also multiplied by a second set of input weights
(W.sub.iz) a second set of network weights (W.sub.hz). The second
result is summed and scaled according to a second sigmoid
non-linearity (.sigma.). This step is described by the
equation:
z.sub.t=.sigma.(W.sub.izx.sub.t+W.sub.hzh.sub.t-1) EQN 2:
[0065] The input vector is multiplied by a third set of input
weights (W.sub.in) and the result of EQN 1 (r.sub.t) is multiplied
by the previous activation vector state and a third set of network
weights (W.sub.hn). The result is summed and scaled according to a
hyperbolic tangent non-linearity (tanh). This step is described by
equation 3 (or 3'):
n.sub.t=tanh(W.sub.inx.sub.t+r.sub.t*W.sub.hnh.sub.t-1) EQN 3:
n.sub.t=tanh(W.sub.inx.sub.t+W.sub.hn(r.sub.t*h.sub.t-1)) EQN
3':
[0066] The results of the foregoing processes are further mixed via
element-wise mixers. Each element-wise mixer takes a series of
inputs (x.sub.0, x.sub.1) and mixes the outputs according to the
select (s) to generate an output (y), as described by the following
equation:
y(i)=s(i)*x.sub.0(i)+(1-s(0)*x.sub.1(i) EQN 4:
[0067] The resulting activation vector (ht) of the GRU process 400
is given by the following equation:
h.sub.t=(1-z.sub.t)*n.sub.t+z.sub.t* h.sub.t-1 EQN 5:
[0068] Notably, all of the GRU's parameters are in global neural
network matrices and the parameters are accessed at every timestep.
In other words, the GRU process quadratically scales as a function
of the neural network's size (O(N.sup.2)). Furthermore, the
aforementioned GRU process does not account for hardware platform
limitations. Thus, existing GRU implementations are poorly suited
for embedded devices that are limited to small memory footprints,
reduced processing capabilities, and/or limited power.
[0069] Referring now to FIG. 5, one exemplary dataflow 500 that
enables locality-based processing within a specific hardware
platform is shown. The exemplary dataflow 500 may be used to map an
"original" global neural network (hardware platform agnostic) into
a functionally identical set of sparsely interconnected, dense
neighborhood neural networks. The exemplary dataflow 500 includes:
a modified gated recurrent unit (GRU), block diagonal matrices D
502 that correspond to each densely connected neighborhood, and
global matrices W 504 that correspond to sparse global
connectivity. The dense, local matrices D 502 take a neighborhood
activation vector (h.sub.t-1) as input, whereas the sparse, global
matrices W 504 take both a sparsified input vector (x'.sub.t) and a
sparsified global activation vector (h'.sub.t-1). In the
illustrated embodiment, a rectified linear unit (ReLU) 506
sparsifies the neighborhood activation vector (h.sub.t-1) to
produce the next sparse global activation vectors (h'.sub.t).
[0070] Conceptually, the aforementioned transformation (map)
divides the neural network processing into portions that are
amenable for distribution among multiple cores. However, the
modified GRU still "touches" different portions of the neural
network. Thus, the exemplary embodiment further partitions the
mapped neural network to ensure that each core has local access its
slice of neural network parameter weights.
[0071] As a preliminary step, FIG. 6 depicts a graphical
representation of an exemplary neural network's unpartitioned
(naively mapped) weight matrices. For illustrative purposes, the
global matrices for tanh (positive/negative reinforcement) and
sigmoid (remember/forget) functions (W.sub.hr, W.sub.hz, W.sub.hn,
W.sub.er, W.sub.iz, W.sub.in) emulate a neural network of
sixty-four (64) nodes (an assumed sparsity of .about.10% for
activation vectors (x, h) and parameters (W) is shown). Also
included are block diagonal matrices (D.sub.hr, D.sub.hz, D.sub.hn)
that correspond to the naive mapping to the target multicore
architecture of eight (8) cores. In the naive mapping, each core
handles 1/8th of the processing burden.
[0072] The matrix operations of FIG. 6 can be re-grouped and
described as a series of stacked global operations. FIG. 7
illustrates the re-grouped and stacked global operations,
mathematically described as follows:
W = [ W ir W hr W iz W hz W in W hn ] EQN .times. .times. 6 D = [ D
hr D hz D hn ] EQN .times. .times. 7 ##EQU00001##
[0073] Once re-grouped and stacked, the naive mapping can be
mathematically simplified to the following global equations:
i'=[x'.sub.t h'.sub.t-1] EQN 8:
[a b c]=W'i EQN 9:
[d e f]=Dh.sub.t-1 EQN 10:
r.sub.t=.sigma.(a+d) EQN 11:
z.sub.t=.sigma.(b+e) EQN 12:
n.sub.t=tanh(c+r.sub.t*f) EQN 13:
h.sub.t=(1-z.sub.t)*n.sub.t+z.sub.t*h.sub.t-1 EQN 14:
h'.sub.t=ReLU(h.sub.t) EQN 15:
[0074] Mathematically, EQNS. 9-15 may also be restated as
follows:
r.sub.t=.sigma.(W.sub.irx'.sub.t+W.sub.hrh'.sub.t-1+D.sub.hrh.sub.t-1)
EQN 16:
z.sub.t=.sigma.(W.sub.irx'.sub.t+W.sub.hrh'.sub.t-1+D.sub.hrh.sub.t-1)
EQN 17:
n.sub.t=tanh(W.sub.inx'.sub.t+W.sub.hnh'.sub.t-1+r.sub.t*(D.sub.hnH.sub.-
t-1)) EQN 18:
h.sub.t=(1-z.sub.t)*n.sub.t+z.sub.t*h.sub.t-1 EQN 19:
h'.sub.t=ReLU(h.sub.t) EQN 20:
[0075] FIG. 7 shows that any neural network that is naively mapped
to a target multicore architecture may be fully expanded to
identify the data dependencies of a neural network, e.g., W.sub.hr
702 is always multiplied by the global activation vector
h'.sub.t-1. Additionally, each block (or submatrix) of the block
diagonal D matrix only touches a small portion of the global
activations; for example, one processor core's blocks of the block
diagonal matrices (D.sub.hr, D.sub.hz, D.sub.hn) only touches its
corresponding portions of the global activations resulting from the
global weight operations (W.sub.hr, W.sub.hz, W.sub.hn, W.sub.ir,
W.sub.iz, W.sub.in), identified in bands 704A, 704B, 704C.
[0076] Referring now to FIG. 8, the neural network parameters may
be partitioned based on the identified data dependencies. The
neural network parameters have been partitioned and re-grouped such
that data dependencies for each neighborhood are lumped together.
Specifically, each cores' block of the block diagonal matrix
(D.sub.hr, D.sub.hz, D.sub.hn) are grouped together, and the cores'
corresponding portion of global neural network parameters
(W.sub.hr, W.sub.hz, W.sub.hn, W.sub.ir, W.sub.iz, W.sub.in) are
grouped together. FIG. 9 provides a graphical illustration of how
the neural network parameters of FIG. 8 may be distributed within
the target multicore architecture. Each cluster core locally stores
its blocks of the block diagonal matrix; however, the global neural
network parameters share common activation vectors (h'.sub.t) and
input vectors (x'.sub.t) that can be stored in aggregate.
[0077] In contrast to FIG. 9, FIG. 10 illustrates a partitioned
memory footprint that fits within the local memory for each
processor core. Notably, the sparse activation vectors (h'.sub.t)
and input vectors (x'.sub.t) are the only remaining core-to-core
data dependency for the target multicore architecture. As a result,
all the core-to-core data dependencies are satisfied by the
communication of sparse data alone: by broadcasting the input
vectors (x'.sub.t) to the cores, and each core broadcasting its
portion of the sparse activation vector (h'.sub.t). As shown
therein, each local memory contains both the core's blocks of the
block diagonal matrices (D.sub.hr, D.sub.hz, D.sub.hn) and global
neural network parameters (W.sub.ir, W.sub.iz, W.sub.in and
W.sub.hr, W.sub.hz, W.sub.hn) which take their respective inputs
(h'.sub.t, x'.sub.t). In the illustrated embodiment, the multicore
processor does not need an external memory and wholly avoids the
aforementioned pin-limitations of fixed width data busses for
external memories. The foregoing techniques illustrate one
exemplary mapping/partitioning based on on-chip connectivity of the
exemplary multicore architecture, however virtually any
mapping/partitioning technique may be substituted with equal
success by artisans of ordinary skill in the related arts.
[0078] Notably, the memory footprint and processing complexity for
each neighborhood is a fraction of the equivalent global neural
network. Consequently, the mapping/partitioning principles
described above may be further extended to segment the global
neural network to accommodate the capabilities of any hardware
platform. More generally, while the illustrated embodiment is shown
in the context of four (4) cores emulating a global neural network
of 128 nodes, the multicore architecture described herein may be
broadly extended to any number of cores and/or any number of nodes.
Additionally, the foregoing discussion presented a symmetric
distribution, however asymmetric distributions may be substituted
with equal success. Partitioning may be scaled to individual core's
capabilities and/or application requirements. For example,
asymmetric systems may enable high performance cores (more logic,
memory, and/or faster clock rates) and low power cores (less logic,
less memory, and/or power efficient clocking). In such
implementations, matrix operations may be sized to complete within
operational constraints, given a core's capabilities. Furthermore,
any consolidation, division, distribution, agglomeration, and/or
combination of processing hardware and/or memory may be substituted
by artisans of ordinary skill in the related arts, given the
contents of the present disclosure.
Methods
[0079] Referring to method 1100 of FIG. 11, a logical flow diagram
of an exemplary method 1100 for operating a multicore neural
network architecture is shown. In one embodiment, the multicore
neural network architecture includes locally dense neural networks
that are connected via sparse global interconnects.
[0080] In one embodiment, a set of global parameters and a set of
local parameters are stored in memories associated with each core.
In one exemplary embodiment, the global parameters define the
global interconnection and the local parameters define local neural
network processing (e.g., global/local weights W.sub.ir, W.sub.in
and W.sub.hr, W.sub.hz, W.sub.hn and block-diagonal matrices
D.sub.hr, D.sub.hz, D.sub.hn). In some variants, the global
parameters correspond to interconnections between nodes of the
target multicore architecture based on spatial organization within
a neighborhood. In some variants, the local parameters are mapped
to neighborhoods of the target multicore architecture based on
similarity of data dependency. While the exemplary embodiments are
presented in the context of hardware-aware global/local mapping and
partitioning, the operations described herein may be used with
naive mappings (hardware agnostic) to the target multicore
architecture.
[0081] Different sets of global parameters and local parameters are
stored in each core. Together the set of global parameters in all
cores may be logically equivalent to a single set of global
parameters but are instead arranged for local processing which
avoids pin-limited, energetically expensive memory accesses from a
large, shared memory. In one embodiment, the parameters are weight
values for matrix-vector multiplications which may be used in
conjunction with activation functions.
[0082] In one exemplary embodiment, the activation functions are
hyperbolic tangent tanh (positive/negative reinforcement) and
sigmoid (remember/forget) functions. Other examples of activation
functions include, without limitation: identity, binary step,
logistic (sigmoid or soft step variants), hyperbolic tangent and
its variants, rectified linear unit (ReLU), Gaussian Error Linear
Unit (GELU), Noisy ReLU, Leaky ReLU, Parametric ReLU, Exponential
Linear Unit (ELU), Softmax, and/or any other activation function
used in the neural processing arts.
[0083] As used herein, memory associated with a core may include
memory resident on the core itself (e.g., registers, accumulators,
etc. as shown in FIG. 2) and/or memory that is connected directly
to the core (not via a shared bus). In an exemplary embodiment,
this memory may include dedicated memory that is configured to
store local neighborhood weights and each core's subset of the
global network weights.
[0084] The terms "sparse," "sparsity," "sparsifying," and "adding
sparsity" refers to a dimensional distribution that skips elements
of and/or adds null elements to a set. Skipping or adding null
elements to a data structure may be achieved with any suitable
activation function (e.g., a rectifier linear unit (ReLU). More
generally, any activation function that inserts (or can be used to
insert) null elements may be substituted to the same end. While the
present disclosure is primarily directed to sparsity in spatial
dimensions, artisans of ordinary skill in the related arts will
readily appreciate that other schemes for adding sparsity (e.g.,
spatial, temporal, frequency, and other hybrids/variants thereof)
may be substituted with equivalent success. A variety of other data
structures may be used for representing sparse data structures, the
aforementioned vectors and matrices being purely illustrative.
[0085] Generally, a combination of matrices can be used to emulate
a neural network of nodes within a number of cores (C); where the
dense local matrices are of dimension NPC.times.NPC (nodes per core
(NPC)) and the sparse global interconnects are of dimension
NPC.times.(C.times.NPC). Notably, the foregoing discussion is
presented in the context of a two (2) tiered hierarchy
(neighborhood, global), however the techniques described herein may
be extrapolated to any higher order degree with equal success. For
example, a device (or multitude of devices) may support a
four-tiered topology comprising: a "neighborhood", that is part of
a "city" (with neighborhoods per city (NePCi), which itself is part
of a "state" (cities per state (CiPSt), within a "global"
configuration. Such a configuration would include neighborhoods of
NPC.times.NPC, city interconnect matrices of
NPC.times.(NePCi.times.NPC), state interconnects of
(NePCi.times.NPC).times.(CiPSt.times.NePCi.times.NPC), etc. In
other words, local unit outputs are sparse and broadcast to all
other local units at each level of the hierarchy. Functionally, the
hierarchy of memory-plus-processing enables each level to provide
more connections with increasing sparsity to keep communication
costs from ballooning. More directly, each additional layer of
hierarchy enables a broader set of connectivity and opportunities
for partitioning according to the available hardware platform
considerations.
[0086] While illustrative embodiments of the present disclosure are
described in the context of symmetric operation (e.g., each core of
the multicore architecture is assigned the same number of nodes),
other embodiments may asymmetrically assign nodes. For example,
some devices may have a performance core which can support a
greater number of logical nodes, and a power saving core that can
support a fewer number of nodes at greatly reduced power
consumption. Asymmetric node operation may result in different
parameterizations; for example, four cores respectively supporting
N.sub.1, N.sub.2, N.sub.3, N.sub.4 node networks would have
N.sub.1.times.N.sub.1, N.sub.2.times.N.sub.2,
N.sub.3.times.N.sub.3, and N.sub.4.times.N.sub.4 dense local
matrices; global interconnect matrices would be sized accordingly;
e.g., N.sub.1.times.(N.sub.1+N.sub.2+N.sub.3+N.sub.4),
N.sub.2.times.(N.sub.1.times.N.sub.2+N.sub.3+N.sub.4), etc.)
[0087] Furthermore, the principles described herein are not limited
to embedded devices or even self-contained devices. The various
concepts described herein may be extended to any neural networking
application that benefits from localization of processing. As but
one such example, several devices may be networked (via wired or
wireless connectivity) to enable neural network sizes that greatly
exceed the capabilities of any of the individual devices. In such a
multi-device configuration, each device may have their own
localized "neighborhood" and communicate with the global network of
devices via sparse network communications. For example, a network
of neighborhood devices may be in coordination with a city device,
the city device may be part of a larger state network, etc.
[0088] While the following steps are described as occurring on a
core of a multicore architecture, a plurality of cores may perform
the steps of method 1100 in parallel on their different set of
global and local parameters.
[0089] At step 1102 of the method 1100, a core of a multicore
architecture may obtain an input vector. The input vector may be a
sparse input vector (e.g., input vector x'.sub.t). A core may
receive the sparse vector from a broadcast to all cores. Exemplary
embodiments are configured such that input vector x'.sub.t (along
with shared activation vectors (h'.sub.t)) may be the only
core-to-core dependencies in the multicore architecture. More
generally however, artisans of ordinary skill in the related arts,
given the contents of the present disclosure will readily
appreciate that less optimal transformations may allow (or require)
cores to communicate and/or share other vectors. Such
implementations may be preferable where hardware agnostic fitting
is infeasible and/or unnecessary. While overall device performance
may suffer, such performance reductions may be preferred in view of
other holistic system constraints (e.g., convenience, breadth of
deployment, versatility, code/network re-use, etc.)
[0090] While the foregoing discussion is presented in the context
of a sparse vector, the concepts described herein may be broadly
extended to any data structure, whether fixed or variable, sparse
or dense. As a brief aside, the sparsity/density of a data
structure may be calculated by dividing the number of
non-null/non-empty elements to the total number of elements. For
example, a sparse vector or matrix may have a sparsity of 0.1 (only
10% of the values are non-null) whereas a dense vector or matrix
may have a density of 0.9 (90% of the values are non-null).
[0091] Conceptually, sparsity is a representation of an actual
connectivity versus the potential connectivity of a neural network.
A sparse neural network could represent many potential connections
(of which only a few are actually connected). In contrast, a dense
neural network can only represent a few potential connections (but
most are actually connected). While the various examples are
presented in the context of illustrative sparsity and density
values (e.g., 0.1/0.9), the techniques described herein broadly
apply any such sparsity/density combination (e.g., 0.2/0.8,
0.3/0.7, 0.5/0.5, etc.) Further discussions of the benefits and
tradeoffs associated with sparsity are described in greater detail
hereinafter (see e.g., Operational Efficiency Tradeoffs, Sparsity
and Density).
[0092] Additionally, while the foregoing scheme is presented in the
context of broadcast signaling, other implementations may use e.g.,
one-to-one (unicast), one-to-many (multicast), many-to-one, and/or
many-to-many, and/or any other communication variant. Such
implementations may be used to e.g., prioritize connectivity and/or
subdivide core operation. For example, a four (4) core device may
operate with all four cores (a large network) or emulate two
smaller networks; this may be useful for devices that dynamically
switch between multiple applications.
[0093] At step 1104 of the method 1100, the core may perform global
neural network processing. In some embodiments, a set of global
parameters (a matrix) may be multiplied by an input vector. In some
cases, the input vector may be sparsified; in alternative variants,
the input vector may be processed without sparsification.
[0094] Within the context of the present disclosure, the terms
"global", "globalization", "globalized" and other linguistic
variants thereof, refer to processing, signaling, and/or other
associated resources (signals, logic, and memory) that may be
propagated to, received from, shared with, or otherwise affect all
other cores of a plurality of cores. As but one specific example,
global signaling may be broadcast to all cores of the multicore
architecture.
[0095] In an exemplary embodiment, the input vector may be combined
(concatenated) with or include a global activation vector. In this
exemplary embodiment, the set of global parameters are multiplied
by the combination input vector and global activation vector. As
previously described, the global activation vector may be assembled
by the core from broadcast portions of the global activation vector
from each of the plurality of cores from the previous operations of
each core of the multicore architecture. The resulting matrix of
values may be used as inputs to the one or more activation
functions in the core.
[0096] At step 1106 of the method 1100, the core may perform local
neural network processing. In some embodiments, a set of local
parameters may be multiplied by a local vector that is specific to
each core. In one exemplary implementation, the local vector may be
dense. The dense vector that is specific to each core may include
the previous activation vector. During operation, the local neural
network processing may include multiplying the set of local
parameters with the dense vector.
[0097] Within the context of the present disclosure, the terms
"local", "localization", "localized" and other linguistic variants
thereof, refer to processing, signaling, and/or other associated
resources (signals, logic, and memory) that are not propagated to,
received from, shared with, or otherwise affect other cores of a
plurality of cores. For example, local parameters may be exclusive
to a single core and stored spatially near the corresponding core.
While the illustrated examples are presented in the context of
localization to a single core, other variants may e.g., localize
processing to multiple cores (e.g., in a four-core architecture,
processing may be localized to a pair of cores, etc.)
[0098] In an exemplary embodiment, each physical core of the
multicore architecture is assigned to a logical neighborhood or
cluster of neurons (e.g., 2, 4, 8, 16, 32, etc. neurons). Each
neighborhood or cluster of neurons shares a common memory which
includes the set of global parameters and the set of local
parameters. In a variant, multiple neighborhoods or clusters of
neurons share a single core. This allows for low cost (e.g.,
energy, bus bandwidth) dense communications between neurons within
a neighborhood while maintaining the benefits of global parameter
sparsity.
[0099] The resulting matrix of values from the multiplication of
the set of local parameters and the dense vector may be used as
e.g., inputs to the one or more activation functions in the core.
For example, the resulting matrix may be split into three different
components to provide inputs into two sigmoid and hyperbolic
tangent activation functions. Still other combinations of
activation functions may be substituted by artisans of ordinary
skill in the related arts, given the contents of the present
disclosure.
[0100] At step 1108 of the method 1100, the core may generate a
result vector for distribution to the plurality of cores. In one
embodiment, the result vector is based on a dense vector generated
by a modified gated recurrent unit (GRU) that combines a dense
activation vector from previous neighborhood network activity with
sparsified input and a sparse activation vector from previous
global network activity. The various combinations of dense
neighborhood/sparse global are remembered-forgotten (sigmoid) and
positively or negatively reinforced (tanh); for example, a first
component (r.sub.t) may be generated by remembering/forgetting a
weighted sum of the sparsified input (x'.sub.t) and a dense
activation vector from previous neighborhood network activity
(h.sub.t-1) (as described above in EQNS. 11 and 16).
[0101] In one exemplary embodiment, result of the modified GRU may
be locally re-circulated in its dense form, and sparsified for
global distribution to other cores of the multicore architecture.
Sparsification may include e.g., skipping elements or adding null
elements to the dense vector. In an exemplary embodiment, a
rectified linear unit (ReLU) activation function is applied to the
resulting dense vector to create a portion of the globally sparse
activation vector.
[0102] In one embodiment, the globally sparse activation vector is
an asynchronous combination of sparsified results from the
network's constituent cores. In other words, each core updates its
portion of the globally sparse activation vector without regard to
the timing of the other cores (updates to the globally sparse
activation vector are not synchronized). The core may broadcast its
core specific portion of the globally sparse activation vector to
each core of the multicore architecture. The broadcast core
specific portion of the activation may go into a messaging queue or
buffer that is specific to each core for use. Each core's specific
portion of the activation vector may then be retrieved from the
queue or buffer and used to assemble the next activation vector. In
one such implementation, each core is assigned a core identifier
and transmits the core identifier with the core specific portion of
the activation vector. The next activation vector may be assembled
in core identifier order. In a similar embodiment, the core
specific portion of the activation vector may be sent by the core
to a shared messaging queue or buffer associated with the multicore
architecture where it is combined with the core specific portion of
the activation from other cores in the multicore architecture. A
separate core, a scheduler/controller, or other processing logic
assembles the activation vector and broadcasts the assembled
activation vector to each core. Still other signaling schemes may
be substituted with equal success, by artisans of ordinary skill in
the related arts, given the contents of the present disclosure. For
example, some variants may use synchronous updates, directed
signaling, and/or local or global queuing mechanisms.
[0103] Once completed, the core may begin work on the next
available task which may include obtaining a sparse vector (e.g.,
returning to step 1102 of the method 1100).
[0104] More generally, the various techniques described herein may
be broadly applied to any recurrent neural network (RNN) that can
be mathematically transformed and apportioned into localized
processing (to various degrees) and globalized processing.
Localized processing may be compressed into densely connected
networks to optimally utilize local core resources. Globalized
processing may be sparsified and logically distributed to the other
cores of the network. Other RNNs that may benefit from the various
concepts described herein include, without limitation, traditional
RNNs, LSTM-based RNN (and variants), GRU-based RNN (and its
variants).
[0105] Referring now to method 1200 of FIG. 12, a logical flow
diagram of an exemplary method 1200 for performing a modified gated
recurrent unit (GRU) within a core of a multicore neural network
architecture is shown. GRUs are characterized by "update"
information (a combination of remembered/forgotten node state and
input), "candidate" information (positive/negative reinforcement
via e.g., a tanh), and the "cell" or "node" state. While the
following discussion is performed within non-transitory
computer-readable media (software), hardware and/or firmware
implementations of the modified-GRU process may be substituted with
equal success, by artisans of ordinary skill (given the contents of
the present disclosure).
[0106] At step 1202 of the method 1200, a core of a multicore
architecture receives an input vector (e.g., x' of FIG. 5) and a
global activation vector (e.g., h'.sub.t-1). In an exemplary
embodiment, the core receives the input vector (e.g., x') from
external stimulus to the multicore architecture. External stimulus
may include audio, video, and/or other sensed metrics. For example,
a hearing aid and/or earbuds may include a microphone that captures
acoustic data as input. Similarly, a mobile device may include a
camera that provides visual data as input. Other sensors may
include e.g., acoustic, sound, vibration, electromagnetic,
chemical, temporal, spatial, positioning, acceleration, etc. In
another example, external stimulus may include streaming text data
for NLP processing (e.g., language translation, language
understanding) or video data and may be performed on a mobile
phone, an augmented reality/virtual reality (AR/VR) goggle/headset,
a wearable (e.g., smart watch), a laptop, etc.
[0107] In one exemplary embodiment, the global activation vector
(e.g., h'.sub.t-1) that represents the previous state of the
network nodes, is received as asynchronous broadcast communication
from a plurality of cores of the multicore architecture. As
previously noted, sparsity in the global activation vector (as well
as the input vector) allows the data to be compressed as it is
transferred between cores as well as allowing a reduction in
parameters and access counts.
[0108] In one exemplary embodiment, core-specific parameters are
retrieved from local memories. For example, a local activation
vector (e.g., h.sub.t-1 ), a portion of a global weight matrix
(e.g., W.sub.ir, W.sub.iz, W.sub.in and W.sub.hr, W.sub.hz,
W.sub.hn), and their corresponding local matrices (e.g., D.sub.hr,
D.sub.hz, D.sub.hn) can be retrieved from a memory associated with
the core. The memory may be associated with the core alone (not any
other core of the multicore architecture) and spatially localized
thereto; dedicated access and on-chip proximity greatly improve
memory bandwidth.
[0109] At step 1204 of the method 1200, the core of the multicore
architecture calculates an update its neighborhood state e.g., in
accordance with the modified-GRU process described above (see e.g.,
EQNS. 8-15, also restated as EQNS. 16-20). For example, to
calculate update information (described in EQN. 9), the core of the
multicore architecture concatenates the input vector (x') and the
global activation vector (h'.sub.t-1) forming a concatenated sparse
vector (xh). The core then may multiply various portions of the
global weight matrix (W.sub.ir, W.sub.iz, W.sub.in) with the
concatenated sparse vector (xh) creating a first set of global
update values (a, b, c). Similarly, candidate information may be
calculated according to EQN. 10 and the resulting node state may be
calculated according to EQN. 14, etc.
[0110] At step 1206 of the method 1200, the core of the multicore
architecture provides a sparsified neighborhood state to other
cores. In an exemplary embodiment, calculating the updated
localized portion of the updated global activation vector includes
applying a rectified linear (ReLU) activation function to its
neighborhood state. The localized portion may then be combined with
localized portions of other cores to generate an updated global
activation vector for use in the next iteration.
[0111] To further illustrate how the above implementations for
operating a neural network architecture can be performed,
illustrative pseudocode is provided below for a single core
operation and multicore operation. The pseudocode is provided for
illustrative purposes, and other code can be used to implement the
algorithms described above as would be understood by one of
ordinary skill, given the contents of the present disclosure.
[0112] FIG. 13 shows a segment of pseudocode 1300 for operating a
neural network on a single core. Pseudocode segment 1302
initializes the dense local (neighborhood) activation vector (h)
and the sparse global activation vector (h'), each of length N. The
pseudocode segment 1302 also initializes the matrices W and D which
represent the sparse global weight matrix and the dense local
matrix respectively.
[0113] A sparse input vector is received at pseudocode segment 1306
and the sparse input vector is concatenated with the sparse global
activation vector at pseudocode segment 1308. Pseudocode segment
1310 performs a matrix-vector multiplication with the sparse matrix
on the global weight matrix (W) and the concatenated vector (xh)
splitting the resulting matrix into three portions a, b, and c.
Pseudocode segment 1312 performs a matrix-vector multiplication
with the dense local matrix (D) and dense local activation vector
(h) splitting the resulting matrix into three portions e, f, and g.
The resulting operations use cluster size (C) times N (vector
length) memory lookups and MACs.
[0114] Pseudocode segment 1314 uses portions of the results of the
multiplications of pseudocode segments 1310 and 1312 to perform
remember-forget and reinforcement operations which are used to
update the local activation vector (h) at pseudocode segment 1316.
At pseudocode segment 1318, the global activation vector is updated
as a sparsified snapshot of the local activation vector (h).
[0115] At pseudocode segment 1320, operations loop back to
pseudocode segment 1304 to receive the next input.
[0116] FIG. 14 shows a segment of pseudocode 1400 for operating a
single core of a multicore neural network. Multiple nodes may be
emulated on each core of the multicore neural network. Each core
then may act as neighborhood with dense data and communication that
does not need to get distributed throughout the broader global
multicore neural network, thus creating a physical and logical
hierarchy for nodes in the neural network. Embodiments of the
disclosed system can exploit this hierarchy and create more
efficient ways to store data (closer to where it is needed),
transmit data across the network (sending sparse data over dense
data), consume less power (as data is physically closer to where it
is used), and offer greater performance.
[0117] Pseudocode segment 1402 initializes the dense local
(neighborhood) activation vector (h), of length NPC (nodes per
core), and the sparse global activation vector (h'), of length N.
The pseudocode segment 1402 also initializes the matrices W and D
which represent the sparse global weight matrix and the dense local
matrix respectively. Constants CORES (the number of cores in the
multicore architecture), CIDX (the identifier of the present core),
and NPC (the number of nodes per core) are defined in pseudocode
segment 1402.
[0118] A sparse input vector is received at pseudocode segment
1406, and each core broadcasts its piece of the global activation
vector (h' to each other core in the multicore architecture.
Similarly, the core receives pieces of the global activation vector
from other cores. The broadcasted pieces are used to (re-)create
the full global activation vector based on core identifiers
(segment 1408).
[0119] In the illustrated pseudocode segment 1400, each core is
assumed to have the same number of nodes per core. Other
embodiments may vary the number of nodes per core; asymmetric
variants may be useful where data dependencies differ across cores.
In such embodiments, additional information may need to be used to
(re-)create the full global activation vector (e.g., the number of
nodes for the core, etc.)
[0120] In pseudocode segment 1410, the sparse input vector is
concatenated with the sparse global activation vector for use in
matrix-vector multiplication. Thereafter, pseudocode segment 1412
performs a matrix-vector multiplication with the sparse matrix on
the global weight matrix (W) and the concatenated vector (xh)
splitting the resulting matrix into three portions (a, b, and c).
Pseudocode segment 1414 performs a matrix-vector multiplication
with the dense local matrix (D) and dense local activation vector
(h) splitting the result into three portions (e, f, and g).
[0121] Pseudocode segment 1416 uses portions of the results of the
multiplications of pseudocode segments 1412 and 1414 to perform
remember-forget and reinforcement operations which are used to
update the local activation vector (h) at pseudocode segment 1418.
At pseudocode segment 1420, the core specific portion of the global
activation vector is updated as a sparsified snapshot of the local
activation vector (h').
[0122] At pseudocode segment 1422, the core specific portion of the
global activation vector is broadcasted to the other cores in the
multicore architecture. At pseudocode segment 1424, operations loop
back to pseudocode segment 1404 to receive the next input.
Operational Efficiency Tradeoffs, Sparsity and Density
[0123] The exemplary embodiments described herein provide a
plethora of advantages that improve the functioning of neural
networks in computer processes. Notably, the exemplary dense local,
sparse global processing described herein provide unconventional
technical solutions for hardware-aware neural network processing.
The sparsity (or density) of a data structure may be calculated by
dividing the number of non-null/non-empty (or null/empty) elements
to the total number of elements. Sparsity and density may be terms
of absolute or relative degree. For example, a data structure may
be considered sparse if most of its values (greater than 50%) are
null values; similarly, a first data structure may be more sparse
than a second data structure even where both data structures are
dense (i.e., mostly non-null). While any data structure may be
considered relatively sparse or dense, there may be
propagation/storage efficiencies as the data becomes sparser and/or
computational efficiencies to packing data more densely. The
following discussion characterizes various operational tradeoffs
that may be made by changing sparsity/density.
[0124] Sparse global connection matrices have approximately
.alpha..beta.N.sup.2 parameter lookups, where .alpha. is activation
sparsity and .beta. is parameter sparsity. Dense local parameter
matrix lookups are characterized by CN. Thus, parameter lookups for
6 global weight matrices (e.g., W.sub.ir, W.sub.iz, W.sub.in and
W.sub.hr, W.sub.hz, W.sub.hn) and their corresponding 3 local
matrices (e.g., D.sub.hr, D.sub.hz, D.sub.hn) is given by
6.times..alpha..beta.N.sup.2+3CN. Consider a neural network of 1024
nodes (N=1024) that is grouped into dense neighborhood clusters of
32 nodes apiece (C=32), having sparse global interconnections
characterized by .alpha.,.beta.=0.1. Such a network would require
160,000 parameter lookups which compares much more favorably
(40.times.) to systems that rely on 6 brute force O(N.sup.2)
parameter lookups (.about.6000000 parameter lookups for an
equivalent system).
[0125] Notably, parameter lookups also differ based on usages. In
the foregoing example, there are 6 sparse global matrices having an
.alpha.=0.1; or
6.times.1024.sup.2.times.0.1.apprxeq.600.times.10.sup.3 parameters.
In contrast, dense neighborhood parameters are localized to 3
memories with an assumed density of 1.0. Thus, dense parameters
consume 3.times.32.times.1024.apprxeq.100.sup.3 entries. In other
words, even though dense lookups dominate the lookup count, there
are only 1/6th as many dense parameters. As a practical matter, the
difference in utilization between global and neighborhood
parameters may be leveraged in a variety of different ways. For
example, different implementations may seek to further increase the
disparity of global/local parameterization to further improve
performance. Alternatively, global/local parameterization may be
load balanced to reduce device wear, etc.
[0126] More generally, the various principles described herein
address specific memory access, processing bandwidth, and/or
utilization issues that are specific to hardware-aware neural
networks; these are unique and distinct from well-understood,
routine, and/or conventional solutions implemented within
traditional neural network computing.
Exemplary Hardware-Aware Mapping and Partitioning
[0127] FIG. 15 is a logical flow diagram of an exemplary method
1500 for optimizing standard machine learning frameworks for
multicore architectures, in accordance with the various principles
described herein. While the following discussion is presented in
the context of the exemplary multicore architecture 200 described
above, the hardware-aware mapping and partitioning techniques
described herein may improve the performance of any multicore
neural network implementation. For example, even naive
implementations of multicore networks via the general compute
system 100 of FIG. 1 would benefit from mapping and/or partitioning
portions of the neural network processing to specific cores 112A,
112B . . . 112N so as to reduce in-core parameter storage and
activations, and off-core data transfers.
[0128] At step 1502 of method 1500, a logical model of a neural
network is synthesized to device-specific primitives. As a brief
aside, existing neural networks may be designed in a variety of
design languages. For example, the most common machine learning
frameworks (e.g., PyTorch, Tensorflow, etc.) use graphical
representations of "machine models" to describe how nodes of the
network are connected to one another, etc. Machine learning
frameworks may include software libraries/application programming
interfaces (APIs) for machine learning to perform training and/or
statistical inference of (deep) neural networks. These frameworks
may offer building blocks for designing, training, and validating
deep neural networks through a high-level programming interface for
a user.
[0129] As used herein, the term "logical model" (also referred to
as a learning model or a prediction model) refers to any schema for
representing the structure of a neural network. Structural
descriptions of a neural network may specify individual node
functionality, node connectivity, and/or groups of nodes (e.g.,
layers), to take desired inputs and generate desired outputs. For
example, a logical model may use training data to derive the
desired outputs from weighted combinations of input variables.
Ideally, the logical model generates a neural network mapping that
works for the training data, as well as similar real-world
data.
[0130] In one exemplary embodiment, the logical model is a
graphical representation of computation that includes a
flow/directed graph where each node of the graph represents one or
more atomic operations, and where each node may be annotated with a
node name, a node type, operations/actions, data, points where data
flows into or out of the chip, and input/output or communication
nodes.
[0131] As used herein, the term "primitive" refers to any
indivisible unit of operation/functionality that is specific to a
device. An indivisible unit of operation cannot be further
subdivided, e.g., a software primitive may be an opcode, a hardware
primitive may be a combinatorial or sequential logic, etc. For
example, a specific field programmable gate array (FPGA) would
support a specific instruction set, and look-up-table logic,
etc.
[0132] In one exemplary embodiment, design synthesis is further
staged into three sub-steps: atomization (sub-step 1504),
quantization (sub-step 1506), and tracing (sub-step 1508). At
sub-step 1504, the model may be atomized into a set of fundamental
operations, referred to as "atomics" or "atomic operators". Atomics
are related to, but abstracted from, primitives. Atomics include
the set of linear algebra and pointwise operations that are common
to all machine learning operations and may be agnostic to software
and/or hardware details During this stage, simple layers of the
model (e.g., dense feedforward layers) may directly map to a single
atomic operator. Higher-level layers may be decomposed into more
basic operations before mapping to atomic operations. In an
embodiment, a dictionary or look-up table (LUT) is be used to map
operators from the source machine learning framework to perform the
conversion of operations to the set of atomics.
[0133] In some embodiments, the model may additionally be converted
to an intermediate representation (e.g., a Patch Intermediate
Representation (PatchIR)) for quantization aware training. The
intermediate representation may enable users to continue training
models after atomization and quantization, while still remaining
compatible with the source machine learning framework. Each dialect
of the intermediate representation may include a computation graph
parser to interpret the input model from the source framework, a
library of quantization-aware-training friendly implementations of
each of the atomic operators written in the source framework, and a
dictionary mapping machine-learning layers from the source
framework to atomic operators or compositions of atomic
operators.
[0134] Notably, design synthesis may have many ways to map logical
functionality to device primitives, e.g., an adder could be
implemented within many different software primitives and/or
hardware primitives. This can be further complicated where exact
functionality is not required (e.g., where device operation can
acceptably deviate from the idealized neural network model). To
reduce synthesis complexity and/or improve synthesis results,
additional machine learning layers can be added to, or substitute
for existing layers, of the logical model. The additional layers
may provide reduce constraints on synthesis/fitting; for example,
sparsifier layers introduce activation sparsity that can be tagged
as prune-able activations; prune-able activations can improve
regularization (described in greater detail below). Similarly,
sparsifiable recurrent neural network layers (such as the modified
GRU described in FIG. 5, above) can be substituted for generic
recurrent neural network layers (e.g., the GRU of FIG. 4, above).
Additionally, certain functionality (such as spectral
transformation layers, encryption/decryption engines, etc.) may be
more efficiently performed in specialized logic; as but one such
example, dedicated logic for Short-Time Fourier Transform (STFT)
may be used to convert raw waveform audio into the time-frequency
domain as inputs to the neural network.
[0135] Typically, logical machine-learning models use
floating-point representations for parameters and activations;
since, most embedded devices are implemented with fixed-point data
structures, differences in behavior due to floating/fixed-point
conversion should be resolved and/or trained to compensate.
Conceptually, logical models should be significantly compressed by
quantizing variables to use fewer bits without suffering
significant losses in accuracy. Empirical results suggest that
quantization may provide similar functionality at a fraction of the
logical model's memory footprint (a factor of 8.times. reduction),
simplify the processing logic, and reduce the latency and energy
costs of operation.
[0136] At sub-step 1506, high-precision floating-point operations
are quantized and approximated with lower-precision integers. In
one embodiment, quantization may convert floating-point
representations (32-bit or 64-bit) to integer representations
(INT16, INT8, INT4, etc.) In one specific variant, the quantization
may be parameterized into bit-depth and shift-amount for each
atomic operator.
[0137] In one exemplary embodiment, the quantization may be
iteratively optimized by processing a set of representative inputs
with the model, and adjusting the quantization based on collected
statistics from the traced output (discussed below, at sub-step
1508). The mapping between floating-point numbers (x) and its
integer representation (for a given quanta (q) and bit width (b))
may be given by:
Q .function. ( x , b , q ) = clamp .function. ( round .function. (
x 2 q ) , - 2 b - 1 , 2 b - 1 - 1 ) EQN .times. .times. 21
##EQU00002##
[0138] For known floating-point ranges, q and b may be chosen to
minimize the Quantization Mean Square Error (QMSE) represented in
EQN 22.
QMSE = 1 N .times. i = 1 N .times. ( x i - 2 q .times. Q .function.
( x i , b , q ) ) 2 EQN .times. .times. 22 ##EQU00003##
[0139] For unknown floating-point ranges, a statistical
approximation may be used to generate a range of q values to
estimate an optimum mapping. As but one such example, the GQMSE
(Gaussian QMSE) may be used where x is normally distributed
(x.about.N(.mu., .sigma.); GQMSE may be calculated by EQN 23.
GQMSE = 1 2 .times. .times. .pi..sigma. 2 .times. .intg. - .infin.
.infin. .times. e - 1 2 .times. ( x - .mu. .sigma. 2 ) 2 .function.
( x - 2 q .times. Q .function. ( x , b , q ) ) 2 .times. dx EQN
.times. .times. 23 ##EQU00004##
[0140] In some embodiments, quantization sub-step 1506 may include
granular quantization control that enables different precisions for
different layers of a model ("heterogenous precision"). In such
cases, a user may tag different layers in their model with
different precisions. For example, spectro-temporal input data
could flow through a model along two paths to implement selective
noise reduction. One path's output is a time-frequency mask that
indicates which time-frequency bins are noise and which are signal.
This mask is applied to the other path containing the original
input time-frequency data. While the mask and other data can be
computed at 8 or 4 bits ("standard"), the input data may be best
preserved at 16 bits ("double"). Various implementations may
strictly (or loosely) obey user tagged precisions during
quantization. Untagged layers may be assigned a default precision
by the mapping algorithm.
[0141] Similarly, certain embodiments may include different options
for vector and matrix precision at different layers. For example,
"standard" precision may be 8-bit integers for vectors and 4-bit
integers for matrices. "Eights" precision may be 8-bit integer
precision for both vector and matrix values. "Double" precision may
be 16-bit integers for vector and 8-bit integers for matrices.
[0142] While the foregoing discussion is presented in the context
of specific data structures, artisans of ordinary skill in the
related arts will readily appreciate that virtually any data
structure of any dimensionality may be substituted with equal
success. Examples of such data structures may include
signed/unsigned integers, floating-point of any precision, and/or
any other data representation.
[0143] As shown in FIG. 16, circles within the hierarchy of layers
1600 depict different layers in the machine learning model. In
hierarchy of layers 1600, a heterogeneous precision is applied to a
model by tagging layers with different precisions. Layer 1602 is
tagged with double precision, while layer 1604 is tagged with
standard precision. Arrows depict parent-child relationships in the
layer hierarchy. During automatic quantization, precisions are
chosen for untagged layers based on the layer hierarchy (i.e.,
untagged layers inherit their precision from their parent in the
hierarchy). Layers that do not inherit a precision from a parent
layer may be set to the default precision (standard).
[0144] Referring back to sub-step 1506 of FIG. 15, certain aspects
of neural network operation may be analyzed and/or trained on. In
one embodiment, commonly used functions (sigmoid, tanh, sqrt, log,
reciprocal, etc.) may be stored in look-up-tables (LUTs) because it
may be more efficient to read a value out of a table than to
evaluate the function on-the-fly (e.g. via Taylor Series).
[0145] Unfortunately, in some situations, the limited input address
space of the LUT operation can bottleneck performance at higher
precision operations. For example, an 8-bit addressable LUTs may
only store 256 outputs for 256 linearly spaced inputs, however the
activation precision level may be configured for either 8-bit or
16-bit output values. As a result, during "double" precision, INT16
activations may be compressed to INT8 before they can address the
LUT. This compression may introduce quantization error (e.g.,
Quantization Mean Squared Error) during operation.
[0146] Some embodiments may use linear interpolation (piecewise
linear approximation) to mitigate quantization errors. Linear
interpolation calculates a local slope f'({tilde over (x)}) between
neighboring entries in the LUT to approximate f(x). The
interpolated approximation to f(x) is given by EQN. 24, where xis
the output precision (e.g. 16-bit) and is the compressed input
precision (e.g. 8-bit).
f(x).apprxeq.f({tilde over (x)})+(x-{tilde over (x)})f'({tilde over
(x)}) EQN. 24:
[0147] Other embodiments may use telescoping functions that change
sensitivity over different ranges. A telescoping approximation
compresses the input into several levels. For instance, a two-level
telescoping approximation combines two evaluations of a
function--one for the coarsely compressed input and one for the
finely compressed input. The coarsely compressed input preserves
the original dynamic range of the input but has a large step size.
The finely compressed input has a small step size which preserves
the original input's granularity but may only be valid for inputs
with small magnitudes. Telescoping functions are often suitable for
functions that satisfy the additive identity of EQN. 25 (e.g.,
logarithms) or the multiplicative identity of EQN. 26 (e.g.,
monomials of the form f(x)=.alpha.x.sup.p):
f(k*x)=f(k)+f(x) EQN. 25:
f(k*x)=f(k)*f(x) EQN. 26:
[0148] Referring back to sub-step 1508 of FIG. 15, the effects of
quantization are traced to aid subsequent iterative and/or
training. In one exemplary embodiment, each atomic operator (from
sub-step 1504) is annotated with the quantization parameters (from
sub-step 1506) to determine the resulting data flow. The data flow
between atomic operators can be represented as a set of directed
edges between nodes in a node graph. Unconstrained portions of the
node graph may receive default configurations and/or derive their
configurations from constrained portions. For example, a node may
assume that its input is the same bit-width as its data source
(e.g., an upstream node). The data flow may be further annotated
with attributes that aid the compiler, such as vector shapes and
sizes, activation sparsity, parameter sparsity, and/or data
types.
[0149] As previously noted, logical models of neural networks are
trained to generate output data, based on training data sets.
However, various embodiments of the present disclosure may
incorporate device-awareness into the training process (step 1510).
While the following discussion is presented in the context of
hardware-aware training, artisans of ordinary skill in the related
arts will readily appreciate that the concepts may be broadly
extended to any awareness-based training. Within the context of
neural network training, the terms "aware", "awareness", and its
linguistic derivatives refer to training techniques that
compensate, leverage, or otherwise adjust for, parameterized
capabilities, limitations, and/or functionalities. For instance,
hardware-aware training may be based on limitations (or
capabilities) of the processor, memory, and/or logic gates.
Examples of such limitations (or capabilities) may include
processing speed, memory size, gate numerosity, etc. Software-aware
training may be based on limitations (or capabilities) of the
software execution; examples of such parameters may include data
structure sizes, permissions, memory allocations, locking access,
etc. The concepts described herein may be broadly applied to any
resource that may affect real-world operation; for example,
training may be modified based on e.g., power consumption, network
bandwidth, and/or any other device, application, system
consideration.
[0150] In one exemplary embodiment, device-specific training may
include Quantization Aware Training (QAT). As a brief aside,
logical model training relies on the iterative fine-tuning of a
model's parameters with floating-point precision between nodes.
Each training iteration could modify parameters by a large range of
possible gradient updates. Unfortunately, embedded implementations
may only have fixed-point precision; this means training must occur
over the range of gradient updates that are representable by
fixed-point precision. Consequently, in one specific
implementation, "fake-quantization" is used during QAT to simulate
the integer arithmetic in the forward pass while allowing for small
floating-point gradient updates in the backward pass. Integer
operations are simulated using floating-point values by rounding
and truncating. It is important to note that the simulation of the
underlying integer operations exactly matches the device-specific
precision. In the backwards pass, small floating-point gradient
updates are percolated back to the parameters. A high-precision
floating-point copy of each of the parameters is updated with the
gradient updates. In this way, the high precision parameters can
accumulate multiple gradient updates before crossing integer
boundaries and exhibiting these changes in the forward pass.
[0151] More broadly, artisans of ordinary skill in the related arts
will readily appreciate that hardware-aware training may be used at
multiple points in the design process. In some embodiments,
training may occur prior to design synthesis (step 1502). In some
embodiments, training may be performed prior to software
compilation/hardware partitioning (step 1514, described below). In
still other embodiments, training may be an iterative process;
training may be performed on preliminary synthesis passes, and
results may be fed back to improve subsequent synthesis passes.
[0152] In one such implementation, the system may prune parameters
and activations during device-specific training of the model
(sub-step 1512). Mathematically, untrained neural networks have
infinitely many potential ways of generating desired outputs from
the inputs; training the neural network selects one solution. While
logical neural networks could arbitrarily pick any solution, the
exemplary device-aware training adds parameter-sparsity and
activation-sparsity to provide maximum flexibility for optimal
connectivity (and penalizes sub-optimal connections, discussed
below).
[0153] As discussed in U.S. patent application Ser. No. ______,
filed ______ and entitled "METHODS AND APPARATUS FOR MATRIX AND
VECTOR STORAGE AND OPERATIONS", previously incorporated herein by
reference in its entirety, activation sparsity can be used to
greatly reduce storage requirements as well as computational
complexity. For instance, compression schemes may be used to
represent sparse matrices with links to compressed column data
structures, where each compressed column data structure only stores
non-null entries to optimize column-based lookups of non-null
entries. Similarly, sparse vector addressing skips nulled entries
to optimize for vector-specific non-null multiply-accumulate
operations. In one exemplary embodiment, activation-sparsity can be
introduced by adding sparsifier layers to the logical model; the
sparsifier layers are tagged as pruneable activations which can be
preferentially sparsified and/or pruned during training to avoid
undesirable penalties.
[0154] Similarly, parameter sparsity may allow users to fit large
models into hardware with a limited memory capacity. As illustrated
above, parameter sparsity distributes neural network parameters to
each of the cores of a multicore architecture; subsequent training
can prune the parameters to incentivize local processing and
penalize global communications. In this manner, each core is
optimized for only a small slice of the overall neural network
parameters. In other words, parameter and activation pruning tools
may prompt models to achieve higher levels of parameter and
activation sparsity when used during device-specific training.
[0155] In one specific implementation, the device-specific training
algorithm is a multivariate optimization of latency (L), energy
(E), and memory (M) based on the activation density (.alpha.) and
parameter density (.beta.), according to the following
equations:
L=.alpha..beta..eta.L.sub.o EQN. 27:
E=.alpha..beta..eta.E.sub.o EQN. 28:
M=.beta.M.sub.o EQN. 29:
[0156] As previously noted, density is the ratio of non-null
elements to the total elements density and sparsity are each
positive and sum to one. L.sub.o, E.sub.o, and M.sub.o are the
baseline latency, energy, and memory for the logical model,
measured when activations and parameters both have densities of
one. These baselines define the theoretical maximum resources
needed for a model. Additionally, .eta. is reflects the
activation-parameter density affinity i.e., a characterization of
the firing rates of neurons and connectivity. When the correlation
between parameter and activation densities is zero, .eta.=1. A
positive correlation between parameter and activation densities
results in .eta.>1 (e.g., non-null activations occur in densely
connected neurons); a negative correlation corresponds to
.eta.<1 (e.g., non-null activations in loosely connected
neurons).
[0157] In one exemplary device-specific training process, the
neural network is trained to minimize .alpha., .beta., and .eta..
In one specific implementation, the device-specific training
process uses a set of heuristics that penalize activations at a
first (general) weight and penalize affinity-dense activations at a
second (heavier) weight. Additionally, the device-specific training
process may iteratively adjust parameter density; to minimize
training complexity, parameter density may be gradually, but
irreversibly pruned. Other techniques may allow for more training
complexity, and support reversible parameter pruning.
[0158] In one implementation, the activation penalty may use a
differentiable regularizer that gradually shrinks the activation
density (.alpha.). In machine learning contexts, regularization is
the process of adding information to solve an ill-posed problem or
to prevent overfitting; differentiability ensures that the
regularization occurs smoothly (continuously); e.g., gradient
descent-based training is a first-order differentiable
function.
[0159] In the exemplary implementation, the activation penally is
added to the overall objective function, which is minimized during
training via gradient descent during training. The activation
penalty .THETA. is an L1 norm of the model's N prune-able
activations (.alpha.), given by the equation:
.THETA. = 1 N .times. a 1 = 1 N .times. i = 1 N .times. a i EQN .
.times. 30 ##EQU00005##
[0160] This rule penalizes neuron activations because each
activation is associated with memory access and computational cost.
The rule treats all neurons with an equal weighting, regardless of
how many neurons they connect to.
[0161] Notably, certain neurons should be penalized more heavily
than others because they can cause chains of downstream neuron
activations; thus, an affinity-aware penalty strives to shrink
.alpha..eta. instead of .alpha. alone. In one specific
implementation, the affinity-aware activation penalty
.THETA..sub..eta. is an L1 norm of the prune-able activations,
weighted by each prune-able neuron's fanout (f.sub.i of the ith
prune-able neuron), given by the equation:
.THETA. .times. .times. .eta. = i = 1 N .times. f i .times. a i i =
1 N .times. f i EQN . .times. 31 ##EQU00006##
[0162] In one embodiment, the device-specific training includes a
structured, magnitude-based parameter pruning algorithm to
gradually reduce the parameter density .beta. during training.
Unlike activation pruning, which can be indirectly weighted/trained
with a regularization penalty, parameter pruning is performed
directly and irreversibly to reflect the realities of hardware
implementation (i.e., a core either has or does not have a
parameter). At each pruning step, the pruning algorithm selects a
number of parameter elements to prune. From this point on, these
pruned parameter elements are set to null. Once an element is
pruned, it is insensitive to gradient updates and will remain fixed
at null for the duration of training.
[0163] As discussed in U.S. patent application Ser. No. ______,
filed ______ and entitled "METHODS AND APPARATUS FOR MATRIX AND
VECTOR STORAGE AND OPERATIONS", previously incorporated herein by
reference in its entirety, the exemplary device may group certain
elements together to accelerate certain types of computation. In
order to benefit from such hardware-acceleration, the
device-specific training may incorporate element grouping in the
training process. Specifically, the training process may
selectively prune elements of a parameter matrix based on a
structured magnitude-based criterion. In particular, pruning
decisions are not made at a per-element basis, as this may lead to
an unstructured sparsity pattern. Instead, the matrix may be broken
down into subcomponents called pencils, and pruning decisions are
made per-pencil instead of per-element. In an exemplary embodiment,
a pencil is a column vector of 8 elements. For example, a matrix of
shape (256, 256) would have 32 pencils per column, for a total of
8,192 pencils. The pencils with the lowest average magnitudes may
be selected for pruning, until enough pencils have been pruned to
reach the target sparsity level. The pencil structure is used to
align with hardware memory interfaces--a read from memory extracts
multiple consecutive elements.
[0164] At step 1514, the synthesized device-specific primitives are
mapped to the device architecture. More directly, the atomic
operations that were logically synthesized in step 1502 are
agnostic to the physical aspects of device implementation (e.g.,
timing and task scheduling, placement, race conditions, etc.) The
mapping step assigns the device-specific primitives to physical
resources of the device and generates the executable software. In
one exemplary embodiment, mapping is performed in four stages:
partitioning and placement (at sub-step 1516), optimization (at
sub-step 1518), code generation (at sub-step 1520), and
instruction-level optimization (at sub-step 1522). While the
exemplary embodiment performs each stage iteratively, other
embodiments may group stages for iteration (e.g., sub-steps 1516
and 1518 may be grouped and iterated over). Artisans of ordinary
skill in the related arts, given the contents of the present
disclosure will readily appreciate that other implementations may
further subdivide, merge, remove, add-to, and/or otherwise modify
the mapping sub-stages described herein.
[0165] As an optional preliminary sub-step, hardware-agnostic
atomic operations may be annotated by a human (either via a textual
or graphical interface) for compilation and placement. Each node
may be annotated to specify e.g., data type and/or data flow,
operators, or markers for other hardware-specific functions such as
core-to-core communication. Edges between nodes of the node graph
represent data and/or control dependencies. Compiling the annotated
representation to assembly code for placement on the hardware may
attempt to optimize energy efficiency, packing efficiency, and
computation latency, among other areas of optimization (e.g.,
performance).
[0166] Various embodiments of the present disclosure implement a
variety of metrics to assess mapping quality. These metrics may be
used to iteratively optimize between different mappings. For
instance, an energy efficient mapping should execute the
computation using as little energy as possible per
inference/timestep. The primary contributor to energy consumption
is the placement: in certain embodiments, core-to-core
communication may be minimized. Similarly, an efficiently packed
mapping maximizes core utilization for a given network. In one such
implementation, packing efficiency refers to the amount of unused
memory in each core; other efficient packings may minimize
inter-core communication, etc. The packing efficiency of the
mapping limits the network size that can practically fit on a given
chip--there is an overhead on the "effective" amount of memory in
the system. This indirectly affects energy efficiency since leakage
current is a function of the core utilization.
[0167] Additionally, certain embodiments may assess the suitability
of the mapping for a particular application. For instance, certain
applications require computations to be performed as quickly as
possible (or within other time constraints). Other applications may
have performance and/or power limitations, etc. Still other
applications may balance multiple considerations. For example, a
solution with sub-optimal latency may increase the amount of time
that the system stays in its highest-power state instead of
sleeping in a low-power state.
[0168] In addition to spatial placement considerations, temporal
utilization may also introduce a variety of considerations. As but
one such example, a multicore architecture may use parallelism to
accelerate processing. Each core may be a multithreaded processor
associated with and working from private memory associated only
with that processor and capable of running several SIMD
instructions simultaneously. Unfortunately, thread-level
parallelism may be limited by resource conflicts (e.g.,
instructions cannot run in parallel if their operands are from the
same memory bank). One potential solution is to distribute data and
operations across multiple resources to minimize resource
conflicts; alternatively, or in addition, resource conflicts can be
scheduled around.
[0169] Other examples of temporal restrictions include long latency
core-to-core (inter-core) communications. This type of parallelism
may arise from partitioning large nodes into smaller pieces.
Inter-core communications may be minimized at the algorithm-level
by communicating sparse vectors, and by keeping communication of
dense vectors as local as possible.
[0170] Referring back to sub-step 1516 of FIG. 15, portioning and
placement generally refers to the process of determining the number
of cores needed to support the neural network, and splitting the
neural network program into core-specific sub-programs. In one
specific implementation, the data and computation loads are
balanced across cores, and the program is split with the objective
of minimizing the total core-to-core communication. Other
implementations may optimize for asymmetric placements; for
example, heterogenous multicore architectures may preferentially
place certain types of functionality in certain cores. For example,
a high-performance core may be coupled with a power efficient core,
a highly-connected core, etc. Similarly, some neural networks may
incorporate specialized logic (e.g., encryption, codecs,
communication protocols, etc.) Various other implementations may be
substituted with equal success, by artisans of ordinary skill in
the related arts given the contents of the present disclosure.
[0171] FIG. 17 is a logical flow diagram of one exemplary
implementation of the partitioning and placement sub-step 1516 of
FIG. 15.
[0172] At step 1702, an initialization pass is performed on the
device-specific primitives based on their atomic operators. In one
exemplary embodiment, the synthesized node graph of device-specific
primitives is input to the mapping algorithm. The edges between
nodes of the node graph determine the data or control dependencies
between nodes.
[0173] In one specific implementation, the mapping algorithm
classifies atomic operator functionality into "OpNodes",
"DataNodes", "TableNodes", and "CommNodes." OpNodes describe a
mathematical operation, e.g., a matrix-vector multiplication. The
mapping algorithm may either unfold OpNodes into device-specific
primitives (instructions or logic) during code generation (see
sub-step 1520 of FIG. 15) or prune the node out during compilation
(see sub-step 1522 of FIG. 15). OpNodes may be annotated with e.g.,
location information (the core and thread that the OpNode has been
assigned to), precision (e.g., standard or double-precision),
and/or operation-specific constants (e.g., an immediate value for
an immediate addition operation). DataNodes may store data
structure information such as: data type, shape, and precision,
location (core and bank), a constant value for fixed parameters,
and/or sparsity information. TableNodes are used for objects that
may be stored in table memory (e.g., look-up-table entries and
column addresses for sparse matrix by sparse vector products).
CommNodes are specialized logic that allow inter-core communication
(see associated discussion at step 1714).
[0174] At step 1704, the synthesized node graph is sequenced
according to execution order. In some cases, this may require the
addition of new edges to the synthesized node graph. For instance,
control edges may be added to indicate control dependencies that do
not have their own data dependency. Control edges may be used in
some specific operations to ensure faithful execution order so this
pass may be repeated whenever the node graph changes.
[0175] At step 1706, certain matrix and vector operations may be
optimized, minimized, or eliminated altogether. For example, matrix
transpositions may be eliminated by propagating the operation back
to a DataNode. These optimizations may be used to avoid
computationally expensive operations (which may not be supported on
all hardware types).
[0176] At step 1708, shift amounts may be standardized. For
example, out-of-range shift amounts may be clipped or corrected.
Standardized shift logic can reduce specialized logic for corner
cases.
[0177] At step 1710, large OpNodes and their associated DataNodes
may be split into smaller shards that can be placed on different
cores (at step 1712), allowing for better memory balance and
execution time for large operations.
[0178] At step 1712, each OpNode is assigned to a core. The amount
of memory and computational work assigned to each core may be
balanced. The partitions may be tiled in a way that minimizes the
number of hops for each communication. As previously alluded to,
different OpNode placement result in different performances;
different mappings may be assessed according to the metrics
described above (e.g., energy efficiency, packing efficiency,
computational latency, etc.)
[0179] At step 1714, communication nodes are inserted at
core-to-core and core-to-chip input/output (IO) boundaries.
Communication nodes may be placeholders which are later expanded
into communication instructions. DataNodes falling on a chip
boundary may be replicated on both sides (where each core gets its
own copy of the data).
[0180] At step 1716, DataNodes are assigned to the cores that were
assigned with their neighboring OpNodes. In an alternative
embodiment, this assignment may occur in a combined pass with step
1712.
[0181] While the foregoing discussion is presented in the context
of a sequential order, it is appreciated that multiple iterations
of partitioning and placement may be used in a trial-and-error
manner to identify as a suitable partitioning/placement. In each
pass, the code/representation may be assessed multiple times,
potentially with different assessment heuristics (e.g., energy
efficiency, packing efficiency, and computation latency, etc.)
[0182] Returning back to sub-step 1518 of FIG. 15, each core's
sub-graph is optimized to prepare for code generation. As large
operations were previously (spatially) split across cores, large
operations within one core may be further (temporally) split across
multiple threads to enable faster operation. In one exemplary
embodiment, the sub-graph nodes can be user annotated with
scheduling and/or timing hints to smooth code generation. In one
such embodiment, a designer may assign data (or DataNodes) to
specific memory banks during code generation.
[0183] FIG. 18 is a logical flow diagram of one exemplary
implementation of the placement optimization sub-step 1518 of FIG.
15.
[0184] At step 1802, neural network operations are split into
multiple parallel operations, each meant to be executed by a single
thread. In one exemplary embodiment, each thread may be allocated
its own data path to reduce execution time, as discussed in U.S.
patent application Ser. No. ______, filed ______ and entitled
"METHODS AND APPARATUS FOR THREAD-BASED SCHEDULING IN MULTICORE
NEURAL NETWORKS", previously incorporated herein by reference in
its entirety. As described therein, threads may run independently
of one another, without any centralized scheduling and/or resource
locking (e.g., semaphore signaling, critical path execution, etc.)
Decoupling thread dependencies allows cores to execute threads
asynchronously.
[0185] In some embodiments, DataNodes for parameter matrices may be
and/or copied so they can be accessed without contention.
Similarly, parallelized OpNodes (and their corresponding DataNodes)
cannot concurrently access the same resources. Ideally, conflicts
can be avoided, but where a conflict must occur the parallelism
will be limited (the instructions of either thread must directly,
or indirectly, be serialized due to the resource conflict).
[0186] At step 1804, a sparsifying pass is performed. Dense data
operations (e.g., dense matrix by dense vector products) may be
converted to sparse data operations (e.g., sparse matrix by sparse
vector products), if the sparsity and average sparsity values of
the original DataNodes make it advantageous. In one embodiment,
sparsification occurs when either only the matrix or the vector
(but not both) are sparse (implementing a sparse matrix by sparse
vector product where either the matrix or the vector that was
originally dense may be inefficiently stored but may offer overall
optimization).
[0187] For example, as described within U.S. patent application
Ser. No. ______, filed ______ and entitled "METHODS AND APPARATUS
FOR MATRIX AND VECTOR STORAGE AND OPERATIONS", previously
incorporated herein by reference in its entirety, matrices and
vectors may be tagged as sparse or dense depending on their
contents. The matrix-vector multiply math can either be performed
with an instruction designed for dense data or an instruction
designed for sparse data. In one specific implementation, the
tagging directs which instructions are used to compute the math on
the chip at run-time.
[0188] Referring now to sub-step 1520 of FIG. 15, software code for
each core's sub-graph is generated. In one specific implementation,
each core's sub-graph is allocated to memory banks (using hints
provided during optimization) and converted into machine-readable
instructions. In one such implementation, communication code
generation and thread control passes may also be added to the
program(s).
[0189] FIG. 19 is a logical flow diagram of one exemplary
implementation of the code generation sub-step 1520 of FIG. 15.
[0190] At step 1902, OpNodes and CommNodes are assigned to threads,
based on grouping rules. For example, chains of operations in the
node graph may be grouped together and sequentially executed.
Parallel chains of operation may be executed concurrently. In some
cases, a single chain may be split, or multiple chains may be
sequenced e.g., to improve performance, reduce core utilization,
etc. For instance, certain operations may be re-ordered to save on
loads and stores by keeping values in the accumulator instead of
writing out to memory.
[0191] In one embodiment, CommNodes mark core-to-core communication
boundaries. Unidirectional inter-core communication may further
specify whether a CommNodes is sends or receives data; e.g., each
CommNode may include an attribute that encodes whether it is a
`send` or `recv` node. CommNodes may be inserted into the node
graph at inter-core boundaries after partitioning (see step 1516).
During code generation, these nodes are used to construct the
communication threads that contain SEND, RECV, and RDY
instructions.
[0192] In some embodiments, CommNodes may additionally support
other communication protocols and/or inter-chip communications. For
example, an IONode may be used to communicate across a chip
boundary (to another chip). While unidirectional communication
(send/receive) is disclosed, bi-directional, multi-cast, and/or
broadcast communication may be substituted with equal success by
artisans of ordinary skill, given the contents of the present
disclosure.
[0193] At step 1904, data is padded to a multiple of the pencil
size. This pass may be used in some embodiments, particularly where
there is no-sub word indexing in the instruction set architecture.
For example, a length-5 vector may be padded into a length-8
vector, with the 3 final elements not being used in the
computation. Unused elements consume memory storage, but reduce
addressing complexity; thus, different pencil dimensions may be
assigned in accordance with overall design considerations (e.g.,
energy efficiency, packing efficiency, computational latency,
etc.)
[0194] At step 1906, the sizes of DataNodes and TableNodes is
computed in units of memory words. This may be used later when
generating a bank assignment for all variables at step 1908.
[0195] At step 1908, DataNodes and TableNodes are assigned to banks
of their respective memory types. In some embodiments, the
assignment uses variable sizes. This pass tries to respect the
thread concurrencies that were assigned to DataNodes previously
(see step 1518 of FIG. 15), while also attempting to balance the
memory assigned to each bank (maximizing packing, minimizing
fragmentation).
[0196] At step 1910, Assembly Code is generated for the OpNodes in
each thread. In code generation, an object-oriented representation
of assembly language, is generated for each of the threads that
were optimized and assigned to the operation nodes (see step 1518).
In addition to the "arithmetic" part of the code (declaration of
data variables and emission of instructions that perform the
various operations), "thread control" instructions may also be
created to ensure the correct concurrent control flow.
[0197] After initializing the assembly code program object, the
system begins by declaring the data variables associated with each
core's DataNodes (using the computed bank assignments). Next the
compiler generates a code snippet for each OpNode based on the
node's constants as well as the DataNodes it is attached to. In
some embodiments, the compiler may be agnostic to accumulator
state; in such cases, the code snippets are emitted with the
maximum set of loads and stores (e.g., making no assumptions about
whether any of the variables needed by the OpNode are already in
the accumulator). In another embodiment, the compiler may optimize
the instruction order and keep track of accumulator residency
operation-by-operation to optimize and reduce unnecessary loads and
stores.
[0198] At step 1912, assembly code is generated for CommNodes. Each
nodes' code is emitted into its own dedicated thread. Receive nodes
may also add "special thread" sections to provide for subsequent
inter-core communication flexibility.
[0199] At step 1914, inter-thread control passes are performed. In
one specific implementation, the compiler inserts scoreboard (SB),
sleep (SLEEP), and jump (JUMP) commands to implement
thread-to-thread sequencing. Inter-thread communication is
described in greater detail within U.S. patent application Ser. No.
______, filed ______ and entitled "METHODS AND APPARATUS FOR
THREAD-BASED SCHEDULING IN MULTICORE NEURAL NETWORKS", previously
incorporated herein by reference in its entirety. In one such
implementation, a "thread graph" is constructed with one node per
thread. The thread graph constructs an edge to another thread graph
whenever an OpNode in one thread has a data or control edge that
terminates in another thread. The number of inbound edges is each
thread's initial score, and each thread's scoreboard may decrement
each of its successors in the thread graph by one before it sleeps
and increments its own score back to the initial score. Other
schemes for inter-thread control may be substituted with equal
success.
[0200] Returning to FIG. 15, at step 1522 the resulting assembly
code is optimized at the instruction-level. For example, the
assembly code is modified to remove inefficiencies (e.g.,
unnecessary load and store instructions). The assembly code may be
checked to ensure correct operation of the pass.
[0201] At step 1524, a behavioral simulator may be run which may
allow for verification of the operation programs and estimation of
the physical costs (energy, area, time) of the program. The
behavioral simulator may include a module to model operation of a
program and track operation counts and approximate hardware
concurrency. The behavioral simulator loads the generated assembly
code and a hardware configuration description and runs test inputs
through the simulation extracting metrics for the input pattern and
given hardware configuration. Output may include estimated area,
energy, and latency metrics.
[0202] At step 1526, machine code is generated by an assembler. In
some embodiments, the generated code is a binary executable that
can be run on the optimized hardware (such as architecture 200). In
some embodiments, one or more listing files are created which may
contain information about data memory, table memory, instruction
memory, and the symbol table. This information may be helpful for
debugging and analysis.
[0203] At step 1528, the generated machine code (and associated
data) may be placed and run on the hardware e.g., a System on a
Chip (SoC), an FPGA, or printed circuit board (PCB).
[0204] It will be appreciated that the various ones of the
foregoing aspects of the present disclosure, or any parts or
functions thereof, may be implemented using hardware, software,
firmware, tangible, and non-transitory computer-readable or
computer usable storage media having instructions stored thereon,
or a combination thereof, and may be implemented in one or more
computer systems.
[0205] It will be apparent to those skilled in the art that various
modifications and variations can be made in the disclosed
embodiments of the disclosed device and associated methods without
departing from the spirit or scope of the disclosure. Thus, it is
intended that the present disclosure covers the modifications and
variations of the embodiments disclosed above provided that the
modifications and variations come within the scope of any claims
and their equivalents.
* * * * *