U.S. patent application number 16/192924 was filed with the patent office on 2019-05-23 for systems and methods for exchange of data in distributed training of machine learning algorithms.
The applicant listed for this patent is Neuralmagic Inc.. Invention is credited to Alexander MATVEEV, Nir Shavit.
Application Number | 20190156214 16/192924 |
Document ID | / |
Family ID | 66533132 |
Filed Date | 2019-05-23 |
United States Patent
Application |
20190156214 |
Kind Code |
A1 |
MATVEEV; Alexander ; et
al. |
May 23, 2019 |
SYSTEMS AND METHODS FOR EXCHANGE OF DATA IN DISTRIBUTED TRAINING OF
MACHINE LEARNING ALGORITHMS
Abstract
Systems and methods may make exchanging data in a neural network
(NN) during training more efficient. Exchanging weights among a
number of processors training a NN across iterations may include
sorting generated weights, compressing the sorted weights, and
transmitting the compressed sorted weights. On each Kth iteration a
sort order of the sorted weights may be created and transmitted.
Exchanging weights among processors training a NN may include
executing a forward pass to produce a set of loss values for
processors, transmitting loss values to other processors, and at
each of the processors, performing backpropagation on at least one
layer of the NN using loss values received from other
processors.
Inventors: |
MATVEEV; Alexander;
(Cambridge, MA) ; Shavit; Nir; (Cambridge,
MA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Neuralmagic Inc. |
Somerville |
MA |
US |
|
|
Family ID: |
66533132 |
Appl. No.: |
16/192924 |
Filed: |
November 16, 2018 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
62588324 |
Nov 18, 2017 |
|
|
|
62588970 |
Nov 21, 2017 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06K 9/00624 20130101;
G06N 3/0445 20130101; G06N 3/0481 20130101; G06N 3/084 20130101;
G06N 3/04 20130101; G06N 3/0454 20130101; G06K 9/6262 20130101;
G06K 9/66 20130101; G06N 20/10 20190101; G06N 5/046 20130101; G06K
9/6271 20130101 |
International
Class: |
G06N 3/08 20060101
G06N003/08; G06N 5/04 20060101 G06N005/04; G06N 20/10 20060101
G06N020/10 |
Claims
1. A method of conducting training on a neural network (NN), the NN
comprising neurons arranged into layers, the method comprising: at
each of a plurality of processors, executing a forward pass on the
NN to produce a set of loss values for the processor; at each of
the plurality of the processors, transmitting the set of loss
values to a set of processors of the plurality of processors; and
at each of the plurality of the processors, performing
backpropagation on at least one layer of the NN using loss values
received from a set of processors of the plurality of processors,
wherein the ratio of compute burden of the at least one layer on
which backpropagation is performed to the compute burden for the
other layers in the NN is less than the ratio of the number of
weights for the at least one layer on which backpropagation is
performed to the number of weights for the other layers in the
NN.
2. The method of claim 1, wherein the at least one layer on which
backpropagation is performed has associated with the layer a larger
amount of weight values than another layer in the NN.
3. The method of claim 1, wherein the at least one layer on which
backpropagation is performed is a fully connected layer
4. The method of claim 1, wherein the NN comprises at least one
fully connected layer, at least one input layer, and at least one
convolution layer.
5. The method of claim 1, comprising, at each of the plurality
processors, prior to performing backpropagation using loss values
received from a set of processors, performing backpropagation using
the loss values produced by the processor.
6. The method of claim 1, wherein performing backpropagation on at
least one layer of the NN using loss values received from a set of
processors of the plurality of processors comprises, for each
processor for which loss values are received, performing a separate
backpropagation operation.
7. The method of claim 1, wherein performing backpropagation on at
least one layer of the NN using loss values received from a set of
processors of the plurality of processors comprises performing
backpropagation down to, including, but not beyond a fully
connected layer.
8. The method of claim 1, comprising at each of a plurality of the
processors receiving parameters from a master node and updating a
NN model stored by the processor.
9. A method of conducting training on a model of a neural network
(NN), the NN arranged into layers, the method comprising, at each
or a plurality of nodes: receiving training data to produce a set
of local losses; sending the losses to other nodes of the plurality
of nodes; for at least a first layer, performing backpropagation
using local losses and losses received from other nodes of the
plurality of nodes; and for at least a second layer, performing
backpropagation using local losses and not using losses received
from other nodes of the plurality of nodes, wherein the ratio of
compute burden of the at least one layer on which backpropagation
is performed to the compute burden for the other layers in the NN
is less than the ratio of the number of weights for the at least
one layer on which backpropagation is performed to the number of
weights for the other layers in the NN.
10. The method of claim 9, wherein the first layer is a fully
connected layer
11. The method of claim 9, wherein the NN comprises at least one
fully connected layer, at least one input layer, and at least one
convolution layer.
12. The method of claim 9, wherein performing backpropagation using
losses received from other nodes comprises, for each node for which
losses are received, performing a separate backpropagation
operation.
13. The method of claim 9, comprising at each of a plurality of
nodes receiving parameters from a master node and updating the NN
model.
14. A 5y stem conducting training on a neural network (NN), the NN
comprising neurons arranged into layers, the system comprising: a
plurality of nodes, each node comprising a memory and a processor
configured to: execute a forward pass on the NN to produce a set of
loss values for the node; transmit the set of loss values to a set
of nodes of the plurality of nodes; and perform backpropagation on
at least one layer of the NN using loss values received from a set
of nodes of the plurality of nodes.
15. The system of claim 14, wherein the at least one layer on which
backpropagation is performed has associated with the layer a larger
amount of weight values and a smaller amount of weight compute
burden than all other layers in the NN cumulatively.
16. The system of claim 14, wherein the at least one layer on which
backpropagation is performed is a fully connected layer
17. The system of claim 14, wherein the NN comprises at least one
fully connected layer, at least one input layer, and at least one
convolution layer.
18. The system of claim 14, wherein at each node the processors is
configured to, prior to performing backpropagation using loss
values received from a set of nodes, perform backpropagation using
the loss values produced by the processor.
19. The system of claim 1.4, wherein performing backpropagation on
at least one layer of the NN using loss values received from a set
of nodes comprises, for each node for which loss values are
received, performing a separate backpropagation operation.
20. A method of conducting training of a neural network (NN), the
NN arranged into layers and represented as a NN model, the method
comprising: at each of a plurality of non-master nodes: receiving
training data to produce a set of losses, sending the losses to a
master node; at the master node: for a subset of he NN layers,
performing backpropagation using the losses; updating the NN model;
and transmitting the NN model to the non-master nodes.
Description
RELATED APPLICATION DATA
[0001] This application claims benefit from U.S. provisional patent
application 62/588,970, filed on Nov. 21, 2017 and entitled "A
Lossless Compression-Based. Method for Reducing Network Traffic in
Distributed Training of Machine Learning Algorithms", and this
application claims benefit from U.S. provisional patent application
62/588,324, filed on Nov. 18, 2017 and entitled "A Method for
Reducing Network Traffic for Distributed Training of Neural
Networks with Fully Connected Layers" each incorporated herein by
reference in its entirety.
FIELD OF THE INVENTION
[0002] The invention relates generally to machine learning;
specifically to training neural networks using distributed
systems.
BACKGROUND
[0003] Neural networks (NN) or connectionist systems are computing
systems inspired by biological computing systems, but operating
using manufactured digital computing technology. NNs are made up of
computing units typically called neurons (which are artificial
neurons, as opposed to biological neurons) communicating with each
other via connections, links or edges. In common NN
implementations, the signal at the link between artificial neurons
may be for example a real number, and the output of each neuron may
be computed by function of the (typically weighted) sum of its
inputs, such as the ReLU rectifier function. NN links or edges
typically have a weight that adjusts as learning proceeds. The
weight increases or decreases the strength of the signal at a
connection. Typically, NN neurons are divided or arranged into
layers, where different layers may perform different kinds of
transformations on their inputs and may have different patterns of
connections with other layers. Typically, a higher or upper layer,
or a layer "above" another layer, is a layer more towards the
output layer, and a lower layer, preceding layer, or a layer
"below" another layer, is a layer towards the input layer.
[0004] Such systems may learn to perform tasks by considering
example input data, generally without being programmed with any
task-specific rules, being presented with the correct output for
the data, and self-correcting. During learning the NN may execute a
forward-backward pass where in the forward pass the NN is presented
with an input and produces an output, and in the backward pass
(backpropagation) the NN is presented with the correct output,
generates an error (e.g., a "loss"), and generates update gradients
which are used to alter the weights at the links or edges.
[0005] Various types of NNs exist. For example, a convolutional
neural network (CNN) is a deep, feed-forward network, which
includes one or more convolutional layers, fully connected layers,
and pooling layers. CNNs are particularly useful for visual and
speech applications. Other NNs include for example long short-term
memory (LSTM) networks.
[0006] In practice, a NN, or NN learning, is simulated by one or
more computing nodes or cores, such as generic central processing
units (CPUs, e g. as embodied in personal computers) or graphics
processing units (GPUs such as provided by Nvidia Corporation),
which may be connected by a data network. A collection of such
connected computers may be termed a pod, and computers used with
NNs may be single socket (e.g. one main processor) or multi-socket
(e.g. multiple processors in one machine, sharing some memory). One
or more computing nodes may model a NN using known data structures.
During or inference, the trained NN may for example recognize or
categorize images, perform speech processing, or other tasks.
[0007] A NN may be modeled as an abstract mathematical object, such
as a function. A NN may be translated physically to CPU or GPU as
for example a sequence of matrix operations where entries in the
matrix represent neurons (e.g. artificial neurons connected by
edges or links) and matrix functions represent functions of the
NN.
[0008] During learning, the NN, or the computing nodes modeling the
NN, may be presented with training data. For example, in an image
recognition application, a NN may learn to identify images that
contain cats by analyzing example images that have been manually
labeled as "cat" or "not a cat" and using the results to identify
cats in other images. The NN may do this without any prior
knowledge about cats, e.g., that they have fur, tails, whiskers and
cat-like faces. Instead, during learning the NN automatically
generates identifying characteristics from the learning material
that it processes.
[0009] One method of training in a NN is data parallel learning,
where (typically via a master node or core), the data or training
sets are divided, and each core or node operates on the same NN,
using forward and backward passes, on only a portion of the data
independently, and after each forward/backward pass the nodes or
cores exchange parameters (e.g. weights or gradients) with each
other, or send them to the master, to come up with the right
parameters for the iteration. For example, on each iteration, a
master node may send one different image, or a set of images, and
the same model of the NN, to each of four CPUs. Each CPU may
execute a forward and backward pass over all layers of the model on
its specific image, and send the resulting parameters to the
master, which then creates an updated model from the parameters
sent by all four CPUs. Each node or processor may at times store a
different version (with different parameters) of the same NN.
[0010] When a node communicates its resulting weights over the
network to other nodes after an iteration of training, a large
amount of data may need to be sent. For example, in the
data-parallel convolutional NN training approach, each node
executes the full machine learning model on a subset of examples,
so the number of parameters a node needs to communicate may be the
same as the model size. For example, in case of AlexNet CNN, there
may be 220 MB of parameters, and if 10 nodes operate on the data,
220 MB*10=2.2 GB of parameters are transferred in both directions
over the network for each iteration. Network bottlenecks may slow
the learning process. High bandwidth interconnections may be used
to speed data transfer, but such systems are expensive compared to
more low bandwidth networks, such as an Ethernet network.
[0011] In some NNs, a loss, inconsistency or error value may be
calculated at the output or at an output layer, with possibly
multiple loss values being created, e.g. one for each node in an
output layer. The output layer or set of layers typically is or
includes a fully connected (IPC) layer, where each neuron in the
layer accepts an input, edge or link from each neuron/output of a
lower or preceding layer (e.g., a layer closer to the input). This
fully connected layer is an example of a layer where the number of
weights is high (because there may be a link between every input
neuron and every output neuron) and yet the layer has a relatively
low amount of compute, because the computation as a whole may be
equivalent to a matrix multiply rather than a convolution. A loss
for a network may represent the difference or inconsistency between
the value or values output from the network, and the correct
value/values that should be output given the data input to the NN.
A loss value may be, for example, a negative log-likelihood or
residual sum of squares, but may be computed in another manner. In
NN learning, it is desired to minimize loss, and after receiving a
loss the NN model may be updated my modifying weight values in the
network using backpropagation.
SUMMARY
[0012] Systems and methods of the present invention may make
exchanging data in a neural network (NN) during training more
efficient. Exchanging weights among a number of processors training
a NN across iterations may in some embodiments include sorting
generated weights, compressing the sorted weights, and transmitting
the compressed sorted weights. On each Kth iteration a sort order
of the sorted weights may be created and transmitted, Embodiments
may exchange weights among processors training a NN by executing a
forward pass to produce a set of loss values for processors,
transmitting loss values to other processors, and at each of the
processors, performing backpropagation on at least one layer of the
NN using loss values received from other processors.
BRIEF DESCRIPTION OF THE DRAWINGS
[0013] Non-limiting examples of embodiments of the disclosure are
described below with reference to figures attached hereto that are
listed following this paragraph. Dimensions of features shown in
the figures are chosen for convenience and clarity of presentation
and are not necessarily shown to scale.
[0014] The subject matter regarded as the invention is particularly
pointed out and distinctly claimed in the concluding portion of the
specification. The invention, however, both as to organization and
method of operation, together with objects, features and advantages
thereof, can be understood by reference to the following detailed
description when read with the accompanied drawings. Embodiments of
the invention are illustrated by way of example and not limitation
in the figures of the accompanying drawings, in which like
reference numerals indicate corresponding, analogous or similar
elements, and in which:
[0015] FIG. 1A is a block diagram of a neural network according to
an embodiment of the present invention.
[0016] FIG. 1B is a block diagram of a neural network according to
an embodiment of the present invention.
[0017] FIG. 1C is a block diagram of a system for training a neural
network according to an embodiment of the present invention.
[0018] FIG. 2 is a high-level block diagram of an exemplary
computing device which may be used with embodiments of the present
invention.
[0019] FIG. 3 is a flowchart of a method according to embodiments
of the present invention.
[0020] FIG. 4 depicts a prior art process for training using a
multi-node system.
[0021] FIG. 5 is a flowchart of a method according to embodiments
of the present invention.
[0022] FIG. 6 is a diagram depicting a method of exchanging weights
among processors according to at embodiments of the present
invention.
[0023] It will be appreciated that for simplicity and clarity of
illustration, elements shown in the figures have not necessarily
been drawn accurately or to scale. For example, the dimensions of
some of the elements can be exaggerated relative to other elements
for clarity, or several physical components can be included in one
functional block or element.
DETAILED DESCRIPTION
[0024] In the following detailed description, numerous specific
details are set forth in order to provide a thorough understanding
of the invention. However, it will be understood by those skilled
in the art that the present invention can be practiced without
these specific details. In other instances, well-known methods,
procedures, and components, modules, units and/or circuits have not
been described in detail so as not to obscure the invention.
[0025] Embodiments of the invention include systems and methods
that may reduce the amount of data communicated during the training
process of NNs (e.g. convolutional neural networks, or other
networks) using a system including multiple nodes such as CPUs
connected via a relatively slow connection such as an Ethernet or
similar inexpensive network. CPUs, if used, may contain multiple
cores, so that certain tasks may be done in parallel or
concurrently: for example transmitting or receiving data, sorting,
compression, portions of a backward or forward pass, etc. However,
embodiments of the invention are applicable to other, non-NN tasks,
for transferring large amounts of data. While a CNN is discussed as
an example NN used, embodiments of the invention may be used with
other NNs, such as LSTMs. Further, while CPU based machines are
discussed, CiPUs or other types of processors may be used.
Embodiments of the present invention may be used with pods, and
single socket or multi-socket systems, or other types of
systems.
[0026] Embodiments of the invention may take advantage of the
computational properties of a NN such as a CNN to distribute the
computation and thus reduce the overall communication. Loss values
may be transmitted by nodes to a master node or other nodes, which
may use the loss values to calculate gradients and/or weights to
modify the model. The computation of these parameters may be
relatively computationally easy, e.g., have a low computational
burden relative to other layers, as in the case of an FC layer,
where the computation per output node is a simple dot product
[0027] Attorney Docket: P-582153-US of its weights. In contrast, in
this same FC layer, the number of weights and similarly the
gradient values per weight) is high relative to convolutional
layers since in an FC layer each node may receive a link or edge
from each node in its input layer. This number is even larger when
compared to the network's overall number of loss values which is
usually the number of outputs the NN has.
[0028] Some prior techniques use compression to reduce data size of
data transmitted among nodes; however, such techniques achieve only
a lossy reduction, e.g. reducing the granularity or accuracy of
data on decompression. Such lossy compression might increase
convergence time (e.g., where the NN converges to a state where the
error of the calculations is small) or even preclude convergence at
all. In some embodiments of the present invention, the
computational properties of the weight distributions during NN
training contribute to improving compression and distribution of
the weights, and thus reduce the overall communication overheads,
with, in come cases, no loss of accuracy (e.g. using lossless
compression). The distribution or transmission of other parameters,
such as loss values or gradients may also be made more
efficient.
[0029] FIG. 1A is a simplified block diagram of a NN according to
an embodiment of the present invention; in typical use thousands of
neurons and links are used. NN 1000 may input data as for example
an input vector 1010 of values (representing, e.g. a photograph,
voice recording, or any sort of data), and may produce an output of
signals or values, for example output vector 1020. NN 1000 may have
neurons arranged into layers 1030, each including neurons 1040
connected to other neurons h links or edges 1050. FIG. 113 is a
block diagram of a neural network according to an embodiment of the
present invention. NN 1100 may input data, for example image 1110
(e.g. an input vector, matrix or other data) and may produce an
output of signals or values, for example output vector 1120, which
may for example indicate the content of or a description of the
image. Other input data may be analyzed. NN 1100 may in one example
have layers 1130 (convolution), 1132 (pooling), 11:34
(convolution), 1136 (pooling), and one or more output layers 1138,
which may include for example an FC layer 1138A and a softmax layer
1138B. Each layer may include neurons connected to other neurons h
links or edges. The NNs in FIGS. 1A and 1B are typically simulated,
and represented as data, for example in a system such as shown in
FIG. 1C, below,
[0030] A convolutional layer may apply a convolution operation to
its input, passing its result to the next layer. The convolution
operation may for example emulate the response of an individual
neuron to visual stimuli, and may for example include neurons
processing data only for its receptive field. A convolutional
layer's parameters may include a set of learnable filters (or
kernels), which have a small receptive field, but extend through
the full depth of the input volume. During the forward pass, each
filter may be convolved across the width and height of the input
volume, computing the dot product between the entries of the filter
and the input and producing a 2-dimensional activation map of that
filter. As a result, the NN may learn filters that activate when
they detect some specific type of feature at some spatial position
in the input. Stacking the activation maps for all filters along
the depth dimension may form the full output volume of the
convolution layer. Every entry in the output volume for a
convolutional layer can thus also be interpreted as an output of a
neuron that looks at a small region in the input and shares
parameters with neurons in the same activation.
[0031] NNs used for classification tasks, e.g. classifying
photographs into descriptions of the content, may produce, for each
class i, an output z_i, sometimes called a logit, which may encode
or represent the likelihood that a given example input should be
classified to class i. Logits z_i, for each class i, (e.g., for
image recognition dog, cat, llama, etc.) may be transformed into
probabilities q_i by comparing each z_i to the other logits, in for
example a softmax layer.
[0032] FIG. 1C is a block diagram of a system for training a neural
network according to an embodiment of the present invention.
Referring to FIG. 1C, a system of computing devices 1 may include
computing nodes 10, 20 and 30, connected by one or more
communications network(s) 40. Communications network 40 may be for
example an Ethernet network, but may be one or more other
communications networks. Node 10 may be a master node, distributing
training data, collecting parameters and creating updated models,
and nodes 20 and 30 may be training nodes, executing forward and
backward passes on training data, sending parameters (e.g. weights
for edges or links) to master node 10, and updating the nodes'
internal representations of the NN after receiving data from the
master node. In alternative embodiments, a training node (e.g. node
20 or 30) may function as a master node. In further embodiments, a
fixed "master" node need not be used, and one or more training
nodes may execute the functionality of a master node. Further,
while only two training nodes are shown, other numbers of nodes may
be used, for example 10 training nodes, 1,028 training nodes, or
other numbers. Other numbers of master nodes may be used, for
example an embodiment may include two master nodes and 16 training
nodes, or 16 nodes total.
[0033] Master node 10 may include data 12, e.g., training sets
(such as collections of images, audio files, etc) and model data 14
representing a NN (e.g. data representing artificial neurons,
links, weights, etc.) and including for example parameters such as
weights, and possibly for example the arrangement of nodes, layers
and edges. Each of nodes 10, 20 and 30 may model the same complete
NN, including neurons, links, weights, etc. as the other nodes, but
each of nodes 20 and 30 may train on a different set of data. Each
node 20 and 30 may model the same NN as master node 10, and may
include for example NN data 22 and 32. The NN may be for example a
CNN, but may be another type of NN. For example, the NN modeled by
NN data 22 and 32 may include an input layer 50, convolution layers
52 and 56, pool layers 54 and 58, a fully connected layer 60, and a
softmax layer 62. Other numbers and types of layers may be used.
The NN made of layers 50-62 may function and be simulated as is
known in the art. A system such as shown in FIG. 1C may execute a
trained NN at inference time, although at inference time such NN s
may be executed by one processing node, e.g. a workstation, PC,
server, etc.
[0034] Nodes may be for example CPU based systems (e.g.
workstations, PCs), CPU based systems, or other systems. In one
example embodiment, master node 10 is a CPU based system and
training nodes may be other systems such as GPU based systems.
Nodes 10, 20 and 30 may be or include structures such as those
shown in FIG. 2. While in some embodiments a generic CPU (e.g. a
workstation, a PC (personal computer), a multi-core system) is
discussed as a node, embodiments of the invention may be used with
other types of nodes, such as CPUs. Further, while example
embodiments of the invention discuss a relatively simple, slow
communications connection between nodes, such as an Ethernet, other
networks or communications systems, such as relatively fast,
expensive, and specially made systems, may be used.
[0035] FIG. 2 shows a high-level block diagram of an exemplary
computing device which may be used with embodiments of the present
invention. Computing device 100 may include a controller or
processor 105 that may be or include, for example, one or more
central processing unit processor(s) (CPU), one or more Graphics
Processing Unit(s) (CPU or CPCPU), a chip or any suitable computing
or computational device, an operating system 115, a memory 120, a
storage 130, input devices 135 and output devices 140. Each of
modules and equipment such as nodes 10, 20 and 30, and other
equipment mentioned herein may be or include a computing device
such as included in FIG. 2, although various units among these
entities may be combined into one computing device.
[0036] Operating system 115 may be or may include any code segment
designed and/or configured to perform tasks involving coordination,
scheduling, arbitration, supervising, controlling or otherwise
managing operation of computing device 100, for example, scheduling
execution of programs. Memory 120 may be or may include, for
example, a Random Access Memory (RAM), a read only memory (ROM), a
Dynamic RAM (DRAM), a Synchronous DRAM (SD-RAM), a double data rate
(DDR) memory chip, a Flash memory, a volatile memory, a
non-volatile memory, a cache memory, a buffer, a short term memory
unit, a long term memory unit, or other suitable memory units or
storage units. Memory 120 may be or may include a plurality of,
possibly different memory units. Memory 120 may store for example,
instructions to carry out a method (e.g. code 125), and/or data
such as user responses, interruptions, etc.
[0037] Executable code 125 may be any executable code, e.g., an
application, a program, a process, task or script. Executable code
125 may be executed by controller 105 possibly under control of
operating system 115. For example, executable code 125 may when
executed cause NN training, coordination of NN training tasks, NN
execution or inference, etc. according to embodiments of the
present invention. In some embodiments, more than one computing
device 100 or components of device 100 may be used for multiple
functions described herein. For the various modules and functions
described herein, one or more computing devices 100 or components
of computing device 100 may be used. Devices that include
components similar or different to those included in computing
device 100 may be used, and may be connected to a network and used
as a system. One or more processor(s) 105 may be configured to
carry out embodiments of the present invention by for example
executing software or code. Storage 130 may be or may include, for
example, a hard disk drive, a floppy disk drive, a Compact Disk
(CD) drive, a CD-Recordable (CD-R) drive, a universal serial bus
(USB) device or other suitable removable and/or fixed storage unit.
Data such as instructions, code, NN model data, parameters, etc.
may be stored in a storage 130 and may be loaded from storage 130
into a memory 120 where it may be processed by controller 105. In
some embodiments, some of the components shown in FIG. 2 may be
omitted.
[0038] Input devices 135 may be or may include for example a mouse,
a keyboard, a touch screen or pad or any suitable input device. It
will be recognized that any suitable number of input devices may be
operatively connected to computing device 100 as shown by block
135. Output devices 140 may include one or more displays, speakers
and/or any other suitable output devices. It will be recognized
that any suitable number of output devices may be operatively
connected to computing device 100 as shown by block 140. Any
applicable input/output (I/O) devices may be connected to computing
device 100, for example, a wired or wireless network interface card
(NIC), a modem, printer or facsimile machine, a universal serial
bus (USB) device or external hard drive may be included in input
devices 135 and/or output devices 140.
[0039] Embodiments of the invention may include one or more
article(s) (e.g. memory 120 or storage 130) such as a computer or
processor non-transitory readable medium, or a computer or
processor non-transitory storage medium, such as for example a
memory, a disk drive, or a USB flash memory, encoding, including or
storing instructions, e.g., computer-executable instructions,
which, when executed by a processor or controller, carry out
methods disclosed herein.
[0040] In some NNs, during backpropagation, each neuron computes
its own gradient for a link for the neuron, the gradient to be
applied to adjust the weight of the link. (When discussed herein, a
neuron taking action such as transmitting data, computing data,
etc., may mean that a processor simulating the neuron performs a
computation to simulate such action; e.g. a computing node
simulating a number of neurons may perform the actual action that
is ascribed to the neuron.). A node that is simulating neurons may
collect weights or other parameters and transmit them to a master
node. The master node may receive and collect parameters and
construct a model based on these parameters: e.g. a master node may
collect all weights from nodes, and for each link, average the
weights to produce an updated weight for that node, the weight
being a part of the updated model. Techniques other than averaging
may be used. In a data parallel learning, a number of nodes
simulate forward/backward passes on the same nodes at the same time
using different data sets: the resulting changes in parameters,
e.g. weights, are sent by each node to a master node which creates
an updated model from the parameters and sends the model back to
the nodes. In some embodiments of the present invention one node
acts as both a node simulating neurons and also the master node for
all nodes. Typically parameters such as weights are represented as
floating point (e.g. 32 bit) numbers, but may be represented in
other ways, such as integers or numbers represented by different
numbers of bits.
[0041] In embodiments of the present invention nodes may
communicate parameters such as weights or other parameters with a
master node or other nodes by first sorting or arranging them, for
example according to the values of the parameters, and then
applying a ZIP or a similar lossless compression algorithm (e.g.,
7zip, or another suitable compression algorithm) to the sorted
sequence. Sorting or arranging, e.g. in order of value, to place
similar or same values adjacent to each other in an ordered list or
vector, may allow for improved parameter compression, due to the
nature of compression algorithms like ZIP. Sorted data is typically
easier to compress than unsorted data because sequential values are
in order when data is sorted so their non-negative differences can
be encoded in place of the original values, and repeated values are
all contiguous and can be encoded by including a count along with
the first instance. After decompressing or unpacking the "zipped"
or compressed data the sort order may be used to re-order the data
to its proper order; thus in some embodiments a sort order is
transmitted typically on the first iteration of training, and
periodically on some but importantly not all successive iterations.
In some embodiments, one iteration may include a forward and
backward pass on a batch of multiple inputs, e.g. multiple images,
at one time, after which the model may be updated. The distribution
or transmission of other parameters, such as loss values or
gradients may also be made more efficient by combining sorting with
compression as taught herein; furthermore transmission among nodes
other than a master node may take advantage of such methods. For
example, non-master nodes may exchange loss data by arranging the
data in order of value of individual loss parameters, compressing
the data, transmitting, uncompressing, and rearranging according to
a sort order.
[0042] The sort order itself, as a collection of data, does not
compress well or at all, and thus transmitting the sort order with
each compressed set of parameters would not typically result in
bandwidth savings. Thus, in some embodiments a sorted order or
order of the arranging is sent, which over time (as parameters
change) becomes an approximate sorted order that suffices to reduce
compression, allowing for lossless compression of parameters. In
some embodiments, this approximate sorted order does not need to be
sent with every set of parameters, and thus the cost of sending the
sort order or order of the arranging be amortized over many
iterations of compressed data being sent. Typically, in each
iteration in-between when the sorted order is created and sent, the
parameters are arranged or sorted to be ordered as the order of the
last sort order created, before compression or "Zipping", and not
according to the actual values of the parameters at that time. Thus
the parameters in the intermediate iterations (between when the
sorted order was created) may be arranged in an order which does
not match that of an actual order sorted by value. The quality of
the compression may be determined by the extent by which the order
of parameters reflects the true sort order of the data, in some
embodiments the effectiveness is greatly helped if the order over
consecutive sequences or iterations of values transmitted does not
change by much. A typical property of the sequences of weights in
consecutive training batches or iterations of a. neural network
trained using backward propagation and stochastic gradient descent
is that the differences between the consecutive sequences is small
since they are determined by the gradients which are typically
small. Thus, compressing the weights of consecutive sequences of
weights from iterations of backward propagation and stochastic
gradient descent have small differences in the sort orders and
small differences between the values, thus lending themselves to
good compression even based on the sort order of preceding
iterations.
[0043] In some embodiments the sort/compress/transmit sequence
where sorting by value and creating a sort order occurs only
periodically, can be in both directions--e.g. when the master sends
parameters back to slave or "non-master" nodes--and also between
slave nodes. The sort order may be an order that the sender (any
node, master or slave node) creates and sends to the receiver,
typically periodically. If the sort order is shared between two
nodes--e.g. a master and slave node and one node (e.g. the slave
node) created it originally the other node (e.g. the master node)
need riot create a sort order. However, any sender, master or slave
node, may create a sort order if needed.
[0044] The typical pattern for distributed training of machine
learning algorithms includes for example iterating or
repeating:
[0045] 1) Each node simulating a set of NN neurons executes a
forward-backward pass that calculates or generates new updated
weights of links or edges.
[0046] 2) The system executes a parameter synchronization
algorithm: for example nodes transmit parameters such as their
newly calculated weights to a master node, or to other nodes. A.
master node may receive the parameters and update the model, e.g.
by for each link averaging the weights received from nodes.
[0047] 3) Each node may receive a model and may update its
parameters, e.g. update the weights for its links or edges.
[0048] 4) Repeat: the process may repeat starting with operation 1.
The iteration may stop when for example a certain accuracy is
achieved, after a certain number of runs or iterations, when
training data is exhausted, or on other conditions.
[0049] In the second step when a node needs to communicate its
parameters over the network to other nodes, problems may arise. For
example, in the case of a data-parallel convolutional neural
network training approach, each node executes the full machine
learning model on a subset of examples, and thus the number of
parameters a node needs to communicate is the same as the model
size. For example, as discussed, in case of the AlexNet CNN, there
may be for example 220 MB in parameters, and thus in the example
case of 10 nodes, 220 MB*10=2.2 GB of parameters that are
transferred over the network in both directions for each iteration.
If in one case the time it takes to complete an iteration is
approximately 30 ms, 2.2 GB must be transferred in 30 ms over the
network to avoid any network bottlenecks. This requires a 2.2
GB/0.030=73 GB/sec network link, which is greater than most
reasonably priced interconnections (much more expensive links may
be available, but this requires expense and a specialized data
link). This may prevent the system from scaling.
[0050] In one embodiment, nodes may compress parameters such as the
weights generated at each node during the distributed training,
before they are sent over the network. In one embodiment:
[0051] Nodes may sort or order the weights of the links or edges.
Sorting may be in any suitable order, e.g. low to high, or high to
low. Sorting may be based on for example the value of the weights
themselves.
[0052] Nodes may compress their sorted weights by using ZIP or a
lossless compression algorithm.
[0053] Sorting and compressing parameters may work well since there
may be many similar numbers in the sorted sequence among the
parameters for each node which reduces the overall entropy and
allows ZIP type compression algorithms to compress well. However,
the nodes that receive the sorted-and-compressed data should know
the sort-order in order to be able to access the data
appropriately. Typically, without the addition of sorting, the
sending and receiving nodes have a common understanding of the
order of the weights being sent. For example, each edge, link or
weight in the NN may be assigned an address, index or a place in an
ordered list. For example, both the sending and receiving nodes
understand the first weight is for edge X of the network. After
sorting, a new understanding a sort order--may be sent. Sort-order
or arrangement order information may be for example a table, one
column being a weight, edge or link number or index, and the other
column being the order in the compressed data, for each weight,
edge or link. Sort order or arrangement order information may be
for example a list or vector (e.g. an ordered list of numbers),
where for each ordered entry X the number indicates the place,
position or order, in the compressed list, of the parameter
numbered or indexed X. Other forms for a sort order may be
used.
[0054] Typically, sort-order information does not compress well,
and sorting itself may be computationally expensive. Transmitting
the sort order may be expensive in terms of network bandwidth, and
thus transmitting sort information with each compressed list may he
expensive and may eliminate the advantages of compression. Thus, in
one embodiment of the invention, a sort-order is not generated and
sent for each iteration, but rather only periodically or once every
K'th iteration (K being an integer greater than one), so that the
same sort--order is used across multiple iterations, and the cost
of communicating and/or calculating the sort-order is amortized
over them. K may be fixed as part of the system design, or may
change periodically, over time or from iteration to iteration based
on for example certain conditions.
[0055] In some embodiments, as the NN learns and changes its
weights, yet many if not most of the weights do not change by a
large percentage from iteration to iteration of training. Thus the
actual order, from high to low or low to high, of the weights,
changes from iteration to iteration but not by much. Typically,
gradients which are applied to edge or link weights to change the
weights are small. For example, a gradient may be +/-0.0002.
[0056] Thus in one embodiment, a process may include:
[0057] For iteration X: [0058] a. Each node computing weights or
another relevant parameter sorts the weights or other parameters,
recording or saving the sort order or index order of the sorted
weights, for example in a vector. The node may save or store the
sort order locally for future use, as discussed further. [0059] b.
Each node compresses the sorted weights for example using ZIP or
another suitable compression technology, typically lossless, to
produce compressed sorted weights. [0060] c. Each node transmits or
sends the compressed or Zipped weights to, for example a master
node. [0061] d. Each node transmits or sends its sort order to, for
example a master node. The master node decompresses the parameters,
and reorders or resorts the parameters to their original order,
according to the last sort order or indexing received.
[0062] For iteration X+1 through X+K-1 (not a "create sort"
iteration): [0063] a. Each node computing weights or another
relevant parameter places, sorts or orders the weights according to
the sort order or indexing order of iteration N, or the sort order
last created for that node (each node computing parameters
typically has a different sort order). No new sort order is
created; thus the sorting is a rearrangement according to a prior
sort order, e.g. the last or most recent sort order, as opposed to
a sorting based on the value of the weights themselves. Typically,
at this point, the parameter list is not fully sorted by value, but
it is "almost-sorted", according to the previously computed sort
order, so that ZIP or another suitable compression algorithm can
benefit from it, [0064] b. Each node compresses the parameters
ordered by its prior sort order (the "sorted parameters"). [0065]
c. Each node transmits the compressed parameters. The receiving
node decompresses the parameters, and reorders the parameters to
their original order, according to the last or most recent sort
order or indexing received for the node that sent the
parameters.
[0066] Iteration X+K is the same as iteration X.
[0067] For iteration X+1 through X+K-1 the process may be the same:
arrange according to the same sort order as the previous iteration,
compress, transmit. For iteration X+K, where K is a pre-set
interval, such as 20 (causing a new sort to be created once every
20 iterations) the process may be the same as iteration X: sort,
compress, transmit. Thus, the sort order may be created and
transmitted only every K (e.g. 10, 20 or another suitable integer)
iterations, so the cost of sending it will be amortized across K
iterations. K can be variable rather than fixed. This works best as
long as the sort order does not change much across iterations,
which is typically the case for distributed machine learning where
the parameters change slowly.
[0068] In some embodiments, nodes may transmit parameters after
computing parameters for each layer, and thus parameters may be
sent multiple times for each backward pass. Further, a master node
may update model parameters after receiving parameters for a
certain layer, and transmit the parameters back to the nodes after
this: for example a node may compute backpropagation of parameters
of the next lower layer while it is updating the parameters of
layers above this layer whose backpropagation has already ended.
Thus in some embodiments a sequence of backpropagation may include
nodes updating parameters for layer X; nodes transmitting
parameters for layer X; nodes updating parameters for layer X+1,
(higher and more towards output than layer X) while master computes
model for layer X (concurrently, or simultaneously); master sending
model to nodes for layer X; nodes sending parameters for layer X+1;
etc. Other orders may be used. Further, in some embodiments nodes
may complete a full backpropagation of all layers before sending
parameters.
[0069] FIG. 3 is a flowchart of a method for exchanging or
transmitting parameters such as weights according to embodiments of
the present invention, while conducting training on a NN. While in
one embodiment the operations of FIG. 3 are carried out using
systems as shown in FIGS. 1 and 2, in other embodiments other
systems and equipment can be used. Further, embodiments of the
example operations of FIG. 3 may be used with or combined with the
embodiment shown in FIG. 5.
[0070] In operation 300 a number of nodes, e.g. computing nodes, or
processors executing or simulating a neural network, may receive
training sets or data from one or more master nodes. For example a
master node may send one image each to a number of nodes. The nodes
may be for example processors representing a NN using data, the NN
including for example artificial neurons connected by edges or
links. Thus the NN may be "virtual" and no actual physical neurons,
links, etc. may exist, existing rather as data used by the
nodes.
[0071] In operation 310, each node may execute a forward pass on
the training data received, to produce an Output.
[0072] In operation 320, each node may execute a backward or
backpropagation pass, comparing its output for a NN to the expected
output for the training data used, and calculating parameters such
as weights for links or edges, or other data. In some embodiments,
during each iteration, all layers in the NN, or at least two
layers, may have parameters generated. In some embodiments, after
each computation of a layer's parameters, the sorting/reordering,
compressing and transmitting operations may occur for that layer.
In some embodiments, during each iteration the nodes or processors
during the backward or backpropagation pass calculate or generate
gradients for links and calculate or generate weights for the links
based on or using the gradients. For example, gradients may be
factors that adjust the values of the weights.
[0073] In operation 330, if the iteration is a "create sort"
iteration or a periodically occurring "ordering" iteration, e.g.
every K'th iteration (typically including the first iteration),
where K is an integer, each node may sort or arrange parameters
created in operation 320, for example according to the values of
the parameters, to create sorted parameters, e.g. sorted weights. A
sort order, order of arranging, ordering, or index may be created
and possibly saved or stored, based on the sorting process. Each
node may have a different locally-created sort order. For example,
while sorting the parameters, the new position of each parameter
(when compared to the position in the parameter before sorting) may
be saved as a sort order. Typically, parameters exchanged in a NN
system have some inherent order understood by all entities in the
system, and the sort process changes that order. Sorting or
arranging may be for example low to high, high to low, etc.
according to the numerical value of the parameter. The period
between when sorting is performed according to the values of the
parameters, and a sort order is created, may vary from iteration or
cycle to iteration or cycle, and thus K may vary.
[0074] In operation 340, if the iteration is not a periodically
occurring "create sort" iteration or "ordering" iteration but
rather an "in-between" iteration, no new sort order is created, and
sorting or arranging is performed on the parameters created in
operation 320 based on the last sort order or order of arranging
created by this node or processor (each node may store a different
"last" sort order). Thus the sorting performed in operation 340 may
be a rearrangement or re-ordering of parameters according to a
prior sort order (e.g. the last Kth iteration, or the last time
operation 330 was performed), and the "sorted parameters" are not
sorted according some ranking of their own values, but rather are
arranged according to a prior sort order.
[0075] In operation 350, the parameters sorted or rearranged in
operations 330 and 340 may be compressed by a node, to produce
compressed sorted parameters, e.g. compressed sorted weights,
typically using lossless compression, although other compression
methods may be used. For example, the parameters may be Zipped. As
noted, the "compressed sorted parameters" may be not sorted
according to their own order; rather they may be sorted or arranged
according to a prior sort order. Typically, with an embodiment that
sorts parameters before compressing, data size savings is most when
parameters are weights, which typically have a similar order across
iterations, as opposed to gradients, which often do not have a
similar order across iterations. However, sorting and compressing
may be performed with parameters other than weights, such as
gradients, losses, etc.
[0076] In operation 360, each node may transmit or send its
compressed parameters to a master node, or one or more other
processors or nodes. If the iteration is a "create sort" iteration,
e.g. every K'th iteration, the sort order, ordering, or index
created in operation 330 may also be transmitted, for example with
the compressed parameters.
[0077] In operation 370, a master node or processor may receive the
parameters and create an updated model of the NN. In order to do
so, the master may decompress the parameters, and place the
parameters in the order according to the last sort order received.
The parameters are typically re-ordered or re-sorted according to
the last sort order received for the node that sent the parameters:
thus the master node may maintain or store a different "last" sort
order for each node sending it parameters. The master node
reordering decompressed parameters to their original, proper order
may be performed for data received from each node using a
separately received sort order, as typically the sort order or
indexing from each node is different.
[0078] In operation 380, the master node may send an updated model
to the nodes performing the simulation, and the nodes may update
their internal representation of the NN. The updated model may be
sent using parameters sorted and compressed according to the
relevant sort order.
[0079] In operation 390, the process may repeat or iterate, moving
back to operation 300. The iteration may stop when for example a
certain accuracy is achieved, after a certain number of runs or
iteration, or on other conditions. Other or different operations
may be used.
[0080] In some embodiments, a node receiving data e.g. a master
node--may use operations similar to operations 300-390 to transmit
data to nodes, or non-master (e.g. "slave") nodes may use such
operations to transmit data. For example, a master node may use the
sort order received from node A to transmit model parameters back
to node A, by sorting the parameters according to the last sort
order received from node A, then compressing the data. Node A then
decompresses the received model data and uses the last sort order
it created to sort the data back to its original order.
Alternately, a master node may create its own sort order
periodically. In some embodiments, parameters may be transmitted
using a sort and compress method (e.g. to or from a master) after
each layer has data computed, but such data may also be data
transmitted after a complete backward pass. In some embodiments,
data transmitted using a sort and compress method may be from a
node executing a forward/backward pass to another node executing a
forward/backward pass. In some embodiments, data transmitted using
a sort and compress method may include parameters other than
weights: for example data may include gradient or loss data.
[0081] In some embodiments a node, typically when performing
calculations relevant to an output layer (typically an FC layer),
instead of using only the loss or error produced at that node to
calculate weights or gradients for that layer, in addition use
losses from other nodes and transmit or communicate their losses or
loss values to other nodes. One or more nodes receiving the losses
may receive all losses from all nodes simulating a forward pass,
and then compute, in series for losses from each different node
sending losses, a gradient and/or weight for each link or edge to
the output layer. This may be in place of a master node receiving
and averaging parameters for that particular layer. In one
embodiment once the gradients are computed the gradients, or the
final node weights after applying the gradients, may be averaged.
The nodes receiving loss data may be a master node, or may be all
nodes conducting a forward pass, in which case all such nodes
perform the same calculations using losses. Since in certain NNs
the number of links to neurons in an FC output layer is orders of
magnitude greater than the number of loss values for the output
layer, this may reduce the amount of data to be communicated (which
may allow for a less expensive communications link), in exchange,
in some embodiments, for the modest cost of multiple nodes using
the global loss values to compute weights or gradients for the
model. Further, typically computation for an FC layer, possibly
involving a matrix multiplication, is less burdensome than for
other layers such as a convolution layer, which may asymptotically
involve as many as the square of the number of operations of the
matrix multiply. Thus, in some embodiments, while a master node may
compute new weights for the model for most layers by accepting
weight values computed by nodes and for if example averaging them,
for an FC layer multiple nodes (or a master node) may compute the
new weights (the weights after applying the gradients for the
model) from the losses by performing the backpropagation
computation step for the layer. This may lower the amount of data
that is transmitted. This may be especially useful for a system
using a small number of nodes, e.g. a pod of 16 or 32 nodes (other
numbers of nodes may be used).
[0082] In some embodiments, the layer or subset of layers on which
backpropagation is performed using non-local losses has associated
with the layer a large fraction of the total weights in the NN but
a much smaller fraction of the weight compute burden in the NN,
even when computing using non-local losses. Since compression may
be considered a translation of data movement burden (e.g. network
burden) to data compute burden, this may be considered analogous to
compression, in that there is a reduction in data movement burden
(less weights are moved) and an increase in computation burden
(each node redundantly performs substantially similar
loss-to-weight calculations. However, given the architecture of
some systems, this may result in faster processing. A measure of
the amount of parameter transmission or network burden may be the
number of bytes sent, or the number of parameters sent. A measure
of the amount of compute or processing burden may be the number of
computer operations (e.g., machine operations) needed to compute
gradients and weights during backpropagation. A layer may have a
different amount or burden of computation than other layers, and a
layer's transmission of parameters such as gradients or weights may
have a different amount or burden for this transmission than other
layers. In some embodiments, the "compute" ratio of the compute
burden of the layer or layers on which backpropagation is performed
using non-local losses to the compute burden of the other layers in
the NN on which backpropagation is performed using local losses may
be smaller than the "transmission" ratio of the data transmission
burden of the layer or layers on which backpropagation is performed
using non-local losses to the transmission burden of the other
layers in the NN on which backpropagation is performed using local
losses. Since the number or amount of weights for a layer is
analogous to or a measure of its transmission burden, in some
embodiments the ratio of compute burden of layer(s) on which
backpropagation is performed with non-local losses to the compute
burden for the other layers in the NN may be less than the ratio of
the number of weights for the layer(s) on which backpropagation is
performed to the number of weights for the other layers in the NN.
In some embodiments the layer(s) on which backpropagation is
performed using non-local losses have more weights than another
layer, or than all the other layers in the NN (e.g. cumulatively).
In some embodiments the layer(s) on which backpropagation is
performed using non-local losses has associated with the layer(s) a
larger amount of weight values and/or a smaller amount of weight
compute burden than all other layers in the NN cumulatively--e.g.
than all the values and burdens for the other layers combined.
[0083] FIG. 4 depicts a prior art process for training using a
multi-node system, using two nodes 402 and 404 and one master node
400; other numbers of nodes may be used and a master may be part of
a node performing NN simulation. Nodes 400, 402 and 404 may be
connected by for example network 406 and may simulate a NN 410
including layers 411, 412, 413, 414, 415, 416 and 417. Master node
400 may store datasets 420, e.g., training data, and model data and
parameters 422. Embodiments of the present invention may improve on
the system of FIG. 4. Referring to FIG. 4, in some processes for
distributed (multi-node) training of machine learning algorithms
operations such as the following may be used:
[0084] 1) A master may send (operation 430) parameters or a model
and input data to the nodes. Each node may execute (operation 432)
a forward-backward pass that generates update gradients and
weights.
[0085] 2) Nodes execute a weight synchronization algorithm, which
may involve a parameter update. This may involve nodes sending
parameters to one or more master nodes (operation 440). In some
embodiments, a loss 460 may be generated, and convolution layers
may generate parameters e.g. parameters 462, and an FC layer may
generate parameters 464. One or more master nodes may accept
parameters to update the model (operation 442), e.g. by averaging
weights, and send the model back to the nodes; or this may involve
each node receiving all other nodes' parameters, so that each node
can update its parameters based on averaging weights from all other
nodes' executions just as the master would have done. As a result,
in the standard data-parallel approach, the full model may be
transmitted by the nodes to the master over the network.
[0086] 3) Each node updates its parameters.
[0087] 4) Iteration repeats at operation (1)
[0088] In operation 2, the weight synchronization, nodes may need
to communicate parameters such as weights or gradients over the
network to other nodes. For example, in the data-parallel learning
approach, each node executes the full machine learning model on a
subset of examples, and thus the number of parameters a node needs
to communicate is the same as the model size, which is a large
amount of data to communicate. In the case, for example, of the
AlexNet CNN, there may be for example 220 MBytes of parameters, so
for 10 nodes, 220 MBytes*10=2.2 GBytes of parameters must be
transferred over the network in both directions for each iteration.
The time it takes to complete an iteration can he for example
approximately 30 ms, so 2.2 GBytes must be transferred in 30 ms
over the network to avoid any network bottlenecks. This requires a
2.2 GB/0.030=73 GB/sec network link, faster than the capabilities
of most reasonably priced network links. This may prevent the
system from scaling.
[0089] In prior art systems, a synchronization procedure (operation
2 above) in distributed data-parallel training of neural networks
included transmitting all of the parameters such as weights or
gradients of the backward pass to a master node, or to the other
nodes. For example, in the case of a six layer CNN with the layers
INPUT, CONV_1, POOL_1, CONV_2, POOL_2, FC and SOFTMAX, after a node
finishes the forward-backward pass for its set of input examples,
there are new parameters generated for CONV_1, CONV_2 and the FC
layers. At this point, a node may send all of these parameters to
one or more other nodes (or a master node).
[0090] In one embodiment, for the FC final or output layers, of a
NN such as CNN, the need to transmit the FC parameters to other
nodes is avoided. Typically, compute and gradient memory
requirements of the different neural network layers are not
balanced or the same. For example, the (1) the amount of compute
needed to execute (e.g. compute weights for, during training) the
FC layer is low compared to other layers such as a convolution
layer, and (2) the parameter memory requirement of the FC layer is
relatively high (e.g. parameters for each FC node having links from
each prior layer node must be stored), while for convolution layers
memory requirements may be low (since typically convolution layer
neurons are less connected to the layer inputting to the
convolution layer compared to an FC layer). For example, in one
example of the AlexNet CNN, the FC layer compute burden may be only
4% of the total CNN compute burden, while the parameter memory
burden is 93% of the parameters for the NN. While embodiments are
described as applied to an PC layer of a CNN, other types of layers
can be used, and other types of NNs can be used. For example,
embodiments may be applied to the training of any CNN that has a
final layer in which the ratio of compute to data size is very
small, that is, there is little computation a lot of parameter data
to be transferred.
[0091] In one embodiment, operations such as the following may be
performed, typically for each node simulating forward/backward
passes. While in the following, an FC layer is given as an example
of a layer where losses may be transmitted instead of other
parameters, other embodiments may be used with layers other than FC
layers, such as FC "style" layers that have large number of weights
and low compute costs.
[0092] FIG. 5 is a flowchart of a method of exchanging weights
among nodes or processors according to embodiments of the present
invention. While in one embodiment the operations of FIG. 5 are
carried out using systems as shown in FIGS. 1 and 2, in other
embodiments other systems and equipment can be used. Further,
embodiments of the example operations of FIG. 5 may be used with or
combined with the embodiment shown in FIG. 3. For example some or
all of data such as parameters, weights, gradients, and/or loss
data may, in an embodiment of FIG. 5, be transmitted using an
embodiment of FIG. 3. Typically, embodiments of FIG. 5 achieve the
most savings in data transmission when nodes are CPU-based. CPU
systems may for example have advantages over GPU systems in memory
size, which may be important as some embodiments of FIG. 5 require
the storage of multiple sets of losses and gradients. However,
embodiments of FIG. 5 may be used with systems where nodes are
GPU-based. In one embodiment, for each processor or node i which is
not a master:
[0093] In operation 500, the node or processor may receive training
data and execute a forward-pass on the NN, which may generate a set
of loss values, e.g. loss(es)_i. These may be termed, for each
node, local losses: losses local to that node.
[0094] In operation 510, the processor or node may send or transmit
the loss i to other nodes executing a forward pass (e.g. non-master
nodes). In other embodiments, such losses may be sent to a master
node, which may perform the calculations discussed herein.
[0095] In operation 520, backpropagation or a backward pass may
occur at the node or processor. The node may execute a full
backward pass for all layers using its own loss only ("local"
losses), not including the other losses received, resulting in the
new weights and gradients for all layers including the FC.
Typically, during the full backpropagation pass using local losses,
gradients for the layers which will have losses for other nodes
applied (e.g. in operation 550) are not applied to modify layer
weights. Rather, these gradients are stored or saved, to be used
later in operation 550 to modify the weights: this is because
modification of a model using should typically losses should
typically be performed on the model which generated the losses, as
opposed to a modified model. In sonic embodiments, prior to
performing backpropagation using loss values received from a set of
other processors or nodes (e.g. operation 550, the node may perform
backpropagation using the loss values produced by the processor or
node.
[0096] In operation 530, the node may receive the loss(es) of each
other node. At this point, the node has multiple sets of losses
(one for each node in the system, including its own loss(es)). As
with other operations, this operation may be performed in an order
different than implied; for example nodes sending and receiving
losses may be done somewhat concurrently, and this may be performed
while nodes are performing other processing, such as
backpropagation.
[0097] In operation 540 the node may transmit or send parameters
such as gradients or modified weights generated in operation 520,
apart from or excluding those for the FC layer (or a layer to be
used with an embodiment of the present invention), to other nodes
or to a master node, substantially as in a standard data-parallel
approach. While in some embodiments, the backward pass results for
layers such as a convolution layer still are transmitted, the
number of parameter values for such layers may be small relative to
those for an FC layer, and thus large savings in network traffic
may result from not sending the FC layer parameters and sending
only the losses.
[0098] Operations such as sending and receiving data may be
performed at any suitable time, and do not have to be performed in
the order implied in FIG. 5. The order of operations in flowcharts
(FIG. 5 and other flowcharts) in this application may be altered in
certain embodiments if suitable. For example, transmitting losses
may be performed after, or concurrently with, transmitting
parameters; other suitable modifications may be implemented.
[0099] In operation 550 the node may perform backpropagation
training on a limited subset of layers, e.g. at least one layer of
the NN such as an FC layer possibly including layers from the FC to
the output, using loss values received from a set of other nodes or
processors, e.g. non-local losses. Application of gradients to
weights on such layers may also be performed using saved gradients
from operation 520. In some embodiments the layer(s) on which
backpropagation is performed using losses from other nodes has
associated with the layer a larger amount of weight values and a
smaller amount of weight compute burden than another layer in the
NN, e.g. when compared to a convolution layer. Note that losses
have already been computed for this layer (and all layers) using
the local losses in operation 520.
[0100] For example, for the losses of each other nodes
("non-local"), apart from the local losses of the node, the node
may execute a backward pass for higher layers down to or until and
including the FC layer, but typically not beyond (e.g., below,
towards the input) the FC layer, one after the other, not
continuing with the backpropagation for layers below (towards the
input) and beyond the FC. This backpropagation may occur
individually and separately for each set of non-local losses
received from another node, as typically the losses cannot be
combined or averaged across different nodes. For example, for each
other processor for which loss values are received, the receiving
processor may perform a separate backpropagation operation on the
layers down to and including the FC layer. Thus in one embodiment,
in operation 550, one backward pass is done down to and including
the FC layer but not beyond for the loss of each node other than
the local node, the gradients--but not weight changes--resulting
from the backward pass accumulating or being stored. Typically, a
model should be modified using losses generated for that model, and
thus gradients should be applied to the un-modified model, rather
than having gradients generated from a model modified from the
model that generated the losses. Thus the gradients generated for
the relevant layer in operation 520 using local losses, and the
gradients generated in operation 550 based on non-local losses, are
accumulated, and then applied to the NN model stored by the node by
for example being applied to the relevant weights, or averaged then
applied to the weights. Weights for all other layers may be updated
based on weights received from the master or from other nodes. In
some embodiments, the parallelization of the backpropagation of
losses of all other parts of the model--e.g. backpropagation in one
pass across nodes, then combining weights--except for the typically
inexpensive FC layer and its typically inexpensive preceding
layers, among threads, may allow for loss calculation time to be
reduced. At this point this node has the full PC backward pass
result (or in the case that the FC layer is not the final top-most
layer, the full result for every layer from the output to and
including the FC layer): each node has same weights for the FC
layer, as if a master node had averaged the FC layer weights and
sent the weights to the nodes. Such a technique may improve NN
learning communication in that in some example NNs, the actual
weights of the FC layer which may be 90% of the NN weights, are
never transmitted. Rather, only the loss and the 10% of the weights
(in one example) for the other layers are transmitted.
[0101] In operation 560 the node may receive parameters from a
master node or other nodes, as a model update (e.g. their
calculated average) or from other nodes (e.g. as individual weights
to be averaged by the node) and may apply them to the NN being
processed by the node, to update the model stored by the node. For
example, the node may receive individual parameters such as weights
for all other layers, apart from FC, or apart from the layers from
the FC to the output inclusive. At this point the node may have new
or updated weights: for all layers lower than the FC (towards the
input), obtained from the master, and for layers above the FC
(towards the output) and including FC from a locally performed loss
based calculation. Thus improvement may be achieved in some
embodiments in that weight updates for layers from the output
through and including the FC layer, the node may execute a backward
pass for all loss values, separately, and for layers between the FC
layer and input layer, weight updates are calculated by a master
averaging locally computed weight values. This may decrease
communications burden while only slightly increasing processing
burden.
[0102] The process may iterate again with operation 500. Typically,
a NN used with embodiments of FIG. 5 includes, at least one fully
connected layer, at least one input layer, and at least one
convolution layer, and possibly other layers; see, e.g. the example
NN structures of FIGS. 1 and 6. However, other structures of NNs
may be used with a process such as in FIG. 5.
[0103] In some embodiments, non-master or slave nodes may send
losses to one central node, such as a master node, which may
execute backpropagation for a selected subset of layers (e.g.
layers from FC to output inclusive) for each loss set, integrate
the results into the model or update the model (e.g. by applying
each resulting gradient set to the model), and send the updated
model back to other nodes. This may be performed in conjunction
with the master node receiving parameters regarding other layers
such as weights or gradients and updating the model based on those
other parameters: the NN model updated by the master using both
loss data and parameters such as weights or gradients may be sent
to the non-master nodes conducting training. Whether the master or
a number of slaves perform sequential backpropagation for certain
layers using loss data from multiple nodes, the backpropagation for
those layers is typically independent for each loss set. E.g. a
loss set from node A may be applied to the model used to generate
the losses to generate gradients, a loss set from node B may be
applied to the model used to generate the losses, etc., and the
multiple sets of gradients may be then applied to the weights from
the model used to generate the losses, for the relevant layer.
[0104] In some embodiments, there is no accuracy loss in using
embodiments of FIG. 5, since the algorithm is semantically the same
to the prior art data-parallel forward-backward pass algorithm. In
some embodiments, operations such as: a node sending a loss or set
of loss values; and the same node executing a backward pass (e.g.
operation 530, a backward pass based on "local" losses for the
processing node only) or a portion of a backward pass, may be done
in parallel, or concurrently. Different cores within the same
processor may be dedicated to different tasks. Improvements may
result from tasks being done in parallel such as for example
transmitting or receiving data, sorting, compression, portions of a
backward or forward pass, etc.
[0105] Communications improvements may result when nodes
communicate their losses to other nodes, so that each node can have
all of the losses and compute the total or aggregated FC gradients
locally. In some embodiments, nodes computing a forward pass
(typically slave nodes as opposed to master nodes) may send their
loss values to each other and each may compute FC gradients (e.g.
gradients to be used to change weights inputting to neurons in an
FC layer) and apply them to alter the FC weights, individually,
which may allow for FC layer weights or gradients to not be
transmitted; rather only weights of gradients for other layers are
transmitted. In prior art systems, weights or gradients for the FC
layer are transmitted, which takes up a lot of network bandwidth.
This may result in significant improvement to NN learning
technology, as in some example systems, 90% of the weights of the
NN may be for the FC layer. In another example, 93% of the weights
are in the FC layer in one example of the AlexNet CNN. A dramatic
reduction in overall communications may result. In some
embodiments, each node computes the total/aggregated FC gradient
result which adds computation time to the node, but this is more
than made up for with communications savings. For example, if there
are N nodes in the distributed system, then the compute time added
is FC_layer_compute time*N: savings are maximized when the
FC_layer_compute time is small (relative to other layers) and the
number of nodes in the system is small. However, savings may result
from systems without such characteristics.
[0106] Such a system of transmitting losses instead of other
parameters such as weights or gradients may be combined with the
embodiments for improved compression using sorting, as discussed
herein, which itself may result in a 3.times. reduction in
communications. The two techniques in combination may in some
examples result in a 30.times. reduction of network traffic, in a
lossless way.
[0107] FIG. 6 is a diagram depicting a method of exchanging weights
among processors according to embodiments of the present invention.
While in one embodiment the operations of FIG. 6 are carried out
using systems as shown in FIGS. 1 and 2, in other embodiments other
systems and equipment can be used. Further, embodiments of the
example operations of FIG. 6 may be used with or combined with the
embodiments shown in FIGS. 3 and/or 5.
[0108] FIG. 6 shows an embodiment with two nodes 610 and 620
executing simulations, including models of a NN 612 and 622, one
master node 600 including model information such as parameters 602
and data or training 604, and a network 650 connecting nodes 600,
610 and 620. Other numbers of nodes may be used. An iteration may
include Phase 1, the execution, and Phase 2, the parameter update.
In Phase 1, a master sends parameters and input data to the nodes,
the nodes perform forward pass, and then each node 610 and 620
transmits its loss value(w) (e.g. the forward pass result) to the
other node of 610 and 620. In Phase 2, each node has one loss data
set from each node (itself and other nodes), in this example two
losses. Each node may use these losses to compute the final result
for FC gradients locally by itself. Then, each node may continue to
execute the rest of the backward pass in a way similar to the
standard data-parallel approach: for example each node may send
convolution weight gradients to master 600, master 600 may sum
convolution weight gradients of nodes 610 and 620 performing
forward and backward passes, and may send the final result (e.g. a
model) to nodes 610 and 620. In some embodiments, an improvement
may result from FC gradients not being transmitted over network 640
at any point in time, which has the potential to provide an order
of magnitude reduction in network traffic in many CNNs (without any
loss in accuracy). NNs other than CNNs may be used, and while
embodiments discuss treating an FC layer differently, embodiments
of the present invention may perform local calculations for layers
other than an FC layer.
[0109] In some prior art systems, most of the computation necessary
to train or perform inference in neural networks is performed by
specialized, massively parallel hardware devices, such as GPUs.
Such devices may have thousands of relatively weak processing
cores, specialized to perform "regular," predictable computation,
which follows exactly the same control flow pattern, such as
massive matrix multiplications.
[0110] Embodiments of the present invention may improve prior NN
training and inference by for example allowing for less expensive,
more common or commodity equipment to he used. For example, an
Ethernet or less expensive data link may be used, and CPU based
machines may be used instead of GPU based machines. While GPUs may
be used with some embodiments of the present invention, typically,
CPUs are not as powerful as CPUs at performing certain algorithms
such as compression, which involves some sequential tasks:
typically, GPUs are better at massively parallel task than CPUs,
and CPUs may out perform GPUs at sequential tasks. Thus GPUs may
not be as powerful at performing compression as discussed herein
which may enable the use of less expensive network connections.
Further, CPUs may be better than CPUs at interleaving, pipelining
and complex parallel tasks which may be performed according to some
embodiments of the present invention. GPUs may lack the large
memory size CPU machines have, which may lower the ability of CPU
machines to buffer a large amount of data. In some embodiments, a
node may receive, and buffer or store, a large amount of input
training data to process, and may process such data in sequence. In
some embodiments, a node may multitask or interleave tasks, for
example, at the same time, performing a forward pass for one layer
of input data (e.g., an input image), sorting and/or compressing
the parameter data for another layer.
[0111] While embodiments have been described in the context of NN
learning, data processing in other contexts may make use of an
embodiment of sort-and-compress method as described herein.
Embodiments are applicable to any system in which the relative
order of the elements to be compressed does not change much from
one iteration to the next. Thus embodiments may be applied to
systems other than machine learning. For example, an embodiment may
be used to transmit pixel data for images. A sort-and-compress or
sort-and-ZIP algorithm may be applicable to any set of numbers that
are generated during iterations.
[0112] Embodiments of the present invention may be applicable to
any set of numbers generated during iterations of distributed or
other training, such as floating point parameters or gradients, or
integer parameters or gradients that may be a result of
quantization, 8 bit representations, etc.
[0113] Embodiments of the invention may be applicable to NNs
computed with any sort of nodes, e.g. CPUs, GPUs, or other types of
processors. However, embodiments may particularly useful with CPU
based nodes, as sorting and compressing (e.g. sequential
compression) may be easier to implement efficiently, or may execute
faster, on a CPU.
[0114] In some embodiments, it is possible to use quantization, a
known compression technique for gradients. For example, a process
may first quantize floating point parameters to integers, and then
perform a sort-and-compress process as described herein.
[0115] One skilled in the art will realize the invention may he
embodied in other specific forms without departing from the spirit
or essential characteristics thereof. The foregoing embodiments are
therefore to be considered in all respects illustrative rather than
limiting of the invention described herein. Scope of the invention
is thus indicated by the appended claims, rather than by the
foregoing description, and all changes that come within the meaning
and range of equivalency of the claims are therefore intended to be
embraced therein.
[0116] In the foregoing detailed description, numerous specific
details are set forth in order to provide an understanding of the
invention. However, it will be understood by those skilled in the
art that the invention can be practiced without these specific
details. In other instances, well-known methods, procedures, and
components, modules, units and/or circuits have not been described
in detail so as not to obscure the invention. Some features or
elements described with respect to one embodiment can be combined
with features or elements described with respect to other
embodiments.
[0117] Although embodiments of the invention are not limited in
this regard, discussions utilizing terms such as, for example,
"processing," "computing," "calculating," "determining,"
"establishing", "analyzing", "checking", or the like, can refer to
operation(s) and/or process(es) of a computer, a computing
platform, a computing system, or other electronic computing device,
that manipulates and/or transforms data represented as physical
(e.g., electronic) quantities within the computer's registers
and/or memories into other data similarly represented as physical
quantities within the computer's registers and/or memories or other
information non-transitory storage medium that can store
instructions to perform operations and/or processes.
[0118] Although embodiments of the invention are not limited in
this regard, the terms "plurality" and "a plurality" as used herein
can include, for example, "multiple" or "two or more". The terms
"plurality" or "a plurality" can be used throughout the
specification to describe two or more components, devices,
elements, units, parameters, or the like. The term set when used
herein can include one or more items. Unless explicitly stated, the
method embodiments described herein are not constrained to a
particular order or sequence. Additionally, some of the described
method embodiments or elements thereof can occur or be performed
simultaneously, at the same point in time, or concurrently.
* * * * *