U.S. patent application number 17/251508 was filed with the patent office on 2021-05-13 for neural networks having reduced number of parameters.
This patent application is currently assigned to TELECOM ITALIA S.p.A.. The applicant listed for this patent is TELECOM ITALIA S.p.A.. Invention is credited to Attilio FIANDROTTI, Gianluca FRANCINI, Skjalg LEPSOY, Enzo TARTAGLIONE.
Application Number | 20210142175 17/251508 |
Document ID | / |
Family ID | 1000005401148 |
Filed Date | 2021-05-13 |
![](/patent/app/20210142175/US20210142175A1-20210513\US20210142175A1-2021051)
United States Patent
Application |
20210142175 |
Kind Code |
A1 |
FIANDROTTI; Attilio ; et
al. |
May 13, 2021 |
NEURAL NETWORKS HAVING REDUCED NUMBER OF PARAMETERS
Abstract
A method includes providing a neural network having a set of
weights. The neural network receives an input data structure for
generating a corresponding output array according to values of the
set of weights. The neural network is trained to obtain a trained
neural network. The training includes setting values of the set of
weights with a gradient descent algorithm which exploits a cost
function including a loss term and a regularization term. The
trained neural network is deployed on a device through a
communication network, and used by the device. The regularization
term is based on a rate of change of elements of the output array
caused by variations of the set of weights values.
Inventors: |
FIANDROTTI; Attilio;
(Torino, IT) ; FRANCINI; Gianluca; (Torino,
IT) ; LEPSOY; Skjalg; (Torino, IT) ;
TARTAGLIONE; Enzo; (Torino, IT) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
TELECOM ITALIA S.p.A. |
Milano |
|
IT |
|
|
Assignee: |
TELECOM ITALIA S.p.A.
Milano
IT
|
Family ID: |
1000005401148 |
Appl. No.: |
17/251508 |
Filed: |
July 18, 2019 |
PCT Filed: |
July 18, 2019 |
PCT NO: |
PCT/EP2019/069326 |
371 Date: |
December 11, 2020 |
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G05D 1/0088 20130101;
G06N 3/04 20130101; G06N 3/08 20130101 |
International
Class: |
G06N 3/08 20060101
G06N003/08; G06N 3/04 20060101 G06N003/04; G05D 1/00 20060101
G05D001/00 |
Foreign Application Data
Date |
Code |
Application Number |
Jul 20, 2018 |
IT |
102018000007377 |
Claims
1-11. (canceled)
12. A method, comprising: providing a neural network having a set
of weights and being configured to receive an input data structure
for generating a corresponding output array according to values of
said set of weights; training the neural network to obtain a
trained neural network, said training comprising setting values of
the set of weights by means of a gradient descent algorithm which
exploits a cost function comprising a loss term and a
regularization term; deploying the trained neural network on a
device through a communication network; and using the deployed
trained neural network on the device, wherein the regularization
term is based on a rate of change of elements of the output array
caused by variations of the set of weights values.
13. The method of claim 12, wherein said regularization term is
based on a sum of penalties each one penalizing a corresponding
weight of the set of weights, each penalty being based on the
product of a first factor and a second factor, wherein: said first
factor is based on a power of said corresponding weight,
particularly a square of the corresponding weight, and said second
factor is based on a function of how sensitive the output array is
to a change in the corresponding weight.
14. The method of claim 13, wherein said function corresponds to
the average of absolute values of derivatives of output elements of
the output array with respect to the weight.
15. The method of claim 13, wherein: said training comprises, for
each one among a plurality of training input data structures,
comparing the output array generated by the neural network
according to said training input data structure with a
corresponding target output array having only one nonzero element,
and said function corresponds to the absolute value of the
derivative, with respect to the weight, of an element of the output
array corresponding to said one nonzero element of the target
output array.
16. The method of claim 13, wherein said training comprises
calculating a corresponding updated weight from each weight of the
set of weights by subtracting from said weight: a first term based
on the derivative of the loss term with respect to the weights, and
a second term based on the product of said weight and a further
function, said further function being equal to one minus said
function if said function is not higher than one, and being equal
to zero if said function is higher than one.
17. The method of claim 12, wherein said training further comprises
setting to zero weights of the set of weights having a value lower
than a corresponding threshold.
18. The method of claim 17, further comprising setting said
threshold to a selected one between: a threshold value based on the
mean of the non-zero weights; a threshold value such that a ratio
of a first set size to a second set size is equal to a constant,
wherein said first set size is the number of nonzero weights whose
absolute values are smaller than or equal to said threshold value
and said second set size is equal to the number of all nonzero
weights.
19. The method of claim 12, wherein said deploying the trained
neural network on a device comprises sending non-zero weights of
the trained neural network to the device through said communication
network.
20. The method of claim 12, wherein said using the deployed trained
neural network on the device comprises using the deployed trained
neural network with an application for a visual object
classification running on the device.
21. The method of claim 12, wherein said device is a mobile
device.
22. The method of claim 12, wherein said device is a processing
device of a control system of a self-driving vehicle.
Description
BACKGROUND OF THE INVENTION
Field of the Invention
[0001] The present invention relates to the field of neural
networks.
Description of the Related Art
[0002] Artificial Neural Networks (hereinafter, briefly "ANN") have
seen an explosion of interest over the last few years and are being
successfully applied across a wide range of application fields,
comprising control systems, robotics, pattern recognition systems,
forecasting, medicine, power systems, manufacturing, optimization,
signal processing, and social/psychological sciences.
[0003] Among the application fields of ANN, object classification
of images (hereinafter, simply "object classification") has become
increasingly important as the use of digital image capture
devices--such as smartphones or other portable devices including a
digital camera--grows. Object classification is a procedure that
provides for assigning labels to object(s) depicted in a (digital)
image according to predefined image classes. Object classification
provides for selecting an appropriate class for the object depicted
in the image among the available predefined image classes by
analyzing visual patterns of the image. In particular, object
classification may be based on the known machine learning approach
usually referred to as "deep learning" applied to ANN.
[0004] As it is well known to those skilled in the art, the basic
element of an ANN is the neuron, also referred to as node. Each
neuron has a single output but it might have many inputs. The
output of a neuron is the result of applying a non-linear function
to a linear combination of its inputs added to a constant value
usually known as bias. The coefficients of this linear combination
are usually called weights w and the non-linear function is usually
called activation function. ANN are arranged according to a
sequence of so-called "layers". Each layer contains a corresponding
set of neurons. The output of a neuron belonging to a layer may
serve as an input to a neuron belonging to the next layer of the
sequence.
[0005] As disclosed for example in Gradient-based learning applied
to document recognition by LeCun, Yann, et al., Proceedings of the
IEEE 86.11 (1998), Convolutional Neural Network (hereinafter,
briefly "CNN") is a kind of ANN that is particularly advantageous
to be exploited in the object classification field. For this
reason, most of the approaches actually employed in the object
classification field are based on CNN. A CNN is an ANN comprising
at least one convolutional layer, i.e., a layer comprising neurons
that share the same set of weights w (in the image analysis field,
said set of weights is usually referred to as "kernel" or "filter")
and whose outputs are given by the convolution among the inputs
thereof.
[0006] Making reference to an exemplary gray-scale digital image
comprising h.sup.2 pixels arranged in h rows and h columns (input
image), wherein each pixel has associated thereto a corresponding
pixel value indicative of a corresponding luminance value (for
example, the higher the pixel value, the higher the luminance,
e.g., using a 1 byte unsigned, a value of 0 corresponds to the
lower luminance and a value of 255 to the highest one), and a
kernel comprising k.sup.2 weights w arranged in a matrix having
k.times.k elements, the convolution between the input image and the
kernel provides for processing the image for generating a so-called
featuremap comprising a plurality of features, each feature being
associated to a corresponding area of k.times.k pixels (source
pixels) of the input image, by carrying out the following
operations: [0007] the k.times.k kernel is "overlapped" over a
corresponding k.times.k portion of the input image in order to have
each source pixel of said portion of the input image that is
associated with a corresponding element of the kernel matrix, with
the center element of the kernel matrix which is associated with a
central source pixel of said portion; [0008] the pixel value of
each pixel included in said portion of the input image is weighted
by multiplying it with the weight w corresponding to the associated
element of the kernel matrix; [0009] the weighted pixel values are
summed to each other, and a corresponding bias is added; [0010] an
activation function is applied, obtaining this way a feature
associated to the examined portion of the input image; such feature
is saved in a position of the featuremap corresponding to the
central source pixel. [0011] the filter is shifted, horizontally
and vertically, by a stride corresponding equal to an integer value
(e.g., 1 pixel) [0012] the above steps are repeated to cover all
the pixels of the input image, in order to obtain a complete
featuremap.
[0013] Generalizing, the convolutional layer of a CNN having as
input a square h.times.h digital signal (also generally referred to
as "input data structure" or simply "input structure") comprising
h.sup.2 values obtained by sampling such signal (such as the
abovementioned h.sup.2 pixel values of the exemplary h.times.h
gray-scale digital input image), and having a kernel comprising
k.sup.2 weights w arranged in a square matrix having k.times.k
elements, outputs a (h-k+1).times.(h-k+1) digital signal forming an
output data structure (featuremap) comprising (h-k+1).sup.2 values
(features).
[0014] Similar considerations apply if the input data structure
and/or the kernel matrix have a different shape, such as a
rectangular shape.
[0015] With reference to the considered example, the weights w of
the k.times.k kernel may be set to represent a particular visual
pattern to be searched in the input structure representing the
input image. In this case, the output of the convolutional layer is
a data structure corresponding to a featuremap having
(h-k+1).times.(h-k+1) features, wherein each feature of said
featuremap may have a value that quantifies how much such
particular visual pattern is present in a corresponding portion of
the input image (e.g., the higher the value of the feature in the
feature map, the more such particular visual pattern is present in
the corresponding portion of the input image). This operation is
well known from communication engineering, where it is known as
"signal detection using matched filters", see for example Modern
Electrical Communications by H. Stark and F. B. Tuteur, Chapter
11.6, Prentice-Hall, 1979 (pages 484-488).
[0016] In a typical convolutional layer of a CNN, the input
structure may be formed by CH equally sized channels. Making for
example reference to the object classification field, a h.times.h
digital color image may be represented through an RGB model by a
set of CH=3 different channels: the first channel (R channel) is a
digital image having h.times.h pixels and corresponding to the red
component of the colored digital image, the second channel (G
channel) is a digital image having h.times.h pixels and
corresponding to the green component of the colored digital image,
and the third channel (B channel) is a digital image having
h.times.h pixels and corresponding to the blue component of the
colored digital image.
[0017] Typically, a set of NF kernels is used for a single
convolutional layer. In this case, the output structure will
comprise in turn NF featuremaps.
[0018] Therefore, considering a generic scenario, in which the
input structure of a convolutional layer is convolved with NF
kernels, the number NP.sub.CL of learnable parameters in the layer
(weights w plus biases) is equal to:
NP.sub.CL=NC*NF*(k*k+1), (1).
wherein NC is the number of input channels of the convolutional
layer.
[0019] As can be read for example in OverFeat: Integrated
Recognition, Localization and Detection using Convolutional
Networks, by Pierre Sermanet, David Eigen, Xiang Zhang, Michael
Mathieu, Rob Fergus, and Yann LeCun, arXiv preprint
arXiv:1312.6229, pages 1-15, 2013, an efficient object
classification algorithm--i.e., an algorithm capable of classifying
objects in correct classes with a low classification error--based
on CNN usually comprises several convolutional layers, typically
interleaved with subsampling layers (e.g., the so-called
max-pooling layers), followed by a sequence of final,
fully-connected (i.e., non-convolutional) layers acting as final
classifier which output as output structure a classification array
providing indications about a corresponding selected class among
available classification classes.
[0020] The number NP.sub.FC of learnable parameters (weights plus
biases) of a generic fully connected layer having NU exits and
which processes a number N of inputs received from the previous
layer is equal to:
NP.sub.FC=NU*(N+1) (2).
[0021] A very important aspect of a CNN regards the way the weights
of the kernels of the various convolutional layers are set. The
efficiency of an object classification algorithm exploiting a CNN
is strictly dependent on the weight w values. If the weight w
values are not correctly set, objects are classified in wrong
classes. In order to set the weights w of a CNN, the CNN is
subjected to a training procedure, such as the so-called
backpropagation training procedure disclosed for example at page
153 of Neural Networks and Learning Machines, 3/E by Simon Haykin,
Prentice Hall (Nov. 18, 2008).
[0022] The backpropagation training procedure provides for two main
phases: the forward phase and the backward phase.
[0023] The forward phase provides for inputting as input data
structure a training image belonging to a known class to the CNN to
be trained, and then comparing the corresponding output--i.e., the
output classification array corresponding the actual weight w
values--with the correct known output--i.e., a target
classification array corresponding to the correct known class.
Since the CNN is not yet trained, the output classification array
will be generally different from the target classification array.
The output classification array is then compared with the target
classification array, and a corresponding loss term is computed,
which provides a quantification of a classification error produced
by the CNN (i.e., a quantification of how much the output
classification array is different from the target classification
array).
[0024] Having C different classes, and providing an input training
image x belonging to one of said C classes to the CNN, the CNN
generates a corresponding output classification array y(x;W),
wherein W represents the whole set of weights w included in the CNN
(i.e., all the weights w corresponding to all the neurons in all
the layers of the CNN). The output classification array y(x;W)
generated by the CNN is an array of C elements y.sub.c(x;W), c=1 to
C, wherein each element y.sub.c(x;W) corresponds to a respective
class IC(c) and has a value providing the probability that such
input training image belongs to such specific class IC(c). The
target classification array t(x) corresponding to a training image
x belonging to class IC(c*) is an array of C elements t.sub.c(x),
c=1 to C, wherein t.sub.c(x)=1 for c=c* and t.sub.c(x)=0 for
c.noteq.c*.
[0025] In order to calculate the abovementioned loss term, a loss
function can be used. A loss function L(y(x;W),t(x)) is a function
which depends on the input training image x and on the
corresponding output classification array y(x;W). Broadly speaking,
a loss function L(y(x;W),t(x)) is a function that allows to
estimate the classification error. The loss function L(y(x;W),t(x))
is such that, given a specific input training image x and given a
specific output classification array y(x;W) generated by the CNN in
response to this input training image x, the larger the
classification error corresponding to said output classification
array y(x;W), the higher the value of said loss function
L(y(x;W),t(x)).
[0026] For example, considering the simple problem of training a
two-outputs network to predict the value of the horizontal and
vertical coordinates of a point on a plane, the loss function for
training this particular network could be reasonably defined as the
Euclidean distance between predicted and target coordinates of the
point on the plane. In general, the loss function must be always
defined as positive or equal to zero and its actual choice depends
on the specific problem the ANN or the CNN solves. Considering the
specific problem of object classification, among the most used loss
functions are the so-called Mean Square Error (MSE) loss function,
for which
L .function. ( y .function. ( x ; W ) , t .function. ( x ) ) = 1 c
.times. c = 1 C .times. ( y c .function. ( x , W ) - t c .function.
( x ) ) 2 ##EQU00001##
(Lehmann, Erich L., and George Casella. Theory of point estimation.
Springer Science & Business Media, 2006, pag. 51), or the
so-called cross-entropy loss function, for which
L(y(x;W),t(x))=-.SIGMA..sub.c=1.sup.ct.sub.c(x) log(y.sub.c(x;W)),
(de Boer, Pieter-Tjerk; Kroese, Dirk P.; Mannor, Shie; Rubinstein,
Reuven Y. (February 2005). "A Tutorial on the Cross-Entropy
Method". Annals of Operations Research. 134 (1). pages 19-67).
[0027] After the definition of a proper loss function
L(y(x;W),t(x)), a cost function J(y(w,x), t(x)) can be defined in
the following way:
J(W;x,t(x))=.eta.L(y(x;W),t(x))+.lamda.R(W) (3)
wherein .eta. is a positive real number called learning rate,
.lamda. is a positive real number called decay rate, and R(W) is a
regularization function. As it is well known to those skilled in
the art, the regularization function R(W) is a function that
influences the cost function J(W;x,t(x)) by providing a rule--e.g.,
through the issuing of proper constraints--on how the weights W of
the CNN should be distributed and set. The rule given by the
regularization function R(W) is added for influencing the
classification error estimation given by the loss function
(y(x;W),t(x)).
[0028] For example, such rule may provide the constraint of having
the weights w to lie on an hypersphere (regularization function of
the L2 type), or the constraint of having the weights w that are
kept as small as possible (regularization function of the L1
type).
[0029] It is known to those skilled in the art that an effect of
using a regularization function R(W) is to reduce the effect of the
overfitting of the CNN on the training images. A CNN is said to
overfit training data if, during the training procedure, the CNN is
learning, together with the real information, also undesired noise.
Regularization, providing a constraint on the way the weights of
the CNN are set and distributed, helps contrast overfitting making
frontiers between classes sharper (see Scholkopf, Bernhard, and
Alexander J. Smola. Learning with kernels: support vector machines,
regularization, optimization, and beyond. MIT press, 2002, pages
87-120).
[0030] Different regularization functions R(W) are known in the
art, based on the n-th order norm of the weights w, such as for
example the L0 regularization function (see Guo, Kaiwen, et al.
"Robust non-rigid motion tracking and surface reconstruction using
10 regularization." Proceedings of the IEEE International
Conference on Computer Vision., 2015, pages 3083-3091).
[0031] The L1 regularization function (see Park, Mee Young, and
Trevor Hastie. "L1-regularization path algorithm for generalized
linear models." Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 69.4 (2007): 659-677), or the L2
regularization function (see Bilgic, Berkin, et al. "Fast image
reconstruction with L2-regularization." Journal of Magnetic
Resonance Imaging 40.1 (2014), pages 181-191).
[0032] The backward phase of the training procedure provides for
calculating the gradient of the cost function J(W;x,t(x)) with
respect to the weights w for all the layers from the last layer to
the first layer.
[0033] The forward phase and the backpropagation phase are repeated
for a very high number of times, using a very high number of
training images, for example taken from a training database.
[0034] Immediately after each backpropagation phase, weights w may
be updated using a gradient descent operation in order to minimize
the cost function (W;x,t(x)). In this way, the updating of the
weights w is done each time a new training image of the training
database is input to the CNN to be trained. Alternatively, the
updating of the weights w may be done by firstly applying the
forward and backward phases to all the training images of the
training database to produce an average of the gradients before
updating the weights w. An intermediate alternative may be also
contemplated, in which the updating of the weights w is carried out
each time the forward and backward phases are carried out on a
subset (referred to as "mini-batch") comprising a fixed number of
training images of the training database.
[0035] In this context, with the term "epoch" it is intended the
number of times the entire training database has been inputted to
the CNN during the training procedure. Usually, hundreds of epochs
are required to make a CNN to converge to its maximum
classification accuracy.
[0036] The weights w are typically represented by real values
hosted in the main memory of the processing system (e.g., computer)
used to train the CNN through the abovementioned training
procedure. Such values can be permanently saved, for example, into
storage memory of the processing system, either at intermediate
steps of the training procedure (e.g., after each mini-batch is
processed or after each epoch of the training procedure) or at the
end of the training procedure.
[0037] The knowledge of the CNN topology and of the relative
learned weights w allows to completely represent a trained CNN.
Once a CNN has been trained, the trained CNN can be used on the
same computer to process different data samples or it can be
transferred to a different computer.
[0038] Training a CNN according to the procedure described above is
a resource-intensive process that is practically possible only
thanks to the availability of ad-hoc hardware resources. Such
resources include ad-hoc designed ICs or Field Programmable Gate
Arrays (FPGAs) or Graphical Processing Units (GPUs) equipped with
large numbers of computational cores and, most importantly, such
hardware includes billions of data memorization units either in the
form of cell memory dedicated to hosting the CNN learned weights w.
For example, modern GPUs designed specifically for training deep
CNN include tens of gigabytes of fast, dedicated memory to
permanently store the large number of weights w of a CNN (and
relative error gradients at backpropagation time).
[0039] The following table provides a parameter (W, weights plus
biases) count and relative memory requirements for each one of 16
layer of an exemplary VGG architecture disclosed in paper Very deep
convolutional networks for large scale image recognition by
Simonyan Karen, and Andrew Zisserman, arXiv preprint arXiv:
1409.1556 (2104).
TABLE-US-00001 Trainable Layer Memory parameters INPUT: 224*224*3 =
0 [224x224x3] 150K CONV3-64: 224*224*64 = (3*3*3)*64 = [224x224x64]
3.2M 1,728 CONV3-64: 224*224*64 = (3*3*64)*64 = [224x224x64] 3.2M
36,864 POOL2: 112*112*64 = 0 1112x112x641 800K CONV3-128:
112*112*128 = (3*3*64)*128 = [112x112x1281] 1.6M 73,728 CONV3-128:
112*112*128 = (3*3*128)*928 = [112x112x1281] 1.6M 147,456 POOL2:
56*56*128 = 0 [56x56x1281] 400K CONV3-256: 56*56*256 =
(3*3*128)*256 = [56x56x256] 800K 294,912 CONV3-256: 56*56*256 =
(3*3*256)*256 = (56x56x256] 800K 589,824 CONV3-256: 56*56*256 =
(3*3*256)*256 = (56x56x256] 800K 589,824 POOL2: 28*28*256 = 0
[28x28x256] 200K CONV3-512: 28*28*512 = (3*3*256)*512 = [28x28x512]
400K 1,179,648 CONV3-512: 28*28*512 = (3*3*512)*512 = [28x28x512]
400K 2,359,296 CONV3-512: 28*28*512 = (3s3*512)*512 = [28x28x512]
400K 2,359,296 POOL2: 1494*512 = 0 [14x14x512] 100K CONV3-512:
1494*512 = (3*3*512)*512 = [14x14x512] 100K 2,359,296 CONV3-512:
1494*512 = (3*3*512)*512 = [14x14x512] 100K 2,359,296 CONV3-512:
1494*512 = (3*3*512)*512 = [14x14x512] 100K 2,359,296 POOL2:
7*7*512 = 0 [7x7x512] 25K FC: 4096 7*7*512*4096 = [1x1x4096]
102,760,448 FC: 4096 4096*4096 = [1x1x4096] 16,777,216 FC: 1000
4096*1000 = [1x1x1000] 4,096,000
[0040] The CNN includes over 100 million of trainable parameters
for a total footprint of about 132*k Megabytes (wherein k is the
size in bytes of a parameter) of memory for a trained CNN. The
table shows that the most part of the parameters are included in
the first fully connected layer. This finding is not surprising as
the number of parameters of the first fully connected layer is
proportional to the number of features outputted by the last
convolutional layer.
[0041] In order to improve the image classification accuracy
offered by a CNN, the number of layers should be increased, for
example up to 152. An architecture of this kind involves the
training of a very large amount of weights w, resulting in even
larger memory requirements, which may easily exceed the amount of
memory available on the processing system used for the training
procedure. For this reason, it is known to use multiple GPUs in
parallel to train a CNN, where every GPU trains a subset of the CNN
(e.g., each GPU trains a subset of the layers). For example, paper
Outrageously large neural networks: The sparsely-gated
mixture-of-experts layer by Shazeer, Noam, Azalia Mirhoseini,
Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff
Dean, arXiv preprint arXiv:1701.06538 (2017) describes a CNN
trained over an array of multiple GPUs arranged in computing
clusters. The total number of parameters of such architecture is
very high, being equal to 137 billion.
[0042] While CNN are usually trained over server-grade GPUs for the
above mentioned reasons, there is great interest in using
previously trained CNN on mobile devices, such as tablets or
smartphones.
[0043] In the following, the process of using a previously trained
CNN over a processing device for classifying images will be
referred to as "deploying" the (previously trained) CNN over such
processing device. The processing device over which the CNN is
deployed may be the same processing device used to train the CNN
(e.g., a powerful server computer) or may be a different one (e.g.,
a laptop PC or a smartphone). It has to be appreciated that the
training images used to train the CNN would be generally different
than the image the CNN should classify when deployed over a
processing device for its normal use.
[0044] For example, in a typical applicative object classification
scenario, a battery-operated mobile device with wireless
communication capabilities could locally process the data acquired
by local sensors by means of a CNN and take autonomous decisions
without communicating with a central server and thus saving battery
energy. In another scenario, the user of a mobile phone could take
a snapshot of an object. Then, a pre-trained CNN could be used to
determine the object type and brand by running the trained CNN over
the mobile device GPU. Finally, the user would be redirected via
the mobile web browser to an e-shopping site where the user would
be able to purchase the object, possibly comparing among different
possible options.
[0045] From the examples above, it is clear that there is an
interest in being able to deploy trained CNN on mobile and other
embedded devices.
[0046] The present invention addresses one of the main problems
that arise when a trained network is deployed over smartphones and
other embedded devices.
[0047] In fact, while modern mobile devices typically include
ad-hoc ICs or GPUs, which provides them sufficient computational
capabilities to execute even complex CNN, such devices have a
limited storage memory capability due to constraints related to
power consumption, price and physical dimensions. Similarly, the
amount of working memory of GPUs commonly found in mobile devices
is typically one order of magnitude lower than the amount of
working memory of their server counterparts. Moreover, in mobile
devices often such memory is not dedicated for exclusive use by the
GPU, rather it is shared by the central CPU, which may permanently
use a significant fraction thereof for other device core
functionalities.
[0048] In view of the above, it can be understood that the limited
memory capabilities of mobile devices strongly put a constraint on
the maximum number of parameters of a CNN, or of generic neural
networks, when the CNN (or generic neural networks) has to be
deployed on mobile devices.
[0049] In order to reduce the memory requirements for deploying a
trained CNN, so as to allow the CNN to be deployed also in devices
having low memory capabilities, several simple solutions have been
already proposed.
[0050] For example, according to a first known solution (see Gupta,
Suyog, et al. "Deep learning with limited numerical precision"
International Conference on Machine Learning. 2015), trained
parameters (which are real numbers) may be stored with a reduced
precision. For example, trained parameters may be stored according
to the recently standardized IEEE half-precision format rather than
with the typical single precision format.
[0051] Another known approach (see Han, Song, Jeff Pool, John Tran,
and William Dally. "Learning both weights and connections for
efficient neural network" In Advances in Neural Information
Processing Systems, pages 1135-1143. 2015) consists in simplifying
the network topology (e.g., by removing layers or units from the
layers) when the training procedure is carried out, until the
trained CNN meets the capabilities the target mobile device where
the CNN is to be deployed.
[0052] A possible option for reducing the memory requirements of a
CNN without affecting its topology is carrying out a training
procedure directed to increase the sparsity of the CNN. As it is
well known to those skilled in the art, the sparsity of a CNN is
defined as the ratio between the number of parameters of the CNN
equal to zero and the total number of parameters of the CNN (see
Tewarson, Reginald P. (May 1973). Sparse Matrices (Part of the
Mathematics in Science & Engineering series). Academic Press
Inc pages 1-13).
[0053] A CNN where a large number of parameters are equal to zero,
i.e., with a high sparsity, is said to be sparse. The more a CNN is
sparse, the lower the memory space requirement.
[0054] It has been observed that if a training procedure is carried
out by using the abovementioned L1 regularization function R(w),
several weights w of the CNN take very low values, close to
zero.
[0055] The paper by Han Song, Jeff Pool, John Tran, and William
Dally proposes a three-staged approach to train sparse network
topologies. Firstly, a network is trained via backpropagation, but
instead of setting the actual weights by minimizing some target
loss function, a coarse measurement of each connection's importance
is carried out. Then, all connections with an importance index
below some threshold are set to zero (pruned). Finally, the
resulting network is trained with standard backpropagation in such
a way to set the actual weights.
[0056] In "Dropout: A Simple Way to Prevent Neural Networks from
Overfitting" by Nitish Srivastava, Geoffrey Hinton, Alex
Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov, Journal of
Machine Learning Research 15 (2014)) it is shown how dropout usage
enhances learning and causes sparsification.
[0057] In "Sparse Convolutional Neural Networks" by Baoyuan Liu,
Min Wang, Hassan Foroosh, 2015 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 7-12 Jun. 2015, a decomposition
method for convolutional layers of a CNN is proposed, introducing
the concept of "Sparse Convolutional Neural Network". In
particular, 4-dimensional convolutions are transformed into
2-dimensional matrix multiplications. The 2D matrices yielded by
such proposed decomposition are highly sparse (90% of sparsity
reported), implying that high parameter redundancy exists for
high-dimension kernels. According to this high sparsity, a custom
sparse multiplication algorithm is also proposed to maximize cache
hits.
SUMMARY OF THE INVENTION
[0058] The Applicant has found that the abovementioned known
solutions for reducing the memory occupation of a CNN, or a generic
neural network, in order to allow such CNN to be deployed on a
processing device equipped with low memory capabilities are not
efficient.
[0059] For example, the known solution by Gupta, Suyog, et al which
provides for memorizing the trained parameters with a reduced
precision has been observed to provide only very scarce memory
savings.
[0060] The known solution by Han, Song, Jeff Pool, John Tran, and
William Dally which provided for simplifying the network topology
when the training procedure is carried out has the disadvantage of
requiring the training of a separate CNN having a specific
(simplified) network topology for mobile devices. Moreover, such
solution also impairs the CNN performance or severely limits the
size of input data structure (e.g., the image to be classified)
that the CNN is able to process.
[0061] Moreover, Applicant has found that even if very low weights
w, obtained at the end of a training procedure which exploited the
abovementioned L1 regularization function, are set to zero (for
example, by means of a threshold based pruning operation), the
performance of the resulting pruned CNN is impaired.
[0062] The paper by Han, Song, Jeff Pool, John Tran, and William
Dally discloses an approach which provides reduced complexity for
obtaining better performance by avoiding network
over-parametrization. However, this approach forces the execution
of a refinement process after the learning step. This will
disadvantageously bring the new, less-parametrized network, into an
energy minimum which strongly depends from the learning dynamic of
the "over-parametrized" network. Furthermore, the parameters
pruning is carried out without taking into account the real
activity which depends on the parameters.
[0063] The paper by Nitish Srivastava, Geoffrey Hinton, Alex
Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov discloses an
approach which produces a too much over-parametrized network.
[0064] The paper by Baoyuan Liu, Min Wang, Hassan Foroosh, simply
provides results concerning convolutional layers sparsity.
[0065] In view of the above, Applicant has tackled the problem of
providing a neural network training procedure (and a corresponding
training system) which generates a trained neural network having a
reduced memory footprint (i.e., which requires a reduced amount of
memory to store its parameters) and at the same time is capable of
operating with sufficiently high performance.
[0066] Applicant has found that this can be achieved by properly
modifying the regularization function of the cost function used for
training a neural network.
[0067] On this regard, Applicant has devised a regularization
function which promotes low values of the weights in the neural
network without reducing the performance of the network itself by
taking into account a rate of change of the neural network output
caused by variations of the weights.
[0068] An aspect of the present invention relates to a method.
[0069] According to an embodiment of the present invention, the
method comprises providing a neural network having a set of weights
and being configured to receive an input data structure for
generating a corresponding output array according to values of said
set of weights.
[0070] According to an embodiment of the present invention, the
method further comprises training the neural network to obtain a
trained neural network.
[0071] According to an embodiment of the present invention, said
training comprising setting values of the set of weights by means
of a gradient descent algorithm which exploits a cost function
comprising a loss term and a regularization term.
[0072] According to an embodiment of the present invention, the
method further comprises deploying the trained neural network on a
device through a communication network.
[0073] According to an embodiment of the present invention, the
method further comprises using the deployed trained neural network
on the device.
[0074] According to an embodiment of the present invention, the
regularization term is based on a rate of change of elements of the
output array caused by variations of the set of weights values.
[0075] According to an embodiment of the present invention, said
regularization term is based on a sum of penalties each one
penalizing a corresponding weight of the set of weights, each
penalty being based on the product of a first factor and a second
factor.
[0076] According to an embodiment of the present invention, said
first factor is based on a power of said corresponding weight,
particularly a square of the corresponding weight.
[0077] According to an embodiment of the present invention, said
second factor is based on a function of how sensitive the output
array is to a change in the corresponding weight.
[0078] According to an embodiment of the present invention, said
function corresponds to the average of absolute values of
derivatives of output elements of the output array with respect to
the weight.
[0079] According to an embodiment of the present invention, said
training comprises, for each one among a plurality of training
input data structures, comparing the output array generated by the
neural network according to said training input data structure with
a corresponding target output array having only one nonzero
element.
[0080] According to an embodiment of the present invention, said
function corresponds to the absolute value of the derivative, with
respect to the weight, of an element of the output array
corresponding to said one nonzero element of the target output
array.
[0081] According to an embodiment of the present invention, said
training comprises calculating a corresponding updated weight from
each weight of the set of weights by subtracting from said weight:
[0082] a first term based on the derivative of the loss term with
respect to the weights, and [0083] a second term based on the
product of said weight and a further function, said further
function being equal to one minus said function if said function is
not higher than one, and being equal to zero if said function is
higher than one.
[0084] According to an embodiment of the present invention, said
training further comprises setting to zero weights of the set of
weights having a value lower than a corresponding threshold.
[0085] According to an embodiment of the present invention, the
method further comprises setting said threshold to a selected one
between: [0086] a threshold value based on the mean of the non-zero
weights; [0087] a threshold value such that a ratio of a first set
size to a second set size is equal to a constant, wherein said
first set size is the number of nonzero weights whose absolute
values are smaller than or equal to said threshold value and said
second set size is equal to the number of all nonzero weights.
[0088] According to an embodiment of the present invention, said
deploying the trained neural network on a device comprises sending
non-zero weights of the trained neural network to the device
through said communication network.
[0089] According to an embodiment of the present invention, said
using the deployed trained neural network on the device comprises
using the deployed trained neural network with an application for a
visual object classification running on the device.
[0090] According to an embodiment of the present invention, said
device is a mobile device.
[0091] According to an embodiment of the present invention, said
device is a processing device of a control system of a self-driving
vehicle.
BRIEF DESCRIPTION OF THE DRAWINGS
[0092] These and other features and advantages of the present
invention will be made evident by the following description of some
exemplary and non-limitative embodiments thereof, to be read in
conjunction with the attached drawings, wherein:
[0093] FIG. 1 illustrates a portion of a CNN adapted to be used in
an object classification procedure wherein the concepts according
to embodiments of the present invention can be applied;
[0094] FIG. 2 is a flow chart illustrating in terms of functional
blocks a training procedure directed to set the weights of the
layers of the CNN of FIG. 1 according to an embodiment of the
present invention;
[0095] FIGS. 3-5 are flow charts illustrating in terms of
functional blocks the main phases of sub-procedures of the training
procedure of FIG. 2 according to an embodiment of the present
invention;
[0096] FIG. 6 illustrates in terms of very simplified functional
blocks an exemplary application scenario of the solution according
to embodiments of the present invention.
DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS OF THE INVENTION
[0097] FIG. 1 illustrates a portion of a CNN 100 adapted to be used
in an object classification procedure wherein the concepts
according to embodiments of the present invention can be
applied.
[0098] The CNN 100 is configured to receive as input a digital
image x (input image) depicting an object, and to select an
appropriate class for the object depicted in the input image x
among a plurality of C predefined image classes IC(c) (c=1, 2, . .
. , C), such as for example: [0099] IC(1)=person image class;
[0100] IC(2)=cat image class; [0101] IC(3)=dog image class; [0102]
IC(4)=car image class; [0103] IC(5)=house image class, [0104] . . .
.
[0105] For this purpose, the CNN 100 is designed to process the
input image x in order to generate a corresponding classification
array which provides an indication about a selected image class
IC(c) among the available predefined ones. The classification array
will be denoted by y(x;W). Here, W represents the whole set of
weights w of the CNN W={w.sub.1, . . . , w.sub.k} where it is
assumed that the CNN has a total of K weights. The CNN is organized
in layers, and the l'th layer contains a subset of the weights
W.sub.1W.
[0106] For example, the classification array y(x;W) comprises C
elements y.sub.c(x;W); each one corresponding to an image class
IC(c) and having a value indicating the probability that the input
image x depicts an object belonging to that image class IC(c). Said
value may either represent directly such a probability or it may be
successively transformed into such a probability by applying a
mapping such as the softmax function, see Bridle, J. S. (1990).
"Training stochastic model recognition algorithms as networks can
lead to maximum mutual information estimation of parameters". In
Advances in neural information processing systems (pages
211-217).
[0107] In order to simplify the description, the CNN 100 of FIG. 1
is adapted to process an input image having a single channel (e.g.,
a gray-scale image), in which each layer of the CNN 100 provides
for a single kernel, and in which each layer outputs a single
channel. However, similar considerations apply if the input image
has more than one channel and more than one kernel are used in at
least some layers.
[0108] The input image x is a digital image having h.times.h pixels
(e.g., h may be equal to 112). Similar considerations apply if the
input image x has a different resolution (i.e., it includes a
different number of pixels). On this regard, even if reference has
been made to a square input image, similar considerations apply in
case the input image has a different shape, such as a rectangular
shape.
[0109] The CNN 100 comprises an ordered sequence of L layers 120(l)
(l=1, 2, . . . L), with the generic layer 120(l) of the sequence
which is configured to: [0110] receive from the previous layer
120(l-1) of the sequence a corresponding input structure 110(l-1)
comprising a corresponding set of h(l-1).times.h(l-1) features;
[0111] process said received input structure 110(l-1) by exploiting
a kernel 130(l) comprising a corresponding set of k(l).times.k(l)
weights w and [0112] generate a corresponding output structure
110(l) comprising a corresponding set of
h(l)-k(l)+1.times.h(l)-k(l)+1 features.
[0113] The first layer 120(1) of the CNN 100 is configured to
receive as input structure the input image x comprising a
corresponding set of h(0).times.h(0) pixels.
[0114] Each layer 120(l) of the sequence is a convolutional layer
configured to carry out a convolution procedure for generating an
output structure 110(l) from the received input structure 110(l-1)
using the kernel 130(l) as it is well known to those skilled in the
art.
[0115] Some of the convolutional layers 120(l) of the sequence may
be followed by a corresponding max-pooling layer (not illustrated),
which is configured to carry out a subsampling procedure directed
to generate a subsampled version of the structure 110(l) received
from the convolutional layer 120(l). The subsampling procedure
provides for spanning a movable selection window over the structure
110(l) in order to select corresponding sets of features and
generating for each selected set of features a corresponding
feature having the highest value among the ones of the selected set
of features. Purpose of this subsampling procedure is to allow for
some degree of translation invariance and to reduce the
computational requirements for the following layers 120(l) of the
sequence. Similar considerations apply if the subsampling procedure
is carried out in a different way, such as for example by
calculating the average among the values of the selected set of
features.
[0116] The CNN 100 further comprises r additional layers 150(1),
150(2), . . . , 150(r) of the fully-connected type, i.e.,
non-convolutional layers designed to generate output structures
from input structures wherein each output value of the output
structure is a function of all the input values of the input
structure. The additional layers 150(1), 150(2), . . . , 150(r) act
as final classifiers having a number of output neurons equal to the
number of possible predefined image classes IC(c), so that each
output neuron is associated to a specific one among the predefined
image classes IC(c).
[0117] The first additional layer 150(1) is designed to receive as
input structure the output structure 110(L) generated by the last
layer 120(L), while the last additional layer 150(r) is designed to
generate as output structure the classification array y(x;W).
[0118] FIG. 2 is a flow chart illustrating in terms of functional
blocks a training procedure 200 directed to set the weights W.sub.1
of the layers of the CNN 100 according to an embodiment of the
present invention.
[0119] On this regard, it has to be appreciated that while in the
present description reference will be made to CNNs (i.e.,
convolutional neural networks), the training procedure 200
according to an embodiment of the present invention can be directly
applied to any kind of non-convolutional neural networks.
[0120] The training procedure 200 comprises a sequence of three
main sub-procedures 210, 220, 230.
[0121] The first sub-procedure 210 is directed to initially train
the CNN 100 according to a standard backpropagation training
procedure which exploits a cost function J(W;x,t(x))--wherein t(x)
is the target classification array t(x) corresponding to the input
image x--without any regularization function.
[0122] The second sub-procedure 220 is directed to further train
the CNN 100 which was subjected to the training procedure of the
first sub-procedure 210 with a backpropagation training procedure
which exploits a cost function J(W;x,t(x)) having a regularization
function R(W;x) according to an embodiment of the present
invention. As will be described in greater detail in the following
of the present description, according to an embodiment of the
present invention the regularization function R(W;x) depends both
on the weights W and on the input image x, unlike already known
regularization functions R(W), which depend only on the set of
weights W.
[0123] The third sub-procedure 230 is directed to further train the
CNN 100 which was subjected to the training procedure of the second
sub-procedure 220 with a backpropagation training procedure
exploiting the same cost function J(W;x,t(x)) used in the second
sub-procedure 220. However, unlike the first two sub-procedures
210, 220, according to an embodiment of the present invention,
during the third sub-procedure 230 a pruning operation is also
performed, directed to set to zero those weights w that have a
value lower than a corresponding threshold.
[0124] Purpose of the first sub-procedure 210 (which exploits a
cost function without regularization function) is to speed up the
process and reach the results in a more fast way. However,
according to a further embodiment of the present invention, the
first sub-procedure 210 may be skipped. Thus, according to this
further embodiment of the present invention, the training procedure
200 may directly start with the second sub-procedure 220 (which
exploits a cost function comprising the regularization function
according to an embodiment of the present invention).
[0125] Skipping the first sub-procedure 210 may be useful in case
the training procedure 200 is applied to an already trained CNN.
This is a quite common case in the Deep Learning field, wherein the
starting point is often an already trained CNN which is then
further trained for adapt it to different classes. Another case in
which the first sub-procedure 210 could be advantageously skipped
is when the CNN to be trained has been just initialized with random
weights. In this case, however, it may be useful to set the decay
rate parameter .lamda. (see equation (3)) to a lower value compared
to usual.
[0126] The sub-procedures 210, 220, 230 will be now described in
greater detail making reference to FIGS. 3-5. The three
sub-procedures 210, 220, 230 provide for updating weights W each
time forward and backward phases have been carried out on a set
TI(p) (p=1, 2, 3 . . . ) of input images x (in this case also
referred to as "input training images") belonging to a training
database. Each set TI(p) contains |TI(p)| images.
[0127] It has to be appreciated that in the following of the
description it is assumed that the weights W are updated every time
forward and backward phases have been carried out on a single input
training image of the training database, generating a gradient of
the cost function (i.e., the set TI(p) contains only a single
image). However, similar considerations apply in case the weights W
are updated after processing a generic set TI(p) of training
images, in which case a gradient is computed for each of the
training images in TI(p) and the weights in W are updated by means
of the average gradient across TI(p). It follows that the
considerations apply also in case a single set TI(p) of input
training images is present comprising all the training images of
the training database, so that the weights W are updated every time
forward and backward phases have been carried out on all the
training images of the training database.
[0128] FIG. 3 is a flow chart illustrating in terms of functional
blocks the main phases of the sub-procedure 210.
[0129] The first phase of the sub-procedure 210 provides for
selecting from a set TI(p) of input training images belonging to
the training database an input training imagex, and providing such
input training image x as input data structure to the CNN 100 to be
trained (block 310). The training database comprises known training
images, i.e., images for which it is known what class IC (c=c*) the
object depicted therein belongs to among the available classes
IC(c).
[0130] In the next phase (block 320), the CNN 100 process the
received input training image x in order to generate a
corresponding output classification array y(x;W).
[0131] As already mentioned above, having C different classes
IC(c), the output classification array y(x;W) is an array
comprising C elements y.sub.c(x;W) each one corresponding to a an
image class IC(c) and having a value that either represents the
probability that the input image x depicts an object belonging to
that image class IC(c), or may be transformed into such a
probability by applying a mapping such as the softmax function.
[0132] In the following phase (block 330), a cost function
J(W;x,t(x))=.eta.L(y(x;W),t(x)) is calculated (see equation 3). In
order to calculate the cost function J(W;x, t(x)), any known loss
function L(y(x;W),t(x)) can be used, such as the MSE loss function,
for which
L .function. ( y .function. ( x ; W ) , t .function. ( x ) ) = 1 c
.times. c = 1 C .times. ( y c .function. ( x ; W ) - t c .function.
( x ) ) 2 ( 4 ) ##EQU00002##
[0133] or the cross-entropy loss function, for which
L(y(x;W),t(x))=-.SIGMA..sub.c=1.sup.ct.sub.c(x)log(y.sub.c(x;W))
(5)
wherein t(x) is the classification array corresponding to an input
training image x belonging to class IC(c*) and comprises C elements
t.sub.c(x), c=1 to C, wherein t.sub.c(x)=1 for c=c* and
t.sub.c(x)=0 for c.noteq.c*.
[0134] Phases corresponding to blocks 310, 320, 330 correspond to
the forward portion of the sub-procedure 210.
[0135] The backward portion of the sub-procedure 210 (block 340)
provides for calculating the gradient of the cost function J(W;x,
t(x)) with respect to the weights W for all the layers recursively
from the last layer to the first layer according to the following
equation:
.gradient. W .times. J .function. ( W ; x , t .function. ( x ) ) =
[ .differential. J .function. ( W ; x , t .function. ( x ) )
.differential. w 1 , .differential. J .function. ( W ; x , t
.function. ( x ) ) .differential. w 2 , .times. ] ( 6 )
##EQU00003##
[0136] Then, the set of weights W of the CNN 100 are updated (block
370). The updating of the weights W is carried out using a gradient
descent operation in order to minimize the cost function
J(W;x,t(x)) according to the following equation:
w p + 1 = w p - .differential. J .function. ( W ; x , t .function.
( x ) ) .differential. w , ( 7 ) ##EQU00004##
wherein w.sup.p is the generic weight of the array W.sup.p of
weights w of the CNN 100 in which the training images x of the p-th
set TI(p) are used, and w.sup.p+1 is the generic updated weight w
of the CNN 100.
[0137] At this point, a check is made (block 380) in order to check
whether the considered image x is the last one in the training set
or not.
[0138] In case the considered image is not the last one (exit block
N of block 380), a next training image x is selected (block 390),
and all the previously described operations are carried out using
this new training image (returning to block 310) producing a
version of the CNN 100 having the updated weights W.sup.p+2.
[0139] In case the considered image is the last one (exit block Y
of block 380), it means that all the training images of the
training database have been used. In this case, it is said that an
epoch of the training phase has passed.
[0140] The sub-procedure 210 described herein may repeated for
several epochs, such as tens of epochs, in order to provide to the
CNN 100 the entire training database several times.
[0141] FIG. 4 is a flow chart illustrating in terms of functional
blocks the main phases of the sub-procedure 220 according to an
embodiment of the present invention.
[0142] According to an embodiment of the present invention, the
sub-procedure 220 is substantially equal to the sub-procedure 210
previously described in FIG. 3, with the main difference that the
cost function J(W;x,t(x)) used is different, including a
regularization function R(W;x) which depends on the input image x
and the array of weights W.
[0143] The first phase of the sub-procedure 220 provides for
selecting an input training image x, and providing such input
training image as input data structure to the CNN 100 to be trained
(block 410).
[0144] In the next phase (block 420), the CNN 100 processes the
received input training image x in order to generate a
corresponding output classification array y.
[0145] According to an embodiment of the present invention, the
training is based on the following cost function
J(W;x,t(x))=.eta.L(y(x;W),t(x))+.lamda.R(W;x) (8)
(see equation 3). This cost function comprises a regularization
function R(W;x) according to an embodiment of the present
invention.
[0146] The regularization function R(W;x) is a function of the
input image x and the array of weights W. According to an
embodiment of the present invention, the regularization function
promotes low values of the weights in the CNN 100 whenever this may
be obtained without reducing the performance of the CNN 100. The
regularization function R(W;x) according to an embodiment of the
present invention influences the cost function J(W;x,t(x)) by
providing a selective penalty to any weight of the CNN 100.
[0147] The regularization function is a sum of costs (or penalties)
assigned to each weight in proportion to a product which comprises
two factors. The first factor is the squared value of the weight;
the second factor is a function of how sensitive the output is to a
change in the weight, i.e., the second factor provides a
quantification of how much the output varies with respect to a
change in the weight. The resulting penalty will be large if both
factors are large: this will occur when the weight has a large
value and the sensitivity to change is low. Conversely, the penalty
will be small or even zero when the weight has a small value or the
sensitivity to change is high. The regularization function is
continuous in the weight w and takes on nonnegative penalty values.
In an embodiment, the regularization function is a sum of the
penalties from all weights in W:
R(W;x)=1/2.SIGMA..sub.w.di-elect cons.Ww.sup.2H(S(w;x,W)) (9)
[0148] Here, the two factors discussed above are the squared weight
w.sup.2 and the function H of sensitivity S. The functions H and S
will be defined below.
[0149] In order to introduce such a regularization function, it is
necessary to define a function S(w;x,W), hereinafter referred to as
"sensitivity function". This function quantifies the rate of change
in the output of the CNN 100--e.g., the output classification array
y(x;W)--as caused by variations of the weight w. The sensitivity
function can be defined as the average of the absolute derivatives
of the elements y.sub.c(x;W) of the classification array y(x;W)
with respect to the weight w, provided that the weight array is W
and the input image is x
S .function. ( w ; x , W ) = 1 c .times. c = 1 C .times.
.differential. y c .function. ( x ; W ) .differential. w ( 10 )
##EQU00005##
[0150] Given a specific weight configuration W*, of which the
weight in consideration is w*, the higher the value of the
sensitivity S(w*;x,W*), the higher the variation of the output
classification array y(x;W*) in the presence of variations of the
weight w in the neighborhood of w*. Hence, if the value of
S(w*;x,W*) is large, then a change in the weight w* may lead to
substantial changes in the output classification array y(x;W).
Conversely, if the value of S(w*;x,W*) is small, then a change in
the weight w* has a small impact on the output classification
array.
[0151] It has to be observed that S(w;x,W) is a local measure, in
the sense that a weight may have different sensitivity values at
two different weight values w=a and w=b.
[0152] According to an embodiment of the present invention, the
regularization function R(W;x) is a function whose derivative with
respect to the weight w is equal to:
.differential. R .function. ( W ; x ) .differential. w = w .times.
H .function. ( S .function. ( w ; x , W ) ) .times. .times. where (
11 ) H .function. ( S .function. ( w ; x , W ) ) = { 1 - S
.function. ( w ; x , W ) 0 < S .times. ( w ; x , W ) .ltoreq. 1
0 S .times. ( w ; x , W ) > 1 ( 12 ) ##EQU00006##
[0153] In view of the above, the next phase of the procedure
according to an embodiment of the present invention (block 440)
provides for calculating the derivative of the cost function
J(W;x,t(x)) with respect to the weight w according to the following
equation:
.differential. J .function. ( W ; x , t .function. ( x ) )
.differential. w = .times. .eta. .times. .differential. L
.function. ( y .function. ( x ; W ) , t .function. ( x ) )
.differential. w + .lamda. .times. .differential. R .function. ( W
; x ) .differential. w = .times. .eta. .times. .differential. L
.function. ( y .function. ( x , W ) , t .function. ( x ) )
.differential. w + .lamda. .times. w .times. H .function. ( S
.function. ( w ; x , W ) ) ( 13 ) ##EQU00007##
[0154] The updating of each weight win the set of weights W is
carried out using a gradient descent operation in order to minimize
the cost function J(W;x,t(x)) according to the following
equation
w p + 1 = w p - .eta. .times. .differential. L .function. ( y
.function. ( x , W p ) , t .function. ( x ) ) .differential. w -
.lamda. .times. w p .times. H .function. ( S .function. ( w p ; x ,
W p ) ) ( 14 ) ##EQU00008##
[0155] wherein W.sup.p is the array of weights w of the CNN 100 in
which the training images x of the p-th set TI(p) are used, w.sup.p
is the generic weight of the array of weights W.sup.p, and
w.sup.p+1 is the generic updated weight w of the CNN 100. At this
point, the weights are updated into W.sup.p+1 (block 470).
[0156] According to equation (14), the updated weight W.sup.p+1 is
obtained from the previous weight w.sup.p. Said operation amounts
to subtracting a first term proportional to the loss function
derivative (i.e., proportional to the classification error produced
by the CNN 100), and subtracting a second term which may range from
0 (H(S(w;W))=0) to .lamda.w.sup.p (H(S(w;x,W))=1). The rationale is
as follows. If the network output is insensitive to a change in a
given weight w.sup.p, then the value of S is small and consequently
the value of H is near 1; the regularization therefore contributes
to changing w.sup.p towards zero by an amount of approximately
.lamda.w.sup.p. Conversely, if the network output is very sensitive
to a change in a given weight, then the value of S is high and
consequently the value of H is small or zero; the term
.lamda.w.sup.pH(S(w.sup.p;x,W.sup.p)) is therefore also small or
zero, and the regularization does not contribute to a change in the
weight. The result of this process is pushing toward zero all the
weights that are not useful for classifying objects.
[0157] From this discussion it should be clear that the
regularization factor 2 should have a value in the interval between
0 and 1. Since the sensitivity is a local measure, it is
advantageous that the regularization factor 2 be closer to zero
than to one; in this case, any weight with a high sensitivity is
changed towards zero by a small fraction, such that a new
sensitivity value may be computed over the new value of the weight.
If the sensitivity value becomes large after some iterations of
gradient descent, then the update due to the regularization will
stop. If the sensitivity value of a given weight never becomes
large over the course of the iterations of gradient descent, then
the value of this weight is likely to approach zero.
[0158] At this point, a check is made (block 480) in order to check
whether the considered training image x is the last one or not.
[0159] In case the considered training image x is not the last one
(exit block N of block 480), a next training image x is selected
(block 490), and all the previously described operations are
carried out using this training image returning to block 410) and
inputting it to a version of the CNN 100 having the updated weights
W.sup.p+1.
[0160] In case the considered training image x is the last one
(exit block Y of block 480), it means that all the training images
of the training database have been used. In this case, it is said
that an epoch of the training phase has passed.
[0161] Like in the case of the sub-procedure 210, the sub-procedure
220 as well described herein may repeated for several epochs, such
as tens of epochs, in order to provide to the CNN 100 the entire
training database several times.
[0162] Applicant has found that by minimizing the abovementioned
cost function J(W;x,t(x))=.eta.L(y(x;W),t(x))+.lamda.R(W;x), a
significant number of weights w assume values close to zero. The
absolute values of other weights remain much larger, namely those
whose sensitivity has remained large towards the end of the
training process. It should be clear that in a weighted sum of
pixels (or features), in which some weights are very small and
other weights are relatively large, the terms that have very small
weights as factors contribute little to the sum. Furthermore, if
those very small weights are substituted by zero, the resulting
weighted sum is a good approximation to the original sum obtained
with all weights unaltered.
[0163] This observation suggests a separation of the weights into
two disjoint subsets, one containing the small weights and one
containing the relatively large weights. Such a separation may be
achieved by comparing the absolute values of the weights to a
threshold.
[0164] This has led the Applicant to develop a method of pruning,
i.e., a method that sets exactly to zero all those weights whose
absolute values are below a given threshold.
[0165] FIG. 5 is a flow chart illustrating in terms of functional
blocks the main phases of the sub-procedure 230 which provides for
the abovementioned pruning according to an embodiment of the
present invention.
[0166] The first phase of the sub-procedure 230 provides for
setting (block 510) a pruning threshold TH to be used for setting
to zero those weights w that have assumed a sufficiently low
value.
[0167] Empirically it has been observed that, in the absence of a
pruning procedure, the weights W of the CNN 100 are distributed
according to a Gaussian distribution of mean zero and standard
deviation a. In such a distribution, the probability that a weight
assumes an absolute value smaller than a given threshold TH is a
function of .sigma. and TH. In the alternative case that some
weights have been set to zero and an update step is successively
applied like the one in Equation (11) and block 470, it may be
observed that some weights remain exactly zero while the other
weights are clustered around a positive and a negative value. In
this alternative case, the distribution of all the weights does not
have a simple parametric form. The Applicant has therefore
developed a nonparametric procedure in order to carry out the
further pruning of weights.
[0168] By observing histograms of the absolute values |w| of those
weights that are different from zero, the Applicant has found that
the distribution of absolute values of these weights is
approximately unimodal, i.e. it has one peak, and that this peak
occurs near the average absolute value of nonzero weights.
Consequently, the threshold that determines which weights will be
set to zero can advantageously be proportional to this average
absolute value. Alternatively, the threshold may be set so as to
separate the nonzero weights into two sets whose numbers of
elements have a constant proportion.
[0169] According to an exemplary embodiment of the present
invention, the pruning threshold is set to:
TH = .theta. .times. mean w .di-elect cons. W , w .noteq. 0 .times.
w ( 15 ) ##EQU00009##
[0170] wherein the mean value is calculated by considering all
those weights of the CNN 100 that have values different from zero,
and 0 is a positive constant smaller than 1, for example
.theta. = 1 1 .times. 0 . ##EQU00010##
[0171] According to a second exemplary embodiment of the present
invention, the pruning threshold is set to
TH,such that P(|w|.ltoreq.TH)=.theta. (16)
wherein P(|w|.ltoreq.TH) indicates the ratio of two set sizes:one
is the number of nonzero weights whose absolute values are smaller
than or equal to TH, and one is the number of all nonzero
weights.
[0172] The next phase (block 520) provides for setting to zero the
weights w of the CNN 100 whose absolute value is lower than the
pruning threshold TH.
[0173] Since the sensitivity function S(w;x,W) used for calculating
the gradient of the regularization term R(W;x) is a local measure,
its validity is ensured for very small variations of W. Since the
pruning process forces some weights to be zero, the pruning
threshold TH has been advantageously chosen in order to minimize
this type of perturbation due to the pruning itself.
[0174] At this point, the CNN 100 is trained for one epoch (block
530) using a training procedure equal to the one corresponding to
sub-procedure 220.
[0175] A certain decrease in performance (e.g. with respect to
classification accuracy) has to be expected as a side effect of the
pruning operation. A complete method therefore includes a stopping
criterion; for example, the process may be interrupted when the
performance reaches a predefined lower bound. Such a mechanism is
described below.
[0176] The classification accuracy may be defined with respect to a
set of images, called a validation set. The validation set does
normally not contain any of the images used for training. In the
same manner as for the training set, also each image in the
validation set must be equipped with a target class, such that
validation image x has target classification array t(x). The
classification accuracy for the validation set VS can now be
defined as a fraction of 100%, where the numerator is the number of
images classified correctly by the CNN, and the denominator is the
number of images contained in the validation set. The
classification accuracy is clearly a function of the weights Win
the CNN.
[0177] The stopping criterion may now be defined. Suppose that a
reference performance has been fixed, for example by measuring the
classification performance A(W) for a previously trained CNN that
has not been pruned (this performance can reasonably be taken as an
upper bound to the performance of a pruned sparse CNN). Before
training starts, an acceptable performance loss may be set, for
example a=0.01, corresponding to a performance loss of 1%.
[0178] After the CNN 100 has been trained for one epoch, resulting
in weights W, its performance may be compared to the reference
performance (block 540). If
A(W)<(1-a)A(W.sup.F) (17)
then the training is interrupted, and the weights resulting from
the previous training epoch are retained (exit branch Y of clock
540). Otherwise, the weights and the classification performance
value are stored, and a new epoch of training may be started (exit
branch N of block 540, returning back to block 510).
[0179] In an alternative embodiment of the present invention, the
sensitivity function has a different definition. Here, the average
absolute derivative is replaced by the absolute derivative only for
that element of the output classification array that corresponds to
the class of the target image. This alternative sensitivity
function may thus be written as
S .function. ( w ; x , W ) = c = 1 C .times. t c .function. ( x )
.times. .differential. y c .function. ( x ; W ) .differential. w (
18 ) ##EQU00011##
since the target classification array t(x) has value 1 in the
element that corresponds to the correct class, while all other
elements are 0.
[0180] Compared to known solutions, the above described training
procedure 200 according to embodiments of the present invention
allows to obtain a CNN 100 having a reduced number of parameters
(weights w) different from zero, and therefore requiring a reduced
memory occupation.
[0181] The reduced memory occupation of CNN 100 trained with the
training procedure according to the embodiments of the present
invention is particularly advantageous in different application
scenarios.
[0182] For example, a CNN having a reduced memory occupation can be
advantageously deployed on devices having a limited storage memory
capability due to constraints related to power consumption, price
and physical dimension, such as in scenarios in which CNN are
deployed on mobile devices provided with an application for visual
object classification.
[0183] Furthermore, a CNN having a reduced memory occupation can be
advantageously deployed on devices by means of a communication
network, such as a mobile communication network having reduced
bandwidth capabilities. For example, in some scenarios a periodic
remote deployment of complex CNNs on the device is required, such
as for example for network updating purpose (e.g., for the updating
of the weights W of a CNN deployed on a processing device of a
control system of a self-driving vehicle which requires a periodic
updating to improve the automatic driving performances).
[0184] FIG. 6 illustrates in terms of very simplified functional
blocks an exemplary application scenario of the solution according
to embodiments of the present invention. The considered application
scenario provides for the training of a CNN on a server module 610
and then for the deployment of the trained CNN on a mobile device
620 running an application for object classification purposes.
[0185] The (untrained) CNN, identified in figure with reference
100', is provided to the server module 610 for being trained. As
schematized in FIG. 6, the server module 610 comprises a training
engine 630, such as a process running on a processor, a processor,
an object, an executable, a thread of execution, a program, and/or
a computing device software application implementing the training
procedure 200 according to the embodiment of the invention. The
training engine 630 is further coupled with a training database 640
comprising a training dataset of training images belonging to
respective known classes to be used by the training engine 630 for
training the CNN 100' using the training procedure 200 according to
the embodiment of the invention and obtaining a corresponding
trained CNN 100''.
[0186] The set of weights W of the trained CNN 100'' is then
transmitted to the mobile device 620 by means of a communication
network 650, such as a mobile communication network.
[0187] The deployment of the trained CNN 100'' is carried out by
storing the set of weights W into a memory unit 660 of the mobile
device 620.
[0188] At this point, the deployed trained CNN 100'' can be used by
a classification engine 670 of the mobile device 620, such as a
process running on a processor, a processor, an object, an
executable, a thread of execution, a program, and/or a computing
device software application at the mobile device 620 for
implementing classification of images x, such as images captured by
a camera module 680 of the mobile device 620, and generating a
corresponding classification array y(x;W).
[0189] The proposed training procedure 200 according to the
embodiments of the invention described above has been compared with
the method proposed by Han et Al. over a number of well-known
neural network architectures and over a number of different image
datasets.
[0190] For each trained model the resulting sparsity and
corresponding memory footprint has been measured under the
assumption that every parameter (weight w) is stored as a 4-bytes
single precision float number according to the IEEE-754 standard.
From the following comparisons it will be shown that the training
procedure 200 according to the embodiments of the invention
improves the network sparsity and thus reduces the deployed model
footprint without impairing its performance.
[0191] First Experimental Result. LeNet300 (MNIST)
[0192] The training procedure 200 according to the described
embodiment of the present invention has been compared over the
MNIST dataset using the Lenet-300 architecture. The MNIST dataset
consists of 28.times.28 8-bit grayscale images of handwritten
digits (C=10 classes). The database consists in 50000 training
samples and 10000 test samples.
[0193] The LeNet-300 architecture is a simple fully connected
network designed for handwritten character recognition and consists
in three fully connected layers with 300, 100 and 10 neurons. The
CNN has been trained both using the training procedure 200
according to the embodiment of the present invention in which the
sensitivity function is the one disclosed in equation 10 and using
the training procedure 200 according to the embodiment of the
invention in which the sensitivity function is the one disclosed in
equation 18. The following table shows, for each layer of the
LeNet-300 network architecture, the original number of parameters
and the number of remaining parameters for each considered method
(lower figures indicate sparser network topologies and so lower
memory requirements). Out of about 266 k parameters in the original
architecture, the method proposed by Han et al. is able to reduce
such figure to about 21 k parameters. The results show that the
embodiment of the invention corresponding to equation 10 achieves
better sparsity, however the embodiment of the invention
corresponding to equation 18 achieves lower overall error (about
0.1% in our experiments) and requires less training time (1 k
versus 2.5 k training iterations respectively).
TABLE-US-00002 Proposed .times. .times. training .times. .times.
procedure [ # .times. .times. remaining .times. .times. parameters
# .times. parameters .times. .times. LeNet .times. .times. 300
.times. 100 ] .times. % ##EQU00012## Embodiment Embodiment
according to according to equation 10 equation 18 Layer LeNet300 [#
parameters] Han .times. .times. et .times. .times. al . [ # .times.
.times. remaining .times. .times. parameters # .times. .times.
parameters .times. .times. LeNet .times. 300 .times. 1 .times. 0
.times. 0 ] .times. % ##EQU00013## .eta. = 0.1 .lamda. = 10.sup.-5
.theta. = 0.1 .eta. = 0.1 .lamda. = 10.sup.-4 .theta. = 0.1 fc1-300
235K 8% 2.25% 4.78% fc2-100 30K 9% 11.93% 24.75% fc3-10 1K 26%
69.3% 73.8% Proposed training procedure Embodiment Embodiment
according to according to equation 10 equation 18 .eta. = 0.1 .eta.
= 0.1 .lamda. = 10.sup.-5 .lamda. = 10.sup.-4 LeNet300 Han et al.
.theta. = 0.1 .theta. = 0.1 Total 266K 21.76K 9.55K 19.39K size [#
parameters] error [%] 1.8% 1.6% 1.65% 1.56%
[0194] Second Experimental Result. LeNet5 (MNIST)
[0195] The previous experiment has been repeated with the MNIST
database over the LeNet5 network architecture. The LeNet5
architecture is a simple convolutional network for hidden character
recognition consisting in two convolutional layers and two fully
connected layers, for a total of about 431 k learnable
parameters.
[0196] The LeNet5 network is trained both using the training
procedure 200 according to the embodiment of the present invention
in which the sensitivity function is the one disclosed in equation
10 and using the training procedure 200 according to the embodiment
of the invention in which the sensitivity function is the one
disclosed in equation 18. The performance of the trained network is
then compared in the following table with the performance of Han et
al.. The table shows trends similar to the results shown for the
LeNet-300 architecture. Out of 430 k parameters in the original
architecture, the method proposed by Han et al. is able to reduce
such figure to about 37 k parameters. The proposed training
procedure 200 in its two different embodiments is able to reduce
such figure to just 9 k and 11 k parameters for the embodiments
corresponding to equations 10 and 18, respectively. The results
show that the embodiment of the invention according to equation 10
achieves better sparsity, however the embodiment of the invention
according to equation 18 achieves the same error.
TABLE-US-00003 Proposed .times. .times. training .times. .times.
procedure [ # .times. .times. remaining .times. .times. parameters
# .times. parameters .times. .times. LeNet .times. .times. 5
.times. 100 ] .times. % ##EQU00014## Embodiment Embodiment
according to according to equation 10 equation 18 Layer LeNet5 [#
parameters] Han .times. .times. et .times. .times. al . [ # .times.
.times. remaining .times. .times. parameters # .times. .times.
parameters .times. .times. LeNet .times. .times. 5 .times. 1
.times. 0 .times. 0 ] .times. % ##EQU00015## .eta. = 0.1 .lamda. =
10.sup.-4 .theta. = 0.1 .eta. = 0.1 .lamda. = 10.sup.-4 .theta. =
0.1 conv1 0.5K 66% 67.6% 72.6% conv2 25K 12% 11.76% 12.04% fc3 400K
8% 0.90% 1.26% fc4 5K 19% 31.04% 37.42% Proposed training procedure
Embodiment Embodiment according to according to equation 10
equation 18 .eta. = 0.1 .eta. = 0.1 .lamda. = 10.sup.-4 .lamda. =
10.sup.-4 LeNet5 Han et al. .theta. = 0.1 .theta. = 0.1 Total 431K
36.28K 8.43K 10.28K size [# parameters] error[%] 0.8% 0.8% 0.8%
0.8%
[0197] Third Experimental Result LeNet5 (Fashion MNIST)
[0198] The previous experiment has been repeated over the
Fashion-MNIST dataset
(https://github.com/zalandoresearch/fashion-mnist). Such dataset
consists in monochrome images sized 28.times.28 as the original
MNIST dataset. However, images contain items such as bags,
clothings, shoes, etc.
[0199] The goal of this further experiment is to assess whether the
proposed training procedure 200 is able also to sparsify networks
operating on denser signals. In fact, the MNIST dataset is a sparse
dataset, i.e., a large number of pixels of the input images have
value equal to 0. It is well known that sparse signals tend to
promote network sparsity during training. Conversely, the
Fashion-MNIST database is significantly less sparse, thus achieving
a sparse network architecture is more challenging. The MNIST
dataset is rather sparse as pixels have values around either 0
(black pixel) and 255 (white pixel). Conversely, the Fashion-MNIST
dataset has a lot of more information along intermediate pixel
intensity values representing different shades of gray.
[0200] The following table shows trends similar to the results
shown for the LeNet-5 architecture when trained over the original
MNIST dataset. Out of 430 k parameters in the original
architecture, the method proposed by Han et al. is able to reduce
such figure to about 46 k parameters. The proposed training
procedure 200 in its two different embodiments is able to reduce
such figure to just 37 k and 61 k parameters for the embodiments
corresponding to equations 10 and 18, respectively. The results
show that the embodiment corresponding to equation 10 allows to
obtain better sparsity, however the embodiment corresponding to
equation 18 achieves the same error. This experiment confirms that
images that are less sparse yield networks that have naturally more
meaningful parameters. Nevertheless, the proposed training
procedure 200 is still able to increase the network sparsity even
with non-sparse signals.
TABLE-US-00004 Proposed .times. .times. training .times. .times.
procedure [ # .times. .times. remaining .times. .times. parameters
# .times. parameters .times. .times. LeNet .times. .times. 5
.times. 100 ] .times. % ##EQU00016## Embodiment Embodiment
according to according to equation 10 equation 18 Layer LeNet5 [#
parameters] Han .times. .times. et .times. .times. al . [ # .times.
.times. remaining .times. .times. parameters # .times. .times.
parameters .times. .times. LeNet .times. .times. 5 .times. 1
.times. 0 .times. 0 ] .times. % ##EQU00017## .eta. = 0.1 .lamda. =
10.sup.-5 .theta. = 0.1 .eta. = 0.1 .lamda. = 10.sup.-5 .theta. =
0.1 conv1 0.5K 73% 76.2% 84.4% conv2 25K 30.32% 32.56% 49.26% fc3
400K 8.63% 6.50% 11.2% fc4 5K 56.86% 44.02% 65.32% Proposed
training procedure Embodiment Embodiment according to according to
equation 10 equation 18 .eta. = 0.1 .eta. = 0.1 .lamda. = 10.sup.-5
.lamda. = 10.sup.-5 LeNet5 Han et al. .theta. = 0.1 .theta. = 0.1
Total 431K 45.24K 36.72K 60.72K size [# parameters] error[%] 8.8%
8.5% 8.5% 8.5%
[0201] Fourth Experimental Result VGG-16 (ImageNet)
[0202] As a final experiment, the previous experiment has been
repeated with the Imagenet database over the VGG-16 network
architecture.
[0203] The ImageNet dataset consists of 224.times.224 24-bit color
images of 1000 different types of objects (C=1000 classes). The
database consists in 1M training samples and 100 k test samples.
With respect to the MNIST dataset, the ImageNet dataset represents
more closely practical large-scale object recognition problems.
[0204] The VGG-16 architecture is a far more complex convolutional
network architecture for generic object character recognition
consisting in 13 convolutional layers and three fully connected
layers as shown below in the table, for a total of about 138M of
learnable parameters. Concerning the number of parameters, the
VGG-16 has about 250 more times parameters than the LeNet5
architecture, represents more closely the scale of the networks
practically deployed to solve large scale object recognition
problems.
[0205] The VGG-16 network has been trained with the proposed
training procedure 200 both according to the embodiment
corresponding to equation 10 and according to the embodiment
corresponding to equation 18. The performance of the trained
network in both the two cases is then compared to the one of Han et
al. in the following table. The table shows trends similar to the
results shown for the LeNet-300 architecture. Out of 138M
parameters in the original architecture, the method proposed by Han
et al. is able to reduce such figure to about 10.3M parameters. The
proposed training procedure 200 in its two different embodiments is
able to reduce such figure to about 11.3 M and 9.8 M parameters.
Finally, the experiment shows that the proposed training procedure
200 is able to reduce the network error by about 1% while
increasing the network sparsity.
TABLE-US-00005 Proposed .times. .times. training .times. .times.
procedure [ # .times. .times. remaining .times. .times. parameters
# .times. .times. parameters .times. .times. VGG .times. - .times.
16 .times. 100 ] .times. % ##EQU00018## Embodiment Embodiment
according to according to equation 10 equation 18 Layer VGG-16 [#
parameters] Han .times. .times. et .times. .times. al . [ # .times.
.times. remaining .times. .times. parameters # .times. .times.
parameters .times. .times. VGG .times. - .times. 16 .times. 1
.times. 0 .times. 0 ] .times. % ##EQU00019## .eta. = 10.sup.-3
.lamda. = 10.sup.-6 .theta. = 0.1 .eta. = 10.sup.-3 .lamda. =
10.sup.-5 .theta. = 0.1 conv1_1 2K 58% 97.80% 96.35% conv1_2 37K
22% 90.47% 80.87% conv2_1 74K 34% 87.81% 81.49% conv2_2 148K 36%
84.96% 81.41% conv3_1 295K 53% 83.44% 77.68% conv3_2 590K 24%
81.92% 71.81% conv3_3 590K 42% 80.85% 69.25% conv4_1 1M 32% 71.07%
62.03% conv4_2 2M 27% 62.96% 51.2% conv4_3 2M 34% 62.34% 51.91%
conv5_1 2M 35% 60.47% 57.09% conv5_2 2M 29% 59.66% 57.08% conv5_3
2M 36% 59.63% 47.75% fc6 103M 4% 1.08% 1.13% fc7 17M 4% 6.27% 8.35%
fc8 4M 23% 35.43% 14.81% Proposed training procedure Embodiment
Embodiment according to according to equation 10 equation 18 .eta.
= 10.sup.-3 .eta. = 10.sup.-3 .lamda. = 10.sup.-6 .lamda. =
10.sup.-5 VGG-16 Han et al. .theta. = 0.1 .theta. = 0.1 Total 138M
10.35M 11.34M 9.77M size [# parameters] Error 31.50%- 31.34%-10.88%
29.29%- 30.92%- [Top1 %- 11.32% 9.8% 10.06% Top5 %]
[0206] The previous description presents and discusses in detail
several embodiments of the present invention; nevertheless, several
changes to the described embodiments, as well as different
invention embodiments are possible, without departing from the
scope defined by the appended claims.
* * * * *
References