U.S. patent application number 17/250928 was filed with the patent office on 2021-11-11 for system and method for automated precision configuration for deep neural networks.
This patent application is currently assigned to Deeplite Inc.. The applicant listed for this patent is Deeplite Inc.. Invention is credited to MohammadHossein ASKARIHEMMAT, Olivier MASTROPIETRO, Ehsan SABOORI, Davis Mangan SAWYER.
Application Number | 20210350233 17/250928 |
Document ID | / |
Family ID | 1000005739536 |
Filed Date | 2021-11-11 |
United States Patent
Application |
20210350233 |
Kind Code |
A1 |
SABOORI; Ehsan ; et
al. |
November 11, 2021 |
System and Method for Automated Precision Configuration for Deep
Neural Networks
Abstract
There is provided a system and method of automated precision
configuration for deep neural networks. The method includes
obtaining an input model and one or more constraints associated
with an application and/or target device or process used in the
application configured to utilize a deep neural network; learning
an optimal low-precision configuration of the architecture using
constraints, the training data set, and the validation data set;
and deploying the optimal configuration on the target device or
process for use in the application.
Inventors: |
SABOORI; Ehsan; (Montreal,
CA) ; SAWYER; Davis Mangan; (Montreal, CA) ;
ASKARIHEMMAT; MohammadHossein; (Montreal, CA) ;
MASTROPIETRO; Olivier; (Montreal, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Deeplite Inc. |
Montreal |
|
CA |
|
|
Assignee: |
Deeplite Inc.
Montreal
QC
|
Family ID: |
1000005739536 |
Appl. No.: |
17/250928 |
Filed: |
November 18, 2019 |
PCT Filed: |
November 18, 2019 |
PCT NO: |
PCT/CA2019/051643 |
371 Date: |
March 29, 2021 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
62769403 |
Nov 19, 2018 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06N 3/08 20130101; G06N
3/0454 20130101; G06K 9/6231 20130101 |
International
Class: |
G06N 3/08 20060101
G06N003/08; G06N 3/04 20060101 G06N003/04; G06K 9/62 20060101
G06K009/62 |
Claims
1. A method of automated precision configuration for deep neural
networks, the method comprising: obtaining an input model and one
or more constraints associated with an application and/or target
device or process used in the application configured to utilize a
deep neural network; learning an optimal low-precision
configuration of the optimal architecture using the input model,
constraints, a training data set, and a validation data set; and
deploying the optimal configuration on the target device or process
for use in the application.
2. The method of claim 1, wherein the optimal configuration is
learned using a policy to generate an optimized model from the
input model.
3. The method of claim 2, wherein the optimal low-precision
configuration of the optimal architecture is learned using the
policy to generate a quantized network, the method further
comprising: fine tuning the quantized network with a knowledge
distillation process; evaluating the fine-tuned network; applying a
reward function; and iterating for at least one additional
quantized network and selecting the optimal low-precision
configuration.
4. The method of claim 3, wherein selecting the optimal
low-precision configuration comprises selecting a precision
configuration that achieves the best reward as determined by the
reward function, for the constraints on the target device or
process.
5. The method of claim 1, wherein learning the optimal
low-precision configuration comprises exploiting low precision
weights using reinforcement learning to learn the optimal
low-precision configuration across the deep neural network.
6. The method of claim 5, wherein each layer comprises a different
precision.
7. The method of claim 1, wherein the constraints comprise at least
one of: accuracy, power, cost, supported precision, speed.
8. The method of claim 7, wherein a computation constraint
comprises a bit budget.
9. The method of claim 1, wherein the application is an artificial
intelligence-based application.
10. A non-transitory computer readable medium comprising computer
executable instructions for automated design space exploration for
deep neural networks, the computer executable instructions
comprising instructions for: obtaining an input model and one or
more constraints associated with an application and/or target
device or process used in the application configured to utilize a
deep neural network; learning an optimal low-precision
configuration of the optimal architecture using the input model,
constraints, a training data set, and a validation data set; and
deploying the optimal configuration on the target device or process
for use in the application.
11. A deep neural network optimization engine configured to perform
automated design space exploration for deep neural networks, the
engine comprising a processor and memory, the memory comprising
computer executable instructions for: obtaining an input model and
one or more constraints associated with an application and/or
target device or process used in the application configured to
utilize a deep neural network; learning an optimal low-precision
configuration of the optimal architecture using the input model,
constraints, a training data set, and a validation data set; and
deploying the optimal configuration on the target device or process
for use in the application.
12. The engine of claim 11, wherein the optimal configuration is
learned using a policy to generate an optimized model from the
input model.
13. The engine of claim 2, wherein the optimal low-precision
configuration of the optimal architecture is learned using the
policy to generate a quantized network, further comprising
instructions for: fine tuning the quantized network with a
knowledge distillation process; evaluating the fine-tuned network;
applying a reward function; and iterating for at least one
additional quantized network and selecting the optimal
low-precision configuration.
14. The engine of claim 13, wherein selecting the optimal
low-precision configuration comprises selecting a precision
configuration that achieves the best reward as determined by the
reward function, for the constraints on the target device or
process.
15. The engine of claim 11, wherein learning the optimal
low-precision configuration comprises exploiting low precision
weights using reinforcement learning to learn the optimal
low-precision configuration across the deep neural network.
16. The engine of claim 15, wherein each layer comprises a
different precision.
17. The engine of claim 11, wherein the constraints comprise at
least one of: accuracy, power, cost, supported precision,
speed.
18. The engine of claim 17, wherein a computation constraint
comprises a bit budget.
19. The engine of claim 11, wherein the application is an
artificial intelligence-based application.
Description
CROSS-REFERENCE TO RELATED APPLICATION(S)
[0001] This application claims priority to U.S. Provisional Patent
Application No. 62/769,403 filed on Nov. 19, 2018, the contents of
which are incorporated herein by reference.
TECHNICAL FIELD
[0002] The following relates to systems and methods for automated
precision configuration for deep neural networks, for example by
enabling low bit-precision weights and activations to be used
effectively.
BACKGROUND
[0003] In modern intelligent applications and devices, deep neural
networks (DNNs) have become ubiquitous when solving complex
computer tasks, such as recognizing objects in images and
translating natural language. The success of these networks has
been largely dependent on high performance computing machinery,
such as Graphics Processing Units (GPUs) and server-class Central
Processing Units (CPUs). Consequently, the adoption of DNNs to
solve real-world problems is typically limited to scenarios where
such computing is available. Recently, many new computer processors
specifically designed for artificial intelligence (AI) applications
have emerged. These dedicated processors, such as Field
Programmable Gate Arrays (FPGAs), Application Specific Integrated
Circuits (ASICs) and analog computers offer the promise of more
efficient and accessible AI products and services. However,
designing DNN models optimized for these new processors remains a
significant challenge for AI engineers and application developers.
Significant domain expertise and trial-and-error is often required
to create an optimized DNN for a specialized hardware. One of the
main challenges is how to enable a precision configuration for a
given DNN architecture that maintains accuracy and optimizes for
memory, energy and latency performance on a given hardware
architecture. The task of quantizing individual layers of a DNN,
which can contain dozens of layers, often results in sub optimal
performance in a real-world environment. Thus, there is significant
interest in automating the task of enabling a precision
configuration for an entire DNN architecture that considers the
properties of the hardware architecture to optimize memory, energy
and latency as well as maintain a desired level of accuracy on the
given dataset.
[0004] To address these problems, there has been a widespread push
in academia and industry to make deep learning models more
efficient by considering the properties of the hardware
architecture in the model optimization process. Many techniques
have been proposed for manual quantization of DNNs that show lower
bit precision models are feasible for accurate inferencing on new
input data.
[0005] Prior solutions include a variety of core quantization
techniques for various DNN model architectures, as well as having
efficient kernels for computation in reduced precision like ARM
CMSIS, Intel MKL-DNN and Nvidia TensorRT. The main approach to
model quantization is by uniform precision reduction across all
layers of a DNN, for example from 32 bit Floating Point to 16 bit,
or to 8 bit INT. It has been observed that once a model is trained,
a lower bit precision is acceptable for the weights and activations
of a DNN model to correctly compute the inference label for a given
input. For this reason, many developers and hardware providers are
developing in-house or add-on quantization methods that can naively
convert the weights and activations of a DNN model to a supported
precision for the target hardware (HW). However, when this process
is applied and the model is attempted to run on a different HW, the
result can often be slower, or the model may be incompatible with
the new HW. Additionally, these uniform quantization approaches are
often found to sacrifice too much accuracy or limit network
performance on complex and large data sets.
[0006] At present, two fundamental challenges exist with current
quantization techniques, namely: 1) that hand-crafted features and
domain expertise is required for automated quantization 2) that
time-consuming fine-tuning is often necessary to maintain
accuracy.
[0007] There exists a need for scalable, automated processes for
model quantization on diverse DNN architectures and hardware
back-ends. Generally, it is found that the current capacity for
model quantization is outpaced by the rapid development of new DNNs
and disparate hardware platforms that aim to increase the
applicability and efficiency of deep learning workloads.
[0008] It is an object of the following to address at least one of
the above-mentioned challenges.
SUMMARY
[0009] It is recognized that a general approach that is agnostic to
both the architecture and target hardware(s) is needed to optimize
DNNs, making them faster, smaller and energy-efficient for use in
daily life. The following relate to deep learning algorithms, for
example, deep neural networks. A method for automated precision
configuration, specifically quantization of DNN weights and
activations, is described. The following relates to the design of a
learning process to leverage trade-offs in different deep neural
network precision configurations using computation constraints and
hardware properties as inputs. The learning process trains an
optimizer agent to adapt large, full precision networks into
smaller networks of similar performance that satisfy target
constraints in a platform-aware way. By design, the learning
process and agent is agnostic to both network architecture and
target hardware platform.
[0010] In one aspect, there is provided a method of automated
precision configuration for deep neural networks, the method
comprising: obtaining an input model and one or more constraints
associated with an application and/or target device or process used
in the application configured to utilize a deep neural network;
learning an optimal low-precision configuration of the optimal
architecture using the input model, constraints, the training data
set, and the validation data set; and deploying the optimal
configuration on the target device or process for use in the
application.
[0011] In another aspect, there is provided a computer readable
medium comprising computer executable instructions for automated
design space exploration for deep neural networks, the computer
executable instructions comprising instructions for performing the
above method.
[0012] In yet another aspect, there is provided a deep neural
network optimization engine configured to perform automated
precision configuration for deep neural networks, the engine
comprising a processor and memory, the memory comprising computer
executable instructions for performing the above method.
BRIEF DESCRIPTION OF THE DRAWINGS
[0013] One or more embodiments will now be described with reference
to the appended drawings wherein:
[0014] FIG. 1 is a schematic diagram of a system for optimizing a
DNN for use in a target device or process used in an artificial
intelligence (AI) application;
[0015] FIG. 2 is a block diagram of an example of a DNN
optimization engine;
[0016] FIG. 3 is a graph comparing energy consumption and
computation costs for various example network designs;
[0017] FIG. 4 is a flow chart illustrating a process for optimizing
an input DNN for deployment on a target device or process; and
[0018] FIG. 5 is a flow chart illustrating operations performed in
learning an optimal low precision configuration.
DETAILED DESCRIPTION
[0019] All should be accessible and beneficial to various
applications in everyday life. With the emergence of deep learning
on embedded and mobile devices, DNN application designers are faced
with stringent power, memory and cost requirements which often
leads to inefficient solutions, possibly preventing people from
moving to these devices. The system described below can be used to
make deep learning applicable, affordable and scalable by bridging
the gap between DNNs and hardware back-ends. To do so, a scalable,
DNN-agnostic engine is provided, which can enable a platform-aware
optimization. The engine targets information inefficiency in the
implementation of DNNs, making them applicable for low-end devices.
To provide such functionality, the engine: [0020] is configured to
be architecture independent, allowing the engine to support
different DNN architectures such as convolution neural networks
(CNNs), recurrent neural networks (RNNs), etc.; [0021] is
configured to be framework agnostic, enabling developers to readily
apply the engine to a project without additional engineering
overhead; [0022] is configured to be hardware agnostic, helping
end-users to readily change the back-end hardware or port a model
from one hardware to another; and.
[0023] One of the core challenges with model optimization for DNN
inference is evaluating which precision configuration is
best-suited for a given application. The engine described herein
uses an AI-driven optimizer to overcome the drawbacks of manual
model quantization. Based on computation constraints i.e. a "bit
budget", a software agent selectively changes the bit precision of
different layers in the model. Information inefficiencies and novel
supported bit-precisions for AI hardware are leveraged to
effectively quantize the layers of a network in a platform-aware
way.
[0024] Turning now to the figures, FIG. 1 illustrates a DNN
optimization engine 10 which is configured, as described below, to
take an initial DNN 12 and generate or otherwise determine an
optimized DNN 14 to be used by or deployed upon a target device or
process 16, the "target 16" for brevity. The target 16 is used in
or purposed for an AI application 18 that uses the optimized DNN
14. The Al application 18 has one or more application constraints
19 that dictate how the optimized DNN 14 is generated or
chosen.
[0025] FIG. 2 illustrates an example of an architecture for the DNN
optimization engine 10. The engine 10 in this example configuration
includes a model converter 22 which can interface with a number of
frameworks 20, an intermediate representation model 24, a design
space exploration module 26, a quantizer 28, and mapping algorithms
30 that can include algorithms for both heterogeneous hardware 32
and homogeneous hardware 34. The engine 10 is also interfaces with
a target hardware (HW) platform 16. The design space exploration
module 26, quantizer 28, and mapping algorithms 30 adopt, apply,
consider, or otherwise take into account the constraints 19. In
this example, the constraints include accuracy, power, cost,
supported precision, speed, among others that are possible as shown
in dashed lines. FIG. 2 illustrates a framework with maximum re-use
in mind, so that new AI frameworks 20, new DNN architectures and
new hardware architectures can be easily added to a platform
utilizing the engine 10. The engine 10 addresses inference
optimization of DNNs by leveraging state-of-the-art algorithms and
methodologies to make DNNs applicable for any device 16. This
provides an end-to-end framework to optimize DNNs from different
deep learning framework front-ends down to low-level machine code
for multiple hardware back-ends.
[0026] For the model converter 22, the engine 10 is configured to
support multiple frameworks 20 (e.g. TensorFlow, Pytorch, etc.) and
DNN architectures (e.g. CNN, RNN, etc.), to facilitate applying the
engine's capabilities on different projects with different AI
frameworks 20. To do so, two layers are included, namely: a) the
model convertor 22 which contains each AI frameworks'
specifications and DNNs' parser to produce the intermediate
representation model (IRM) 24 from the original model; and b) the
IRM 24 which represents all DNN models in a standard format.
[0027] The engine 10 also provides content aware optimization, by
providing a two-level intermediate layer composed of: a) the design
space exploration module 26, which is an intermediate layer for
finding a smaller architecture with similar performance as the
given model to reduce memory footprint and computation (described
in greater detail below); and b) the quantizer 28, which is a
low-level layer for quantizing the network to gain further
computation speedup.
[0028] Regarding the design space exploration module 26, DNNs are
heavily dependent on the design of hyper-parameters like the number
of hidden layers, nodes per layer and activation functions, which
have traditionally been optimized manually. Moreover, hardware
constraints 19 such as memory and power should be considered to
optimize the model effectively. Given spaces can easily exceed
thousands of solutions, it can be intractable to find a
near-optimal solution manually.
[0029] Quantizing DNNs has the potential to decrease complexity and
memory footprint and facilitate potential deployment on the edge
devices. However, precision is typically considered at the design
level of an entire model, making it difficult to consider as a
tunable hyper parameter. Moreover, exploring efficient precision
requires tight integration between the network design, training and
implementation, which is not always feasible. Typical
implementations of low precision DNNs use uniform precision across
all layers of the network while mixed-precision leads to better
performance. The engine 10 described herein exploits low precision
weights using reinforcement learning to learn an optimal precision
configuration across the neural network where each layer may have
different precision to get the best out of the target platform 16.
Besides mixed-precision, the engine 10 also supports uniform
precision, fixed-point, dynamic fixed-point and binary/ternary
networks.
[0030] It is also recognized that a major challenge lies in
enabling support for multiple hardware back-ends while keeping
compute, memory and energy footprints at their lowest. Content
aware optimization alone is not considered to be enough to solve
the challenge of supporting different hardware back ends. The
reason being that primitive operations like convolution or matrix
multiplication may be mapped and optimized in very different ways
for each hardware back-end. These hardware-specific optimizations
can vary drastically in terms of memory layout, parallelization
threading patterns, caching access patterns and choice of hardware
primitives.
[0031] The platform aware optimization layer that includes the
mapping algorithms 30 is configured to address this challenge. This
layer contains standard transformation primitives commonly found in
commodity hardware such as CPUs, GPUs, FPGAs, etc. This additional
layer provides a toolset to optimize DNNs for FPGAs and
automatically map them onto FPGAs for model inference. This
automated toolset can save design time significantly. Importantly,
many homogeneous and heterogeneous multicore architectures have
been introduced currently to continually improve system
performance. Compared to homogeneous multicore systems,
heterogeneous ones offer more computation power and efficient
energy consumption because of the utilization of specialized cores
for specific functions and each computational unit provides
distinct resource efficiencies when executing different inference
phases of deep models (e.g. Binary network on FPGA, full precision
part on GPU/DSP, regular arithmetic operations on CPU, etc.). The
engine 10 provides optimization primitives targeted at
heterogeneous hardware 32, by automatically splitting the DNN's
computation on different hardware cores to maximize
energy-efficiency and execution time on the target hardware 16.
[0032] Using platform aware optimization techniques in combination
with content aware optimization techniques achieves significant
performance cost reduction across different hardware platforms
while delivering the same inference accuracy compared to the
state-of-the-art deep learning approaches.
[0033] For example, assume an application that desires to run a CNN
on a low-end hardware with 60 MB memory. The model size is 450 MB
and it needs to meet 10 ms critical response time for each
inference operation. The model is 95% accurate, however, 90%
accuracy is also acceptable. The CNN designers usually use GPUs to
train and run their models, but they would now need to deal with
memory and computation power limitations, new hardware architecture
and satisfying all constraints (such as memory and accuracy) in the
same time. It is considered infeasible to find a solution for the
target hardware or may require tremendous engineering effort. In
contrast, using the engine 10, and specifying the constraints 19, a
user can effectively produce the optimized model by finding a
feasible solution, reducing time to market and engineering effort,
as illustrated in the chart shown in FIG. 3.
[0034] Referring now to FIG. 4, the engine 10 provides a quantizer
28 which formulates the quantization problem as a multi-objective
design space exploration 42 for DNNs with respect to the supported
precisions of the target hardware 16, where reinforcement
learning-based agents 50 (see also FIG. 5) exploits low precision
weights by learning an optimal precision configuration across the
neural network where the precision assigned to each layer may
different (mixed-precision) to get the best out of the target
platform 16, when it is then deployed on the target platform 16 at
step 46.
[0035] The engine 10 provides for automated optimization of deep
learning algorithms. The engine 10 also employs an efficient
process for design space exploration 26 of DNNs that can satisfy
target computation constraints 19 such as speed, model size,
accuracy, power consumption, etc. There is provided a learning
process for training optimizer agents that automatically explore
design trade-offs starting with large, initial DNNs to produce
compact DNN designs in a data-driven way. Once an engineer has
trained an initial deep neural network on a training data set to
achieve a target accuracy for a task, they would then need to
satisfy other constraints for the real-world production environment
and computing hardware. The proposed process makes this possible by
automatically producing an optimized DNN model suitable for the
production environment and hardware 16. Referring to FIG. 5, the
agent 50 receives as inputs an initial DNN or teacher model 40,
training data set 52 and target constraints 19. This can be done
using the existing deep learning frameworks, without the need to
introduce a new framework and the associated engineering overhead.
The agent 50 then generates a new precision configuration from the
initial DNN based on target constraints 19. The agent 50 receives a
reward based on the performance of the adapted model measured on
the training data set 52, guiding the process towards a feasible
design. The learning process can converge on a feasible precision
configuration using minimal computing resources, time and human
expert interaction. This process overcomes the disadvantages of
manual optimization, which is often limited to certain DNN
architectures, applications, hardware platforms and requires domain
expertise. The process is a universal method to leverage trade-offs
in different DNN precision configuration and to ensure that target
computation constraints are met. Furthermore, the process benefits
end-users with multiple DNNs in production, each requiring updates
and re-training at various intervals by providing a fast,
lightweight and flexible method for designing new and compact DNNs.
This approach advances current approaches by enabling
resource-efficient DNNs that economize data centers, are available
for use on low-end, affordable hardware and are accessible to a
wider audience aiming to use deep learning algorithms in daily
environments.
[0036] In step 42, shown in FIG. 5, a policy 53 exploits low
precision weights by learning an optimal precision configuration
across the neural network where the precision assigned to each
layer may be different. The supported precisions by the target
hardware (e.g. INT8, INT16, F16 etc.) and bit-budget need to be
defined as constraints 19 for this step 42. As shown in FIG. 5, the
agent 53 observes a state that is generated through applying steps
58-64. The reinforcement learning policy repeatedly generates a set
of precision configurations, with respect to supported precisions
and bit-budget, to create new networks by altering layers'
precisions. This step 42 produces a quantized network at step 58
that is fine-tuned via knowledge distillation at step 60 on the
training data set 52 and subsequently evaluated at step 62 for
accuracy on the validation data set 54. The agent 50 then updates
the policy 53 based on the reward achieved by the new architecture.
Over a series of iterations, the agent 50 will select the precision
configuration that achieves the best reward determined by the
reward function 64, for the given constraints 19 on the target
computing hardware platform 16. Once this model has been selected,
the user can deploy the optimized model in production on their
specified hardware(s).
[0037] To reuse weights, the engine 10 leverages the class of
function-preserving transformations that help to initialize the new
network to represent the same function as the given network but use
different parameterization to be further trained to improve the
performance. Knowledge distillation at step 60 has been employed as
a component of the training process to accelerate the training of
the student network, especially for large networks.
[0038] The transformation actions may lead to defected networks
(e.g. not realistic kernel size, number of filters, etc.). It is
not worth it to train these networks as they cannot learn properly.
To improve the training process, an apparatus has been employed to
detect these defected networks earlier and cut off the learning
process by using a negative reward for them.
[0039] For simplicity and clarity of illustration, where considered
appropriate, reference numerals may be repeated among the figures
to indicate corresponding or analogous elements. In addition,
numerous specific details are set forth in order to provide a
thorough understanding of the examples described herein. However,
it will be understood by those of ordinary skill in the art that
the examples described herein may be practiced without these
specific details. In other instances, well-known methods,
procedures and components have not been described in detail so as
not to obscure the examples described herein. Also, the description
is not to be considered as limiting the scope of the examples
described herein.
[0040] It will be appreciated that the examples and corresponding
diagrams used herein are for illustrative purposes only. Different
configurations and terminology can be used without departing from
the principles expressed herein. For instance, components and
modules can be added, deleted, modified, or arranged with differing
connections without departing from these principles.
[0041] It will also be appreciated that any module or component
exemplified herein that executes instructions may include or
otherwise have access to computer readable media such as storage
media, computer storage media, or data storage devices (removable
and/or non-removable) such as, for example, magnetic disks, optical
disks, or tape. Computer storage media may include volatile and
non-volatile, removable and non-removable media implemented in any
method or technology for storage of information, such as computer
readable instructions, data structures, program modules, or other
data. Examples of computer storage media include RAM, ROM, EEPROM,
flash memory or other memory technology, CD-ROM, digital versatile
disks (DVD) or other optical storage, magnetic cassettes, magnetic
tape, magnetic disk storage or other magnetic storage devices, or
any other medium which can be used to store the desired information
and which can be accessed by an application, module, or both. Any
such computer storage media may be part of the engine 10, any
component of or related to the engine, etc., or accessible or
connectable thereto. Any application or module herein described may
be implemented using computer readable/executable instructions that
may be stored or otherwise held by such computer readable
media.
[0042] The steps or operations in the flow charts and diagrams
described herein are just for example. There may be many variations
to these steps or operations without departing from the principles
discussed above. For instance, the steps may be performed in a
differing order, or steps may be added, deleted, or modified.
[0043] Although the above principles have been described with
reference to certain specific examples, various modifications
thereof will be apparent to those skilled in the art as outlined in
the appended claims.
* * * * *