U.S. patent application number 16/673901 was filed with the patent office on 2021-05-06 for method and system for directly tuning pid parameters using a simplified actor-critic approach to reinforcement learning.
The applicant listed for this patent is Honeywell International Inc.. Invention is credited to Bhushan Gopaluni, Nathan Lawrence, Philip D. Loewen, Gregory E. Stewart.
Application Number | 20210132552 16/673901 |
Document ID | / |
Family ID | 1000004547056 |
Filed Date | 2021-05-06 |
![](/patent/app/20210132552/US20210132552A1-20210506\US20210132552A1-2021050)
United States Patent
Application |
20210132552 |
Kind Code |
A1 |
Lawrence; Nathan ; et
al. |
May 6, 2021 |
METHOD AND SYSTEM FOR DIRECTLY TUNING PID PARAMETERS USING A
SIMPLIFIED ACTOR-CRITIC APPROACH TO REINFORCEMENT LEARNING
Abstract
A method and system for reinforcement learning can include an
actor-critic framework comprising an actor and a critic, the actor
comprising an actor network and the critic comprising a critic
network; and a controller comprising a neural network embedded in
the actor-critic framework and which can be tuned according to
reinforcement learning based tuning including anti-windup
tuning.
Inventors: |
Lawrence; Nathan; (North
Vancouver, CA) ; Loewen; Philip D.; (North Vancouver,
CA) ; Gopaluni; Bhushan; (Vancouver, CA) ;
Stewart; Gregory E.; (North Vancouver, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Honeywell International Inc. |
Morris Plains |
NJ |
US |
|
|
Family ID: |
1000004547056 |
Appl. No.: |
16/673901 |
Filed: |
November 4, 2019 |
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G05B 6/02 20130101; G05B
13/027 20130101 |
International
Class: |
G05B 6/02 20060101
G05B006/02; G05B 13/02 20060101 G05B013/02 |
Claims
1. A system for reinforcement learning, comprising: an actor-critic
framework comprising an actor and a critic, the actor comprising an
actor network and the critic comprising a critic network; and a
controller comprising a neural network embedded in the actor-critic
framework and which is tuned according to reinforcement learning
based tuning including anti-windup tuning.
2. The system of claim 1 wherein the controller comprises
parameters that include an anti-windup parameter.
3. The system of claim 1 wherein the controller allows for
constraining of individual parameters.
4. The system of claim 1 wherein the actor network is initialized
with gains, which are already in use or known to be
stabilizing.
5. The system of claim 1 wherein the controller comprises a PID
(Proportional Integral Derivative) controller.
6. The system of claim 5 wherein weights associated with the actor
are initialized with selected PID gains.
7. The system of claim 5 wherein the PID controller comprises a
(Proportional-Derivative) portion.
8. The system of claim 5 wherein the PID controller comprises an
integral portion.
9. The system of claim 5 wherein the PID controller comprises a PD
(Proportional-Derivative) portion and an integral portion.
10. A system for reinforcement learning, comprising: at least one
processor; and a non-transitory computer-usable medium embodying
computer program code, said computer-usable medium capable of
communicating with said at least one processor, said computer
program code comprising instructions executable by said at least
one processor and configured for: providing an actor-critic
framework comprising an actor and a critic, the actor comprising an
actor network and the critic comprising a critic network; and
tuning a controller comprising a neural network embedded in the
actor-critic framework, wherein the tuning of the controller
comprises reinforcement learning based tuning including anti-windup
tuning.
11. The method of claim 10 wherein the controller comprises
parameters that include an anti-windup parameter.
12. The system of claim 10 wherein the controller allows for
constraining of individual parameters.
13. The system of claim 10 wherein the instructions are further
configured for initializing the actor network with gains, which are
already in use or known to be stabilizing.
14. The system of claim 10 wherein the controller comprises a PID
(Proportional Integral Derivative) controller.
15. The system of claim 14 wherein the instructions are further
configured for initializing weights associated with the actor with
selected PID gains.
16. A method for reinforcement learning, comprising: providing an
actor-critic framework comprising an actor and a critic, the actor
comprising an actor network and the critic comprising a critic
network; and tuning a controller comprising a neural network
embedded in the actor-critic framework, wherein the tuning of the
controller comprises reinforcement learning based tuning including
anti-windup tuning.
17. The method of claim 16 wherein the controller comprises
parameters that include an anti-windup parameter.
18. The method of claim 16 wherein the controller allows for
constraining of individual parameters.
19. The method of claim 16 further comprising initializing the
actor network with gains that are already in use or known to be
stabilizing.
20. The method of claim 16 further comprising initializing weights
associated with the actor with selected PID gains.
Description
TECHNICAL FIELD
[0001] Embodiments are generally related to the field of machine
learning including Deep Reinforcement Learning (DRL). Embodiments
also relate to neural networks and Proportional Integral Derivative
(PID) control. Embodiments further relate to the direct tuning of
PID parameters using an actor-critic framework.
BACKGROUND
[0002] Model-based control methods such as Model Predictive Control
(MPC) or Proportional Integral Derivative (PID) control rely on the
accuracy of the available plant model. However, gradual changes in
the plant result in decreased performance of the controllers. Model
reidentification is costly and time-consuming, often making this
procedure impractical. As such, controllers will often be tuned for
robustness over performance to ensure they are still operational
under model uncertainty.
[0003] Reinforcement Learning (RL) is a branch of machine learning
in which the objective is to learn an optimal policy through
interactions with a stochastic environment modeled as a Markov
Decision Process (MDP). Only somewhat recently has RL been
successfully applied in the process industry. The first successful
implementations of RL methods in process control utilized
approximate dynamic programming (ADP) methods for optimal control
of discrete-time nonlinear systems. While these results illustrate
the applicability of RL in controlling discrete-time nonlinear
processes, they are also limited to processes for which at least a
partial model is available or can be derived through system
identification.
[0004] Recently, several data-based approaches have been proposed
to address the limitations of model-based RL in control. For
example, a data-based learning algorithm has been proposed to
derive an improved control policy for discrete-time nonlinear
systems using ADP with an identified process model. Another
proposal involves a Q-learning algorithm to learn an improved
control policy in a model-free manner using only input-output data.
While these methods remove the requirement for having an exact
model, they still present several issues. For example, proposed
solutions are still based on ADP, so its performance relies on the
accuracy of the identified model. Note that as utilized herein, the
term "model-free" relates to the plant, meaning the disclosed
algorithm does not assume any information or structure about the
plant. There are two types of models: models of the plant, and
models (e.g., neural networks) in the machine learning algorithm
(that have nothing to do with control or the plant). The term
"model-free" as utilized herein can relate to not using a model for
the plant.
[0005] Other approaches to RL-based control include using a fixed
control strategy such as PID. With applications to process control,
some solutions have developed a model-free algorithm to dynamically
assign the PID gains from a pre-defined collection derived from
Internal Model Control. Other approaches, on the other hand, may
involve dynamically tuning a PID controller in a continuous
parameter space using the actor-critic method, where the actor is
the PID controller. This approach is based on Dual Heuristic
Dynamic Programming, where an identified model may be assumed to be
available. The actor-critic method has also been employed in
applications where the PID gains are the actions taken by the actor
at each time-step.
[0006] These methods treat the PID gains as the action by a RL
agent by some function approximation method such as Deep Neural
Network or Quadratic Function Approximation. This point of view can
lead to dynamically changing PID gains. While closed-loop
instability has not been in the aforementioned approaches, it is
known in the hybrid system literature that switching between
control strategies, even stabilizing ones, can destabilize the
closed-loop system.
BRIEF SUMMARY
[0007] The following summary is provided to facilitate an
understanding of some of the features of the disclosed embodiments
and is not intended to be a full description. A full appreciation
of the various aspects of the embodiments disclosed herein can be
gained by taking the specification, claims, drawings, and abstract
as a whole.
[0008] It is, therefore, one aspect of the disclosed embodiments to
provide for an improved machine learning method and system.
[0009] It is another aspect of the disclosed embodiments to provide
for a method and system, which allows for the direct tuning of PID
parameters using an actor-critic framework.
[0010] The aforementioned aspects and other objectives can now be
achieved as described herein. In an embodiment, a system for
reinforcement learning can include an actor-critic framework
comprising an actor and a critic, the actor comprising an actor
network and the critic comprising a critic network; and a
controller comprising a neural network embedded in the actor-critic
framework and which can be tuned according to reinforcement
learning based tuning including anti-windup tuning.
[0011] In an embodiment, the controller can include parameters
comprising an anti-windup parameter.
[0012] In an embodiment, the controller can allow for constraining
of individual parameters.
[0013] In an embodiment, the actor network can be initialized with
gains, which are already in use or known to be stabilizing.
[0014] In an embodiment, the controller can include a PID
(Proportional Integral Derivative) controller.
[0015] In an embodiment, the weights associated with the actor can
be initialized with selected PID gains.
[0016] In an embodiment, the PID controller can include a
(Proportional-Derivative) portion and an integral portion.
BRIEF DESCRIPTION OF THE DRAWINGS
[0017] The accompanying figures, in which like reference numerals
refer to identical or functionally-similar elements throughout the
separate views and which are incorporated in and form a part of the
specification, further illustrate the present invention and,
together with the detailed description of the invention, serve to
explain the principles of the present invention.
[0018] FIG. 1 illustrates a block diagram of a closed-loop system
based on a plant model that includes a neural network and a plant,
in accordance with an embodiment;
[0019] FIG. 2 illustrates a block diagram of a closed-loop system
that includes a PID controller and an actuator, in accordance with
an embodiment;
[0020] FIG. 3 illustrates a block diagram of a parameterized form
of an actor and a critic in the context of an actor-framework
framework, in accordance with an embodiment;
[0021] FIG. 4 illustrates graphs depicting simulation results based
on the training of actor and critic networks, in accordance with an
embodiment;
[0022] FIG. 5 illustrates graphs depicting simulation results based
on the training of actor and critic networks, in accordance with an
embodiment;
[0023] FIG. 6 illustrates graphs depicting simulation results based
on the training of actor and critic networks, in accordance with an
embodiment;
[0024] FIG. 7 illustrates graphs depicting simulation results based
on the training of actor and critic networks, in accordance with an
embodiment;
[0025] FIG. 8 illustrates graphs depicting simulation results based
on the training of actor and critic networks, in accordance with an
embodiment;
[0026] FIG. 9 illustrates graphs depicting simulation results based
on the training actor and critic networks, in accordance with an
embodiment;
[0027] FIG. 10 illustrates a schematic view of a data-processing
system, in accordance with an embodiment; and
[0028] FIG. 11 illustrates a schematic view of a software system
including a module, an operating system, and a user interface, in
accordance with an embodiment.
DETAILED DESCRIPTION
[0029] The particular values and configurations discussed in these
non-limiting examples can be varied and are cited merely to
illustrate one or more embodiments and are not intended to limit
the scope thereof.
[0030] Subject matter will now be described more fully hereinafter
with reference to the accompanying drawings, which form a part
hereof, and which show, by way of illustration, specific example
embodiments. Subject matter may, however, be embodied in a variety
of different forms and, therefore, covered or claimed subject
matter is intended to be construed as not being limited to any
example embodiments set forth herein; example embodiments are
provided merely to be illustrative. Likewise, a reasonably broad
scope for claimed or covered subject matter is intended. Among
other issues, subject matter may be embodied as methods, devices,
components, or systems. Accordingly, embodiments may, for example,
take the form of hardware, software, firmware, or a combination
thereof. The following detailed description is, therefore, not
intended to be interpreted in a limiting sense.
[0031] Throughout the specification and claims, terms may have
nuanced meanings suggested or implied in context beyond an
explicitly stated meaning. Likewise, phrases such as "in one
embodiment" or "in an example embodiment" and variations thereof as
utilized herein may not necessarily refer to the same embodiment
and the phrase "in another embodiment" or "in another example
embodiment" and variations thereof as utilized herein may or may
not necessarily refer to a different embodiment. It is intended,
for example, that claimed subject matter include combinations of
example embodiments in whole or in part.
[0032] In general, terminology may be understood, at least in part,
from usage in context. For example, terms such as "and," "or," or
"and/or" as used herein may include a variety of meanings that may
depend, at least in part, upon the context in which such terms are
used. Generally, "or" if used to associate a list, such as A, B, or
C, is intended to mean A, B, and C, here used in the inclusive
sense, as well as A, B, or C, here used in the exclusive sense. In
addition, the term "one or more" as used herein, depending at least
in part upon context, may be used to describe any feature,
structure, or characteristic in a singular sense or may be used to
describe combinations of features, structures, or characteristics
in a plural sense. Similarly, terms such as "a," "an," or "the",
again, may be understood to convey a singular usage or to convey a
plural usage, depending at least in part upon context. In addition,
the term "based on" may be understood as not necessarily intended
to convey an exclusive set of factors and may, instead, allow for
existence of additional factors not necessarily expressly
described, again, depending at least in part on context.
[0033] Note that as utilized herein the term plant can relate to a
"plant" in the context of control theory. A plant in this context
can be the combination of process and an actuator and may also be
considered as a transfer function indicating the relationship
between an input signal and the output signal of a system without
feedback, commonly determined by physical properties of the system.
An example may be an actuator with its transfer of the input of the
actuator to its physical displacement. In a system with feedback,
the plant still may have the same transfer function, but a control
unit and a feedback loop (with their respective transfer functions)
may be added to the system.
[0034] FIG. 1 illustrates a block diagram of a closed-loop system
100 based on a plant model that includes a neural network 102 and a
plant 104, in accordance with an embodiment. The closed-loop system
100 can be implemented based on Deep Reinforcement Learning and an
actor-critic architecture to develop a model-free, input-output
controller for set-point tracking problems of discrete-time
nonlinear processes. An ReLU Deep Neural Network (DNN) can
parameterize both the actor and critic in such an actor-critic
architecture. At the end of the training, the closed loop system
100 can include the plant 104 together with the neural network 102
as a feedback controller. Note that the neural network 102 may be
implemented as a DNN. Thus, in FIG. 1, the neural network 102 can
be thought of as a "block-box" in the sense that even with an exact
plant model, it is unclear whether the closed-loop system is
internally stable.
[0035] The disclosed embodiments thus relate to a simple
interpretation of the actor-critic framework by expressing a PID
controller as a shallow neural network. The PID gains can be the
weights of the actor network. The critic is the Q-function
associated with the actor, and can be parameterized by a DNN. The
disclosed embodiments can apply a Deep Deterministic Policy
Gradient algorithm and can include a significant simplification of
a model-free approach to control. The disclosed embodiments can be
extended to include a tuning parameter for Anti-Windup
compensation. Finally, the simplicity of the disclosed actor
network allows us to use initialize training with pre-existing PID
gains as well as incorporate individual constraints on each
parameter. The actor can be therefore initialized as an
operational, interpretable, and industrially accepted controller
that can be then updated in an optimal direction after each
roll-out in the plant.
[0036] FIG. 2 illustrates a block diagram of a closed-loop system
120 that includes a PID controller comprising a PD control block
122, a k.sub.i block 124, and an actuator 132, in accordance with
an embodiment. In the closed-loop system 120 shown in FIG. 2, the
PID controller, including the PD control block 122 and the k.sub.i
block 124, can be subject to an input e.sub.y=y-y. That is, the
input e.sub.y=y-y can be input to both the PD control block 122 and
the k.sub.i block 124. The PD control block 122 PID 122 supplies an
output signal that can be fed as input to a summation unit 130.
[0037] The PID controller is thus split into two pieces: a PD
(proportional-derivative) portion and an I (Integral) portion. The
PD control block 122 leading to a summation block 130 concerns the
first split, PD. The k.sub.i block 124 provides an output signal
that can be supplied as input to a summation unit 126. The output
signal from the summation unit 126 can be fed as input to a 1/s
block 128, which in turn can provide a signal that is fed to, and
completing, the summation block 130. The output signal from the
summation block 130 can be the signal fed to the actuator 132. The
output of the actuator 132 can be the saturated output signal from
the summation block 130. The difference between the output signal
from the summation block 130 and the actuator 132 can be evaluated
at the summation block 134. The output from summation block 134 is
fed to the block 136, the output of which can be then fed to, and
completes the summation block 126.
[0038] A parallel form of a PID controller can be implemented as
shown in Equation (1):
u .function. ( t ) = k p .times. e .function. ( t ) + k i .times.
.intg. 0 t .times. e .function. ( .tau. ) .times. d .times. .times.
.tau. + k d .times. d dt .times. e .function. ( t ) . ( 1 )
##EQU00001##
[0039] In Equation (1) above, we can refer to a reference signal at
time t as y(t), then e.sub.y(t):=y(t)-y(t). To implement the PID
controller it may be necessary to discretize the parameter u in
Equation (1). In such a case, we can let .DELTA.t>0 be a fixed
sampling time and then define
I.sub.y(t.sub.n)=.SIGMA..sub.i=1.sup.ne.sub.y(t.sub.i).DELTA.t,
where 0=t.sub.0<t.sub.1< . . . <t.sub.n, and
D .function. ( t n ) = e .function. ( t n ) - e .function. ( t n -
1 ) .DELTA. .times. t . ##EQU00002##
The parameter u can be then used to refer to the discretized
version of Equation (1), which can be written as follows as shown
in Equation (2):
u(t.sub.n):=k.sub.pe(t.sub.n)+k.sub.iI.sub.y(t.sub.n)+k.sub.dD(t.sub.n).
(2)
[0040] The problem of tuning a PID controller can be handled
utilizing a variety of approaches. For example, strategies can
range from heuristics with look-up tables, optimization methods,
relay tuning or some combination of one or more of these
strategies.
[0041] The PID controller can become saturated when it has output
constraints and can be given a setpoint outside of the operating
region. If the actuator constraints are two scalars
.alpha.<.beta., the saturation function can be defined as shown
in Equation (3) below:
sat .function. ( u ) = { u , if .times. .times. .alpha. .ltoreq. u
.ltoreq. .beta. .alpha. , if .times. .times. u < .alpha. .beta.
, if .times. .times. u > .beta. . ( 3 ) ##EQU00003##
[0042] If saturation persists, the controller can operate in an
open-loop and the integrator can continue to accumulate error at a
non-diminishing rate; that is, the integrator can experience
windup. This can create a nonlinearity in the PID controller and
may destabilize the closed-loop system 100. Methods for mitigating
the effects of windup can be referred to by the term
anti-windup.
[0043] While there are many approaches to anti-windup design, the
disclosed approach focuses on back-calculation, which can function
in discrete-time by feeding into the control signal, a scaled sum
of past deviations of the actuator signal from the unsaturated
signal. The nonnegative scaling constant, .beta., can govern how
quickly the PID controller unsaturates (that is, returns to the
region [.alpha., .beta.]). Precisely, we can define
e.sub.u(t):=sat(u(t))-u(t)) and
I.sub.u(t.sub.n):=.SIGMA..sub.i=1.sup.n-1e.sub.u(t.sub.i).DELTA.t,
then we redefine the PID controller from Equation (2) to the
following, as shown in Equation (4):
u(t.sub.n):=k.sub.pe(t.sub.n)+k.sub.iI.sub.y(t.sub.n)+k.sub.dD(t.sub.n)+-
.rho.I.sub.u(t.sub.n) (4)
[0044] From Equation (3) it is clear that if the PID controller is
operating within its constraints, then Equation (4) can be equal to
Equation (2); otherwise the differences sat(u)-u can add negative
feedback to the PID controller if u>.beta., or positive feedback
if u<.alpha.. Further, Equation (4) can equal to Equation (3)
when .rho.=0; therefore the recovery time of the PID controller to
the operating region [.alpha., .beta.] may be slower the closer it
is to zero and more aggressive when .rho. is large.
[0045] As previously described, the actor (PID controller) can be
updated after each roll-out with the environment. We are, however,
free to change the PID gains at each timestep, as is originally
formulated and implemented. There are two main reasons for avoiding
this. One is that the PID controller can be designed for set-point
tracking and may be an inherently intelligent controller that
simply needs to be improved subject to the user-defined objective
(e.g., reward function). Second, when the PID gains are free to
change at each time-step, the learned policy can essentially
function as a gain scheduler. The closed loop stability therefore
can become more difficult to analyze, even if all the gains are
stabilizing.
[0046] We now turn to the subject of PID in the
reinforcement-learning framework. The disclosed PID tuning can stem
from a state-space representation of Equation (2), as shown below
in Equation (5) and Equation (6):
[ e y .function. ( t n + 1 ) I y .function. ( t n + 1 ) D
.function. ( t n + 1 ) I u .function. ( t n + 1 ) ] = [ 0 0 0 0 0 1
0 0 - 1 / .DELTA. .times. .times. t 0 0 0 0 0 0 1 ] .function. [ e
y .function. ( t n ) I y .function. ( t n ) D .function. ( t n ) I
u .function. ( t n ) ] + [ 1 0 .DELTA. .times. .times. t 0 1 /
.DELTA. .times. .times. t 0 0 .DELTA. .times. .times. t ]
.function. [ e y .function. ( t n + 1 ) e u .function. ( t n + 1 )
] ( 5 ) .times. u ^ .function. ( t n + 1 ) = [ k p k i k d .rho. ]
.function. [ e y .function. ( t n + 1 ) I y .function. ( t n + 1 )
D .function. ( t n + 1 ) I u .function. ( t n + 1 ) ] ( 6 ) .times.
u .function. ( t n + a ) = sat .function. ( u .function. ( t n + 1
) ) ( 7 ) ##EQU00004##
[0047] Equation (5) simply describes the computations that may be
necessary for implementing a PID controller with a fixed sampling
time. On the other hand, Equation (6) parameterizes the PID
controller. We therefore can take Equation (6) and Equation (7)
above to be a shallow Neural Network, where [k.sub.p, k.sub.i,
k.sub.d, .rho.] is a vector of trainable weights and the saturation
function of Equation (7) is a nonlinear activation. In the next
section we explain how Reinforcement Learning (RL) can be used to
train these weights without a process model.
[0048] The fundamental components of RL are the policy, the
objective, and the environment. We can assume that a Markov
Decision Process with action space U and state space S may model
the environment. Here, s.sub.t.di-elect cons.A can refer to the
left-hand side of Equation (5) and u.sub.t.di-elect cons.U can
refer to the left-hand side of Equation (6). The vector of weights
parameterizing Equation (6) can be referred to as K. Formally, the
PID controller with anti-windup compensation in Equation (6) can be
given by the mapping .mu.(.,K):S.fwdarw.U such that
u.sub.t=.mu.(s.sub.t, K).
[0049] The controller can interact with the environment, and can be
therefore modeled as an initial distribution p(s.sub.1) with a
transition distribution (p.sub.s+1|s.sub.t, u.sub.t). These
interactions are goal-oriented. That is, each interaction with the
environment can be scored with a scalar value called the reward. A
goal of RL can be to find a controller that can maximize the
expectation of future rewards across state-action pairs.
[0050] We can define the state s.sub.t as in Equation (6) and the
reward function to be as shown in Equation (8) below:
r(s.sub.t,u.sub.t)=-(|e.sub.y(t)|.sup.p+.lamda.|u.sub.t|), (8)
[0051] where p=1 or 2 and .lamda..gtoreq.0 are fixed during
training. We can use the notation h.about.p.sup.u( ) to denote an
arbitrary trajectory h=(s.sub.1, u.sub.1, r.sub.1, . . . , s.sub.T,
u.sub.T, r.sub.T) generated by the policy .mu., where T is a random
variable referred to as the terminal time and r.sub.t is shorthand
for the reward at time t.
[0052] The desirability of a PID controller with gains K can be
measured in terms of the expected cumulative reward over
trajectories h:
J .function. ( .mu. .function. ( , K ) ) = h ~ p .mu. .function. (
) .function. [ t = 1 .infin. .times. .gamma. t - 1 .times. r
.function. ( s t , .mu. .function. ( s t , K ) ) | s 0 ] ( 9 )
##EQU00005##
[0053] where s.sub.0.di-elect cons.S is a starting state, and
0.ltoreq..gamma..ltoreq.1 is a discount factor. Our strategy is to
iteratively maximize J via stochastic gradient ascent, as
maximizing J corresponds to finding the optimal PID gains. This
objective may require several additional concepts, which are
outlined in the next section.
[0054] Equation (9) above can be referred to as the value function
for policy .mu.. Closely related to the value function is the
Q-function, which can consider state-action pairs in the
conditional expectation:
Q .function. ( s t , u t ) := h ~ p .mu. .function. ( ) .function.
[ t = 1 .infin. .times. .gamma. t - 1 .times. r .function. ( s t ,
.mu. .function. ( s t , K ) ) | s t , u t ] ( 10 ) ##EQU00006##
[0055] In continuous state and action spaces, we may not be able to
precisely evaluate Equation (10). Instead, we can approximate Q
iteratively using a deep neural network with training data from
Replay Memory. Replay Memory can be a fixed-size collection of
tuples of the form (s.sub.t, u.sub.t, s.sub.t+1, r.sub.t).
Concretely, we can write a parametrized Q-function, Q( ,
,W.sub.c):S.times.U.fwdarw., where W.sub.c is a collection of
weights. This approximate Q-function can be referred to as the
critic. One of our objectives can therefore be to minimize the
loss:
.sub.t(W.sub.c)=.sub.s.sub.t.sub..about..rho..sub..beta..sub.(
),u.sub.t.sub..about..beta.(
|s.sub.t.sub.)[q.sub.t-Q(s|.sub.t,u.sub.t,W.sub.c)).sup.2],
[0056] where q.sub.t refers to a target for the value Q(s.sub.t,
u.sub.t, W.sub.c)). Ideally, q.sub.t=Q*(s.sub.t, u.sub.t), but
since Q* may be unavailable, we can use the bootstrap approximation
as follows:
Q .function. ( s t , u t ) = s t + 1 ~ p .function. ( | s t , u t )
.function. [ r .function. ( s t , u t ) + .gamma. .times. .times. Q
.function. ( s t + 1 , u t + 1 ) ] .apprxeq. r .function. ( s t , u
t ) + .gamma. .times. .times. Q .function. ( s t + 1 , .mu.
.function. ( s t + 1 , K ) , W c ) = q t . ( 11 ) ##EQU00007##
[0057] The quantity given by Equation (11) above can be tractable
since each term can be held in Replay Memory or computed with .mu.
or the DNN approximation of Q.
[0058] FIG. 3 illustrates a block diagram of an actor 133 and a
critic 135 in an actor-critic framework, in accordance with an
embodiment. The actor 133 is shown at the left side of FIG. 3 and
depicts the input passing through PID parameters, leading to an
action. On the right side of FIG. 3 is the critic 135, which is a
DNN (Deep Neural Network), which takes as inputs, the input-output
from the actor 133.
[0059] The deterministic actor-critic method can be the basis of
the DRL controller. More precisely, the actor-critic method can be
a combination of policy gradient methods and Q-learning via a
temporal difference (TD) update. The actor can be the PID
controller given by Equation (7) and the critic can be an
approximation of the Q-function given in Equation (11).
[0060] Returning to our objective of maximizing Equation (9), we
can employ a stochastic gradient method on both the actor and
critic. To perform this update, we can use a policy gradient
theorem for deterministic policies to approximate the gradient of J
in terms of the critic Q.sup..mu.( , , W.sub.c), as follows:
{circumflex over (.gradient.)}.sub.KJ(.mu.(
,K))=.sub.s.sub.t.sub..about..rho..sub..gamma..sub..beta..sub.(
)[.gradient..sub.uQ.sup..mu.(s.sub.t,u,W.sub.c)|.sub.u=.mu.(s.sub.t.sub.,-
K).gradient..sub.K.mu.(s.sub.t,K)], (12)
[0061] where
.rho..sub..gamma..sup..beta.(s)=.SIGMA..sub.n=0.sup..infin..gamma..sup.tp-
(s.sub.t=s|s.sub.0,.mu.) is a discounted state visitation
distribution. Note that Equation (9) is maximized only when the
policy parameters K are optimal, which can then lead to the
following update scheme:
W.sub.t+1.rarw.W.sub.t+.alpha..sub..alpha.,t{circumflex over
(.gradient.)}.sub.KJ(.mu.( ,K))|.sub.K=W.sub.t. (13)
[0062] We can update the parameters in Equation (11) for the critic
using batch gradient descent, where our batch data come from a
cache of tuples of the form (S.sub.t, u.sub.t, S.sub.t+1, r
(S.sub.t, S.sub.t+1, u.sub.t)). Hence, it is important that our
state properly captures the dynamics of the system it represents,
so as to make meaningful parameter updates.
[0063] Since the actor network may be simply a PID controller, we
are able to incorporate known information about the plant it is
controlling. For instance, we are able to initialize the actor
network with gains that are already in use or known to be
stabilizing. The idea is that these gains will be updated by
stochastic gradient ascent in the approximate direction leading the
greatest expected reward.
[0064] If a rough model of the process is known, we can estimate
the region of PID gains in R.sup.3 for which closed-loop stability
is attained. One method for achieving this is can involve
considering the boundary of the stabilizing gains set that includes
the pairs (k.sub.p, k.sub.i), (k.sub.p, k.sub.d), or (k.sub.i,
k.sub.d).
[0065] One advantage of the disclosed approach is that the weights
for the actor can be initialized with hand-picked PID gains. For
example, if a plant such as the plant 104 shown in FIG. 1 is
operating with known gains k.sub.p, k.sub.i, and k.sub.d, then
these gains can be used to initialize the actor. The quality of the
gain can update and then rely on the quality of the value function
used in Equation (13). The value function can be parameterized by a
deep neural network and can be therefore initialized randomly. Both
the actor and critic parameters can be updated after each roll-out
with the environment. However, depending on the number of timesteps
in each roll-out, this can lead to slow learning. Therefore, we can
continually update the critic during the roll-out using batch data
from Replay Memory.
[0066] Equation (14) and the Algorithm 1 shown below present an
example of a DRL algorithm:
.differential. Q .mu. .differential. u .times. ( s , u , w ) :=
.differential. Q u .differential. u .times. ( s , u , w ) .times.
.differential. Q .mu. .differential. u .times. ( s , u , w )
.times. { u H - u u H - u L , if .times. .times. .differential. Q
.mu. .differential. u .times. ( s , u , w ) > 0 u - u L u H - u
L , otherwise ( 14 ) ##EQU00008##
[0067] That is, as shown below, Algorithm 1 is an example of a deep
reinforcement learning (DRL) controller.
TABLE-US-00001 Algorithm 1 Deep Reinforcement Learning Controller
1: Output: Optimal PID controller .mu.(s, K) 2. Initialize: Actor K
to tuning parameters 3. Initialize: Critic W.sub.c to random
weights 4: Initialize: Target weights K.sub..alpha.' .rarw. K and
W.sub.c' .rarw. W.sub.c 5: Initialize: Replay memory (RM) with
random policies 6: for each episode do 7: Initialize: e(0), I(0),
D(0) 8: Set y.sub.sp .rarw. set-point from the user 9: for each
step t of episode 0, 1, . . . T - 1 do 10: Set s .rarw. e.sub.t,
I.sub.t, D.sub.t 11: Set u.sub.t .rarw. .mu.(s, K) + 12. Take
action u.sub.t, observe y.sub.t+1 and r 13: Set s' .rarw.
e.sub.t+1, I.sub.t+1, D.sub.t+1 14: Store tuple (s, u.sub.t, s', r)
in RM 15: Uniformly sample M tuples from RM 16: for i = 1 to M do
17: Set {tilde over (y)}.sup.(i) .rarw. r.sup.(i) +
.gamma.Q.sup..mu.(s'.sup.(i), .mu.(s'.sup.(i), K.sub..alpha.'),
W.sub.c') 18: Set .times. .times. W c .rarw. W c + .alpha. c M
.times. i = 1 M .times. .times. ( y ~ ( i ) - Q .mu. .function. ( s
( i ) , u ( i ) , W c ) ) .times. .gradient. W c .times. Q .mu.
.function. ( s ( i ) , u ( i ) , W c ) ##EQU00009## 19: for i = 1
to M do 20: Calculate .gradient..sub.uQ.sup..mu.(s.sup.(i), u,
W.sub.c)|.sub.u=u.sub.(i) 21: Clip
.gradient..sub.uQ.sup..mu.(s.sup.(i), u, W.sub.c)|.sub.u=u.sub.(i)
using (14) 22: Set .times. .times. K .rarw. K + .alpha. c M .times.
i = 1 M .times. .times. .gradient. K .times. .mu. .function. ( s (
i ) , K ) .times. .gradient. u .times. Q .mu. .function. ( s ( i )
, u , W c ) .times. | u = u ( t ) ##EQU00010## 23: Set
K.sub..alpha.' .rarw. .tau.K + (1 - .tau.)K.sub..alpha.' 24: Set
W.sub.c' .rarw. .tau.W.sub.c + (1 - .tau.)W.sub.c'
[0068] In the following non-limiting examples, the RMSprop
optimizer was used to train the actor and SGD with momentum to
train the critic. The actor and critic networks were trained using
TensorFlow and the processes were simulated in discrete time with
the Control Systems Library for python. The hyper parameters in
Algorithm 1 used across all examples are as follows: Mini-batch
size M=256, Replay Memory size is 10.sup.-5, and discount factor
.gamma.=0.99.
[0069] In a first example, we can consider the following
continuous-time transfer function:
G .function. ( s ) = 2 .times. e - s 6 .times. s + 1 . ( 15 )
##EQU00011##
[0070] In this example, we consider a PI controller initialized
with gains k.sub.p=0.2; k.sub.i=0.05. We can discretize Equation
(15) with timesteps of 0.1 seconds.
[0071] A second example concerns the double integrator,
G(s)=1/s.sup.2. Consider the following collection of transfer
functions:
P={G(s)e.sup.-.tau.s:0.ltoreq..tau..ltoreq.0.1} (16)
[0072] In a third example, we can incorporate an anti-windup tuning
parameter. Consider the following transfer function, as shown in
Equation (17):
G .function. ( s ) = 1 ( s + 1 ) 3 . ( 17 ) ##EQU00012##
[0073] FIG. 4 illustrates graphs 142, 144, pertaining to example 1
(equation 15), depicting simulation results based on the training
of an actor and critic networks, in accordance with an embodiment.
Graph 142 plots data indicative of output data versus time
(seconds). Graph 144 plots data indicative of input data versus
time (seconds).
[0074] FIG. 5 illustrates graphs 152, 154, pertaining to example 1
(equation 15), depicting simulation results based on the training
of actor and critic networks, in accordance with an embodiment.
Graph 152 plots data indicative of proportional gain with respect
to episode number. Graph 154 plots data indicative of integral gain
with respect to episode number.
[0075] FIG. 6 illustrates graphs 162, 164, pertaining to example 2
(equation 16), depicting simulation results based on the training
of actor and critic networks, in accordance with an embodiment.
Graph 162 plots data indicative of output data versus time
(seconds), and graph 164 plots data indicative of input data versus
time (seconds).
[0076] FIG. 7 illustrates graphs 172, 174, 176, pertaining to
example 2 (equation 16), depicting simulation results based on the
training of actor and critic networks, in accordance with an
embodiment. Graph 172 plots data indicative of proportional gain
with respect to episode numbers. Graph 174 plots data indicative of
integral gain with respect to episode numbers. Graph 176 plots data
indicative of derivative gain with respect to episode numbers.
[0077] FIG. 8 illustrates graphs 182, 184, pertaining to example 3
(equation 17), depicting simulation results based on the training
of actor and critic networks, in accordance with an embodiment.
Graph 182 plots data indicative of output versus time-steps. Graph
184 plots data indicative of input versus time-steps.
[0078] FIG. 9 illustrates graphs 192, 194, 196, pertaining to
example 3 (equation 17), depicting simulation results based on the
training of actor and critic networks, in accordance with an
embodiment. Graph 192 plots data indicative proportional gain with
respect to episode numbers. Graph 194 plots data indicative of
integral gain with respect to episode numbers. Graph 106 plots data
indicative of anti-windup with respect to episode numbers.
[0079] As can be appreciated by one skilled in the art, embodiments
can be implemented in the context of a method, data processing
system, or computer program product. Accordingly, embodiments may
take the form of an entirely hardware embodiment, an entirely
software embodiment or an embodiment combining software and
hardware aspects all generally referred to herein as a "circuit" or
"module." Furthermore, embodiments may in some cases take the form
of a computer program product on a computer-usable storage medium
having computer-usable program code embodied in the medium. Any
suitable computer readable medium may be utilized including hard
disks, USB Flash Drives, DVDs, CD-ROMs, optical storage devices,
magnetic storage devices, server storage, databases, etc.
[0080] Computer program code for carrying out operations of the
present invention may be written in an object oriented programming
language (e.g., Java, C++, etc.). The computer program code,
however, for carrying out operations of particular embodiments may
also be written in procedural programming languages, such as the
"C" programming language or in a visually oriented programming
environment, such as, for example, Visual Basic.
[0081] The program code may execute entirely on the user's
computer, partly on the user's computer, as a stand-alone software
package, partly on the user's computer and partly on a remote
computer or entirely on the remote computer. In the latter
scenario, the remote computer may be connected to a user's computer
through a bidirectional data communications network such as a local
area network (LAN) or a wide area network (WAN), a wireless local
area network (WLAN), wireless data network e.g., Wi-Fi, Wimax,
802.xx, and/or a cellular network or the bidirectional connection
may be made to an external computer via most third party supported
networks (for example, through the Internet utilizing an Internet
Service Provider).
[0082] The embodiments are described at least in part herein with
reference to flowchart illustrations and/or block diagrams of
methods, systems, and computer program products and data structures
according to embodiments of the invention. It will be understood
that each block or feature of the illustrations, and combinations
of blocks or features, can be implemented by computer program
instructions. These computer program instructions may be provided
to a processor of, for example, a general-purpose computer,
special-purpose computer, or other programmable data processing
apparatus to produce a machine, such that the instructions, which
execute via the processor of the computer or other programmable
data processing apparatus, create means for implementing the
functions/acts specified in the block or blocks or elsewhere
herein. To be clear, the disclosed embodiments can be implemented
in the context of, for example a special-purpose computer or a
general-purpose computer, or other programmable data processing
apparatus or system. For example, in some embodiments, a data
processing apparatus or system can be implemented as a combination
of a special-purpose computer and a general-purpose computer.
[0083] These computer program instructions may also be stored in a
computer-readable memory that can direct a computer or other
programmable data processing apparatus to function in a particular
manner, such that the instructions stored in the computer-readable
memory produce an article of manufacture including instruction
means which implement the function/act specified in the various
block or blocks, flowcharts, and other architecture illustrated and
described herein.
[0084] The computer program instructions may also be loaded onto a
computer or other programmable data processing apparatus to cause a
series of operational steps to be performed on the computer or
other programmable apparatus to produce a computer implemented
process such that the instructions which execute on the computer or
other programmable apparatus provide steps for implementing the
functions/acts specified in the block or blocks.
[0085] The flowchart and block diagrams in the figures illustrate
the architecture, functionality, and operation of possible
implementations of systems, methods, and computer program products
according to various embodiments of the present invention. In this
regard, each block in the flowchart or block diagrams may represent
a module, segment, or portion of instructions, which comprises one
or more executable instructions for implementing the specified
logical function(s). In some alternative implementations, the
functions noted in the block may occur out of the order noted in
the figures. For example, two blocks shown in succession may, in
fact, be executed concurrently, or the blocks may sometimes be
executed in the reverse order, depending upon the functionality
involved. It will also be noted that each block of the block
diagrams and/or flowchart illustration, and combinations of blocks
in the block diagrams and/or flowchart illustration, can be
implemented by special purpose hardware-based systems that perform
the specified functions or acts or carry out combinations of
special purpose hardware and computer instructions.
[0086] FIGS. 10-11 are shown only as exemplary diagrams of
data-processing environments in which example embodiments may be
implemented. It should be appreciated that FIGS. 10-11 are only
exemplary and are not intended to assert or imply any limitation
with regard to the environments in which aspects or embodiments of
the disclosed embodiments may be implemented. Many modifications to
the depicted environments may be made without departing from the
spirit and scope of the disclosed embodiments.
[0087] As illustrated in FIG. 10, some embodiments may be
implemented in the context of a data-processing system 400 that can
include, for example, one or more processors such as a CPU (Central
Processing Unit) 341 and/or other another processor 349 (e.g.,
microprocessor, microcontroller etc), a memory 342, an input/output
controller 343, a peripheral USB (Universal Serial Bus) connection
347, a keyboard 344 and/or another input device 345 (e.g., a
pointing device, such as a mouse, track ball, pen device, etc.), a
display 346 (e.g., a monitor, touch screen display, etc) and/or
other peripheral connections and components.
[0088] As illustrated, the various components of data-processing
system 400 can communicate electronically through a system bus 351
or similar architecture. The system bus 351 may be, for example, a
subsystem that transfers data between, for example, computer
components within data-processing system 400 or to and from other
data-processing devices, components, computers, etc. The
data-processing system 400 may be implemented in some embodiments
as, for example, a server in a client-server based network (e.g.,
the Internet) or in the context of a client and a server (i.e.,
where aspects are practiced on the client and the server).
[0089] In some example embodiments, data-processing system 400 may
be, for example, a standalone desktop computer, a laptop computer,
a smartphone, a tablet computing device, a networked computer
server, and so on, wherein each such device can be operably
connected to and/or in communication with a client-server based
network or other types of networks (e.g., cellular networks, Wi-Fi,
etc). The data-processing system 400 can communicate with other
devices such as, for example, an electronic device 110.
Communication between the data-processing system 400 and the
electronic device 110 can be bidirectional, as indicated by the
double arrow 402. Such bidirectional communications may be
facilitated by, for example, a computer network, including wireless
bidirectional data communications networks.
[0090] FIG. 11 illustrates a computer software system 450 for
directing the operation of the data-processing system 400 depicted
in FIG. 10. Software application 454, stored for example in the
memory 342 can include one or more modules such as module 452. The
computer software system 450 also can include a kernel or operating
system 451 and a shell or interface 453. One or more application
programs, such as software application 454, may be "loaded" (i.e.,
transferred from, for example, mass storage or another memory
location into the memory 342) for execution by the data-processing
system 400. The data-processing system 400 can receive user
commands and data through the interface 453; these inputs may then
be acted upon by the data-processing system 400 in accordance with
instructions from operating system 451 and/or software application
454. The interface 453 in some embodiments can serve to display
results, whereupon a user 459 may supply additional inputs or
terminate a session. The software application 454 can include
module(s) 452, which can, for example, implement instructions,
steps or operations such as those discussed herein. Module 452 may
also be composed of a group of modules and/or sub-modules.
[0091] The following discussion is intended to provide a brief,
general description of suitable computing environments in which the
system and method may be implemented. The disclosed embodiments can
be described in the general context of computer-executable
instructions, such as program modules, being executed by a single
computer. In most instances, a "module" can constitute a software
application, but can also be implemented as both software and
hardware (i.e., a combination of software and hardware).
[0092] Generally, program modules include, but are not limited to,
routines, subroutines, software applications, programs, objects,
components, data structures, etc., that can perform particular
tasks or which can implement particular data types and
instructions. Moreover, those skilled in the art will appreciate
that the disclosed method and system may be practiced with other
computer system configurations, such as, for example, hand-held
devices, multi-processor systems, data networks,
microprocessor-based or programmable consumer electronics,
networked PCs, minicomputers, mainframe computers, servers, and the
like.
[0093] Note that the term module as utilized herein may refer to a
collection of routines and data structures that perform a
particular task or implements a particular data type. Modules may
be composed of two parts: an interface, which lists the constants,
data types, variable, and routines that can be accessed by other
modules or routines, and an implementation, which may be private
(e.g., accessible only to that module) and which can include source
code that actually implements the routines in the module. The term
module can also relate to an application, such as a computer
program designed to assist in the performance of a specific task,
such as implementing the operations associated with the example
Algorithm 1 previously discussed herein.
[0094] It can be appreciated that the technical solutions described
herein are rooted in computer technology, particularly using
reinforcement learning frameworks. The technical solutions
described herein can improve such computer technology by providing
the one or more advantages described throughout the present
disclosure by improving the performance of an incremental control
system and devices such as a controller (e.g., a PID controller).
The tuning of a PID controller is a challenge across many
industries. There are often many more PID controllers in a mill or
plant than there are competent persons to tune them. Therefore,
having an automated loop-tuning method could improve process
control and thus improve throughput, yield, or quality, while
saving time and effort.
[0095] The disclosed embodiments can utilize a machine learning
approach referred to as reinforcement learning to experiment on a
process and find optimal PID tuning parameters. The disclosed
embodiments include (a) the inclusion of a fourth PID tuning
parameter--the anti-windup parameter in the tuning algorithm, (b)
direct use of the PID controller itself as the `actor` within the
reinforcement learning approach, and (c) episodic switching of PID
parameters where the PID parameters are not updated at every
controller update, but instead can be set for a longer period of
time to gather more data about the system performance with the
improved parameters.
[0096] Note that the term machine learning as utilized herein can
related to methods, systems and devices for data analysis, which
can automate analytical model building. Machine learning is a
branch of artificial intelligence based on the concept that systems
can learn from data, identify patterns and make decisions with
minimal human intervention. The use of machine learning can lead to
technical solutions that improve the underling computer technology,
such as increased efficiencies in computer memory management,
data-processing an energy savings.
[0097] It will be appreciated that variations of the
above-disclosed and other features and functions, or alternatives
thereof, may be desirably combined into many other different
systems or applications. It will also be appreciated that various
presently unforeseen or unanticipated alternatives, modifications,
variations or improvements therein may be subsequently made by
those skilled in the art which are also intended to be encompassed
by the following claims.
* * * * *