U.S. patent application number 14/309641 was filed with the patent office on 2015-12-24 for controlling a target system.
The applicant listed for this patent is Hany F. Bassily, Siegmund Dull, Michael Muller, Clemens Otte, Steffen Udluft. Invention is credited to Hany F. Bassily, Siegmund Dull, Michael Muller, Clemens Otte, Steffen Udluft.
Application Number | 20150370227 14/309641 |
Document ID | / |
Family ID | 53274489 |
Filed Date | 2015-12-24 |
United States Patent
Application |
20150370227 |
Kind Code |
A1 |
Bassily; Hany F. ; et
al. |
December 24, 2015 |
Controlling a Target System
Abstract
For controlling a target system, such as a gas or wind turbine
or another technical system, a pool of control policies is used.
The pool of control policies including a plurality of control
policies and weights for weighting each control policy of the
plurality of control policies are received. The plurality of
control policies is weighted by the weights to provide a weighted
aggregated control policy. The target system is controlled using
the weighted aggregated control policy, and performance data
relating to a performance of the controlled target system is
received. The weights are adjusted based on the received
performance data to improve the performance of the controlled
target system. The plurality of control policies is reweighted by
the adjusted weights to adjust the weighted aggregated control
policy.
Inventors: |
Bassily; Hany F.; (Oviedo,
FL) ; Otte; Clemens; (Munchen, DE) ; Dull;
Siegmund; (Munchen, DE) ; Muller; Michael;
(Munchen, DE) ; Udluft; Steffen; (Eichenau,
DE) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Bassily; Hany F.
Otte; Clemens
Dull; Siegmund
Muller; Michael
Udluft; Steffen |
Oviedo
Munchen
Munchen
Munchen
Eichenau |
FL |
US
DE
DE
DE
DE |
|
|
Family ID: |
53274489 |
Appl. No.: |
14/309641 |
Filed: |
June 19, 2014 |
Current U.S.
Class: |
700/48 ; 700/28;
700/47 |
Current CPC
Class: |
G06N 5/04 20130101; G05B
13/027 20130101 |
International
Class: |
G05B 13/02 20060101
G05B013/02 |
Claims
1. A method for controlling a target system by a processor based on
a pool of control policies, the method comprising: receiving the
pool of control policies, the pool of control policies comprising a
plurality of control policies; receiving weights for weighting each
control policy of the plurality of control policies; weighting the
plurality of control policies by the weights to provide a weighted
aggregated control policy; controlling the target system using the
weighted aggregated control policy; receiving performance data
relating to a performance of the controlled target system;
adjusting the weights by the processor based on the received
performance data to improve the performance of the controlled
target system; and reweighting the plurality of control policies by
the adjusted weights to adjust the weighted aggregated control
policy.
2. The method of claim 1, wherein adjusting the weights comprises
training a neural network run by the processor.
3. The method of claim 2, further comprising: receiving operational
data of at least one source system; and calculating the plurality
of control policies from different data sets of the operational
data.
4. The method of claim 3, wherein calculating the plurality of
control policies comprises training the neural network or a further
neural network.
5. The method of claim 3, wherein calculating the plurality of
control policies comprises using a reward function relating to a
performance of the at least on source system, and wherein adjusting
the weights comprises using the reward function for the adjusting
of the weights.
6. The method of claim 1, wherein the performance data comprises
state data relating to a current state of the target system, and
wherein the weighting of the plurality of control policies, the
reweighting of the plurality of control policies, or the weighting
of the plurality of control policies and the reweighting of the
plurality of control policies depends on the state data.
7. The method as claimed in claim 1, wherein receiving the
performance data comprises receiving the performance data from the
controlled target system, from a simulation model of the target
system, from a policy evaluation, or from any combination
thereof.
8. The method of claim 1, wherein controlling the target system
comprises determining an aggregated control action according to the
weighted aggregated control policy by weighted majority voting, by
forming a weighted mean, by forming a weighted median from action
proposals according to the plurality of control policies, or by any
combination thereof.
9. The method of claim 2, wherein the training of the neural
network is based on a reinforcement learning model.
10. The method of claim 2, wherein the neural network operates as a
recurrent neural network.
11. The method of claim 1, wherein the plurality of control
policies is selected from the pool of control policies in
dependence of a performance evaluation of control policies.
12. The method of claim 1, wherein control policies from the pool
of control policies are included into or excluded from the
plurality of control policies in dependence of the adjusted
weights.
13. The method of claim 1, wherein the controlling, the receiving
of the performance data, the adjusting, and the reweighting are run
in a closed learning loop with the target system.
14. A controller for controlling a target system based on a pool of
control policies, the controller being configured to: receive the
pool of control policies, the pool of control policies comprising a
plurality of control policies; receive weights for weighting each
control policy of the plurality of control policies; weight the
plurality of control policies by the weights to provide a weighted
aggregated control policy; control the target system using the
weighted aggregated control policy; receive performance data
relating to a performance of the controlled target system; adjust
the weights by the processor based on the received performance data
to improve the performance of the controlled target system; and
reweight the plurality of control policies by the adjusted weights
to adjust the weighted aggregated control policy.
15. In a non-transitory computer-readable storage medium that
stores instructions executable by one or more processors to control
a target system based on a pool of control policies, the
instructions comprising: receiving the pool of control policies,
the pool of control policies comprising a plurality of control
policies; receiving weights for weighting each control policy of
the plurality of control policies; weighting the plurality of
control policies by the weights to provide a weighted aggregated
control policy; controlling the target system using the weighted
aggregated control policy; receiving performance data relating to a
performance of the controlled target system; adjusting the weights
by the processor based on the received performance data to improve
the performance of the controlled target system; and reweighting
the plurality of control policies by the adjusted weights to adjust
the weighted aggregated control policy.
16. The non-transitory computer-readable storage medium of claim
15, wherein adjusting the weights comprises training a neural
network run by the processor.
17. The non-transitory computer-readable storage medium of claim
16, wherein the instructions further comprise: receiving
operational data of at least one source system; and calculating the
plurality of control policies from different data sets of the
operational data.
18. The non-transitory computer-readable storage medium of claim
17, wherein calculating the plurality of control policies comprises
training the neural network or a further neural network.
19. The non-transitory computer-readable storage medium of claim
17, wherein calculating the plurality of control policies comprises
using a reward function relating to a performance of the at least
on source system, and wherein adjusting the weights comprises using
the reward function for the adjusting of the weights.
20. The non-transitory computer-readable storage medium of claim
15, wherein the performance data comprises state data relating to a
current state of the target system, and wherein the weighting of
the plurality of control policies, the reweighting of the plurality
of control policies, or the weighting of the plurality of control
policies and the reweighting of the plurality of control policies
depends on the state data.
Description
BACKGROUND
[0001] The control of complex dynamical technical systems (e.g.,
gas turbines, wind turbines, or other plants) may be optimized by
data driven approaches. With that, various aspects of such
dynamical systems may be improved. For example, efficiency,
combustion dynamics, or emissions for gas turbines may be improved.
Additionally, life-time consumption, efficiency, or yaw for wind
turbines may be improved.
[0002] Modern data driven optimization utilizes machine learning
methods for improving control policies (e.g., control strategies)
of dynamical systems with regard to general or specific
optimization goals. Such machine learning methods may outperform
conventional control strategies. For example, if the controlled
system is changing, an adaptive control approach capable of
learning and adjusting a control strategy according to the new
situation and new properties of the dynamical system may be
advantageous over conventional non-learning control strategies.
[0003] However, in order to optimize complex dynamical systems
(e.g., gas turbines or other plants), a sufficient amount of
operational data is to be collected in order to find or learn a
good control strategy. Thus, in case of commissioning a new plant
or upgrading or modifying the plant, it may take some time to
collect sufficient operational data of the new or changed system
before a good control strategy is available. Reasons for such
changes may be wear, changed parts after a repair, or different
environmental conditions.
[0004] Known methods for machine learning include reinforcement
learning methods that focus on data efficient learning for a
specified dynamical system. However, even when using these methods,
it may take some time until a good data driven control strategy is
available after a change of the dynamical system. Until then, the
changed dynamical system operates outside a possibly optimized
envelope. If the change rate of the dynamical system is very high,
only sub-optimal results for a data driven optimization may be
achieved since a sufficient amount of operational data may be never
available.
SUMMARY AND DESCRIPTION
[0005] The scope of the present invention is defined solely by the
appended claims and is not affected to any degree by the statements
within this summary.
[0006] The present embodiments may obviate one or more of the
drawbacks or limitations in the related art. For example, control
of a target system that allows a more rapid learning of a control
policy (e.g., for a changing target system) is provided.
[0007] Embodiments of a method, a controller, and a computer
program product for controlling a target system (e.g., a gas or
wind turbine or another technical system) by a processor are based
on a pool of control policies. The method, controller, or computer
program product (non-transitory computer readable storage medium
having instructions, which when executed by a processor, perform
actions) is configured to receive the pool of control policies,
which includes a plurality of control policies, and to receive
weights for weighting each of the plurality of control policies.
The plurality of control policies is weighted by the weights to
provide a weighted aggregated control policy. The target system is
controlled using the weighted aggregated control policy, and
performance data relating to a performance of the controlled target
system are received. The weights are adjusted by the processor
based on the received performance data to improve the performance
of the controlled target system. The plurality of control policies
is reweighted by the adjusted weights to adjust the weighted
aggregated control policy.
[0008] One or more of the present embodiments allow for an
effective learning of peculiarities of the target system by
adjusting the weights for the plurality of control policies. Such
weights may include much fewer parameters than the pool of control
policies. Thus, the adjusting of the weights may use much less
computing effort and may converge much faster than a training of
the whole pool of control policies. A high level of optimization
may thus be reached in a shorter time. For example, a reaction time
to changes of the target system may be significantly reduced.
Aggregating a plurality of control policies reduces a risk of
accidentally choosing a poor policy, thus increasing the robustness
of the method.
[0009] According to an embodiment, the weights may be adjusted by
training a neural network run by the processor.
[0010] The usage of a neural network for the adjusting of the
weights allows for an efficient learning and flexible
adaptation.
[0011] According to a further embodiment, the plurality of control
policies may be calculated from different data sets of operational
data of one or more source systems (e.g., by training a neural
network). The different data sets may relate to different source
systems, to different versions of one or more source systems, to
different policy models, to source systems in different climes, or
to one or more source systems under different conditions (e.g.,
before and after repair, maintenance, changed parts, etc.).
[0012] The one or more source systems may be chosen similar to the
target system, so that control policies optimized for the one or
more source systems are expected to perform well for the target
system. Therefore, the plurality of control policies based on one
or more similar source systems are a good starting point for
controlling the target system. Such a learning from similar
situations is often denoted as "transfer learning." Hence, much
less performance data relating to the target system are used in
order to obtain a good aggregated control policy for the target
system. Thus, effective aggregated control policies may be learned
in a short time even for target systems with scarce data.
[0013] The calculation of the plurality of control policies may use
a reward function relating to a performance of the source systems.
That reward function may also be used for adjusting the
weights.
[0014] The performance data may include state data relating to a
current state of the target system. The plurality of control
policies may be weighted and/or reweighted in dependence of the
state data. This allows for a more accurate and more effective
adjustment of the weights. For example, the weight of a control
policy may be increased if a state is recognized where the control
policy turned out to perform well, and vice versa.
[0015] Advantageously, the performance data may be received from
the controlled target system, from a simulation model of the target
system, and/or from a policy evaluation. Performance data from the
controlled target system allows monitoring the actual performance
of the target system and may improve the performance by learning a
particular response characteristic of the target system. A
simulation model of the target system also allows what-if queries
for the reward function. With a policy evaluation, a Q-function may
be set up, allowing an expectation value to be determined for the
reward function.
[0016] An aggregated control action for controlling the target
system may be determined according to the weighted aggregated
control policy by weighted majority voting, by forming a weighted
mean, and/or by forming a weighted median from action proposals
according to the plurality of control policies.
[0017] According to one embodiment, the training of the neural
network may be based on a reinforcement learning model, which
allows an efficient learning of control policies for dynamical
systems.
[0018] For example, the neural network may operate as a recurrent
neural network. This allows for maintaining an internal state
enabling an efficient detection of time dependent patterns when
controlling a dynamical system. Many Partially Observable Markov
Decision Processes may be handled like Markov Decision Processes by
a recurrent neural network
[0019] The plurality of control policies may be selected from the
pool of control policies in dependence of a performance evaluation
of control policies. The selected control policies may establish an
ensemble of control policies. For example, only those control
policies may be selected from the pool of control policies that
perform well according to a predefined criterion.
[0020] Control policies from the pool of control policies may be
included into the plurality of control policies or excluded from
the plurality of control policies in dependence of the adjusted
weights. This allows improvement of the selection of control
policies contained in the plurality of control policies. So, for
example, control policies with very small weights may be removed
from the plurality of control policies in order to reduce a
computational effort.
BRIEF DESCRIPTION OF THE DRAWINGS
[0021] FIG. 1 illustrates an exemplary embodiment including a
target system and a plurality of source systems together with
controllers generating a pool of control policies; and
[0022] FIG. 2 illustrates the target system together with a
controller in greater detail.
DETAILED DESCRIPTION
[0023] FIG. 1 illustrates an exemplary embodiment including a
target system TS and a plurality of source systems S1, . . . , SN.
The target system TS and the plurality of source systems S1, . . .
, SN may be gas or wind turbines or other dynamical systems
including simulation tools for simulating a dynamical system. In
one embodiment, the source systems S1, . . . , SN are chosen to be
similar to the target system TS.
[0024] The source systems S1, . . . , SN may also include the
target system TS at a different time (e.g., before maintenance of
the target system TS or before exchange of a system component,
etc.). Vice versa, the target system TS may be one of the source
systems S1, . . . , SN at a later time.
[0025] Each of the source systems S1, . . . , SN is controlled by a
reinforcement learning controller RLC1, . . . , or RLCN,
respectively. The reinforcement learning controllers RLC1, . . . ,
or RLCN are driven by control policies P1, . . . , or PN,
respectively. The reinforcement learning controllers RLC1, . . . ,
RLCN may each include a recurrent neural network (not shown) for
learning (e.g., optimizing the control policies P1, . . . , PN).
Source system specific operational data OD1, . . . , ODN of the
source systems S1, . . . , SN are collected and stored in databases
DB1, . . . , DBN. The operational data OD1, . . . , ODN are
processed according to the control policies P1, . . . , PN, and the
control policies P1, . . . , PN are refined by reinforcement
learning by the reinforcement learning controllers RLC1, . . . ,
RLCN. The control output of the control policies P1, . . . , PN is
fed back into the respective source system S1, . . . , or SN via a
control loop CL, resulting in a closed learning loop for the
respective control policy P1, . . . , or PN in the respective
reinforcement learning controller RLC1, . . . , or RLCN. The
control policies P1, . . . , PN are fed into a reinforcement
learning policy generator PGEN that generates a pool P of control
policies including the control policies P1, . . . , PN.
[0026] The target system TS is controlled by a reinforcement
learning controller RLC including a recurrent neural network RNN
and an aggregated control policy ACP. The reinforcement learning
controller RLC receives the control policies P1, . . . , PN from
the reinforcement learning policy generator PGEN and generates the
aggregated control policy ACP from the control policies P1, . . . ,
PN.
[0027] The reinforcement learning controller RLC receives
performance data PD relating to a current performance of the target
system TS (e.g., a current power output, a current efficiency,
etc.) from the target system TS. The performance data PD includes
state data SD relating to a current state of the target system TS
(e.g., temperature, rotation speed, etc.). The performance data PD
is input to the recurrent neural network RNN for training of the
recurrent neural network RNN and input to the aggregated control
policy ACP for generating an aggregated control action for
controlling the target system TS via a control loop CL. This
results in a closed learning loop for the reinforcement learning
controller RLC.
[0028] The usage of pre-trained control policies P1, . . . , PN
from several similar source systems S1, . . . , SN gives a good
starting point for a neural model run by the reinforcement learning
controller RLC. With that, the amount of data and/or time required
for learning an efficient control policy for the target system TS
may be reduced considerably.
[0029] FIG. 2 illustrates one embodiment of the target system TS
together with the reinforcement learning controller RLC in greater
detail. The reinforcement learning controller RLC includes a
processor PROC and, as already mentioned above, the recurrent
neural network RNN and the aggregated control policy ACP. The
recurrent neural network RNN implements a reinforcement learning
model.
[0030] The performance data PD(SD) including the state data SD
stemming from the target system TS is input to the recurrent neural
network RNN and to the aggregated control policy ACP. The control
policies P1, . . . , PN are input to the reinforcement learning
controller RLC. The control policies P1, . . . , PN may include the
whole pool P or a selection of control policies from the pool
P.
[0031] The recurrent neural network RNN is adapted to train a
weighting policy WP including weights W1, . . . , WN for weighting
each of the control policies P1, . . . , PN. The weights W1, . . .
, WN are initialized by initial weights IW1, . . . , IWN received
by the reinforcement learning controller RLC (e.g., from the
reinforcement learning policy generator PGEN or from a different
source).
[0032] The aggregated control policy ACP relies on an aggregation
function AF receiving the weights W1, . . . , WN from the recurrent
neural network RNN and on the control policies P1, . . . , PN. Each
of the control policies P1, . . . , PN or a pre-selected part of
the control policies P1, . . . , PN receives the performance data
PD(SD) with the state data SD and calculates from the performance
data PD(SD) and the state data SD a specific action proposal AP1, .
. . , or APN, respectively. The action proposals AP1, . . . , APN
are input to the aggregation function AF, which weights each of the
action proposals AP1, . . . , APN with a respective weight W1, . .
. , or WN to generate an aggregated control action AGGA. The action
proposals AP1, . . . , APN may be weighted (e.g., by majority
voting, by forming a weighted mean, and/or by forming a weighted
median from the control policies P1, . . . , PN). The target system
TS is controlled by the aggregated control action AGGA.
[0033] The performance data PD(SD) resulting from the control of
the target system TS by the aggregated control action AGGA are fed
back to the aggregated control policy ACP and to the recurrent
neural network RNN. From the fed back performance data PD(SD), new
specific action proposals AP1, . . . , APN are calculated by the
control policies P1, . . . , PN. The recurrent neural network RNN
uses a reward function (not shown) relating to a desired
performance of the target system TS for adjusting the weights W1, .
. . , WN in dependence of the performance data PD(SD) fed back from
the target system TS. The weights W1, . . . , WN are adjusted by
reinforcement learning with an optimization goal directed to an
improvement of the desired performance. With the adjusted weights
W1, . . . , WN, an update UPD of the aggregation function AF is
made. The updated aggregation function AF weights the new action
proposals AP1, . . . , APN (e.g., reweights the control policies
P1, . . . , PN) by the adjusted weights W1, . . . , WN in order to
generate a new aggregated control action AGGA for controlling the
target system TS. The above acts implement a closed learning loop
leading to a considerable improvement of the performance of the
target system TS.
[0034] A more detailed description of the embodiment is given
below.
[0035] Each control policy P1, . . . , PN is initially calculated
by the reinforcement learning controllers RLC1, . . . , RLCN based
on a set of operational data OD1, . . . , or ODN, respectively. The
set of operational data for a specific control policy may be
specified in multiple ways. Examples for such specific sets of
operational data may be operational data of a single system (e.g.,
a single plant, operational data of multiple plants of a certain
version, operational data of plants before and/or after a repair,
or operational data of plants in a certain clime, in a certain
operational condition, and/or in a certain environmental
condition). Different control policies from P1, . . . , PN may
refer to different policy models trained on a same set of
operational data.
[0036] When applying any of such control policies specific to a
certain source system to a target system, the target system may not
perform optimally since none of the data sets was representative
for the target system. Therefore, a number of control policies may
be selected from the pool P to form an ensemble of control policies
P1, . . . , PN. Each control policy P1, . . . , PN provides a
separate action proposal AP1, . . . , or APN, from the performance
data PD(SD). The action proposals AP1, . . . , APN are aggregated
to calculate the aggregated control action AGGA of the aggregated
control policy ACP. In case of discrete action proposals AP1, . . .
, APN, the aggregation may be performed using majority voting. If
the action proposals AP1, . . . , APN are continuous, a mean or
median value of the action proposals AP1, . . . , APN may be used
for the aggregation.
[0037] The reweighting of the control policies P1, . . . , PN by
the adjusted weights W1, . . . , WN allows for a rapid adjustment
of the aggregated control policy ACP, for example, if the target
system TS changes. The reweighting depends on the recent
performance data PD(SD) generated while interacting with the target
system TS. Since the weighting policy WP has less free parameters
(e.g., the weights W1, . . . , WN) than a control policy usually
has, less data is used to adjust to a new situation or to a
modified system. The weights W1, . . . , WN may be adjusted using
the current performance data PD(SD) of the target system and/or
using a model of the target system (e.g., implemented by an
additional recurrent neural network) and/or using a policy
evaluation.
[0038] According to a simple implementation, each control policy
P1, . . . , PN may be globally weighted (e.g., over a complete
state space of the target system TS). A weight of zero may indicate
that a particular control policy is not part of the ensemble of
policies.
[0039] Additionally or alternatively, the weighting by the
aggregation function AF may depend on the system state (e.g., on
the state data SD of the target system TS). This may be used to
favor good control policies with high weights within one region of
the state space of the target system TS. Within other regions of
the state space, the control polices may not be used at all.
[0040] P.sub.i, i=1, . . . , N may denote a control policy from the
set of stored control policies P1, . . . , PN, and s may be a
vector denoting a current state of the target system TS. A weight
function f(P.sub.i,s) may assign a weight W.sub.i (of the set W1, .
. . , WN) to the respective control policy P.sub.i dependent on the
current state denoted by s (e.g., W.sub.i=f(P.sub.i,s)). A possible
approach may be to calculate the weights W.sub.i based on distances
(e.g., according to a pre-defined metric of the state space)
between the current state s and states stored together with P.sub.i
in a training set including states where P.sub.i performed well.
Uncertainty estimates (e.g., provided by a probabilistic policy)
may also be included in the weight calculation.
[0041] In one embodiment, the global and/or state dependent
weighting is optimized using reinforcement learning. The action
space of such a reinforcement learning problem is the space of the
weights W1, . . . , WN, while the state space is defined in the
state space of the target system TS. For a pool of, for example,
ten control policies, the action space is only ten dimensional and,
therefore, allows a rapid optimization with comparably little input
data and little computational effort. Meta actions may be used to
reduce the dimensionality of the action space even further. Delayed
effects are mitigated by using the reinforcement learning
approach.
[0042] The adjustment of the weights W1, . . . , WN may be carried
out by applying a measured performance of the ensemble of control
policies P1, . . . , PN to a reward function. The reward function
may be chosen according to the goal of maximizing efficiency,
maximizing output, minimizing emissions, and/or minimizing wear of
the target system TS. For example, a reward function used to train
the control policies P1, . . . , PN may be used for training and/or
initializing the weighting policy WP.
[0043] With the trained weights W1, . . . , WN, the aggregated
control action AGGA may be computed according to AGGA=AF(s,AP1, . .
. , APN, W1, . . . , WN), with AP.sub.i=P.sub.i(s), i=1, . . . ,
N.
[0044] The elements and features recited in the appended claims may
be combined in different ways to produce new claims that likewise
fall within the scope of the present invention. Thus, whereas the
dependent claims appended below depend from only a single
independent or dependent claim, it is to be understood that these
dependent claims can, alternatively, be made to depend in the
alternative from any preceding or following claim, whether
independent or dependent, and that such new combinations are to be
understood as forming a part of the present specification.
[0045] While the present invention has been described above by
reference to various embodiments, it should be understood that many
changes and modifications can be made to the described embodiments.
It is therefore intended that the foregoing description be regarded
as illustrative rather than limiting, and that it be understood
that all equivalents and/or combinations of embodiments are
intended to be included in this description.
* * * * *