Method for computer-aided control and/or regulation using two neural networks wherein the second neural network models a quality function and can be used to control a gas turbine Patent Grant Schneega.beta. , et al. May 21, 2 [Schneega.beta.; Daniel]

Method for computer-aided control and/or regulation using two neural networks wherein the second neural network models a quality function and can be used to control a gas turbine

Schneega.beta. , et al. May 21, 2

Patent Grant 8447706

U.S. patent number 8,447,706 [Application Number 12/675,555] was granted by the patent office on 2013-05-21 for method for computer-aided control and/or regulation using two neural networks wherein the second neural network models a quality function and can be used to control a gas turbine. This patent grant is currently assigned to Siemens Aktiengesellschaft. The grantee listed for this patent is Daniel Schneega.beta., Steffen Udluft. Invention is credited to Daniel Schneega.beta., Steffen Udluft.

United States Patent	8,447,706
Schneega.beta. , et al.	May 21, 2013

Method for computer-aided control and/or regulation using two neural networks wherein the second neural network models a quality function and can be used to control a gas turbine

Abstract

A method for a computer-aided control of a technical system is provided. The method involves use of a cooperative learning method and artificial neural networks. In this context, feed-forward networks are linked to one another such that the architecture as a whole meets an optimality criterion. The network approximates the rewards observed to the expected rewards as an appraiser. In this way, exclusively observations which have actually been made are used in optimum fashion to determine a quality function. In the network, the optimum action in respect of the quality function is modeled by a neural network, the neural network supplying the optimum action selection rule for the given control problem. The method is specifically used to control a gas turbine.

Inventors:

Schneega.beta.; Daniel (Amsterdam, NL), Udluft; Steffen (Eichenau, DE)

Applicant:

Name	City	State	Country	Type
Schneega.beta.; Daniel Udluft; Steffen	Amsterdam Eichenau	N/A N/A	NL DE

Assignee:

Siemens Aktiengesellschaft (Munchen, DE)

Family ID:

39986234

Appl. No.:

12/675,555

Filed:

August 26, 2008

PCT Filed:

August 26, 2008

PCT No.:

PCT/EP2008/061115

371(c)(1),(2),(4) Date:

February 26, 2010

PCT Pub. No.:

WO2009/033944

PCT Pub. Date:

March 19, 2009

Prior Publication Data


	Document Identifier	Publication Date
	US 20100205974 A1	Aug 19, 2010

Foreign Application Priority Data


Sep 6, 2007 [DE]			10 2007 042 440

Current U.S. Class:	706/12
Current CPC Class:	G05B 13/027 (20130101); G06N 3/08 (20130101); F02C 9/00 (20130101); G06N 3/0454 (20130101); F05D 2270/709 (20130101); F05D 2270/707 (20130101)
Current International Class:	G06F 11/00 (20060101)
Field of Search:	;706/12,45,62

References Cited [Referenced By]

U.S. Patent Documents


5434951	July 1995	Kuwata
5857321	January 1999	Ashley
6169981	January 2001	Werbos
6882992	April 2005	Werbos
2002/0016665	February 2002	Ulyanov et al.

Foreign Patent Documents


69917711	Jun 2005	DE
102007017259	Oct 2008	DE
0936351	Aug 1999	EP
2007065929	Mar 2007	JP
WO 2005081076	Sep 2005	WO

Other References

Antos, Andras et al.; "Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path" In: Proc. of the Conference on Learning Theory, 2006; pp. 574-588. cited by applicant .
Schneegass, D. et al., "Neural Rewards Regression for Near-optimal Policy Identification in Markovian and Partial Observable Environments", in: Verleysen. M. Proc. of the ESANN, 2007; pp. 301-306. cited by applicant .
Bram Bakker; "Reinforcement Learing by Backpropagation through an LSTM Model/Critic"; IEEE International Symposium on Approximate Dynamic programming and Reinforcement learning, Apr. 2007, pp. 127-134, XP031095237. cited by applicant .
Venayagamoorthy G K et al: "Adaptive Critic Designs for Optimal Control of Power Systems" Intelligent Systems Application to Power Systems, International Conference on, Proceedings of the 13th Nov. 6-10, 2005 Piscataway, NJ, USA, IEEE, Nov. 6, 2005, pp. 136-148, XP010897199. cited by applicant .
Mohagheghi, Salman et al: "Making the Power Grid more intelligent" Bulk Power System Dynamics and Control--VII. Revitalizing Operational Reliability, 2007 IREP Symposium, IEEE, PI, 1. Aug. 2007, pp. 1-10, XP031195591. cited by applicant .
Schneegass, Daniel et al: "Improving Optimality of neural rewards Regression for Data-Efficient Batch Near-Optimal Policy Identification" Artificial Neural Networks aICANN 2007; (Lecture notes in computer science), Springer Berlin Heidelberg, vol. 4668, 9. Sep. 2007, pp. 109-118, XP019069348. cited by applicant .
Riedmiller, Martin: "Neural Fitted Q Iteration--First Experiences with a Data Efficient Neural Reinforcement Learning Method". ECML 2005: pp. 317-328. cited by applicant .
Eiji Uchibe et al.: "Reinforcement learning under constraints generated by multiple reward functions", IEICE Technical Report, the Institute of Electronics, Information and Communication Engineers, Jun. 9, 2006, vol. 106, No. 102, pp. 1-6; Magazine. cited by applicant .
Sachiyo Arai, "Multiagent Reinforcement Learning Frameworks: Steps toward Practical Use"Journal of the Japanese Society for Artificial Intelligence, the Japanese Society for Artificial Intelligence, Jul. 1, 2001, vol. 16, No. 4, pp. 476-481; Magazine. cited by applicant.

Primary Examiner: Vincent; David

Claims

The invention claimed:

1. A method for computer-aided control of a technical system, comprising: characterizing a dynamic behavior of the technical system for a plurality of time points by a state of the technical system and an action carried out on the technical system for each time point, wherein an action at a time point results in a sequential state of the technical system at a next time point; learning an action selection rule with a plurality of data records, wherein each data record comprises the state of the technical system at the time point, the action carried out in the state and the sequential state, and wherein an evaluation is assigned to each data record, the learning of the action selection rule comprising: modeling of a quality function by a first neural network comprising the states and actions of the technical system as parameters; learning the first neural network based on an optimality criterion, which is a function of the evaluations of the data records and the quality function, an optimum action in respect of the quality function being modeled by a second neural network, which is learned based on the quality function; and controlling the technical system such that the actions to be carried out on the technical system are selected using the action selection rule based upon the second neural network.

2. The method as claimed in claim 1, wherein the quality function is modeled by the first neural network such that an evaluation function is tailored to the evaluations of the data records.

3. The method as claimed in claim 1, wherein the optimum action in respect of the quality function is the action which maximizes the quality function.

4. The method as claimed in claim 1, wherein the first neural network forms a feed-forward network with an input layer comprising a respective state of the technical system and the action to be carried out in the respective state, one or more hidden layers and an output layer comprising the quality function.

5. The method as claimed in claim 1, wherein the second neural network forms a feed-forward network with an input layer comprising a respective sequential state of the technical system, one or more hidden layers and an output layer comprising the optimum action in the sequential state in respect of the quality function.

6. The method as claimed in claim 1, wherein a backpropagation method is used to learn the first neural network and the second neural network.

7. The method as claimed in claim 1, wherein the optimality criterion is selected such that an optimum dynamic behavior of the technical system is parameterized.

8. The method as claimed in claim 1, wherein the optimality criterion is the minimization of the Bellman residual.

9. The method as claimed in claim 1, wherein the optimality criterion is the reaching of the fixed point of the Bellman iteration.

10. The method as claimed in claim 1, wherein the optimality criterion is the minimization of a modified Bellman residual, the modified Bellman residual comprising an auxiliary function, which is a function of the state of the technical system and the action to be carried out in the respective state.

11. The method as claimed in claim 10, wherein the auxiliary function is modeled by a third neural network, which is learned based upon the optimality criterion, the third neural network forming a feed-forward network with an input layer comprising a respective state of the technical system and the action to be carried out in the respective state, one or more hidden layers and an output layer comprising the auxiliary function.

12. The method as claimed in claim 1, wherein the optimality criterion comprises an adjustable parameter, and wherein the optimality criterion is adapted based upon a change of the adjustable parameter.

13. The method as claimed in claim 1, wherein a state of the technical system comprises one or more variables, in particular observed state variables of the technical system.

14. The method as claimed in claim 1, wherein an action to be carried out on the technical system comprises one or more action variables.

15. The method as claimed in claim 1, wherein the states are states of the technical system hidden in the data records, which are generated by a recurrent neural network with the aid of source data records, the source data records respectively comprising an observed state of the technical system, an action carried out in the observed state and the resulting sequential state.

16. The method as claimed in claim 15, wherein the dynamic behavior of the technical system is modeled by the recurrent neural network, the recurrent neural network being formed by at least one input layer comprising the observed states of the technical system and the actions carried out on the technical system, at least one hidden recurrent layer comprising the hidden states and at least one output layer comprising the observed states of the technical system.

17. The method as claimed in claim 16, wherein the recurrent neural network is learned using a backpropagation method.

18. The method as claimed in claim 1, wherein the technical system is a gas turbine.

19. The method as claimed in claim 18, wherein the method is used to control a gas turbine, the states of the technical system and/or the actions to be carried out in the respective states comprising one or more of the following variables: gross output of the gas turbine; one or more pressures and/or temperatures in the gas turbine or in the area around the gas turbine; combustion chamber accelerations in the gas turbine; and one or more adjustment parameters at the gas turbine, in particular valve settings and/or fuel ratios and/or preliminary vane positions.

20. A non-transitory computer readable medium storing a program code for implementing a method for computer-aided control of a technical system when the program is running on a computer, the method comprising: characterizing a dynamic behavior of the technical system for a plurality of time points by a state of the technical system and an action carried out on the technical system for each time point, wherein an action at a time point results in a sequential state of the technical system at a next time point; learning an action selection rule with a plurality of data records, wherein each data record comprises the state of the technical system at the time point, the action carried out in the state and the sequential state, and wherein an evaluation is assigned to each data record, the learning of the action selection rule comprising: modeling of a quality function by a first neural network comprising the states and actions of the technical system as parameters; learning the first neural network based on an optimality criterion, which is a function of the evaluations of the data records and the quality function, an optimum action in respect of the quality function being modeled by a second neural network, which is learned based on the quality function; and controlling the technical system such that the actions to be carried out on the technical system are selected using the action selection rule based upon the second neural network.

Description

CROSS REFERENCE TO RELATED APPLICATIONS

This application is the US National Stage of International Application No. PCT/EP2008/061115 filed Aug. 26, 2008, and claims the benefit thereof. The International Application claims the benefits of German Application No. 10 2007 042 440.1 DE filed Sep. 6, 2007. All of the applications are incorporated by reference herein in their entirety.

FIELD OF INVENTION

The invention relates to a method for computer-aided control and/or regulation of a technical system and a corresponding computer program product.

SUMMARY OF INVENTION

When controlling complex technical systems it is often desirable to select the actions to be carried out on the technical system in such a manner that an advantageous desired dynamic behavior of the technical system is achieved. The dynamic behavior is however often not simple to predict in the case of complex technical systems, so corresponding computer-aided prediction methods are required, to estimate the future behavior of the technical system and to select appropriate actions for regulating or controlling the technical system correspondingly.

The control of technical systems today is frequently based on expert knowledge, in other words automatic regulation of the system is established on the basis of such expert knowledge. However approaches are also known, with which technical systems are controlled with the aid of known methods for what is known as reinforcement learning. The known methods cannot however be applied generally to any technical systems and often do not furnish sufficiently good results.

An object of the invention is to create a method for computer-aided control and/or regulation of a technical system, which can be applied generally to any technical systems and furnishes good results.

In the inventive method the dynamic behavior of a technical system is observed for a number of time points, with the dynamic behavior for each time point being characterized by a state of the technical system and an action carried out on the technical system, a respective action at a respective time point resulting in a sequential state of the technical system at the next time point.

In order to achieve optimum control or regulation of the technical system, an action selection rule is learned based on data records, each data record comprising the state of the technical system at a respective time point, the action carried out in the time point and the sequential state and an evaluation being assigned to each data record.

A state of the technical system here is in particular a state vector with one or more variables, the variables being for example observed state variables of the technical system. Similarly an action to be carried out on the technical system can also consist of a corresponding vector with a number of action variables, the action variables in particular representing adjustable parameters on the technical system.

The inventive method is characterized by a specific variant of the learning of the action selection rule, comprising the following steps: i) modeling of a quality function by a first neural network comprising the states and actions of the technical system as parameters; ii) learning the first neural network based on an optimality criterion, which is a function of the evaluations of the data records and the quality function, an optimum action in respect of the quality function being modeled by a second neural network, which is learned based on the quality function.

With such a method an optimum action selection rule is determined in a simple and effective manner by appropriate learning of the first and second neural networks, said action selection rule being a function of evaluations of the data records, the action selection rule being embodied such that the action with the best evaluation is selected where possible in a state. The actual regulation or control of the technical system then takes place with the learned action selection rule in that actions to be carried out on the technical system are selected with the learned action selection rule based on the learned second neural network. The inventive method has been verified based on test data records and it has proved that very good results are achieved with the method.

The inventive method represents an extension of the method described in document [1], the document [1] being a German patent application submitted by the same applicant as the present application. The general content of this document is established by reference to the content of the present application. Compared with the method in document [1], the method according to the present invention has the advantage that a second neural network is used, which learns the optimum action based on the quality function, so that the action selection rule learned with the method is defined in a simple manner by a learned second neural network, with which, based on a state of the technical system, the optimum action in said state can be calculated. The method is thus not restricted to discrete actions but the second neural network can in particular also model continuous actions. The inventive method also allows data efficiency to be increased, in other words good results for the appropriate control or regulation of the technical system can be achieved based on an optimality criterion even with a smaller number of data records.

In one preferred embodiment of the inventive method the quality function is modeled by the first neural network such that an evaluation function is tailored to the evaluations of the data records.

In a further embodiment of the inventive method the optimum action in respect of the quality function, which is modeled by the second neural network, is determined such that the optimum action maximizes the quality function.

In a particularly preferred embodiment of the inventive method the first neural network forms a feed-forward network with an input layer comprising a respective state of the technical system and the action that can be carried out in the respective state, one or more hidden layers and an output layer comprising the quality function. Similarly the second neural network is preferably also embodied as a feed-forward network, this feed-forward network comprising the following layers: an input layer comprising a respective sequential state of the technical system; one or more hidden layers with hidden variables; an output layer comprising the optimum action in the sequential state in respect of the quality function.

The abovementioned feed-forward networks are also referred to as multilayer perceptrons and are sufficiently known structures of artificial neural networks from the prior art.

The backpropagation method known sufficiently from the prior art is preferably used in the inventive method to learn the first and/or second neural network.

The optimality criterion can be selected differently in the inventive method, with the optimality criterion that parameterizes an optimum dynamic behavior of the technical system preferably being used. Possible optimality criteria are for example the minimization of the Bellman residual or the reaching of the fixed point of the Bellman iteration. The Bellman residual and the Bellman iteration are known to the person skilled in the art in the field of reinforcement learning) and are therefore not explained further here.

Instead of or in addition to the Bellman residual and the reaching of the fixed point of the Bellman equation, the minimization of a modified Bellman residual can also be used as an optimality criterion, the modified Bellman residual comprising an auxiliary function, which is a function of the respective state of the technical system and the actions that can be carried out in the respective state. One possible embodiment of this Bellman residual is described in the detailed description of the application. The modified Bellman residual is referred to there as L.sub.aux. In order to use this modified Bellman residual in the inventive method, the auxiliary function is preferably modeled by a third neural network, which is learned on the basis of the optimality criterion, the third neural network forming a feed-forward network with an input layer comprising a respective state of the technical system and the action that can be carried out in the respective state, one or more hidden layers and an output later comprising the auxiliary function. In the inventive method the learning of this third neural network takes place in parallel with the learning of the first and second neural networks.

In a particularly preferred embodiment of the inventive method the optimality criterion comprises an adjustable parameter, the change in which causes the optimality criterion to be adapted. This provides a flexible means of tailoring the inventive method to the most appropriate optimality criterion for the predetermined data record.

In a further embodiment of the inventive method the history of past states and actions of the technical system can be taken into account appropriately. This is achieved in that the states in the data records are hidden states of the technical system, which are generated by a recurrent neural network with the aid of source data records, the source data records respectively comprising an observed state of the technical system, an action carried out in the observed state and the resulting sequential state. The dynamic behavior of the technical system in particular is modeled with the recurrent neural network, the recurrent neural network being formed by at least one input layer comprising the observed states of the technical system and the actions carried out on the technical system, at least one hidden recurrent layer comprising the hidden states of the technical system and at least one output layer comprising the observed states of the technical system. The recurrent neural network is in turn learned using an appropriate learning method, in particular also using the known backpropagation method.

Any technical systems can be controlled and regulated using the inventive method but a preferred area of application is turbines, in particular gas turbines. When controlling or regulating a gas turbine the states of the technical system and/or the actions that can be carried out in the respective states are one or more of the following variables: gross output of the gas turbine; one or more pressures and/or temperatures in the gas turbine or in the area around the gas turbine; combustion chamber accelerations in the gas turbine; one or more adjustment parameters in the gas turbine, in particular valve settings and/or fuel ratios and/or preliminary vane positions.

As well as the method described above the invention also relates to a computer program product with a program code stored on a machine-readable medium for implementing the inventive method when the program is running on a computer.

BRIEF DESCRIPTION OF THE DRAWINGS

Exemplary embodiments of the invention are described in detail below with reference to the accompanying Figures, in which:

FIG. 1 shows a schematic diagram of a first embodiment of the inventive method; and

FIG. 2 shows a schematic diagram of a second embodiment of the inventive method.

DETAILED DESCRIPTION OF INVENTION

The embodiments of the inventive method described below are based on a set of data records, which were observed, i.e. measured or determined by experiment, for any technical system. One particularly preferred application of a technical system here is the control of a gas turbine, for which data in the form of state variables of the turbine is present, for example the gross output of the gas turbine, one or more pressures and/or temperatures in the gas turbine, combustion chamber accelerations and the like. Data records relating to a plurality of successive time points are present here, each data record being characterized by a state, which is generally a state vector with a number of state variables, by an action, which represents the change in state variables or other adjustable parameters of the technical system, and by a sequential state, which shows the values of the state variables after the action has been carried out. Also present for each data record is an evaluation or reward, showing the quality of the action at the respective time point for the control of the technical system. The evaluation here is preferably embodied such that the best or optimum control of the technical system is achieved by actions with high evaluations or rewards at the various time points during operation of the technical system.

In the embodiments of the inventive method described below an action selection rule is learned based on the observed data records of the technical system using a reinforcement learning method, it being possible for the technical system then to be operated appropriately with said action selection rule. The action selection rule here indicates for a state of the technical system, which is the best action to be carried out in this state. The technical system here is considered as a stochastic dynamic system, the reinforcement learning method for determining the action selection rule being considered to be a regression task, in which a reward function is tailored to the observed data records.

In the learning method described below a search is carried out for the learning selection method, which can be used optimally to control the technical system. The states, actions and sequential states are considered mathematically here as observations of what is known as a Markov decision process. A Markov decision process is generally defined by a state space S, a set of actions A, which can be selected in the various states, and the dynamic, which is considered as a transition probability distribution P.sub.T: S.times.A.times.S.fwdarw.[0,1], which is a function of the instantaneous state s, the selected action a and the sequential state s'. The transition from a state to the sequential state is characterized by what are known as rewards R(s,a,s'), which are functions of the instantaneous state, the action and the sequential state. The rewards are defined by a reward probability distribution P.sub.R with the expected value of the reward

.intg. .times..times..function.'.times.'.di-elect cons..di-elect cons. ##EQU00001##

According to the embodiment of the inventive method described below a search is carried out for the maximum of a discounting Q function, which corresponds to the quality function within the meaning of the claims and is defined as follows by the Bellman equation, which is known sufficiently from the prior art: Q.sup..pi.(s,a)= .sub.s'(R(s,a,s')+.gamma.Q.sup..pi.(s',.pi.(s'))) (1).

Maximization here takes place in the so-called rule space .PI.=(S.fwdarw.A) across all possible states s and actions a, where 0<.gamma.<1 is the discounting factor, s' the sequential state of s and .pi..epsilon..PI. the action selection rule used. Maximization is carried out according to the invention described here using a regression method based on neural networks, said regression method using a gradient based on the optimum action selection rule (i.e. on the selection rule that maximizes the Q function) and is also referred to as Policy Gradient Neural Rewards Regression. There is no search here--as in the method according to document [1] --specifically for discrete actions, which maximize the quality function. Instead the action already assumed to be optimum beforehand is used as the input for the Q function, the optimum action being calculated based on a neural feed-forward network. The architecture of the method used is shown in FIG. 1 and is described in more detail below.

In the embodiments of the inventive method described below a technical system is considered, in which both the states of the system and also the actions that can be carried out in a respective system are continuous. The dynamic of the system is probabilistic here.

In the embodiments in FIG. 1 and FIG. 2 the Q function described above is described by an individual feed-forward network or multilayer perceptron N(s,a)=Q(s,a), comprising an input layer I with the state s and the action a, a hidden layer H and an output layer O with the quality function Q. The backpropagation algorithm known sufficiently from the prior art is used to model the reward function according to equation (1), the optimization problem being resolved by minimizing a modified (quadratic) Bellman residual across all 1 observed transitions. 1 is thus the number of data records in the observed data of the technical system. The modified Bellman residual used in the embodiment described here is based on the publication [2] and is a better approximation of the true Bellman residual. It is also referred to below as the auxiliary residual L.sub.aux and is as follows:

.times..OMEGA..function..theta..times..times..function..gamma..times..tim- es..function..pi..function..times..times..function..gamma..times..times..f- unction..pi..function..OMEGA..function..theta. ##EQU00002##

Here .theta. represents the parameters of the artificial neural feed-forward network N(s,a) and in particular comprises the weighting matrices between the individual neuron layers in the feed-forward network. .OMEGA. is an appropriate regularization term. r.sub.i represents the observed reward or evaluation in a state s.sub.i from the data records, and s.sub.i+1 are unbiased appraisers of the state variables of the sequential state.

It is known that minimization of the Bellman residual on the one hand has the advantage that it represents a readily controllable learning problem, as it is related to the monitored learning system. On the other hand minimization of the Bellman residual tends to minimize higher-order terms of the discounted sum of future rewards in the stochastic instance, if no further uncorrelated data records can be defined for each transition. Generally the solutions for Q functions that are smoother for sequential states of the stochastic transitions are subject to prejudice. If s.sub.i+1 and r.sub.i are unbiased estimates of subsequent states or rewards, the expression (Q(s.sub.i, a.sub.i)-.gamma.V(s.sub.i+1)-r.sub.i).sup.2 is not an unbiased estimation of the true quadratic Bellman residual (Q(s,a)-(TQ)(s,a)).sup.2, but of (Q(s,a)-(TQ)(s,a)).sup.2+(T'Q)(s,a).sup.2. T and T' are defined here as follows: (T,Q)(s,a)= .sub.s'(R(s,a,s')+.gamma.max.sub.a'Q(s',a')) (T'Q)(s,a).sup.2=Var.sub.s'(R(s,a,s')+.gamma.max.sub.a'Q(s',a')) T is also referred to as the Bellman operator.

As an alternative to using double trajectories, the above modified Bellman residual from document [2] is used as a better approximation of the true Bellman residual. The optimization task is thus the solution {circumflex over (Q)}=argmin.sub.Q.epsilon.H.sub.Qmax.sub.h.epsilon.H.sub.hL.sub.aux.

The idea of the modified Bellman residual is to find an h, which approximates the Bellman operator across the observations.

This gives:

.times.'.function..function..gamma..times..times..function.'.function.'.t- imes.'.function..function..gamma..times..times..function.'.function.'.time- s..function..times..function..function..times. ##EQU00003##

This is the true loss function with an additional error term due to the suboptimal approximation of h, if H.sub.h is not able to approximate the Bellman operator with any degree of accuracy.

This technique allows the true Bellman residual to be limited upward, if the error of h in respect of TQ can be limited. It is easy to see that {circumflex over (L)}.ltoreq.L applies within a saddle point of L.sub.aux, if H.sub.Q=H.sub.h. Otherwise h would not give the minimum of {circumflex over (L)}. An optimum of L.sub.aux would therefore be provided by any fixed point of the Bellman iteration, if such a point exists, as only in this instance can Q approximate the Bellman operator as well as h and L.sub.aux=0. In contrast to the proposal in the publication [2] in the embodiment of the invention described here H.sub.h was either selected to be a much more powerful function class than H.sub.Q or taking into account prior knowledge of the true Bellman operator, so that {circumflex over (L)} essentially provides a better estimation of T'Q.sup.2. Since such an estimation of the variance is not always unbiased, the method converges on a not unbiased appraiser of the true Bellman residual, which only minimizes the function {circumflex over (Q)}*.epsilon.H.sub.Q within the function space but clearly provides a better approximation than appraisers known from the prior art.

The following gradients .DELTA..theta., .DELTA..omega. and .DELTA..psi. result from the above Bellman residual L.sub.aux according to equation (2), representing derivations of the residual L.sub.aux according to .theta., .omega. and/or .psi., their zero position having to be determined to achieve the optimization task:

.DELTA..theta..times..function..gamma..times..times..function..pi..functi- on.dd.theta..times..times..rho..gamma..times..times..function..pi..functio- n.dd.theta..times..function..pi..function..beta..rho..gamma..function..fun- ction..gamma..times..times..function..pi..function..times.dd.theta..times.- .function..pi..function.d.OMEGA..function..theta.d.theta..times..DELTA..om- ega..alpha..beta..times..times..function..times..gamma..times..times..func- tion..pi..function..times.dd.omega..times..function..times..DELTA..psi..ti- mes.d.times.d.psi..times..pi..function..times.dd.pi..function..times..func- tion..pi..function..times.dd.psi..times..function..pi..function. ##EQU00004##

.omega. here are the corresponding parameters, which describe the auxiliary function h, which is modeled as a feed-forward network, 0.ltoreq..beta..ltoreq.1 serves to control the influence of the auxiliary function h and .alpha..gtoreq.1 is the extent of the optimization of h compared with Q. .psi. represents the parameter of a feed-forward network .pi. (FIG. 1), which is used to model the optimum selection rule. Instead of this modified Bellman residual L.sub.aux it is also possible to use the standard Bellman residual or the reaching of the fixed point of the Bellman equation to resolve the optimization problem. The above equation (4) is then no longer required and the term starting with the factors .beta..rho..gamma. no longer occurs in the equation (3). The parameter .beta. also allows a gradual transition. The optimality criterion according to the conventional minimalization of the Bellman residual is then obtained for .rho.=1, while the reaching of the fixed point of the Bellman iteration is obtained for .rho.=0.

In the embodiment described here the function h is achieved in the architecture according to FIG. 1 by inserting the auxiliary network AN next to the main network MN. The auxiliary network AN models the function h with a neural network, which like the network N(s,a) has an input layer I comprising the state s and the action a, a hidden layer H and an output layer O comprising the function h. The auxiliary network AN performs two tasks: it maximizes {circumflex over (L)} in relation to .theta. and minimizes {circumflex over (L)} in relation to .omega.. The main network MN and the auxiliary network AN here are learned simultaneously.

The architecture of the main network MN shown in FIG. 1 shows the policy gradient regression used in the invention. The main network MN consists of a left part, which models the quality function Q as an individual neural network N(s,a). The states s and the actions a are continuous here. The left part of the network is coupled to the right part by way of the reward function R, with the gradient flow being controlled by the parameters -.gamma. and .rho., which occur in the above gradient equation (3). In the right part of the main network MN in branches 1 and 2 is the optimum action selection rule P in the form of the neural network .pi. with corresponding parameters .psi. and the neural network N(s',a') for the sequential state. The parameters .psi. of the neural network .pi. are learned in respect of the optimum selection rule P. The optimum selection rule here is the maximization of the quality function Q. V' in FIG. 1 standards for V(s')=Q(s',.pi.(s')) and reflects the maximum Q function for the sequential state s' and the sequential action a'. The branch 2 of the right part of the main network MN shows the maximization of the Q function in respect of the .epsilon. term according to the above equation (5). A sufficiently small .epsilon. allows the rule P only to contribute to the maximization of the Q function.

Therefore according to FIG. 1 a method results, in which an individual neural network N(s,a) is used for continuous states s and actions a. For the sequential states Q(s',.pi.(s')) is calculated, in which process it is assumed that .pi.: S.fwdarw.A with the parameters .psi. tends to implement the optimum rule P, in other words to maximize the Q function. Therefore Q(s',.pi.(s')) is in proximity to max.sub.a'Q(s',a'). This is achieved by maximizing the Q function for the sequential states at the same time as the regression in respect of the rewards. A type of Batch On-Policy-Iteration or a Batch Actor-Critic-Iteration is thus implemented, in which the intrinsic interaction between the Q function and the rule P is used. The gradient flow control technique known from the prior art is used here in conjunction with shared weights, this technique being sufficient to construct the appropriate architecture. In the network element for the sequential state shown in FIG. 1 with the reference character 1 the gradient flow through the network is cut off, as shown by the number 0 at the dashed arrow pointing to the action a'. The rule P therefore does not influence the regression in respect of the rewards. In an extended network element, shown in FIG. 1 with the reference character 2, a sufficiently small allows just the rule P to contribute to the maximization of the Q function. A backpropagation method with shared weights known sufficiently from the prior art is used to calculate the optimum selection rule .pi.. The gradient backflow blocking technique is also used, having been developed by the inventors and already published. The gradient backflow blocking technique is thus part of the prior art and is sufficiently known to those skilled in the art. It is therefore not described in more detail here. After the convergence of the backpropagation method the optimum selection rule can be determined with the aid of the feed-forward network .pi., without using the Q function as an interim result.

The method described above does not take into account the history of past states, which means that the mode of operation cannot be guaranteed, if there is no Markov decision process. In a second embodiment of the inventive method this history can be taken into account however. This is done by generating the data record, which is used to learn the neural networks, itself in turn from a source data record. The source data record here is the data record which is used directly in the embodiment in FIG. 1 to learn the neural networks. In the second embodiment, which is shown in FIG. 2, this source data record is used to model the dynamic behavior of the technical system with a recurrent neural network RNN, which comprises an input layer I', a hidden layer H' and an output layer O'. This neural network comprises the internal states x.sub.t, z.sub.t, t.epsilon.t-.tau., . . . , t+1 and the weighting matrices F, G, H, J. The transitions between the states are defined here by the following equations: x.sub.t=tan h(Fs.sub.t+Jz.sub.t-1) z.sub.t=Ga.sub.t+Hx.sub.t

A matrix M, which maps the internal onto the external state, can be used to achieve the sequential state by complying with the following condition: .parallel.Mz.sub.t-s.sub.t+1.parallel..sup.2=min.

Known algorithms are used to determine the parameters of the recurrent neural network (i.e. the weighting matrices of the network) according to the above equations, such that the recurrent neural network generates the observed data records in the respective time point very efficiently. The recurrent neural network here is learned in turn using a backpropagation method known sufficiently from the prior art. Modeling of the dynamic behavior by means of the recurrent neural network RNN is sufficiently known to those skilled in the art and is therefore not described in detail here. In contrast to the method in FIG. 1 the hidden states x.sub.t and x.sub.t+1 are now included as input variables in the input layer I instead of the states s and s'. Otherwise the learning method in FIG. 2 corresponds to the learning method in FIG. 1, which can be seen from the fact that the part above the recurrent neural network RNN is identical to FIG. 1. No further description of the learning of the neural networks will therefore be given. A further parameter .mu. is included in the recurrent neural network according to FIG. 2, being used to adjust the learning rate of the recurrent neural network compared with the learning rate of the individual feed-forward networks. Matrices M are also used, these being described in more detail below.

The architecture shown in FIG. 2 allows only one recurrent neural network to be used both for the Q function and for the value function. The recurrent neural network here is able to approximate a deterministic dynamic system with any accuracy, which is however not possible for a stochastic dynamic system. However this does not represent any restriction, as the structure of the internal hidden state can be considered as a transformation to an appropriate feature space for the stochastic reinforcement learning method by the feed-forward networks. In the deterministic instance this feature space is identical to the exact description of all information, which determines the future, while in the general stochastic instance the internal state must be generated such that the expected future is predicted. Therefore in the recurrent neural network in FIG. 2 an autonomous future element is used, when internal states are predicted without observations from the Markov process. In FIG. 2 these are the states x.sub.t for t>i+1 and the following applies: x.sub.t=tan h(Nz.sub.t-1), t>i+1.

This is sufficient, as the Markov characteristic can be reconstructed by means of the knowledge of the expected future states. The recurrent architecture according to FIG. 2 as described above allows the automatic reconstruction of an underlying Markov process of any order in principle. The user of the inventive method can thus utilize more information from the data and has to provide less expert information.

The embodiments according to FIG. 1 and FIG. 2 of the inventive method described above have been tested based on problems known from the prior art. It has proved in this process that the method furnishes good results. In particular the method was also compared with the method according to publication [1], in which process it proved possible to achieve better results with the inventive method.

The method described in the foregoing offers an information-efficient solution for general optimum control problems in any technical fields, it being possible to overcome even complex control problems with few available data records, such problems having proved impossible to resolve satisfactorily with conventional methods.

Bibliography [1] German patent application 10 2007 017 259.3 [2] Andras Antos, Csaba Szepesvari, Remi Munos. Learning near-optimum policies with Bellman-residual minimization based fitted policy iteration and a single sample path. In Proc. of the Conference on Learning Theory, pages 574-588, 2006.

* * * * *