U.S. patent application number 15/329690 was filed with the patent office on 2017-07-27 for inverse reinforcement learning by density ratio estimation.
This patent application is currently assigned to Okinawa Institute of Science and Technology School Corporation. The applicant listed for this patent is Okinawa Institute of Science and Technology School Corporation. Invention is credited to Kenji DOYA, Eiji UCHIBE.
Application Number | 20170213151 15/329690 |
Document ID | / |
Family ID | 55263500 |
Filed Date | 2017-07-27 |
United States Patent
Application |
20170213151 |
Kind Code |
A1 |
UCHIBE; Eiji ; et
al. |
July 27, 2017 |
INVERSE REINFORCEMENT LEARNING BY DENSITY RATIO ESTIMATION
Abstract
A method of inverse reinforcement learning for estimating cost
and value functions of behaviors of a subject includes acquiring
data representing changes in state variables that define the
behaviors of the subject; applying a modified Bellman equation
given by Eq. (1) to the acquired data:
q(x)+gV(y)-V(x)=-1n{pi(y|x))/(p(y|x)} (1) where q(x) and V(x)
denote a cost function and a value function, respectively, at state
x, g represents a discount factor, and p(y|x) and pi(y|x) denote
state transition probabilities before and after learning,
respectively; estimating a density ratio pi(y|x)/p(y|x) in Eq. (1);
estimating q(x) and V(x) in Eq. (1) using the least square method
in accordance with the estimated density ratio pi(y|x)/p(y|x), and
outputting the estimated q(x) and V(x).
Inventors: |
UCHIBE; Eiji; (Okinawa,
JP) ; DOYA; Kenji; (Okinawa, JP) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Okinawa Institute of Science and Technology School
Corporation |
Okinawa |
|
JP |
|
|
Assignee: |
Okinawa Institute of Science and
Technology School Corporation
Okinawa
JP
|
Family ID: |
55263500 |
Appl. No.: |
15/329690 |
Filed: |
August 7, 2015 |
PCT Filed: |
August 7, 2015 |
PCT NO: |
PCT/JP2015/004001 |
371 Date: |
January 27, 2017 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
62034510 |
Aug 7, 2014 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06N 20/00 20190101;
G06N 7/005 20130101 |
International
Class: |
G06N 99/00 20060101
G06N099/00; G06N 7/00 20060101 G06N007/00 |
Claims
1. A method of inverse reinforcement learning for estimating cost
and value functions of behaviors of a subject, comprising:
acquiring data representing changes in state variables that define
the behaviors of the subject; applying a modified Bellman equation
given by Eq. (1) to the acquired data: q ( x ) + .gamma. V ( y ) -
V ( x ) = - ln .pi. ( y x ) p ( y x ) , ( 1 ) ##EQU00034## where
q(x) and V(x) denote a cost function and a value function,
respectively, at state x, and .gamma. represents a discount factor,
and p(y|x) and .pi.(y|x) denote state transition probabilities
before and after learning, respectively; estimating a density ratio
.pi.(y|x)/p(y|x) in Eq. (1); estimating q(x) and V(x) in Eq. (1)
using the least square method in accordance with the estimated
density ratio .pi.(y|x)/p(y|x); and outputting the estimated q(x)
and V(x).
2. The method according to claim 1, wherein the step of estimating
the ratio .pi.(y|x)/p(y|x) includes using uLSIF.
3. The method according to claim 1, wherein the step of estimating
the ratio .pi.(y|x)/p(y|x) includes using the Least Square
Conditional Density Estimation (LSCDE).
4. The method according to claim 1, wherein the step of estimating
the ratio .pi.(y|x)/p(y|x) includes using a logistic
regression.
5. The method according to claim 1, wherein the step of estimating
the ratio .pi.(y|x)/p(y|x) includes using a Gaussian process.
6. The method according to claim 1, wherein the step of estimating
the cost function q(x) and value function V(x) includes using the
least squares method with regularization.
7. A non-transitory storage medium storing instructions to cause a
processor to perform an algorithm for inverse reinforcement
learning for estimating cost and value functions of behaviors of a
subject, said instructions causing the processor to perform the
following steps: acquiring data representing changes in state
variables that define the behaviors of the subject; applying a
modified Bellman equation given by Eq. (1) to the acquired data: q
( x ) + .gamma. V ( y ) - V ( x ) = - ln .pi. ( y x ) p ( y x ) , (
1 ) ##EQU00035## where q(x) and V(x) denote a cost function and a
value function, respectively, at state x, and .gamma. represents a
discount factor, and p(y|x) and .pi.(y|x) denote state transition
probabilities before and after learning, respectively; estimating a
density ratio .pi.(y|x)/p(y|x) in Eq. (1); estimating q(x) and V(x)
in Eq. (1) using the least square method in accordance with the
estimated density ratio .pi.(y|x)/p(y|x), outputting the estimated
q(x) and V(x).
8. A system for inverse reinforcement learning for estimating cost
and value functions of behaviors of a subject, comprising: a data
acquisition unit to acquire data representing changes in state
variables that define the behaviors of the subject; a processor
with a memory, the processor and the memory are configured to:
apply a modified Bellman equation given by Eq. (1) to the acquired
data: q ( x ) + .gamma. V ( y ) - V ( x ) = - ln .pi. ( y x ) p ( y
x ) , ( 1 ) ##EQU00036## where q(x) and V(x) denote a cost function
and a value function, respectively, at state x, and .gamma.
represents a discount factor, and p(y|x) and .pi.(y|x) denote state
transition probabilities before and after learning, respectively;
estimate a density ratio .pi.(y|x)/p(y|x) in Eq. (1); and estimate
q(x) and V(x) in Eq. (1) using the least square method in
accordance with the estimated density ratio .pi.(y|x)/p(y|x); and
an output interface that outputs the estimated q(x) and V(x).
9. A system for predicting a preference in topic of articles that a
user is likely to read from a series of articles the user selected
in an Internet web surfing, comprising: the system for inverse
reinforcement learning as set forth in claim 8, implemented in a
computer connected to the Internet, wherein said subject is the
user, and said state variables that define the behaviors of the
subject include topics of articles selected by the user while
browsing each webpage, and wherein the processor causes an
interface through which the user is browsing Internet websites to
display a recommended article for the user to read in accordance
with the estimated cost and value functions.
Description
TECHNICAL FIELD
[0001] The present invention relates to inverse reinforcement
learning, and more particularly, to system and method of inverse
reinforcement learning. This application hereby incorporates by
reference U.S. Provisional Application No. 62/034,510, filed Aug.
7, 2014, in its entirety.
BACKGROUND ART
[0002] Understanding behaviors of human from observation is very
crucial for developing artificial systems that can interact with
human beings. Since our decision making processes are influenced by
rewards/costs associated with selected actions, the problem can be
formulated as an estimation of the rewards/costs from observed
behaviors.
[0003] The idea of inverse reinforcement learning is originally
proposed by Ng and Russel (2000) (NPL 14). The OptV algorithm
proposed by Dvijotham and Todorov (2010) (NPL 6) is a prior work
and they show that the policy of the demonstrator is approximated
by the value function, which is a solution of the linearized
Bellman equation.
[0004] Generally speaking, reinforcement Learning (RL) is a
computational framework for investigating decision-making processes
of both biological and artificial systems that can learn an optimal
policy by interacting with an environment. There exist several open
questions in RL, and one of the critical problems is how to design
and prepare an appropriate reward/cost function. It is easy to
design a sparse reward function, which gives a positive reward when
the task is accomplished and zero otherwise, but that makes it hard
to find an optimal policy.
[0005] In some situations, it is easier to prepare examples of a
desired behavior than to handcraft an appropriate reward/cost
function. Recently, several methods of Inverse Reinforcement
Learning (IRL) (Ng & Russell, 2000, NPL 14) and apprenticeship
learning (Abbeel & Ng, 2004, NPL 1) have been proposed in order
to derive a reward/cost function from demonstrator's performance
and to implement imitation learning. However, most of the existing
studies (Abbeel & Ng, 2004, NPL 1; Ratliff et al., 2009, NPL
16; Ziebart et al., 2008, NPL 26) require a routine to solve
forward reinforcement learning problems with estimated reward/cost
functions. This process is usually very time-consuming even when
the model of the environment is available.
[0006] Recently, the concept of Linearly solvable Markov Decision
Process (LMDP) (Todorov, 2007; 2009, NPLs 23-24) is introduced,
which is a sub-class of Markov Decision Process by restricting the
form of the cost function. This restriction plays an important role
in IRL. LMDP is also known as KL-control and path-integral
approaches (Kappen et al., 2012, NPL 10; Theodorou et al., 2010,
NPL 21) and similar ideas are proposed in the field of control
theory (Fleming and Soner, 2006, NPL 7). Model-free IRL algorithms
based on the path-integral method are proposed by Aghasadeghi &
Bretl (2011) (NPL 2); Kalakrishnan et al. (2013) (NPL 8). Since the
likelihood of the optimal trajectory is parameterized by the cost
function, the parameters of the cost can be optimized by maximizing
likelihood. However, their methods require the entire trajectory
data. A model-based IRL method is proposed by Dvijotham and Todorov
(2010) (NPL 6) based on the framework of LMDP, in which the
likelihood of the optimal state transition is represented by the
value function. As opposed to path-integral approaches of IRL, it
can be optimized from any dataset of state transitions. A major
drawback is to evaluate the integral which cannot be solved
analytically. In practice, they discretized the state space to
replace the integral with a sum, but it is not feasible in
high-dimensional continuous problems.
CITATION LIST
Non Patent Literature
[0007] NPL 1: Abbeel, P. and Ng, A. Y. Apprenticeship learning via
inverse reinforcement learning. In Proc. of the 21st International
Conference on Machine Learning, 2004. [0008] NPL 2: Aghasadeghi, N.
and Bretl, T. Maximum entropy inverse reinforcement learning in
continuous state spaces with path integrals. In Proc. of IEEE/RSJ
International Conference on Intelligent Robots and Systems,
pp.1561-1566, 2011. [0009] NPL 3: Boularias, A., Kober, J., and
Peters, J. Relative entropy inverse reinforcement learning. In
Proc. of the 14th International Conference on Artificial
Intelligence and Statistics, volume 15, 2011. [0010] NPL 4:
Deisenroth, M. P., Rasmussen, C. E, and Peters, J. Gaussian process
dynamic programming. Neurocomputing, 72(7-9):1508-1524, 2009.
[0011] NPL 5: Doya, K. Reinforcement learning in continuous time
and space. Neural Computation, 12:219-245, 2000. [0012] NPL 6:
Dvijotham, K. and Todorov, E. Inverse optimal control with linearly
solvable MDPs. In Proc. of the 27th International Conference on
Machine Learning, 2010. [0013] NPL 7: Fleming, W. H. and Soner, H.
M. Controlled Markov Processes and Viscosity Solutions. Springer,
second edition, 2006. [0014] NPL 8: Kalakrishnan, M., Pastor, P.,
Righetti, L., and Schaal, S. Learning objective functions for
manipulation. In Proc. of IEEE International Conference on Robotics
and Automation, pp. 1331-1336, 2013. [0015] NPL 9: Kanamori, T.,
Hido, S., and Sugiyama, M. A Least-squares Approach to Direct
Importance Estimation. Journal of Machine Learning Research,
10:1391-1445, 2009. [0016] NPL 10: Kappen, H. J., Gomez, V., and
Opper, M. Optimal control as a graphical model inference problem.
Machine Learning, 87(2):159-182, 2012. [0017] NPL 11: Kinjo, K.,
Uchibe, E., and Doya, K. Evaluation of linearly solvable Markov
decision process with dynamic model learning in a mobile robot
navigation task. Frontiers in Neurorobotics, 7(7), 2013. [0018] NPL
12: Levine, S. and Koltun, V. Continuous inverse optimal control
with locally optimal examples. In Proc. of the 27th International
Conference on Machine Learning, 2012. [0019] NPL 13: Levine, S.,
Popovic, Z., and Koltun, V. Nonlinear inverse reinforcement
learning with Gaussian processes. Advances in Neural Information
Processing Systems 24, pp. 19-27. 2011. [0020] NPL 14: Ng, A. Y.
and Russell, S. Algorithms for inverse reinforcement learning. In
Proc. of the 17th International Conference on Machine Learning,
2000. [0021] NPL 15: Rasmussen, C. E. and Williams, C. K. I.
Gaussian Processes for Machine Learning. MIT Press, 2006. [0022]
NPL 16: Ratliff, N. D., Silver, D, and Bagnell, J. A. Learning to
search: Functional gradient techniques for imitation learning.
Autonomous Robots, 27(1): 25-53, 2009. [0023] NPL 17: Stulp, F. and
Sigaud, O. Path integral policy improvement with covariance matrix
adaptation. In Proc. of the 10th European Workshop on Reinforcement
Learning, 2012. [0024] NPL 18: Sugimoto, N. and Morimoto, J.
Phase-dependent trajectory optimization for periodic movement using
path integral reinforcement learning. In Proc. of the 21st Annual
Conference of the Japanese Neural Network Society, 2011. [0025] NPL
19: Sugiyama, M., Takeuchi, I., Suzuki, T., Kanamori, T., Hachiya,
H., and Okanohara, D. Least-squares conditional density estimation.
IEICE Transactions on Information and Systems, E93-D(3): 583-594,
2010. [0026] NPL 20: Sugiyama, M., Suzuki, T., and Kanamori, T.
Density ratio estimation in machine learning. Cambridge University
Press, 2012. [0027] NPL 21: Theodorou, E., Buchli, J., and Schaal,
S. A generalized path integral control approach to reinforcement
learning. Journal of Machine Learning Research, 11: 3137-3181,
2010. [0028] NPL 22: Theodorou, E. A and Todorov, E. Relative
entropy and free energy dualities: Connections to path integral and
KL control. In Proc. of the 51st IEEE Conference on Decision and
Control, pp. 1466-1473, 2012. [0029] NPL 23: Todorov, E.
Linearly-solvable Markov decision problems. Advances in Neural
Information Processing Systems 19, pp. 1369-1376. MIT Press, 2007.
[0030] NPL 24: Todorov, E. Efficient computation of optimal
actions. Proceedings of the National Academy of Sciences of the
United States of America, 106(28): 11478-83, 2009. [0031] NPL 25:
Todorov, E. Eigenfunction approximation methods for
linearly-solvable optimal control problems. In Proc. of the 2nd
IEEE Symposium on Adaptive Dynamic Programming and Reinforcement
Learning, pp. 161-168, 2009. [0032] NPL 26: Ziebart, B. D., Maas,
A., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse
reinforcement learning. In Proc. of the 23rd AAAI Conference on
Artificial Intelligence, 2008.
SUMMARY OF INVENTION
Technical Problem
[0033] Inverse reinforcement learning is a framework to solve the
above problems, but as mentioned above, the existing methods have
the following drawbacks: (1) intractable when the state is
continuous, (2) computational cost is expensive, and (3) entire
trajectories of states should be necessary to estimate. Methods
disclosed in this disclosure solve these drawbacks. In particular,
the previous method proposed in NPL 14 does not work well as many
previous studies reported. Moreover, the method proposed in NPL 6
cannot solve continuous problems in practice because their
algorithm involves a complicated evaluation of integrals.
[0034] The present invention is directed to system and method for
inverse reinforcement learning.
[0035] An object of the present invention is to provide a new and
improved inverse reinforcement learning system and method so as to
obviate one or more of the problems of the existing art.
Solution to Problem
[0036] To achieve these and other advantages and in accordance with
the purpose of the present invention, as embodied and broadly
described, in one aspect, the present invention provides a method
of inverse reinforcement learning for estimating cost and value
functions of behaviors of a subject, including: acquiring data
representing changes in state variables that define the behaviors
of the subject; applying a modified Bellman equation given by Eq.
(1) to the acquired data:
q ( x ) + .gamma. V ( y ) - V ( x ) = - ln .pi. ( y x ) p ( y x ) ,
( 1 ) ##EQU00001##
[0037] where q(x) and V(x) denote a cost function and a value
function, respectively, at state x, and .gamma. represents a
discount factor, and p(y|x) and a(y|x) denote state transition
probabilities before and after learning, respectively; estimating a
density ratio .pi.(y|x)/p(y|x) in Eq. (1); estimating q(x) and V(x)
in Eq. (1) using the least square method in accordance with the
estimated density ratio .pi.(y|x)/p(y|x); and outputting the
estimated q(x) and V(x).
[0038] In another aspect, the present invention provides a
non-transitory storage medium, such as a CD-ROM or other form of
non-transitory storage medium, storing instructions to cause a
processor to perform an algorithm for inverse reinforcement
learning for estimating cost and value functions of behaviors of a
subject, said instructions causing the processor to perform the
following steps: acquiring data representing changes in state
variables that define the behaviors of the subject; applying a
modified Bellman equation given by Eq. (1) to the acquired
data:
q ( x ) + .gamma. V ( y ) - V ( x ) = - ln .pi. ( y x ) p ( y x ) ,
( 1 ) ##EQU00002##
[0039] where q(x) and V(x) denote a cost function and a value
function, respectively, at state x, and .gamma. represents a
discount factor and p(y|x) and .pi.(y|x) denote state transition
probabilities before and after learning, respectively; estimating a
density ratio .pi.(y|x)/p(y|x) in Eq. (1); estimating q(x) and V(x)
in Eq. (1) using the least square method in accordance with the
estimated density ratio .pi.(y|x)/p(y|x); and outputting the
estimated q(x) and V(x).
[0040] In another aspect, the present invention provides a system
for inverse reinforcement learning for estimating cost and value
functions of behaviors of a subject, including a data acquisition
unit to acquire data representing changes in state variables that
define the behaviors of the subject; a processor with a memory, the
processor and the memory are configured to: apply a modified
Bellman equation given by Eq. (1) to the acquired data:
q ( x ) + .gamma. V ( y ) - V ( x ) = - ln .pi. ( y x ) p ( y x ) ,
( 1 ) ##EQU00003##
[0041] where q(x) and V(x) denote a cost function and a value
function, respectively, at state x, and .gamma. represents a
discount factor, and p(y|x) and .pi.(y|x) denote state transition
probabilities before and after learning, respectively; estimate a
density ratio .pi.(y|x)/p(y|x) in Eq. (1); and estimate q(x) and
V(x) in Eq. (1) using the least square method in accordance with
the estimated density ratio .pi.(y|x)/p(y|x); and an output
interface that outputs the estimated q(x) and V(x).
[0042] In another aspect, the present invention provides a system
for predicting a preference in topic of articles that a user is
likely to read from a series of articles the user selected in an
Internet web surfing, including: the system for inverse
reinforcement learning as set forth above, implemented in a
computer connected to the Internet, wherein said subject is the
user, and said state variables that define the behaviors of the
subject include topics of articles selected by the user while
browsing each webpage, and wherein the processor causes an
interface through which the user is browsing Internet websites to
display a recommended article for the user to read in accordance
with the estimated cost and value functions.
Advantageous Effects of Invention
[0043] According to one or more aspects of the present invention,
it becomes possible to perform inverse reinforcement learning
effectively and efficiently. In some embodiments, there is no need
to know the environmental dynamics in advance and there is no need
to execute integration.
[0044] Additional or separate features and advantages of the
invention will be set forth in the descriptions that follow and in
part will be apparent from the description, or may be learned by
practice of the invention. The objectives and other advantages of
the invention will be realized and attained by the structure
particularly pointed out in the written description and claims
thereof as well as the appended drawings.
[0045] It is to be understood that both the foregoing general
description and the following detailed description are exemplary
and explanatory, and are intended to provide further explanation of
the invention as claimed.
BRIEF DESCRIPTION OF DRAWINGS
[0046] FIG. 1 shows normalized squared errors for the results of
the swing-up inverted pendulum experiments to which embodiments of
the present invention was applied for each of the following density
ratio estimation methods: (1) LSCDE-IRL, (2) uLSIF-IRL, (3)
LogReg-IRL, (4) Gauss-IRL, (5) LSCDE-OptV, and (6) Gauss-OptV. As
indicated in the drawing, (a)-(d) differ from each other in terms
of sampling methods and other parameters.
[0047] FIG. 2 is a graph showing cross-validation errors in the
swing-up inverted pendulum experiments for various density ratio
estimation methods.
[0048] FIG. 3 shows an experimental setup for the pole balancing
task for the long pole; left: the start position, middle: the goal
position, and right: state variables.
[0049] FIG. 4 shows learning curves in the pole balancing task
experiment with respect to various subjects according to an
embodiment of the present invention; sold line: long pole, dotted
line: short pole.
[0050] FIG. 5 shows estimated cost functions derived for the pole
balancing task experiment according to the embodiment of the
present invention for Subject Nos. 4, 5, and 7, projected to the
defined subspace.
[0051] FIG. 6 shows negative log likelihood values for the test
datasets in the pole balancing task experiment for Subject Nos. 4
and 7, evaluating the estimated cost functions.
[0052] FIG. 7 schematically shows a framework of inverse
reinforcement learning according to an embodiment of the present
invention that can infer an objective function from observed state
transitions generated by demonstrators.
[0053] FIG. 8 is a schematic black diagram showing an example of
implementation of the inverse reinforcement learning of the present
invention in imitation learning of robot behaviors.
[0054] FIG. 9 is a schematic black diagram showing an example of
implementation of the inverse reinforcement learning of the present
invention in interpreting human behaviors.
[0055] FIG. 10 schematically shows a series of clicking actions by
a web-visitor, showing the visitor's preference in topic in web
surfing.
[0056] FIG. 11 schematically shows an example of an inverse
reinforcement learning system according to an embodiment of the
present invention
DESCRIPTION OF EMBODIMENTS
[0057] The present disclosure provides a novel inverse
reinforcement learning method and system based on density ratio
estimation under the framework of Linearly solvable Markov Decision
Process (LMDP). In LMDP, the logarithm of the ratio between the
controlled and uncontrolled state transition densities is
represented by the state-dependent cost and value functions. In one
aspect of the present invention, density ratio estimation methods
are used to estimate the transition density ratio, and the least
squares method with regularization is used to estimate the
state-dependent cost and value functions that satisfy the relation.
This method can avoid computing the integral such as evaluating the
partition function. As described below, a simple numerical
simulation of a pendulum swing-up was performed, and its
superiority over conventional methods have been demonstrated. The
present inventors further apply the method to humans behaviors in
performing a pole balancing task and show that the estimated cost
functions can predict the performance of the subjects in new trials
or environments in a satisfactory manner.
[0058] One aspect of the present invention is based on the
framework of linearly solvable Markov decision processes like the
OptV algorithm. The present inventors have derived a novel Bellman
equation given by:
- log .pi. ( y x ) p ( y x ) = q ( x ) + .gamma. V ( y ) - V ( x )
, ##EQU00004##
[0059] where q(x) and V(x) denote the cost and value function at
state x and .gamma. represents a discount factor. p(y|x) and
.pi.(y|x) denote the state transition probabilities before and
after learning, respectively. The density ratio, the left hand side
of the above equation, is efficiently computed from observed
behaviors by density ratio estimation methods. Once the density
ratio is estimated, the cost and value function can be estimated by
regularized least-squares method. An important feature is that our
method can avoid computing the integrals, where it is usually
calculated with high computational cost. The present inventors have
applied this method to humans behaviors in performing a pole
balancing task and show that the estimated cost functions can
predict the performance of the subjects in new trials or
environments, verifying universal applicability and effectiveness
of this new computation technique in inverse reinforcement
learning, which has well-recognized wide applicability in control
system, machine learning, operations research, information theory,
etc.
[0060] <1. Linearly Solvable Markov Decision Process>
[0061] <1.1. Forward Reinforcement Learning>
[0062] The present disclosure provides a brief introduction of
Markov Decision Process and its simplification for a discrete-time
continuous-space domain. Let X and U be the continuous state and
continuous action spaces, respectively. At a time step t, a
learning agent observes the environmental current state
[0063] x.sub.t .di-elect cons. X
[0064] and executes action
[0065] u.sub.t .di-elect cons. l
[0066] sampled from a stochastic policy p(u.sub.t|x.sub.t).
Consequently, an immediate cost c(x.sub.t, u.sub.t) is given from
the environment and the environment makes a state transition
according to a state transition probability P.sub.T(y|x.sub.t,
u.sub.t) from x.sub.t to
[0067] y .di-elect cons. X
[0068] under the action u.sub.t. The goal of reinforcement learning
is to construct an optimal policy .pi.(u|x) which minimizes the
given objective function. There exist several objective functions
and the most widely used one is a discounted sum of costs given
by:
V ( x ) = [ t = 0 .infin. .gamma. t c ( x t , u t ) ] , ( 1 )
##EQU00005##
[0069] where
[0070] .gamma. .di-elect cons..left brkt-bot.0,1)
[0071] is called the discount factor. It is known that the optimal
value function satisfies the following Bellman equation:
V ( x ) = min u [ c ( x , u ) + .gamma. y .about. P T ( ' x , u ) [
V ( y ) ] ] ( 2 ) ##EQU00006##
[0072] Eq. (2) is a nonlinear equation due to the min operator.
[0073] Linearly solvable Markov Decision Process (LMDP) simplifies
Eq. (2) under some assumptions (Todorov, 2007; 2009a, NPLs 23-24).
The key trick of LMDP is to optimize the state transition
probability directly instead of optimizing the policy. More
specifically, two conditional probability density functions are
introduced. One is the uncontrolled probability denoted by p(y|x)
which can be regarded as an innate state transition. p(y|x) is
arbitrary and it can be constructed by
p(y|x)=.intg.P.sub.t(y|x,u).pi.,(u|x)du.
[0074] where .pi..sub.0(u|x) is a random policy. The other is the
controlled probability denoted by .pi.(y|x) which can be
interpreted as an optimal state transition. Then, the cost function
is restricted to the following form:
c(x,u)=q(x)+KL(.pi.(|x).parallel.p(|x)), (3)
[0075] Where q(x) and KL(p(.times.|x).parallel.p(.times.|x)) denote
the state dependent cost function and Kullback Leibler divergence
between the controlled and uncontrolled state transition densities,
respectively. In this case, the Bellman equation (2) is simplified
to the following equation:
exp(-V(x))=exp(-q(x)).intg.p(y|x)exp(-.gamma.V(y))dy (4)
[0076] The optimal controlled probability is given by:
.pi. ( y x ) = p ( y x ) exp ( - .gamma. V ( y ) ) .intg. p ( y ' x
) exp ( - .gamma. V ( y ' ) ) dy ' . ( 5 ) ##EQU00007##
[0077] It should be noted that Eq. (4) is still nonlinear even
though the desirability function Z(x)=exp(-V(x)) is introduced
because of the existence of the discount factor .gamma.. In the
forward reinforcement learning under the framework of LMDP, V(x) is
computed by solving Eq. (4), then .pi.(y|x) is computed (Todorov,
2009, NPL 25).
[0078] <1.2. Inverse Reinforcement Learning>
[0079] The inverse reinforcement learning (IRL) algorithm under
LMDP was proposed by Dvijotham and Todorov (2010) (NPL 6). In
particular, OptV is quite efficient for discrete state problems.
The advantage of OptV is that the optimal state transition is
explicitly represented by the value function so that the maximum
likelihood method can be applied to estimate the value function.
Suppose that the observed trajectories are generated by the optimal
state transition density (5). The value function is approximated by
the following linear model:
{circumflex over (V)}(x)=.sub.v.sup.T.psi..sub.v(x), (6)
[0080] where w, and .psi..sub.v(x) denote the learning weights and
basis function vector, respectively.
[0081] Since the controlled probability is given by Eq. (5), the
weight vector w.sub.v can be optimized by maximizing the
likelihood. Suppose that we have a dataset of state
transitions:
.sup..pi.={(x.sub.j.sup..pi.,
y.sub.j.sup..pi.)}.sub.j=1'.sup.N.sup..pi.,y.sub.j.sup..pi..about..pi.(|x-
.sub.j.sup..pi.), (7)
[0082] where N.sup..pi. denotes the number of data from the
controlled probability. Then, the log-likelihood and its derivative
are given by:
L ( w V ) = j = 1 N .pi. ln .pi. ( y j n x j .pi. ; w V ) ,
.differential. L ( w V ) .differential. w V = j = 1 N .pi. [ -
.psi. V ( y j .pi. ) + .intg. .pi. ( y x j .pi. ; w V ) .psi. V ( y
) dy ] , ( 8 ) ##EQU00008##
[0083] where .pi.(y|x;w.sub.v) is the controlled policy in which
the value function is parameterized by Eq. (6). Once the gradient
is evaluated, the weight vector w.sub.v is updated according to the
gradient ascent method.
[0084] After the value function is estimated, the simplified
Bellman equation (4) can be used to retrieve the cost function. It
means that the cost function q(x) is uniquely determined when
[0086] and .gamma. are given, and q(x) is expressed by the basis
functions used in the value function. While the representation of
the cost function is not important in the case of imitation
learning, we want to find a simpler representation of the cost for
analysis. Therefore, the present inventors introduce an
approximator:
{circumflex over (q)}(x)=w.sub.q.sup.T.psi..sub.q(x), (9)
[0087] where w.sub.q and .psi..sub.v(x) denote the learning weights
and basis function vector, respectively. The objective function
with L1 regularization to optimize w.sub.q is given by:
J ( w q ) = 1 2 j = 1 N .pi. [ w q .psi. q ( x j .pi. ) - V ^ ( x j
.pi. ; w V ) - ln .intg. p ( y x j .pi. ) exp ( - .gamma. V ^ ( y ;
w V ) ) d y ] 2 + .lamda. q 2 w q 1 ( 10 ) ##EQU00009##
[0088] where .lamda..sub.q is a regularization constant. A simple
gradient descent algorithm is adopted, and J(w.sub.q) is evaluated
at the observed states.
[0089] The most significant problem of Dvijotham and Todorov (2010)
(NPL 6) is the integral in Eqs. (8) and (10) which cannot be solved
analytically, and they discretized the state space and replaced the
integral with a sum. However, as they suggested, it is infeasible
in high-dimensional problems. In addition, the uncontrolled
probability p(y|x) is not necessarily Gaussian. In at least some
embodiments of the present invention, the Metropolis Hastings
algorithm is applied to evaluate the gradient of the
log-likelihood, in which the uncontrolled probability p(y|x) is
used as a causal density.
[0090] <2. Inverse Reinforcement Learning by Density Ratio
Estimation>
[0091] <2.1. Bellman Equation for IRL>
[0092] From Equations (4) and (5), the present inventors have
derived the following important relation for the discounted-cost
problems:
q ( x ) + .gamma. V ( y ) - V ( x ) = - ln .pi. ( y x ) p ( y x ) .
( 11 ) ##EQU00010##
[0093] Eq. (11) plays an important role in the IRL algorithms
according to embodiments of the present invention. Similar
equations can be derived for first-exit, average cost, and finite
horizon problems. It should be noted that the left hand side of Eq.
(11) is not a temporal difference error because q(x) is the
state-dependent part of the cost function shown in Eq. (3).
[0094] Our IRL is still an ill-posed problem and the cost function
is not uniquely determined although the form of the cost function
is constrained by Eq. (3) under LMDP. More specifically, if the
state-dependent cost function is modified by:
q'(x)=q(x)+C, (12)
[0095] the corresponding value function is changed to:
V ' ( x ) = V ( x ) + C 1 - .gamma. , ( 13 ) ##EQU00011##
[0096] where C is a constant value. Then, the controlled
probability derived from V(x) is identical to that from V'(x). This
property is useful when estimating the cost function as described
below. In one aspect of the present invention, the disclosed IRL
method consists of two parts. One is to estimate the density ratio
of the right hand side of Eq. (11) described below. The other is to
estimate q(x) and V(x) by the least squares method with
regularization as shown below.
[0097] <2.2. Density Ratio Estimation for IRL>
[0098] Estimating the ratio of controlled and uncontrolled
transition probability densities can be regarded as a problem of
density ratio estimation (Sugiyama et al., 2012, NPL 20). According
to the setting of the problem, the present disclosure considers the
following formulation.
[0099] <2.2.1. General Case>
[0100] First, a general setting is considered. Suppose that we have
two datasets of state transitions: One is D.sup..pi. shown in Eq.
(7) and the other is a dataset from the uncontrolled
probability:
.sup.p={(x.sub.j.sup.p,y.sub.j.sup.p)}.sub.j=1'.sup.N.sup.p,
y.sub.j.sup.p.about.p(|x.sub.j.sup.p),
[0101] where N.sup.p denotes the number of data. Then, we are
interested in estimating the ratio .pi.(y|x)/p(y|x) from D.sup.p
and D.sup..pi..
[0102] From Eq. (11), we can consider the following two
decompositions:
- ln .pi. ( y x ) p ( y x ) = ln p ( y x ) - ln .pi. ( y x ) ( 14 )
= ln .pi. ( x ) p ( x ) - ln .pi. ( x , y ) p ( x , y ) . ( 15 )
##EQU00012##
[0103] The first decomposition (14) shows the difference of
logarithms of conditional probability densities. In order to
estimate Eq. (14), the present disclosure considers two
implementations. The first one is LSCDE-IRL which adopts Least
Squares Conditional Density Estimation (LSCDE) (Sugiyama et al.,
2010) to estimate .pi.(y|x) and p(y|x). The other is Gauss-IRL
which uses a Gaussian process (Rasmussen & Williams, 2006, NPL
15) to estimate the conditional densities in Eq. (14).
[0104] The second decomposition (15) shows the difference of
logarithms of density ratio. The advantage of the second
decomposition is that In .pi.(x)/p(x) can be neglected if
.pi.(x)=p(x). This condition may be satisfied according to the
setup. Currently, two methods are implemented to estimate
.pi.(x)/p(x) and .pi.(x, y)/p(x, y). One is uLSIF-IRL using the
unconstrained Least Squares Importance Fitting (uLSIF) (Kanamori et
al., 2009, NPL 9). The other is LogReg, which utilizes a logistic
regression in a different way. Section 2.3 below describes their
implementation.
[0105] <2.2.2. When p(y|x) is Unknown>
[0106] The state transition probability P.sub.t(y|x, u) is assumed
to be known in advance in the case of standard IRL problems, and
this corresponds to the assumption that the uncontrolled
probability p(y|x) is given in the case of LMDP. This can be
regarded as a model-based IRL. In this case, Eq. (14) is
appropriate and it is enough to estimate the controlled probability
.pi.(y|x) from the dataset D.sup.p.
[0107] In some situations, we have neither an analytical model nor
a dataset from the uncontrolled probability density. Then, p(y|x)
is replaced by a uniform distribution, which is an improper
distribution for unbounded variables. Without loss of generality,
p(y|x) is set to 1 since it can be compensated by shifting the cost
and value function by Eqs. (12) and (13).
[0108] <2.3. Density Ratio Estimation Algorithms>
[0109] This section describes density ratio estimation algorithms
appropriate for the IRL method disclosed in this disclosure.
[0110] <2.3.1. uLSIF>
[0111] uLSIF (Kanamori et al., 2009, NPL 9) is a least-squares
method for the direct density ratio estimation method. The goal of
uLSIF is to estimate the ratio of two densities .pi.(x)/p(x) and
.pi.(x, y)/p(x, y). Hereafter, the present disclosure explains how
to estimate r(z)=.pi.(z)/p(z) from D.sup.p and D.sup..pi., where
z=(x, y) for simplicity. Let us approximate the ratio by the linear
model:
{circumflex over (r)}(z)=.alpha..sup.T.phi.(z),
[0112] where .phi.(z) denotes the basis function vector and a are
the parameters to be learned, respectively. The objective function
is given by:
J ( .alpha. ) = 1 2 .intg. ( r ^ ( z ) - r ( z ) ) 2 p ( z ) dz +
.lamda. 2 .alpha. 2 2 = 1 2 .alpha. T H .alpha. - h T .alpha. +
.lamda. 2 .alpha. 2 2 + const . , ( 16 ) ##EQU00013##
[0113] where 1 is a regularization constant and
H = .intg. .phi. ( z ) .phi. ( z ) T p ( z ) dz .apprxeq. 1 N p i =
1 N p .phi. ( z i p ) .phi. ( z i p ) T , h = .intg. .phi. ( z )
.pi. ( z ) dz .apprxeq. 1 N .pi. j = 1 N .pi. .phi. ( z j .pi. ) .
##EQU00014##
[0114] It should be noted that H is estimated from D.sup.p while h
from D.sup..pi., respectively. Eq. (16) can be analytically
minimized as
( H ^ + .lamda. I ) ? h ^ . ? indicates text missing or illegible
when filed ##EQU00015##
[0115] but this minimizer ignores the non-negativity constraint of
the density ratio. To compensate for this problem, uLSIF modifies
the solution by:
{circumflex over (.alpha.)}(.lamda.)=max(0,(H+.lamda.I).sup.-1h),
(17)
[0116] where the max operator above is applied in the element-wise
manner.
[0117] As recommended by Kanamori et al. (2009) (NPL 9), a Gaussian
function centered at the states of D.sup..pi. is used as a basis
function described by:
.phi. j ( z ) = exp ( - z - z j .pi. 2 2 2 .sigma. 2 ) , ( 18 )
##EQU00016##
[0118] where .sigma. is a width parameter.
[0120] is the state which is randomly selected from D.sup..pi.. The
parameters .lamda. and .sigma. are selected by leave-one-out
cross-validation.
[0121] <2.3.2. LSCDE>
[0122] LSCDE (Sugiyama et al., 2010, NPL 19) is regarded as a
special case of uLSIF to estimate a conditional probability density
function. For example, the objective function to estimate
.pi.(y|x)=.pi.(x, y)/.pi.(x) from D.sup..pi. is given by:
J ( .alpha. ) = 1 2 .intg. .intg. ( r ^ ( x , y ) - r ( x , y ) ) 2
p ( x ) dxdy + .lamda. 2 .alpha. 2 2 ##EQU00017##
[0123] where
{circumflex over (r)}(x,y)=.alpha.'.phi.(x,y)
[0124] is a linear model and .lamda. is a regularization constant.
Computing H and h in LSCDE are slightly different from those in
uLSIF, and they are computed as follows:
H = .intg. .PHI. ( x ) .pi. ( x ) d x .apprxeq. 1 N .pi. j = 1 N
.pi. .PHI. _ ( x j .pi. ) , h = .intg. .phi. ( x , y ) .pi. ( x , y
) d x d y .apprxeq. 1 N .pi. j = 1 N .pi. .phi. ( x j .pi. , y j
.pi. ) . ##EQU00018##
[0125] where
.phi.
[0127] is defined as:
.PHI..DELTA..intg..phi.(x,y).phi.(x, y).sup.T.sub.dy.
[0128] Since the basis function shown in Eq. (18) is used, this
integral can be computed analytically. The estimated weight of
LSCDE is given by Eq. (17). In order to assure that the estimated
ratio is a conditional density, the solution should be normalized
when it is used to estimate the cost and value function.
[0129] <2.3.3. LogReg>
[0130] LogReg is a method of density estimation using a logistic
regression. Let us assign a selector variable .eta.=-1 to samples
from the uncontrolled probability and .eta.=1 to samples from the
controlled probability:
p(z)=Pr(z|.eta.=-1), .pi.(z)=Pr(z|.eta.=1).
[0131] The density ratio can be represented by applying the Bayes
rule as follows:
.pi. ( z ) p ( z ) = Pr ( .eta. = - 1 ) Pr ( .eta. = 1 ) Pr ( .eta.
= 1 z ) Pr ( .eta. = - 1 z ) . ##EQU00019##
[0132] The first ratio Pr(.eta.=1)/Pr(.eta.=1) is estimated by
N.sup.P/N.sup..pi. and the second ratio is computed after
estimating the conditional probability P(.eta.|z) by a logistic
regression classifier:
p ( .eta. z ) = 1 1 + exp ( - .eta. .alpha. T .phi. ( z ) ) ,
##EQU00020##
[0133] where .eta. can be regarded as a label. It should be noted
that a logarithm of the density ratio is given by a linear model in
the case of LogReg:
ln .pi. ( z ) p ( z ) = .alpha. T .phi. ( z ) + ln N p N .pi. ,
##EQU00021##
[0134] The second term 1nNp/N.sup..pi. can be ignored in our IRL
formulation shown in Eq. (15). The objective function is derived
from the negative regularized log-likelihood expressed by:
J ( .alpha. ) = i = 1 N p ln ( 1 + exp ( .alpha. T .phi. ( z i v )
) ) + j = 1 N .pi. ln ( 1 + exp ( - .alpha. T .phi. ( z j .pi. ) )
) + .lamda. 2 .alpha. 2 2 ##EQU00022##
[0135] The closed-form solution is not derived, but it is possible
to minimize efficiently by standard nonlinear optimization methods
since this objective function is convex.
[0136] <2.4. Estimating the Cost and Value Functions>
[0137] Once the density ratio .pi.(y|x)/p(y|x) is estimated, the
least squares method with regularization is applied to estimate the
state-dependent cost function q(x) and value function V(x). Suppose
that
[0138] {circumflex over (R)}(x,y)
[0139] is an approximation of a negative log ratio;
R ^ ( x , y ) = - ln .pi. ( y x ) p ( y x ) ##EQU00023##
[0140] and consider linear approximators of q(x) and V(x) as
defined in Eqs. (6) and (9), respectively. The objective function
is given by:
J ( w q , w r ) = 1 2 j = 1 N .pi. [ w ? ( .gamma. .psi. ? ( y ) -
.psi. ( x ) ) + w .psi. q ( x ) - R ^ ( x i .pi. , y j .pi. ) ] 2 +
.lamda. q 2 w q 1 + .lamda. 2 w ? 1 ##EQU00024## ? indicates text
missing or illegible when filed ##EQU00024.2##
[0141] where .lamda..sub.q and .lamda..sub.v are regularization
constants. L2 regularization is used for w.sub.v because L2
regularization is an effective means of achieving numerical
stability. On the other hand, L1 regularization is used for w.sub.q
to yield sparse models that are more easily interpreted by the
experimenters. It is possible to use L2 regularization for w.sub.q
if sparseness is not important. In addition, the non-negative
constraints of w.sub.q and w.sub.v are not introduced because Eq.
(12) can be used by setting
? = - min ? ##EQU00025## ? indicates text missing or illegible when
filed ##EQU00025.2##
[0142] to satisfy the non-negativity of the cost function
efficiently.
[0143] Theoretically, we can choose arbitrary basis functions. In
one embodiment of the present invention, a Gaussian function shown
in Eq. (18) is used for simplicity:
.psi. q , j ( x ) .ident. .psi. V , j ( x ) = exp ( - x - x j .pi.
2 2 .sigma. 2 ) ##EQU00026##
[0144] where .sigma. is a width parameter. The center position
[0145] x
[0146] is randomly selected from D.sup.P.
[0147] <3. Experiments>
[0148] <3.1. Swing-Up Inverted Pendulum>
[0149] <3.1.1. Task Description>
[0150] To demonstrate and confirm the effectiveness of the
above-described embodiments of the present invention, the present
inventors have studied a swing-up inverted pendulum problem in
which the state vector is given by a two dimensional vector x=[q,
w].sup.T, where q and w denote the angle and the angular velocity
of the pole, respectively. The equation of motion is given by the
following stochastic differential equation:
dx = [ .theta. . m g l sin .theta. - .kappa. .theta. . ] dt + [ 0 1
] ( udt + .sigma. e d .omega. ) , ##EQU00027##
[0151] where l, m, g, k, s.sub.c, and w denote the length of the
pole, mass, gravitational acceleration, coefficient of friction,
scaling parameter for the noise, and Brownian noise, respectively.
As opposed to the previous studies (Deisenroth et al., 2009, NPL 4;
Doya, 2000, NPL 5), the applied torque u is not restricted and it
is possible to swing-up directly. By discretizing the time axis
with step h, the corresponding state transition probability
P.sub.T(y|x, u), which is represented by a Gaussian distribution,
is obtained. In this simulation, the parameters are given as
follows: l=1 [m], m=1 [kg], g=9.8 [m/s.sup.2], and k=0.05
[kgm.sup.2/s], h=0.01 [s], s.sub.e=4, and
.gamma. = 10 ? .apprxeq. 0.94 ##EQU00028## ? indicates text missing
or illegible when filed ##EQU00028.2##
[0152] The present inventors have conducted a series of experiments
by changing (1) the state dependent cost function q(x), (2) the
uncontrolled probability p(y|x), and (3) the datasets D.sup.P and
D.sup..pi. as follows.
[0153] <Cost Function>
[0154] The goal is to keep the pole upright and the following three
cost functions are prepared:
q cos ( x ) = 1 - cos .theta. , q quad ( x ) = x T Q x , q exp ( x
) = 1 - exp ( - 1 2 x T Q x ) , ( 19 ) ##EQU00029##
[0155] where Q=diag[1, 0.2]. q.sub.cost(x) is used by Doya (2000)
while q.sub.exp(x) by Deisenroth et al. (2009) (NPL 4).
[0156] <Uncontrolled Probability>
[0157] Two densities p.sub.G(y|x) and p.sub.m(y|x) are considered.
p.sub.G(y|x) is constructed by using a stochastic policy a(ulx)
represented by a Gaussian distribution. Since the equation of
motion in discrete time is given by the Gaussian, p.sub.G(y|x) is
also Gaussian. In the case of p.sub.m(y|x), a mixture of Gaussian
distributions is used as a stochastic policy.
[0158] <Preparation of the Datasets>
[0159] Two sampling methods are considered. One is the uniform
sampling and the other is the trajectory-based sampling. In the
uniform sampling method, x is sampled from a uniform distribution
defined over the entire state space. In other words, p(x) and
.pi.(x) are regarded as a uniform distribution. Then, y is sampled
from the uncontrolled and the controlled probability to construct
D.sup.P and D.sup..pi., respectively. In the trajectory-based
sampling method, p(y|x) and .pi.(y|x) are used to generate
trajectories of states from the same start state x. Then, a pair of
state transitions are randomly selected from the trajectories to
construct D.sup.P and D.sup..pi.. It is expected that p(x) is
different from .pi.(x).
[0160] For each cost function, the corresponding value function is
calculated by solving Eq. (4) and the corresponding optimal
controlled probability is evaluated by Eq. (5). In the previous
method (Todorov, 2009b, NPL 25), exp(-V(x)) is represented by a
linear model, but it is difficult under the objective function (1)
because the discount factor.gamma. makes the linear model
complicated. Therefore, the value function is approximated by the
linear model shown in Eq. (6) and the Metropolis Hastings algorithm
is used to evaluate the integral.
[0161] The methods according to the embodiments of the present
invention can be compared with OptV because the assumptions of OptV
are identical to those of our methods according to the embodiments
of the present invention. According to the choice of the density
ratio estimation methods, there exist several variants as described
above. More specifically, the following six algorithms are
considered: (1) LSCDE-IRL, (2) uLSIF-IRL, (3) LogReg-IRL, (4)
Gauss-IRL, (5) LSCDE-OptV, which is the OptV method where p(y|x) is
estimated by LSCDE, and (6) Gauss-OptV, where the Gaussian process
method is used to estimate p(y|x).
[0162] We set the number of samples of D.sup.P and D.sup..pi. at
N.sup.P=N.sup..pi.=300. The parameters .lamda..sub.q,
.lamda..sub.V, .sigma., and .gamma. are optimized by
cross-validation from the following regions: log.lamda..sub.q,
log .lamda. .di-elect cons. linspace(-3.1.9)
log .sigma. .di-elect cons. linspace(-1.5.1.5.9).
and
log .gamma. .di-elect cons. linspace (-0.2.0.9).
[0163] where linspace(x.sub.min, x.sub.max, n) generates a set of n
points which is equally spaced between x.sub.min and x.sub.max.
[0164] <3.1.2. Experimental Results>
[0165] The accuracy of the estimated cost functions is measured by
the normalized squared error for the test samples:
error = .DELTA. j = 1 N ( q ( x j ) - q ^ ( x j ) ) 2 ( q ( x j ) )
2 ##EQU00030##
[0166] where q(x.sub.j) is one of the true cost function shown in
Eq. (19) at state x.sub.j while
[0167] {dot over (q)}(x)
[0168] is the estimated cost function, respectively. FIG. 1(a)-(d)
compare the accuracy of the IRL methods of the present embodiments;
it is shown that our methods (1)-(4) performed better than OptV
methods (5)-(6) in all settings. More specifically, LogReg-IRL
showed the best performance, but there were no significant
differences among our methods (1)-(3). The accuracy of the cost
estimated by Gauss-IRL increased significantly if the stochastic
policy .pi.(u|x) was given by the mixture of Gaussians because the
standard Gaussian process cannot represent the mixture of
Gaussians.
[0169] FIG. 2 shows the cross-validation error of the discount
factor .gamma. where other parameters such as .lamda..sub.q,
.lamda..sub.v and .sigma. are set to the optimal values. In this
simulation, the cross validation error was minimum at the true
discount factor
.gamma. = 10 ? .apprxeq. 0.94 ##EQU00031## ? indicates text missing
or illegible when filed ##EQU00031.2##
[0170] in all the methods. As show in FIG. 2 and also as explained
in FIG. 1 above, the embodiments of the present invention have been
proven to have sufficiently small errors, confirming the
effectiveness of the present invention effective.
[0171] <3.2. Human Behavior Analysis>
[0172] <3.2.1. Task Description>
[0173] In order to evaluate our IRL algorithm in a realistic
situation, the present inventors have conducted a dynamic motor
control, pole-balancing problem. FIG. 3 shows the experimental
setup. A subject can move the base left, right, top and bottom to
swing the pole several times and decelerate the pole to balance it
at the upright position. The dynamics is described by the
six-dimensional state vector
x = [ .theta. . ? ] . ? indicates text missing or illegible when
filed ##EQU00032##
[0174] where .theta. and
[0175] {circumflex over (.theta.)}
[0176] are the angle and angular velocity of the pole, x and y are
the horizontal and vertical positions of the base, and
[0177] {circumflex over (x)}
[0178] and
y
[0180] are their time derivatives, respectively.
[0181] The task was performed under two conditions: long pole (73
cm) and short pole (29 cm). Each subject had 15 trials to balance
the pole in each condition. Each trial ended when the subject could
keep the pole upright for 3 seconds or 40 seconds elapsed. We
collected the data from 7 subjects (5 right-handed and 2
left-handed) and the trajectory-based sampling method was used to
construct the following two datasets of controlled probability:
[0182] D
[0183] for training and
[0184] D
[0185] for testing of the i-th subject. It is assumed that all
subjects had a unique uncontrolled probability p(y|x), which was
generated by a random policy. This means the datasets
[0186] D
[0187] for training and
[0188] D
[0189] for testing are shared among subjects. The number of samples
in the datasets was 300.
[0190] <4.2.2. Experimental Results>
[0191] FIG. 4 shows the learning curves of seven subjects, which
shows that the learning processes were quite different among
subjects. Two subject Nos. 1 and 3 could not accomplish the task.
Since a set of successful trajectories should be used by the IRL
algorithms, we picked up the data from five subject Nos. 2 and
4-7.
[0192] The experimental results in the case of using LogReg-IRL
will be described below (LSCDE-IRL and uLSIF-IRL showed similar
results). FIG. 5 shows the estimated cost function of the subjects
4, 5, and 7 projected to the subspace
[0193] (.theta...theta.)
[0194] while x, y,
[0195] {circumflex over (x)}
[0196] and
y
[0198] are set to zeros for visualization. In the case of the
subject 7, the cost function of the long pole condition was not so
different from that of the short pole condition while there was a
significant difference in those of the subject 5, who did not
perform well in the short pole condition as shown in FIG. 4.
[0199] In order to evaluate the cost functions estimated from the
training data sets, we applied the forward reinforcement learning
to find the optimal controlled transition probability for the
estimated cost function and then computed the negative
log-likelihood for the test datasets:
NLL ( i ) = .DELTA. - 1 N i , te .pi. j = 1 N i , te .pi. ln .pi. ^
( y j x j ) , ##EQU00033##
[0200] where
[0201] N
[0202] is the number of samples in
[0203] D
[0204] .FIG. 6 shows the results. In the left figure (a), we used
the test dataset of the subject 4
[0205] D
[0206] in the long pole condition. The minimum negative
log-likelihood was achieved by the cost function estimated from the
training datasets
[0207] D
[0208] and
[0209] D
[0210] of the same condition. The right panel (b) of FIG. 6 shows
that the test data of the subject 7 in both the long and short pole
conditions were best predicted by the cost function estimated from
the training dataset of the same subject 7 only in the long pole
condition. Thus, the effectiveness and usefulness of the
embodiments of the present invention have been confirmed and
demonstrated by this experiment as well.
[0211] The present disclosure presented a novel inverse
reinforcement learning under the framework of LMDP. One of the
features of the present invention is to show Eq. (11), which means
the temporal difference error is zero for the optimal value
function with the corresponding cost function. Since the right hand
side of Eq. (11) can be estimated from samples by the efficient
methods of density ratio estimation, the IRL of present invention
results in a simple least-squares method with regularization. In
addition, the method according to the embodiments of the present
invention does not need to compute the integral, which is usually
intractable in high-dimensional continuous problems. As a result,
the disclosed method is computationally inexpensive than OptV.
[0212] LMDP and path integral methods have been receiving attention
recently in the field of robotics and machine learning fields
(Theodorou & Todorov, 2012, NPL 22) because there exist a
number of interesting properties in the linearized Bellman equation
(Todorov, 2009a, NPL 24). They have been successfully applied to
learning of stochastic policies for robots with large degrees of
freedom (Kinjo et al., 2013, NPL 11; Stulp & Sigaud, 2012, NPL
17; Sugimoto and Morimoto, 2011, NPL 18; Theodorou et al., 2010,
NPL 21). The IRL methods according to the embodiments of the
present invention may be integrated with the existing forward
reinforcement learning methods to design complicated
controllers.
[0213] As described above, in at least some aspects of the present
invention, the present disclosure provides a computational
algorithm that can infer the reward/cost function from observed
behaviors effectively. The algorithm of the embodiments of the
present invention can be implemented in general-purpose computer
systems with appropriate hardware and software as well as
specifically designed proprietary hardware/software. Various
advantages according to at least some embodiments of the present
invention include:
[0214] A) Model-free method/system: the method and system according
to the embodiments of the present invention do not need to know the
environmental dynamics in advance; i.e., the method/system is
regarded as a model-free method-it is not necessary to model the
target dynamics explicitly although some prior art approaches
assume that the environmental dynamics is known in advance.
[0215] B) Data efficient: the dataset for the method and system
according to the embodiments of the present invention consist of a
set of state transition while many previous methods require a set
of trajectories of states. Thus, in the methods and system
according to the embodiments of the present invention it is easier
to collect the data.
[0216] C) Computationally efficient (1): the method and system
according to the embodiments of the present invention do not need
to solve a (forward) reinforcement learning problem. In contrast,
some previous methods required solving such a forward reinforcement
learning problem many times with the estimated reward/cost
function. That computation must be performed for each candidate and
it usually takes long time to find the optimal solution.
[0217] D) Computationally efficient (2): the method and system
according to the embodiments of the present invention use two
optimization algorithms: (a) density ratio estimation and (b)
regularized least squares. In contrast, some previous methods use a
stochastic gradient method or a Markov chain Monte Carlo method,
which usually take time to optimize as compared with least-squares
methods.
[0218] As described above, in one aspect, the present invention
provides inverse reinforcement learning that can infer the
objective function from observed state transitions generated by
demonstrators. FIG. 7 schematically shows a framework of the method
according to an embodiment of the present invention. An embodiment
of the inverse reinforcement learning according to the present
invention includes two components: (1) learning the ratio of state
transition probabilities with and without control by density ratio
estimation and (2) estimation of the cost and value functions that
are compatible with the ratio of transition probabilities by a
regularized least squares method. By the use of efficient
algorithms for each step, the embodiments of the present invention
are more efficient in data and computation than other inverse
reinforcement learning methods.
[0219] The industrial applicability and usefulness of inverse
reinforcement leaning have been well understood and recognized.
Examples of the system/configuration to which the embodiments of
the present invention can be applied are described below.
[0220] <Imitation Learning of Robot Behaviors>
[0221] Programming robots to perform complex tasks is difficult
with standard methods such as motion planning. In many situations,
it is much easier to demonstrate the desired behaviors to the
robot. However, a major drawback of classical imitation learning is
that the obtained controller cannot cope with new situations
because it just reproduces the demonstrated movements. Embodiments
of the present invention can estimate the objective function from
the demonstrated behaviors and then the estimated objection
function can be used for learning different behaviors for different
situations.
[0222] FIG. 8 schematically shows such an implementation of the
present invention. First, the demonstrator controls a robot to
accomplish a task and the sequence of states and actions is
recorded. Then an inverse reinforcement learning component
according to an embodiment of the present invention estimates the
cost and value functions, which are then given to forward
reinforcement learning controllers for different robots.
[0223] <Interpretation of Human Behaviors>
[0224] Understanding of the human intentions behind behaviors is a
basic issue in building a user-friendly support system. In general,
a behavior is represented by a sequence of states, which are
extracted by the motion tracking system. The cost function
estimated by the inverse reinforcement learning method/system
according to an embodiment of the present invention can be regarded
as a compact representation to explain the given behavioral
dataset. Through pattern classification of the estimated cost
functions, it becomes possible to estimate the user's expertise or
preference. FIG. 9 schematically shows this implementation
according to an embodiment of the present invention.
[0225] <Analysis of the Web Experience>
[0226] In order to increase the likelihood for visitors to read
articles that are presented to the visitors, the designers of
online news websites, for example, should investigate the web
experiences of visitors from a viewpoint of decision making In
particular, recommendation systems are receiving attention as an
important business application for personalized services. However,
previous methods such as collaborative filtering do not consider
the sequences of decision making explicitly. Embodiments of the
present invention can provide a different and effective way to
model the behaviors of visitors during net surfing. FIG. 10 shows
an example of a series of clicking actions by a user, indicating
what topics were accessed in what order by the user. The topic that
the visitor is reading is regarded as the state and clicking the
link is considered as the action. Then, inverse reinforcement
learning according to an embodiment of the present invention can
analyze the decision-making in the user's net surfing. Since the
estimated cost function represents the preference of the visitor,
it becomes possible to recommend a list of articles for the
user.
[0227] As described above, the inverse reinforcement learning
schemes according to embodiments of the present invention are
applicable to a wide variety of industrial and/or commercial
systems. FIG. 11 shows an example of the implementation using a
general computer system and a sensor system. The methods explained
above with mathematical equations can be implemented in such a
general computer system, for example. As shown in the figure, the
system of this example includes a sensor system 111 (an example of
a data acquisition unit) to receive information about state
transitions--i.e., observed behavior--from the object being
observed. The sensor system 111 may include one or more of an image
capturing device with image processing software/hardware,
displacement sensors, velocity sensors, acceleration sensors,
microphone, keyboards, and any other input devices. The sensor
system 111 is connected to a computer 112 having a processor 113
with an appropriate memory 114 so that the received data can be
analyzed according to embodiments of the present invention. The
result of the analysis is outputted to any output system 115, such
as a display monitor, controllers, drivers, etc. (examples of an
output interface), or, an object to be controlled in the case of
utilizing the results for control. The result can be used to
program or transferred to another system, such as another robot or
computer, or website software that responds to user's interaction,
as described above.
[0228] In the case of predicting the user's web article preference
described above, the implemented system may include a system for
inverse reinforcement learning as described in any one of the
embodiments above, implemented in a computer connected to the
Internet. Here, the state variables that define the behaviors of
the user include topics of articles selected by the user while
browsing each webpage. Then, the result of the inverse
reinforcement learning is used to cause an interface through which
the user is browsing Internet websites, such as portable
smartphone, personal computer, etc., to display a recommended
article for the user.
[0229] It will be apparent to those skilled in the art that various
modification and variations can be made in the present invention
without departing from the spirit or scope of the invention. Thus,
it is intended that the present invention cover modifications and
variations that come within the scope of the appended claims and
their equivalents. In particular, it is explicitly contemplated
that any part or whole of any two or more of the embodiments and
their modifications described above can be combined and regarded
within the scope of the present invention.
* * * * *