U.S. patent application number 13/248296 was filed with the patent office on 2012-04-05 for data processing device, data processing method, and program.
This patent application is currently assigned to SONY CORPORATION. Invention is credited to Takashi HASUO, Kenta KAWAMOTO, Kohtaro SABE, Yukiko YOSHIKE.
Application Number | 20120084237 13/248296 |
Document ID | / |
Family ID | 45890672 |
Filed Date | 2012-04-05 |
United States Patent
Application |
20120084237 |
Kind Code |
A1 |
HASUO; Takashi ; et
al. |
April 5, 2012 |
DATA PROCESSING DEVICE, DATA PROCESSING METHOD, AND PROGRAM
Abstract
A data processing device includes a state value calculation unit
which calculates a state value of which the value increases as much
as a state with a high transition probability for each state of the
state transition model, an action value calculation unit which
calculates an action value, of which the value increases as a
transition probability increases for each state of the state
transition model and each action that the agent can perform, a
target state setting unit which sets a state with great unevenness
in the action value among states of the state transition model to a
target state that is the target to reach by action performed by the
agent, and an action selection unit which selects an action of the
agent so as to move toward the target state.
Inventors: |
HASUO; Takashi; (Tokyo,
JP) ; SABE; Kohtaro; (Tokyo, JP) ; KAWAMOTO;
Kenta; (Tokyo, JP) ; YOSHIKE; Yukiko; (Tokyo,
JP) |
Assignee: |
SONY CORPORATION
Tokyo
JP
|
Family ID: |
45890672 |
Appl. No.: |
13/248296 |
Filed: |
September 29, 2011 |
Current U.S.
Class: |
706/12 |
Current CPC
Class: |
G06N 3/006 20130101;
G06N 20/10 20190101; G06N 20/00 20190101 |
Class at
Publication: |
706/12 |
International
Class: |
G06F 15/18 20060101
G06F015/18 |
Foreign Application Data
Date |
Code |
Application Number |
Oct 4, 2010 |
JP |
2010-225156 |
Claims
1. A data processing device comprising: a state value calculation
unit which calculates a state value having a predetermined state of
a state transition model, in which a state is transited by an
action performed by an agent that can act, set as a reference, of
which the value increases as much as a state with a high transition
probability to a state close to the predetermined state, for each
state of the state transition model based on the state transition
model of each action; an action value calculation unit which
calculates an action value, of which the value increases as a
transition probability to a state with a high state value having
the predetermined state set as a reference increases, for each
state of the state transition model and each action that the agent
can perform, based on the state transition model and the state
value having the predetermined state set as a reference; a target
state setting unit which sets a state with great unevenness in the
action value among states of the state transition model to a target
state that is the target to reach by an action performed by the
agent, based on the action value; and an action selection unit
which selects an action of the agent so as to move toward the
target state.
2. The data processing device according to claim 1, further
comprising: a state recognition unit which recognizes the current
state which is a state where an observation value, which is
observed by the agent from the outside, is observed among states of
the state transition model based on the observation value, wherein
the predetermined state is the current state, and wherein the state
value calculation unit calculates a state value having the current
state set as a reference, of which the value increases as much as a
state with a high transition probability to a state close to the
current state.
3. The data processing device according to claim 2, wherein the
action selection unit calculates a state value having the target
state set as a reference, of which the value increases as a state
with a high transition probability to a state close to the target
state, for each state of the state transition model based on the
state transition model, calculates an action value of which the
value increases as high as a transition probability to a state with
a high state value having the target state set as a reference for
each state of the state transition model and each action that the
agent can perform based on the state transition model and the state
value having the target state set as a reference, and selects an
action of the agent so as to move toward the target state based on
an action value of the current state.
4. The data processing device according to claim 3, further
comprising: a model updating unit which updates a state transition
model for an action of the agent, in which state transition to the
current state occurs, based on the state transition to the current
state.
5. The data processing device according to claim 4, wherein the
state transition model for a predetermined action indicates a
frequency of transition to a second state by the predetermined
action by the agent in a first state, and wherein the model
updating unit updates the state transition model by increasing the
frequency.
6. The data processing device according to claim 5, wherein the
agent acts in an action environment where the agent acts, assuming
a predetermined space as the action environment, and observes a
position of the agent in the action environment as the observation
value, and wherein the state indicates a small area obtained by
dividing the action environment into such small areas.
7. The data processing device according to claim 6, wherein the
action selection unit determines whether or not the current state
coincides with the target state, and selects an action of the agent
so as to move toward the target state based on an action value of
the current state when the current state does not coincide with the
target state.
8. The data processing device according to claim 7, wherein, when
the current state coincides with the target state, the state value
calculation unit re-calculates a state value having the current
state set as a reference based on the state transition model, the
action value calculation unit re-calculates the action value based
on the state transition model and the state value having the
current state set as a reference, and the target state setting unit
re-sets the target state based on the action value.
9. The data processing device according to claim 2, wherein the
target state setting unit obtains a variance of the action value
for each state of the state transition model, and sets a state to
be reached from the current state by state transitions within a
predetermined number of times among states, in which a variance of
the action value is equal to or higher than a predetermined
threshold value, to the target state.
10. The data processing device according to claim 3, wherein the
action selection unit selects an action of the agent so as to move
toward the target state based on the action value of the current
state with a 6-greedy method or a softmax method.
11. A data processing method of a data processing device,
comprising: calculating a state value having a predetermined state
of a state transition model, in which a state is transited by an
action performed by an agent that can act, set as a reference, of
which the value increases as much as a state with a high transition
probability to a state close to the predetermined state, for each
state of the state transition model based on the state transition
model of each action; calculating an action value, of which the
value increases as a transition probability to a state with a high
state value having the predetermined state set as a reference
increases, for each state of the state transition model and each
action that the agent can perform, based on the state transition
model and the state value having the predetermined state set as a
reference; setting a state with great unevenness in the action
value among states of the state transition model to a target state
that is the target to reach by an action performed by the agent,
based on the action value; and selecting an action of the agent so
as to move toward the target state.
12. A program causing a computer to function as: a state value
calculation unit which calculates a state value having a
predetermined state of a state transition model, in which a state
is transited by an action performed by an agent that can act, set
as a reference, of which the value increases as much as a state
with a high transition probability to a state close to the
predetermined state, for each state of the state transition model
based on the state transition model of each action; an action value
calculation unit which calculates an action value, of which the
value increases as a transition probability to a state with a high
state value having the predetermined state set as a reference
increases, for each state of the state transition model and each
action that the agent can perform, based on the state transition
model and the state value having the predetermined state set as a
reference; a target state setting unit which sets a state with
great unevenness in the action value among states of the state
transition model to a target state that is the target to reach by
action performed by the agent, based on the action value; and an
action selection unit which selects an action of the agent so as to
move toward the target state.
Description
BACKGROUND
[0001] The present disclosure relates to a data processing device,
a data processing method, and a program, and particularly to a data
processing device, a data processing method, and a program which
enable an agent that can autonomously perform various actions
(autonomous agent) to efficiently perform learning of an unknown
environment.
[0002] For example, as a learning method in which an agent such as
a robot acting in the real world, a virtual character acting in a
virtual world, or the like, that can perform actions, performs
actions in an unknown environment, there is reinforcement learning
through which an agent learns rules of action stage by stage
(Leslie Pack Kaelbling, Michael L. Littman, Andrew W. Moore,
"Reinforcement Learning: A Survey", Journal of Artificial
Intelligence Research 4 (1996) 237-285).
[0003] In the reinforcement learning, an action value of each
action U by an agent performed to reach a state targeted (target
state) in a state recognized based on an observation value observed
from the outside (environment, or the like) (current state) is
calculated (estimated).
[0004] When the action values for reaching the target state are
calculated, the agent can perform actions for reaching the target
state by controlling the actions based on the action values.
SUMMARY
[0005] The time when the agent performs action control based on
such action values is after the agent reaches the target state and
the action values used for reaching the target state are calculated
based on reinforcement learning.
[0006] Thus, the agent has to perform actions randomly selected
from, for example, actions that the agent can perform until the
agent reaches the target state, whereby it is difficult to
efficiently perform learning of the unknown environment
(reinforcement learning).
[0007] In other words, when there is, for example, a narrow passage
that is hard for the agent to pass through in the environment where
the agent acts (action environment), the agent that performs
randomly selected actions is not able to pass through the narrow
passage, and as a result, it is difficult for the agent to learn
the environment after passing through the narrow passage.
[0008] In addition, when gravity is set in an action environment
movable to the upper and lower side, for example, it is difficult
for the agent that performs randomly selected actions to move to
the upper side in the action environment due to the influence of
gravity, and as a result, it is difficult to learn the upper side
of the action environment.
[0009] The disclosure takes the above circumstances into
consideration, and it is desirable to be able to efficiently learn
an unknown environment.
[0010] According to an embodiment of the disclosure, there is
provided a data processing device which includes: or a program
which causes a computer to function as a data processing device
including: a state value calculation unit which calculates a state
value having a predetermined state of a state transition model, in
which a state is transited by an action performed by an agent that
can act, set as a reference, of which the value increases as much
as a state with a high transition probability to a state close to
the predetermined state, for each state of the state transition
model based on the state transition model of each action; an action
value calculation unit which calculates an action value, of which
the value increases as a transition probability to a state with a
high state value having the predetermined state set as a reference
increases, for each state of the state transition model and each
action that the agent can perform, based on the state transition
model and the state value having the predetermined state set as a
reference; a target state setting unit which sets a state with
great unevenness in the action value among states of the state
transition model to a target state that is the target to reach by
action performed by the agent, based on the action value; and an
action selection unit which selects an action of the agent so as to
move toward the target state.
[0011] According to another embodiment of the disclosure, there is
provided a data processing method of the data processing device
including calculating a state value having a predetermined state of
a state transition model, in which a state is transited by an
action performed by an agent that can act, set as a reference, of
which the value increases as much as a state with a high transition
probability to a state close to the predetermined state, for each
state of the state transition model based on the state transition
model of each action, calculating an action value, of which the
value increases as a transition probability to a state with a high
state value having the predetermined state set as a reference
increases, for each state of the state transition model and each
action that the agent can perform, based on the state transition
model and the state value having the predetermined state set as a
reference, setting a state with great unevenness in the action
value among states of the state transition model to a target state
that is the target to reach by an action performed by the agent,
based on the action value, and selecting an action of the agent so
as to move toward the target state.
[0012] In the above embodiments, a state value having a
predetermined state of a state transition model, in which a state
is transited by an action performed by an agent that can act, set
as a reference, is calculated of which the value increases as much
as a state with a high transition probability to a state close to
the predetermined state, for each state of the state transition
model based on the state transition model of each action, and an
action value is calculated, of which the value increases as a
transition probability to a state with a high state value having
the predetermined state set as a reference increases, for each
state of the state transition model and each action that the agent
can perform, based on the state transition model and the state
value having the predetermined state set as a reference. In
addition, a state with great unevenness in the action value among
states of the state transition model is set to a target state that
is the target to reach by an action performed by the agent, based
on the action value, and an action of the agent so as to move
toward the target state is selected.
[0013] Furthermore, the data processing device may be an
independent device, or an internal block included in one
device.
[0014] In addition, the program can be transmitted through a
transmission medium, or provided by being recorded on a recording
medium.
[0015] According to the embodiments of the disclosure, it is
possible to efficiently learn an unknown environment where an agent
acts.
BRIEF DESCRIPTION OF THE DRAWINGS
[0016] FIG. 1 is a diagram illustrating the overview of a
configuration example of an embodiment of a data processing device
to which the disclosure is applied;
[0017] FIG. 2 is a block diagram showing a configuration example of
an agent;
[0018] FIG. 3 is a diagram illustrating an action environment where
the agent acts and an example of an action that the agent can
perform;
[0019] FIG. 4 is a diagram illustrating an example of a state
transition model of the agent acting in the action environment;
[0020] FIG. 5 is a block diagram showing a configuration example of
a learning unit;
[0021] FIG. 6 is a flowchart illustrating a learning process;
[0022] FIG. 7 is a block diagram showing a configuration example of
an action control unit;
[0023] FIG. 8 is a diagram illustrating a process of a state value
calculation unit;
[0024] FIG. 9 is a diagram showing an example of a variance and an
action value obtained for each small area obtained by partitioning
an action area as a state;
[0025] FIG. 10 is a diagram showing an example of an existence
probability obtained for each small area obtained by partitioning
an action area as a state;
[0026] FIG. 11 is a diagram showing an example of a state value
having a target state set as a reference;
[0027] FIG. 12 is a diagram showing the state where the agent goes
toward the target state;
[0028] FIG. 13 is a flowchart explaining an action control process
for learning;
[0029] FIG. 14 is a diagram illustrating an action of the agent in
an action environment;
[0030] FIG. 15 is a flowchart explaining an action control process
for an autonomous action;
[0031] FIG. 16 is a diagram illustrating an action of the agent in
the action environment;
[0032] FIG. 17 is a diagram illustrating an action of the agent to
reach an action target state while avoiding an avoidance state;
[0033] FIG. 18 is a diagram illustrating an object moving task;
[0034] FIG. 19 is a diagram illustrating a state transition model
when the object moving task is performed;
[0035] FIG. 20 is a flowchart explaining a learning process
performed by the learning unit in the object moving task;
[0036] FIG. 21 is a flowchart explaining an action control process
for an autonomous action performed by an action control unit in the
object moving task;
[0037] FIG. 22 is a flowchart explaining an action control process
for learning performed by an action control unit in the object
moving task;
[0038] FIG. 23 is a flowchart explaining the action control process
for learning performed by an action control unit in the object
moving task;
[0039] FIG. 24 is a diagram illustrating control of a posterior
probability used in obtaining an action value, using a temperature
parameter;
[0040] FIG. 25 is a diagram illustrating learning of GMM performed
when the GMM is adopted as a state of a state transition model;
[0041] FIG. 26 is a diagram showing an example of an action
environment where an agent applied with the extended HMM performs
actions;
[0042] FIGS. 27A and 27B are diagrams showing examples of actions
performed by the agent and observation values obtained by
observation by the agent in an action environment;
[0043] FIG. 28 is a flowchart explaining a learning process of the
learning unit in the agent applied with the extended HMM;
[0044] FIGS. 29A and 29B are diagrams illustrating the extended
HMM;
[0045] FIG. 30 is a flowchart explaining learning of the extended
HMM using a learning data set; and
[0046] FIG. 31 is a block diagram showing a configuration example
of an embodiment of a computer to which the disclosure is
applied.
DETAILED DESCRIPTION OF EMBODIMENTS
[0047] [An Embodiment of a Data Processing Device to which the
Disclosure is Applied]
[0048] FIG. 1 is a diagram illustrating the overview of a
configuration example of an embodiment of a data processing device
to which the disclosure is applied.
[0049] In FIG. 1, the data processing device is, for example, an
agent that performs autonomous actions, and acts in a predetermined
environment by driving an actuator.
[0050] In other words, the agent includes a sensor, and the sensor
senses a physical amount from an environment where the agent acts
(action environment), and sensor signals as observation values
corresponding to the physical amount are output.
[0051] Furthermore, the agent has a state transition model for each
action in which a state is transited by actions performed by the
agent, and the state transition model is updated using observation
values from the sensor (sensor signals) (learning of the state
transition model is performed).
[0052] In addition, the agent includes an actuator. The agent
selects an action that the agent performs based on the state
transition model, and supplies an action signal corresponding to
the action to the actuator.
[0053] The actuator is driven according to the action signal, and
accordingly, the agent performs actions corresponding to the action
signal in the action environment.
[0054] FIG. 2 is a block diagram showing a configuration example of
an agent as the data processing device of FIG. 1.
[0055] The agent includes a sensor 11, a learning unit 12, a model
storage unit 13, an action control unit 14, and an actuator 15.
[0056] The sensor 11 observes a physical amount from the outside,
that is, an action environment, and outputs an observation value
corresponding to the physical amount. The observation value output
by the sensor 11 is supplied to the learning unit 12 and the action
control unit 14.
[0057] Herein, as an observation value output by the sensor 11, for
example, (a coordinate of) a position of the agent within the
action environment is employed.
[0058] The observation value from the sensor 11 as well as the
action signal from the action control unit 14 is supplied to the
learning unit 12.
[0059] The learning unit 12 performs learning of a state transition
model, which updates a state transition model, which is stored in
the model storage unit 13, for each action using the observation
value from the sensor 11 and the action signal from the action
control unit 14.
[0060] In other words, the learning unit 12 recognizes the current
state that is a state where the observation value from the sensor
11 is observed out of states of the state transition model, based
on the observation value observed by the agent from the
outside.
[0061] Furthermore, the learning unit 12 recognizes the action of
the agent that obtains a state transition to the current state from
action signals from the action control unit 14, and updates a state
transition model of the action based on the state transition to the
current state.
[0062] The model storage unit 13 stores state transition models for
each action that the agent can perform.
[0063] The action control unit 14 controls the action of the agent
based on an observation value from the sensor 11 and the state
transition model stored in the model storage unit 13.
[0064] In other words, the action control unit 14 selects an action
to be performed next (action to be performed in the current state)
among actions that the agent can perform based on the observation
value from the sensor 11 and the state transition model stored in
the model storage unit 13, and supplies an action signal
corresponding to the action to the learning unit 12 and the
actuator 15.
[0065] The actuator 15 is, for example, a motor driving foot of the
agent, or the like, an object (program) that moves the agent, or
the like, and is driven according to action signals from the action
control unit 14. The agent performs actions according to the action
signal by the actuator 15 driven according to the action
signal.
[Action Environment and Action of Agent]
[0066] FIG. 3 is a diagram illustrating examples of an action
environment where the agent acts and an action that the agent can
perform.
[0067] In FIG. 3, the action environment is a predetermined space
(plane) defined with the x direction that is the right direction
from the left and the y direction that is the upper direction from
the bottom, and gravity acts in the lower direction (the opposite
direction to the y direction).
[0068] In addition, in the action environment, the position of y=0
is the ground surface, and furthermore, there are provided
platforms at several positions of y>0.
[0069] For the agent, the position (coordinate (x,y)), speed, and
acceleration thereof are defined. The position, speed, and
acceleration of the agent are continuous values.
[0070] In addition, as actions of the agent, an action U.sub.1
increasing the acceleration of the agent by a predetermined value
.alpha. to the right direction (x direction), an action U.sub.2
increasing the acceleration to the left direction (the opposite
direction to the x direction), and an action U.sub.3 increasing the
acceleration to the upper direction (y direction) are defined.
Thus, an action U that the agent can perform is expressed with
discrete values indicating the actions U.sub.1, U.sub.2, and
U.sub.3 in FIG. 3.
[0071] Furthermore, the action U.sub.3 increasing the acceleration
of the agent to the upper direction (y direction) can be performed
only when the speed of the agent to the upper direction is
zero.
[0072] In addition, since gravity acts in the action environment,
when (the bottom) of the agent does not contact the ground surface
or the platform, the swiftness (speed) of the agent toward the
lower direction increases by a predetermined value V per unit time
according to gravity.
[0073] The agent acts within the action environment as above, but
an observation value that the agent observes is only the position
of the agent, and knowledge on the action environment, that is, for
example, the platforms, the ground surface, the position of a wall,
information whether or not the agent collides with the platforms,
or the like, and information of positions to be moved, or the like,
is not given at all.
[State Transition Model]
[0074] FIG. 4 is a diagram illustrating an example of a state
transition model of the agent acting in the action environment.
[0075] In FIG. 4, as the state of a state transition model of the
agent acting in the action environment, a small area obtained by
dividing the action environment into small areas is employed.
[0076] In other words, in FIG. 4, a small area in a square shape
obtained by dividing the action environment with an equal interval
respectively to the x direction and the y direction represents a
state, and the state is expressed with a discrete value.
[0077] The agent observes the current position as an observation
value, and can recognize the state of the current time (current
state) from the current position.
[0078] A state transition model P.sub.SS'.sup.U for each action
indicates transition of the state of the agent from a state (first
state) S to a state (second state (a state the same as or different
from the state S)) S' by performing a predetermined action U.
[0079] The state transition model P.sub.SS'.sup.U for the action U
is expressed, for example, by Formula (1).
P.sub.SS'.sup.U=P(S'|S,U) [Expression 1]
[0080] In Formula (1) here, P(S'|S,U) indicates a transition
probability (probability model) with which the state is transited
to the state S' when the agent performs the action U in the state
S.
[0081] Furthermore, as the state transition model P.sub.SS'.sup.U
for the action U, the frequency of transition to the state S' when
the agent performs the action U in the state S can be employed.
[0082] The frequency of transition to the state S' by performance
of the action U in the state S can be randomized to a transition
probability of transition to the state S' by performance of the
action U in the state S by normalizing to the sum of frequencies of
transition to each state by performance of the action U in the
state S.
[0083] Thus, the frequency of transition to the state S' by
performance of the action U in the state S and the transition
probability of transition to the state S' by performance of the
action U in the state S can be regarded to be equivalent.
[0084] Furthermore, herein, the storage (learning) of the state
transition model P.sub.SS'.sup.U is performed with the frequency,
the frequency is randomized to a transition probability in a
process using the state transition model P.sub.SS'.sup.U depending
on necessity, and the transition probability is used.
[0085] In addition, hereinbelow, the state transition model
P.sub.SS'.sup.U indicating a transition probability is also
described as a transition probability P.sub.SS'.sup.U.
[Configuration Example of Learning Unit 12]
[0086] FIG. 5 is a block diagram showing a configuration example of
the learning unit 12 of FIG. 2.
[0087] In FIG. 5, the learning unit 12 includes a state recognition
unit 21 and a model updating unit 22.
[0088] The state recognition unit 21 is supplied with (the
coordinate of) the current position of the agent from the sensor 11
as an observation value.
[0089] The state recognition unit 21 recognizes the current state
which is a state where the coordinate is observed (herein, a small
area where the agent is positioned among the small areas obtained
by dividing the action area described in FIG. 4) based on the
coordinate of the current position as the observation value from
the sensor 11, and supplies the result to the model updating unit
22.
[0090] The model updating unit 22 recognizes the action U of the
agent having a state transition to the (latest) current state from
the state recognition unit 21 based on the action signal from the
action control unit 14.
[0091] Then, the model updating unit 22 updates a state transition
model P.sub.SS'.sup.U for the action U of the agent having the
state transition to the (latest) current state S' from the state
recognition unit 21 among state transition models for each action
stored in the model storage unit 13 based on the state transition
to the current state S'.
[0092] In other words, the current state of immediately before (or
one time before) the latest current state S' supplied from the
state recognition unit 21 to the model updating unit 22
(hereinafter, also referred to as the previous state) is assumed to
be a state S.
[0093] The model updating unit 22 recognizes the previous state S
and the current state S' based on the current state supplied from
the state recognition unit 21, and further recognizes the action U
of the agent that is performed to bring about the state transition
from the previous state S to the current state S' based on the
action signal from the action control unit 14.
[0094] Then, the model updating unit 22 updates the state
transition model P.sub.SS'.sup.U by increasing the frequency
indicated by the state transition model P.sub.SS'.sup.U stored in
the model storage unit 13 by one when the state transition to the
current state S' is implemented by performance of the action U in
the previous state S.
[Learning Process]
[0095] FIG. 6 is a flowchart explaining a process of learning
(learning process) of the state transition model performed by the
learning unit 12 of FIG. 5.
[0096] Furthermore, the learning process of FIG. 6 is performed at
all times while the agent performs actions.
[0097] In Step S11, the model updating unit 22 awaits the output of
an action signal U from the action control unit 14 to acquire
(receive) the action signal U, and recognizes an action U of the
agent performed based on the action signal U, and the process
advances to Step S12.
[0098] Herein, the action signal U is an action signal that causes
the agent to perform the action U.
[0099] In Step S12, the state recognition unit 21 acquires an
observation value (sensor signal) observed by the sensor 11 after
the agent performs the action U corresponding to the action signal
U previously output from the action control unit 14, and the
process advances to Step S13.
[0100] In Step S13, the state recognition unit 21 recognizes the
current state S' based on the observation value from the sensor 11,
and supplies the result to the model updating unit 22, and the
process advances to Step S14.
[0101] In Step S14, the model updating unit 22 updates the state
transition model P.sub.SS'.sup.U indicating the state transition to
the current state S' supplied from the state recognition unit 21 by
performance of the action U one time before in the previous state S
supplied from the state recognition unit 21 one time before, among
the state transition models stored in the model storage unit
13.
[0102] In other words, the model updating unit 22 updates the state
transition model P.sub.SS'.sup.U by increasing the frequency
indicated by the state transition model P.sub.SS'.sup.U by one.
[0103] After the updating of the state transition model
P.sub.SS'.sup.U, the process returns to Step S11 from Step S14, and
the same process is repeated thereafter after awaiting the output
of the action signal from the action control unit 14.
[Configuration Example of Action Control Unit 14]
[0104] FIG. 7 is a block diagram showing a configuration example of
the action control unit 14 of FIG. 2.
[0105] In FIG. 7, the action control unit 14 includes a state
recognition unit 31, a state value calculation unit 32, an action
value calculation unit 33, a target state setting unit 34, and an
action selection unit 35.
[0106] The state recognition unit 31 is supplied with (the
coordinate) of the current position of the agent as an observation
value from the sensor 11.
[0107] The state recognition unit 31 recognizes the current state
which is a state where the coordinate is observed (herein, a small
area where the agent is positioned among the small areas obtained
by dividing the action area described in FIG. 4) based on the
coordinate of the current position as the observation value from
the sensor 11 in the same manner as the state recognition unit 21
of FIGS. 5, and supplies the result to the state value calculation
unit 32 and the action selection unit 35.
[0108] Furthermore, either of the state recognition unit 31 or the
state recognition unit 21 of FIG. 5 can be used as a state
recognition unit.
[0109] The state value calculation unit 32 calculates a state value
having a predetermined state set as a reference of which value
increases as much as a state where a transition probability to a
state close to a predetermined state of the state transition model
increases, for each state of state transition models based on the
state transition model stored in the model storage unit 13, that
is, each small area obtained by dividing the action area described
in FIG. 4 here, and supplies the result to the action value
calculation unit 33.
[0110] Specifically, the state value calculation unit 32 calculates
a state value V(S) having the current state set as a reference as a
predetermined state, for example, of which the value increases as
much as the state S where a transition probability P.sub.SS'.sup.U
to the state S' close to the current state from the state
recognition unit 31 for each state S of the state transition model,
and supplies the result to the action value calculation unit
33.
[0111] The action value calculation unit 33 calculates an action
value Q(S,U) of which the value increases as much as the state S
and the action U of which transition probability to the state S'
with a high state value V(S') having the current state set as a
reference, for each state S of the state transition model and each
action U that the agent can perform based on the state transition
models stored in the model storage unit 13 and the state value V(S)
having the current state set as a reference from the state value
calculation unit 32, and supplies the result to the target state
setting unit 34.
[0112] The target state setting unit 34 sets a state with great
unevenness in the action value Q(S,U) among states of the state
transition models to a target state that is the target of the agent
to reach by performance of actions based on the action value Q(S,U)
from the action value calculation unit 33, and supplies the target
state to the action selection unit 35.
[0113] The action selection unit 35 selects the action U of the
agent so as to move toward the target state out of actions that the
agent can perform based on the state transition models stored in
the model storage unit 13 and the target state from the target
state setting unit 34, and outputs the action signal U
corresponding to the action U (action signal U that causes the
agent to perform the action U).
[0114] The action signal U output by the action selection unit 35
is supplied to the learning unit 12 and the actuator 15 (in FIG.
2).
[Process of State Value Calculation Unit 32]
[0115] FIG. 8 is a diagram illustrating the process of the state
value calculation unit 32 of FIG. 7.
[0116] The state value calculation unit 32 calculates the state
value V(S) having the current state set as a reference of which the
value increases as much as the state S with a high transition
probability P.sub.SS'.sup.U to the state S' close to the current
state from the state recognition unit 31 for each state S of the
state transition models.
[0117] In other words, the state value calculation unit 32
calculates the state value V(S) having the current state set as a
reference for each state S of the state transition models by
repeatedly calculating the recurrence formula of Formula (2), for
example, which propagates a state value V (S.sub.current) with
attenuation the (satisfactory) number of times in advance, setting
the state value V(S.sub.current) of the current state S.sub.current
to 1(1.0).
V ( S ) .rarw. max U S ' P SS ' U [ R S ' + .gamma. V ( S ' ) ] [
Expression 2 ] ##EQU00001##
[0118] Herein, in Formula (2), .SIGMA..sub.S' indicates having a
summation of all states S', and max indicates the maximum value
among values right after max, which is obtained for each action
U.
[0119] Furthermore, in Formula (2), .gamma. is an attenuation
constant of a real number within the range of 0<.gamma.<1 for
propagating the state value V(S.sub.current) of the current state
S.sub.current with attenuation, and determined in advance.
[0120] In addition, in Formula (2), R.sub.S' indicates a constant
set for the state S' (of the transition destination of the state
transition). If a constant R.sub.S' when the state S' is the
current state indicates R.sub.current, and a constant R.sub.S' when
the state S' is other than the current state indicates R.sub.other,
the constant R.sub.current is 1 and the constant R.sub.other is
0.
[0121] According to the recurrence formula of Formula (2), when the
transition probability P.sub.SS'.sup.U is high, when the state
value V(S') of the transition destination is high, and when the
state S' of the transition destination is the current state
(R.sub.S'=R.sub.current), the state value V(S) of the state S of
the transition destination increases. In other words, the value of
the state value V(S) having the current state set as a reference
increases as much as the state S with the high transition
probability P.sub.SS'.sup.U to the state S' close to the current
state.
[0122] Herein, FIG. 8 shows an example of the state value V(S)
having the current state set as a reference.
[0123] When a state is set to a small area obtained by dividing the
action area as described in FIG. 4, the closer the small area as
the current state is to a small area, the easier it is for the
small area to move to a small area as the current state (the
transition probability P.sub.SS'.sup.U is high), and therefore, the
value of the state value V(S) having the current state set as a
reference tends to increase as the state is close to the current
state in FIG. 8.
[0124] Furthermore, in FIG. 8, the state value calculation unit 32
is set to calculate the state value V(S) having the current state
set as a reference, but the state value calculation unit 32 can
calculate the state value V(S) having an arbitrary state other than
the current state (for example, a state selected at random) set as
a reference.
[0125] In addition, the recurrence formula of Formula (2) is
calculated with an assumption that the initial value of V(S) is 0
(in the same manner for a recurrence formula to be described later)
unless specified otherwise.
[Process of Action Value Calculation Unit 33 and Target State
Setting Unit 34]
[0126] FIGS. 9 and 10 are diagrams illustrating the process of the
action value calculation unit 33 and the target state setting unit
34 of FIG. 7.
[0127] The action value calculation unit 33 calculates an action
value Q(S,U) of which the value increases as much as the action U
and the state S with a high transition probability to the state S'
with a high state value V(S') having the current state set as a
reference for each state S of the state transition models and each
action U that the agent can perform based on the state transition
models stored in the model storage unit 13 and the state value V(S)
having the current state set as a reference from the state value
calculation unit 32.
[0128] In other words, the action value calculation unit 33
calculates an action value Q(S,U) for each state S of the state
transition models and each action U that the agent can perform by
calculating, for example, Formula (3) using the transition
probability (state transition model) P.sub.SS'.sup.U and the state
value V(s) having the current state set as a reference.
Q(S,U)=.SIGMA..sub.S'P.sub.SS'.sup.UV(S') [Expression 3]
[0129] According to Formula (3), the value of the action value
Q(S,U) increases as much as the action U and the state S with a
high transition probability P.sub.SS'.sup.U to the state S' with a
high state value V(S') having the current state set as a
reference.
[0130] The action value calculation unit 33 supplies each state S
and the action value Q(S,U) for each action U to the target state
setting unit 34.
[0131] The target state setting unit 34 sets a state with large
unevenness in the action value Q(S,U) to the target state among the
states of the state transition models based on the action value
Q(S,U) from the action value calculation unit 33.
[0132] In other words, the target state setting unit 34 obtains,
for example, a variance W(S) as unevenness in the action value
Q(S,U) for each state S according to Formulas (4) and (5) based on
the action value Q(S,U) from the action value calculation unit
33.
Q av ( S , U ) = Q ( S , U ) U Q ( S , U ) [ Expression 4 ] W ( S )
= E [ Q av ( S , U ) 2 ] - ( E [ Q av ( S , U ) ] ) 2 [ Expression
5 ] ##EQU00002##
[0133] Herein, Q.sub.av(S,U) indicates a probability (random
variable) obtained by randomizing an action value Q(S,U) for the
state S, and .SIGMA. of Formula (4) indicates summation of actions
U.
[0134] In addition, in Formula (5), E[ ] indicates an expectation
value of a value (probability variable) in the parenthesis [ ].
[0135] When the variance W(S) for the state S is high, unevenness
in the action value Q(S,U) of the action U performed in the state S
is great, and thus, it is highly possible that there is an action
that the agent has not performed in the state S, and further, it is
highly possible that the agent also has little experience of
reaching the state S (the state S reaches the current state).
[0136] In addition, it is highly possible that learning (updating)
of a state transition model is insufficient for the state S that
the agent has little experience of reaching.
[0137] Furthermore, for a state where transition only from the
state S that the agent has little experience of reaching is
possible, there is not only a possibility that learning of the
state transition model is insufficient but also a probability of
being a state that the agent has little experience of reaching.
[0138] On the other hand, since the agent reaches the state S that
the agent has little experience of reaching or a state that the
agent has no experience of reaching, and then performs learning
(updating) of a state transition model (in FIG. 6) for the states,
it is possible to efficiently learn an action environment that is
an unknown environment.
[0139] Thus, when the target state setting unit 34 obtains a
variance W(S) as unevenness of the action value Q(S,U) for each
state S, the target state setting unit selects a state with a high
variance W(S), that is, a state of which a variance W(S) is equal
to or higher than a predetermined threshold value as a candidate of
the target state.
[0140] FIG. 9 shows an example of the variance W(S) of the action
value Q(S,U) obtained for each small area obtained by dividing the
action area as a state.
[0141] In FIG. 9, as the predetermined threshold value, for
example, 1 is employed to select the candidates of the target
states.
[0142] The target state setting unit 34 sets the target state from
the candidates of the target states, after the selection of the
candidates of the target states.
[0143] As a method of setting the target state from the candidates
of the target states, for example, there are methods of selecting
one candidate among the candidates of the target states at random
and setting the candidates to the target state, and of setting a
candidate with the maximum variance W(S) to the target state.
[0144] However, in the methods of setting to the target state by
selecting one candidate among the candidates of the target states
and of setting the candidate with the maximum variance W(S) to the
target state, it may be difficult to reach the target state from
the current state.
[0145] Thus, the target state setting unit 34 sets a candidate to
reach from the current state to the target state among the
candidates of the target states by state transitions within a
predetermined number of times.
[0146] In other words, the target state setting unit 34 obtains an
existence probability T(S) of being in (or reaching) the current
state by the state transitions within a predetermined number of
times for each state S based on the state transition model
P.sub.SS'.sup.U stored in the model storage unit 13 by repeatedly
calculating, for example, the recurrence formula of Formula (6) a
predetermined number of times.
T ( S ) .rarw. max U S ' P SS ' U T ( S ' ) [ Expression 6 ]
##EQU00003##
[0147] In Formula (6) here, if it is assumed that the initial value
of an existence probability T(S') of the current state is indicated
by T.sub.current and the initial value of the existence probability
T(S') of a state other than the current state is indicated by
T.sub.other the initial value T.sub.current is 1, and the initial
value T.sub.other is 0.
[0148] FIG. 10 shows an example of the existence probability T(S)
obtained for each small area obtained by dividing the action area
as a state.
[0149] A state where the existence probability T(S) is greater than
0 is a state that can be reached from the current state by state
transitions within a predetermined number of times (hereinbelow,
referred to as a reachable state), and the target state setting
unit 34 selects a reachable state among the candidates of the
target states, for example, at random, and sets it to the target
state.
[0150] As described above, since a state with a high variance W(S)
of the action value Q(S,U) for each state S is set to the target
state in the target state setting unit 34, it is easy for the agent
to reach a state that the agent has little experience of reaching
and a state that the agent has no experience of reaching by
performing actions so as to reach such a target state, it is
possible to efficiently learn the action environment that is an
unknown environment with learning (updating) of a state transition
model for the states.
[0151] As described above, herein, the target state set based on
the variance W(S) of the action value Q(S,U) in the target state
setting unit 34 is a state set for efficiently learning the action
environment that is an uncharged environment by making the agent
easily reach the state with little experience of reaching and the
state with no experience of reaching (in other words, by making the
agent easily accumulate unknown experience), and hereinbelow also
referred to also as a learning target state.
[Process of Action Selection Unit 35]
[0152] FIGS. 11 and 12 are diagrams illustrating a process of the
action selection unit 35 of FIG. 7.
[0153] The action selection unit 35 selects an action U of the
agent so as to move toward the target state among actions that the
agent can perform based on the state transition models stored in
the model storage unit 13 and the target state from the target
state setting unit 34, and outputs an action signal U corresponding
to the action U (action signal U that causes the agent to perform
the action U).
[0154] In other words, the action selection unit 35 calculates a
state value V(S) having the target state set as a reference of
which the value increases as much as the state S with a high
transition probability P.sub.SS'.sup.U to the state S' close to the
target state from the target state setting unit 34 for each state S
of the state transition model.
[0155] Specifically, the action selection unit 35 calculates the
state value V(S) having the target state set as a reference for
each state S of the state transition models by repeatedly
calculating the recurrence formula of Formula (2), for example,
which propagates a state value V(S.sub.goal) with attenuation the
(satisfactory) number of times in advance, setting the state value
V(S.sub.goal) of the current state S.sub.goal to 1(1.0), in the
same manner as the state value calculation unit 32 (in FIG. 7).
[0156] Furthermore, according to Formula (2), when the state value
V(S) having the target state set as a reference is to be
calculated, as a constant R.sub.S' of Formula (2), 1 is used for
the target state, and 0 is used for a state other than the target
state.
[0157] In other words, in Formula (2), if it is assumed that the
constant R.sub.S' when the state S' is the target state is
indicated by R.sub.goal and the constant R.sub.S' when the state S'
is a state other than the target state is indicated by R.sub.other,
the constant R.sub.goal is 1 and the constant R.sub.other is 0.
[0158] According to the recurrence formula of Formula (2), when the
transition probability P.sub.SS'.sup.U is high, when the state
value V(S') as the transition destination is high, and when the
state S' as the transition destination is the target state
(R.sub.S'=R.sub.goal), the state value V(S) of the state S as the
transition destination increases. In other words, the value of the
state value V(S) having the target state set as a reference
increases as much as the state S with a high transition probability
P.sub.SS'.sup.U to the state S' close to the target state.
[0159] Herein, FIG. 11 shows an example of the state value V(S)
having the target state set as a reference.
[0160] After the calculation of the state value V(S) having the
target state set as a reference, the action selection unit 35
calculates the action value Q(S,U) of which the value increases as
much as the action U and the state S with the high transition
probability P.sub.SS'.sup.U to the state S' with the state value
V(S) having the target state set as a reference for each state S of
the state transition model and each action U that the agent can
perform, based on the state value V(S) and the state transition
models stored in the model storage unit 13.
[0161] In other words, the action selection unit 35 calculates the
action value Q(S,U) for each state S of the state transition model
and each action U that the agent can perform by calculating, for
example, the above-described Formula (3) using the transition
probability (state transition model) P.sub.SS'.sup.U and the state
value V(S) having the target state set as a reference.
[0162] According to Formula (3), the value of the action value
Q(S,U) increases as much as the action U and the state S with a
high transition probability P.sub.SS'.sup.U to the state S' with a
high state value V(S') having the target state set as a
reference.
[0163] When the action value Q(S,U) is obtained for each state S
and each action U, the action selection unit 35 selects an action U
that gives the maximum value among the action value Q(S,U) for the
current state S from the state recognition unit 31 as an action
.pi.(S,U) performed in the current state S according to, for
example, Formula (7).
.pi. ( S , U ) = argmax U Q ( S , U ) [ Expression 7 ]
##EQU00004##
[0164] Herein, in Formula (7), argmax indicates an action U that
gives the maximum value among action values Q(S,U) for the current
state S (action U of the maximum action value Q(S,U).
[0165] The action selection unit 35 repeats selecting the action U
that gives the maximum value among the action values Q(S,U) for the
current state S as the action .pi.(S,U) performed in the current
state S every time the current state S is supplied from the state
recognition unit 31, and as a result, the agent performs actions so
as to move toward the target state.
[0166] FIG. 12 shows the state where the agent moves toward the
target state by repeating the action U=.pi.(S,U) that gives the
maximum value among the action values Q(S,U) for the current state
S.
[0167] Furthermore, the target state setting unit 34 can set the
above-described learning target state as a target state, and can
set a state given from the outside based on, for example, the
operation of a user, or the like.
[0168] Herein, the state given from the outside as a target state
is a state given in order to make the agent autonomously act until
the agent reaches the state, and hereinbelow, the state is referred
to also as an action target state, in order to discriminate from
the learning target state.
[0169] When the target state supplied from the target state setting
unit 34 to the action selection unit 35 is the action target state,
the action selection unit 35 can select the action U that gives the
maximum value among the action values Q(S,U) for the current state
S as the action .pi.(S,U) performed in the current state S as
described above.
[0170] On the other hand, when the target state supplied from the
target state setting unit 34 to the action selection unit 35 is the
learning target state, the action selection unit 35 can select the
action U that gives the maximum value among the action values
Q(S,U) for the current state S as the action .pi.(S,U) performed in
the current state S, and can select the action .pi.(S,U) performed
in the current state S based on the action value Q(S,U) in the
current state S by, for example, a .epsilon.-greedy method.
[0171] In the .epsilon.-greedy method, the action U that gives the
maximum value among the action values Q(S,U) for the current state
S with a certain probability 1-.epsilon. is selected as the action
.pi.(S,U) performed in the current state S, and one of actions
performed by the agent with a probability .epsilon. is selected as
the action .pi.(S,U) performed in the current state S at random
according to Formula (8).
.pi. ( S , U ) = { argmax U Q ( S , U ) ( random < 1 - ) random
( U ) ( random < ) [ Expression 8 ] ##EQU00005##
[0172] Furthermore, when the target state supplied from the target
state setting unit 34 to the action selection unit 35 is the
learning target state, the action selection unit 35 can select the
action .pi.(S,U) performed in the current state S based on, for
example, the action value Q(S,U) for the current state S by a
softmax method, in addition to the above.
[0173] In the softmax method, each action U is selected as the
action .pi.(S,U) performed in the current state S at random with a
probability corresponding to the action value Q(S,U) of each action
U for the current state S.
[Action Control Process]
[0174] FIG. 13 is a flowchart explaining a process of action
control of the agent for learning the action environment (action
control process for learning) performed by the action control unit
14 of FIG. 7.
[0175] In the action control process for learning, in order to
proceed learning (updating) of the state transition models stored
in the model storage unit 13, in other words, in order to learn the
entire unknown action environment, a state with a high possibility
that the agent has little experience of reaching is set to the
learning target state, and actions of the agent are controlled so
as to move toward the learning target state.
[0176] Furthermore, the agent performs innate actions performed in
compliance with a rule determined, for example, at random or in
advance before the agent performs the action control process for
learning in FIG. 13 for the first time, and performs a certain
degree of learning for the action environment by the learning
process (in FIG. 6) performed between the innate actions.
[0177] Accordingly, the agent acquires a state transition model
(state transition model indicating the frequency that is not 0)
within the range of a state that the agent has reached by the
innate actions before the agent performs the action control process
for learning for the first time.
[0178] In Step S21, the state recognition unit 31 awaits the output
of an observation value (sensor signal) observed after the agent
performed an action corresponding to the action signal previously
output by the action selection unit 35 from the sensor 11, and
acquires the observation value.
[0179] Furthermore, the state recognition unit 31 recognizes the
current state based on the observation value from the sensor 11,
and supplies the result to the state value calculation unit 32 and
the action control unit 35, and the process advances from Step S21
to Step S22.
[0180] In Step S22, the state value calculation unit 32 calculates
a state value V(S) having the current state set as a reference for
each state S of state transition models using the state transition
model P.sub.SS'.sup.U according to the recurrence formula of the
above-described Formula (2), and supplies the result to the action
value calculation unit 33, and the process advances to Step
S23.
[0181] In Step S23, the action value calculation unit 33 calculates
an action value Q(S,U) for each state S of the state transition
models and each action U that the agent can perform based on the
state value V(S) from the state value calculation unit 32 having
the current state set as a reference according to the
above-described Formula (3), and supplies the result to the target
state setting unit 34, and the process advances to Step S24.
[0182] In Step S24, the target state setting unit 34 obtains a
variance W(S) of the action value Q(S,U) for each state S based on
the action value Q(S,U) from the action value calculation unit 33
according to the above-described Formulas (4) and (5), and the
process advances to Step S25.
[0183] In Step S25, the target state setting unit 34 obtains
candidates of the target states (candidate states), that is,
selects a state of which the variance W(S) of the action value
Q(S,U) is equal to or higher than a predetermined threshold value
as a candidate of the target state, based on the variance W(S) of
the action value Q(S,U), and the process advances to Step S26.
[0184] In Step S26, the target state setting unit 34 obtains an
existence probability T(S) of being in (or reaching) the current
state for each state S by a state transition within a predetermined
number of times based on the state transition models
P.sub.SS'.sup.U stored in the model storage unit 13 according to
the recurrence formula of the above-described Formula (6), and the
process advances to Step S27.
[0185] In Step S27, the target state setting unit 34 selects one of
states (reachable states) of which the existence probability T(S)
is greater than 0 (has a positive value) out of the candidates of
the target states, for example, at random, and sets the result to
the learning target state.
[0186] Then, the target state setting unit 34 supplies the learning
target state to the action selection unit 35, and the process
advances from Step S27 to Step S28.
[0187] In Step S28, the action selection unit 35 calculates a state
value V(S) having the learning target state from the target state
setting unit 34 set as a reference for each state S of the state
transition models according to the recurrence formula of the
above-described Formula (2), and the process advances to Step
S29.
[0188] In Step S29, the action selection unit 35 uses the state
value V(S) having the learning target state set as a reference to
calculate an action value Q(S,U) for each state S of the state
transition models and each action U that the agent can perform
according to the above-described Formula (3), and the process
advances to Step S30.
[0189] In Step S30, the action selection unit 35 selects an action
U performed in the current state S based on the action value Q(S,U)
for the current state S from the state recognition unit 31 among
the action values Q(S,U) for each state S of the state transition
models and each action U that the agent can perform by, for
example, the .epsilon.-greedy method or the softmax method, and
outputs an action signal U corresponding thereto.
[0190] The action signal U output by the action selection unit 35
is supplied to the learning unit 12 and the actuator 15.
[0191] The learning unit 12 performs the above-described learning
process (in FIG. 6) using the action signal U from the action
selection unit 35.
[0192] In addition, the actuator 15 is driven according to the
action signal U from the action selection unit 35, and accordingly,
the agent performs the action U according to the action signal
U.
[0193] When the agent performs the action U according to the action
signal U, the process advances from Step S30 to Step S31, and the
state recognition unit 31 awaits the output of an observation value
observed after the action U of the agent from the sensor 11, and
acquires the observation value.
[0194] Furthermore, the state recognition unit 31 recognizes the
current state based on the observation value from the sensor 11,
and supplies the result to the state value calculation unit 32 and
the action selection unit 35, and the process advances from Step
S31 to Step S32.
[0195] In Step S32, the action selection unit 35 determines whether
or not the current state from the state recognition unit 31
coincides with the (latest) learning target state from the target
state setting unit 34 and whether or not a predetermined time t1
passed after the (latest) learning target state is supplied from
the target state setting unit 34.
[0196] In Step S32, when it is determined that the current state
from the state recognition unit 31 does not coincide with the
learning target state from the target state setting unit 34 and
that the predetermined time t1 did not pass after the learning
target state is supplied from the target state setting unit 34, the
process returns to Step S30, and thereafter, the same process is
repeated.
[0197] In addition, in Step S32, when it is determined that the
current state from the state recognition unit 31 coincides with the
learning target state from the target state setting unit 34, that
is, when the agent reaches the learning target state, or that the
predetermined time t1 passed after the learning target state is
supplied from the target state setting unit 34, that is, when the
agent was not able to reach the learning target state for the
predetermined time t1, the process advances to Step S33, and the
action selection unit 35 determines whether or not the condition of
ending the action control to end the action control process for
learning is satisfied.
[0198] Herein, as the condition of ending the action control to end
the action control process for learning, for example, there is a
command performed so as to end the action control process for
learning by a user, passage of a predetermined time t2 which is
sufficiently longer than the predetermined time t1 after the action
control process for learning is started, or the like.
[0199] In Step S33, when it is determined that the condition of
ending the action control is not satisfied, the process returns to
Step S22, and thereafter, the same process is repeated.
[0200] In addition, in Step S33, when it is determined that the
condition of ending the action control is satisfied, the action
control unit 14 ends the action control process for learning.
[0201] As described above, the agent calculates the state value
V(S) having a predetermined state such as the current state set as
a reference using the state transition model P.sub.SS'.sup.U
calculates the action value Q(S,U) for each state of the state
transition model and each action U that the agent can perform based
on the state value V(S), sets the state S with a high variance W(S)
as unevenness in the action value Q(S,U) to the learning target
state, and performs actions toward the learning target state.
[0202] As described above, it is a high possibility that the state
S with a high variance W(S) as unevenness in the action value
Q(S,U) is a state that the agent has little experience of reaching,
and that learning (updating) of the state transition model is
insufficient for such a state S.
[0203] Furthermore, for a state to be transited only from the state
S that the agent has little experience of reaching, there is also a
possibility that learning of the state transition model is
insufficient, and the state is a state that the agent has no
experience of reaching.
[0204] Accordingly, by setting, in the agent, the state S with the
high variance W(S) of the action value Q(S,U) to the learning
target state and performing actions toward the learning target
state, the agent reaches (or tends to reach) the state that the
agent has little experience of reaching and the state that the
agent has no experience of reaching, and as a result, learning
(updating) of the state transition model is performed for such a
state, and therefore, it is possible to thoroughly learn the entire
action environment with efficiency.
[0205] In other words, the agent performs movement actions
thoroughly within the action environment, and as a result, the
agent can efficiently learn the entire action environment.
[0206] FIG. 14 is a diagram illustrating an action in the action
environment by the agent of FIG. 2 which is performed by the action
control process for learning in FIG. 13.
[0207] In the action control of the past, the time when the action
control based on an action value is possible is after the agent
reaches the target state, and an action value for reaching the
target state is calculated with reinforcement learning, and thus,
the agent has to perform actions, for example, selected at random
among actions that the agent can perform until the agent reaches
the target state.
[0208] In addition, for the agent performing the action selected at
random, it is difficult to reach the target state, that is, to
perform learning to reach the target state, due to the complexity
of the unknown action environment, or the like.
[0209] In other words, for example, when there is a narrow passage
which is difficult for the agent to pass through in the action
environment, the agent performing the actions selected at random is
not able to pass through the narrow passage, and not able to learn
the environment after passing through the narrow passage.
[0210] In addition, for example, when gravity is set in an action
environment where the agent can move to the upper and lower side,
it is difficult for the agent performing the actions selected at
random to move in the upper side in the action environment due to
gravity, and the agent is not able to learn the environment in the
upper side in the action environment.
[0211] Furthermore, for example, there is a bias in actions
performed by the agent performing the actions selected at random, a
bias may occur also in learning of the action environment.
[0212] On the other hand, according to action control (new action
control) by the action control process (in FIG. 13) for learning,
since the state S with a high variance W(S) of the action value
Q(S,U) is set to the learning target state, the agent reaches (or
tends to reach) the state in which the agent has little experience
of reaching and the state that the agent has no experience of
reaching, and as a result, learning (updating) of the state
transition model is performed for such a state, and therefore, it
is possible to thoroughly learn the entire action environment with
efficiency.
[0213] FIG. 15 is a flowchart explaining a process of action
control of the agent for autonomously acting in the action
environment (action control process for autonomous actions)
performed by the action control unit 14 of FIG. 7.
[0214] In the action control process for autonomous actions, the
state given from the outside based on, for example, an operation of
a user is set to the action target state, or the like, and the
action of the agent is controlled so as to be toward the action
target state.
[0215] In Step S41, the target state setting unit 34 sets the state
given from the outside based on, for example, an operation of a
user is set to the action target state, and supplies the result to
the action selection unit 35.
[0216] Herein, as the action target state, a state that the agent
has reached is set. By the action control process for learning (in
FIG. 13), when learning of the entire action environment ends, that
is, when the agent has reached all states of the action
environment, an arbitrary state of the action environment can be
set as the action target state.
[0217] In Step S41, furthermore, the state recognition unit 31
recognizes the current state based on the observation value from
the sensor 11, and supplies the result to the action selection unit
35, and the process advances to Step S42.
[0218] In Step S42, the action selection unit 35 calculates the
state value V(S) having the action target state from the target
state setting unit 34 for each state S of the state transition
models using the state transition model P.sub.SS'.sup.U according
to the recurrence formula of the above-described Formula (2), and
the process advances to Step S43.
[0219] In Step S43, the action selection unit 35 calculates an
action value Q(S,U) for each state S of the state transition models
and each action U that the agent can perform using the state value
V(S) having the action target state set as a reference according to
the above-described Formula (3), and the process advances to Step
S44.
[0220] In Step S44, the action selection unit 35 selects an action
U which gives the maximum value among action values Q(S,U) for the
current state S based on the action value Q(S,U) from the state
recognition unit 31 for the current state S among the action values
Q(S,U) for each state S of the state transition models and each
action U that the agent can perform to the action .pi.(S,U)
performed in the current state S, and outputs an action signal U
corresponding thereto.
[0221] The action signal U output from the action selection unit 35
is supplied to the learning unit 12 and the actuator 15.
[0222] The actuator 15 is driven according to the action signal U
from the action selection unit 35, and accordingly, the agent
performs the action U (=.pi.(S,U)) according to the action signal
U.
[0223] Furthermore, even while the action control process for
autonomous actions is performed, the above-described learning
process (in FIG. 6) can be performed in the learning unit 12 using
the action signal U from the action selection unit 35 in the same
manner as the action control process for learning (in FIG. 13).
[0224] When the agent performs the action U according to the action
signal U, the process advances from Step S44 to Step S45, and the
state recognition unit 31 awaits the output of the observation
value observed after the action U of the agent from the sensor 11,
and acquires the observation value.
[0225] Furthermore, the state recognition unit 31 recognizes the
current state based on the observation value from the sensor 11,
and supplies the result to the action selection unit 35, and the
process advances from Step S45 to Step S46.
[0226] In Step S46, the action selection unit 35 determines whether
or not the target state setting unit 34 sets a new action target
state.
[0227] In Step S46, when it is determined that the target state
setting unit 34 sets a new action target state, that is, when a
user performs an operation so as, for example, to change the
(action) target state, the target state setting unit 34 sets a new
action target state based on the operation, and the result is
supplied to the action selection unit 35, the process returns to
Step S42, the action selection unit 35 calculates a state value
V(S) having the new action target state set as a reference, and
thereafter, the same process is repeated.
[0228] In addition, in Step S46, when it is determined that the
target state setting unit 34 dos not set a new action target state,
the process advances to Step S47, and the action selection unit 35
determines whether or not a condition of ending action control to
end the action control process for autonomous actions is
satisfied.
[0229] Herein, as the condition of ending the action control to end
the action control process for autonomous actions, there is a
command performed so that the action control process for autonomous
actions ends by a user, coincidence of the current state with the
action target state, or the like.
[0230] In Step S47, when it is determined that the condition of
ending the action control is not satisfied, the process returns to
Step S44, and thereafter, the same process is repeated.
[0231] In addition, in Step S47, when it is determined that the
condition of ending the action control is satisfied, the action
control unit 14 ends the action control process for autonomous
actions.
[0232] FIG. 16 is a diagram illustrating an action in the action
environment of the agent of FIG. 2 performed by the action control
process for autonomous actions of FIG. 15.
[0233] In the action control of the past, the time when the action
control based on an action value is possible is after the agent
reaches the target state, and an action value for reaching the
target state is calculated with reinforcement learning, and thus,
if the target state is changed, it is necessary for the agent to
perform reinforcement learning again to calculate an action value
for reaching a target state after the change.
[0234] On the other hand, in action control by the action control
process for autonomous action (new action control), since a state
value V(s) having the action target state set as a reference and
further an action value Q(S,U) for reaching the action target state
using the state transition model P.sub.SS'.sup.U (with which
learning is performed at all times), a state value V(S) having the
new action target state set as a reference and further an action
value Q(S,U) for reaching the new action target state are easily
calculated even when the action target state is changed to the new
action target state, and it is possible to cause the agent to
perform actions toward the new action target state.
[0235] Furthermore, when there is a state that the agent has to
avoid (hereinafter, referred to as an avoidance state) in the
action environment, and the avoidance state is given to the agent,
the action selection unit 35 can select an action to reach the
action target state while avoiding the avoidance state in the
action control process for autonomous actions.
[0236] FIG. 17 is a diagram illustrating an action of the agent to
reach the action target state while avoiding the avoidance
state.
[0237] In order to avoid reaching the avoidance state, the action
selection unit 35 uses 1 for the target state, a negative value,
for example, -0.3 for the avoidance state, and 0 for a state other
than the target state and the avoidance state as the constant
R.sub.S' of Formula (2), in calculation of the state value V(S)
having the action target state using the state transition model
P.sub.SS'.sup.U according to the recurrence formula of Formula
(2).
[0238] In other words, in Formula (2), if the constant R.sub.S'
when the state S' is the target state is indicated by R.sub.goal,
the constant R.sub.S when the state S' is to be avoided is
indicated by R.sub.unlike, and the constant R.sub.S' when the state
S' is a state other than the target state and the avoidance state
is indicated by R.sub.other, the constant R.sub.goal is 1, the
constant R.sub.unlike is -0.3, and the constant R.sub.other is
0.
[0239] Herein, FIG. 17 shows an example of the state value V(S)
having the target state s reference when the constant R.sub.goal is
set to 1, the constant R.sub.unlike to -0.3, and the constant
R.sub.other to 0, respectively.
[0240] With the setting above, the action selection unit 35
calculates the state value V(S) having the target state set as a
reference, and then, calculates an action value Q(S,U) for each
state S of the state transition models and each action U that the
agent can perform using the state value V(S) according to Formula
(3).
[0241] Then, the action selection unit 35 selects an action U that
gives the maximum value in the action values Q(S,U) for the current
state S among action values Q(S,U) for each state S and each action
U as an action U performed in the current state S.
[0242] As described above, in Formula (2), by employing a negative
value as the constant R.sub.unlike for an avoidance state, the
state value V(S) having the target state set as a reference and
further the action value Q(S,U) of an action toward the avoidance
state, which is obtained using the state value V(S) for the
avoidance state, become relatively small, and as a result, the
agent performs actions so as to move toward the target state while
avoiding the avoidance state, as shown by the arrow of FIG. 17.
[Application Example to Object Moving Task]
[0243] The learning process of the learning unit 12 and the action
control process of the action control unit 14 can be applied to a
task in which the agent simply moves in the action environment as
described above (hereinafter, also referred to as a simple movement
task), and also to a task in which the agent moves an object
(hereinafter, also referred to as an object moving task), for
example, in the action environment.
[0244] FIG. 18 is a diagram illustrating the object moving
task.
[0245] In the object moving task, an object that can be moved
exists in addition to the agent in the action environment.
[0246] In FIG. 18, the action environment is an area (ground
surface) on a two-dimensional plane, and the agent and the object
move in the area.
[0247] Now, in FIG. 18, if the upper direction is assumed to the
north, the agent can move to any direction of the east, the west,
the south, the north, the northeast, the southeast, the southwest,
and the northwest, by a predetermined distance with one action by,
so to speak, its own efforts.
[0248] In addition, the agent can move (push) the object to a
direction in which the agent moves when the agent contacts the
objects.
[0249] The object is not able to move by itself, but moved only by
pushing of the agent.
[0250] FIG. 19 is a diagram illustrating a state transition model
when the object moving task is performed.
[0251] In FIG. 19, in regard to the object moving task, a small
area obtained by dividing the action environment into small areas
is employed as a state of a state transition move for each action
in the same manner as in the simple movement task.
[0252] However, in regard to the object moving task, there are an
agent state S(agt) and an object state S(obj) as a state of a state
transition model for each action.
[0253] In addition, in the object moving task, the agent observes
the current position thereof as an observation value, and can
recognize the current state thereof based on the current position
thereof, in the same manner as in the simple movement task.
[0254] Furthermore, in the object moving task, the agent observes
the position of the object as an observation value, and can
recognize the current state of the object based on the current
position of the object.
[0255] In addition, in regard to the object moving task, as a state
transition model P.sub.SS'.sup.U for each action, a state
transition model (hereinafter, also referred to as an agent
transition model) P.sub.S(agt)S(agt)'.sup.U indicating that the
state of the agent is transited to the state S(agt)' by performing
a predetermined action U in the state S(agt), an object transition
model P.sub.S(obj)S(obj)'.sup.U, and an agent-object transition
model P.sub.S(agt)S(obj*)'.sup.U are stored in the model storage
unit 13.
[0256] Herein, the object transition model
P.sub.S(obj)S(obj)'.sup.U indicates that the state of the object is
transited from a state S(obj) to a state S(obj)' by the agent
performing a predetermined action U.
[0257] In addition, the agent-object transition model
P.sub.S(agt)S(obj*)'.sup.U indicates that the state of the object
is transited to the state S(obj*) by the agent performing a
predetermined action U in the state S(agt).
[0258] As the object transition model P.sub.S(obj)S(obj)'.sup.U, a
frequency (or a transition probability) that the state of the
object is transited to the state S(obj)' can be employed by the
agent performing a predetermined action U when the state of the
object is the state S(obj) in the same manner as in, for example,
the agent transition model P.sub.S(agt)S(agt)'.sup.U.
[0259] Further as the agent-object transition model
P.sub.S(agt)S(obj*)'.sup.U, a frequency (or a transition
probability) that the state of the object is transited to the state
S(obj*)' can be employed by the agent performing a predetermined
action U in the state S(agt) in the same manner as in, for example,
the agent transition model P.sub.S(agt)S(agt)'.sup.U.
[0260] In the object moving task, a target state is set for the
object, and actions of the agent are controlled based on the agent
transition model P.sub.S(agt)S(obj)'.sup.U, the object transition
model P.sub.S(obj)S(obj)'.sup.U, and the agent-object transition
model P.sub.S(agt)S(obj*)'.sup.U.
[Learning Process in Object Moving Task]
[0261] FIG. 20 is a flowchart explaining a learning process
performed by the learning unit 12 in the object moving task.
[0262] Furthermore, the learning process of FIG. 20 is performed
all the time while the agent performs actions, in the same manner
as in, for example, the learning process of FIG. 6.
[0263] In Step S61, the learning unit 12 awaits the output of an
action signal U from the action control unit 14, acquires
(receives) the action signal U, and recognizes an action U of the
agent performed based on the action signal U, and the process
advances to Step S62.
[0264] In Step S62, the learning unit 12 acquires an observation
value observed in the sensor 11 after the agent performed the
action U corresponding to the action signal U previously output the
action control unit 14, and the process advances to Step S63.
[0265] In Step S63, the learning unit 12 recognizes the current
state of the agent S(agt)' and the current state of the object
S(obj)' based on the observation value from the sensor 11, and the
process advances to Step S64.
[0266] In Step S64, the learning unit 12 updates the agent
transition model P.sub.S(agt)S(agt)'.sup.U, the object transition
model P.sub.S(obj)S(obj)'.sup.U, and the agent-object transition
model P.sub.S(agt)S(obj)'.sup.U stored in the model storage unit 13
based on the current state of the agent S(agt)', the previous state
S(agt) that is the current state one time before, the current state
of the object S(obj)', and the previous state S(obj) that is the
current state one time before.
[0267] In other words, the learning unit 12 updates the agent
transition model P.sub.S(agt)S(agt)'.sup.U by increasing the
frequency as the agent transition model P.sub.S(agt)S(agt)'.sup.U
which indicates that the state of the agent is transited to the
current state S(agt)' by performing the action U of one time before
in the previous state S(agt), by 1.
[0268] Furthermore, the learning unit 12 updates the object
transition model P.sub.S(obj)S(obj)'.sup.U by increasing the
frequency as the object transition model P.sub.S(obj)S(obj)'.sup.U,
which indicates that the state of the object is transited from the
previous state S(obj) to the current state S(obj)' by the agent
performing the action U of one time before, by 1.
[0269] In addition, the learning unit 12 updates the agent-object
transition model P.sub.S(agt)S(obj)'.sup.U by increasing the
frequency as the agent-object transition model
P.sub.S(agt)S(obj)'.sup.U, which indicates that the state of the
object is transited to the current state S(obj)' by the agent
performing the action U of the one time before in the previous
state S(agt) that is the current state one time before, by 1.
[0270] After the updating of the agent transition model
P.sub.S(agt)S(agt)'.sup.U, the object transition model
P.sub.S(obj)S(obj)'.sup.U, and the agent-object transition model
P.sub.S(agt)S(obj)'.sup.U, the process returns from Step S64 to
Step S61, and thereafter, the same process is repeated after
waiting for the output of the action signal from the action control
unit 14.
[Action Control Process in Object Moving Task]
[0271] FIG. 21 is a flowchart explaining an action control process
for autonomous actions performed by the action control unit 14 (of
FIG. 2) in the object moving task.
[0272] In the object moving task, a state given from the outside
based on, for example, an operation of a user, or the like, is set
to an action target state and actions of the agent are controlled
so as to be toward the action target state in the action control
process for autonomous actions, in the same manner as in the case
of FIG. 15.
[0273] However, a state of the object is set in the action target
state.
[0274] In Step S71, the action control unit 14 sets the state of
the object given from the outside based on, for example, an
operation of a user, or the like, to the action target state, and
the process advances to Step S72.
[0275] For example, if the user performs an operation so that the
state corresponding to the position where the object is to be moved
is set to the target state, the action control unit 14 sets the
state of the object according to the operation of the user to the
action target state.
[0276] In Step S72, the action control unit 14 calculates a state
value V.sub.obj(S(obj)) having the action target state set as a
reference for each state S(obj) of the object transition models
using the object transition model P.sub.S(obj)S(obj)'.sup.U
according to Formula (9) the same as the recurrence formula of the
above-described Formula (2), and the process advances to Step
S73.
V obj ( S ( obj ) ) .rarw. max U S ( obj ) ' P S ( obj ) S ( obj )
' U [ R S ( obj ) ' + .gamma. V obj ( S ( obj ) ' ) ] [ Expression
9 ] ##EQU00006##
[0277] Herein, in Formula (9), .SIGMA..sub.S(obj)' indicates
summation for all states s(obj)' of the object is performed, and
max indicates the maximum value among values before max obtained
for each action U.
[0278] Furthermore, in Formula (9), .gamma. is the same attenuation
constant as in the case of Formula (2).
[0279] In addition, in Formula (9), R.sub.S(obj)' indicates a
constant set for the state of the object S(obj)' (of the transition
destination of state transition). If a constant R.sub.S(obj)' when
the state S(obj)' is the action target state is indicated by
R.sub.goal, and a constant R.sub.S(obj)' when the state s(obj)' is
a state other than the action target state is indicated by
R.sub.other, the constant R.sub.goal is 1 and the constant
R.sub.other is 0.
[0280] In Step S73, the action control unit 14 uses the object
transition model P.sub.S(obj)S(obj)'.sup.U and the state value
V.sub.obj(S(obj)) having the action target state set as a reference
to calculate an action value Q.sub.obj(S(obj),U) for each state
S(obj) of the object transition model and each action U that the
agent can perform according to Formula (10) the same as Formula (3)
described above, and the process advances to Step S74.
Q.sub.obj(S(obj),U)=.SIGMA..sub.S(obj),P.sub.S(obj)S(obj).sup.U,V.sub.ob-
j(S(obj)') [Expression 10]
[0281] In Step S74, the action control unit 14 awaits the output of
the observation value observed after the action U of the agent from
the sensor 11 to acquire the observation value, and recognizes the
current states of the agent and the object based on the observation
value, and the process advances to Step S75.
[0282] In Step S75, the action control unit 14 obtains an action U*
that gives the maximum value in an action value
Q.sub.obj(S(obj-current),U) for the current state S(obj-current)
based on the action value Q.sub.obj(S(obj-current),U) for the
current state of the object S(obj-current) among action values
Q.sub.obj(S(obj),U) for each state S(obj) of the object transition
models and each action U that the agent can perform, and the
process advances to Step S76.
[0283] In other words, in Step S75, the action U* is obtained
according to Formula (11).
U * argmax U Q obj ( S ( obj - current ) ) , U ) [ Expression 11 ]
##EQU00007##
[0284] Herein, in Formula (11), argmax indicates an action U that
gives the maximum value in the action value Q.sub.obj(S(obj),U) for
the current state of the object S(obj-current).
[0285] In Step S76, the action control unit 14 obtains a state of
the object S(obj*) of which a transition probability (frequency)
P.sub.S(obj-current)S(obj)'.sup.U* indicated by an object
transition model is at the maximum among states of the object,
which is the transition destination from the current state of the
object S(obj-current) when the agent performs the action U*, and
the process advances to Step S77.
[0286] In other words, in Step S76, the state of the object S(obj*)
is obtained according to Formula (12).
S ( obj * ) = argmax S ( obj ) ' P S ( obj - current ) S ( obj ) '
U * [ Expression 12 ] ##EQU00008##
[0287] Herein, in Formula (12), argmax indicates the state of the
object S(obj)', which is the transition destination, with the
maximum transition probability P.sub.S(obj-current)S(obj)'.sup.U*
of the state transition from the current state of the object
S(obj-current).
[0288] The state S(obj*) that is the state of the object S(obj)' as
the transition destination obtained based on Formula (12) is a
state with the highest probability
P.sub.S(obj-current)S(obj)'.sup.U* in the state of the object
S(obj)' as the transition destination in state transition from the
current state of the object S(obj-current) occurring by performance
of the action U* with the highest action value
Q.sub.obj(S(obj-current),U), that is, a state with the highest
possibility as the transition destination in state transition of
the object occurring by the agent performing the action U*.
[0289] In Step S77, the action control unit 14 calculates a state
value V.sub.agt(S(agt)) having the state of the object S(obj*) set
as a reference for each state S(agt) of the agent by repeatedly
calculating the recurrence formula of Formula (13) by the
predetermined (satisfactory) number of times using the agent-object
transition model P.sub.S(agt)S(obj*).sup.U of which the transition
destination is the state of the object S(obj*) among agent-object
transition models P.sub.S(agt)S(obj)'.sup.U and the agent
transition model P.sub.S(agt)S(agt)'.sup.U, and the process
advances to Step S78.
V agt ( S ( agt ) ) .rarw. max U { P S ( agt ) S ( obj * ) U + S (
agt ) ' .gamma. P S ( agt ) S ( agt ) ' U V agt ( S ( agt ) ' ) } [
Expression 13 ] ##EQU00009##
[0290] Herein, in Formula (13), .SIGMA..sub.S(agt)' indicates that
summation for all states S(agt)' of the agent is performed, and
.gamma. is the same attenuation constant as in the case of Formula
(2).
[0291] The value of the state value V.sub.agt(S(agt)) having the
state of the object S(obj*) set as a reference obtained by Formula
(13) increases as much as the state S(agt) that the agent can
perform an action U of which the transition probability (transition
probability indicated by the agent-object transition model)
P.sub.S(agt)S(obj*)'.sup.U is high, with which the state of the
object is transited to the state S(obj*) when the agent performs
the action U in the state of the agent S(agt).
[0292] In the state value V.sub.agt(S(agt)) having the state of the
object S(obj*) set as a reference, it can be said that the state
value V.sub.obj(S(obj)) having the action target state set as a
reference obtained according to Formula (9), so to speak,
propagates through the transition probability
P.sub.S(agt)S(obj*).sup.U of the state transition to the state of
the object S(obj*) close to the action target state.
[0293] In Step S78, the action control unit 14 calculates an action
value Q.sub.agt(S(agt),U) for each state S(agt) of the agent
transition models and each action U that the agent can perform
using the agent-object transition model P.sub.S(agt)S(obj*).sup.U
of which the transition destination is the state of the object
S(obj*) among agent-object transition models
P.sub.S(agt)S(obj)'.sup.U, the agent transition model
P.sub.S(agt)S(agt)'.sup.U and the state value V.sub.agt(S(agt))
having the state of the object S(obj*) set as a reference according
to Formula (14), and the process advances to Step S79.
Q.sub.agt(S(agt),U)=P.sub.S(agt)S(obj*).sup.U+.SIGMA..sub.S(agt),P.sub.S-
(agt)S(agt).sup.U,V.sub.agt(S(agt)') [Expression 14]
[0294] In Step S79, the action control unit 14 selects an action U
that gives the maximum value in an action value Q.sub.agt(S(agt),U)
for the current state S(agt) as the action U performed by the agent
in the current state S(agt) based on the action value
Q.sub.agt(S(agt),U) of the agent in the current state S(agt) among
action values Q.sub.agt(S(agt),U) for each state S(agt) of the
agent transition models and each action U that the agent can
perform, and outputs an action signal U corresponding thereto, and
the process advances to Step S80.
[0295] Herein, the action signal U output by the action control
unit 14 is supplied to the learning unit 12 and the actuator
15.
[0296] The actuator 15 is driven according to the action signal U
from the action control unit 14, and accordingly, the agent
performs the action U according to the action signal U.
[0297] Furthermore, the learning unit 12 can perform the
above-described learning process (of FIG. 20) using the action
signal U from the action control unit 14 while the action control
process for autonomous actions is performed.
[0298] In Step S80, the action control unit 14 determines whether
or not a new action target state (the state of the object S(obj))
is set.
[0299] In Step S80, when it is determined that the new action
target state is set, that is, for example, when an operation is
performed so that a user changes the action target state, and the
action control unit 14 sets the new action target state based on
the operation, the process returns to Step S72, and the action
control unit 14 calculates the state value V.sub.obj(S(obj)) having
the new action target state set as a reference, and thereafter, the
same process is repeated.
[0300] In addition, in Step S80, when it is determined that the new
action target state is not set, the process advances to Step S81,
and the action control unit 14 awaits the output of the observation
value observed after the action U of the agent from the sensor 11
to acquire the observation value.
[0301] Furthermore, the action control unit 14 recognizes the
current states of the agent and the object based on the observation
value from the sensor 11, and the process advances to Step S81 to
Step S82.
[0302] In Step S82, the action control unit 14 determines whether
or not a condition of ending action control to end the action
control process for autonomous actions is satisfied, in the same
manner as in Step S47 of FIG. 15.
[0303] In Step S82, when it is determined that the condition of
ending the action control is not satisfied, the process advances to
Step S83, and the action control unit 14 determines whether or not
the current state of the object is changed from the previous state
of the object to other state (a state other than the previous
state).
[0304] In Step S83, when it is determined that the current state of
the object is changed from the previous state of the object to
other state, that is, when the object is moved by the action of the
agent, and as a result, the state of the object is changed before
and after the action of the agent, the process returns to Step S75,
and the action control unit 14 obtains an action U* that gives the
maximum value in the action value Q.sub.obj(S(obj-current),U) for
the current state of the object S(obj-current) after the change,
and thereafter, the same process is repeated.
[0305] In addition, in Step S83, when it is determined that the
current state of the object is not changed from the previous state
of the object to other state, that is, when the agent acted but the
object is not moved, or when the object is moved by the action of
the agent but the state of the object is not changed before and
after the movement, the process returns to Step S79, and
thereafter, the same process is repeated.
[0306] On the other hand, in Step S82, when it is determined that
the condition of ending the action control is satisfied, the action
control unit 14 ends the action control process for autonomous
actions.
[0307] FIGS. 22 and 23 are flowcharts explaining an action control
process for learning performed by the action control unit 14 (of
FIG. 2) in the object moving task.
[0308] In the action control process for learning of the object
moving task, a learning target state is set so that the object
easily reaches a state that the object has little experience of
reaching or a state that the object has no experience of reaching,
the action of the agent is controlled so that the state of the
object moves toward the learning target state in the same manner as
in the case of FIG. 13, and accordingly, learning of the agent
transition model P.sub.S(agt)S(agt)'.sup.U, the object transition
model P.sub.S(obj)S(obj)'.sup.U, and the agent-object transition
model P.sub.S(agt)S(obj)'.sup.U are efficiently performed in the
learning process of FIG. 20.
[0309] Furthermore, the agent performs innate actions performed in
compliance of rules determined, for example, at random or in
advance before the agent performs the action control process for
learning of FIGS. 22 and 23 for the first time, and performs a
certain degree of learning for the action environment by the
learning process (of FIG. 20) performed between the innate
actions.
[0310] Thus, the agent gains the agent transition model
P.sub.S(agt)S(agt)'.sup.U, the object transition model
P.sub.S(obj)S(obj)'.sup.U, and the agent-object transition model
P.sub.S(agt)S(obj)'.sup.U, which indicate a frequency other than 0,
within the range of states of the agent and the object that the
agent has reached by the innate actions before performing the
action control process for learning of FIGS. 22 and 23 for the
first time.
[0311] In Step S101, the action control unit 14 awaits the output
of the observation value observed after the agent performed the
action corresponding to the action signal previously output, from
the sensor 11 to acquire the observation value.
[0312] Furthermore, the action control unit 14 recognizes the
current states of the agent and the object based on the observation
value from the sensor 11, and the process advances from Step S101
to Step S102.
[0313] In Step S102, the action control unit 14 calculates a state
value V.sub.obj(S(obj)) having the current state of the object
S(obj-current) set as a reference for each state s(obj) of the
object transition models using the object transition model
P.sub.S(obj)S(obj)'.sup.U, according to the recurrent formula of
Formula (9) described above, and the process advances to Step
S103.
[0314] Herein, in calculation of the state value V.sub.obj(S(obj))
having the current state of the object S(obj-current) set as a
reference according to the recurrent formula of Formula (9), if a
constant R.sub.S(obj)' when the state S(obj)' is the current state
S(obj-current) is indicated by R.sub.current, and a constant
R.sub.S(obj)' when the state S(obj)' is a state other than the
current state S(obj-current) is indicated by R.sub.other, the
constant R.sub.current is 1 and the constant R.sub.other is 0.
[0315] In Step S103, the action control unit 14 calculates an
action value Q.sub.obj(S(obj),U) for each state s(obj) of the
object transition models and each action U that the agent can
perform based on the state value V.sub.obj(S(Obj)) having the
current state of the object S(obj-current) set as a reference
according to Formula (10) described above, and the process advances
to Step S104.
[0316] In Step S104, the action control unit 14 obtains a variance
W(S(obj)) of the action value Q.sub.obj(S(obj),U) for each state of
the object S(obj) based on the action value Q.sub.obj(S(obj),U) as
described in Formulas (4) and (5) above, and the process advances
to Step S105.
[0317] In Step S105, the action control unit 14 obtains candidates
of the learning target state, that is, selects states of the object
of which the variance W(S(obj)) of the action value
Q.sub.obj(S(obj),U) is equal to or higher than a predetermined
threshold value as candidates of the learning target state based on
the variance W(S(obj)) of the action value Q.sub.obj(S(obj),U), and
the process advances to Step S106.
[0318] In Step S106, the action control unit 14 obtains an
existence probability T(S) of being in the current state of the
object S(obj-current) by state transitions within a predetermined
number of times for each state of the object S(obj) based on the
object transition model P.sub.S(obj)S(obj)'.sup.U stored in the
model storage unit 13 by repeatedly calculating the recurrence
formula as described in Formula (6) above, and the process advances
to Step S107.
[0319] In Step S107, the action control unit 14 selects one state
of which the existence probability T(S) is greater than 0 (a
positive value) (a reachable state) from the candidates of the
learning target state, for example, at random, and sets the
selection to the learning target state.
[0320] Then, the process advances from Step S107 to Step S111 of
FIG. 23, and thereafter, the action of the agent is controlled so
that the state of the object is toward the learning target
state.
[0321] In other words, FIG. 23 is a flowchart continuing from FIG.
22.
[0322] In Step S111, the action control unit 14 calculates a state
value V.sub.obj(S(obj)) having the learning target state set as a
reference for each state S(obj) of the object transition models
using the object transition model P.sub.S(obj)S(obj)'.sup.U
according to Formula (9) described above, and the process advances
to Step S112.
[0323] Herein, in calculating the state value V.sub.obj(S(obj))
having the learning target state set as a reference according to
Formula (9), if a constant R.sub.S(obj)' when the state S(obj)' is
the learning target state is indicated by R.sub.goal, and a
constant R.sub.S(obj)' when the state S(obj)' is a state other than
the learning target state is indicated by R.sub.other, the constant
R.sub.goal is 1 and the constant R.sub.other is 0.
[0324] In Step S112, the action control unit 14 calculates an
action value Q.sub.obj(S(obj),U) for each state of the object
transition models S(obj) and each action U that the agent can
perform using the object transition model P.sub.S(obj)S(obj)'.sup.U
and the state value V.sub.obj(S(obj)) having the learning target
state set as a reference according to Formula (10) described above,
and the process advances to Step S113.
[0325] In Step S113, the action control unit 14 obtains an action
U* that gives the maximum value in an action value
Q.sub.obj(S(obj-current),U) for the current state S(obj-current)
based on the action value Q.sub.obj(S(obj-current),U) for the
current state of the object S(obj-current) among action values
Q.sub.obj(S(obj),U) for each state S(obj) of the object transition
models and each action U that the agent can perform, and the
process advances to Step S114.
[0326] In Step S114, the action control unit 14 obtains a state of
the object S(obj*) of which the transition probability (frequency)
P.sub.S(obj-current)S(obj)'.sup.U* indicated by the object
transition model among states of the object that is the transition
destination from the current state of the object S(obj-current)
when the agent performs the action U*, and the process advances to
Step S115.
[0327] In Step S115, the action control unit 14 calculates a state
value V.sub.agt(S(agt)) having the state of the object S(obj*) set
as a reference for each state S(agt) of the agent by repeatedly
calculating the recurrence formula of Formula (13) by the
predetermined (satisfactory) number of times using the agent-object
transition model P.sub.S(agt)S(obj*).sup.U of which the transition
destination is the state of the object S(obj*) among agent-object
transition models P.sub.S(agt)S(obj)'.sup.U and the agent
transition model P.sub.S(agt)S(agt)'.sup.U, and the process
advances to Step S116.
[0328] In Step S116, the action control unit 14 calculates an
action value Q.sub.agt(S(agt),U) for each state S(agt) of the agent
transition models and each action U that the agent can perform
according to the above-described Formula (14) using the
agent-object transition model P.sub.S(agt)S(obj*).sup.U of which
the transition destination is the state of the object S(obj*) among
agent-object transition models P.sub.S(agt)S(obj)'.sup.U, the agent
transition model P.sub.S(agt)S(agt)'.sup.U, and the state value
V.sub.agt(S(agt)) having the state of the object S(obj*) set as a
reference, and the process advances to Step S117.
[0329] In Step S117, the action control unit 14 selects an action U
performed by the agent in the current state S(agt) based on the
action value Q.sub.agt(S(agt),U) for the current state of the agent
S(agt) among action values Q.sub.agt(S(agt),U) for each state
S(agt) of the agent transition models and each action U that the
agent can perform, with the .epsilon.-greedy method or the softmax
method, for example, in the same manner as in Step S30 of FIG. 13,
and outputs an action signal U corresponding thereto.
[0330] Herein, the action signal U output from the action control
unit 14 is supplied to the learning unit 12 and the actuator
15.
[0331] The actuator 15 is driven according to the action signal U
from the action control unit 14, and accordingly, the agent
performs the action U according to the action signal U.
[0332] Furthermore, the learning unit 12 can perform the
above-described learning process (of FIG. 20) using the action
signal U from the action control unit 14 while the action control
process for learning is performed.
[0333] When the agent performs the action U according to the action
signal U, the process advances from Step S117 to Step S118, and the
action control unit 14 awaits the output of the observation value
observed after the action U of the agent from the sensor 11 to
acquire the observation value.
[0334] Furthermore, the action control unit 14 recognizes the
current states of the agent and the object based on the observation
value from the sensor 11, and the process advances from Step S118
to Step S119.
[0335] In Step S119, the action control unit 14 determines whether
or not the current state from the action control unit 14 coincides
with the (latest) learning target state from the action control
unit 14 and whether or not a predetermined time t1 passed after the
(latest) learning target state is supplied from the action control
unit 14.
[0336] In Step S119, when it is determined that the current state
from the action control unit 14 does not coincide with the learning
target state from the action control unit 14 and the predetermined
time t1 did not pass after the learning target state is supplied
from the action control unit 14, the process advances to Step S120,
and the action control unit 14 determines whether or not the
current state of the object is changed to other state (a state
other than the previous state) from the previous state of the
object.
[0337] In Step S120, when it is determined that the current state
of the object is changed to other state from the previous state of
the object, that is, when the object is moved by the action of the
agent, and as a result, the state of the object is changed before
and after the action of the agent, the process returns to Step
S113, and the action control unit 14 obtains an action U* that
gives the maximum value in the action value
Q.sub.obj(S(obj-current),U) for the current state of the object
S(obj-current) after the change, and thereafter, the same process
is repeated.
[0338] In addition, in Step S120, when it is determined that the
current state of the object is not changed from the previous state
of the object to other state, that is, when the agent acted but the
object is not moved, or when the object is moved by the action of
the agent but the state of the object is not changed before and
after the movement, the process returns to Step S117, and
thereafter, the same process is repeated.
[0339] On the other hand, in Step S119, when it is determined that
the current state from the action control unit 14 coincides with
the learning target state from the action control unit 14, that is,
when the agent reaches the learning target state, or when it is
determined that the predetermined time t1 passed after the learning
target state is supplied from the action control unit 14, that is,
when the agent was not able to reach the learning target state for
the predetermined time t1, the process advances to Step S121, and
the action control unit 14 determines whether or not a condition of
ending action control to end the action control process for
learning is satisfied, in the same manner as in Step S33 of FIG.
13.
[0340] In Step S121, when it is determined that the condition of
ending the action control is not satisfied, the process returns to
Step S102 of FIG. 22, and thereafter, the same process is
repeated.
[0341] In addition, in Step S121, when it is determined that the
condition of ending the action control is satisfied, the action
control unit 14 ends the action control process for learning.
[Other Example of State of State Transition Model]
[0342] In the above, a small area obtained by dividing an action
environment into such small areas is employed as the state of state
transition models (agent transition model, object transition model,
and agent-object transition model) P.sub.SS'.sup.U, but the state
of the state transition models can be realized using other model,
for example, the latent variable model such as the GMM (Gaussian
Mixture Model), the HMM (Hidden Markov Model), or the like.
[0343] In other words, as the state of the state transition model
P.sub.SS'.sup.U, for example, the state of the GMM or the HMM can
be employed.
[0344] When the state of the GMM or the HMM is employed as the
state of the state transition model P.sub.SS'.sup.U, an action
value used in selection an action can be obtained based on a
posterior probability in the action control unit 14.
[0345] In other words, when a small area obtained by dividing the
action environment is employed as the state of the state transition
model P.sub.SS'.sup.U, an action value Q(S,U) for each state S of
the state transition model is obtained as an action value used in
selecting an action, but when the GMM or the HMM is employed as the
state of the state transition model P.sub.SS'.sup.U, an action
value Q(O,U) used for performing an action U can be obtained as an
action value used in selecting an action when an observation value
O is observed in the action control unit 14.
[0346] When the observation value O is observed, the action value
Q(O,U) used for performing the action U can be obtained according
to, for example, Formula (15).
Q ( O , U ) = S P ( U | S ) P ( S | O ) = S ( S ' P SS ' U V ( S '
) ) P ( S | O ) [ Expression 15 ] ##EQU00010##
[0347] Herein, in Formula (15), P(S|O) indicates a probability
(posterior probability) of being in the state S when the
observation value O is observed. When the state of the HMM is
employed as the state of the state transition model
P.sub.SS'.sup.U, the probability P(S|O) can be obtained using time
series data of the observation value, that is, time series data O
of the observation value observed from the time backdated from the
latest observation value was observed by a predetermined time to
the latest time.
[0348] In addition, in Formula (15), P(U|S) is a probability with
which the action U is performed in the state S. Furthermore,
.SIGMA. indicates the summation for the state S', and thus, the
probability P(U|S) is obtained by taking the sum of the product
P.sub.SS'.sup.UV(S') of the transition probability (transition
probability indicated by a state transition model) P.sub.SS'.sup.U
and a state value V(S') of the state S' of the transition
destination for all states S' of the transition destination.
[0349] Furthermore, when the state of the HMM is employed as the
state of the state transition model P.sub.SS'.sup.U, a transition
probability of the HMM a.sub.ij is extended to a transition
probability a.sub.ij(U) for each action U performed by the agent,
and the transition probability a.sub.ij (U) for each action U can
be used as the transition probability P.sub.SS'.sup.U of Formula
(15).
[0350] Herein, the HMM in which the transition probability a.sub.ij
is extended to the transition probability a.sub.ij(U) for each
action U is referred to as an extended HMM. The extended HMM will
be described later.
[0351] FIG. 24 is a diagram illustrating a posterior probability
used in obtaining an action value Q(O|U) is controlled using a
so-called temperature parameter .beta. when the state of the state
transition model P.sub.SS'.sup.U is expressed using the latent
variable model.
[0352] When the observation value O is observed, the action value
Q(O|U) used for performing the action U can be obtained according
to Formula (16), instead of Formula (15).
Q ( O , U ) = S P ( U | S ) P ( S | O ) .beta. S P ( S | O ) .beta.
[ Expression 16 ] ##EQU00011##
[0353] Herein, in Formula (16), the temperature parameter 3 is a
value in the range of 0<.beta..ltoreq.1.
[0354] According to Formula (16), an action value Q(O|U) is
obtained using a value P(S|O).sup..beta./.SIGMA. P(S|O).sup..beta.
obtained by normalizing a value obtained by raising the posterior
probability P(S|O) of Formula (15) as a posterior probability, to
the power of .beta..
[0355] P(S|O).sup..beta./.SIGMA. P(S|O).sup..beta. as a posterior
probability can be controlled by the temperature parameter .beta.,
and thus, according to the temperature parameter .beta., it is
possible to control ambiguity of being in the state S or not when
the observation value O is observed.
[0356] Furthermore, when the temperature parameter .beta. is set to
1, P(S|O).sup..beta./.SIGMA. P(S|O).sup..beta. as a posterior
probability is equal to the posterior probability P(S|O) of Formula
(15).
[0357] When the temperature parameter .beta. is set to a value less
than 1, for example, 0.2 or the like, and when the current state is
a state that the agent has little experience, that is, the agent is
in a state where the agent has little experience of taking various
actions, it is possible that appropriate actions will be
performed.
[0358] In other words, in a state corresponding to a circumstance
where the agent contacts the wall in an action environment, for
example, if the agent has a great deal of experience of bumping
into the wall, but little experience of taking other actions, in
the action value Q(O|U) of Formula (15), the agent is highly
possible to continue to bump into the wall.
[0359] On the other hand, according to the action value Q(O|U) of
Formula (16) where the temperature parameter .beta. is set to a
value less than 1, for example, 0.2 or the like, it is easy for the
agent to perform an action that the agent has experienced in a
state other than the state corresponding to the circumstance where
the agent contacts the wall in the action environment, for example,
in a state where the agent has a great deal of experience of taking
various actions, and further, it is easy for the agent to perform
actions (proper actions) other than the action of bumping into the
wall.
[0360] FIG. 24 shows a posterior probability P(S|O) and a posterior
probability P(S|O).sup..beta./.SIGMA. P(S|O).sup..beta. when the
temperature parameter .beta. is set to 0.2.
[0361] In FIG. 24, in regard to a posterior probability P(S|O), a
posterior probability P(S|O) in a state with insufficient
experience is set to 0.8, and a posterior probability P(S|O) in a
state with sufficient experience is set to 0.1, but the difference
between the posterior probability P(S|O) in a state with
insufficient experience and the posterior probability P(S|O) in a
state with sufficient experience is very large.
[0362] The action value Q(O|U) obtained using such a posterior
probability P(S|O) is strongly affected by a posterior probability
P(S|O) with a high value, that is, the posterior probability P(S|O)
in a state with insufficient experience.
[0363] On the other hand, in regard to a posterior probability
P(S|O).sup..beta./.SIGMA. P(S|O).sup..beta., a posterior
probability P(S|O)/.SIGMA. P(S|O).sup..beta. in a state with
insufficient experience is set to 0.4, and a posterior probability
P(S|O).sup..beta./.SIGMA. P(S|O).sup..beta. in a state with
sufficient experience is set to 0.3, and the difference between the
a posterior probability P(S|O).sup..beta./.SIGMA. P(S|O).sup..beta.
in a state with insufficient experience and the posterior
probability P(S|O).sup..beta./.SIGMA. P(S|O).sup..beta. in a state
with sufficient experience is not that great (ambiguity of being in
each state is great).
[0364] The action value Q(O|U) obtained using such a posterior
probability P(S|O).sup..beta./.SIGMA. P(S|O).sup..beta. is affected
by the posterior probabilities P(S|O).sup..beta./.SIGMA. P
(S|O).sup..beta. in a state with insufficient experience and in a
state with sufficient experience to the same degree.
[0365] Furthermore, the agent obtains an action value Q(O|U) using
a posterior probability P(S|O) as default, and can obtain an action
value Q(O|U) using posterior probabilities
P(S|O).sup..beta./.SIGMA. P (S|O).sup..beta. only for the time when
the agent is in the state with insufficient experience (with a high
possibility).
[0366] For example, a user can teach the agent that the agent is in
the state with insufficient experience.
[0367] In addition, when the agent remains in the same state for a
certain time period or longer, it is highly possible that the agent
has insufficient experience of performing an action for transiting
the state to other state, and thus, the agent determines whether or
not the agent remains the same state for a certain time period or
longer, and the agent can determined that the agent is in a state
with insufficient experience for the time when the agent remains in
the same state for a certain time period or longer.
[0368] FIG. 25 is a diagram illustrating learning of the GMM
performed when the GMM is employed as the state of the state
transition model P.sub.SS'.sup.U.
[0369] When the GMM is employed as the state of the state
transition model P.sub.SS'.sup.U, learning of the GMM as a state,
that is, learning of Gaussian distribution as probability
distribution in which the observation value O is observed in the
GMM, using the observation value O that is a continuous value
observed in the agent.
[0370] Learning data that is the observation value O used in
learning of the GMM is acquired (observed) in the movement
destination for the agent that moves (acts) in the action
environment.
[0371] Therefore, if gravity is set in an action environment in
which the agent performs random actions can move to, for example,
the upper and lower side, there are a lot of opportunities to move
to the lower side in the action environment, and few opportunities
to move in the upper side, and thus, a great deal of learning data
is acquired for the lower side of the action environment, but only
a little learning data can be acquired for the upper side of the
action environment. As a result, bias occurs in the density of
learning data acquired from the action environment (density of
positions where the observation value O that is the learning data
is observed).
[0372] In other words, learning data is acquired at close positions
in the lower side of the action environment, but learning data is
acquired at scattered positions in the upper side of the action
environment.
[0373] As described above, when there is bias in the density of
learning data acquired from the action environment, bias occurs
also in a state as the GMM obtained by learning using such learning
data (also in a configuration of a model including a plurality of
GMMs indicating the agent acting in the action environment).
[0374] In other words, for the agent performing random actions by
the past action control, dispersion of Gaussian distribution
indicating distribution of the observation value O observed in the
state as the GMM obtained by learning becomes small in the state
corresponding to the lower side of the action environment, and
becomes great in the state corresponding to the upper side of the
action environment as shown in FIG. 25, according to the bias in
density of learning data acquired from the action environment.
[0375] On the other hand, according to action control by the action
control process for learning (new action control), since the agent
performs actions of movement thoroughly in the action environment
as described in FIG. 13, (the observation value O that will serve
as) learning data is acquired thoroughly from the action
environment.
[0376] As a result, dispersion of Gaussian distribution indicating
distribution of the observation value O observed in the state as
the GMM obtained by learning is uniform in the entire action
environment with little bias (if any), as shown in FIG. 25.
[Extended HMM]
[0377] Next, the extended HMM mentioned above will be
described.
[0378] FIG. 26 is a diagram showing an example of an action
environment where the agent of FIG. 2 to which the extended HMM is
applied performs actions.
[0379] In FIG. 26, the action environment is a maze in a
two-dimensional plane, and the agent can move along the white
portion in the drawing as a passage.
[0380] FIGS. 27A and 27B show examples of actions performed by the
agent and observation values observed by the agent in the action
environment.
[0381] The agent assumes areas in the drawing divided into square
shapes in dotted lines in the action environment shown in FIG. 26
as units of observing observation values (observation unit), and
performs actions of moving the observation units.
[0382] FIG. 27A shows types of actions performed by the agent.
[0383] In FIG. 27A, the agent can perform five actions U.sub.1 to
U.sub.5 in total including an action U.sub.1 to move in the upper
(north) direction by observation units, an action U.sub.2 to move
to the right (east) direction by observation units, an action
U.sub.3 to move to the lower (south) direction by observation
units, an action U.sub.4 to move to the left (west) direction by
observation units, and an action U.sub.5 of no movement (doing
nothing).
[0384] FIG. 27B schematically shows types of observation values
observed by the agent with observation units.
[0385] In the present embodiment, the agent observes any one of 15
types of observation values (symbols) O.sub.1 to O.sub.15 in the
observation units.
[0386] An observation value O.sub.1 is observed as an observation
unit with walls in the upper, lower, and left sides and a passage
in the right side, and an observation value O.sub.2 is observed as
an observation unit with walls in the upper, left, and right sides
and a passage in the lower side.
[0387] An observation value O.sub.3 is observed as an observation
unit with walls in the upper and left sides and a passage in the
lower and right side, and an observation value O.sub.4 is observed
as an observation unit with walls in the upper, lower, and right
sides, and a passage in the left side.
[0388] An observation value O.sub.5 is observed as an observation
unit with walls in the upper, and lower sides and a passage in the
left and right sides, and an observation value O.sub.6 is observed
as an observation unit with walls in the upper and right sides and
a passage in the lower and left sides.
[0389] An observation value O.sub.7 is observed as an observation
unit with walls in the upper side and a passage in the lower, left,
and right sides, and an observation value O.sub.8 is observed as an
observation unit with walls in the lower, left, and right sides and
a passage in the upper side.
[0390] An observation value O.sub.9 is observed as an observation
unit with walls in the lower and left sides and a passage in the
upper and right sides, and an observation value O.sub.10 is
observed as an observation unit with walls in the left and right
sides and a passage in the lower and upper sides.
[0391] An observation value O.sub.11 is observed as an observation
unit with walls in the left side and a passage in the upper, lower,
and right sides, and an observation value O.sub.12 is observed as
an observation unit with walls in the lower and right sides and a
passage in the upper and left sides.
[0392] An observation value O.sub.13 is observed as an observation
unit with walls in the lower side and a passage in the upper, left,
and right sides, and an observation value O.sub.14 is observed as
an observation unit with walls in the right side and a passage in
the upper, lower, and left sides.
[0393] An observation value O.sub.15 is observed as an observation
unit with a passage of all upper, lower, left, and right sides.
[0394] Furthermore, herein, both action U.sub.m (m=1, 2, . . . , M
(M is the total number of (types of) actions) and observation value
O.sub.k (k=1, 2, . . . , K (K is the total number of observation
values)) are discrete values.
[0395] FIG. 28 is a flowchart explaining a learning process
performed by the learning unit 12 in the agent of FIG. 2 to which
the extended HMM is applied.
[0396] In Step S141, the learning unit 12 awaits the output of the
current observation value (observation value of the current time t)
o.sub.t by the sensor 11, which is observed from the action
environment, to acquire the observation value o.sub.t, and the
process advances to Step S142.
[0397] Herein, the observation value o.sub.t of the (current) time
t is any one of 15 observation values O.sub.1 to O.sub.15 shown in
FIG. 27B in the embodiment.
[0398] In Step S142, the learning unit 12 awaits the output of an
action signal u.sub.t of an action u.sub.t by selecting the action
u.sub.t to be performed at the time t or randomly selecting the
action u.sub.t to be performed at the time t by action control of
the action control unit 14 (of FIG. 2) using an observation value
o.sub.t to acquire the action signal u.sub.t, and the process
advances to Step S143.
[0399] Herein, the action u.sub.t of the time t is any one of five
actions U.sub.1 to U.sub.5 shown in FIG. 27A in the embodiment.
[0400] In addition, the actuator 15 (of FIG. 2) is driven according
to the action signal u.sub.t output from the action control unit
14, and accordingly, the agent performs the action u.sub.t.
[0401] In Step S143, the learning unit 12 stores a set of the
observation value o.sub.f of the time t acquired from the sensor 11
and the action signal u.sub.t of the time t acquired from the
action control unit 14 as a learning data set used in learning of
the extended HMM in the form of adding to the history of the
learning data set, and the process advances to Step S144.
[0402] In Step S144, the learning unit 12 determines whether or not
a learning condition for performing learning of the extended HMM is
satisfied.
[0403] Herein, as the learning condition for learning the extended
HMM, addition of a predetermined number of new learning data sets
(learning data sets not used in learning of the extended HMM),
which is one or greater, to the history, or the like can be
employed.
[0404] In Step S144, when it is determined that the learning
condition is not satisfied, the process returns to Step S141, and
the learning unit 12 awaits the output of an observation value
o.sub.t+1 of a time t+1 observed after the agent performed an
action u.sub.t from the sensor 11, to acquire the observation value
o.sub.t+1 output from the sensor, and thereafter, the same process
is repeated.
[0405] In addition, in Step S144, when it is determined that the
learning condition is satisfied, the process advances to Step S145,
and the learning unit 12 performs learning (updating) of the
extended HMM using the learning data sets stored as history.
[0406] Then, after the end of learning of the extended HMM, the
process returns from Step S145 to Step S141, and thereafter, the
same process is repeated.
[0407] FIGS. 29A and 29B are diagrams illustrating the extended
HMM.
[0408] In the extended HMM, a (state) transition probability of the
general (past) HMM is extended to a transition probability for each
action performed by the agent.
[0409] In other words, FIG. 29A shows a transition probability of
the general HMM.
[0410] Now, as the HMM for the extended HMM, an ergodic HMM in
which state transition from a state to an arbitrary state is
possible is employed. In addition, the number of states of the HMM
is set to N.
[0411] In the general HMM, a transition probability a.sub.ij of
N.times.N state transitions from N-number of states S.sub.i to
N-number of states S.sub.i is included as a model parameter.
[0412] All transition probabilities of the general HMM can be
expressed by a two-dimensional table in which a transition
probability a.sub.ij of state transition from a state S.sub.i to a
state S.sub.j is arranged in the i-th unit from the top and the
j-th unit from the left.
[0413] Herein, the table of the transition probability of the HMM
(including the extended HMM) can be described also as a transition
probability A.
[0414] FIG. 29B shows the transition probability A of the extended
HMM.
[0415] In the extended HMM, transition probabilities exist for each
action U.sub.m performed by the agent.
[0416] Herein, a transition probability of state transition from a
state S.sub.i to a state S.sub.j for an action U.sub.m is described
also as a.sub.ij(U.sub.m).
[0417] A transition probability a.sub.ij(U.sub.m) indicates a
probability with which state transition from a state S.sub.i to a
state S.sub.j occurs when the agent performs an action U.sub.m.
[0418] All transition probabilities of the extended HMM can be
expressed with a three-dimensional table in which a transition
probability a.sub.ij(U.sub.m) a state S.sub.i to a state S.sub.j
for an action U.sub.m is arranged in the i-th unit from the top,
the j-th unit from the left, and m-th unit from the front side to
the depth direction.
[0419] Herein, in the three-dimensional table of the transition
probability A, the axis of the vertical direction is referred to as
an i-axis, the axis of the horizontal direction as the j-axis, and
the axis of the depth direction as the m-axis, or the action axis,
respectively.
[0420] In addition, a plane, which is obtained by cutting the
three-dimensional table of the transition probability A with a
plane perpendicular to the action axis at a position m of the
action axis and is constituted by transition probabilities a.sub.ij
(U.sub.m), is referred to also as a transition probability plane
for an action U.sub.m.
[0421] Furthermore, a plane, which is obtained by cutting the
three-dimensional table of the transition probability A with a
plane perpendicular to the i-axis at a position I of the i-axis and
is constituted by transition probabilities a.sub.ij(U.sub.m), is
referred to also as an action plane for a state S.sub.I.
[0422] Transition probabilities a.sub.ij(U.sub.m) constituting an
action plane for a state S.sub.I indicates a probability of
performing each action U.sub.m when state transition having a state
S.sub.I as the transition source occurs.
[0423] Furthermore, the extended HMM includes an initial state
probability .pi..sub.i of being in a state S.sub.i at the first
time t=1 and output probability distribution (herein, discrete
probability value) b.sub.i (O.sub.k) that is probability
distribution for observing an observation value O.sub.k in the
state S.sub.i, in addition to transition probabilities a.sub.ij
(U.sub.m) for each action as model parameters, in the same manner
as the general HMM.
[0424] FIG. 30 is a flowchart explaining learning of the extended
HMM performed by the learning unit 12 (of FIG. 2) using the
learning data sets stored as history in Step S145 of FIG. 28.
[0425] In Step S151, the learning unit 12 initializes the extended
HMM.
[0426] In other words, the learning unit 12 initializes the initial
state probability .pi..sub.i, the transition probability
a.sub.ij(U.sub.m) (for each action), and the output probability
distribution b.sub.i(O.sub.k) that are model parameters in the
extended HMM.
[0427] Furthermore, if the number (total number) of states of the
extended HMM is set to N, the initial state probability .pi..sub.i
is initialized to, for example, 1/N. Herein, if the action
environment that is a maze of a two-dimensional plane is
constructed by a.times.b observation units for the width and the
length thereof respectively, as the number of states of the
extended HMM N, with the setting of an integer for a margin to A,
(a+.DELTA.).times.(b+.DELTA.) can be employed.
[0428] In addition, the transition probability a.sub.ij(U.sub.m)
and the output probability distribution b.sub.i(O.sub.k) are
initialized to, for example, random values that can be obtained as
probability values.
[0429] Herein, the initialization of the transition probability
a.sub.ij(U.sub.m) is performed so that the sum of transition
probabilities a.sub.ij(U.sub.m) in each row of the transition
probability plane for each action U.sub.m, which is
(a.sub.i,1(U.sub.m)+a.sub.i,2(U.sub.m)+ . . . +a.sub.i,N(U.sub.m))
is 1.0.
[0430] In the same manner, the initialization of the output
probability distribution b.sub.i(O.sub.k) is performed so that the
sum of output probability distribution in which observation values
O.sub.1, O.sub.2, . . . , O.sub.k are observed from each state
S.sub.i for the state S.sub.i, which is
(b.sub.i(O.sub.1)+b.sub.i(O.sub.2)+ . . . +b.sub.i(O.sub.k)) is
1.0.
[0431] Furthermore, when so-called addition learning is performed,
the initial state probability .pi..sub.i, the transition
probability a.sub.ij(U.sub.m), and the output probability
distribution b.sub.i(O.sub.k) of the extended HMM that are obtained
in learning performed right before the addition learning and stored
in the model storage unit 13 are used as initial values without
change, and the initialization of Step S151 is not performed.
[0432] After Step S151, the process advances to Step S152, and
then, in Step S152 and thereafter, learning of the extended HMM is
performed in which an initial state probability .pi..sub.i, a
transition probability a.sub.ij(U.sub.m) for each action, and an
output probability distribution b.sub.i(O.sub.k) are estimated
using the learning data sets stored as history according to a
re-estimation method of Baum-Welch (or a method of extending the
re-estimation method for actions).
[0433] In other words, in Step S152, the learning unit 12
calculates a forward probability .alpha..sub.t+1(j) and a backward
probability .beta..sub.t(i).
[0434] Herein, in the extended HMM, if an action u.sub.t is
performed in a time t, state transition is performed from the
current state S.sub.i to a state S.sub.j, and in the next time t+1,
an observation value o.sub.t+i is observed in the state S.sub.j
after the state transition.
[0435] In the extended HMM, the forward probability
.alpha..sub.t+1(j) indicates a probability P of being in a state
S.sub.i in the time t+1, which is P(o.sub.1, o.sub.2, . . . ,
o.sub.t+1, u.sub.1, u.sub.2, . . . , u.sub.t,
s.sub.t+1=j|.LAMBDA.), as a sequence of action signals of the
learning data set as history (action sequence) u.sub.1, u.sub.2, .
. . u.sub.t is observed, and a sequence of observation values
(observation value sequence) o.sub.1, o.sub.2, . . . , o.sub.t+1 is
observed in a model .LAMBDA. that is the current extended HMM
(extended HMM normalized with the initial state probability
.pi..sub.i, the transition probability a.sub.ij(U.sub.m), and the
output probability distribution b.sub.i(O.sub.k) that are
initialized or currently stored in the model storage unit 13), and
expressed by Formula (17).
.alpha. t + 1 ( j ) = P ( o 1 , o 2 , , o t + 1 , u 1 , u 2 , , u t
, s t + 1 = j | .LAMBDA. ) = i = 1 N .alpha. t ( i ) a ij ( u t ) b
j ( o t + 1 ) [ Expression 17 ] ##EQU00012##
[0436] Furthermore, a state s.sub.t indicates a state of being in a
time t, and is any one of states S.sub.1 to S.sub.N when the number
of states of the extended HMM is N. In addition, a formula
s.sub.t+1=j indicates that a state s.sub.t+1 of being in a time t+1
is a state S.
[0437] The forward probability .alpha..sub.t+1(j) of Formula (17)
indicates a probability that state transition occurs and an
observation value o.sub.t+1 in a state S.sub.j in a time t+1 is
observed by performing (observing) an action u.sub.t when the
action sequence u.sub.1, u.sub.2, . . . , u.sub.t-1 and the
observation value sequence o.sub.1, o.sub.2, . . . , o.sub.t in the
learning data set are observed and the agent is in the state
s.sub.t in the time t.
[0438] Furthermore, the initial value a.sub.l(j) of the forward
probability .alpha..sub.t+1(j) is expressed by Formula (18).
.alpha..sub.1(j)=.pi..sub.jb.sub.j(o.sub.1) [Expression 18]
[0439] The initial value .alpha..sub.1(j) of Formula (18) indicates
a probability of observing the observation value o.sub.1 in the
state S.sub.j at first (time t=1).
[0440] In addition, in the extended HMM, the backward probability
.beta..sub.t(i) is a probability P of being in a state S.sub.i in
the time t and thereafter, observing an action sequence u.sub.t+1,
u.sub.t+2, . . . , u.sub.T-1 of the learning data set and observing
an observation value sequence o.sub.t+1, o.sub.t+2, . . . ,
o.sub.T, which is P(o.sub.t+1, o.sub.t+2, . . . , o.sub.T,
u.sub.t+1, u.sub.t+2, . . . , u.sub.T-1, s.sub.t=i|.LAMBDA.), in a
model .LAMBDA. that is the current extended HMM, and expressed by
Formula (19).
.beta. t ( i ) = P ( o t + 1 , o t + 2 , , o T , u t + 1 , u t + 2
, , u T - 1 , s t = j | .LAMBDA. ) = j = 1 N a ij ( u t ) b j ( o t
+ 1 ) .beta. t + 1 ( j ) [ Expression 19 ] ##EQU00013##
[0441] Furthermore, T indicates the number of observation values
(sequence length) of the observation sequence in the learning data
set.
[0442] The backward probability .beta..sub.t(i) of Formula (19)
indicates a probability that the agent is in a state S.sub.j in a
time t+1, and thereafter, that a state s.sub.t of a time t is a
state S.sub.i when state transition occurs by performing an action
u.sub.t (observed) in the state S.sub.i at the time t, a state
s.sub.t+1 of the time t+1 is the state S.sub.j, and the observation
value o.sub.t+1 is observed when the action sequence of the
learning data set u.sub.t+1, u.sub.t+2, . . . , u.sub.T-1 is
observed and the observation value sequence o.sub.t+2, o.sub.t+3, .
. . , o.sub.T, is observed.
[0443] Furthermore, the initial value .beta..sub.T(i) of the
backward probability .beta..sub.t(i) is expressed by Formula
(20).
.beta..sub.T(i)=1 [Expression 20]
[0444] The initial value .beta..sub.T(i) of Formula (20) indicates
that the probability of being in the state S.sub.i finally (time
t=T) is 1.0, that is, of necessarily being in the state S.sub.i
finally.
[0445] In the extended HMM, as shown in Formulas (17) and (19),
using a transition probability a.sub.ij ( ) for each action as a
transition probability from a state S.sub.i to a state S.sub.i is
different from the general HMM.
[0446] In Step S152, after the forward probability
.alpha..sub.t+1(j) and the backward probability .beta..sub.t(i) are
calculated, the process advances to Step S153, and the learning
unit 12 re-estimates the initial state probability .pi..sub.i, the
transition probability a.sub.ij (U.sub.m) for each action U.sub.m,
and the output probability distribution b.sub.i(O.sub.k) that are
model parameters .LAMBDA. of the extended HMM using the forward
probability .alpha..sub.t+1(j) and the backward probability
.beta..sub.t(i).
[0447] Herein, re-estimation of the model parameters is accompanied
by extension of a transition probability to the transition
probability a.sub.ij (U.sub.m) for each action U.sub.m, and the
re-estimation method of Baum-Welch is extended to perform as
above.
[0448] In other words, in the model .LAMBDA. of the current
extended HMM, when the action sequence U=u.sub.1, u.sub.2, . . . ,
u.sub.T-1 and the observation value sequence O=o.sub.l, O.sub.2,
o.sub.T are observed, a probability .xi..sub.t+1(i,j,U.sub.m) of
performing state transition to the state S.sub.i in the time t+1 by
performing an action U.sub.m in the state S.sub.i in the time t is
expressed by Formula (21) using the forward probability
.alpha..sub.t (i) and the backward probability
.beta..sub.t+1(j).
.xi. t + 1 ( i , j , U m ) = P ( s t = i , s t + 1 = j , u t = U m
| O , U , .LAMBDA. ) = .alpha. t ( i ) a ij ( U m ) b j ( o t + 1 )
.beta. t + 1 ( j ) P ( O , U | .LAMBDA. ) ( 1 .ltoreq. t .ltoreq. T
- 1 ) [ Expression 21 ] ##EQU00014##
[0449] Furthermore, a probability .gamma..sub.t(i,U.sub.m) with
which an action u.sub.t=U.sub.m in the state S.sub.i in the time t
can be calculated as a probability marginalized with respect to the
state S.sub.i in the time t+1 for the probability
.xi..sub.t+1(i,j,U.sub.m), and expressed by Formula (22).
.gamma. 1 ( i , U m ) = P ( s t = i , u t = U m | O , U , .LAMBDA.
) = j = 1 N .xi. t + 1 ( i , j , U m ) ( 1 .ltoreq. t .ltoreq. T -
1 ) [ Expression 22 ] ##EQU00015##
[0450] The learning unit 12 performs re-estimation of the model
parameters .LAMBDA. of the extended HMM, using the probability
.xi..sub.t+1(i,j,U.sub.m) of Formula (21), and the probability
.gamma..sub.t(i,U.sub.m) of Formula (22).
[0451] Herein, if an observation value obtained after performing
re-estimation of the model parameters .LAMBDA. is indicated by a
model parameter .LAMBDA.' using a single quotation mark ('), an
estimation value .pi.'.sub.i of the initial state probability that
is a model parameter .LAMBDA.' is obtained according to Formula
(23).
.pi. i ' = .alpha. 1 ( i ) .beta. 1 ( i ) P ( O , U | .LAMBDA. ) (
1 .ltoreq. i .ltoreq. N ) [ Expression 23 ] ##EQU00016##
[0452] In addition, an estimation value a'.sub.ij(U.sub.m) of a
transition probability for each action that is a model parameter
.LAMBDA.' is obtained according to Formula (24).
a ij ' ( U m ) = t = 1 T - 1 .xi. t + 1 ( i , j , U m ) t = 1 T - 1
.gamma. t ( i , U m ) = t = 1 T - 1 .alpha. t ( i ) a ij ( U m ) b
j ( o t + 1 ) .beta. t + 1 ( j ) t = 1 T - 1 j = 1 N .alpha. t ( i
) a ij ( U m ) b j ( o t + 1 ) .beta. t + 1 ( j ) [ Expression 24 ]
##EQU00017##
[0453] Herein, the numerator of the estimation value
a'.sub.ij(U.sub.m) of a transition probability in Formula (24)
indicates an expectation value of the number of state transitions
to the state S.sub.j after performing the action u.sub.t=U.sub.m in
the state S.sub.i, and the denominator thereof indicates an
expectation value of the number of state transitions after
performing the action u.sub.t=U.sub.m in the state S.sub.i.
[0454] An estimation value b'.sub.j(O.sub.k) of output probability
distribution that is a model parameter .LAMBDA.' is obtained
according to Formula (25).
b j ' ( O k ) = t = 1 T - 1 i = 1 N m = 1 M .xi. t + 1 ( i , j , U
m , O k ) t = 1 T - 1 i = 1 N m = 1 M .xi. t + 1 ( i , j , U m ) =
t = 1 T - 1 .alpha. t + 1 ( j ) b j ( O k ) .beta. t + 1 ( j ) t =
1 T - 1 .alpha. t + 1 ( j ) .beta. t + 1 ( j ) [ Expression 25 ]
##EQU00018##
[0455] Herein, the numerator of the estimation value
b'.sub.j(O.sub.k) of output probability distribution in Formula
(25) indicates an expectation value of the number of observation of
an observation value O.sub.k in the state S.sub.j after performing
the state transition to the state S.sub.j, and the denominator
thereof indicates an expectation value of the number of state
transitions to the state S.sub.j.
[0456] In Step S153, after re-estimation of the initial state
probability, transition probability, and estimation values of
output probability distribution of a'.sub.ij(U.sub.m) and
b'.sub.j(O.sub.k), which are model parameters .LAMBDA.', the
learning unit 12 causes the model storage unit 13 to store the
estimation value .pi.'.sub.i as a new initial state probability
.pi..sub.i, the estimation value a'.sub.ij(U.sub.m) as a new
transition probability a.sub.ij(U.sub.m), and the estimation value
b'.sub.j(O.sub.k) as a new output probability distribution
b.sub.j(O.sub.k), in the form of overwriting, and the process
advances to Step S154.
[0457] In Step S154, it is determined whether or not the model
parameters of the extended HMM, that is, the (new) initial state
probability .pi..sub.i, transition probability a.sub.ij(U.sub.m),
and output probability distribution b.sub.j(O.sub.k) stored in the
model storage unit 13 converge.
[0458] In Step S154, when it is determined that the model
parameters of the extended HMM have not converged yet, the process
returns to Step S152, and the same process is repeated using the
new initial state probability .pi..sub.i, transition probability
a.sub.ij(U.sub.m), and output probability distribution
b.sub.j(O.sub.k) stored in the model storage unit 13.
[0459] In addition, in Step S154, when it is determined that the
model parameters of the extended HMM converged, that is, when the
model parameters of the extended HMM are not changed, for example,
before and after re-estimation of Step S153, the process of
learning of the extended HMM ends.
[0460] As the state transition model P.sub.SS'.sup.U of Formula
(15) (and Formula (16)), a transition probability a.sub.ij(U) for
each action U of the extended HMM obtained by learning as above can
be used, and in that case, the state of the state transition model
P.sub.SS'.sup.U coincides with the state of the extended HMM.
[Description of Computer to which the Disclosure is Applied]
[0461] Next, the series of processes described above can be
performed by hardware, and also performed by software. When the
series of processes is performed by software, a program
constituting the software is installed in a general-purpose
computer, or the like.
[0462] Thus, FIG. 31 shows a configuration example of an embodiment
of a computer in which a program for executing the series of
processes described above is installed.
[0463] The program can be recorded in advance on a hard disk 205 or
a ROM 203 as a recording medium included in the computer.
[0464] Alternatively, the program can be stored (recorded) in a
removable recording medium 211. Such a removable recording medium
211 can be provided as package software. As the removable recording
medium 211 here, for example, a flexible disc, a CD-ROM (Compact
Disc Read Only Memory), a MO (Magneto Optical) disc, a DVD (Digital
Versatile Disc), a magnetic disk, a semiconductor memory, or the
like.
[0465] Furthermore, the program can be installed in the computer
from the removable recording medium 211 as described above,
downloaded to the computer through a communication network, or a
broadcasting network, and installed in the hard disk 205 included
therein. In other words, the program is wirelessly transferred to
the computer from a downloading site through a satellite for
digital satellite broadcasting, or can be transferred by wires to
the computer through a network such as a LAN (Local Area Network),
or the Internet.
[0466] The computer includes a CPU (Central Processing Unit) 202,
and the CPU 202 is connected to an input and output interface 210
through a bus 201.
[0467] If a user inputs a command by operating an input unit 207,
or the like, through the input and output interface 210, the CPU
202 executes a program stored in the ROM (Read Only Memory) 203
according thereto. Alternatively, the CPU 202 loads a program
stored in the hard disk 205 to a RAM (Random Access Memory) 204 to
execute.
[0468] Accordingly, the CPU 202 performs processes according to the
above-described flowcharts, or processes performed the
configurations of the above-described block diagrams. In addition,
the CPU 202 causes an output unit 206 to output, the communication
unit 208 to transmit, or the hard disk 205 to record the process
results through, for example, the input and output interface 210
depending on necessity.
[0469] Furthermore, the input unit 207 includes a keyboard, a
mouse, a microphone, or the like. In addition, the output unit 206
includes an LCD (Liquid Crystal Display), a speaker, or the
like.
[0470] Herein, in the present specification, it is not necessary
that a process performed by a computer based on a program is
performed in time series following the order described as a
flowchart. In other words, a process performed by a computer based
on a program includes a process executed in parallel or
individually (for example, a parallel process or a process by an
object).
[0471] In addition, a program may perform a process by one computer
(processor), or distributed processes by a plurality of computers.
Furthermore, a program may be executed by being transferred to a
remote computer.
[0472] The present disclosure contains subject matter related to
that disclosed in Japanese Priority Patent Application JP
2010-225156 filed in the Japan Patent Office on Oct. 4, 2010, the
entire contents of which are hereby incorporated by reference.
[0473] It should be understood by those skilled in the art that
various modifications, combinations, sub-combinations and
alterations may occur depending on design requirements and other
factors insofar as they are within the scope of the appended claims
or the equivalents thereof.
* * * * *