U.S. patent application number 13/131317 was filed with the patent office on 2011-09-22 for toy exhibiting bonding behavior.
This patent application is currently assigned to STELLENBOSCH UNIVERSITY. Invention is credited to Johan Adam Du Preez, Ludwig Carl Schwardt.
Application Number | 20110230114 13/131317 |
Document ID | / |
Family ID | 42225297 |
Filed Date | 2011-09-22 |
United States Patent
Application |
20110230114 |
Kind Code |
A1 |
Du Preez; Johan Adam ; et
al. |
September 22, 2011 |
TOY EXHIBITING BONDING BEHAVIOR
Abstract
A toy capable of exhibiting bonding behavior toward a user and a
method of simulating such behavior. The toy includes input sensors
for receiving interactive input from users, an output apparatus for
communicating with users, a processor and memory containing machine
instructions causing the processor to receive interactive input,
process received input and send control signals to the output
apparatus. The processor classifies received input as either
positive or negative and adjusts an accumulated input, stored in
the memory, in accordance with the classification. The control
signals, in turn, are dependent on the accumulated input.
Inventors: |
Du Preez; Johan Adam;
(Stellenbosch, ZA) ; Schwardt; Ludwig Carl;
(Newlands, ZA) |
Assignee: |
STELLENBOSCH UNIVERSITY
Western Cape Province
ZA
|
Family ID: |
42225297 |
Appl. No.: |
13/131317 |
Filed: |
November 27, 2009 |
PCT Filed: |
November 27, 2009 |
PCT NO: |
PCT/IB2009/007585 |
371 Date: |
May 26, 2011 |
Current U.S.
Class: |
446/175 ;
446/268 |
Current CPC
Class: |
A63H 2200/00 20130101;
A63H 3/28 20130101 |
Class at
Publication: |
446/175 ;
446/268 |
International
Class: |
A63H 3/28 20060101
A63H003/28 |
Foreign Application Data
Date |
Code |
Application Number |
Nov 27, 2008 |
ZA |
2008/04571 |
Mar 5, 2009 |
ZA |
2008/08880 |
Claims
1-18. (canceled)
19. A toy comprising: a body that includes at least one input
sensor for receiving an input from a human user, at least one
output apparatus by means of which the toy interacts with the user,
a processor in communication with the input sensor and the output
apparatus, and a memory in communication with the processor,
wherein the processor is programmed to classify each received input
as either positive or negative, to adjust an accumulated input
stored in the memory in accordance with the classification, and to
send control signals to the output apparatus that are dependent on
the accumulated input, and the toy thereby exhibiting increased
bonding behavior in response to a series of predominantly positive
inputs overtime, and decreased bonding behavior in response to a
series of predominantly negative inputs over time.
20. The toy according to claim 19, in which the received input
corresponds to human interaction with the toy corresponding to one
or more of sound, motion and image.
21. The toy according to claim 20, in which the processor
classifies sound associated with shouting and motion associated
with physical abuse as negative inputs.
22. The toy according to claim 19, which includes at least first
and second input sensors, the first sensor of which is a microphone
configured to detect sound and sound amplitude and the second
sensor of which is an accelerometer configured to detect motion and
acceleration of the toy.
23. The toy according to claim 19, in which the accumulated input
is representative, at least to some degree, of a voice of a
preferred user of the toy.
24. The toy according to claim 22, in which the processor is
programmed to determine a degree of similarity between a received
voice input received by means of the microphone and the accumulated
input.
25. The toy according to claim 24, in which the accumulated input
is adjusted to become increasingly representative of a user when
the received input is classified as positive, and for it to become
less representative of a preferred user or remain unchanged when
the degree of similarity is low or the received input is classified
as negative.
26. The toy according to claim 19, in which the processor is
programmed to classify a voice input received by means of the input
sensor at an amplitude above a predefined maximum voice amplitude
as a negative input, and below the predefined maximum voice
amplitude as a positive input.
27. The toy according to claim 19, in which the processor is
programmed to classify a motion input detected by means of the
input sensor at an acceleration above a predefined maximum
acceleration threshold as a negative input, and below the
predefined maximum acceleration threshold as a positive input.
28. The toy according to claim 19, in which the processor is
programmed to determine a degree of positivity or negativity, as
the case may be, of a received input and to adjust the accumulated
input proportionate to the degree of positivity or negativity.
29. The toy according to claim 23, in which the toy includes timing
means in communication with the processor and in which the
processor is programmed to classify an absence of received input,
for longer than a predefined period of time, as a negative input
and to adjust the accumulated input to become less representative
of the voice of the preferred user in response thereto.
30. The toy according to claim 24, in which the output apparatus
include one or both of a sound transducer and movement actuators
and in which the processor is programmed to send control signals to
the output apparatus more frequently and/or of a higher quality,
when the degree of similarity of a received voice input is high,
and in which the processor is programmed to send control signals to
the output apparatus less frequently and/or of a lower quality,
when the degree of similarity of the received voice input is
low.
31. The toy according to claim 19, in which the accumulated input
comprises a collection of characteristics extracted from a voice
associated with a generic background speaker, each characteristic
having a variable weigh associated therewith so that the collection
of weighted characteristics is representative of the voice of a
preferred user.
32. The toy according to claim 31, in which the variable weights
associated with the characteristics are adjusted in order to make
the accumulated input increasingly or less representative of the
voice of the preferred user.
33. The toy according to claim 31, in which the accumulated input
is adjusted to become increasingly representative of the voice of
at least one alternative user as the accumulated input becomes less
representative of the voice of a current preferred user, the
alternative user becoming a new preferred user when the accumulated
input becomes more representative of the voice of the alternative
user than that of the current preferred user.
34. A method of simulating bonding behaviour in a toy toward a
human including the steps of: storing an accumulated input
representative of a preferred user in a memory associated with the
toy, receiving an input from a user by means of at least one input
sensor incorporated in the toy, classifying the input as either
positive or negative, adjusting the accumulated input to become
increasingly representative of the preferred user, in response to a
positive input, and less representative of the preferred user, in
response to a negative input, and issuing control signals to output
apparatus of the toy in response to the input, and the control
signals being dependent, at least to some extent, on the
accumulated input.
35. The method according to claim 34, including the steps of
classifying a received voice input above a predefined amplitude as
a negative input, classifying a received motion input outside a
predefined acceleration range as a negative input, and classifying
an absence of received input for longer that a predetermined period
of time as a negative input.
36. The method according to claim 34, including the step of
determining a degree of similarity of a received voice input to
that of a preferred user, and issuing control signals to the output
apparatus of the toy which are proportional to the degree of
similarity.
Description
[0001] This application is a National Stage completion of
PCT/IB2009/007585 filed Nov. 27, 2009, which claims priority from
South African patent application serial no. 2008/08880 filed Mar.
5, 2009 and South African patent application serial no. 2008/04571
filed Nov. 27, 2008.
FIELD OF THE INVENTION
[0002] This invention relates to an interactive toy, more
specifically a doll, capable of exhibiting bonding behaviour
towards natural persons which mimics the bonding that naturally
occurs between a parent and child. The invention extends to a
method for simulating bonding behaviour by a toy towards a natural
person or persons.
BACKGROUND TO THE INVENTION
[0003] Toys, in particular dolls, are owned by people the world
over, and have been for hundreds of years. Children use dolls to
play with, for companionship and also sometimes to invoke a sense
of security. Children, especially young children, often develop a
very strong bond with their dolls, which may even play a part in
the child's development. Dolls are also owned by adults for
numerous reasons, be it as collector's items, for their aesthetic
qualities or emotional attachment.
[0004] Along with technological advances made over the past years,
dolls have developed and have become increasingly sophisticated
and, in fact, more life-like. The inventor is, for example, aware
of dolls that are capable of simulating limited human behaviour,
such as crying, sleeping, talking and even simulating humanly
bodily functions such as eating and excreting bodily waste. The
inventor is furthermore aware that electronic appliances, for
example, microphones, sound transducers, movement actuators and the
like have been incorporated into dolls.
[0005] United States patent application number US2007/0128979,
entitled "Interactive Hi-tech Doll", for example, discloses a doll
which produces human-like facial expressions, recognizes certain
words when they are spoken by humans, and which is able to carry on
a limited conversation with a living person based on certain
pre-defined question and answer scenarios. The doll's recognition
of the spoken words is based on speech and voice recognition
technology controlled by a processor incorporated in the doll, and
allows the doll to be trained to identify the voice of a specific
person, as well as assign a specific role, such as that of its
mother, to the person. The doll is equipped with movement actuators
in its face, allowing movement of its eyes, mouth and cheeks to
exhibit certain pre-defined facial expressions concurrently with
spoken words or separately to simulate human emotions. The limited
conversational skills are based on basic voice and speech
recognition techniques which are widely known in the field. In each
scenario, the doll will ask a pre-recorded question and expect to
receive a specific answer. If it receives the expected answer the
doll reacts favorably and if it receives any unexpected answer, it
reacts less favorably. There is, however, no mention in the
application that the doll has long-term learning capabilities.
Instead, its behavior appears to be governed by a state machine
that responds primarily to the current user input and its built in
clock.
[0006] It is an object of this invention to provide an interactive
toy, more specifically a doll, capable of simulating bonding
behaviour towards a person, which is an improvement over the prior
art outlined above.
SUMMARY OF THE INVENTION
[0007] In accordance with this invention there is provided a toy
comprising a body that includes at least one input sensor for
receiving an input from a human user, at least one output apparatus
by means of which the toy interacts with the user, a processor in
communication with the input sensor and the output apparatus, and a
memory in communication with the processor, the toy being
characterized in that the processor is programmed to classify each
received input as either positive or negative, to adjust an
accumulated input stored in the memory in accordance with the
classification, and to send control signals to the output apparatus
that are dependent on the accumulated input, the toy thereby
exhibiting increased bonding behaviour in response to a series of
predominantly positive inputs over time, and decreased bonding
behaviour in response to a series of predominantly negative inputs
over time.
[0008] Further features of the invention provide for the received
input to correspond to human interaction with the toy including one
or more of sound, motion and image; for the processor to classify
sound associated with shouting and motion associated with physical
abuse as negative inputs; for the toy to include at least two input
sensors, a first of which is a microphone configured to detect
voice and voice amplitude and a second of which is an accelerometer
configured to detect motion and acceleration of the toy; for the
accumulated input to be representative, at least to some degree, of
the voice of a preferred user of the toy; for the processor to be
programmed to determine a degree of similarity between a received
voice input received by the microphone and the accumulated input;
for the accumulated input to be adjusted to become increasingly
representative of a user when the received input is classified as
positive, and for it to become less representative of a preferred
user or remain unchanged when the degree of similarity is low or
the received input is classified as negative; for the processor to
be programmed to classify a received voice input at an amplitude
above a predefined maximum voice amplitude as a negative input, and
below it as a positive input; for the processor to be programmed to
classify a detected motion input at an acceleration above a
predefined maximum acceleration threshold as a negative input, and
below it as a positive input; and for the processor to be
programmed to determine a degree of positivity or negativity, as
the case may be, of a received input and to adjust the accumulated
input proportionate to the degree or positivity or negativity.
[0009] Still further features of the invention provide for the toy
to include timing means connected to the processor and for the
processor to be programmed to classify an absence of received input
for longer than a predefined period of time as negative input and
to adjust the accumulated input to become less representative of
the preferred user in response thereto; and for the output
apparatus to include one or both of a sound transducer and movement
actuators and for the processor to be programmed to send control
signals to the output apparatus more frequently and/or of a higher
quality, when the degree of similarity of a received voice input is
high, and for the processor to be programmed to send control
signals to the output apparatus less frequently and/or of a lower
quality, when the degree of similarity of the received voice input
is low.
[0010] Yet further features of the invention provide for the
accumulated input to comprise a collection of characteristics
extracted from a voice associated with a generic background
speaker, each characteristic having a variable weight associated
therewith so that the collection of weighted characteristics is
representative of the voice of a preferred user; for the weights
associated with the characteristics to be adjusted in order to make
the accumulated input increasingly or less representative of the
voice of the preferred user; and for the accumulated input to be
adjusted to become increasingly representative of the voice of at
least one alternative user as the accumulated input becomes less
representative of the voice of a current preferred user, the
alternative user becoming a new preferred user when the accumulated
input becomes more representative of the voice of the alternative
user than that of the current preferred user.
[0011] The invention also provides a method of simulating bonding
behavior in a toy towards a human including the steps of storing an
accumulated input representative of a preferred user in a memory
associated with the toy, receiving an input from a user by means of
at least one input sensor incorporated in the toy, classifying the
input as either positive or negative, adjusting the accumulated
input to become increasingly representative of the preferred user
in response to a positive input and less representative of the
preferred user in response to a negative input, and issuing control
signals to output apparatus of the toy in response to the input,
the control signals being dependent on the accumulated input.
[0012] Further features of the invention provide for the method to
include the steps of classifying a received voice input above a
predefined amplitude as a negative input, classifying a received
motion input beyond a predefined acceleration range as a negative
input, and classifying an absence of received input for longer that
a predetermined period of time as a negative input; and determining
a degree of similarity of a received voice input to that of a
preferred user and issuing control signals to the output apparatus
of the toy which are proportional to the degree of similarity.
BRIEF DESCRIPTION OF THE DRAWINGS
[0013] The invention will now be described, by way of example only
with reference to the accompanying representations in which:
[0014] FIG. 1 is a schematic representation of the internal
components of a toy doll capable of exhibiting bonding behaviour
towards a human being according to a first embodiment of the
invention;
[0015] FIG. 2 is a schematic representation of an alternative
embodiment of the toy doll of FIG. 1; and
[0016] FIG. 3 is a flow diagram showing the macro behaviour of a
toy doll according to the invention.
DETAILED DESCRIPTION WITH REFERENCE TO THE DRAWINGS
[0017] FIG. 1 of the accompanying drawings show the internal
functional components (10) of a toy doll (not shown) in accordance
with a first embodiment of the invention. The doll contains a body
which is not shown in the drawings as it can take on any number of
appearances, for example, that of infants, toddlers, animals or
even toy characters. The components (10) are conveniently located
inside the doll, for example in a chest cavity of the body, where
they are protected by the body. Access may be provided in strategic
positions on the body in order to access certain parts of the
components that may need periodic replacement or maintenance, for
example a power supply or battery pack.
[0018] The components (10) include the following to support the
required behaviour: a digital central processing unit (CPU) (12),
which includes timing means (14), in this example a digital timer,
a storage unit (16) in the form of a non-volatile memory module,
input sensors (18) to detect an input, in this embodiment a
microphone (20) and accelerometer (22), and output apparatus (24)
to communicate with a user. The output apparatus in this embodiment
include a sound transducer (26) and movement actuators (28)
connected to the limbs (not shown) of the toy. It should be
appreciated that the movement actuators (28) can be connected to
any limbs of the toy in order to control their movement. The CPU
(12) is connected to the input sensors (18) and output apparatus
(26) with an input interface (30) and output interface (32),
respectively. The input interface (30) includes an
analog-to-digital (A/D) converter (34) and the output interface
(32) includes a digital-to-analog (D/A) converter (36). Machine
instructions in the form of software (not shown) is stored in the
memory (16) or on additional memory modules (38) to drive the input
interface (30) and output interface (32) and their respective A/D
and D/A converters. The machine instructions also include
instructions causing the CPU to receive input via the input
sensors, process received inputs, and send control signals to the
output apparatus.
[0019] Additional software governing the behaviour of the toy is
also stored in the memory (16) along with an accumulated input
variable in the form of a digital model (not shown) which comprises
a collection of characteristics or properties extracted from the
voice and/or behaviour of users, including a current preferred user
along with a reference of how the characteristics of the preferred
user is to be distinguished from other users in general. The
accumulated input is representative, to a variable extent, of the
current preferred user and is stored in the non-volatile memory
module (16). The software further includes voice and speech
recognition functionality and other feature extraction software
allowing the processor to analyse received inputs and determine the
degree to which it corresponds to the digital model of the current
preferred user, thus yielding a degree of similarity of the
received voice input to that of the preferred user as represented
by the accumulated input.
[0020] The memory (16) furthermore contains software allowing the
CPU to analyse an input detected by the input sensors (18) and
classify the input as being either positive or negative in nature
and also to assign a degree of positivity or negativity to the
received input. If the interaction with the current user, as
received via the input, is deemed positive, the input is used to
provide further learning of the properties of the current user and
the accumulated input is updated with such further properties. It
will be appreciated that the addition of further properties of the
current user to the accumulated input, insofar as the input is
classified as positive, makes the accumulated input increasingly
representative of the current user and therefore represents an
increasingly stronger bond to the current user. If the current user
also closely represents the preferred user, the accumulated input
will become increasingly representative of the preferred user
simulating an increasingly stronger bond to it, but if the current
user does not represent the preferred user, the toy will diminish
its bond with the preferred user and increase its bond with the
current user. It is therefore possible for the current user to
become the preferred user by continuous positive interaction with
the toy.
[0021] If the interactions with the toy are deemed negative and to
the degree that the current user matches the properties
representative of the preferred user as contained in the
accumulated input, an unlearning process gradually returns or
degrades the accumulated input to become less representative of the
preferred user and become more representative of other or a general
background user.
[0022] The degree of learning or unlearning, as the case may be,
may be proportional to the degree to which the interaction from the
user is classified as positive or negative. The machine
instructions (software) include threshold values for received voice
input amplitude as well as detected motion input acceleration. If
voice is received having an amplitude above the amplitude threshold
value, such voice will be classified as negative input in that it
corresponds to shouting or noise. Acceleration above the maximum
threshold will likewise be classified as negative input in that it
corresponds to physical abuse, throwing or falling. It is also
foreseeable that the software may allow the CPU (12) to identify
standard deviations in pitch patterns of sound inputs as singing
and standard accelerations between predefined minimum and maximum
thresholds as rocking, which will be interpreted as positive
inputs.
[0023] To the extent that the interactions from a user are deemed
positive and the characteristics of the current user matches that
of the preferred user closely, in other words there is a high
degree of similarity between the voice of the current user and that
of the preferred user (represented by the accumulated input),
positive responses from the toy, as dictated by instructions sent
to the output apparatus (26) by the CPU (12), will increase, in
frequency and/or in quality. Conversely, if the characteristics of
the current user do not match that of the preferred user, positive
responses from the toy, as dictated by the instructions sent to the
output apparatus (26) by the CPU (12) will decrease in frequency
and/or in quality.
[0024] In addition to the inputs such as speech and motion as
detected by the sensors (18), the software also causes the CPU (12)
to monitor the timer (14) and identify a lack of interaction with
the toy for longer than a specified period. This corresponds to
neglect of the toy and will be classified as negative input and
influence the accumulated input accordingly, resulting in an
unlearning of the preferred user.
[0025] The macro behaviour of the toy can be explained more simply
with reference to the flow diagram shown in FIG. 3. In FIG. 3, when
an input is detected by one of the input sensors (18) at a step
(40), the CPU (12) classifies the input as positive or negative and
measures its degree of positivity or negativity, as the case may
be. The CPU (12) also determines the degree of similarity of the
voice associated with a voice input to that of the preferred user,
in the drawing this step is referred to as the quality of match to
the bonded user. If the input was classified as positive, this is
identified at a step (42), and the CPU (12) is instructed to learn
or reinforce the properties of the current user, by making the
accumulated input increasingly representative of the preferred
user, proportional to the degree of positivity of the received
input at a step (44), after which the CPU (12) send instructions to
the output apparatus (18), proportional to the degree of similarity
of the current user to the preferred user and the positivity of the
input at a step (46).
[0026] If the input is identified as negative at step (42), the CPU
(12) determines whether the current user is also the current
preferred user or if the input is identified as neglect at a step
(48). If the current user is not the current preferred user and the
input is also not identified as neglect, the CPU (12) again send
instructions to the output apparatus (18), proportional to the
degree of similarity of the current user to the preferred user and
the negativity of the input at step (46). If, however, the current
user is identified as the current preferred user or the input is
identified as neglect at step (48), the CPU (12) is instructed to
unlearn the properties of the current user proportional to the
degree of negativity of the input at a step (50), after which the
CPU (12) sends instructions to the output apparatus (18),
proportional to the degree of similarity of the current user to the
preferred user and the negativity of the input at step (46).
[0027] On completion of the instructions sent to the output
apparatus at step (46), the CPU (12) waits for the next input to be
received or for the timer to indicate a lack of interaction.
[0028] An alternative embodiment of the invention is shown in FIG.
2. In the figure, like numerals indicate like features to the
embodiment illustrated in FIG. 1. The embodiment of FIG. 2 again
includes a digital central processing unit (CPU) (12), which
includes a digital timer (14), a storage unit (16) in the form of a
non-volatile memory module, input sensors (18) to detect an input
and a microphone (20) and accelerometer (22). This embodiment
additionally includes a digital image recorder (50) which, in this
embodiment, is a digital camera. The embodiment also includes
output apparatus (24) to communicate with the user. The output
apparatus again include a sound transducer (26) and movement
actuators (28) connected to the limbs (not shown) of the toy. The
CPU (12) is connected to the input sensors (18) and output
apparatus (26) with an input interface (30) and output interface
(32), respectively. The input interface (30) includes an
analog-to-digital (ND) converter (34) and the output interface (32)
includes a digital-to-analog (D/A) converter (36). Machine
instructions in the form of software (not shown) is stored in the
memory (16) or on additional memory modules (38) to drive the input
interface (30) and output interface (32) and their respective A/D
and D/A converters.
[0029] It should be appreciated that in this embodiment of the
invention, the digital camera (50) may be used to periodically
capture an image of a user, for example when interaction from a
user is detected. This image may be used in combination with a
voice recording or separately, to recognise the face of the
preferred user. Complicated image recognition software is available
that may be employed to compare a digital image to an image of the
preferred user stored in the memory (16). As is described above and
further below for voice recognition, the image recognition software
may be used to determine a degree of similarity between an image
taken with the camera (50) of the preferred user, and an image
taken of a current user at a later stage. The control signals sent
by the CPU (12) to the output apparatus (24) may again be dependent
on the degree of similarity between the images of the current user
and that of the preferred user.
[0030] The above description provides a general overview of the
working of the toy. What follows is a more detailed analysis of the
algorithms employed by the software and executed by the CPU (12).
The algorithms, be it software or hardware implementations and
which may not reside in the memory (16), will execute on the CPU
(12) to evaluate the interactions with the current user and based
on that change its internal representation (the accumulated input)
of the preferred user as well as determine the nature of its
interactions with the user.
[0031] The input from the user, in this case speech, is sampled
when detected and made available to the CPU in a digital format.
This signal is then digitally processed to determine its relevant
information content. Although various alternatives are possible, in
this embodiment it is sub-divided into a sequence of 30 ms frames
overlapping each other by 50%. Each frame is shaped by a windowing
function, and its power level as well as Mel Frequency Cepstral
Coefficients (MFCCs) are determined (various other analyses such as
RASTA PLP can also be used). This is augmented with the pitch
frequency at that given time. All this information is combined into
a feature vector x(n) which summarises the relevant speech
information for that frame. The index n denotes the specific frame
number where this vector was determined. With the information
available the signal can be divided into silence and speech
segments, for which several implementations are known.
[0032] Similarly the input obtained from accelerometers can be
collected in another feature vector y(n) summarising the motion of
the toy.
[0033] From x(n) both the signal power (amplitude) as well as the
pitch frequency are known as a function of time. The loudness of
the voice is directly determined from this power. If the loudness
remains between pre-established minimum and maximum thresholds, the
interaction is considered to be positive. The total absence of
voice during a predetermined interval will be considered as neglect
and therefore negative, while the presence of overly loud voice
above the maximum threshold will be considered as shouting and
therefore also negative.
[0034] These aspects can be combined into a quality measure over a
given period, presented as a value -1.ltoreq.Q.ltoreq.1, where 0 is
taken as neutral.
[0035] To determine the identity of a speaker, statistical models
are used to describe both the target speaker as well as a generic
background speaker. Although the description here concerns a
particular implementation of modelling speaker characteristics and
using this for determining the match between an unknown speech
sample and a particular speaker, other techniques for doing so are
not excluded. The exact technique or implementation is not critical
to this patent and there are several candidates available from the
broad fields of speaker recognition and machine learning (pattern
recognition) in general. The use of Support Vector Machines (SVM)
or other popular pattern classification approaches can conceivably
also be used instead of what is described here.
[0036] A generic background speaker is represented with a Gaussian
Mixture Model (GMM), referred to here as a Universal Background
Model (UBM). In its most simplified form such a mixture can
collapse to a single Gaussian density, thereby reducing
computational requirements greatly. The UBM is typically
collectively trained from the speech of a large number of
speakers.
[0037] This UBM is then adapted to the speech of a specific target
speaker, in this embodiment the preferred user, via a process such
as Maximum a Posteriori (MAP) adaptation, Maximum-Likelihood Linear
Regression (MLLR), or Maximum-Likelihood Eigendecomposition (MLED).
The trained UBM parameters form a stable initial model estimate,
which are then reweighted in some fashion to more closely resemble
the characteristics of the preferred user. This results in the
preferred speaker model. This approach is discussed in more detail
below.
[0038] Having a UBM and a target speaker model available allows one
to evaluate the closeness of the match of an unknown segment of
speech to the model of the preferred user. This is done by
evaluating the logarithmic score of this speech segment to both the
models of the background speakers (UBM) and the preferred user (as
represented by the accumulated input). The difference between those
scores approximates the log-likelihood-ratio (LLR) score and
directly translates to how well the preferred user matches with the
current speech. Mathematically the LLR score of the nth frame,
s(n), is expressed as:
s(x(n))=log(f.sub.T(x(n)))-log(f.sub.M(x(n))),
where f denotes either a Gaussian or GMM probability density
function and the subscripts T and U respectively denote the target
and UBM speaker.
[0039] Basing a decision on a single frame is precarious. Typically
N frames are collected before doing so, with N chosen such that it
corresponds to a time duration in the range of 10 to 30 seconds.
The score for such a segment is then given by
s ( X ) = n = 0 N - 1 s ( x ( n ) ) , ##EQU00001##
with X={x(0), . . . , x(N-1)}. A larger score indicates a larger
possibility that the speech originated from the preferred user (a
high degree of similarity), with a value of zero indicating that
the speech cannot be distinguished from that of the generic
background speaker (a low degree of similarity). Once again there
are several other alternatives for this. Test normalization (TNORM)
is another notable example that replaces the single UBM with a
number of background speaker models.
[0040] A multi-dimensional Gaussian density consists of a
mean/centroid vector m and a covariance matrix C. MAP adaptation of
the Gaussian centroid vector specifically leads to a weighted
combination of the existing prior centroid and the newly observed
target feature vectors, while leaving the covariance matrices
unchanged and intact. This idea is adapted here to allow the system
to learn the characteristics of a recent speaker while
simultaneously also gradually unlearning the characteristics of
earlier speakers in a computationally efficient manner.
[0041] The adaptation of a single target Gaussian centroid is
described first and is later extended to the adaptation of Gaussian
centroids embedded in a GMM. Before first use of the toy, the
target centroid is cloned from the UBM. The preferred user is
therefore indistinguishable from the generic background speaker at
this stage. Therefore
m.sub.T(n)=M.sub.U, for n=-1
where once again T denotes the target, U denotes the UBM, and the
quantity n denotes the adaptation time step. Note that the target
centroid is a function of time n, whereas the UBM centroid remains
constant. A target feature vector is now observed which is derived
from the speech of a user, which is denoted by x(n). The target
centroid is then adapted using the recursion
m.sub.T(n)=.lamda.x(n)+(1-.lamda.)m.sub.T(n-1),
with .lamda. a small positive constant and n=0, 1, 2, . . . . This
difference equation represents a digital lowpass filter with a DC
gain of 1. The smaller the value of .lamda., the more emphasis is
being placed on the existing centroid value and the less on the
newly observed feature vector. Therefore .lamda. effectively
controls the length of memory that the system has of past
centroids. The effective length of this memory can be determined by
noting how long it takes for the impulse response of this filter to
subside to about 10% of the original impulse height. The following
table summarizes this:
TABLE-US-00001 TABLE 1 Effective memory length for different values
of .lamda.. The duration in minutes is based on 15 ms time steps.
.lamda. 10.sup.-3 10.sup.-4 10.sup.-5 Number of steps 2301 23025
230257 Number of 0.58 5.8 58 Minutes
[0042] Therefore, for .lamda.=10.sup.-5 about one hour of sustained
speech is required to unlearn the previous speaker and bond to a
new preferred speaker. Such a learning rate can be modulated by the
quality of the interaction by setting it as
.lamda. = 10 - 5 ( 1 + Q 2 ) . ##EQU00002##
[0043] A more sophisticated system uses a Gaussian Mixture Model
(GMM), consisting of K Gaussian component models, instead of a
single Gaussian density as discussed above. If the likelihood of
feature vector x(n) given the ith Gaussian component is given by
f.sub.i(x(n)), the likelihood resulting from the GMM will be the
weighted sum
f ( x ( n ) ) = i = 1 K w i f i ( x ( n ) ) , ##EQU00003##
with w.sub.i the mixture weights and i=1, 2, . . . , K. When
updating such a model, a target feature vector x(n) will now be
proportionally associated with the various Gaussian components,
instead of entirely with only one Gaussian. These proportionality
constants are known as responsibilities and can be determined
as
r i ( n ) = w i f i ( x ( n ) ) j = 1 K w i f i ( x ( n ) ) .
##EQU00004##
[0044] Adaptation of the GMM is correspondingly done by
proportionally using the feature vector to update each of the
Gaussian components. This changes the original updated recursion
to:
m.sub.T,i(n)=.lamda.r.sub.i(n)x(n)+(1+.lamda.r.sub.i(n))m.sub.T,i(n-1),
[0045] Using this method of adaptation will maintain the bonding of
an existing user as long as that user sustains interaction. If,
however, another user starts to interact with the toy, the memory
of the original user will gradually fade and be replaced by that of
the new one, which is precisely the desired behavior.
[0046] When the current preferred user is neglecting interaction
with the toy we also want him/her to fade from the toy's memory, in
other words for the toy to unlearn his/her voice characteristics.
This is achieved by periodically inserting extra feature vectors
x.sub.i=m.sub.U,i originating from the UBM centroids, into the
adaption process.
Their corresponding responsibility constants should be
r.sub.i=w.sub.i.
[0047] This will move the target model away from the
characteristics of the preferred user, and closer to the generic
background speaker. However, the effect of these vectors should be
much less pronounced than that of the true target speaker input
vectors. They should therefore be inserted after roughly every 20
(or more) time frames, making this unlearning process approximately
20 times slower than the learning process. This serves two
purposes. Firstly, the target model is continually being stabilised
towards the UBM, providing some extra robustness against extraneous
environmental noise, and secondly, should the user ignore the toy
for an extended period, the toy will gradually "forget" this
user.
[0048] If the preferred user engages in "abusive" behaviour, we
want to rapidly fade that user from the toy's memory. The preferred
user is recognized by a high identification score s(X) and the
presence of abuse is typified by a high negative value of the
interaction quality Q. Their combined presence accelerates the
above unlearning process by immediately applying this procedure,
but with a hugely increased value of
.lamda. = 1 3 max ( 0 , 2 1 + - s ( x ) - 1 ) . ##EQU00005##
[0049] This will rapidly move the target model back to the UBM
while still taking into account the uncertainty that the speech
actually arose from the preferred speaker.
[0050] To the extent that the interactions are deemed a) positive
and b) the match with the preferred user is strong, positive
interactions from the toy will increase, both in frequency and
quality. These are expressed in terms of the spoken response of the
toy, possible facial expression control, as well as the movements
made by its limbs.
[0051] Although the description here concerns particular
implementations for detecting a quiet soothing voice versus
shouting, as well as a soft rocking motion versus throwing or
falling, other implementation for doing so, as well as other types
of gestures to be considered, are not excluded. The exact technique
or implementation is not critical to this patent.
[0052] Furthermore, although not described here, similar processes
can be devised for distinguishing the face of the preferred
individual from that of a generic face representation. One approach
for this is by measuring how the preferred face deviates from the
generic face provided by the first components of an eigenface
representation.
[0053] It should be appreciated that the above description is by
way of example only and that numerous modifications, adaptations,
and other implementations are possible. For example, substitutions,
additions, or modifications may be made to the elements illustrated
in the drawings, and the methods described herein may be modified
by substituting, reordering, or adding stages to the disclosed
methods. Furthermore, any elements described as being of a digital
nature may equally well be implemented with analog circuitry if
appropriate changes are made to the hardware of the toy.
Accordingly, the above detailed description does not limit the
invention.
* * * * *