Toy Exhibiting Bonding Behavior Du Preez; Johan Adam ; et al. [STELLENBOSCH UNIVERSITY]

Toy Exhibiting Bonding Behavior

Du Preez; Johan Adam ; et al.

Patent Application Summary

U.S. patent application number 13/131317 was filed with the patent office on 2011-09-22 for toy exhibiting bonding behavior. This patent application is currently assigned to STELLENBOSCH UNIVERSITY. Invention is credited to Johan Adam Du Preez, Ludwig Carl Schwardt.

Application Number	20110230114 13/131317
Document ID	/
Family ID	42225297
Filed Date	2011-09-22

United States Patent Application	20110230114
Kind Code	A1
Du Preez; Johan Adam ; et al.	September 22, 2011

TOY EXHIBITING BONDING BEHAVIOR

Abstract

A toy capable of exhibiting bonding behavior toward a user and a method of simulating such behavior. The toy includes input sensors for receiving interactive input from users, an output apparatus for communicating with users, a processor and memory containing machine instructions causing the processor to receive interactive input, process received input and send control signals to the output apparatus. The processor classifies received input as either positive or negative and adjusts an accumulated input, stored in the memory, in accordance with the classification. The control signals, in turn, are dependent on the accumulated input.

Inventors:	Du Preez; Johan Adam; (Stellenbosch, ZA) ; Schwardt; Ludwig Carl; (Newlands, ZA)
Assignee:	STELLENBOSCH UNIVERSITY Western Cape Province ZA
Family ID:	42225297
Appl. No.:	13/131317
Filed:	November 27, 2009
PCT Filed:	November 27, 2009
PCT NO:	PCT/IB2009/007585
371 Date:	May 26, 2011

Current U.S. Class:	446/175 ; 446/268
Current CPC Class:	A63H 2200/00 20130101; A63H 3/28 20130101
Class at Publication:	446/175 ; 446/268
International Class:	A63H 3/28 20060101 A63H003/28

Foreign Application Data

Date	Code	Application Number
Nov 27, 2008	ZA	2008/04571
Mar 5, 2009	ZA	2008/08880

Claims

1-18. (canceled)

19. A toy comprising: a body that includes at least one input sensor for receiving an input from a human user, at least one output apparatus by means of which the toy interacts with the user, a processor in communication with the input sensor and the output apparatus, and a memory in communication with the processor, wherein the processor is programmed to classify each received input as either positive or negative, to adjust an accumulated input stored in the memory in accordance with the classification, and to send control signals to the output apparatus that are dependent on the accumulated input, and the toy thereby exhibiting increased bonding behavior in response to a series of predominantly positive inputs overtime, and decreased bonding behavior in response to a series of predominantly negative inputs over time.

20. The toy according to claim 19, in which the received input corresponds to human interaction with the toy corresponding to one or more of sound, motion and image.

21. The toy according to claim 20, in which the processor classifies sound associated with shouting and motion associated with physical abuse as negative inputs.

22. The toy according to claim 19, which includes at least first and second input sensors, the first sensor of which is a microphone configured to detect sound and sound amplitude and the second sensor of which is an accelerometer configured to detect motion and acceleration of the toy.

23. The toy according to claim 19, in which the accumulated input is representative, at least to some degree, of a voice of a preferred user of the toy.

24. The toy according to claim 22, in which the processor is programmed to determine a degree of similarity between a received voice input received by means of the microphone and the accumulated input.

25. The toy according to claim 24, in which the accumulated input is adjusted to become increasingly representative of a user when the received input is classified as positive, and for it to become less representative of a preferred user or remain unchanged when the degree of similarity is low or the received input is classified as negative.

26. The toy according to claim 19, in which the processor is programmed to classify a voice input received by means of the input sensor at an amplitude above a predefined maximum voice amplitude as a negative input, and below the predefined maximum voice amplitude as a positive input.

27. The toy according to claim 19, in which the processor is programmed to classify a motion input detected by means of the input sensor at an acceleration above a predefined maximum acceleration threshold as a negative input, and below the predefined maximum acceleration threshold as a positive input.

28. The toy according to claim 19, in which the processor is programmed to determine a degree of positivity or negativity, as the case may be, of a received input and to adjust the accumulated input proportionate to the degree of positivity or negativity.

29. The toy according to claim 23, in which the toy includes timing means in communication with the processor and in which the processor is programmed to classify an absence of received input, for longer than a predefined period of time, as a negative input and to adjust the accumulated input to become less representative of the voice of the preferred user in response thereto.

30. The toy according to claim 24, in which the output apparatus include one or both of a sound transducer and movement actuators and in which the processor is programmed to send control signals to the output apparatus more frequently and/or of a higher quality, when the degree of similarity of a received voice input is high, and in which the processor is programmed to send control signals to the output apparatus less frequently and/or of a lower quality, when the degree of similarity of the received voice input is low.

31. The toy according to claim 19, in which the accumulated input comprises a collection of characteristics extracted from a voice associated with a generic background speaker, each characteristic having a variable weigh associated therewith so that the collection of weighted characteristics is representative of the voice of a preferred user.

32. The toy according to claim 31, in which the variable weights associated with the characteristics are adjusted in order to make the accumulated input increasingly or less representative of the voice of the preferred user.

33. The toy according to claim 31, in which the accumulated input is adjusted to become increasingly representative of the voice of at least one alternative user as the accumulated input becomes less representative of the voice of a current preferred user, the alternative user becoming a new preferred user when the accumulated input becomes more representative of the voice of the alternative user than that of the current preferred user.

34. A method of simulating bonding behaviour in a toy toward a human including the steps of: storing an accumulated input representative of a preferred user in a memory associated with the toy, receiving an input from a user by means of at least one input sensor incorporated in the toy, classifying the input as either positive or negative, adjusting the accumulated input to become increasingly representative of the preferred user, in response to a positive input, and less representative of the preferred user, in response to a negative input, and issuing control signals to output apparatus of the toy in response to the input, and the control signals being dependent, at least to some extent, on the accumulated input.

35. The method according to claim 34, including the steps of classifying a received voice input above a predefined amplitude as a negative input, classifying a received motion input outside a predefined acceleration range as a negative input, and classifying an absence of received input for longer that a predetermined period of time as a negative input.

36. The method according to claim 34, including the step of determining a degree of similarity of a received voice input to that of a preferred user, and issuing control signals to the output apparatus of the toy which are proportional to the degree of similarity.

Description

[0001] This application is a National Stage completion of PCT/IB2009/007585 filed Nov. 27, 2009, which claims priority from South African patent application serial no. 2008/08880 filed Mar. 5, 2009 and South African patent application serial no. 2008/04571 filed Nov. 27, 2008.

FIELD OF THE INVENTION

[0002] This invention relates to an interactive toy, more specifically a doll, capable of exhibiting bonding behaviour towards natural persons which mimics the bonding that naturally occurs between a parent and child. The invention extends to a method for simulating bonding behaviour by a toy towards a natural person or persons.

BACKGROUND TO THE INVENTION

[0003] Toys, in particular dolls, are owned by people the world over, and have been for hundreds of years. Children use dolls to play with, for companionship and also sometimes to invoke a sense of security. Children, especially young children, often develop a very strong bond with their dolls, which may even play a part in the child's development. Dolls are also owned by adults for numerous reasons, be it as collector's items, for their aesthetic qualities or emotional attachment.

[0004] Along with technological advances made over the past years, dolls have developed and have become increasingly sophisticated and, in fact, more life-like. The inventor is, for example, aware of dolls that are capable of simulating limited human behaviour, such as crying, sleeping, talking and even simulating humanly bodily functions such as eating and excreting bodily waste. The inventor is furthermore aware that electronic appliances, for example, microphones, sound transducers, movement actuators and the like have been incorporated into dolls.

[0005] United States patent application number US2007/0128979, entitled "Interactive Hi-tech Doll", for example, discloses a doll which produces human-like facial expressions, recognizes certain words when they are spoken by humans, and which is able to carry on a limited conversation with a living person based on certain pre-defined question and answer scenarios. The doll's recognition of the spoken words is based on speech and voice recognition technology controlled by a processor incorporated in the doll, and allows the doll to be trained to identify the voice of a specific person, as well as assign a specific role, such as that of its mother, to the person. The doll is equipped with movement actuators in its face, allowing movement of its eyes, mouth and cheeks to exhibit certain pre-defined facial expressions concurrently with spoken words or separately to simulate human emotions. The limited conversational skills are based on basic voice and speech recognition techniques which are widely known in the field. In each scenario, the doll will ask a pre-recorded question and expect to receive a specific answer. If it receives the expected answer the doll reacts favorably and if it receives any unexpected answer, it reacts less favorably. There is, however, no mention in the application that the doll has long-term learning capabilities. Instead, its behavior appears to be governed by a state machine that responds primarily to the current user input and its built in clock.

[0006] It is an object of this invention to provide an interactive toy, more specifically a doll, capable of simulating bonding behaviour towards a person, which is an improvement over the prior art outlined above.

SUMMARY OF THE INVENTION

[0007] In accordance with this invention there is provided a toy comprising a body that includes at least one input sensor for receiving an input from a human user, at least one output apparatus by means of which the toy interacts with the user, a processor in communication with the input sensor and the output apparatus, and a memory in communication with the processor, the toy being characterized in that the processor is programmed to classify each received input as either positive or negative, to adjust an accumulated input stored in the memory in accordance with the classification, and to send control signals to the output apparatus that are dependent on the accumulated input, the toy thereby exhibiting increased bonding behaviour in response to a series of predominantly positive inputs over time, and decreased bonding behaviour in response to a series of predominantly negative inputs over time.

[0008] Further features of the invention provide for the received input to correspond to human interaction with the toy including one or more of sound, motion and image; for the processor to classify sound associated with shouting and motion associated with physical abuse as negative inputs; for the toy to include at least two input sensors, a first of which is a microphone configured to detect voice and voice amplitude and a second of which is an accelerometer configured to detect motion and acceleration of the toy; for the accumulated input to be representative, at least to some degree, of the voice of a preferred user of the toy; for the processor to be programmed to determine a degree of similarity between a received voice input received by the microphone and the accumulated input; for the accumulated input to be adjusted to become increasingly representative of a user when the received input is classified as positive, and for it to become less representative of a preferred user or remain unchanged when the degree of similarity is low or the received input is classified as negative; for the processor to be programmed to classify a received voice input at an amplitude above a predefined maximum voice amplitude as a negative input, and below it as a positive input; for the processor to be programmed to classify a detected motion input at an acceleration above a predefined maximum acceleration threshold as a negative input, and below it as a positive input; and for the processor to be programmed to determine a degree of positivity or negativity, as the case may be, of a received input and to adjust the accumulated input proportionate to the degree or positivity or negativity.

[0009] Still further features of the invention provide for the toy to include timing means connected to the processor and for the processor to be programmed to classify an absence of received input for longer than a predefined period of time as negative input and to adjust the accumulated input to become less representative of the preferred user in response thereto; and for the output apparatus to include one or both of a sound transducer and movement actuators and for the processor to be programmed to send control signals to the output apparatus more frequently and/or of a higher quality, when the degree of similarity of a received voice input is high, and for the processor to be programmed to send control signals to the output apparatus less frequently and/or of a lower quality, when the degree of similarity of the received voice input is low.

[0010] Yet further features of the invention provide for the accumulated input to comprise a collection of characteristics extracted from a voice associated with a generic background speaker, each characteristic having a variable weight associated therewith so that the collection of weighted characteristics is representative of the voice of a preferred user; for the weights associated with the characteristics to be adjusted in order to make the accumulated input increasingly or less representative of the voice of the preferred user; and for the accumulated input to be adjusted to become increasingly representative of the voice of at least one alternative user as the accumulated input becomes less representative of the voice of a current preferred user, the alternative user becoming a new preferred user when the accumulated input becomes more representative of the voice of the alternative user than that of the current preferred user.

[0011] The invention also provides a method of simulating bonding behavior in a toy towards a human including the steps of storing an accumulated input representative of a preferred user in a memory associated with the toy, receiving an input from a user by means of at least one input sensor incorporated in the toy, classifying the input as either positive or negative, adjusting the accumulated input to become increasingly representative of the preferred user in response to a positive input and less representative of the preferred user in response to a negative input, and issuing control signals to output apparatus of the toy in response to the input, the control signals being dependent on the accumulated input.

[0012] Further features of the invention provide for the method to include the steps of classifying a received voice input above a predefined amplitude as a negative input, classifying a received motion input beyond a predefined acceleration range as a negative input, and classifying an absence of received input for longer that a predetermined period of time as a negative input; and determining a degree of similarity of a received voice input to that of a preferred user and issuing control signals to the output apparatus of the toy which are proportional to the degree of similarity.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] The invention will now be described, by way of example only with reference to the accompanying representations in which:

[0014] FIG. 1 is a schematic representation of the internal components of a toy doll capable of exhibiting bonding behaviour towards a human being according to a first embodiment of the invention;

[0015] FIG. 2 is a schematic representation of an alternative embodiment of the toy doll of FIG. 1; and

[0016] FIG. 3 is a flow diagram showing the macro behaviour of a toy doll according to the invention.

DETAILED DESCRIPTION WITH REFERENCE TO THE DRAWINGS

[0017] FIG. 1 of the accompanying drawings show the internal functional components (10) of a toy doll (not shown) in accordance with a first embodiment of the invention. The doll contains a body which is not shown in the drawings as it can take on any number of appearances, for example, that of infants, toddlers, animals or even toy characters. The components (10) are conveniently located inside the doll, for example in a chest cavity of the body, where they are protected by the body. Access may be provided in strategic positions on the body in order to access certain parts of the components that may need periodic replacement or maintenance, for example a power supply or battery pack.

[0018] The components (10) include the following to support the required behaviour: a digital central processing unit (CPU) (12), which includes timing means (14), in this example a digital timer, a storage unit (16) in the form of a non-volatile memory module, input sensors (18) to detect an input, in this embodiment a microphone (20) and accelerometer (22), and output apparatus (24) to communicate with a user. The output apparatus in this embodiment include a sound transducer (26) and movement actuators (28) connected to the limbs (not shown) of the toy. It should be appreciated that the movement actuators (28) can be connected to any limbs of the toy in order to control their movement. The CPU (12) is connected to the input sensors (18) and output apparatus (26) with an input interface (30) and output interface (32), respectively. The input interface (30) includes an analog-to-digital (A/D) converter (34) and the output interface (32) includes a digital-to-analog (D/A) converter (36). Machine instructions in the form of software (not shown) is stored in the memory (16) or on additional memory modules (38) to drive the input interface (30) and output interface (32) and their respective A/D and D/A converters. The machine instructions also include instructions causing the CPU to receive input via the input sensors, process received inputs, and send control signals to the output apparatus.

[0019] Additional software governing the behaviour of the toy is also stored in the memory (16) along with an accumulated input variable in the form of a digital model (not shown) which comprises a collection of characteristics or properties extracted from the voice and/or behaviour of users, including a current preferred user along with a reference of how the characteristics of the preferred user is to be distinguished from other users in general. The accumulated input is representative, to a variable extent, of the current preferred user and is stored in the non-volatile memory module (16). The software further includes voice and speech recognition functionality and other feature extraction software allowing the processor to analyse received inputs and determine the degree to which it corresponds to the digital model of the current preferred user, thus yielding a degree of similarity of the received voice input to that of the preferred user as represented by the accumulated input.

[0020] The memory (16) furthermore contains software allowing the CPU to analyse an input detected by the input sensors (18) and classify the input as being either positive or negative in nature and also to assign a degree of positivity or negativity to the received input. If the interaction with the current user, as received via the input, is deemed positive, the input is used to provide further learning of the properties of the current user and the accumulated input is updated with such further properties. It will be appreciated that the addition of further properties of the current user to the accumulated input, insofar as the input is classified as positive, makes the accumulated input increasingly representative of the current user and therefore represents an increasingly stronger bond to the current user. If the current user also closely represents the preferred user, the accumulated input will become increasingly representative of the preferred user simulating an increasingly stronger bond to it, but if the current user does not represent the preferred user, the toy will diminish its bond with the preferred user and increase its bond with the current user. It is therefore possible for the current user to become the preferred user by continuous positive interaction with the toy.

[0021] If the interactions with the toy are deemed negative and to the degree that the current user matches the properties representative of the preferred user as contained in the accumulated input, an unlearning process gradually returns or degrades the accumulated input to become less representative of the preferred user and become more representative of other or a general background user.

[0022] The degree of learning or unlearning, as the case may be, may be proportional to the degree to which the interaction from the user is classified as positive or negative. The machine instructions (software) include threshold values for received voice input amplitude as well as detected motion input acceleration. If voice is received having an amplitude above the amplitude threshold value, such voice will be classified as negative input in that it corresponds to shouting or noise. Acceleration above the maximum threshold will likewise be classified as negative input in that it corresponds to physical abuse, throwing or falling. It is also foreseeable that the software may allow the CPU (12) to identify standard deviations in pitch patterns of sound inputs as singing and standard accelerations between predefined minimum and maximum thresholds as rocking, which will be interpreted as positive inputs.

[0023] To the extent that the interactions from a user are deemed positive and the characteristics of the current user matches that of the preferred user closely, in other words there is a high degree of similarity between the voice of the current user and that of the preferred user (represented by the accumulated input), positive responses from the toy, as dictated by instructions sent to the output apparatus (26) by the CPU (12), will increase, in frequency and/or in quality. Conversely, if the characteristics of the current user do not match that of the preferred user, positive responses from the toy, as dictated by the instructions sent to the output apparatus (26) by the CPU (12) will decrease in frequency and/or in quality.

[0024] In addition to the inputs such as speech and motion as detected by the sensors (18), the software also causes the CPU (12) to monitor the timer (14) and identify a lack of interaction with the toy for longer than a specified period. This corresponds to neglect of the toy and will be classified as negative input and influence the accumulated input accordingly, resulting in an unlearning of the preferred user.

[0025] The macro behaviour of the toy can be explained more simply with reference to the flow diagram shown in FIG. 3. In FIG. 3, when an input is detected by one of the input sensors (18) at a step (40), the CPU (12) classifies the input as positive or negative and measures its degree of positivity or negativity, as the case may be. The CPU (12) also determines the degree of similarity of the voice associated with a voice input to that of the preferred user, in the drawing this step is referred to as the quality of match to the bonded user. If the input was classified as positive, this is identified at a step (42), and the CPU (12) is instructed to learn or reinforce the properties of the current user, by making the accumulated input increasingly representative of the preferred user, proportional to the degree of positivity of the received input at a step (44), after which the CPU (12) send instructions to the output apparatus (18), proportional to the degree of similarity of the current user to the preferred user and the positivity of the input at a step (46).

[0026] If the input is identified as negative at step (42), the CPU (12) determines whether the current user is also the current preferred user or if the input is identified as neglect at a step (48). If the current user is not the current preferred user and the input is also not identified as neglect, the CPU (12) again send instructions to the output apparatus (18), proportional to the degree of similarity of the current user to the preferred user and the negativity of the input at step (46). If, however, the current user is identified as the current preferred user or the input is identified as neglect at step (48), the CPU (12) is instructed to unlearn the properties of the current user proportional to the degree of negativity of the input at a step (50), after which the CPU (12) sends instructions to the output apparatus (18), proportional to the degree of similarity of the current user to the preferred user and the negativity of the input at step (46).

[0027] On completion of the instructions sent to the output apparatus at step (46), the CPU (12) waits for the next input to be received or for the timer to indicate a lack of interaction.

[0028] An alternative embodiment of the invention is shown in FIG. 2. In the figure, like numerals indicate like features to the embodiment illustrated in FIG. 1. The embodiment of FIG. 2 again includes a digital central processing unit (CPU) (12), which includes a digital timer (14), a storage unit (16) in the form of a non-volatile memory module, input sensors (18) to detect an input and a microphone (20) and accelerometer (22). This embodiment additionally includes a digital image recorder (50) which, in this embodiment, is a digital camera. The embodiment also includes output apparatus (24) to communicate with the user. The output apparatus again include a sound transducer (26) and movement actuators (28) connected to the limbs (not shown) of the toy. The CPU (12) is connected to the input sensors (18) and output apparatus (26) with an input interface (30) and output interface (32), respectively. The input interface (30) includes an analog-to-digital (ND) converter (34) and the output interface (32) includes a digital-to-analog (D/A) converter (36). Machine instructions in the form of software (not shown) is stored in the memory (16) or on additional memory modules (38) to drive the input interface (30) and output interface (32) and their respective A/D and D/A converters.

[0029] It should be appreciated that in this embodiment of the invention, the digital camera (50) may be used to periodically capture an image of a user, for example when interaction from a user is detected. This image may be used in combination with a voice recording or separately, to recognise the face of the preferred user. Complicated image recognition software is available that may be employed to compare a digital image to an image of the preferred user stored in the memory (16). As is described above and further below for voice recognition, the image recognition software may be used to determine a degree of similarity between an image taken with the camera (50) of the preferred user, and an image taken of a current user at a later stage. The control signals sent by the CPU (12) to the output apparatus (24) may again be dependent on the degree of similarity between the images of the current user and that of the preferred user.

[0030] The above description provides a general overview of the working of the toy. What follows is a more detailed analysis of the algorithms employed by the software and executed by the CPU (12). The algorithms, be it software or hardware implementations and which may not reside in the memory (16), will execute on the CPU (12) to evaluate the interactions with the current user and based on that change its internal representation (the accumulated input) of the preferred user as well as determine the nature of its interactions with the user.

[0031] The input from the user, in this case speech, is sampled when detected and made available to the CPU in a digital format. This signal is then digitally processed to determine its relevant information content. Although various alternatives are possible, in this embodiment it is sub-divided into a sequence of 30 ms frames overlapping each other by 50%. Each frame is shaped by a windowing function, and its power level as well as Mel Frequency Cepstral Coefficients (MFCCs) are determined (various other analyses such as RASTA PLP can also be used). This is augmented with the pitch frequency at that given time. All this information is combined into a feature vector x(n) which summarises the relevant speech information for that frame. The index n denotes the specific frame number where this vector was determined. With the information available the signal can be divided into silence and speech segments, for which several implementations are known.

[0032] Similarly the input obtained from accelerometers can be collected in another feature vector y(n) summarising the motion of the toy.

[0033] From x(n) both the signal power (amplitude) as well as the pitch frequency are known as a function of time. The loudness of the voice is directly determined from this power. If the loudness remains between pre-established minimum and maximum thresholds, the interaction is considered to be positive. The total absence of voice during a predetermined interval will be considered as neglect and therefore negative, while the presence of overly loud voice above the maximum threshold will be considered as shouting and therefore also negative.

[0034] These aspects can be combined into a quality measure over a given period, presented as a value -1.ltoreq.Q.ltoreq.1, where 0 is taken as neutral.

[0035] To determine the identity of a speaker, statistical models are used to describe both the target speaker as well as a generic background speaker. Although the description here concerns a particular implementation of modelling speaker characteristics and using this for determining the match between an unknown speech sample and a particular speaker, other techniques for doing so are not excluded. The exact technique or implementation is not critical to this patent and there are several candidates available from the broad fields of speaker recognition and machine learning (pattern recognition) in general. The use of Support Vector Machines (SVM) or other popular pattern classification approaches can conceivably also be used instead of what is described here.

[0036] A generic background speaker is represented with a Gaussian Mixture Model (GMM), referred to here as a Universal Background Model (UBM). In its most simplified form such a mixture can collapse to a single Gaussian density, thereby reducing computational requirements greatly. The UBM is typically collectively trained from the speech of a large number of speakers.

[0037] This UBM is then adapted to the speech of a specific target speaker, in this embodiment the preferred user, via a process such as Maximum a Posteriori (MAP) adaptation, Maximum-Likelihood Linear Regression (MLLR), or Maximum-Likelihood Eigendecomposition (MLED). The trained UBM parameters form a stable initial model estimate, which are then reweighted in some fashion to more closely resemble the characteristics of the preferred user. This results in the preferred speaker model. This approach is discussed in more detail below.

[0038] Having a UBM and a target speaker model available allows one to evaluate the closeness of the match of an unknown segment of speech to the model of the preferred user. This is done by evaluating the logarithmic score of this speech segment to both the models of the background speakers (UBM) and the preferred user (as represented by the accumulated input). The difference between those scores approximates the log-likelihood-ratio (LLR) score and directly translates to how well the preferred user matches with the current speech. Mathematically the LLR score of the nth frame, s(n), is expressed as:

s(x(n))=log(f.sub.T(x(n)))-log(f.sub.M(x(n))),

where f denotes either a Gaussian or GMM probability density function and the subscripts T and U respectively denote the target and UBM speaker.

[0039] Basing a decision on a single frame is precarious. Typically N frames are collected before doing so, with N chosen such that it corresponds to a time duration in the range of 10 to 30 seconds. The score for such a segment is then given by

s ( X ) = n = 0 N - 1 s ( x ( n ) ) , ##EQU00001##

with X={x(0), . . . , x(N-1)}. A larger score indicates a larger possibility that the speech originated from the preferred user (a high degree of similarity), with a value of zero indicating that the speech cannot be distinguished from that of the generic background speaker (a low degree of similarity). Once again there are several other alternatives for this. Test normalization (TNORM) is another notable example that replaces the single UBM with a number of background speaker models.

[0040] A multi-dimensional Gaussian density consists of a mean/centroid vector m and a covariance matrix C. MAP adaptation of the Gaussian centroid vector specifically leads to a weighted combination of the existing prior centroid and the newly observed target feature vectors, while leaving the covariance matrices unchanged and intact. This idea is adapted here to allow the system to learn the characteristics of a recent speaker while simultaneously also gradually unlearning the characteristics of earlier speakers in a computationally efficient manner.

[0041] The adaptation of a single target Gaussian centroid is described first and is later extended to the adaptation of Gaussian centroids embedded in a GMM. Before first use of the toy, the target centroid is cloned from the UBM. The preferred user is therefore indistinguishable from the generic background speaker at this stage. Therefore

m.sub.T(n)=M.sub.U, for n=-1

where once again T denotes the target, U denotes the UBM, and the quantity n denotes the adaptation time step. Note that the target centroid is a function of time n, whereas the UBM centroid remains constant. A target feature vector is now observed which is derived from the speech of a user, which is denoted by x(n). The target centroid is then adapted using the recursion

m.sub.T(n)=.lamda.x(n)+(1-.lamda.)m.sub.T(n-1),

with .lamda. a small positive constant and n=0, 1, 2, . . . . This difference equation represents a digital lowpass filter with a DC gain of 1. The smaller the value of .lamda., the more emphasis is being placed on the existing centroid value and the less on the newly observed feature vector. Therefore .lamda. effectively controls the length of memory that the system has of past centroids. The effective length of this memory can be determined by noting how long it takes for the impulse response of this filter to subside to about 10% of the original impulse height. The following table summarizes this:

TABLE-US-00001 TABLE 1 Effective memory length for different values of .lamda.. The duration in minutes is based on 15 ms time steps. .lamda. 10.sup.-3 10.sup.-4 10.sup.-5 Number of steps 2301 23025 230257 Number of 0.58 5.8 58 Minutes

[0042] Therefore, for .lamda.=10.sup.-5 about one hour of sustained speech is required to unlearn the previous speaker and bond to a new preferred speaker. Such a learning rate can be modulated by the quality of the interaction by setting it as

.lamda. = 10 - 5 ( 1 + Q 2 ) . ##EQU00002##

[0043] A more sophisticated system uses a Gaussian Mixture Model (GMM), consisting of K Gaussian component models, instead of a single Gaussian density as discussed above. If the likelihood of feature vector x(n) given the ith Gaussian component is given by f.sub.i(x(n)), the likelihood resulting from the GMM will be the weighted sum

f ( x ( n ) ) = i = 1 K w i f i ( x ( n ) ) , ##EQU00003##

with w.sub.i the mixture weights and i=1, 2, . . . , K. When updating such a model, a target feature vector x(n) will now be proportionally associated with the various Gaussian components, instead of entirely with only one Gaussian. These proportionality constants are known as responsibilities and can be determined as

r i ( n ) = w i f i ( x ( n ) ) j = 1 K w i f i ( x ( n ) ) . ##EQU00004##

[0044] Adaptation of the GMM is correspondingly done by proportionally using the feature vector to update each of the Gaussian components. This changes the original updated recursion to:

m.sub.T,i(n)=.lamda.r.sub.i(n)x(n)+(1+.lamda.r.sub.i(n))m.sub.T,i(n-1),

[0045] Using this method of adaptation will maintain the bonding of an existing user as long as that user sustains interaction. If, however, another user starts to interact with the toy, the memory of the original user will gradually fade and be replaced by that of the new one, which is precisely the desired behavior.

[0046] When the current preferred user is neglecting interaction with the toy we also want him/her to fade from the toy's memory, in other words for the toy to unlearn his/her voice characteristics. This is achieved by periodically inserting extra feature vectors x.sub.i=m.sub.U,i originating from the UBM centroids, into the adaption process.

Their corresponding responsibility constants should be

r.sub.i=w.sub.i.

[0047] This will move the target model away from the characteristics of the preferred user, and closer to the generic background speaker. However, the effect of these vectors should be much less pronounced than that of the true target speaker input vectors. They should therefore be inserted after roughly every 20 (or more) time frames, making this unlearning process approximately 20 times slower than the learning process. This serves two purposes. Firstly, the target model is continually being stabilised towards the UBM, providing some extra robustness against extraneous environmental noise, and secondly, should the user ignore the toy for an extended period, the toy will gradually "forget" this user.

[0048] If the preferred user engages in "abusive" behaviour, we want to rapidly fade that user from the toy's memory. The preferred user is recognized by a high identification score s(X) and the presence of abuse is typified by a high negative value of the interaction quality Q. Their combined presence accelerates the above unlearning process by immediately applying this procedure, but with a hugely increased value of

.lamda. = 1 3 max ( 0 , 2 1 + - s ( x ) - 1 ) . ##EQU00005##

[0049] This will rapidly move the target model back to the UBM while still taking into account the uncertainty that the speech actually arose from the preferred speaker.

[0050] To the extent that the interactions are deemed a) positive and b) the match with the preferred user is strong, positive interactions from the toy will increase, both in frequency and quality. These are expressed in terms of the spoken response of the toy, possible facial expression control, as well as the movements made by its limbs.

[0051] Although the description here concerns particular implementations for detecting a quiet soothing voice versus shouting, as well as a soft rocking motion versus throwing or falling, other implementation for doing so, as well as other types of gestures to be considered, are not excluded. The exact technique or implementation is not critical to this patent.

[0052] Furthermore, although not described here, similar processes can be devised for distinguishing the face of the preferred individual from that of a generic face representation. One approach for this is by measuring how the preferred face deviates from the generic face provided by the first components of an eigenface representation.

[0053] It should be appreciated that the above description is by way of example only and that numerous modifications, adaptations, and other implementations are possible. For example, substitutions, additions, or modifications may be made to the elements illustrated in the drawings, and the methods described herein may be modified by substituting, reordering, or adding stages to the disclosed methods. Furthermore, any elements described as being of a digital nature may equally well be implemented with analog circuitry if appropriate changes are made to the hardware of the toy. Accordingly, the above detailed description does not limit the invention.

* * * * *