U.S. patent application number 11/291231 was filed with the patent office on 2007-05-31 for methods and apparatus for use in speech recognition systems for identifying unknown words and for adding previously unknown words to vocabularies and grammars of speech recognition systems.
This patent application is currently assigned to International Business Machines Corporation. Invention is credited to Ramesh A. Gopinath, Dimitri Kanevsky, Mahesh Viswanathan.
Application Number | 20070124147 11/291231 |
Document ID | / |
Family ID | 38088633 |
Filed Date | 2007-05-31 |
United States Patent
Application |
20070124147 |
Kind Code |
A1 |
Gopinath; Ramesh A. ; et
al. |
May 31, 2007 |
Methods and apparatus for use in speech recognition systems for
identifying unknown words and for adding previously unknown words
to vocabularies and grammars of speech recognition systems
Abstract
The present invention concerns methods and apparatus for
identifying and assigning meaning to words not recognized by a
vocabulary or grammar of a speech recognition system. In an
embodiment of the invention, the word may be in an acoustic
vocabulary of the speech recognition system, but may be
unrecognized by an embedded grammar of a language model of the
speech recognition system. In another embodiment of the invention,
the word may not be recognized by any vocabulary associated with
the speech recognition system. In embodiments of the invention, at
least one hypothesis is generated for an utterance not recognized
by the speech recognition system. If the at least one hypothesis
meets at least one predetermined criterion, a word or more
corresponding to the at least one hypothesis is added to the
vocabulary of the speech recognition system. In other embodiments
of the invention, before adding the word to the vocabulary of the
speech recognition system, the at least one hypothesis may be
presented to the user of the speech recognition system to determine
if that is what the used intended when the user spoke.
Inventors: |
Gopinath; Ramesh A.;
(Millwood, NY) ; Kanevsky; Dimitri; (Ossining,
NY) ; Viswanathan; Mahesh; (Yorktown Heights,
NY) |
Correspondence
Address: |
HARRINGTON & SMITH, PC
4 RESEARCH DRIVE
SHELTON
CT
06484-6212
US
|
Assignee: |
International Business Machines
Corporation
|
Family ID: |
38088633 |
Appl. No.: |
11/291231 |
Filed: |
November 30, 2005 |
Current U.S.
Class: |
704/257 ;
704/E15.021 |
Current CPC
Class: |
G10L 15/063 20130101;
G10L 15/19 20130101; G10L 2015/0631 20130101; G10L 15/183
20130101 |
Class at
Publication: |
704/257 |
International
Class: |
G10L 15/18 20060101
G10L015/18 |
Claims
1. A signal-bearing medium tangibly embodying a program of
machine-readable instructions executable by a digital processing
apparatus of a computer system to perform speech recognition
operations, the speech recognition operations comprising: detecting
at least a target word known to an acoustic vocabulary but unknown
to an embedded grammar of a language model of the speech
recognition system; assigning a language model probability to the
target word; calculating a sum of an acoustic and language model
confidence score for the target word and words already included in
the embedded grammar of the language model; and if the sum of the
acoustic and language model probability for the target word is
greater than the sum of the acoustic and language model probability
for the words already included in the embedded grammar, adding the
target word to the language model.
2. The signal-bearing medium of claim 1 where the operations
further comprise: after calculating the sum and prior to adding the
target word to the embedded grammar of the language model, asking
confirmation of the target word from a user of the speech
recognition system; and receiving confirmation for the target word
from the user of the speech recognition system.
3. The signal-bearing medium of claim 2 wherein confirmation
comprises confirmation of the spelling of the target word.
4. The signal-bearing medium of claim 2 wherein confirmation
comprises confirmation of the pronunciation of the target word.
5. A signal-bearing medium tangibly embodying a program of
machine-readable instructions executable by a digital processing
apparatus of a computer system to perform speech recognition
operations, the speech recognition operations comprising: detecting
an utterance having a low acoustic score within an acoustic
vocabulary of the speech recognition system indicating that the
utterance may correspond to an out-of-vocabulary word; generating
at least one new word hypothesis comprised of at least one of a
phone- or syllable sequence using confidence scores derived from
probabilities contained in a database of viable phone and syllable
sequences; and if the at least one new word hypothesis meets a
pre-determined criterion, adding a word corresponding to the at
least one new word hypothesis to the vocabulary of the speech
recognition system.
6. The signal-bearing medium of claim 5 wherein the pre-determined
criterion corresponds to confirmation by a user of the speech
recognition system wherein the operations further comprise: prior
to adding at least one word to the acoustic vocabulary of the
speech recognition system, presenting the new word hypothesis to a
user of the speech recognition system seeking confirmation that the
new word hypothesis corresponds to at least one word intended by
the user when the user spoke; and whereby the new word is added to
the vocabulary of the speech recognition system only if
confirmation is receiving from the user.
7. The signal-bearing medium of claim 6 wherein the utterance
corresponds to a multi-word command, and wherein the operations
further comprise: adding the command to an embedded grammar of a
language model associated with the speech recognition system.
8. The signal-bearing medium of claim 7 wherein the operations
further comprise: adding information received from a user of the
speech recognition system to memory indicating at least one action
to be performed when the command is detected by the speech
recognition system.
9. A signal-bearing medium tangibly embodying a program of
machine-readable instructions executable by a digital processing
apparatus of a computer system to perform speech recognition
operations in a speech recognition system, the speech recognition
operations comprising: detecting an utterance not recognized by at
least a first one of an acoustic vocabulary, embedded grammar, and
viable phone/syllable sequence library of the speech recognition
system; generating at least one hypothesis for the utterance,
wherein the hypothesis is based on information derived from a
second one of an acoustic vocabulary, embedded grammar and viable
phone/syllable sequence library of the speech recognition system;
calculating a confidence score for the at least one hypothesis and
for members of the first one of the acoustic vocabulary, embedded
grammar and viable phone/syllable sequence library of the speech
recognition system; comparing the confidence scores calculated for
the at least one hypothesis and for members of the first one of the
acoustic vocabulary, embedded grammar and viable phone/syllable
sequence library of the speech recognition system; and adding
information to the first one of an acoustic vocabulary, embedded
grammar and viable phone/syllable sequence corresponding to the
hypothesis if a pre-determined criterion based on the comparison is
met.
10. The signal-bearing medium of claim 9 wherein the utterance
corresponds to a phone sequence, and wherein the first one of the
acoustic vocabulary, embedded grammar and viable phone/syllable
sequence library corresponds to a particular viable phone/syllable
sequence library.
11. The signal-bearing medium of claim 9 wherein the utterance
corresponds to a word, and wherein the first one of the acoustic
vocabulary, embedded grammar and viable phone/syllable sequence
library corresponds to a particular acoustic vocabulary.
12. The signal-bearing medium of claim 9 wherein the utterance
corresponds to a command, and wherein the first one of the acoustic
vocabulary, embedded grammar and viable phone/syllable sequence
library corresponds to a particular embedded grammar.
13. The signal-bearing medium of claim 9 wherein the at least one
criterion corresponds to confirmation by a user of the speech
recognition system, wherein the operations further comprise: prior
to adding information corresponding to the at least one hypothesis
to the first one of the acoustic vocabulary, embedded grammar and
viable phone/syllable sequence library of the speech recognition
system, seeking confirmation that the hypothesis corresponds to
what the user intended when the user spoke; and whereby the
information is added only if confirmation is received from the user
of the speech recognition system.
14. The signal-bearing medium of claim 9 wherein the operations
further comprise: using biometric information to assist in
identifying the utterance as unrecognized by the first one of the
acoustic vocabulary, embedded grammar and viable phone/syllable
sequence library of the speech recognition system.
15. The signal signal-bearing medium of claim 14 wherein the
biometric information comprises speech biometric information.
16. The signal-bearing medium of claim 14 wherein the biometric
information comprises data derived from video information.
17. A speech recognition system comprising: a speech input for
receiving speech from a user of the speech recognition system; an
open set comprised of at least one open vocabulary and at least one
open embedded grammar associated with a language model implemented
in the speech recognition system; a hierarchical mapping system for
identifying utterances not recognized by at least one of the open
vocabulary and open embedded grammar of the speech recognition
system; for generating hypotheses for the unrecognized utterances
using confidence scores based at least in part on one of viable
phone/syllable sequence information, acoustic vocabulary
information and grammar information; and for adding information
corresponding to the hypotheses to at least one of the open
vocabulary and embedded grammar of the speech recognition system if
a pre-determined criterion is met; and a confidence score system
for generating confidence scores for use by the hierarchical
mapping system.
18. The speech recognition system of claim 17 further comprising: a
user behavior biometrics detector for generating data to assist the
hierarchical mapping system in identifying utterances that a user
expects not to be recognized by the speech recognition system.
19. The speech recognition system of claim 17 further comprising: a
confirmation system for providing the hypotheses corresponding to
the unrecognized utterances to a user of the speech recognition
system, and for receiving confirmation from the user if the
hypotheses correspond to what the user intended when the user spoke
the unrecognized utterances.
20. The speech recognition system of claim 17 further comprising: a
user input system for receiving data from the user of the speech
recognition system, wherein the data is associated with the
information corresponding to the hypotheses added to at least one
of the open acoustic vocabulary and open embedded grammar of the
speech recognition system when a pre-determined criterion is
met.
21. The speech recognition system of claim 17 wherein the data
concerns at least one action to be performed.
Description
TECHNICAL FIELD
[0001] The invention concerns methods and apparatus for use in
speech recognition systems and more particularly concerns methods
and apparatus for identifying and assigning meaning to new words
and utterances. The new words and utterances may be known
beforehand, but used in a new way unknown to an embedded grammar of
a language model incorporated in a speech recognition system, or
may be totally unknown beforehand from any perspective to a speech
recognition system.
BACKGROUND
[0002] Speech recognition systems are finding increasing use,
particularly in voice-controlled user interfaces. Voice-controlled
user interfaces are familiar to anyone who performs banking and
credit card transactions by telephone. In the past, telephonic
banking and credit card service transactions were performed either
through interaction with a human agent or by using a keypad of a
telephone; now, with increasing frequency telephonic banking and
credit card service transactions may be performed using voice
commands.
[0003] Voice-activated user interfaces are also finding increasing
use in portable electronic devices like cellular telephones and
personal digital assistants ("PDAs") with telephonic capabilities.
For example, in cellular telephones with voice-activated user
interface capability, a user can enter a voice command "Call Bob
Smith" in order to initiate a telephone call to a target person
("Bob Smith"). This eliminates the need for the user to enter a
telephone number, or to access a contact list containing the
telephone number, thereby saving keystrokes. The elimination of
keystrokes often enables hands-free modes of operation, which is
particularly advantageous when the telephone call is initiated by
someone operating an automobile. There is increasing pressure to
restrict the operation of cellular telephones by drivers of
automobiles, particularly cellular telephones that require hand
operation.
[0004] Thus, the ability to initiate an operation (e.g., a
telephone call) by issuing a voice command to a voice-controlled
user interface is particularly advantageous because it saves time
and effort previously expended by entering commands using keys or
other hand-operated input devices. This advantage ends, though, as
soon as a user enters a command not recognized by a speech
recognition system associated with a voice-controlled user
interface. In such circumstances, a user is often thrust back to
old, more tedious modes of operation where a command has to be
entered using a combination of keystrokes.
[0005] In such situations, where a cellular telephone user is
seeking to initiate a telephone call, the user would either have to
enter the telephone number directly, or add it to a contact list.
Since users of productivity-enhancement devices like cellular
telephones and PDAs value the ability of these devices to "grow"
with the user by, for example, being able to record and save an
extensive and ever-expanding contact list, the fact that this
ability may only be partially implemented (if at all) through voice
commands is viewed as a particular limitation of voice-activated
user interface systems incorporated in such devices. If a user has
an extensive contact list, the user might not even initiate a
telephone call using the voice command feature, because the user
might forget whether the person to be called is even in the contact
list and thus capable of being recognized by a voice-activated user
interface operating in combination with the contact list.
[0006] A further problem is apparent in this description of the
prior art. In conventional speech recognition systems, the
vocabularies and grammars are fixed. Accordingly, when the user is
thrust back upon a keystroke-mode of operation in order to enter
new commands, the user will have to enter the new commands with
keystrokes every time the new commands are to be performed, since
the vocabularies and grammars are fixed. There is no benefit to the
speech recognition system associated with the user giving meaning
to a command unrecognized by the speech recognition system using
keystrokes, since the information entered using keystrokes does not
modify the capabilities of the speech recognition system.
[0007] Accordingly, those skilled in the art desire speech
recognition systems with the ability to "grow." In particular,
those skilled in the art desire speech recognition systems with the
ability to identify new words previously unknown to the speech
recognition system and to add them to one or more vocabularies and
grammars associated with the speech recognition system. In
addition, those skilled in the art desire voice activated user
interfaces with the ability to learn new commands. Further, when it
is necessary to enter commands using keystrokes, those skilled in
the art seek speech recognition systems that can be re-programmed
though interaction with keys, keyboards, and other command entry
controls of an electronic device, so that the speech recognition
system benefits from the efforts expended in such activities.
SUMMARY OF THE PREFERRED EMBODIMENTS
[0008] The foregoing and other problems are overcome, and other
advantages are realized, in accordance with the following
embodiments of the present invention.
[0009] A first embodiment of the present invention comprises a
signal-bearing medium tangibly embodying a program of
machine-readable instructions executable by a digital processing
apparatus of a computer system to perform speech recognition
operations, the speech recognition operations comprising: detecting
at least a target word known to an acoustic vocabulary but unknown
to an embedded grammar of a language model of the speech
recognition system; assigning a language model probability to the
target word; calculating a sum of an acoustic and language model
confidence score for the target word and words already included in
the embedded grammar of the language model; and if the sum of the
acoustic and language model probability for the target word is
greater than the sum of the acoustic and language model probability
for the words already included in the embedded grammar, adding the
target word to the language model.
[0010] A second embodiment of the present invention comprises a
signal-bearing medium tangibly embodying a program of
machine-readable instructions executable by a digital processing
apparatus of a computer system to perform speech recognition
operations, the speech recognition operations comprising: detecting
an utterance having a low acoustic score within an acoustic
vocabulary of the speech recognition system indicating that the
utterance may correspond to an out-of-vocabulary word; generating
at least one new word hypothesis comprised of at least one of a
phone- or syllable sequence using confidence scores derived from
probabilities contained in a database of viable phone and syllable
sequences; and if the at least one new word hypothesis meets a
pre-determined criterion, adding a word corresponding to the at
least one new word hypothesis to the vocabulary of the speech
recognition system.
[0011] A third embodiment of the present invention comprises a
signal-bearing medium tangibly embodying a program of
machine-readable instructions executable by a digital processing
apparatus of a computer system to perform speech recognition
operations in a speech recognition system, the speech recognition
operations comprising: detecting an utterance not recognized by at
least a first one of an acoustic vocabulary, embedded grammar, and
viable phone/syllable sequence library of the speech recognition
system; generating at least one hypothesis for the utterance,
wherein the hypothesis is based on information derived from a
second one of an acoustic vocabulary, embedded grammar and viable
phone/syllable sequence library of the speech recognition system;
calculating a confidence score for the at least one hypothesis and
for members of the first one of the acoustic vocabulary, embedded
grammar and viable phone/syllable sequence library of the speech
recognition system; comparing the confidence scores calculated for
the at least one hypothesis and for members of the first one of the
acoustic vocabulary, embedded grammar and viable phone/syllable
sequence library of the speech recognition system; and adding
information to the first one of an acoustic vocabulary, embedded
grammar and viable phone/syllable sequence corresponding to the
hypothesis if a pre-determined criterion based on the comparison is
met.
[0012] A fourth embodiment of the present invention comprises a
speech recognition system comprising: a speech input for receiving
speech from a user of the speech recognition system; an open set
comprised of at least one open vocabulary and at least one open
embedded grammar associated with a language model implemented in
the speech recognition system; a hierarchical mapping system for
identifying utterances not recognized by at least one of the open
vocabulary and open embedded grammar of the speech recognition
system; for generating hypotheses for the unrecognized utterances
using confidence scores based at least in part on one of viable
phone/syllable sequence information, acoustic vocabulary
information and grammar information; and for adding information
corresponding to the hypotheses to at least one of the open
vocabulary and embedded grammar of the speech recognition system if
a pre-determined criterion is met; and a confidence score system
for generating confidence scores for use by the hierarchical
mapping system.
[0013] In conclusion, the foregoing summary of the alternate
embodiments of the present invention is exemplary and non-limiting.
For example, one of ordinary skill in the art will understand that
one or more aspects or steps from one alternate embodiment can be
combined with one or more aspects or steps from another alternate
embodiment to create a new embodiment within the scope of the
present invention.
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] The foregoing and other aspects of these teachings are made
more evident in the following Detailed Description of the Preferred
Embodiments, when read in conjunction with the attached Drawing
Figures, wherein:
[0015] FIG. 1 is a block diagram depicting a system embodying
several aspects of the present invention;
[0016] FIG. 2 is a block diagram depicting in greater detail the
hierarchical mapping system of FIG. 1;
[0017] FIG. 3 is a block diagram depicting a phone/syllable mapper
made in accordance with the present invention;
[0018] FIG. 4 is a block diagram depicting a user behavioral
biometrics detector made in accordance with the present invention;
and
[0019] FIG. 5 is a flow chart depicting a method operating in
accordance with the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0020] As introduction, an aspect of the present invention will be
described to illustrate problems encountered in the prior art and
how the present invention solves them. Embodiments of the present
invention are generally operative in automated, electronic speech
recognition systems that are used in electronic devices with speech
input capability such as, for example, telephones. The speech
recognition systems typically operate in such electronic devices as
part of a voice-activated user interface. Before the electronic
device can take action in response to a user command, the speech
recognition system has to parse the speech utterance comprising the
command and assign meaning to the speech utterance. In prior art
devices, users are required to operate within relatively narrow
categories of vocabulary and grammar when interacting with a speech
recognition system, because conventional speech recognition systems
are fixed in capability. The speech recognition systems of the
prior art have fixed vocabularies and grammars, meaning that if a
speech utterance is not in a speech recognition system's vocabulary
and grammar, no action or possibly even an incorrect action will be
taken by the voice-activated user interface. This occurs because
the speech utterance is unknown to the speech recognition system
associated with the voice activated user interface.
[0021] Accordingly, an aspect of the present invention provides a
speech recognition system with open vocabularies and grammars,
allowing the speech recognition system to be programmed with new
words and grammatical constructs (such as, for example, commands)
through interaction with a user. As a result of these interactions,
a voice-activated user interface with which the speech recognition
system is associated can be programmed to perform new actions. To
illustrate the operation of an aspect of the invention an example
will be provided. Assume a user is interacting with a
voice-activated user interface that is incorporated in a telephone,
and speaks a command "Call Morita-san". "Morita" is a Japanese
surname, and "Morita-san" is a way one named "Morita" may be
addressed in Japanese. The speech recognition system is programmed
to recognize the command "Call ______", and also is programmed to
recognize certain names and telephone numbers that are used in
combination with the "Call ______" command. However, in this
particular example, the speech recognition system is initially not
programmed to recognize the name "Morita-san", nor has the user
heretofore uttered the words "Morita-san" in combination with the
command "Call ______". Accordingly, in one aspect of the present
invention, the speech recognition system generates a phonetic
sequence hypothesis for "Morita-San" having a high degree of
probability; presents the hypothesis to the user for confirmation,
including spelling; and after receiving confirmation (and possibly
even a spelling correction) adds the word "Morita-San" to an
embedded grammar associated with the "Call ______" command. In
various embodiments of the invention, additional steps may be
performed. For example, the user may associate a specific telephone
number with the word "Morita-san" as it is being added to the
embedded grammar of the speech recognition system. Once
"Morita-san" has been added to the embedded grammar and the
telephone number has been associated with the new word
"Morita-san", the next time the speech recognition system hears the
command "Call Morita-san" it will automatically call the telephone
number associated with "Morita-san".
[0022] In variants of this embodiment, confidence scores may be
assigned using additional information besides, for example,
phonetic or grammar information. Higher-level models based on
semantic and context information may be used in combination with
phonetic and grammar information to identify unknown words using
confidence scores. For example, regarding context, the speech
recognition system may take into consideration what actions the
user of the speech recognition system had been performing prior to
speaking the unrecognized word. These actions provide context
information which may assist the speech recognition system in
assigning meaning to the unrecognized word.
[0023] In another embodiment of the invention, the speech
recognition system would automatically poll the user of the speech
recognition system to enter by keystrokes the information
associated with the unrecognized command. Assume the user spoke the
same sequence as in the preceding example, "Call Morita-san", and
the system did not recognize either the grammatical construct "Call
______" or the name "Morita-san". In this embodiment of the
invention, the voice-recognition system would ask the user to
illustrate the command by keystrokes and provide the name by
keystrokes. Accordingly, after entry of the illustrative example,
the speech recognition system would then recognize that the "Call
______" construct corresponds to an instruction to perform a
telephone call. In addition, after entry of the name "Morita-san"
(and possibly an associated telephone number), the speech
recognition system would recognize "Morita-san" as someone to be
called at a specific telephone number.
[0024] Further embodiments of the present invention implement
additional features that may be used in combination with the
functionality associated with the foregoing aspects of the present
invention. For example, often a user of a speech recognition system
provides biometric cues identifying when the user is introducing a
new word. The user may slow down her speech to emphasize a word,
may speak more loudly to emphasize a word, or may pause to
emphasize a word. These actions may be used alone or in combination
with physical gestures to emphasize a word. Further embodiments of
the present invention employ audio and visual biometric measuring
systems to help identify when a user of a speech recognition system
is speaking a new word.
[0025] Now further aspects of the present invention, and the
problems they overcome, will be described in greater detail. There
are two typical situations encountered in a speech recognition
system with respect to new words. In a first situation, the speech
recognition system recognizes a word as a valid phonetic sequence
known to at least one acoustic vocabulary of the speech recognition
system. However, the word is used in new way not recognized by an
embedded grammar of a language model incorporated in the speech
recognition system. "Embedded grammar" and "language model" are
concepts and means for implementing a speech recognition system
that generally refer to the fact that a speech recognition system
recognizes and assigns meaning to not only words, but to
combinations of words. In a voice-activated user interface
incorporating a speech recognition system, "embedded grammar" and
"language model" refer to the functionality of the speech
recognition system that recognizes both responses to queries
initiated by the voice-activated user interface, and to commands
entered by a user of the voice-activated user interface. So in the
first example, a word that is recognized as a valid phonetic
sequence is nonetheless used in a such a way that the speech
recognition system cannot assign meaning to the utterance
incorporating the word, since the word is used in a new way. A
typical example would be encountered when a word that is recognized
by a voice-activated user interface as a valid phonetic sequence is
used in a command, wherein the embedded grammar functionality which
ordinarily detects the command is not programmed to recognize and
assign meaning to the command when the command incorporates the new
word. In one aspect of the present invention various methods and
apparatus are provided that enable an embedded grammar of a speech
recognition system to "grow" by adding new words to the embedded
grammar.
[0026] In a more general situation, a sequence of sounds
corresponding to one or more words spoken by a user of a speech
recognition system may be unknown to any vocabulary or language
model of the speech recognition system. In this aspect of the
present invention, various methods and apparatus are provided that
enable a speech recognition system to grow both by adding
previously unknown words to one or more vocabularies of the speech
recognition system, and by adding new grammatical constructs (such
as, for example, new commands) to an embedded grammar of a language
model incorporated in a speech recognition system.
[0027] Embodiments of the present invention responding to the first
circumstance identified above--where a known word is used in a new,
unrecognized context--are handled in the following manner.
Generally, an embedded grammar incorporated in a language model of
a speech recognition system operating in accordance with the
invention is designed to expand by accommodating new uses for words
recognized by other aspects of the speech recognition system (such
as phonetic vocabularies).
[0028] A conventional embedded grammar operates as follows when a
word included in the grammar is spoken: [0029] Construct: {W1} {W2}
[0030] Prepare list of acceptable Li's [0031] L1, L2, . . . are all
list items--part of an embedded grammar [0032] L1, . . . Ln are all
equi-probable (to a first degree of approximation) [0033] For
example, Call <name>, where name may be a list of 50 proper
names [0034] Phrase score for {W1} {W2} {Li} [0035] =Acoustic score
(Li)+Language Model Score (Li|W1W2) As is apparent, a particular
word Li having the highest sum for acoustic score and language
model score is deemed to be the most likely hypothesis for the word
intended by a speaker. No accommodation is made in conventional
methods for words unrecognized by the speech recognition
system.
[0036] In contrast, in methods and apparatus of the present
invention, embedded grammars and language models of a speech
recognition can expand to incorporate words that are recognized by
other aspects of the speech recognition system (such as, for
example a phonetic vocabulary), but which are not recognized by a
particular embedded grammar as a valid option. A method of the
present invention operates in the following manner: [0037] (`U`
(Word actually spoken) is not in an embedded grammar) Construct:
{W1} {W2} [0038] "Create" an empty list item and assign it a
non-zero probability, P{U}<P{Li} [0039] Word (`U`) recognized by
other aspects of speech recognition system but not by embedded
grammar has a small probability allowing grammar room to expand
[0040] For example, "Go to <city not in embedded grammar>
[0041] P{U}<P{Li}, but Acoustic Score (U)+Language Model [0042]
Score (U)>Acoustic Model Score (Li)+Language Model Score (Li) In
this method of the present invention, the sum of the acoustic and
language model scores will favor the word recognized by other
aspects of the speech recognition system (such as a phonetic
vocabulary) but not by the embedded grammar over words that are
recognized by the embedded grammar. This results from the fact that
none of the words initially in the embedded grammar sound like the
word actually spoken. Alternatively, the word not in the embedded
grammar is recognized phonetically with a high degree of
probability since the word is in at least one phonetic vocabulary
of the speech recognition system. Accordingly, the speech
recognition system concludes that the most likely hypothesis is
that the speaker intended to use the new word in, for example, the
command spoken, as opposed to any words recognized by the embedded
grammar.
[0043] A method operating in accordance with this aspect of the
present invention may be followed by additional steps. For example,
the speech recognition system may synthesize a hypothesis
corresponding to the utterance spoken by the speaker and play it to
the speaker using the word not initially in the embedded grammar
but incorporated in some other vocabulary or grammar of the speech
recognition system. In such an instance the system would seek
confirmation from the speaker that the word is what the speaker
intended. As part of these additional steps, a baseform may be
generated so that pronunciation can be confirmed.
[0044] In the other situation described above where an utterance is
unrecognized by any vocabulary or grammar of a speech recognition
system, the present invention operates on phone sequences to
generate hypotheses for a word or combinations of words spoken by a
user that are unrecognized by the speech recognition system. A
speech recognition system operating in accordance with the present
invention generates a hypothesis and assigns a confidence score to
check if a hypothetical word corresponds to the spoken word with a
high degree of probability. The speech recognition system can seek
confirmation from a speaker to make sure the system reproduced the
correct word. For example, if the speaker spoke the command "Call
Moscow" and the word "Moscow" is not in any vocabulary or grammar
of the speech recognition system, the speech recognition system
would reproduce the sound sequence "moss cow" and compute a
confidence score for the combination of syllables. This aspect of
the present invention operates based on the assumption that it is
possible to understand what a user spoke by identifying sequences
of syllables. In order for the speech recognition system to
implement this aspect of the present invention, the system
incorporates a library that includes possible phones or syllables
that might occur in a user's active vocabulary. In addition, the
system includes decoding graphs indicating how individual phones or
syllables can be combined.
[0045] In a typical implementation, this second aspect of the
present invention would operate in combination with the first
aspect. For example, in many instances, it would not be necessary
for the system to operate with phone or syllable decoding enabled
at all times, since the user would be speaking words that are
recognized at least by phonetic vocabularies of the speech
recognition system. However, when an utterance is encountered which
is not recognized by any vocabulary or grammar of the speech
recognition system, the phone/syllable decoder of the present
invention would be enabled to assist in decoding of the
utterance.
[0046] Various embodiments of the invention operate to improve the
efficiency of a speech recognition system in identifying new words
based on phonetic methods. For example, in one embodiment a
database of viable phone/syllable sequences and associated
combination probabilities is implemented to assist the speech
recognition system in proposing word or utterance hypotheses with a
high degree of confidence. The combination probabilities may
reflect the likelihood of a two-phone or -syllable sequence, a
three-phone or -syllable sequence, etc. The viable phone/syllable
sequence database can be implemented in many ways in accordance
with the present invention. For example, the viable phone/syllable
sequence database can reflect phone/syllable sequences likely to be
encountered in interactions with a particular user of a speech
recognition system; phone/syllable sequences likely to be
encountered with respect to a set of commands used in combination
with a voice-activated user interface; phone/syllable sequences
likely to be encountered in proper names and surnames;
phone/syllable sequences likely to be encountered in a specific
language; and phone/syllable sequences likely to be encountered in
a subset of languages or all languages.
[0047] In further embodiments of the invention additional
information--such as, for example speech and body movement
biometric information--are used to identify new words. Apparatus
associated with the speech recognition system detect changes in
speech cadence which may be indicative of a new word. Additional
apparatus associated with the speech recognition system analyze
video data to detect gestures and body movements that may be
indicative of introduction of a new word in the speech of a user of
a speech recognition system.
[0048] FIG. 1 is a block diagram showing a plurality of systems
that selectively may be incorporated in various embodiments of the
present invention. The central system is an hierarchical mapping
system 100 that receives inputs from a plurality of interconnected
systems comprising the speech recognition system 10 of the present
invention. The hierarchical mapping system 100 processes
information received from other blocks and maps user input (such
as, for example, a voice utterance) into a vocabulary subset in a
hierarchical open set 105 of vocabularies. In an example, the
hierarchical mapping process 100 can decode an acoustic utterance
"China" into a word "China" that may belong to one of the system
vocabularies but which is not recognized by a grammar set
associated with a context in which the word "China" appeared. In
the speech recognition system of the present invention, the
hierarchical mapping process 100 adds "China" to the grammar set
associated with the context in which the word "China" appeared and
interprets the utterance in accordance (via semantic/context
interpreter 120) with the context otherwise indicated by the
utterance. A particular advantage of the present invention results
from the fact that open hierarchical set 105 is comprised of open
subsets (grammars and vocabularies)--as a result, these subsets are
dynamic and can be updated with new words and grammatic constructs
in various embodiments of the present invention. Learning module
103 is operable to learn user behavior associated with user
requests (using internet facilities to learn across a plurality of
users) and associate commands to user requests. In one example, a
previously unrecognized command like "Call China" would be
associated with an action to call a specific telephone number after
the speech recognition system learns the word "China" and through
interaction with a user learns to associate the command "Call
China" with the action to call a specific telephone number.
[0049] Confidence score metrics system 104 resolves conflicts
between different words and their membership in different subsets
in the hierarchy. For example, referring back to the "Call China"
example, there may be a word incorporated in a grammar which has a
higher language model score than "China" but which has a lower
acoustic score than "China". The confidence score metrics system
104 operates in such a way to resolve these conflicts. In various
embodiments of the invention, confidence scores can be assigned for
acoustic models, language models and for semantic models. In
embodiments of the present invention an acoustic score is assigned
for a sequence of phones or syllables via phone/syllable mapper
102. The acoustic representation determined with a high degree of
confidence from this scoring process may not correspond to any
existing word in a set of vocabularies 106, 107, 108 or 109. In
such a situation, if the confidence score block 104 evaluates the
confidence metric for a new phone/syllable sequence as higher than
the score for competitive words--the new sequence of
phones/syllables will be considered as a new word that should be
added to an open vocabulary (e.g., to 109). A meaning for the new
word/phrase is received through one or both of user actions
learning module 103 and semantics/context interpreter block 120.
New commands are also added to a grammar 106 in embodiments of the
present invention. Language model services block 107 provides
language data for sequences: phones, syllables, words and phrases.
This data can be used by the confidence score block 104 to derive
confidence scores. This language data also can be used to compute
language model scores in a decoding process operating within the
hierarchical mapping system 100. User behavior biometric detector
101 provides biometrics data about user behavior (e.g.,
conversational biometrics) that helps to identify whether the
acoustic utterance points to a new word (e.g., hesitation on some
phrases, pauses, speaking stress etc.).
[0050] FIG. 2 is a block diagram depicting in greater detail the
hierarchical mapping system 100 of FIG. 1. The hierarchical mapping
system 100 contains a communications bus 200 through which
different system modules exchange data. Data that enters
hierarchical map system 100 (through bus 200) comprises data
produced by modules previously described with respect to FIG. 1 and
which are connected to 100 (e.g., speech input from 110, phonetic
data in 203 from 102, confidence data in 203 from 104, biometrics
data in 203 from 101 etc.).
[0051] Speech input 201 is directed to the hierarchical speech
recognition system 202. This speech system operates to provide
hierarchical decoding of, for example, phones, syllables, words and
phrases. Hierarchical speech recognition system 202 also produces
data for computation of hierarchical scores in 204.
[0052] Hierarchical score calculator 204 also uses conventional
biometrics information from user biometric detector 101. For
example, if the user hesitates on some acoustic utterance--a score
is added to the confidence score for acoustic information (for
example as linear weighted sum). For example, duration of
hesitation or stress value of sounds may be normalized and added as
a weighted sum. Similarly other scores (semantic, language models
etc) are added as a weighted sum in more complex implementations.
The confidence score is computed either for separate words, for
phonetic/syllable sequences, or for membership in some subset (a
grammar, vocabulary etc.) in 205. If a novel sequence of
phones/syllables/phrases is chosen (via the highest confidence
score) it is added by the vocabulary extender 206 to the
appropriate subset.
[0053] FIG. 3 depicts a phone/syllable mapper 102 capable of
operating in accordance with the present invention. Phonetic
decoder 300 is a phonetic/syllable decoder that can be used in an
hierarchical speech recognition system 202 that decodes
phonetically. The phone/syllable decoder 300 uses phone 306 and
syllable vocabularies 309 and phone/syllable language models 308
for decoding. Phone 306 and syllable vocabularies 309 are created
from a database of names 301 (which can include names in different
languages--English 302; Chinese 303; Japanese 304; Indian dialects
305). Other databases for other categories include phone classes
312 and a universal phonetic system 311 (that are applicable to
several or all languages). Language models 308 and phone/syllable
vocabularies 306, 309 are used to create viable phonetic/syllable
sequences 310, which are derived from viable language models stored
in a database or which are created dynamically. These viable
sequences are not words in some vocabulary, but have a good chance
to become legitimate words to be added to open vocabularies.
[0054] FIG. 4 depicts a user behavioral biometrics detector 101.
User behavioral biometrics detector 101 comprises a speech input
400 and a video input 401. Pause detector 402 operates to detect
pauses in speech; stress volume indicator 403 operates to detect
stresses during speech; and speech speed measure detector 404
operates to detect changes in speech speed. Speech biometrics
interpreter 408 combines information derived from the speech data
by 402, 403 and 404.
[0055] Video data received at input 401 is operated on by head
position detector 405, body movement detector 406, and gesture
detector 407. Head position detector 405 that helps to identify
whether a user requested some actions from a system by looking at a
device--for example, by looking at a window in a car and asking to
open the window. Information derived by 405, 406 and 407 are
combined by body movements/gesture interpreter 409 to provide a
complete biometrics picture based on user movement
[0056] FIG. 5 is a flow chart depicting a method of the present
invention. At step 500, a speech recognition system capable of
practicing the methods of the present invention receives an
utterance and process the utterance. Then, at step 501, the speech
recognition system decodes the acoustic data. Step 502 is a
decision point where the speech recognition system decides whether
the entire acoustic utterance has been decoded. If it has, then at
step 503 the speech recognition system interprets the acoustic
data. At step 506, the speech recognition system reaches another
decision point. At step 506, the speech recognition system decides
whether the entire utterance has been interpreted. If so, at step
507, a command contained in the utterance is executed.
[0057] Returning to step 502, if the entire acoustic utterance
cannot be decoded, the speech recognition system decides whether
the utterance can be decoded in an extended system. If so, it
continues to step 506. If the entire utterance cannot be decoded in
the extended system, the system continues to step 505 which is
another decision point. At step 505, the speech recognition system
determines whether there is additional biometric/context data
available that points to a new word. If so, the speech recognition
systems continues to step 520, where user biometric data is
interpreted either implicitly or by asking questions. Then at step
509 the vocabulary is updated. If not, the utterance us interpreted
by interacting with the user.
[0058] One of ordinary skill in the art will understand that the
methods depicted and described herein can be embodied in a tangible
machine-readable memory medium. A computer program fixed in a
machine readable memory medium and embodying a method or methods of
the present invention performs steps of the method or methods when
executed by a digital processing apparatus coupled to the
machine-readable memory medium. Tangible machine-readable memory
media include, but are not limited to, hard drives, CD- or DVD-ROM,
flash memory storage devices or in a RAM memory of a computer
system.
[0059] Thus it is seen that the foregoing description has provided
by way of exemplary and non-limiting examples a full and
informative description of the best method and apparatus presently
contemplated by the inventors for implementing a speech recognition
system for identifying, and assigning meaning to, new words and
utterances initially unknown to the speech recognition system. One
skilled in the art will appreciate that the various embodiments
described herein can be practiced individually; in combination with
one or more other embodiments described herein; or in combination
with speech recognition systems differing from those described
herein. Further, one skilled in the art will appreciate that the
present invention can be practiced by other than the described
embodiments; that these described embodiments are presented for the
purposes of illustration and not of limitation; and that the
present invention is therefore limited only by the claims which
follow.
* * * * *