U.S. patent application number 10/227653 was filed with the patent office on 2004-12-30 for methods, systems, and programming for performing speech recognition.
Invention is credited to Cohen, Jordan R., Grabherr, Manfred G., Johnston, David F., Roth, Daniel L..
Application Number | 20040267528 10/227653 |
Document ID | / |
Family ID | 32684400 |
Filed Date | 2004-12-30 |
United States Patent
Application |
20040267528 |
Kind Code |
A9 |
Roth, Daniel L. ; et
al. |
December 30, 2004 |
Methods, systems, and programming for performing speech
recognition
Abstract
The present invention relates to: speech recognition using
selectable recognition modes; using choice lists in
large-vocabulary speech recognition; enabling users to select word
transformations; speech recognition that automatically turns
recognition off in one or more specified ways; phone key control of
large-vocabulary speech recognition; speech recognition using phone
key alphabetic filtering and spelling: speech recognition that
enables a user to perform re-utterance recognition; the combination
of speech recognition and text-to-speech (TTS) generation; the
combination of speech recognition with handwriting and/or character
recognition; and the combination of large-vocabulary speech
recognition with audio recording and playback.
Inventors: |
Roth, Daniel L.; (Brookline,
MA) ; Cohen, Jordan R.; (Gloucester, MA) ;
Johnston, David F.; (Arlington, MA) ; Grabherr,
Manfred G.; (Medford, MA) |
Correspondence
Address: |
Edward W. Porter
Porter & Associates
Suite 600
One Broadway
Cambridge
MA
02142
US
|
Prior
Publication: |
|
Document Identifier |
Publication Date |
|
US 0049388 A1 |
March 11, 2004 |
|
|
Family ID: |
32684400 |
Appl. No.: |
10/227653 |
Filed: |
September 6, 2002 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
10227653 |
Sep 6, 2002 |
|
|
|
10302053 |
Sep 5, 2002 |
|
|
|
60317333 |
Sep 5, 2001 |
|
|
|
60317433 |
Sep 5, 2001 |
|
|
|
60317431 |
Sep 5, 2001 |
|
|
|
60317329 |
Sep 5, 2001 |
|
|
|
60317330 |
Sep 5, 2001 |
|
|
|
60317331 |
Sep 5, 2001 |
|
|
|
60317423 |
Sep 5, 2001 |
|
|
|
60317422 |
Sep 5, 2001 |
|
|
|
60317421 |
Sep 5, 2001 |
|
|
|
60317430 |
Sep 5, 2001 |
|
|
|
60317432 |
Sep 5, 2001 |
|
|
|
60317435 |
Sep 5, 2001 |
|
|
|
60317434 |
Sep 5, 2001 |
|
|
|
Current U.S.
Class: |
704/251 |
Current CPC
Class: |
G10L 15/19 20130101;
G10L 15/22 20130101 |
Class at
Publication: |
704/251 |
International
Class: |
G10L 015/04 |
Claims
1. A method of speech recognition comprising: providing a user
interface which allows a user to select between generating a first
and a second user input; responding to the generation of the first
user input by performing large vocabulary recognizing on one or
more utterances in a prior language context dependent mode, which
recognizes at least the first word of such recognition depending in
part on a language model context created by a previously recognized
word; and responding to the generation of the second user input by
performing large vocabulary recognizing on one or more utterances
in a prior language context independent mode, which recognizes at
least the first word of such recognition independently of a
language model context created by any previously recognized
word.
2. A method as in innovation 1 wherein: the user interface includes
a first button and a second button; the first user input is
generated by pressing the first button; and the second user input
is generated by pressing the second button.
3. A method as in innovation 1 wherein prior language context
independent mode uses language model context created by the first
and any successively recognized words of an utterance in selecting
the second and successive words, if any, recognized for an
utterance.
4. A method as in innovation 1 further including providing
recognized words output by the prior language context dependent and
independent modes as a text input to another program.
5. A method as in innovation 4 wherein said method is performed by
a software input panel in Microsoft Windows CE
6. A method of speech recognition comprising: providing a user
interface which allows a user to select between generating a first
and a second user input; responding to the generation of the first
user input by recognizing one or more utterances as one or more
words in a given vocabulary in a continuous speech recognition
mode; and responding to the generation of the second user input by
recognizing one or more utterances as one or more words in the same
given vocabulary in a discrete speech recognition mode.
7. A method as in innovation 6 wherein the given vocabulary is a
large vocabulary.
8. A method as in innovation 6 wherein the given vocabulary is an
alphabetic input vocabulary.
9. A method as in innovation 6 wherein: said user interface allows
a user to select between generating a third and a fourth input
independently from the selection of the first and second input; and
said method further includes responding to said third and fourth
inputs, respectively, by selecting as said given vocabulary a first
vocabulary or a second vocabulary.
10. A method as in innovation 9 wherein said first and second
vocabulary are a large vocabulary of words and an alphabetic input
vocabulary.
11. A method as in innovation 9 wherein said first and second
vocabulary are two different alphabetic input vocabularies.
12. A method as in innovation 6 wherein: the user interface
provided includes a first button and a second button; the first
user input is generated by pressing the first button; and the
second user input is generated by pressing the second button.
13. A method as in innovation 12 wherein: pressing the first and
second buttons causes their respective recognition mode recognize
from substantially the time of the pressing of such a button until
the next end of utterance detected; wherein the discrete
recognition is substantially limited to the recognition of one or
more candidates for a single word matching said utterance and the
continuous recognition mode is not so limited.
14. A method as in innovation 6 wherein acoustic models used to
represent words in the discrete recognition mode are different than
the acoustic models used to represent the same words in the
continuous recognition mode.
15. A method of speech recognition comprising: providing a user
interface which allows a user to select between generating a first
and a second user input; responding to the generation of the first
user input by recognizing one or more utterances as one or more
words in a first alphabetic entry vocabulary; and responding to the
generation of the first user input by recognizing one or more
utterances as one or more words in a second alphabetic entry
vocabulary.
16. A method as in innovation 15 wherein: the first alphabetic
entry vocabulary includes the names of each letter of the alphabet
and the second alphabetic entry vocabulary does not; and the second
alphabetic entry vocabulary includes one or more words that start
with each letter of the alphabet and the first alphabetic entry
vocabulary does not.
17. A method as in innovation 15 wherein said user interface
provides a separate button for generating said first and second
inputs.
18. A method as in innovation 17 wherein touching of each of said
buttons turns on recognition in the button's associated alphabetic
entry mode.
19. A method as in innovation 15 wherein said user interface
enables: a user to select a filtering mode in which word choices
for the recognition of a given word are limited to word's whose
spelling matches a sequence of one or more characters input by the
user; a user to enter said one or more filtering characters by
voice recognition using either said first or second alphabetic
entry modes; and said first and second inputs select between
whether such recognition of filtering characters is performed using
said first or second alphabetic entry modes.
20. A method of speech recognition comprising: providing a user
interface which allows a user to select between generating a first,
a second, and a third user input; responding to the generation of
the first user input by recognizing one or more utterances as one
or more words in a first, general purpose large vocabulary; and
responding to the generation of the second user input by
recognizing one or more utterances as one or more words in a
second, alphabetic entry vocabulary; and responding to the
generation of the third user input by recognizing one or more
utterances as one or more words in a third, vocabulary which
represent non-spelled text inputs; sequentially receiving output
received from recognition in either of the three vocabularies and
placing that output into a common text.
21. A method as in innovation 20 wherein the third vocabulary is a
digits vocabulary.
22. A method as in innovation 20 wherein the third vocabulary is a
vocabulary of punctuation marks
23. A method as in innovation 20 wherein the user interface
provides a different button for the selection of each of first,
second, and third inputs.
24. A method as in innovation 23 wherein pressing the button
associated with one of said three vocabularies turns on recognition
using that vocabulary.
25. A method of performing word recognition comprising: receiving a
word input signal containing non-textual user input representing a
sequence of one or more words; performing word recognition upon the
input signal to produce a choice list of best scoring recognition
candidates, each comprised of a sequence of one or more words
and/or numbers, found by the recognizer to have a relatively high
probability of corresponding to the input signal; producing
user-perceivable output representing a choice list of best scoring
recognition candidates, with the candidates being ordered in said
choice list according to a character ordering of a sequence of
characters corresponding to the one or more words associated with
each candidate in the list; providing a user interface which
enables a user to select one of the character-ordered recognition
candidates from the choice list; responding to user selection of
one of the recognition candidates from the choice list by treating
the selected candidate as the one or more words and/or numbers that
correspond to the word input signal.
26. A method as in innovation 25 wherein: the word recognition
selects a best scoring recognition candidate; and the best scoring
candidate is placed in a position in said user-perceivable output
that is independent of where the character sequence corresponding
to the one or more words associated with the best scoring candidate
would, according to said character ordering, fall in the
character-ordered list.
27. A method as in innovation 25 wherein: the word input signal is
a representation of an utterance of a spoken word; and the word
recognition is speech recognition.
28. A method as in innovation 25 wherein the user perceivable
output includes showing a character-ordered list of said best
scoring recognition candidates on a visual display.
29. A method as in innovation 28 wherein: said choice list includes
more recognition candidates than fit on the display at one time;
and the choice list is scrollable, so that a user can select to
move the list relative to the display, so as to see more
recognition candidates on the list than fit on the display at one
time.
30. A method as in innovation 28 wherein: the character-ordered
list is an alphabetically ordered list; and the display of an
individual recognition candidates in the list includes a sequence
of one or more alphabetically spelled words.
31. A method as in innovation 30 wherein: said choice list includes
more recognition candidates than fit on the display at one time;
and the choice list is scrollable, so that a user can select to
move the list relative to the display, so as to see more
recognition candidates on the list than fit on the display at one
time.
32. A method as in innovation 31 wherein: said choice list has to
alphabetically ordered sub-lists; the first sub-list includes the
highest scoring choice candidates that fit on the display at one
time; and the second sub-list includes other best scoring choice
candidates.
33. A method as in innovation 32 wherein the second sub-list has
more candidates than fit one display at one time.
34. A method as in innovation 30 further including: providing a
user interface that allows the user to select a filtering sequence
of one or more letter-indications after said display of the
character-ordered list of best scoring recognition candidates; and
responding to the selection of said filtering sequence by
generating and showing on said display a new alphabetized choice
list of recognition candidates, which new choice list is limited to
candidates whose sequence of one or more characters start with said
filtering sequence; and providing a user interface that enables a
user to select one of the alphabetized recognition candidates from
the new choice list; responding to a user selection of one of the
recognition candidates in the new choice list by treating the
selected candidate as the one or more words and/or numbers that
correspond to the word input signal.
35. A method as in innovation 34 wherein said responding to the
selection of said filtering sequence by generating and showing a
new alphabetized choice list includes: detecting whether or not if
the number of recognition candidates is below a desired number;
when a detection is made that the number of recognition candidates
is below the desired number, selecting from a vocabulary list one
or more additional candidates that start with the filtering
sequence for inclusion in said new alphabetized choice list.
36. A method as in innovation 35 wherein: said new alphabetized
choice list includes more recognition candidates than fit on the
display at one time; and the choice list is scrollable, so that a
user can select to move the list relative to the display, so as to
see more recognition candidates on the list than fit on the display
at one time.
37. A method as in innovation 34 wherein: the method is performed
on a telephone having a telephone keypad; the user interface that
allows the user to input said letter-indicating inputs allows the
user to enter such inputs by pressing one or more keys of said
telephone keypad, with the pressing of a given telephone pad key
indicating that corresponding letter in the sequence of one or more
characters associated with a desired recognition candidate is one
of a set of multiple letters associated with the given key; and the
new candidate list is limited to candidates whose sequence of one
or more words start with an initial sequence of letters
corresponding to the sequence of letter-indicating inputs, in which
each letter of the initial sequence of letters corresponds to one
of the set of letters indicated by a corresponding
letter-indicating input in said sequence of letter-indicating
inputs.
38. A method as in innovation 37 wherein: said new choice list
includes more recognition candidates than fit on the display at one
time; and the choice list is scrollable, so that a user can select
to move the list relative to the display, so as to see more
recognition candidates on the list than fit on the display at one
time.
39. A method as in innovation 34 wherein: the user interface that
allows the use to select a sequence of one or more
letter-indications allows a user to select a desired number of
characters from the start of a string of alphabetic characters
contained within a selected one of the recognition candidates
displayed in a choice list; and and said user interface response to
such a selection by using the selected one or more characters as
all or part of said sequence of one or more letter-indications.
40. A method as in innovation 30 further including: providing a
user interface that allows the user to indicate the selection of a
location on a displayed alphabetized choice between listed
candidates or between a listed candidate and the beginning or end
of the list; and responding to such a selection by redisplaying a
new alphabetized choice list limited to recognition candidates
having spellings between the two candidates or between the
candidates and the beginning or end of the alphabet,
respectively.
41. A method as in innovation 28 wherein: the input signal
represents the utterance of one or more sequential numbers; and the
choice list is a numerically ordered list of recognition candidates
displayed as numbers.
42. A method as in innovation 30 wherein: said input signal
represents the utterance of a phone number; said word recognition
is speech recognition; and said responding to a user selection of a
recognition candidate causes the phone number displayed for the
selected recognition candidate to be automatically dialed.
43. A method as in innovation 28 wherein: the input signal
represents the utterance of one or more names from contact
information; and the choice list represents a plurality of best
scoring names from the contact information, ordered
alphabetically.
44. A method as in innovation 43 wherein: said choice list includes
more recognition candidates than fit on the display at one time;
and the choice list is scrollable, so that a user can select to
move the list relative to the display, so as to see more
recognition candidates on the list than fit on the display at one
time.
45. A method of performing word recognition comprising: receiving a
word input signal containing non-textual user input representing a
sequence of one or more words; performing word recognition upon the
input signal to produce a choice list of best scoring recognition
candidates, each comprised of a sequence of one or more words
and/or numbers, found by the recognizer to have a relatively high
probability of corresponding to the input signal; showing the
choice list in a user scrollable display, with the choice list
having more recognition candidates than fit on the display at one
time so that only a sub-portion of the choice list is displayed at
one time; responding to user input selecting to scroll the choice
list up or down by moving the choice list relative to the display
up or down, respectively, so as to change the portion of the choice
list shown on the display.
46. A method as in innovation 45 wherein the word input signal is a
representation of an utterance of a spoken word and the word
recognition is speech recognition.
47. A method as in innovation 45 wherein: said user input selecting
to scroll the choice list up or down includes a multiple-candidate
scroll input; and said responding to user input includes responding
to each multiple-candidate scroll input by moving the choice list
up or down relative to the display by multiple recognition
candidates.
48. A method as in innovation 45 wherein: the method is performed
on a cell phone; and the display is the display of a cell
phone.
49. A method as in innovation 48 wherein: the showing of the choice
list on cell display includes displaying different number in
association with each recognition candidate in the portion of the
choice list shown on the display at one time; providing a user
interface which enables a user to select one of the recognition
candidates from the choice list by pressing a numbered phone key on
said cell phone corresponding to a desired recognition candidate;
and responding to a user selection of one of the recognition
candidates from the choice list by treating the selected candidate
as the one or more words and/or numbers that correspond to the word
input signal.
50. A method as in innovation 45 wherein: each recognition
candidate has associated with it a character string; and the
recognition candidates in the scrollable choice list are ordered by
the character ordering in which their respective character strings
occur.
51. A method as in innovation 45 wherein the recognition candidates
in the scrollable choice list are ordered by their recognition
score against the word signal.
52. A method as in innovation 45 further including responding to
user input selecting to scroll the choice list right or left by
moving the choice list relative to the display right or left,
respectively, so as to change the portion of individual choices in
the choice list that are shown on the display.
53. A method of performing word recognition comprising: receiving a
word input signal containing non-textual user input representing a
sequence of one or more words; receiving a sequence of one or more
filter input signal, each containing non-textual user input
representing a sequence of one or more characters; responding to
the one or more filter input signals by producing a filter,
representing one or more possible character sequences, each having
one or more characters, found to have possibly corresponded to the
filter input signal; generating a list of recognition candidates
starting with a one of the character sequences represented by the
filter, including one or more candidate from word recognition of
the input signal when one or more such word recognition candidates
starting with a one of the character sequences represented by the
filter have a recognition probability above a certain minimum
level; producing user-perceivable output representing: said list of
best scoring recognition candidates; and a character sequence
represented by said filter corresponding to the initial characters
of one of the list of best scoring recognition candidates; enabling
a user to select one of the recognition candidates from said list
and/or to select a character from said filter; responding to
selection of one of the recognition candidates from the choice list
by treating the selected candidate as the one or more words that
correspond to the word input signal; responding to selection of a
filter character by displaying a choice list of other characters in
the possible character sequences represented by the filter that
correspond to the selected character's position to the
user-perceivable filter; enabling a user to choose one of the
characters in the character choice list; responding to a choice of
a character in the character choice list by: limiting the possible
character sequences represented by the filter to ones having the
chosen character in the selected character's position; and repeated
said generation of a list of recognition candidates using the
filter as limited by the chosen character.
54. A method as in innovation 53 wherein the limiting of the
possible character sequences represented by the filter includes
limiting such character sequences to ones having the characters, if
any, that occur before the selected character in the
user-perceivable filter.
55. A method as in innovation 53 wherein: said generation of a list
of recognition candidates limits the recognition candidates to
those starting with only a single character sequence represented by
the filter; and the user-perceivable output representing said
candidate list includes said single character sequence as the
user-perceivable filter.
56. A method as in innovation 53 wherein said generation of a list
of recognition candidates limits the recognition candidates to
those starting with any of a plurality of character sequences
represented by the filter.
57. A method as in innovation 53 wherein: the filter input signals
correspond to a sequence of one or more phone key presses, where
each pressed phone key has an associated set of letters; and the
responding to the filter input signals produces a filter
representing one or more sequences of characters, where each such
sequence has one character for each such key press, with each such
character corresponding to one of the set of letters associated
with the corresponding key press.
58. A method as in innovation 53 wherein: the filter input signals
correspond to a sequence of one or more utterances each of a
sequence of one or more letter indications; and the responding to
the filter input signals includes performing speech recognition
upon the sequence of one or more utterances to produce a filter
representing a one or more sequences of characters corresponding to
the characters recognized from said utterances.
59. A method of performing word recognition comprising: receiving a
word input signal containing non-textual user input representing a
sequence of one or more words; performing word recognition upon the
input signal to produce a choice list of best scoring recognition
candidates, each comprised of a sequence of one or more words
and/or numbers, found by the recognizer to have a relatively high
probability of corresponding to the input signal; showing the
choice list in a user scrollable display; responding to user input
selecting to scroll the choice list right or left by moving the
choice list relative to the display right or left, respectively, so
as to change the portion of individual choices in the choice list
that are shown on the display.
60. A method as in innovation 59 wherein said method is practiced
on a cell phone and the user input selecting to scroll horizontally
is the pressing of a button or key on the cell phone.
61. A method of performing word recognition comprising: receiving a
word input signal representing one or more words; performing word
recognition upon the signal to produce one or more best scoring
words corresponding to the word input signal; providing a user
interface enabling a user to select from among a plurality of word
transformation commands each having different type of
transformation associated with it; responding to the user's
selection of one of the word transformation commands by
transforming a currently selected word to a corresponding, but
different, word spelled with a different sequence of letters from a
though z using the selected command's associated
transformation.
62. A method as in innovation 61 wherein at least one of the word
transformation commands transforms the currently selected word to a
different grammatical form.
63. A method as in innovation 62 wherein at least one of the word
transformation commands transforms the currently selected word to a
different tense.
64. A method as in innovation 62 wherein at least one of the word
transformation commands transforms the currently selected word to a
plural or singular form.
65. A method as in innovation 62 wherein at least one of the word
transformation commands transforms the currently selected word to a
possessive or non-possessive form.
66. A method as in innovation 61 wherein at least one of the word
transformation commands transforms the currently selected word to a
homonym of the selected word.
67. A method as in innovation 61 wherein at least one of the word
transformation commands transforms the currently selected word by
changing its ending to one of a set of common word endings.
68. A method as in innovation 61 wherein the word recognition
produces a choice list of best scoring recognition candidates, each
comprised of one or more words, found by the recognizer to have a
relatively high probability of corresponding to the word signal;
and the user interface outputs the recognition candidates of the
choice list in user perceivable form; and the user interface
enables a user to select a choice from one of the recognition
candidates output on the choice list and to select have a selected
one of the transformation commands performed upon the selected
choice, and to have the resulting transformed word produced as
output of the recognition process.
69. A method as in innovation 61 wherein the word recognition is
speech recognition performed on a telephone; and the user interface
enables a user to select a selected one of the transformation
commands by pressing a phone key.
70. A method of performing word recognition comprising: receiving a
word input signal representing one or more words; performing word
recognition upon the signal to produce one or more best scoring
words corresponding to the word input signal; providing a user
interface enabling a user to select from among a plurality of word
transformation commands; responding to the user's selection of one
of the word transformation commands by transforming a currently
selected word between an alphabetic representation and a
non-alphabetic representation.
71. A method as in innovation 71 wherein the word recognition
produces a choice list of best scoring recognition candidates, each
comprised of one or more words, found by the recognizer to have a
relatively high probability of corresponding to the signal; and the
user interface outputs the recognition candidates of the choice
list in user perceivable form; and the user interface enables a
user to select a word from one of the recognition candidates output
on the choice and to select have the transforming for changing
between an alphabetic and a non-alphabetic representation performed
upon that selected word, and to have the resulting transformed word
produced as output of the recognition process.
72. A method of performing word recognition comprising: receiving a
word input signal representing one or more words; performing word
recognition upon the signal to produce one or more best scoring
words corresponding to the word input signal; providing a user
interface enabling a user to select to display of list of
transformations upon a word produced by said recognition;
responding to the user's selection by producing a choice list of
said transformed words corresponding to the recognized word; the
user interface enables a user to select one of the transformed
words in the choice list; and responding to the selection of a
transformed word by having the selected transformed word produced
as output of the recognition process.
73. A method a in innovation 72 wherein: the choice list of
transformed words is shown on a user scrollable display, with the
choice list having more transformed words than fit on the display
at one time so that only a sub-portion of the choice list is
displayed at one time; responding to user input selecting to scroll
the choice list up or down by moving the choice list relative to
the display up or down, respectively, so as to change the portion
of the choice list shown on the display.
74. A method as in innovation 72 wherein the user interface: places
words output by the recognition process into a text; and allows the
user to select from among one or more words in the text the word
for which the transformation choice list is to be produced.
75. A method as in innovation 72 wherein the user interface:
produces a choice list of best scoring word candidates from a word
recognition; and allows the user to select from among one or more
words in the best scoring choice list the word for which the
transformation choice list is to be produced.
76. A method as in innovation 72 wherein the words in the
transformed word list include the one or more homonyms, if any, of
the word for which the transformation choice list is produced.
77. A method as in innovation 72 wherein the words in the
transformed word list include one or more different
representations, if any, of the word for which the transformation
choice list is produced.
78. A method as in innovation 72 wherein the words in the
transformed word list include one or more different grammatical
forms, if any, of the word for which the transformation choice list
is produced.
79. A method of performing word recognition comprising: responding
to a command input from a user to start recognition by: turning
large vocabulary speech recognition on after the receipt of the
command; subsequently automatically turning the large vocabulary
speech recognition off and leaving it off until receiving another
command input from a user to start recognition.
80. A method as in innovation 79 wherein the turning off a speech
recognition occurs automatically after the lapsing of the given
period of time.
81. A method as in innovation 79 wherein the turning off a speech
recognition occurs automatically after the detection of the first
end of utterance after the turning on of the speech
recognition.
82. A method as in innovation 79 wherein the command input which
causes the turning on of speech recognition is a non-acoustic
input.
83. A method as in innovation 82 wherein the speech recognition is
turned off in response to the next end of utterance detection made
by the speech recognition and is left off until the next
non-acoustic user input to start recognition.
84. A method as in innovation 83 wherein the speech recognition is
continuous speech recognition.
85. A method as in innovation 83 wherein the speech recognition is
discrete speech recognition.
86. A method as in innovation 83 further comprising: outputting a
user perceivable representation of the one or more words recognized
as a best choice for the utterance preceding the end of utterance
detection; providing a user interface allowing a user to provide
correction input to correct errors in the best choice output in
response to the recognition of an utterance; responding to receipt
of a start recognition command input after the outputting of the
best choice recognized for an utterance before any correction input
has been received for said best choice by: confirming said best
choice as correct; and repeating said method again for a new
utterance starting with receipt of the start recognition
command.
87. A method as in innovation 86 further including responding to
such a confirmation of an utterances by including one or more of
the recognized words as being part of the current context used to
calculate a language model score for subsequent speech
recognition.
88. A method as in innovation 86 further include responding to such
a confirmation of an utterances by using one or more of the
recognized words as data for altering the language model.
89. A method as in innovation 86 further including responding to
such a confirmation of an utterance as corresponding to a given
recognized word by labeling acoustic data from the utterance for
use in updating one or more acoustic models used in the recognition
in of the given recognized word.
90. A method as in innovation 83 further including allowing a user
to select between a first mode in which recognition turns off after
the next end of utterance detected after receiving the non-acoustic
input, and second mode which does not turn off recognition after
said next end of utterance detection.
91. A method as in innovation 90 wherein, in said second mode,
recognition is automatically turned off in response to lapse of
time longer than the normal lapse between utterances in
conversation.
92. A method as in innovation 83 wherein: the method is performed
by software running on a handheld computing device; and the
non-acoustic input is the pressing of a button, including a GUI
button.
93. A method as in innovation 92 wherein the handheld computing
device is a cellphone; and the buttons are cellphone buttons.
94. A method as in innovation 83 wherein the method is performed by
software running on a computer which is part of an automotive
vehicle.
95. A method as in innovation 82 wherein the start recognition
command input is the pressing of a hardware or software button; and
the recognition is automatically turned off within less than a
second after the pressing of the button ceases.
96. A method as in innovation 82 wherein: said method provides a
user interface having a plurality of speech mode selection buttons,
each for selecting a different speech recognition mode, available
for selection by the user at one time; and the non-acoustic input
which causes the turning of speech recognition is the pressing of
one of said buttons; and the method responds to the pressing of a
speech mode button by turning on speech recognition in its
associated mode and subsequently automatically turning of said
recognition.
97. A method as in innovation 96 wherein: the speech recognition
mode associated with one of said buttons is said large vocabulary
recognition; the recognition mode associated with another of said
buttons is a mode which performs recognition with a vocabulary for
alphabetic entry.
98. A method as in innovation 96 wherein: the speech recognition
mode associated with one of said buttons is continuous recognition;
the recognition mode associated with another of said buttons is
discrete recognition.
99. A method as in innovation 96 wherein the handheld computing
device is a cellphone; and the buttons are cellphone buttons.
100. A method of speech recognition comprising: providing a user
interface which provides a button which responds to touch lasting
less than a first duration as a click, and a touch lasting longer
than a second duration as a press; responding to a press by causing
speech recognition to be performed on sound for a duration that
varies as a function of the length of the press; and responding to
a click by causing speech recognition to be performed on sound for
a duration that is independent of the length of the click.
101. A method as in innovation 100 wherein: said responding to a
click causes speech recognition to be performed on sound received
from substantially the time of the click until the next end of
utterance detected; and said responding to a press causes speech
recognition to be performed on sound received during the period of
the press.
102. A method as in innovation 101 wherein recognition performed in
response to a click is discrete recognition and recognition
performed in response to a press is continuous recognition.
103. A method as in innovation 102 wherein the user interface
allows a user to select between: a mode in which recognition in
response to a click and recognition in response to a press are both
either continuous or discrete; and a mode wherein recognition
performed in response to a click is discrete recognition and
recognition performed in response to a press is continuous.
104. A method as in innovation 100 wherein: said responding to a
click causes speech recognition to be performed on sound received
from substantially the time of the click for a period of at least
one minute; and said responding to a press causes speech
recognition to be performed on sound received during the period of
the press and for not more than one second afterward.
105. A method as in innovation 100 wherein: the user interface has
a plurality of speech mode selection buttons, each for selecting a
different speech recognition mode, available for selection by the
user at one time; the user interface responds to a touch of each of
the mode selection buttons lasting less than a first duration as a
click, and a touch of such a button lasting longer than a second
duration as a press; the method responds to a press of a mode
button by causing speech recognition to be performed in the
button's associated mode on sound for a duration that varies as a
function of the length of the press; and responding to a click of a
mode button by causing speech recognition to be performed in the
button's associated mode on sound for a duration that is
independent of the length of the click.
106. A method as in innovation 105 wherein: the recognition mode
associated with a first of said mode buttons is a mode which
performs recognition with a large vocabulary; and the recognition
mode associated with a second of said mode buttons is a mode which
performs recognition with an alphabetic entry vocabulary.
107. A method as in innovation 105 wherein the speech recognition
mode associated with one of said mode buttons is continuous
recognition; and the recognition mode associated with another of
said mode buttons is discrete recognition.
108. A method as in innovation 105 wherein: the method is practiced
on a cellphone; and numbered cellphone buttons act as said mode
buttons.
109. A computing device that functions as a telephone comprising: a
user perceivable output device; a set of phone keys including at
least a standard twelve key phone key pad; one or more
microprocessors; microprocessor readable memory; a microphone or
audio input from which said telephone can receive electronic
representations of sound; a speaker or audio output for enabling an
electric representation of sound produced in said telephone to be
transduced into a corresponding sound; transmitting and receiving
circuitry; programming recorded in the memory including: telephone
programming having instructions for performing telephone functions
including making and receiving calls; and speech recognition
programming including instructions for: performing large vocabulary
speech recognition upon an electronic representations of sound
received from the microphone or microphone input; and responding to
presses of one or more of the phone keys to control the operation
of the speech recognition.
110. A a computing device as in innovation 109 wherein the device
is a cellphone.
111. A computing device as in innovation 109 wherein the device is
a cordless phone.
112. A computing device as in innovation 109 wherein the device is
a is a landline phone.
113. A computing device as in innovation 109 wherein the speech
recognition programming includes instructions for: responding to a
given utterance by performing speech recognition to produce a
choice list of best scoring speech recognition candidates each
comprised of one or more words found by the recognizer to have a
relatively high probability of corresponding to the given utterance
or part of an utterance; producing user perceivable output
indicating a plurality of the choice list candidates and
associating a separate phone key with each of such choice; and
responding to a press of a phone key associated with a choice list
candidate by selecting the associated candidate as the output for
the given utterance.
114. A computing device as in innovation 113 wherein the speech
recognition programming includes instructions for using a plurality
of numbered phone keys as said phone keys associated with choice
list candidates.
115. A computing device as in innovation 114 wherein the speech
recognition programming includes instructions for, at the same time
some of the numbered phone keys are associated with choice list
candidates, using other numbered phone keys for other speech
recognition functions.
116. A computing device as in innovation 113 wherein the speech
recognition programming includes instructions for: operating in a
first mode which responds to presses of each of a set of phone key
by selecting an associated choice list candidate; and operating in
a second mode which responds to presses of each of the same set of
phone key as a letter identification input.
117. A computing device as in innovation 116 wherein the speech
recognition programming includes instructions for using said letter
identifications for alphabetic filtering of the choice list.
118. A computing device as in innovation 109 wherein the speech
recognition programming includes instructions for: producing a
recognition output corresponding to a sequence of one or more
recognized words in response to the recognition of a given
utterance; placing the recognition output into a text sequence
previously containing a sequence of zero or more words stored in
the memory at a current cursor location in the text sequence; and
moving the cursor location forward and backward, respectively, in
the text sequence in response to the pressing of different ones out
of the phone keys.
119. A computing device as in innovation 118 wherein the
instructions for moving the current text location include
instructions for moving the current text location forward and
backward a whole word at time, respectively, in response to the
pressing of one of two phone keys associated with word-at-a-time
motion, one associated with word forward motion and one associated
with word backward motion.
120. A computing device as in innovation 119 wherein the
instructions for moving the current text location forward and
backward a whole word at a time includes instructions for:
responding, under a first condition, to the pressing of the key
associated with word forward or backward motion, respectively, by
selecting the whole word after or before the prior cursor location;
and responding, under a second condition, to the pressing of the
key associated with word forward or backward motion by placing a
non-selection cursor immediately behind or before, respectively,
the prior cursor location; whereby the same two keys can be used to
move a word at a time in text, and either to make the cursor
correspond to the selection of a whole word or non-selection cursor
before or after a word.
121. A computing device as in innovation 120 wherein said second
condition includes one in which the pressing of one of said
word-at-a-time keys is received as the next input after the
pressing of the other of said two word-at-a-time keys.
122. A computing device as in innovation 118 wherein: the user
perceivable output device is a display; the speech recognition
programming includes instructions for displaying all or a portion
of the text sequence across successive lines on the display; and
the instructions for moving the current text location include
instructions for moving the current text location up a line and
down a line, respectively, in response to the pressing of different
ones of the phone keys.
123. A computing device as in innovation 118 wherein the
instructions for moving the current text location include
instructions for moving the current text location to the start and
to the end of a sequence of words including all or part of the
words in the text sequence, respectively, in response to the
pressing of different ones of the phone keys.
124. A computing device as in innovation 118 wherein the speech
recognition programming includes instructions for: responding to
the press of one phone key by starting an extendable selection at
the current text location; and responding to the pressing of
different ones of the phone keys associated with moving the current
text location forward and backward, respectively, by extending the
selection forward and backward, respectively, by the amount
associated with such keys.
125. A computing device as in innovation 118 wherein the
programming includes instructions for generating an audio output by
a text-to-speech process of one or more words at the current text
location after that current location has been moved in response to
the pressing of one of the phone keys.
126. A computing device as in innovation 118 wherein: the user
perceivable output device is a display; the speech recognition
programming includes instructions for showing on the display one or
more words at the current location after that current location has
been moved in response to the pressing of one of the phone
keys.
127. A computing device as in innovation 109 wherein the speech
recognition programming includes instructions for responding to a
selection of a given one of the phone keys by entering a help mode
which responds to a subsequent phone key press by provided in user
perceivable form an explanation about the function associated with
the subsequently pressed phone key before entering the help
mode.
128. A computing device as in innovation 127 wherein: the
instructions for responding to presses of one or more phone keys to
control operation of speech recognition define a hierarchical
command structure in which a user can navigate and select commands
by a sequence of one or more phone keys; and the instructions for
entering a help mode include instructions for responding to a each
key press in a sequence of two or more key presses after entering
said help mode by providing, in user perceivable form, an
explanation about the function the key press would have in a
similar sequence of key press in the hierarchical command structure
if that key sequence had been entered before entering the help
mode.
129. A computing device as in innovation 109 wherein the speech
recognition programming includes instructions for responding to a
pressing of a first phone key by outputting a user perceivable list
indicating the functions associated with each of a plurality of
individual phone keys at the current time.
130. A computing device as in innovation 129 wherein the user
perceivable output includes the generation of an audio signal
saying the list of function indications.
131. A computing device as in innovation 129 wherein: the phone
keys include said first key and a set of one or more navigation
keys; and the speech recognition programming includes instructions
for operating in a text mode where: the navigation keys allow user
perceivable navigation of recognized text; other phone keys have a
set of functions mapped to them for controlling entry and editing
of said text; and a press of the first key is responded to by
entering command list mode where navigation keys allow user
perceivable navigation of a list of the functions associated with
each of a plurality of phone keys in the text mode.
132. A computing device as in innovation 131 wherein: the command
list mode's user-perceivable list of functions include the
associations of phone key numbers with a plurality of functions in
the list; and speech recognition programming includes instructions
for responding to pressing of a numbered phone key associated with
a function in said list during operation of the command list mode
by returning to the text mode and selecting its associated
function.
133. A computing device as in innovation 131 wherein: the speech
recognition programming includes instructions for use in the
command list mode for: responding to one or more presses of
navigational keys by moving a function selection relative to the
user-perceivable list of functions; and responding to a press of a
selection phone key by returning to the text mode and selecting its
associated function.
134. A computing device as in innovation 133 wherein the command
list includes functions in addition to those that can be selected
by pressing of phone keys in the text mode, which additional
functions can be selected in the command list mode by said
navigation and selection.
135. A computing device as in innovation 133 wherein: the command
list lists functions that are associated with the navigation keys
in the text mode; said text-mode navigational key functions are
different than those associated with the navigation keys in command
list mode; and the text mode navigational key functions can be
selected in the command list mode by said navigation and
selection.
136. A computing device as in innovation 131 wherein: said phone
keys include a menu key; said programming recorded in the memory
includes instructions for responding to a press of the menu key in
each of a plurality modes other than said text mode by displaying a
list of functions selectable by phone key that were not selectable
by the same phone keys immediately before the pressing of the menu
key; and said first key used in said text mode to select the
command list mode is the menu key.
137. A computing device as in innovation 109 wherein the speech
recognition programming includes instructions for operating in a
text mode during which: the navigation keys allow user perceivable
navigation of recognized text; and a plurality of the numbered
phone keys function at one time as key mapping keys, each of which
selects a different key mapping mode that maps a different set of
functions to a plurality of said numbered phone keys; whereby a
user can quickly select a desired key mapping from a plurality of
such mappings by pressing a numbered phone keys, greatly increasing
the speed with which the user can select one from a among a
relatively large number of commands from the text mode.
138. A computing device as in innovation 137 wherein the speech
recognition programming includes instructions for responding to the
pressing of one of said key mapping keys by entering an associated
menu mode where navigation keys allow user-perceivable navigation
of a menu that indicates the functions associated with each of a
plurality of numbered phone keys in the pressed mapping key's
associated key mapping mode.
139. A method of performing large vocabulary speech recognition
comprising: receiving a filtering sequence of one or more key-press
signals each of which indicates which of a plurality of keys has
been selected by a user, where each of the keys represents two or
more letters; receiving an acoustic representation of a sound;
performing speech recognition upon the acoustic representation
which scores word candidates as a function of the match between the
acoustic representation of the sound and acoustic models of words;
wherein: the scoring of word candidates favors word candidates
containing a sequence of one or more alphabetic characters
corresponding to the filtering sequence of key-press signals, where
a candidate word is considered to contain a characters sequence
corresponding to the filtering sequence if each sequential
character in the characters sequence corresponds to one of the
letters represented by it corresponding sequential key-press
signal.
140. A method as in innovation 139 further including: responding to
an additional utterance made in association with a given key press
signal in said filtering sequence by performing speech recognition
upon the associated utterance; and responding to the recognition of
the key press's associated utterance as a letter identifying word
by causing the set of letters represented by the key press in the
filtering sequence to be limited to a letter identified by the
recognized letter identifying word.
141. A method as in innovation 140 further including: responding to
a key press signal by displaying in user-perceivable form a set of
words containing one or more words starting with each letter
represented by the pressed key; and favoring the recognition of an
utterance made after the display of the pressed key's associated
letter identifying words as corresponding to one of said displayed
words.
142. A method as in innovation 139 further including providing a
user interface which: outputs a plurality of the word candidates
produced by said speech recognition in a user-perceivable form in a
choice list; and allows a user to select one of the output
candidates as the desired words; and responding to the user
selection of one of the output candidates by selecting it as the
recognized word for the recognition.
143. A method as in innovation 139 wherein said receiving of a
filtering sequence and said performing of speech recognition
favoring candidates containing characters corresponding to the
filter sequence can be performed repeatedly for a given acoustic
representation in response to the receipt of successive key-press
signals in said filtering sequence.
144. A method as in innovation 139 wherein the preferential scoring
of word candidates is performed by selecting from word candidates
previously selected by the recognition process those candidates
that contain a sequence of one or more characters corresponding the
filtering sequence.
145. A method as in innovation 139 wherein the preferential scoring
of word candidates is performed by performing the speech
recognition upon the acoustic representation a second time during
which word candidates are favored which contain a sequence of one
or more characters corresponding to the received filtering
sequence.
146. A method as in innovation 139 wherein the sequence of key
press signals is received before the initial recognition of the
acoustic representation is complete and the alphabetic favoring of
word candidates its performed during the initial recognition.
147. A method as in innovation 139 wherein the method is performed
by software running on a telephone and the keys are keys of a
telephone keypad
148. A method as in innovation 139 wherein the telephone is a cell
phone.
149. A method as in innovation 139 wherein the preferential scoring
of word candidates is performed by performing the speech
recognition upon an acoustic representation of a second utterance
of the desired word in which word candidates are favored which
contain a sequence of one or more characters corresponding to the
received filtering sequence.
150. A method as in innovation 149 wherein the preferential scoring
of word candidates is performed by scoring word candidate against
both the original and second utterance of a desired word.
151. A method as in innovation 139 wherein the scoring of word
candidates not only favors word candidates containing a sequence of
one or more alphabetic characters corresponding to the filtering
sequence, but also language models scores.
152. A method as in innovation 151 wherein the language models used
in conjunction with such filtering sequences in the scoring of word
candidates are context dependent language models.
153. A method of performing large vocabulary speech recognition
comprising: receiving a key-press sequence of one or more telephone
key-press signals, each of which indicates which of a plurality of
keys has been selected by a user; decoding the key-press sequence
by using the number of presses of a given key which occur within a
given time of each other to select which of multiple letters
associated with the given key as a desired letter; storing the
sequence of one or more letters decoded from said key-press
sequence as an alphabetic filtering sequence; receiving an acoustic
representation of a sound; performing speech recognition upon the
acoustic representation which scores word candidates as a function
of the match between the acoustic representation of the sound and
acoustic models of words; wherein: the scoring of word candidates
favors word candidates containing a sequence of one or more
alphabetic characters corresponding to the letters of said
alphabetic filtering sequence.
154. A method of performing large vocabulary speech recognition to
input a sequence of one or more alphabetic characters comprising:
pressing a sequence of one or more selected phone keys, each of
which represents two or more letters; uttering a corresponding
sequence of one or more letter identifying word; performing speech
recognition upon the utterance of each of the letter identifying
words, with the recognition of each such utterance favoring the
recognition of a letter identifying words identifying one of the
two or more letters represented by the utterance's associated phone
key; and treating the sequence of one or more letters identified by
the letter identifying word associated with each phone key press as
alphabetic input from the user.
155. A method as in innovation 154 wherein: the method is used in
conjunction with a large vocabulary recognition system; and a
majority of the words which starts with a given letter in the
vocabulary of the large vocabulary recognition system can be used
as a letter identifying word for the given letter.
156. A method as in innovation 154 wherein: the letter identifying
word associated with each of a majority of letters belongs to a
limited set of five or less letter identifying words which starts
with that given letter; and the recognition of an utterance of a
letter identifying words favors the recognition of a one of the
limited set of letter identifying words identifying one of the two
or more letters represented by the utterance's associated phone
key.
157. A method as in innovation 156 further including: responding to
a key press signal by displaying in user-perceivable form a set of
letter identifying words containing one or more words starting with
each letter represented by the pressed key; and favoring the
recognition of an utterance made after the display of the pressed
key's associated letter identifying words as corresponding to one
of said displayed words.
158. A method as in innovation 156 wherein: the method is performed
on a telephone having a display; and the outputting of the subset
of letter identifying words is performed by displaying such words
on the telephone's display.
159. A method of performing large vocabulary speech recognition on
a device having telephone keys, said method comprising: performing
large vocabulary speech recognition upon one or more utterances to
produce a corresponding output text containing one or more words
which have been recognized by said speech recognition; receiving a
sequence of one or more phone key presses signals and interpreting
said sequence of presses as corresponding to a sequence of one or
more alphabetic characters; and outputting said sequence of one or
more alphabetic characters into said output text.
160. A method as in innovation 159 wherein the telephone is a
cellphone.
161. A method as in innovation 159 wherein: the sequence of one or
more key-press signals, by itself, is treated by the process as
being ambiguous, in a sense that individual key press signals each
represents two or more letters; and information from sources other
than such key presses are used to select which of the one or more
letters associated with a key press in the sequence is to be
interpreted as corresponding to each such key press.
162. A method as in innovation 161 wherein the information from
sources other than such key presses includes language model
information.
163. A method as in innovation 162 wherein the information from
sources other than such key presses includes context dependent
language model information.
164. A method as in innovation 159: wherein the sequence of one or
more key-press signals, by itself, is treated by the process as
being ambiguous, in a sense that individual key press signals each
represents two or more letters; and further including: outputting a
plurality of the word candidates whose spellings correspond to the
key-press signal in a user-perceivable form in a choice lists;
allowing a user to select one of the output candidates as the
desired words; and responding to the user selection of one of the
output candidates by selecting is as the recognized word for the
recognition.
165. A method as in innovation 159 wherein the interpretation of
the sequence of key presses includes decoding the key-press
sequence by using the number of presses of a given key which occur
within a given time of each other to select which of the multiple
letters associated with the given key as a desired letter.
166. A method of speech recognition comprising: receiving an
original utterance of one or more words; performing an original
speech recognition upon the original utterance; producing a user
perceivable output representing one or more sequences of one or
more words selected by the recognition as most likely corresponding
to the utterance; providing a user interface that allows a user to
select to perform a re-utterance recognition upon a part of the
original utterance corresponding to all or a selected part of the
user perceivable output; and responding to a user selection to
perform a re-utterance recognition upon all or a part of the
original utterance by: treating a second utterance received in
association with the selection as a re-utterance of the selected
portion of the original utterance; and performing speech
recognition upon the re-utterance to select one or more sequences
of one or more words considered to most likely match the
re-utterance based on the scoring of the one or more words against
both the re-utterance and the selected portion of the original
utterance.
167. A method as in innovation 166 wherein: the original
recognition of the original utterance is by continuous speech
recognition; and the re-utterance is recognized by discrete speech
recognition.
168. A method as in innovation 167 wherein the number of utterances
detected with a re-utterance recognized by discrete recognition is
used to determine the number of words allowable in sequences of one
or more words recognized for the original utterance after the
re-utterance.
169. A method as in innovation 166 wherein both the original
utterance and the re-utterance are recognized by discrete speech
recognition.
170. A method as in innovation 166 wherein both the original
utterance and the re-utterance are recognized by continuous speech
recognition.
171. A method as in innovation 166 wherein the selection of a
sequences of one or more words considered to most likely match both
the re-utterance and the selected portion of the original utterance
is used to update acoustic models with data from the selected
portion of the original utterance.
172. A method as in innovation 166 wherein: the user interface
allows a user to select one or more word filtering inputs, each
indicating that the desired output has certain characteristics, to
be used in conjunction with the re-utterance recognition; and the
process of selecting of one or more sequences as most likely
matching both the re-utterance and the original utterance also uses
the selected filtering inputs to favor the selection of any
recognition candidates having the selected characteristics.
173. A method as in innovation 172 wherein the user interface
allows a user to select alphabetic filtering inputs indicating that
the desired output contains a word containing a sequence of one or
more specified letters.
174. A computing device for performing large vocabulary speech
recognition comprising microprocessor readable memory; a microphone
or audio input for providing an electronic signal representing an
utterance to be recognized; a speaker or audio output for enabling
an electric representation of sound produced in said telephone to
be transduced into a corresponding sound; programming recorded in
the memory including: speech recognition programming including
instructions for speech recognition programming for performing
large vocabulary speech recognition that responds to the electronic
representations of a sequence of one or more utterances received
from the microphone or microphone input by producing a text output
corresponding to the one or more words recognized as corresponding
to the utterances; and TTS programming for providing TTS output to
said speaker or audio output saying one or more words of said text
recognized by the speech recognition for the utterance; shared
speech modeling data stored in the memory that are used by both
said speech recognition programming to recognize words
corresponding to spoken utterances and by said TTS programming to
generate sounds corresponding to the speaking of a sequence of one
or more words.
175. A computing device as in innovation 174 wherein said shared
speech modeling data includes letter to sound rules.
176. A computing device as in innovation 174 wherein said shared
speech modeling data includes a mapping between a words and one or
more phonetic spellings for each of at least several thousand
vocabulary words.
177. A computing device as in innovation 176 wherein said mappings
include an indication of the different phonetic spellings
appropriate for certain words when they occur as different parts of
speech.
178. A computing device as in innovation 177 wherein said shared
speech modeling data includes language modeling information
indicating which parts of speech for one or more words are more
likely to occur in a given language context.
179. A computing device as in innovation 174 wherein the device is
a handheld device.
180. A computing device as in innovation 179 wherein the device is
a cell phone.
181. A computing device for performing large vocabulary speech
recognition comprising microprocessor readable memory; a microphone
or audio input for providing an electronic signal representing an
utterance to be recognized; a speaker or audio output for enabling
an electric representation of sound produced in said telephone to
be transduced into a corresponding sound; programming recorded in
the memory including speech recognition programming including
instructions for: performing large vocabulary speech recognition
upon an electronic representations of utterances received from the
microphone or microphone input to produce a text output; providing
TTS output to said speaker or audio output saying one or more words
of said text output; recognizing utterances which are voice
commands as commands; providing TTS or recorded audio output to
said speaker or audio output saying the name of a recognized
command.
182. A computing device as in innovation 181 wherein the device is
a handheld device.
183. A computing device as in innovation 182 wherein the device is
a cell phone.
184. A computing device for performing large vocabulary speech
recognition comprising microprocessor readable memory; a microphone
or audio input for providing an electronic signal representing an
utterance to be recognized; a speaker or audio output for enabling
an electric representation of sound produced in said telephone to
be transduced into a corresponding sound; programming recorded in
the memory including speech recognition programming including
instructions for performing large vocabulary speech recognition
that responds to the electronic representations of each of a
sequence of one or more utterances received from the microphone or
microphone input by: producing a text output corresponding to one
or more words recognized as corresponding to the utterance; and
then providing TTS output to said speaker or audio output saying
one or more words of said text recognized by the speech recognition
for the utterance.
185. A computing device as in innovation 184 wherein said speech
recognition is discrete speech recognition and said TTS output says
the text word which is recognized in response to each
utterance.
186. A computing device as in innovation 184 wherein said speech
recognition is continuous speech recognition and said TTS output
says the one or more text words recognized in response to each
utterance after the end of the utterance.
187. A computing device as in innovation 184 wherein the device is
a handheld device.
188. A computing device as in innovation 187 wherein the device is
a cell phone.
189. A computing device for performing large vocabulary speech
recognition comprising microprocessor readable memory; a microphone
or audio input for providing an electronic signal representing an
utterance to be recognized; a speaker or audio output for enabling
an electric representation of sound produced in said telephone to
be transduced into a corresponding sound; programming recorded in
the memory including speech recognition programming including
instructions for: performing large vocabulary speech recognition
upon an electronic representation of utterances received from the
microphone or microphone input to produce a text output; responding
to text navigation commands by moving a cursor backward and forward
in the one or more words of said text output; responding to each
movement in response to one of said navigational commands by
providing a TTS output to said speaker or audio output saying one
or more words either starting or ending with the location of the
cursor after of said movement.
190. A computing device as in innovation 189 wherein said
programming further includes instructions for responding to a
selection expansion command by: recording the cursor location at
the time the command is received as a selection start; starting a
selection at the selection start; and entering a selection
expansion mode in which the response to one of said navigational
commands further includes causing the selection to extend from the
selection start to the cursor location after the cursor movement
made in response to said navigation command.
191. A computing device as in innovation 190 wherein said
programming further includes instructions for responding to a play
selection command by providing a TTS output to said speaker or
audio output saying the one or more words that are currently in the
selection.
192. A computing device as in innovation 189 wherein said saying of
one or more words starts speaking words of said text starting at
the current cursor position and continues speaking them until an
end of a unit of text larger than a word is reached or until a user
input is received to terminate such playback.
193. A computing device as in innovation 189 wherein the device is
a handheld device.
194. A computing device as in innovation 193 wherein the device is
a cell phone.
195. A computing device for performing large vocabulary speech
recognition comprising microprocessor readable memory; a microphone
or audio input for providing an electronic signal representing an
utterance to be recognized; a speaker or audio output for enabling
an electric representation of sound produced in said telephone to
be transduced into a corresponding sound; programming recorded in
the memory including speech recognition programming including
instructions for: performing large vocabulary speech recognition
upon an electronic representations of uttered sound received from
the microphone or microphone input to produce a choice list of
recognition candidates, each comprised of a sequence of one or more
words, selected by the recognition as scoring best against said
uttered sound; providing spoken output to said speaker or audio
output saying the one or more words of one of the recognition
candidates in the choice list.
196. A computing device as in innovation 195 wherein said
programming includes instructions for: responding to choice
navigation commands by moving which of the recognition candidates
in the list of choices is currently selected; and responding to
each movement in response to one of said navigational commands by
providing spoken output saying the one or more words in the
currently selected recognition candidate.
197. A computing device as in innovation 195 wherein: said spoken
output says the words of a plurality of recognition candidates in
said list and contains a spoken indication of a choice input signal
associated with each of said plurality of commands; and said
programming further includes instructions for responding to receipt
of one of said choice input signal by selecting the associated
recognition candidate as the output for said uttered sound.
198. A computing device as in innovation 197 wherein: said device
has a telephone keypad; and said choice input signals include phone
key numbers; and said responding to receipt of one of said choice
input signal includes responding to the pressing of numbered phone
keys as said choice input signals.
199. A computing device as in innovation 197 wherein said spoken
output says the best scoring recognition candidate first.
200. A computing device as in innovation 195 wherein said
programming includes instructions for responding to the receipt of
filtering input by; producing a filtered choice list of recognition
candidates, each comprised of a sequence of one or more words that
agree with said filtering input and which have been selected by the
recognition as scoring best against said uttered sound; and
providing spoken output to said speaker or audio output saying the
one or more words of one of the recognition candidates in the
filtered choice list.
201. A computing device as in innovation 200 wherein said
programming further includes instructions for providing spoken
output saying the current value of the filter.
202. A computing device as in innovation 201 wherein the filtering
input is a sequence of letters and the spoken output says the
letters in the filter sequence.
203. A computing device as in innovation 195 wherein the spoken
output includes the spelling of one or more choices.
204. A computing device as in innovation 195 wherein the device is
a handheld device.
205. A computing device as in innovation 204 wherein the device is
a cell phone.
206. A method of word recognition comprising: receiving a
handwritten representation of all or a part of a given sequence of
one or more words to be recognized; receiving a spoken
representation of said sequence of one or more words; performing
handwriting recognition the handwritten representation and speech
recognition upon the spoken representation and selecting one or
more best scoring recognition candidates each comprised of a
sequences of one or more words based on the scoring of recognition
candidates against both the handwritten and spoken
representations.
207. A method of word recognition comprising: receiving a spoken
representation of a given sequence of one or more words to be
recognized; receiving a filtering input consisting of handwriting
or character drawing input; using handwriting or character
recognition, respectively, to define a filter representing one or
more sequences of characters selected by said recognition as most
likely corresponding to said filtering input; and using a
combination of said filter and speech recognition performed on said
spoken representation to select one or more recognition candidates,
each consisting of a sequence of one or more words, selected as a
function of the closeness of their match against the spoken
representation and whether or not they match one of the one or more
character sequences associated said filter.
208. A method as in innovation 207 wherein said filtering input
consists of handwriting.
209. A method as in innovation 208 wherein: said filter represents
a plurality of sequences of characters; and said selection of
recognition candidates selects a plurality of best scoring
recognition candidates, different ones of which can match different
sequences of characters represented by said filter
210. A method as in innovation 209 wherein said plurality of
character sequences represented by one filter and used in said
selection of recognition candidates can be of different character
length.
211. A method as in innovation 208 wherein: said filter represents
only one of sequences of characters which is used for filtering;
and said selection of recognition candidates selects a plurality of
best scoring recognition candidates, all of which match said one
character sequence.
212. A method as in innovation 207 wherein said filtering input
consists of one or more separate character drawings.
213. A method as in innovation 212 wherein: said filter represents
a plurality of sequences of characters; and said selection of
recognition candidates selects a plurality of best scoring
recognition candidates, different ones of which can match different
sequences of characters represented by said filter
214. A method as in innovation 212 wherein: said filter represents
only one of sequences of characters which is used for filtering;
and said selection of recognition candidates selects a plurality of
best scoring recognition candidates, all of which match said one
character sequence.
215. A method as in innovation 207: further including: receiving a
spoken representation of a second sequence of one or more words to
be recognized; using speech recognition to output a corresponding
sequence of one or more words into a sequential body of text;
responding to user input with the pointing device that touches a
sequence of one or more words in said body of text by selecting the
touched sequence as a sequence to be correction; treating the
portion of the spoken representation of said second sequence of
words as said given sequence of words; and then receiving said
filtering input; using said handwriting or character recognition to
define said filter; and using said combination of the filter and
speech recognition to select one or more recognition
candidates.
216. A method of word recognition comprising: receiving a
handwritten representation of a given sequence of one or more words
to be recognized; receiving a filtering input consisting one or
more utterances representing a sequence of one or more letter
identifying words; using speech recognition to define a filter
representing one or more sequences of characters selected by said
recognition as most likely corresponding to said filtering input;
and using a combination of said filter and handwriting recognition
performed on said handwritten representation to select one or more
recognition candidates, each consisting of a sequence of one or
more words, selected as a function of the closeness of their match
against the handwritten representation and whether or not they
match one of the one or more character sequences associated said
filter.
217. A method as in innovation 216 wherein: the filtering input is
a sequence of continuously spoken letter identifying words; and the
speech recognition is continuous speech recognition.
218. A method as in innovation 216 wherein: the filtering input is
a sequence of discretely spoken letter identifying words; and the
speech recognition is discrete speech recognition.
219. A method as in innovation 216 wherein: said filter represents
a plurality of sequences of characters; and said selection of
recognition candidates selects a plurality of best scoring
recognition candidates, different ones of which can match different
sequences of characters represented by said filter
220. A method as in innovation 219 wherein said plurality of
character sequences represented by one filter and used in said
selection of recognition candidates can be of different character
length.
221. A method as in innovation 220 wherein: the filtering input is
a sequence of continuously spoken letter names; and the speech
recognition is continuous speech recognition.
222. A method as in innovation 216 wherein: said filter represents
only one of sequences of characters which is used for filtering;
and said selection of recognition candidates selects a plurality of
best scoring recognition candidates, all of which match said one
character sequence.
223. A method as in innovation 216 further including providing a
user interface which enables a user to select whether the filtering
input is recognized with discrete or continuous recognition.
224. A method as in innovation 216 further including providing a
user interface which enables a user to select whether the filtering
input is recognized in a mode which favors the recognition of
letter names or of non-letter name letter identifying words.
225. A method of word recognition comprising: receiving a
handwritten representation of a given sequence of one or more words
to be recognized; performing handwriting recognition upon said
handwritten representation to produce one or more best scoring
recognition candidates, each of which contains one or more words
selected as having a likelihood of corresponding to the one or more
words of said handwritten representation; then receiving a spoken
representation of a given sequence of one or more words to be
recognized; performing speech recognition upon said spoken
representation to produce one or more best scoring recognition
candidates, each of which contains one or more words selected as
having a likelihood of corresponding to the one or more words of
said spoken representation; using information in one of said speech
recognition's best scoring candidates to correct the prior
recognition of said handwritten representation.
226. A method as in innovation 225 wherein said using of speech
recognition information to correct handwriting recognition includes
replacing a best scoring recognition candidate produced by the
handwriting recognition with a best scoring recognition candidate
produced by the speech recognition.
227. A method as in innovation 225 wherein said using of speech
recognition information to correct handwriting recognition includes
interpreting one of the recognition candidates produced by the
speech recognition as a command, and performing said command in
corrections of a best scoring recognition candidate produced by the
handwriting recognition.
228. A hand held computing device for performing large vocabulary
speech recognition comprising: one or more processing devices;
memory readable by the processing devices; a microphone or audio
input for providing an electronic signal representing a sound
input; a speaker or audio output for enabling an electric
representation of sound produced in said device to be transduced
into a corresponding sound; programming recorded in one or more of
the memory devices including: speech recognition programming for
performing large vocabulary speech recognition that responds to the
electronic representations of the sound of a sequence of one or
more utterances received from the microphone or microphone input by
producing a text output corresponding to the one or more words
recognized as corresponding to the utterances; and audio recording
programming for recording an electronically readable representation
of said sound in one or more of said memory devices; and audio
playback programming for playing back said recorded sound
representation and providing a corresponding audio signal to said
speaker or audio output; wherein the devices programming has
instructions for enabling a user to select between two of the
following three possible modes of recording sound input as it is
received: a first mode that places text output in response to
speech recognition of said sound input into a user navigable
document at a current cursor location, without a representation of
a recording of said sound input; a second mode that places a
representation of a recording of said sound input into said user
navigable document at said cursor without text responding to speech
recogntion of said sound input; and a third mode that places text
output in response to speech recognition of said sound input into
the user navigable document at the current cursor location, with
the words of the text output themselves representing the portions a
recording of the sound input from which each such word has been
recognized; and wherein the audio playback programming includes
instructions for enabling a user to select to play recorded sound
represented by the sound representations placed in the document by
the second and third recording modes by having the cursor located
on such representations when in a playback mode.
229. A device as in innovation 228 wherein the device's
instructions for enabling a user to select to switch back and forth
between the second mode to either the first or third with less than
one second's delay for each switch.
230. A device as in innovation 228 wherein the device's programming
further includes instructions for enabling a user to select a
portion of audio recorded without corresponding recognition to have
speech recognition performed on the selected portion of audio
recording so as to produce a text output corresponding to the
selected audio.
231. A device as in innovation 228 wherein the device's programming
further includes instructions for enabling a user to select a
sub-portion of text output by speech recognition in the third mode
that has recorded sound associated with its words and to have the
recorded sound associated with the selected text removed.
232. A device as in innovation 228 wherein the device's programming
further includes instructions for enabling a user to select a
sub-portion of text output by speech recognition in the third mode
that has recorded sound associated with its words and to have the
selected text removed and to replace its location in the document
with the type of representation of the recorded sound produced by
recording in the second mode.
233. A device as in innovation 228 wherein the representations of
sound placed in the document by the second recording mode are
audiographic representations that vary in length as a function of
the duration of the respective portions of recorded sound they
represent.
234. A computing device as in innovation 228 wherein the device is
a handheld device.
235. A computing device as in innovation 234 wherein the device is
a cell phone.
236. A hand held computing device for performing large vocabulary
speech recognition comprising: one or more processing devices;
memory readable by the processing devices; a microphone or audio
input for providing an electronic signal representing a sound
input; a speaker or audio output for enabling an electric
representation of sound produced in said device to be transduced
into a corresponding sound; programming recorded in one or more of
the memory devices including: speech recognition programming for
performing large vocabulary speech recognition that responds to the
electronic representations of the sound of a sequence of one or
more utterances received from the microphone or microphone input by
producing a text output corresponding to the one or more words
recognized as corresponding to the utterances; and audio recording
programming for recording an electronically readable representation
of said sound in one or more of said memory devices; and audio
playback programming for playing back said recorded sound
representation and providing a corresponding audio signal to said
speaker or audio output; wherein the device's programming further
includes instructions for enabling a user to select a portion of
audio recorded without corresponding recognition and to have speech
recognition performed on the selected portion of audio recording so
as to produce a text output corresponding to the selected
audio.
237. A hand held computing device for performing large vocabulary
speech recognition comprising: one or more processing devices;
memory readable by the processing devices; a microphone or audio
input for providing an electronic signal representing a sound
input; a speaker or audio output for enabling an electric
representation of sound produced in said device to be transduced
into a corresponding sound; programming recorded in one or more of
the memory devices including: speech recognition programming for
performing large vocabulary speech recognition that responds to the
electronic representations of the sound of a sequence of one or
more utterances received from the microphone or microphone input by
producing a text output corresponding to the one or more words
recognized as corresponding to the utterances; and audio recording
programming for recording an electronically readable representation
of said sound in one or more of said memory devices; and audio
playback programming for playing back said recorded sound
representation and providing a corresponding audio signal to said
speaker or audio output; wherein said device's programming further
includes instructions for: enabling a user to associate recorded
portions of text output by said speech recognition with portions of
the recorded sound representation that have not previously been
labeled by voice; enabling a user to select to cause text output by
said speech recognition to be used as a text search string; and
performing a search for recorded text output that matches the
search string; whereby the user can select to find a portion of
recorded sound representation by searching for its associated
recorded text.
238. A computing device for performing large vocabulary speech
recognition comprising: one or more processing devices; memory
readable by the processing devices; a microphone or audio input for
providing an electronic signal representing a sound input; a
speaker or audio output for enabling an electric representation of
sound produced in said device to be transduced into a corresponding
sound; programming recorded in one or more of the memory devices
including: speech recognition programming for performing large
vocabulary speech recognition that responds to the electronic
representations of the sound of a sequence of one or more
utterances received from the microphone or microphone input by
producing a text output corresponding to the one or more words
recognized as corresponding to the utterances; and audio recording
programming for recording an electronically readable representation
of said sound in one or more of said memory devices; audio playback
programming for playing back said recorded sound representation and
providing a corresponding audio signal to said speaker or audio
output; and instructions for switching back and forth between said
audio playback and said speech recognition with one user input
causing each such switch, with successive audio playbacks starting
slightly before the end of the prior playback.
239. A computing device as in innovation 238 wherein said
instructions for switching back and forth between said audio
playback and said speech recognition make both such switch in
response to a user selection of the same input device.
240. A computing device that functions as a cell phone comprising:
a user perceivable output device; a set of phone keys including at
least a standard twelve key phone key pad; one or more processing
devices; memory readable by the processing devices; a microphone or
audio input from which said telephone can receive electronic
representations of sound; a speaker or audio output for enabling an
electric representation of sound produced in said device to be
transduced into a corresponding sound; transmitting and receiving
circuitry; programming recorded in the memory including: telephone
programming having instructions for performing telephone functions
including making and receiving calls; and speech recognition
programming for performing large vocabulary speech recognition that
responds to the electronic representations of the sound of a
sequence of one or more utterances received from the microphone or
microphone input by producing a text output corresponding to the
one or more words recognized as corresponding to the utterances;
and audio recording programming for recording an electronically
readable representation of said sound in one or more of said memory
devices; audio playback programming for playing back said recorded
sound representation and providing a corresponding audio signal to
said speaker or audio output.
241. A computing device as in innovation 240 wherein said play
programming includes instructions for: enabling a user to select a
sub-portion of said recorded sound representation; and enabling a
user to select to play a selected sub-portion of said sound
representation to the other side of a cellular telephone call.
242. A computing device as in innovation 240 wherein said recording
programming includes instructions for: enabling a user to select to
record an electronically readable representation of one or both
sides of a cellular phone conversation.
243. A computing device as in innovation 240 wherein the device's
programming further includes instructions for enabling a user to
associate recorded portions of text output by said speech
recognition with portions of the recorded sound representation that
have not previously been labeled by voice.
244. A computing device as in innovation 243 wherein the device's
programming further includes instructions for: enabling a user to
select to cause text output by said speech recognition to be used
as a text search string; and performing a search for recorded text
output corresponding to said search string; whereby said user can
select to find a portion of recorded sound representation by
searching for its associated recorded text.
245. A computing device as in innovation 240 wherein the device's
programming further includes instructions for enabling a user to
select a sub-portion of said recorded sound representation which
had not previously been recognized and to have said large
vocabulary speech recognition performed upon said selected
sub-portion.
246. A computing device as in innovation 245 wherein: said speech
recognition programming includes instructions for performing speech
recognition at different levels of quality, with the higher quality
recognition taking more time to recognize a given length of sound;
and said instructions for enabling a user to select to have speech
recognition performed on a selected sub portion of recorded sound
includes instructions for enabling the selected recorded sound to
be recognized said higher quality.
247. A computing device as in innovation 245 wherein said speech
recognition programming includes instructions for: marking the time
alignment between individual recognized words in text output by
said speech recognition and the portions of the recorded sound
associated with each recognized word in said text; and enabling a
user select a sequence of one or more words and to have the
recorded sound associated with those words played back.
248. A computing device as in innovation 240 wherein the device's
programming further includes instructions for switching back and
forth between audio playback and speech recognition, with
successive audio playbacks starting slightly before the end of the
prior playback.
Description
RELATED APPLICATIONS
[0001] This application is a child application of, and claims the
priority of, a parent application, U.S. patent application Ser. No.
10/302,053, entitled "Methods, Systems, and Programming For
Performing Speech Recognition", filed on Sep. 5, 2002 by Daniel L.
Roth et al., and through this parent application, the present
application claims the priority of the following United States
provisional applications, all of which were filed on Sep. 5, 2001,
and all of which were referenced in a priority claim contained in
the parent application:
[0002] U.S. Provisional Patent App. No. 60/317,333, entitled
"Systems, Methods, and Programming For Speech Recognition Using
Selectable Recognition Modes" by Daniel L. Roth et al.
[0003] U.S. Provisional Patent App. No. 60/317,433, entitled
"Systems, Methods, and Programming For Speech Recognition Using
Automatic Recognition Turn Off" by Daniel L. Roth et al.
[0004] U.S. Provisional Patent App. No. 60/317,431, entitled
"Systems, Methods, and Programming For Speech Recognition Using
Ambiguous Or Phone Key Spelling And/Or Filtering" by Daniel L. Roth
et al.
[0005] U.S. Provisional Patent App. No. 60/317,329, entitled
"Systems, Methods, and Programming For Phone Key Control Of Speech
Recogonition" by Daniel L. Roth et al.
[0006] U.S. Provisional Patent App. No. 60/317,330, entitled
"Systems, Methods, and Programming For Word Recognition Using
Choice Lists" by Daniel L. Roth et al.
[0007] U.S. Provisional Patent App. No. 60/317,331, entitled
"Systems, Methods, and Programming For Word Recognition Using Word
Transformation Commands" by Daniel L. Roth et al.
[0008] U.S. Provisional Patent App. No. 60/317,423, entitled
"Systems, Methods, and Programming For Word Recognition Using
Filtering Commands" by Daniel L. Roth et al.
[0009] U.S. Provisional Patent App. No. 60/317,422, entitled
"Systems, Methods, and Programming For Speech Recognition Using
Phonetic Models" by Daniel L. Roth et al.
[0010] U.S. Provisional Patent App. No. 60/317,421, entitled
"Systems, Methods, and Programming For Large Vocabulary Speech
Recognition In Handheld Computing Devices" by Daniel L. Roth et
al.
[0011] U.S. Provisional Patent App. No. 60/317,430, entitled
"Systems, Methods, and Programming For Combined Speech And
Handwriting Recognition" by Daniel L. Roth et al.
[0012] U.S. Provisional Patent App. No. 60/317,432, entitled
"Systems, Methods, and Programming For Performing Re-Utterance
Recognition" by Daniel L. Roth et al.
[0013] U.S. Provisional Patent App. No. 60/317,435, entitled
"Systems, Methods, and Programming For Combined Speech Recognition
And Text-To-Speech Generation" by Daniel L. Roth et al.
[0014] U.S. Provisional Patent App. No. 60/317,434 entitled
"Systems, Methods, and Programming For Sound Recording" by Daniel
L. Roth et al.
FIELD OF THE INVENTION
[0015] The present invention relates to methods, systems, and
programming for performing speech recognition.
BACKGROUND OF THE INVENTION
[0016] Discrete large-vocabulary speech recognition systems have
been available for use on desktop personal computers for
approximately 10 years by the time of the writing of this patent
application. Continuous large-vocabulary speech recognition systems
have been available for use on such computers for approximately
five years by this time. Such speech recognition systems have
proven to be of considerable worth. In fact, much of the text of
the present patent application is being prepared by the use of a
large vocabulary continuous speech recognition system.
[0017] As used in this specification and the claims that follow,
when we refer to a large-vocabulary speech recognition system, we
mean one that has the ability to recognize a given utterance as
being any one of at least two thousand different vocabulary words,
depending upon which of those words has corresponding phonetic
models that most closely match the given spoken word
[0018] As indicated by FIG. 1, large-vocabulary speech recognition
typically functions by having a user 100 speak into a microphone
102, which in the example of FIG. 1 is a microphone of a cellular
telephone 104. The microphone transduces the variation in air
pressure over time caused by the utterance of words into a
corresponding waveform represented by an electronic signal 106. In
many speech recognition systems this waveform signal is converted
by digital signal processing performed either by a computer
processor or by a special digital signal processor 108, into a time
domain representation. Often the time domain representation
comprises a plurality of parameter frames 112, each of which
represents properties of the sound represented by the waveform 106
at each of a plurality of successive time periods, such as every
one-hundredth of a second.
[0019] As indicated in FIG. 2, the time domain, or frame,
representation of an utterance to be recognized is then matched
against a plurality of possible sequences of phonetic models 200
corresponding to different words in a large vocabulary. In most
large-vocabulary speech recognition systems, individual words 202
are each represented by a corresponding phonetic spelling 204,
similar to the phonetic spellings found in most dictionaries. Each
phoneme in a phonetic spelling has one or more phonetic models 200
associated with it. In many systems the models 200 are
phoneme-in-context models, which model the sound of their
associated phoneme when it occurs in the context of the preceding
and following phoneme in a given word's phonetic spelling. The
phonetic models are commonly composed of the sequence of one or
more probability models, each of which represents the probability
of different parameter values for each of the parameters used in
the frames of the time domain representation 110 of an utterance to
be recognized.
[0020] One of the major trends in personal computing in recent
years has been the increased use of smaller and often more portable
computing devices.
[0021] Originally most personal computing was performed upon
desktop computers of the general type represented by FIG. 3. Then
there was an increase in usage of even smaller personal computers
in the form of laptop computers, which are not shown in the
drawings because laptop computers have roughly the same type of
computational capabilities and user interface as desktop computers.
Most current large-vocabulary speech recognition systems have been
designed for use on such systems.
[0022] Recently there has been an increase in the use of new types
of computers such as the tablet computer shown in FIG. 4, the
personal digital assistant computer shown in FIG. 5, cell phones
which have increased computing power, shown in FIG. 6, wrist phone
computers represented in FIG. 7, and a wearable computer which
provides a user interface with a screen and eyetracking and/or
audio output provided from a head wearable device as indicated in
FIG. 8.
[0023] Because of recent increases in computing power, such new
types of devices can have computational power equal to that of the
first desktops on which discrete large vocabulary recognition
systems were provided and, in some cases, as much computational
power as was provided on desktop computers that first ran large
vocabulary continuous speech recognition. The computational
capacities of such smaller and/or more portable personal computers
will only grow as time goes by.
[0024] One of the more important challenges involved in providing
effective large-vocabulary speech recognition on ever more portable
computers is that of providing a user interface that makes it
easier and faster to create, edit, and use speech recognition on
such devices.
SUMMARY OF THE INVENTION
[0025] One aspect of the present invention relates to speech
recognition using selectable recognition modes. This includes
innovations such as: allowing a user to select between recognition
modes with and without language context; allowing a user to select
between continuous and discrete large-vocabulary speech recognition
modes; allowing a user to select between at least two different
alphabetic entry speech recognition modes; and allowing a user to
select from among four or more of the following recognitions modes
when creating text: a large-vocabulary mode, a letters recognizing
mode, a numbers recognizing mode, and a punctuation recognizing
mode.
[0026] Another aspect of the invention relates to using choice
lists in large-vocabulary speech recognition. This includes
innovations such as: providing character-ordered choice lists;
providing vertically scrollable choice lists; providing
horizontally scrollable choice lists; and providing choice lists on
characters in an alphabetic filter used to limit recognition
candidates.
[0027] Another aspect of the invention relates to enabling users to
select word transformations. This includes innovations such as
enabling a user to choose one from a plurality of transformations
to perform upon a recognized word so as to change it in a desired
way, such as to change from singular to plural, to give the word a
gerund form, etc. It also includes innovations such as enabling a
user to select to transform a selected word between an alphabetic
and non-alphabetic form. It also includes innovations such as
providing a user with a choice list of transformed words
corresponding to a recognized word and allowing the user to select
one of the transformed words as output.
[0028] Another aspect of the invention relates to speech
recognition that automatically turns recognition off in one or more
specified ways. This includes innovations such as a
large-vocabulary speech recognition command that turns on
recognition and then automatically turns such recognition off until
receiving another command to turn recognition back on. It also
includes the innovation of speech recognition in which pressing a
button causes recognition for a duration determined by the length
of time of such a press, and in which clicking the same button
causes recognition for a length of time independent of the length
of such a click.
[0029] Another aspect of the invention relates to phone key control
of large-vocabulary speech recognition. This includes the
innovations of using phone keys to select a word from a choice
list; of using them to select a help mode that provides explanation
about a subsequently pressed key; and of using them to select a
list of functions currently associated with phone keys. It also
includes the innovation of speech recognition of having a text
navigation mode in which multiple numbered phone keys concurrently
have multiple different key mappings associated with them, and the
pressing of such a key causes the functions associated with the
numbered phone keys to change to the mapping associated with the
pressed key.
[0030] Another aspect of the invention relates to speech
recognition using phone key alphabetic filtering and spelling. By
alphabetic filtering we mean favoring the speech recognition of
words including a sequence of letters normally an initial sequence
of letters, corresponding to a sequence of letters indicated by
user input. This aspect of the invention includes the innovation of
using as filtering input the pressing of phone keys, where each key
press is ambiguous in that it indicates that a corresponding
character location in a desired word corresponds to one of a
plurality of letters identified with that phone key. This aspect of
the invention also includes the innovation of using as filtering
input a sequence of phone key presses in which the number of zero
or more repeated presses of a given key provides a non-ambiguous
indication of which of multiple letters associated with the key are
intended for use in the filter. This aspect of the invention also
includes the innovation of using such ambiguous and non-ambiguous
phone key input for spelling text that can be used in addition to
text produced by speech recognition.
[0031] Another aspect of the invention relates to speech
recognition that enables a user to perform re-utterance
recognition, in which speech recognition is performed upon both a
second saying of a sequence of one or more words and upon an early
saying of the same sequence to help the speech recognition better
select one or more best scoring text sequences for the
utterances.
[0032] Another aspect of the invention relates to the combination
of speech recognition and text-to-speech (TTS) generation. This
includes the innovation of having speech recognition and TTS
software sharing resources such as phonetic spellings and
letter-to-sound rules. It also includes the innovation of a large
vocabulary speech recognition system that has at least one mode
which automatically uses TTS to say recognized text after its
recognition and uses TTS or recorded audio to say the names of
recognized commands after their recognition. This aspect of the
invention also includes the innovation of a large vocabulary system
that automatically repeats recognized text using TTS after each
utterance. This aspect also includes the innovation of a large
vocabulary system that enables a user to move back or forward in
recognized text, with one or more words at the current location
after each such move being said by TTS. This aspect also includes
the innovation of a large vocabulary system that uses speech
recognition to produce a choice list and provides TTS output of one
or more of that list's choices.
[0033] Another aspect of the invention relates to the combination
of speech recognition with handwriting and/or character
recognition. This includes the innovation of selecting one or more
best-scoring recognition candidates as a function of recognition of
both handwritten and spoken representations of a sequence of one or
more words to be recognized. It also includes the innovation of
using character or handwriting recognition of one or more letters
to alphabetically filter speech recognition of one or more words.
It also includes the innovations of using speech recognition of one
or more letter-identifying words to alphabetically filter
handwriting recognition, and of using speech recognition to correct
handwriting recognition of one or more words.
[0034] Another aspect of the invention relates to the combination
of large-vocabulary speech recognition with audio recording and
playback. It includes the innovation of a handheld device with both
large-vocabulary speech recognition and audio recoding in which
users can switch between at least two of the following modes of
recording sound input: one which records audio without
corresponding speech recognition output; one that records audio
with corresponding speech recognition output; and-one that records
the audio's speech recognition output without corresponding audio.
This aspect of the invention also includes the innovation of a
handheld device that has both large-vocabulary speech recognition
and audio recoding capability and that enables a user to select a
portion of previously recorded sound and to have speech recognition
performed upon it. It also includes the innovation of a
large-vocabulary speech recognition system that enables a user to
use large-vocabulary speech recognition to provide a text label for
a portion of sound that is recorded without corresponding speech
recognition output, and the innovation of a system that enables a
user to search for a text label associated with portions of
unrecognized recorded sound by uttering the label's words,
recognizing the utterance, and searching for text containing those
words. This aspect of the invention also includes the innovation of
a large vocabulary system that allows users to switch between
playing back previously recorded audio and performing speech
recognition with a single input, with successive audio playbacks
automatically starting slightly before the end of prior playback.
This aspect of the invention also includes the innovation of a cell
phone that has both large vocabulary speech recognition and audio
recording and playback capabilities.
BRIEF DESCRIPTION OF THE DRAWINGS
[0035] These and other aspects of the present invention will become
more evident upon reading the following description of the
preferred embodiment in conjunction with the accompanying
drawings:
[0036] FIG. 1 is a schematic illustration of how spoken sound can
be converted into acoustic parameter frames for use by speech
recognition software.
[0037] FIG. 2 a schematic illustration of how speech recognition,
using phonetic spellings, can be used to recognize words
represented by a sequence of parameter frames such as those shown
in FIG. 1, and how the time alignment between phonetic models of
the word can be used to time align those words against the original
acoustic signal from which the parameter frames have been
derived.
[0038] FIGS. 3 through 8 show a progression of different types of
computing platforms upon which many aspects of the present
invention can be used, and illustrating the trend toward smaller
and/or more portable computing devices.
[0039] FIG. 9 illustrates a personal digital assistant, or PDA,
device having a touch screen displaying a software input panel, or
SIP, embodying many aspects of the present invention, that allows
entry by speech recognition of text into application programs
running on such a device.
[0040] FIG. 10 is a highly schematic illustration of many of the
hardware and software components that can be found in a PDA of the
type shown in FIG. 9.
[0041] FIG. 11 is a blowup of the screen image shown in FIG. 9,
used to point out many of the specific elements of the speech
recognition SIP shown in FIG. 9.
[0042] FIG. 11 is similar to FIG. 12 except that it also
illustrates a correction window produced by the speech recognition
SIP and many of its graphical user interface elements.
[0043] FIGS. 13 through 17 provide a highly simplified pseudocode
description of the responses that the speech recognition SIP makes
to various inputs, particularly inputs received from its graphical
user interface.
[0044] FIG. 18 is a highly simplified pseudocode description of the
recognition duration logic used to determine the length of time for
which speech recognition is turned on in response to the pressing
of one or more user interface buttons, either in the speech
recognition SIP shown in FIG. 9 or in the cellphone embodiment
shown starting at FIG. 59.
[0045] FIG. 19 is a highly simplified pseudocode description of a
help mode that enables a user to see a description of the function
associated with each element of the speech recognition SIP of FIG.
9 merely by touching it.
[0046] FIGS. 20 and 21 are screen images produced by the help mode
described in FIG. 19.
[0047] FIG. 22 is a highly simplified pseudocode description of a
display Choice List routine used in various forms by both the
speech recognition SIP a FIG. 9 and the cellphone embodiment of
FIG. 59 to display correction windows.
[0048] FIG. 23 is a highly simplified pseudocode description of the
get Choices routine used in various forms by both the speech
recognition SIP and the cellphone embodiment to generate one or
more choice list for use by the display choice list routine of FIG.
22.
[0049] FIGS. 24 and 25 illustrate the utterance list data structure
used by the get Choices routine of FIG. 23.
[0050] FIG. 26 is a highly simplified pseudocode description of a
filter Match routine used by the get Choices routine to limit
correction window choices to match filtering input, if any, entered
by a user.
[0051] FIG. 27 is a highly simplified pseudocode description of a
word Form List routine used in various forms by both the speech
recognition SIP and the cellphone embodiment to generate a word
form correction list that displays alternate forms of a given word
or selection.
[0052] FIGS. 28 and 29 provided a highly simplified pseudocode
description of a filter Edit routine used in various forms by both
the speech recognition SIP and cellphone embodiment to edit a
filter string used by the filter Match routine of FIG. 26 in
response to alphabetic filtering information input from a user.
[0053] FIG. 30 provides a highly simplified pseudocode description
of a filter Character Choice routine used in various forms by both
the speech recognition SIP and cellphone embodiment to display
choice lists for individual characters of a filter string.
[0054] FIGS. 31 through 35 illustrate a sequence of interactions
between a user and the speech recognition SIP, in which the user
enters and corrects the recognition of words using a one-at-a-time
discrete speech recognition method.
[0055] FIG. 36 shows how a user of the SIP can correct a
mis-recognition shown at the end of FIG. 35 by a scrolling through
the choice list provided in the correction window until finding a
desired word and then using a capitalized button to capitalize it
before entering it into text.
[0056] FIG. 37 shows how a user of the SIP can correct such a
mis-recognition by selecting part of an alternate choice in the
correction window and using it as a filter for selecting the
desired speech recognition output.
[0057] FIG. 38 shows how a user of the SIP can select two
successive alphabetically ordered alternate choices in the
correction window to cause the speech recognizer's output to be
limited to output starting with a sequence of characters located
between the two selected choices in the alphabetic.
[0058] FIG. 39 illustrates how a user of the sip can use the speech
recognition of letter names to input filtering characters and how a
filter character choice list can be used to correct errors in the
recognition of such filtering characters.
[0059] FIG. 40 illustrates how a user of the SIP recognizer can
enter one or more characters of a filter string using the
international communication alphabets and how the SIP interface can
show the user the words out of that alphabet.
[0060] FIG. 41 shows how a user can select an initial sequence of
characters from an alternate choice in the correction window and
then use international communication alphabets to add characters to
that sequence so as to complete the spelling of a desired
output.
[0061] FIGS. 42 through 43 illustrate a sequence of user
interactions in which the user enters and edits text into the SIP
using continuous speech recognition.
[0062] FIG. 45 illustrates how the user can correct a
mis-recognition by spelling all or part of the desired output using
continuous letter name recognition as an ambiguous (or multivalued)
filter, and how the user can use filter character choice lists to
rapidly correct errors produced in such continuous letter name
recognition.
[0063] FIG. 46 illustrates how the speech recognition SIP also
enables a user to input characters by drawn character
recognition.
[0064] FIG. 47 is a highly simplified pseudocode description of a
character recognition mode used by the SIP when performing drawn
character recognition of the type shown in FIG. 46.
[0065] FIG. 48 illustrates how the speech recognition SIP lets a
user input text using handwriting recognition.
[0066] FIG. 49 is a highly simplified pseudocode description of the
handwriting recognition mode used by the SIP when performing
handwriting recognition of the type shown in FIG. 48.
[0067] FIG. 50 illustrates how the speech recognition system
enables a user to input text with a software keyboard.
[0068] FIG. 51 illustrates a filter entry mode menu that can be
selected to choose from different methods of entering filtering
information, including speech recognition, character recognition,
handwriting recognition, and software keyboard input.
[0069] FIGS. 52 through 54 illustrates how either character
recognition, handwriting recognition, or software keyboard input
can be used to filter speech recognition choices produced by in the
SIP's correction window.
[0070] FIGS. 55 and 56 illustrate how the SIP allows speech
recognition of words or filtering characters to be used to correct
handwriting recognition input.
[0071] FIG. 58 is a highly simplified description of an alternate
embodiment of the display choice list routine of FIG. 22 in which
the choice list produced orders choices only by recognition score,
rather than by alphabetical ordering as in FIG. 22.
[0072] FIG. 59 illustrates a cellphone that embodies many aspects
of the present invention.
[0073] FIG. 60 provides a highly simplified block diagram of the
major components of a typical cellphone such as that shown in FIG.
59.
[0074] FIG. 61 is a highly simplified block diagram of various
programming and data structures contained in one or more mass
storage devices on the cellphone of FIG. 59.
[0075] FIG. 62 illustrates that the cellphone of FIG. 59 allows
traditional phone dialing by the pressing of numbered phone
keys.
[0076] FIG. 63 is a highly simplified pseudocode description of the
command structure of the cellphone of FIG. 59 when in its top level
phone mode, as illustrated by the screen shown in the top of FIG.
62.
[0077] FIG. 64 illustrates how a user of the cellphone of FIG. 59
can access and quickly view the commands of a main menu by pressing
the menu key on the cellphone.
[0078] FIGS. 65 and 66 provide a highly simplified pseudocode
description of the operation of the main menu illustrated in FIG.
64.
[0079] FIGS. 67 through 74 illustrate command mappings of the
cellphone's numbered keys in each of various important modes and
menus associated with a speech recognition text editor that
operates on the cellphone of FIG. 59.
[0080] FIG. 75 illustrates how user of the cellphone's text editing
software can rapidly see the function associated with one or more
keys in a non-menu mode by pressing the menu button and scrolling
through a command list that can be used substantially in the same
manner as a menu of the type shown in FIG. 64.
[0081] FIGS. 66 through 68 provide a highly simplified pseudocode
description of the responses of the cellphone's speech recognition
program when in its text window, editor, mode.
[0082] FIGS. 79 and 80 provide a highly simplified pseudocode
description of an entry mode menu, that can be accessed from
various speech recognition modes, to select among various ways to
enter text.
[0083] FIGS. 81 through 83 provide a highly simplified pseudocode
description of the correction Window routine used by the cellphone
to display a correction window and to respond to user input when
such correction window is shown.
[0084] FIG. 84 is a highly simplified pseudocode description of an
edit navigation menu that allows a user to select various ways of
navigating with the cellphone's navigation keys when the edit
mode's text window is displayed.
[0085] FIG. 85 is a highly simplified pseudocode description of a
correction window navigation menu that allows the user to select
various ways of navigating with the cellphone's navigation keys
when in a correction window, and also to select from among
different ways the correction window can respond to the selection
of an alternate choice in a correction window.
[0086] FIGS. 86 through 88 provide highly simplified pseudocode
descriptions of three slightly different embodiments of the key
Alpha mode, which enables a user to enter a letter by saying a word
starting with that letter and which responds to the pressing of a
phone key by substantially limiting such recognition to words
starting with one of the three or four letters associated with the
pressed key.
[0087] FIGS. 89 and 90 provide a highly simplified pseudocode
description of some of the options available under the edits
options menu that is accessible from many of the modes of the
cellphone's speech recognition programming.
[0088] FIGS. 91 and 92 provide a highly simplified description of a
word type menu that can be used to limit recognition choices to a
particular type of word, such as a particular grammatical type of
word.
[0089] FIG. 93 provides a highly simplified pseudocode description
of an entry preference menu that can be used to set default
recognition settings for various speech recognition functions, or
to set recognition duration settings.
[0090] FIG. 94 provides a highly simplified pseudocode description
of text to speech playback operation available on the
cellphone.
[0091] FIG. 95 provides a highly simplified pseudocode description
of how the cellphone's text to speech generation uses programming
and data structures also used by the cellphone's speech
recognition.
[0092] FIG. 96 is a highly simplified pseudocode description of the
cellphone's transcription mode that makes it easier for a user to
transcribe audio recorded on the cellphone using the device's
speech recognition capabilities.
[0093] FIG. 97 is a highly simplified pseudocode description of
programming that enables the cellphone's speech recognition editor
to be used to enter and edit text in dialogue boxes presented on
the cellphone, as well as to change the state of controls such as
list boxes check boxes and radio buttons in such dialog boxes.
[0094] FIG. 98 is a highly simplified pseudocode description of a
help routine available on the cellphone to enable a user to rapidly
find descriptions of various locations in the cellphone's command
structure.
[0095] FIGS. 99 and 100 illustrate examples of help menus of the
type that displayed by the programming of FIG. 98.
[0096] FIGS. 101 and 102 illustrate how a user can use the help
programming of FIG. 98 to rapidly search for, and received
descriptions of, the functions associated with various portions of
the cellphone's command structure.
[0097] FIGS. 103 and 104 illustrate a sequence of interactions
between a user and the cellphone's speech recognition editor's user
interface in which the user enters and corrects text using
continuous speech recognition.
[0098] FIG. 105 illustrates how a user can scroll horizontally in
one a correction window displayed on the cellphone.
[0099] FIG. 107 illustrates operation of the key Alpha mode shown
in FIG. 86.
[0100] FIGS. 108 and 109 illustrate how the cellphone's speech
recognition editor allows the user to address and enter and edit
text in an e-mail message that can be sent by the cellphone's
wireless communication capabilities.
[0101] FIG. 110 illustrates how the cellphone's speech recognition
can combine scores from the discrete recognition of one or more
words with scores from a prior continuous recognition of those
words to help produce the desired output.
[0102] FIG. 111 illustrates how the cellphone speech recognition
software can be used to enter a URL for the purposes of accessing a
World Wide Web site using the wireless communication capabilities
of the cellphone.
[0103] FIGS. 112 and 113 illustrate how elements of the cellphone's
speech recognition user interface can be used to navigate World
Wide Web pages and to select items and enter and edit text in the
fields of such web pages.
[0104] FIG. 114 illustrates how elements of the cellphone speech
recognition user interface can be used to enable a user to more
easily read text strings too large to be seen at one time in a text
field displayed on the cellphone screens, such as a text fields of
a web page or dialogue box.
[0105] FIG. 115 illustrates the cellphone's find dialog box, how a
user can enter a search string into that dialog box by speech
recognition, how the find function then performs a search for the
entered string, and how the found text can be text used to label
audio recorded on the cellphone.
[0106] FIG. 116 illustrates how the dialog box editor programming
shown in FIG. 97 enable speech recognition to be used to select
from among possible values associated with a list boxes.
[0107] FIG. 117 illustrates how speech recognition can be used to
dial people by name, and how the audio playback and recording
capabilities of the cellphone can be used during such a cellphone
call.
[0108] FIG. 118 illustrates how speech recognition can be turned on
and off when the cellphone is recording audio to insert text labels
or text comments into recorded audio.
[0109] FIG. 119 illustrates how the cellphone enables a user to
have speech recognition performed on portions of previously
recorded audio.
[0110] FIG. 120 illustrates how the cellphone enables a user to
strip text recognized for a given segment of sound from the audio
recording of that sound.
[0111] FIG. 121 illustrates how the cellphone enables the user to
either turned on or off an indication of which portions of a
selected segment of text have been associated audio recording.
[0112] FIGS. 122 through 125 illustrate how the cellphone speech
recognition software allows the user to enter telephone numbers by
speech recognition and to correct the recognition such numbers when
wrong.
[0113] FIG. 126 is provided to illustrate how many aspects of the
cellphone embodiment shown in FIGS. 59 through 125 can be used in
an automotive environment, including the TTS and duration logic
aspects of the cellphone embodiment.
[0114] FIGS. 127 and 128 illustrate that most of the aspects of the
cellphone embodiment shown in FIGS. 59 through 125 can be used
either on cordless phones or landline phones.
[0115] FIG. 129 provides a highly simplified pseudocode description
of the name dialing programming of the cellphone embodiment, which
is partially illustrated in FIG. 117.
[0116] FIG. 130 provides a highly simplified pseudocode description
of the cellphone's digit dial programming illustrated in FIGS. 122
through 125.
DETAILED DESCRIPTION OF SOME PREFERRED EMBODIMENTS
[0117] FIG. 9 illustrates the personal digital assistant, or PDA,
900 on which many aspects of the present invention can be used. The
PDA shown is similar to that currently being sold as the Compaq
iPAQ H3650 Pocket PC, the Casio Cassiopeia, and the Hewlett-Packard
Jornado 525.
[0118] The PDA 900 includes a relatively high resolution touch
screen 902, which enables the user to select software buttons as
well as portions of text by means of touching the touch screen,
such as with a stylus 904 or a finger. The PDA also includes a set
of input buttons 906 and a two-dimensional navigational control
908.
[0119] In this specification and the claims that follow, a
navigational input device that allows a user to select discrete
units of motion on one or more dimensions will often be considered
to be included in the definition of a button. This is particularly
true with regard to telephone interfaces, in which the up, down,
left, and right inputs of a navigational device will be considered
phone keys or phone buttons.
[0120] FIG. 10 provides a schematic system diagram of a PDA 900. It
shows the touch screen 902 and input buttons 906 (which include the
navigational input 908). It also shows that the device has a
central processing unit such as a microprocessor 1002. The CPU 1002
is connected over one or more electronic communication buses 1004
with read-only memory 1006 (often flash ROM); random access memory
1008; one or more I/O devices 1010; a video controller 1012 for
controlling displays on the touch screen 902; and an audio device
1014 for receiving input from a microphone 1015 and supplying audio
output to a speaker 1016.
[0121] The PDA also includes a battery 1018 for providing it with
portable power; a headphone-in and headphone-out jack 1020, which
is connected to the audio circuitry 1014; a docking connector 1022
for providing a connection between the PDA and another computer
such as a desktop; and an add on connector 1024 for enabling a user
to add circuitry to the PDA such as additional flash ROM, a modem,
a wireless transceiver 1025, or a mass storage device.
[0122] FIG. 10 shows a mass storage device 1017. In actuality, this
mass storage device could be any type of mass storage device,
including all or part of the flash ROM 1006 or a miniture hard
disk. In such a mass storage device the PDA would normally store an
operating system 1026 for providing much of the basic functionality
of the device. Commonly it would include one or more application
programs, such as a word processor, a spreadsheet, a Web browser,
or a personal information management system, in addition to the
operating system and in addition to the speech recognition related
functionality explained next.
[0123] When the PDA 900 is used with the present invention, it will
normally include speech recognition programming 1030. It includes
programming for performing word matching of the general type
described above with regard to FIGS. 1 and 2. The speech
recognition programming will also normally include one or more
vocabularies or vocabulary groupings 1032 including a large
vocabulary that includes at least two thousand words. Many large
vocabulary systems have a vocabulary of fifty thousand to several
hundred thousand words. For each vocabulary word, the vocabulary
will normally have a text spelling 1034 and one or more vocabulary
groupings 1036 to which the word belongs (for example, the text
output "." might actually be in both a large-vocabulary recognition
vocabulary, a spelling vocabulary, and a punctuation vocabulary
grouping in some systems). Each vocabulary word will also normally
have an indication of the one or more parts of speech 1038 in which
the word can be classified, and the phonetic spelling 1040 for the
word for each of those parts of speech.
[0124] The speech recognition programming commonly includes a
pronunciation guesser 1042 for guessing the pronunciation of new
words that are added to the system and, thus, which do not have a
predefined phonetic spelling. The speech recognition programming
commonly includes one or more phonetic lexical trees 1044. A
phonetic lexical tree is a tree-shaped data structure that groups
together in a common path from the tree's root all phonetic
spellings that start with the same sequence of phonemes. Using such
lexical trees improves recognition performance because it enables
all portions of different words that share the same initial
phonetic spelling to be scored together.
[0125] Preferably the speech recognition programming will also
include a PolyGram language model 1045 that indicates the
probability of the occurrence of different words in text, including
the probability of words occurring in text given one or more
preceding and/or following words.
[0126] Commonly the speech recognition programming will store
language model update data 1046, which includes information that
can be used to update the PolyGram language model 1045 just
described. Commonly this language model update data will either
include or contain statistical information derived from text that
the user has created or that the user has indicated is similar to
the text that he or she wishes to generate. In FIG. 10 the speech
recognition programming is shown storing contact information 1048,
which includes names, addresses, phone numbers, e-mail addresses,
and phonetic spellings for some or all of such information. This
data is used to help the speech recognition programming recognize
the speaking of such contact information. In many embodiments of
the information such contact information will be included in an
external program, such as one of the application programs 1028 or
accessories to the operating system 1026, but, even in such cases,
the speech recognition programming would normally need access to
such names, addresses, phone numbers, e-mail addresses, and
phonetic representations for them.
[0127] The speech recognition programming will also normally includ
phonetic acoustic models 1050 which can be similar to the phonetic
models 200 shown in FIG. 2. Commonly the speech recognition
programming also stores acoustic model update data 1052, which
includes information from acoustic signals that have been
previously recognized by the system. Commonly such acoustic model
update data will be in the form of parameter frames, such as the
parameter frames 110 shown in FIGS. 1 and 2, or in the form of
statistical data that has been abstracted from such frames.
[0128] FIG. 11 provides close-up views of the user interface
provided by the touch screen 902 shown in FIG. 9 with the PDA using
a software input panel (or SIP) 1100 embodying many aspects of the
present invention.
[0129] FIG. 12 is similar to FIG. 11 except it shows the touch
screen 902 when the speech recognition SIP is displaying a
correction window 1200.
[0130] FIGS. 13 through 17 represent successive pages of a
pseudocode description of how the speech recognition SIP responds
to various inputs on its graphical user interface. For purposes of
simplicity this pseudocode is represented as one main event loop
1300 in the SIP program which responds to user input.
[0131] In FIGS. 13 through 17 this event loop is described as
having two major switch statements: a switch statement 1301 in FIG.
13 that responds to inputs on the user interface that can be
generated whether or not the correction window 1200 is displayed,
and a switch statement 1542 in FIG. 15 that responds to user inputs
that can only be generated when the correction window 1200 is
displayed.
[0132] If the user presses the Talk button 1102 shown in FIG. 11,
function 1302 of FIG. 13 causes functions 1304 through 1308 to be
performed. Function 1304 tests to see if there is any text in the
SIP buffer shown by the window 1104 in FIG. 11. In the SIP
embodiment shown in the FIGS., the SIP buffer is designed to hold a
relatively small number of lines of text, of which the SIP's
software will keep track of the acoustic input and best choices
associated with the recognition of each word, and the linguistic
context created by such text. Such a text buffer is used because
the speech recognition SIP often will not have knowledge about the
text in the remote application shown in the window 1106 in FIG. 11
into which the SIP outputs text at the location of the current
cursor 1108 in the application. In other embodiments of the
invention a much larger SIP buffer could be used. In other
embodiments many of the aspects of the present invention will be
used as part of an independent speech recognition text creation
application that will not require the use of a SIP for the
inputting of text. The major advantage of using a speech recognizer
that functions as a SIP is that it can be used to provide input for
almost any application designed to run on a PDA.
[0133] Returning to FIG. 13, function 1304 clears any text from the
SIP buffer 1104 because the Talk button 1102 is provided as a way
for user to indicate to the SIP that he is dictating text in a new
context. Thus, if the user of the SIP has moved the cursor 1108 in
the application window 1106 of FIG. 11, he should start the next
dictation by pressing the Talk button 1102.
[0134] Function 1306 in FIG. 13 responds to the pressing on the
Talk button by testing to see if the speech recognition system is
currently in correction mode. If so, it exits that mode, removing
any correction window 1200 of the type shown in FIG. 12 that might
be shown.
[0135] The SIP shown in the FIGS. is not in correction mode when a
correction window is displayed, but has not been selected to
receive input inputs from most buttons of the main SIP interface,
and is in correction mode when the correction window is displayed
and has been selected to receive inputs from many of such buttons.
This distinction is desirable because the particular SIP shown can
be selected to operate in a one-at-a-time mode in which words are
spoken and recognized discreetly, and in which a correction window
is displayed for each word as it is recognized to enable a user to
more quickly see the choice list or provide correction input. In
one-at-a-time mode most forms of user input not specifically
related to making corrections are used to perform the additional
function of confirming the first choice displayed in the current
choice list as the desired word. When the system is not in
one-at-a-time mode, the correction window is usually displayed only
when the user has provided input indicating a desire to correct
previous input. In such cases the correction window is displayed in
correction mode, because it is assumed that, since the user has
chosen to make a correction, most forms of input should be directed
to the correction window.
[0136] It should be appreciated that in systems that only use
one-at-a-time recognition, or those that do not use it at all,
there would be no need to have the added complication of switching
into and out of correction mode.
[0137] Returning to function 1306, it removes any current
correction window because the pressing of the Talk button 1302
indicates a desire to start new dictation, rather than an interest
in correcting old dictation.
[0138] Function 1308 of FIG. 13 responds to the pressing of the
Talk button by causing SIP buffer recognition to start according to
a previously selected current recognition duration mode. This
recognition takes place without any prior language context for the
first word. Preferably language model context will be derived from
words recognized in response to one pressing of the Talk button and
used to provide a language context for the recognition of the
second and subsequent words in such recognition.
[0139] FIG. 18 is a schematic representation of the recognition
duration programming 1800 that enables a user to select different
modes of activating speech recognition in response to the pressing
or clicking of any button in the SIP interface that can be used to
start speech recognition. In the shown embodiment there is a
plurality of buttons, including the Talk button, each of which can
be used to start speech recognition. This enables a user to both
select a given mode of recognition and to start recognition in that
mode with a single pressing of a button.
[0140] Function 1802 helps determine which functions of FIG. 18 are
performed, depending on the current recognition duration mode. The
mode can have been set in multiple different ways, including by
default and by selection under the Entry Preference option in the
function menu shown in FIG. 46.
[0141] If the Press Only recognition duration type has been
selected, function 1804 will cause functions 1806 and 1808 to
recognize speech sounds that are uttered during the pressing of a
speech button. This recognition duration type is both simple and
flexible because it enables a user to control the length of
recognition by one simple rule: recognition occurs during and only
during the pressing of a speech button. Preferably utterance and/or
end of utterance detection is used during any recognition mode, to
decrease the likelihood that background noises will be recognized
as utterances.
[0142] If the current recognition duration type is the Press And
Click To Utterance End type, function 1810 will cause functions
1812 and 1814 to respond to the pressing of a speech button by
recognizing speech during that press. In this case the "pressing"
of a speech button is defined as the pushing of such a button for
longer than a given duration, such as, for example, longer than
one-quarter or one-third of a second. If the user pushes on a
speech button for a shorter period of time, that push will be
treated as a "click" rather than as a "press," and functions 1816
and 1818 will initiate recognition starting from the time of that
click until the next end of utterance detection.
[0143] The Press And Click To Utterance End recognition duration
type has the benefit of enabling the use of one button to rapidly
and easily select between a mode that allows a user to select a
variable length extended recognition, and a mode that recognizes
only a single utterance.
[0144] If the current recognition duration type is the Press
Continuous, Click Discrete To Utterances End type, function 1820
causes functions 1822 through 1828 to be performed. If the speech
button is clicked, as just defined, functions 1822 and 1824 perform
discrete recognition until the next end of utterance. If, on the
other hand, the speech button is pressed, as previously defined,
functions 1826 and 1828 perform continuous recognition as long as
the speech button remains pressed.
[0145] This recognition duration type has the benefit of making it
easy for users to quickly switch between continuous and discrete
recognition merely by using different types of presses on a given
speech button. In the SIP embodiment shown, the other recognition
duration types do not switch between continuous and discrete
recognition.
[0146] If the current recognition duration type is the Click To
Timeout type, function 1830 causes functions 1832 to 1840 to be
performed. If the speech button is clicked, functions 1833 through
1836 normally toggle recognition between off and on. Function 1834
responds to a click by testing to see whether or not speech
recognition is currently on. If so, and if the speech button being
clicked is other than one that changes vocabulary, it responds to
the click by turning off speech recognition. On the other hand, if
speech recognition is off when the speech button is clicked,
function 1836 turns speech recognition on until a timeout duration
has elapsed. The length of this timeout duration can be set by the
user under the Entry Preferences option in the function menu 4602
shown in FIG. 46. If the speech button is pressed for longer than a
given duration, as described above, functions 1838 and 1840 will
cause recognition to be on during the press but to be turned off at
its end.
[0147] This recognition duration type provides a quick and easy way
for users to select with one button between toggling speech
recognition on an off, and causing speech recognition to be turned
on only during an extended press of a speech button.
[0148] Returning to function 1308 of FIG. 13, it can be seen that
the selection of different recognition duration types can allow the
user to select how the Talk button and other speech buttons
initiate recognition.
[0149] If the user selects the Clear button 1112 shown in FIG. 11,
functions 1309 through 1314 remove any correction window which
might be displayed and clear the contents of the SIP buffer without
sending any deletions to the operating system text input. As stated
above, in the speech SIP shown, the SIP text window 1104, shown in
FIG. 11, is designed to hold a relatively small body of text. As
text is entered or edited in the SIP buffer, characters are
supplied to the operating system of the PDA, causing corresponding
changes to be made to text in the application window 1006 shown in
FIG. 11. The Clear button enables a user to clear text from the SIP
buffer, to prevent it from being overloaded, without causing
corresponding deletions to be made to text in the application
window.
[0150] The Continue button 1114 shown in FIG. 11 is intended to be
used when the user wants to dictate a continuation of the last
dictated text, or text which is to be inserted at the current
location in the SIP buffer window 1104, shown in FIG. 11. When this
button is pressed, function 1316 causes functions 1318 through 1330
to be performed. Function 1318 removes any correction window,
because the pressing of the Continue button indicates that the user
has no interest in using the correction window. Next, function 1132
tests if the current cursor in the SIP buffer window has a prior
language context that can be used to help in predicting the
probability of the first word or words of any utterance recognized
as a result of the pressing of the Continue button. If so, it
causes that language context to be used. If not, and if there is
currently no text in the SIP buffer, function 1326 uses the last
one or more words previously entered in the SIP buffer as the
language context at the start of recognition initiated by the
Continue button. Next, function 1330 starts SIP buffer recognition,
that is, recognition of text to be output to the cursor in the SIP
buffer, using the current recognition duration mode.
[0151] If the user selects the Backspace button 1116 shown in FIG.
11, functions 1132 through 1336 will be performed. Function 1134
tests if the SIP is currently in the correction mode. If so, it
enters the backspace into the filter editor of the correction
window. The correction window 1200 shown in FIG. 12 includes a
first choice window 1202. As will be described below in greater
detail, the correction window interface allows the user to select
and edit one or more characters in the first choice window as being
part of a filter string which identifies a sequence of initial
characters belonging to the desired recognition word or words. If
the SIP is in the correction mode, pressing backspace will delete
from the filter string and characters currently selected in the
first choice window, and if no characters are so selected, will
delete the character to the left of the filter cursor 1204.
[0152] If the SIP is not currently in the correction mode, function
1136 will respond to the pressing of the Backspace button by
entering a backspace character into the SIP buffer and outputting
that same character to the operating system so that the same change
can be made to the corresponding text in the application window
1106 shown in FIG. 11.
[0153] If the user selects the New Paragraph button 1118 shown in
FIG. 11, functions 1338 through 1342 of FIG. 13 will exit
correction mode, if the SIP is currently in it, and they will enter
a New Paragraph character into the SIP buffer and provide
corresponding output to the operating System.
[0154] As indicated by functions 1344 through 1338, the SIP
responds to user selection of a Space button 1120 in substantially
the same manner that it responds to a backspace, that is, by
entering it into the filter editor if the SIP is in correction
mode, and otherwise outputting it to the SIP buffer and the
operating system.
[0155] If the user selects one of the Vocabulary Selection buttons
1122 through 1132 shown in FIG. 11, functions 1350 through 1370
FIG. 13, and functions 1402 through 1416 FIG. 14, will set the
appropriate recognition mode's vocabulary to the vocabulary
corresponding to the selected button and start speech recognition
in that mode according to the current recognition duration mode and
other settings for the recognition mode.
[0156] If the user selects the Name Recognition button 1122,
functions 1350 and 1356 set the current mode's recognition
vocabulary to the name recognition vocabulary and start recognition
according to the current recognition duration settings and other
appropriate speech settings. With all of the vocabulary buttons
besides the Name and Large Vocabulary buttons, these functions will
treat the current recognition mode as either filter or SIP buffer
recognition, depending on whether the SIP is in correction mode.
This is because these other vocabulary buttons are associated with
vocabularies used for inputting sequences of characters that are
appropriate for defining a filter string or for direct entry into
the SIP buffer. The large vocabulary and the name vocabulary,
however, are considered inappropriate for filter string editing
and, thus, in the disclosed embodiment the current recognition mode
is considered to be either re-utterance or SIP buffer recognition,
depending on whether the SIP is in correction mode. In other
embodiments, name and large vocabulary recognition could be used
for editing a multiword filter.
[0157] In addition to the standard response associated with the
pressing of a vocabulary button, if the AlphaBravo Vocabulary
button is pressed, functions 1404 through 1406 cause a list of all
the words used by the International Communication Alphabet (or ICA)
to be displayed, as is illustrated in numeral 4002 in FIG. 40.
[0158] If the user selects the Continuous/Discrete Recognition
button 1134 shown in FIG. 11, functions 1418 through 1422 of FIG.
14 are performed. These toggle between continuous recognition mode,
which uses continuous speech acoustic models and allows multiword
recognition candidates to match a given single utterance, and a
discrete recognition mode, which uses discrete recognition acoustic
models and only allows single word recognition candidates to be
recognized for a single utterance. The function also starts speech
recognition using either discrete or continuous recognition, as has
just been selected by the pressing of the Continuous/Discrete
button.
[0159] If the user selects the function key 1110 by pressing it,
functions 1424 and 1426 call the function menu 4602 shown in FIG.
46. This function menu allows the user to select from other options
besides those available directly from the buttons shown in FIGS. 11
and 12.
[0160] If the user selects the Help button 1136 shown in FIG. 11,
functions 1432 and 1434 of FIG. 14 call help mode.
[0161] As shown in FIG. 19, when the help mode is entered in
response to an initial pressing of the Help button, a function 1902
displays a help window 2000 providing information about using the
help mode, as illustrated in FIG. 20. During subsequent operation
of the help mode, if the user touches a portion of the SIP
interface, functions 1904 and 1906 display a help window with
information abut the touched portion of the interface that
continues to be displayed as long as the user continues that touch.
This is illustrated in FIG. 21, in which the user has used the
stylus 904 to press the Filter button 1218 of the correction
window. In response, a help window 2100 is shown that explains the
function of the Filter button. If during the help mode a user
double-clicks on a portion of the display, functions 1908 and 1910
display a help window that stays up until the user presses another
portion of the interface. This enables the user to use the scroll
bar 2102 shown in the help window of FIG. 21 to scroll through and
read help information too large to fit on the help window 2102 at
one time.
[0162] Although not shown in FIG. 19, help windows can also have a
Keep Up button 2100 to which a user can drag from an initial down
press on a portion of the SIP user interface of interest to also
select to keep the help window up until the touching of a another
portion of the SIP user interface.
[0163] When, after the initial entry of the help mode, the user
again touches the Help button 1136 shown in FIGS. 11, 20, and 21,
functions 1912 and 1914 remove any help windows and exit the help
mode, turning off the highlighting of the Help button.
[0164] If a user taps on a word in the SIP Buffer, functions 1436
through 1438 of FIG. 14 make the selected word the current
selection and call the display Choice List routine shown in FIG. 22
with the tapped word as the current selection and with acoustic
data associated with the recognition of the tapped word, if any,
the first entry in an utterance list, which holds acoustic data
associated with the current selection.
[0165] As shown in FIG. 22, the display Choice List routine is
called with the following parameters: a selection parameter; a
filter string parameter; a filter range parameter; a word type
parameter, and a Not Choice List flag. The selection parameter
indicates the text in the SIP buffer for which the routine has been
called. The filter string indicates a sequence of one or more
characters indicating elements that define the set of one or more
possible spellings with which the desired recognition output
begins. The filter range parameter defines two character sequences,
which bound a section of the alphabet in which the desired
recognition output falls. The word type parameter indicates that
the desired recognition output is of a certain type, such as a
desired grammatical type. The Not Choice List flag indicates a list
of one or more words that the user's actions indicate are not a
desired word.
[0166] Function 2202 of the display Choice List routine calls a get
Choices routine, shown in FIG. 23, with the filter string and
filter range parameters with which the display Choice List routine
has been called and with an utterance list associated with the
selection parameter.
[0167] As shown in FIGS. 24 and 25, the utterance list 2404 stores
sound representations of one or more utterances that have been
spoken as part of the desired sequence of one or more words
associated with the current selection. As previously stated, when
function 2202 of FIG. 22 calls the get Choices routine, it places a
representation 2400, shown in FIG. 24, of that portion of the sound
2402 from which the words of the current selection have been
recognized. As was indicated in FIG. 2, the process of speech
recognition time-aligns acoustic models against representations of
an audio signal. The recognition system preferably stores these
time alignments so that when corrections or playback of selected
text are desired it can find the corresponding audio
representations from such time alignments.
[0168] In FIG. 24 the first entry 2004 in the utterance list is
part of a continuous utterance 2402. The present invention enables
a user to add additional utterances of a desired sequence of one or
more words to a selection's utterance list, and recognition can be
performed on all these utterance together to increase the chance of
correctly recognizing a desired output. As shown in FIG. 24, such
additional utterances can include both discrete utterances, such as
entry 2400A, as well as continuous utterances, such as entry 2400B.
Each additional utterance contains information as indicated by the
numerals 2406 and 2408 that indicates whether it is a continuous or
discrete utterance and the vocabulary mode in which it was
dictated.
[0169] In FIGS. 24 and 25, the acoustic representations of
utterances in the utterance list are shown as waveforms. It should
be appreciated that in many embodiments, other forms of acoustic
representation will be used, including parameter frame
representations such as the representation 110 shown in FIGS. 1 and
2.
[0170] FIG. 25 is similar to FIG. 24, except that in it, the
original utterance list entry is a sequence of discrete utterances.
It shows that additional utterance entries used to help correct the
recognition of an initial sequence of one or more discrete
utterances can also include either discrete or continuous
utterances, 2500A and 2500B, respectively.
[0171] As shown in FIG. 23, the getChoices routine 2300 includes a
function 2302 which tests to see if there has been a prior
recognition for the selection for which this routine has been
called that has been performed with the current utterance list and
filter values (that is, filter string and filter range values). If
so, it causes function 2304 to return with the choices from that
prior recognition, since there have been no changes in the
recognition parameters since the time the prior recognition was
made.
[0172] If the test of function 2302 is not met, function 2306 tests
to see if the filter range parameter is null. If it is not null,
function 2308 tests to see if the filter range is more specific
than the current filter string, and, if so, it changes the filter
string to the common letters of the filter range. If not, function
2312 nulls the filter range, since the filter string contains more
detailed information that it does.
[0173] As will be explained below, a filter range is selected when
a user selects two choices on a choice list as an indication that
the desired recognition output falls between them in the alphabet.
When the user selects two choices that share initial letters,
function 2310 causes the filter string to correspond to those
shared letters. This is done so that when the choice list is
displayed, the shared letters will be indicated to the user as one
which has been confirmed as corresponding to the initial characters
of the desired output.
[0174] It should be appreciated that when the user performs a
command that selects either a new filter range or filter string, if
the newly selected one of these two parameters has values that
contradict values in the other, the value of the older of these two
parameters will be nulled.
[0175] If there are any candidates from a prior recognition of the
current utterance list, function 2316 causes function 2318 and 2320
to be performed. Function 2318 calls a filter Match routine shown
in FIG. 26 for each such prior recognition candidate with the
candidate's prior recognition score and the current filter
definitions, and function 2320 deletes those candidates returned as
a result of such calls that have scores below a certain
threshold.
[0176] As indicated in FIG. 26, the filter Match routine 2600
performs filtering upon word candidates. In the embodiment of the
invention shown, this filtering process is extremely flexible,
since it allows filters to be defined by filter strings, filter
range, or word type. It is also flexible because it allows a
combination of word type and either filter string or filter range
specifications, and because it allows ambiguous filtering,
including ambiguous filters where elements in a filter string are
not only ambiguous as to the value of their associative characters
but also ambiguous as to the number of characters in their
associative character sequences.
[0177] When we say a filter string or a portion of a filter string
is ambiguous, we mean that a plurality of possible character
sequences can be considered to match it. Ambiguous filtering is
valuable when used with a filter string input, which, although
reliably recognized, does not uniquely defined a single character,
such as is the case with ambiguous phone key filtering of the type
described below with regard to a cell phone embodiment of many
aspects of the present invention.
[0178] Ambiguous filtering is also valuable with filter string
input that cannot be recognized with a high degree of certainty,
such as recognition of letter names, particularly if the
recognition is performed continuously. In such cases, not only is
there a high degree of likelihood that the best choice for the
recognition of the sequence of characters will include one or more
errors, but also there is a reasonable probability that the number
of characters recognized in a best-scoring recognition candidate
might differ from the number spoken. But spelling all or the
initial characters of a desired output is a very rapid and
intuitive way of inputting filtering information, even though the
best choice from such recognition will often be incorrect,
particularly when dictating under adverse conditions.
[0179] The filter Match routine is called for each individual word
candidate. It is called with that word candidate's prior
recognition score, if any, or else with a score of 1. It returns a
recognition score equal to the score with which it has been called
multiplied by the probability that the candidate matches the
current filter values.
[0180] Functions 2602 through 2606 of the filter Match routine test
to see if the word type parameter has been defined, and, if so and
if the word candidate is not of the defined word type, it returns
from the filter Match function with a score of 0, indicating that
the word candidate is clearly not compatible with current filter
values.
[0181] Functions 2608 through 2614 test to see if a current value
is defined for the filter range. If so, and if the current word
candidate is alphabetically between the starting and ending words
of that filter range, they return with an unchanged score value.
Otherwise they return with a score value of 0.
[0182] Function 2616 determines if there is a defined filter
string. If so, it causes functions 2618 through 2653 to be
performed. Function 2618 sets the current candidate character, a
variable that will be used in the following loop, to the first
character in the word candidate for which filter Match has been
called. Next, a loop 2620 is performed until the end of the filter
string is reached by its iterations. This loop includes functions
2622 through 2651.
[0183] The first function in each iteration of this loop is the
test by step 2622 to determine the nature of the next element in
the filter string. In the embodiment shown, three types of filter
string elements are allowed: an unambiguous character, an ambiguous
character, and an ambiguous element representing a set of ambiguous
character sequences, which can be of different lengths.
[0184] An unambiguous character unambiguously identifies a letter
of the alphabet or other character, such as a space. It can be
produced by unambiguous recognition of any form of alphabetic
input, but it is most commonly associated with letter or ICA word
recognition, keyboard input, or non-ambiguous phone key input in
phone implementations. Any recognition of alphabetic input can be
treated as unambiguous merely by accepting a single best scoring
spelling output by the recognition as an unambiguous character
sequence.
[0185] An ambiguous character is one which can have multiple letter
values, but which has a definite length of one character. As stated
above, this can be produced by the ambiguous pressing upon keys in
a telephone embodiment, or by speech or character recognition of
letters. It can also be produced by continuous recognition of
letter names in which all the best scoring character sequences have
the same character length.
[0186] An ambiguous length element is commonly associated with the
output of continuous letter name recognition or handwriting
recognition. It represents multiple best-scoring letter sequences
against handwriting or spoken input, some of which sequences can
have different lengths.
[0187] If the next element in the filter string is an unambiguous
character, function 2644 causes functions 2626 through 2606 to be
performed. Function 2626 tests to see if the current candidate
character matches the current unambiguous character. If not, the
call to filter Match returns with a score of 0 for the current word
candidate. If so, function 2630 increments the position of the
current candidate character.
[0188] If the next element in the filter string is an ambiguous
character, function 2632 causes functions 2634 through 2636 to be
performed. Function 2634 tests to see if the current character
fails to match one of the recognized values of the ambiguous
character. If so, function 2636 returns from the call to
filterMatch with a score of 0. Otherwise, functions 2638 through
2642 alter the current word candidate's score as a function of the
probability of the ambiguous character matching the current
candidate character's value, and then increment the current
candidate character's position.
[0189] If the next element in the filter string is an ambiguous
length element, function 2644 causes a loop 2646 to be performed
for each character sequence represented by the ambiguous length
element. This loop comprises functions 2648 through 2652. Function
2648 tests to see if there is a matching sequence of characters
starting at the current candidate's character position that matches
the current character sequence of the loop 2646. If so, function
2649 alters the word candidate's score as a function of the
probability of the recognized matching sequence represented by the
ambiguous length element, and then function 2650 increments the
current position of the current candidate character by the number
of the characters in the matching ambiguous length element sequence
If there is no sequence of characters starting at the current word
candidate's character position that match any of the sequences of
characters associated with the ambiguous length element, functions
2651 and 2652 return from the call to filter Match with a score of
0.
[0190] If the loop 2620 is completed, the current word candidate
will have matched against the entire filter string. In this case,
function 2653 returns from filter Match with the current word's
score produced by the loop 2620.
[0191] If the test of step 2616 finds that there is no filter
string defined, step 2654 merely returns from filter Match with the
current word candidate's score unchanged.
[0192] Returning now to function 2318 of FIG. 23, it can be seen
that the call to filter Match for each word candidate will return a
score for the candidate. These are the scores that are used to
determine which word candidates to delete in function 2320.
[0193] Once these deletions have taken place, function 2322 tests
to see if the number of prior recognition candidates left after the
deletions, if any, of function 2320 is below a desired number of
candidates. Normally this desired number would represent a desired
number of choices for use in a choice list. If the number of prior
recognition candidates is below such a desired number, functions
2324 through 2336 are performed. Function 2324 performs speech
recognition upon every one of the one or more entries in the
utterance list 2400, shown in FIGS. 24 and 25. As indicated by
functions 2326 and 2328, this recognition process includes a test
to determine if there are both continuous and discrete entries in
the utterance list, and, if so, limits the number of possible word
candidates in recognition of the continuous entries to a number
corresponding to the number of individual utterances detected in
one or more of the discrete entries. The recognition of function
2324 also includes recognizing each entry in the utterance list
with either continuous or discrete recognition, depending upon the
respective mode that was in effect when each was received, as
indicated by the continuous or discrete recognition indication 2406
shown in FIGS. 24 and 25. As indicated by 2332, the recognition of
each utterance list entry also includes using the filter Match
routine previously described and using a language model in
selecting a list of best-scoring acceptable candidates for the
recognition of each such utterance. In the filter Match routine,
the vocabulary indicator 2408 shown in FIGS. 24 and 25 for the most
recent utterance in the utterance list is used as a word type
filter to reflect any indication by the user that the desired word
sequence is limited to one or more words from a particular
vocabulary. The language model used is a PolyGram language model,
such as a bigram or trigram language model, which uses any prior
language contexts that are available in helping to select the
best-scoring candidates.
[0194] After the recognition of one or more entries in the
utterance list has been performed, if there is more than one entry
in the utterance list, functions 2334 and 2336 pick a list of best
scoring recognition candidates for the utterance list based on a
combination of scores from different recognitions. It should be
appreciated that in some embodiments of this aspect of the
invention, combination of scoring could be used from the
recognition of the different utterances so as to improve the
effectiveness of the recognition using more than one utterance.
[0195] If the number of recognition candidates produced by
functions 2314 through 2336 is less than the desired number, and if
there is a non-null filter string or filter range definition,
functions 2338 and 2340 use filter Match to select a desired number
of additional choices from the vocabulary associated with the most
recent entry in the utterance list, or the current recognition
vocabulary if there are no entries in the utterance list.
[0196] If there are no candidates from either recognition or the
current vocabulary by the time the get Choices routine of FIG. 23
reaches function 2342, function 2344 uses the best-scoring
character sequences that matche the current filter string as
choices, up to the desired number of choices. When the filter
string contains nothing but unambiguous characters, only the single
character sequence that matches those unambiguous characters will
be selected as possible choices. However, where there are ambiguous
characters and ambiguous length elements in the filter string,
there will be a plurality of such character sequence choices. And
where ambiguous characters with ambiguous length elements have
different probabilities associated with different possible
corresponding sequences of one or more characters, the choices
produced by function 2344 will be scored correspondingly by a
scoring mechanism corresponding to that shown in functions 2616
through 2606 the three of FIG. 26.
[0197] When the call to get Choices returns, a list of choices
produced by recognition, by selection from a vocabulary according
to filter, or select from a list of possible filters will normally
be returned.
[0198] Returning now to FIG. 22, when the call to get Choices in
function 2202 returns to the display Choice List routine, function
2204 tests to see if any filter has been defined for the current
selection, if there has been any utterance added to the current
selection's utterance list, and if the selection for which display
Choice List has been called is not in the not Choice List, which
includes a list of one or more words that the user's inputs
indicate are not desired as recognition candidates. If these
conditions are met, function 2206 makes that selection the first
choice for display in the correction window, which the routine is
to create. Next, function 2210 removes any other candidates from
the list of candidates produced by the call to the get Choices
routine that are contained in the not Choice List. Next, if the
first choice has not already been selected by function 2206,
function 2212 makes the best-scoring candidate returned by the call
to get Choices the first choice for the subsequent correction
window display. If there is no single best-scoring recognition
candidate, alphabetical order can be used to select the candidate
which is to be the first choice. Next, function 2218 selects those
characters of the first choice which correspond to the filter
string, if any, for special display. As will be described below, in
the preferred embodiments, characters in the first choice which
correspond to an unambiguous filter are indicated in one way, and
characters in the first choice which correspond to an ambiguous
filter are indicated in a different way so that the user can
appreciate which portions of the filter string correspond to which
type of filter elements. Next, function 2220 places a filter cursor
before the first character of the first choice that does not
correspond to the filter string. When there is no filter string
defined, this cursor will be placed before the first character of
the first choice.
[0199] Next, function 2222 causes steps 2224 through 2228 to be
performed if the getChoices routine returned any candidates other
than the current first choice. In this case, function 2224 creates
a first-character-ordered choice list from a set of the
best-scoring such candidates that will all fit in the correction
window at one time. If there are any more recognition candidates,
functions 2226 and 2228 create a second-character-ordered choice
list of up to a preset number of screens for all such choices from
the remaining best-scoring candidates.
[0200] When all this has been done, function 2230 displays a
correction window showing the current first choice, an indication
of which admits characters that the any or in the filter, an
indication of the current filter cursor location, and with the
first choice list. In FIG. 12 the first choice 1206 is shown in the
first choice window 1202 and the filter cursor 1204 is shown before
the first character of the first choice, since there currently has
not been any filter defined.
[0201] It should be appreciated that the display Choice List
routine can be called with a null value for the current selection
as well as for a text selection which has no associated utterances.
In this case, it will respond to alphabetic input by performing
word completion based on the operation of functions 2338 and 2340.
It allows to select choices for the recognition of an utterance
without the use of filtering or re-utterances, to use filtering
and/or re-utterances to help correct a prior recognition, to
perform word completion upon alphabetic filtering input, and, if
desired, to help such alphabetic completion process by the entering
of a subsequent utterance, to spell a word which is not in the
current vocabulary with alphabetic input, to mix and match
different forms of alphabetic input including forms which are
unambiguous, ambiguous with regard to character, and ambiguous with
regard to length.
[0202] Returning now to FIG. 14, we've explained how functions 1436
and 1438 respond to a tap on a word in the SIP buffer by calling
the display Choice List routine, which in turn, causes a correction
window such as the correction window 1200 shown in FIG. 12 to be
displayed. The ability to display a correction window with its
associated choice list merely by tapping on a word provides a fast
and convenient way for enabling a user to correct single word
errors.
[0203] If the user double taps on a selection in the SIP buffer,
functions 1440 through 1444 escape from any current correction
window that might be displayed, and start SIP buffer recognition
according to current recognition duration modes and settings using
the current language context of the current selection. The
recognition duration logic responds to the duration of the key
press type associated with such a double-click in determining
whether to respond as if there has been either a press or a click
for the purposes described above with regard to FIG. 18. The output
of any such recognition will replace the current selection.
Although not shown in the FIGS., if the user double taps on a word
in the SIP buffer, it is treated as the current selection for the
purpose of function 1444
[0204] If the user taps in any portion of the SIP buffer which does
not include text, such as between words or before or after the text
in the buffer, function 1446 causes functions 1448 to 1452 to be
performed. Function 1448 plants a cursor at the location of the
tap. If the tap is located at any point in the SIP buffer window
which is after the end of the text in the SIP buffer, the cursor
will be placed after the last word in that buffer. If the tap is a
double tap, functions 1450 1452 start SIP buffer recognition at the
new cursor location according to the current recognition duration
modes and other settings, using the duration of the second touch of
the double tap for determining whether it is to be responded to as
a press or a click.
[0205] FIG. 15 is a continuation of the pseudocode described above
with regard to FIGS. 13 and 14.
[0206] If the user drags across part of one or more words in the
SIP buffer, functions 1502 and 1504 call the display Choice List
routine described above with regard to FIG. 22 with all of the
words that are all or partially dragged across as the current
selection and with the acoustic data associated with the
recognition of those words, if any, as the first entry in the
utterance list.
[0207] If the user drags across an initial part of an individual
word in the SIP buffer, functions 1506 and 1508 call the display
Choice List function with that word as the selection, with that
word added to the not Choice List, with the dragged initial portion
of the word as the filter string, and with the acoustic data
associated with that word as the first entry in the utterance list.
This programming interprets the fact that a user has dragged across
only the initial part of a word as an indication that the entire
word is not the desired choice, as indicated by the fact that the
word is added to the not Choice List.
[0208] If a user drags across the ending of an individual word in
the SIP buffer, functions 1510 and 1512 call the display Choice
List routine with the word as a selection, with the selection added
to be not Choice List, with the undragged initial portion of the
word as the filter string, and with the acoustic data associative
with a selected word as the first entry in the utterance list.
[0209] If an indication is received that the SIP buffer has more
than a certain amount of text, functions 1514 and 1516 display a
warning to the user that the buffer is close to full. In the
disclosed embodiment this warning informs the user that the buffer
will be automatically cleared if more than an additional number of
characters are added to the buffer, and requests that the user
verify that the text currently in the buffer is correct and then
press talk or continue, which will clear the buffer.
[0210] If an indication is received that the SIP buffer has
received text input, function 1518 causes steps 1520 through 1528
to be performed. Function 1520 tests to see if the cursor is
currently at the and of the SIP buffer. If not, function 1522
outputs to the operating system a number of backspaces equal to the
distance from the last letter of the SIP buffer to the current
cursor position within that buffer. Next, function 1526 causes the
text input, which can be composed of one or more characters, to be
output into the SIP buffer at its current cursor location. Steps
1527 and 1528 output the same text sequence and any following text
in the SIP buffer to the text input of the operating system.
[0211] The fact that function 1522 feeds backspace tto the
operating system before the recognized text is sent to the OS as
well as the fact that function 1528 feed any text following the
received text to the operating system causes any change made to the
text of the SIP buffer that corresponds to text previously supplied
to the application window to also be made to that text in the
application window.
[0212] If the SIP program is in one-at-a-time mode when an
indication of new SIP buffer text input is received, function 1536
tests to see if the text input has been generated in response to
speech recognition. If so, function 1537 calls the display Choice
List routine for the recognized text, and function 1538 turns off
correction mode. Normally, the calling of the display Choice List
routine switches the system to correction mode, but function 1538
prevents this from being the case when one-at-a-time mode is being
used. As has been described above, this is because in one-at-a-time
mode, a correction window is displayed automatically each time
speech recognition is performed upon an utterance of the word, and
thus there is a relatively high likelihood that a user intends
input supplied to the non-correction window aspects of the SIP
interface to be used for purposes other than input into the
correction window. On the other hand, the correction window is
being displayed as a result of specific user input indicating a
desire to correct one or more words, correction mode is entered so
that certain non-correction window inputs will be directed to the
correction window.
[0213] Function 1539 tests to see if the following set of
conditions is met: the SIP is in one-at-a-time mode, a correction
window is displayed, but the system is not in correction mode. This
is the state of affairs which normally exists after each utterance
of the word in one-at-a-time mode. If the said conditions exist,
functions 1540 responds to any of the inputs above in FIGS. 13, 14,
and 15 by confirming recognition of the first choice in the
correction window for purposes of causing that choice to be
introduced as text output into the SIP buffer and to the operating
system for purposes of updating the current language context for
the recognition of one or more subsequent words, for the purpose of
providing data for use in updating the language model, and for the
purpose of providing data for updating acoustic models. This
enables a user to confirm the prior recognition of the word in
one-at-a-time mode by any one of a large number of inputs which can
be used to also advance the recognition process.
[0214] It should be appreciated that if the user is in
one-at-a-time mode and generates inputs indicating a desire to
correct the word shown in a choice list, the SIP will be set to the
correction mode, and subsequent input during the continuation of
that mode will not cause operation of function 1540.
[0215] Function 1542 in FIG. 15 indicates the start of the portion
of the main response loop of the SIP program, which relates to
inputs received when a correction window is displayed. This portion
extends through the remainder of FIG. 15 and all of FIGS. 16 and
17.
[0216] If the Escape button 1210 of a correction window shown in
FIG. 12 is pressed, functions 1544 and 1546 cause the SIP program
to exit the correction window without changing the current
selection.
[0217] If the Delete button 1212 of the correction window shown in
FIG. 12 is pressed, functions 1548 and 1550 delete the current
selection in the SIP buffer and send an output to the operating
system, which causes a corresponding change to be made to any text
in the application window corresponding to that in the SIP
buffer.
[0218] If the New button 1214 shown in FIG. 12 is pressed,
functions 1552 causes functions 1553 to 1556 to be performed.
Function 1553 deletes the current selection in the SIP buffer
corresponding to the correction window and sends output to the
operating system so as to cause a corresponding change to text in
the application window. Function 1554 sets the recognition mode to
the new utterance default, which will normally be the large
vocabulary recognition mode, and can be set by the user to be
either continuous or discrete recognition mode. Function 1556
starts SIP buffer recognition using the current recognition
duration mode and other recognition settings. SIP buffer
recognition is recognition that provides an input to the SIP
buffer, according to the operation of functions 1518 to 1538,
described above.
[0219] FIG. 16 continues the illustration of the response of the
main loop of the SIP program to input received during the display
of a correction window.
[0220] If the re-utterance button 1216 of FIG. 12 is pressed,
function 1602 causes functions 1603 through 1610 to be performed.
Function 1603 sets the SIP program to the correction mode if it is
not currently in it. This will happen if the correction window has
been displayed as a result of a discrete word recognition in
one-at-a-time mode and the user responds by pressing a button in
the correction window, in this case the Re-utterance button,
indicating an intention to use the correction window for correction
purposes. Next, function 1604 sets the recognition mode to the
current recognition mode associated with re-utterance recognition.
Then function 1606 receives one or more utterances according to the
current re-utterance recognition duration mode and other
recognition settings, including vocabulary. Next function 1608 adds
the one or more utterances received by function 1606 to the
utterance list for the correction window selection, along with an
indication of the vocabulary mode at the time of those utterances,
and whether continuous or discrete recognition is in effect. This
causes the utterance list 2004 shown in FIGS. 24 and 25 to have an
additional utterance.
[0221] Then function 1610 calls the display Choice List routine of
FIG. 22, described above. This in turn will call the get Choices
function described above regarding FIG. 23 and will cause functions
2306 through 2336 to perform re-utterance recognition using the new
utterance list entry.
[0222] [If the Filter button 1218 shown in FIG. 12 is pressed,
function 1612 of FIG. 16 and cause Functions 1613 to 1620 to be
performed Function 1613 enters the correction mode and if the SIP
program is not currently in it, as described above with regard to
Function 1603 function 1614 tests to see whether the current entry
mode is a speech recognition mode and if so causes function 1616 to
start filter recognition and according to the current filter
recognition duration mode and settings. This causes any input
generated by such recognition to be directed to the cursor of the
current filter string. If on the other hand the current filter
entry mode is an non-speech recognition entry window mode functions
1618 and 1620 call the appropriate entry window. As described
below, in the embodiment of the invention shown, these non-speech
entry window modes correspond to a character recognition entry
mode, a handwriting recognition entry mode and a keyboard entry
mode.
[0223] If the user presses the Word Form button 1220 shown in FIG.
12, functions 1622 through 1624 cause the correction mode to be
entered if the SIP program is not currently in it, and cause the
word form list routine of FIG. 27 to be called for the current
first choice word. Until a user provides input to the correction
window that causes a redisplay of the correction window, the
current first choice will normally be the selection for which the
correction window has been called. This means that by selecting one
or more words in the SIP buffer and by pressing the Word Form
button in the correction window, a user can rapidly select a list
of alternate forms for any such a selection.
[0224] FIG. 25 illustrates the function of the word form list
routine. If a correction window is already displayed when it is
called, functions 2702 and 2704 treat the current best choice as
the selection for which the word form list will be displayed. If
the current selection is one word, function 2706 causes functions
2708 through 2714 to be performed. If the current selection has any
homonyms, function 2708 places them at the start of the word form
choice list. Next, step 2710 finds the root form of the selected
word, and function 2712 creates a list of alternate grammatical
forms for the word. Then function 2714 alphabetically orders all
these grammatical forms in the choice list after any homonyms,
which may have been added to the list by function 2708.
[0225] If, on the other hand, the selection is composed of multiple
words, function 2716 causes functions 2718 through functions 2728
to be performed. Function 2718 tests to see if the selection has
any spaces between its words. If so, function 2720 adds a copy of
the selection to the choice list, which has no such spaces between
its words, and function 2222 adds a copy of the selection with the
spaces replaced by hyphens. Although not shown in FIG. 27,
additional functions can be performed to replace hyphens with
spaces or with the absence of spaces. If the selection has multiple
elements subject to the same spelled/non spelled transformation
function, 2726 adds a copy of the selection and all prior choices
transformations to the choice list. For example, this will
transform a series of number names into a numerical equivalent, or
reoccurrences of the word "period" into corresponding punctuation
marks. Next, function 2728 alphabetically orders the choice
list.
[0226] Once the choice list has been created either for a single
word or a multiword selection, function 2730 displays a correction
window showing the selection as the first choice, the filter cursor
at the start of the first choice, and a scrollable choice list and
a scrollable list. In some embodiments where the selection is a
single word, the filter of which has a single sequence of
characters that occurs in all its grammatical forms, the filter
cursor could be placed after that common sequence with the common
sequence indicated as an unambiguous filter string.
[0227] In some embodiments of the invention, the word form list
provides one single alphabetically ordered list of optional word
forms. In other embodiments, options can be ordered in terms of
frequency of use, or there could be a first and a second
alphabetically ordered choice list, with the first choice list
containing a set of the most commonly selected optional forms which
will fit in the correction window at one time, and the second list
containing less commonly used word forms.
[0228] As will be demonstrated below, the word form list provides a
very rapid way of correcting a very common type of speech
recognition error, that is, an error in which the first choice is a
homonym of the desired word or is an alternate grammatical form of
it.
[0229] If the user presses the Capitalization button 1222 shown in
FIG. 12, functions 1626 through 1628 will enter the correction mode
if the system is currently not in it and will call the capitalized
cycle function for the correction window's current first choice.
The capitalized correction cycle will cause a sequence of one or
more words which do not all have initial capitalization to have
initial capitalization of each word, will cause a sequence of one
or more words which all have initial capitalization to be changed
to an all capitalized form, and will cause a sequence of one or
more words which have an all capitalized form to be changed to an
all lower case form. By repeatedly pressing the Capitalization
button, a user can rapidly select between these forms.
[0230] If the user selects the Play button 1224 shown in FIG. 12,
functions 1630 and 1632 cause an audio playback of the first entry
in the utterance list associated with the correction window's
associated selection, if any such entry exists. This enables a user
to hear exactly what was spoken with regard to a mis-recognized
sequence of one or more words. Although not shown, the preferred
embodiments enable a user to select a setting which automatically
causes such audio to be played automatically when a correction
window is first displayed.
[0231] If the Add Word button 1226 shown in FIG. 12 is pressed when
it is not displayed in a grayed state, function 1634 and 1636 call
a dialog box that allows a user to enter the current first choice
word into either the active or backup vocabulary. In this
particular embodiment of the SIP recognizer, the system uses a
subset of its total vocabulary as the active vocabulary that is
available for recognition during the normal recognition using the
large vocabulary mode. Function 1636 allows a user to make a word
that is normally in the backup vocabulary part of the active
vocabulary. It also allows the user to add a word that is in
neither vocabulary but which has been spelled in the first choice
window by use of alphabetic input, to be added to either the active
or backup vocabulary. It should be appreciated that in other
embodiments of the invention having greater hardware resources,
there would be no need for distinction between an active and a
backup vocabulary.
[0232] The Add Word button 1226 will only be in a non-grayed state
when the first choice word is not currently in the active
vocabulary. This provides an indication to the user that he or she
may want to add the first choice to either the active or backup
vocabulary.
[0233] If the user selects the Check button 1228 shown in FIG. 12,
functions 1638 through 1648 remove the current correction window
and output its first choice to the SIP buffer and feed to the
operating system a sequence of keystrokes necessary to make a
corresponding change to text in the application window.
[0234] If the user taps one of the choices 1230 shown in the
correction window of FIG. 12, functions 1650 through 1653 remove
the current correction window, and output the selected choice to
the SIP buffer and feed the operating system a sequence of
keystrokes necessary to make the corresponding change in the
application window.
[0235] If the user taps on one of the Choice Edit buttons 1232
shown in FIG. 12, function 1654 causes functions 1656 through 1658
to be performed. Function 1656 changes to correction mode if the
system is not already currently in it. Function 1656 makes the
choice associated with the tapped Choice Edit button to be the
first choice and to be the current filter string, then function
1658 calls the display Choice List with a new filter string. As
will be described below, this enables a user to select a choice
word or sequence of words as the current filter string and then to
edit that filter string, normally by deleting any characters from
its end which disagree with the desired word.
[0236] If the user drags across one or more initial characters of
any choice, including the first choice, functions 1664 through 1666
change the system to correction mode if it is not in it, and call
the display Choice List with the dragged choice added to the choice
list and with the dragged initial portion of the choice as the
filter string. These functions allow a user to indicate that a
current choice is not the desired first choice but that the dragged
initial portion of it should be used as a filter to help find the
desired choice.
[0237] FIG. 17 provides the final continuation of the list of
functions which the SIP recognizer makes in response to correction
window input.
[0238] If the user drags across the ending of a choice, including
the first choice, functions 1702 and 1704 enter the correction mode
if the system is currently not already in it, and call display
Choice List with the partially dragged choice added to the not
Choice List and with the undragged initial portion of the choice as
the filter string.
[0239] If the user drags across two choices in the choice list,
functions 1706 through 1708 enter the correction mode if the system
is not currently in it, and call display Choice List with the two
choices added to the not Choice List and with the two choices as
the beginning and ending words in the definition of the current
filter range.
[0240] If the user taps between characters on the first choice,
functions 1710 through 1712 enter the correction mode if the SIP is
not already in it, and move the filter cursor to the tapped
location. No call is made to display Choice List at this time
because the user has not yet made any change to the filter.
[0241] If the user enters a backspace by pressing the Backspace
button 1116 when in correction mode, as described above with regard
to function 1334 of FIG. 13, function 1714 causes functions 1718
through 1720 to be performed. Function 1718 calls the filter edit
routine of FIGS. 28 and 29 when a backspace is input.
[0242] As will be illustrated with regard to FIG. 28, the filter
edit routine 2800 is designed to give the user flexibility in the
editing of a filter with a combination of unambiguous, ambiguous,
and/or ambiguous length filter elements.
[0243] This routine includes a function 2802, a test to see if
there are any characters in the choice with which it has been
called before the current location of the filter cursor. If so, it
causes functions 2804 to define the filter string with which the
routine has been called as the old filter string, and function 2806
makes the characters in the choice with which the routine has been
called before the location of the filter cursor, the new filter
cursor, and all the characters in that string to be unambiguously
defined. This enables a user to define any part of a first choice
because of the location of an edit to be automatically confirmed as
a correct filter character.
[0244] Next, the function 2807 tests to see if the input with which
the filter edits have been called is a backspace. If so, it causes
functions 2808 through 2812 to be performed. Functions 2808 and
2810 delete the last character of the new filter string if the
filter cursor is a non-selection cursor. If the filter cursor
corresponds to a selection of one or more characters in the current
first choice, these characters were already not to be included in
the new filter by the operation of function 2806 just described.
Then function 2812 clears the old filter string because when the
input to the filter edit is a backspace it is assumed that no
portions of the prior filter to the right of the location of the
backspace are intended for future inclusion in the filter. This
deletes any ambiguous as well as unambiguous elements in the filter
string which might have been previously to the right of the
location of the filter cursor.
[0245] If the input with which the filter edit routine is called is
one or more unambiguous characters, functions 2814 and 2816 add the
one or more unambiguous characters to the end of the new filter
string.
[0246] If the input to the filter edit routine is a sequence of one
or more ambiguous characters of fixed length, function 2818 and
function 2820 place an element representing each ambiguous
character in the sequence at the end of the new filter.
[0247] If the input to the filter edit routine is an ambiguous
length element, function 2822 causes functions 2824 through 2832 to
be performed. Function 2824 selects the best-scoring sequence of
letters associated with the ambiguous input, which, if added to the
prior unambiguous part of the filter, would correspond to all or an
initial part of a vocabulary word. It should be remembered that
when this function is performed, all of the prior portions of the
new filter string will have been confirmed by the operation of
function 2806, described above. Next, function 2826 tests to see if
there are any sequences selected by function 2824 above a certain
minimum score. If so, it will cause function 2828 to select the
best-scoring letter sequences independent of vocabulary. This is
done because if the condition of the test in function 2826 is met,
it indicates that the ambiguous filter is being used to spell out a
vocabulary word. Next, functions 2830 and 2832 associate the
character sequences selected by the operation of functions 2824
through function 2828 with a new ambiguous filter element, and they
add that new ambiguous filter element to the end of the new filter
string.
[0248] Next, a loop 2834 is performed for each filter element in
the old filter string. This loop comprises the functions 2836
through 2850 shown in the remainder of FIG. 28 and the functions
2900 through 2922 shown in FIG. 29.
[0249] If the current old filter string element of the loop 2834 is
an ambiguous, fixed length element that extends beyond a new fixed
length element which has been added to the new filter string by
functions 2814 through 2820, functions 2836 and 2838 add the old
element to the end of the new filter string if it extends beyond
those new elements. This is done because editing of a filter string
other than by use of the Backspace button does not delete
previously entered filter information that corresponds to part of
the prior filter to the right of the new edit.
[0250] If the current old element of the loop 2834 is an ambiguous,
fixed length element that extends beyond some sequences in a new
ambiguous length element that has been added to the end of the new
filter string by operation of functions 2822 through 2832, function
2840 causes functions 2842 through 2850 to be performed. Function
2842 performs a loop for each character sequence represented by the
new ambiguous length element that has been added to the filter
string. The loop performed for each such character sequence of the
new ambiguous length element includes a loop 2844 performed for
each character sequence which agrees with the current old ambiguous
fixed length element of the loop 2834. This inner loop 2844
includes a function 2846, which test to see if the old element
matches and extends beyond the current sequence in the new element.
If so, function 2848 adds to the list of character sequences
represented by the new ambiguous length element a new sequence of
characters corresponding to the current sequence from the new
element plus the portion of the sequence from the old element that
extends beyond that current sequence from the new element.
[0251] If the current old element is an ambiguous length element
that contains any character sequences that extend beyond a new
fixed length element that has been added to the new filter,
function 2900 of FIG. 29 causes functions 2902 through 2910 to be
performed.
[0252] Function 2902 is a loop which is performed for each sequence
represented by the old ambiguous length element. It is composed of
a test 2904 that checks to see if the current sequence from the old
element matches and extends beyond the new fixed length element. If
so, function 2906 creates a new character sequence corresponding to
that extension from the old element that extends beyond the new.
After this loop has been completed, a function 2908 tests to see if
any new sequences have been created by the function 2906, and if
so, they cause function 2910 to add that new ambiguous length
element to the end of the new filter, after the new element. This
new ambiguous length element represents the possibility of each of
the sequences created by function 2906. Preferably a probability
score is associated with each such new sequence based on the
relative probability scores of each of the character sequences
which were found by the loop 2902 to match the current new fixed
length element.
[0253] If the current old element is an ambiguous length element
that has some character sequences that extend beyond some character
sequences in a new ambiguous length element, function 2912 causes
functions 2914 through 2920 to be performed. Function 2914 is a
loop that is performed for each character sequence in the new
ambiguous length element. It is composed of an inner loop 2916
which is performed for each character sequence in the old ambiguous
length element. This inner loop is composed of functions 2918 and
2920, which test to see if the character sequence from the old
element matches and extends beyond the current character sequence
from the new element. If so, they associate with the new ambiguous
length element, a new character sequence corresponding to the
current sequence from the new element plus the extension from the
current old element character sequence.
[0254] Once all the functions in the loop 2834 are completed,
function 2924 returns from the call to filter edit with the new
filter string which has been created by that call.
[0255] It should be appreciated that in many embodiments of various
aspects of the invention a different and often more simple
filter-editing scheme can be used. But it should be appreciated
that one of the major advantages of the filter edit scheme shown in
FIGS. 28 and 29 is that it enables one to enter an ambiguous filter
quickly, such as by continuous letter recognition, and then to
subsequently edit it by more reliable alphabetic entry modes, or
even by subsequent continuous letter recognition. For example, this
scheme would allow a filter entered by the continuous letter
recognition to be all or partially replaced by input from discrete
letter recognition, ICA word recognition, or even handwriting
recognition. Under this scheme, when a user edits an earlier part
of the filter string, the information contained in the latter part
of the filter string is not destroyed unless the user indicates
such an intent, which in the embodiment shown is by use of the
backspace character.
[0256] Returning now to FIG. 17, when the call to filter edit in
function 1718 returns, function 1724 calls display Choice List for
the selection with the new filter string that has been returned by
the call to filter edit.
[0257] Whenever filtering input is received, either by the results
of recognition performed in response to the pressing of the filter
key described above with regard to function 1612 of FIG. 16, or by
any other means, functions 1722 through 1738 are performed.
[0258] Function 1724 tests to see if the system is in one-at-a-time
recognition mode and if the filter input has been produced by
speech recognition. If so, it causes functions 1726 to 1730 to be
performed. Function 1726 tests to see if a filter character choice
window, such as window 3906 shown in FIG. 39, is currently
displayed. If so, function 1728 closes that filter choice window
and function 1730 calls filter edit with the first choice filter
character as input. This causes all previous characters in the
filter string to be treated as an unambiguously defined filter
sequence. Regardless of the outcome of the test of function 1726, a
function 1732 calls filter edit for the new filter input which is
causing operation of function 1722 and the functions listed below
it. Then, function 1734 calls display Choice List for the current
selection and the new filter string. Then, if the system is in
one-at-a-time mode, functions 1736 and 1738 call the filter
character choice routine with the filter string returned by filter
edit and with the newly recognized filter input character as the
selected filter character.
[0259] FIG. 30 illustrates the operation of the filter character
choice subroutine 3000. It includes a function 3002 which tests to
see if the selected filter character with which the routine has
been called corresponds to an either an ambiguous character or an
unambiguous character in the current filter string having multiple
best choice characters associated with it. If this is the case,
function 3004 sets a filter character choice list equal to all
characters associated with that character. If the number of
characters is more than will fit on the filter character choice
list at one time, the choice list can have scrolling buttons to
enable the user to see such additional characters. Preferably the
choices are displayed in alphabetical order to make it easier for
the user to more rapidly scan for a desired character. The filter
character choice routine of FIG. 30 also includes a function 3006
which tests to see if the selected filter character corresponds to
a character of an ambiguous length filter string element in the
current filter string. If so, it causes functions 3008 through 3014
to be performed. Function 3008 tests to see if the selected filter
character is the first character of the ambiguous length element.
If so, function 3010 sets the filter character choice list equal to
all the first characters in any of the ambiguous element's
associated character sequences. If the selected filter character
does not correspond to the first character of the ambiguous length
element, functions 3012 and 3014 set the filter character choice
list equal to all characters in any character sequences represented
by the ambiguous element that are preceded by the same characters
as in the selected filter character in the current first choice.
Once either functions 3002 and 3004 or functions 3006 though 3014
have created a filter character choice list, function 3016 displays
that choice list in a window, such as the window 3906 shown in FIG.
39
[0260] If the SIP program receives a selection by a user of a
filter character choice in a filter character choice window,
function 1740 causes functions 1742 through 1746 to be performed.
Function 1742 closes the filter choice window in which such a
selection has been made. Function 1744 calls the filter edit
function for the current filter string with the character that has
been selected in the filter choice window as the new input. Then
function 1746 calls the display Choice List routine with the new
filter string returned by filter edit.
[0261] If a drag upward from a character in a filter string, of the
type shown in the correction windows 4526 and 4538 of FIG. 45,
function 1747 causes functions 1748 through 1750 to be performed.
Function 1748 calls the filter character choice routine for the
character which has been dragged upon, which causes a filter
character choice window to be generated for it if there are any
other character choices associated with that character. If the drag
is released over a filter choice character in this window, function
1749 generates a selection of the filter character choice over
which the release takes place. Thus it causes the operation of the
functions 1740 through 1746 which have just been described. If the
drag is released other than on a choice in the filter character
choice window, function 1750 closes the filter choice window.
[0262] If a re-utterance is received other than by pressing of the
Re-utterance button, as described above with regard to functions
1602 and 1610, such as by pressing the Large Vocabulary button or
the Name Vocabulary button during correction mode, as described
above with regard to functions 1350, 1356 and 1414 and 1416 of
FIGS. 13 and 14, respectively, function 1752 of FIG. 17 causes
functions 1754 and 1756 to be performed. Function 1754 adds any
such new utterance to the correction window's selections utterance
list, and function 1756 calls the display Choice List routine for
the selection so as to perform re-recognition using the new
utterance.
[0263] Turning now to FIGS. 31 through 41, we will provide an
illustration of how the user interface which has just been
described can be used to dictate a sequence of text. In this
particular sequence, the interface is illustrated as being in the
one-at-a-time mode, which is a discrete recognition mode that
causes a correction window with a choice list to be displayed every
time a discrete utterance is recognized.
[0264] In FIG. 31, numeral 3100 points to the screenshot of the PDA
screen showing the user tapping the Talk button 1102 to commence
dictation starting in a new linguistic context. As indicated by the
highlighting of the Large Vocabulary button 1132, the SIP
recognizer is in the large vocabulary mode. The sequence of
separated dots on the Continuous/Discrete button 1134 indicates
that the recognizer is in a discrete recognition mode. It is
assumed the SIP is in the Press And Click To End Of Utterance
Recognition duration mode described with regard to numerals 1810 to
1816 of FIG. 18. As a result, the click of the Talk button causes
recognition to take place until the end of the next utterance.
Numeral 3102 represents an utterance by the user of the word
"this". Numeral 3104 points to an image of the screen of the PDA
after a response to this utterance by placing the recognized text
3106 in the SIP text window 1104, outputting this text to the
application window 1106, and by displaying a correction window 1200
which includes the recognized word in the first choice window 1202
and a first choice list 1208.
[0265] In the example of FIG. 31, the user taps the Capitalization
button 1222 as pointed to by the numeral 3108. This causes the PDA
screen to have the appearance pointed to by 3110 in which the
current first choice and the text output in the SIP buffer and the
application window is changed to having initial capitalization.
[0266] In the example the user clicks the Continue button 1104 as
pointed to by numeral 3102 and than utters the word "is" as pointed
to by numeral 3114. In the example, it is assumed this utterance is
mis-recognized as the word "its" causing the PDA screen to have the
appearance pointed to by numeral 3116, in which a new correction
window 1200 is displayed having the mis-recognized word as its
first choice 3118 and a new choice list for that recognition
1208.
[0267] FIG. 32 represents a continuation of this example, in which
the user clicks the choice word "is" 3200 in the image pointed to
by numeral 3202. This causes the PDA screen to have the appearance
indicated by the numeral 3204 in which the correction window has
been removed, and corrected text appears in both the SIP buffer
window and the application window.
[0268] In the screenshot pointed to by numeral 3206 the user is
shown tapping the letter name vocabulary button 1130, which changes
the current recognition mode to the letter name vocabulary as is
indicated by the highlighting of the button 1130. As is indicated
above with regard to functions 1410 and 1412, the tapping of this
button commences speech recognition according to the current
recognition duration mode. This causes the system to recognize the
subsequent utterance of the letter name "e" as pointed to by
numeral 3208
[0269] In order to emphasize the ability of the present interface
to quickly correct recognition mistakes, the example assumes that
the system mis-recognizes this letter as the letter "p" 3211, as
indicated by the correction window that is displayed in
one-at-a-time mode in response to the utterance 3208. As can be
seen in the correction window pointed to by 3210, the correct
letter "e" is, however, one of the choices shown in the correction
window. In the view of the correction window pointed to by numeral
3214, the user taps on the choice 3212, which causes the PDA screen
to have the appearance pointed to by numeral 3216 in which the
correct letter is entered both in the SIP buffer and the
application window.
[0270] FIG. 33 illustrates a continuation of this example, in which
the user taps on the Punctuation Vocabulary button 11,024 as
indicated in the screenshot pointed to by button 11,024. This
starts utterance recognition causing the utterance of the word
"period" pointed to by the numeral 3300, which changes the
recognition vocabulary to the punctuation vocabulary as indicated
by the highlighting numeral 3302 to give rise to the correction
pointed to by 3304 in which the punctuation mark "." is shown in
the first choice window followed by that punctuation mark's name to
make it easier for the user to recognize.
[0271] Since, in the example, this is the correct recognition, the
user confirms it and starts recognition of a new utterance using
the letter name vocabulary by pressing the button 1130, as shown in
the screenshot numeral 3306, and saying the utterance 3308 of the
letter "l." This process of entering letters followed by periods is
repeated until the PDA screen has the appearance shown by numeral
3312. At this point it is assumed the user drags across the text
"e.l.v.i.s." as shown in the screenshot 3314 which causes that text
to be selected and which causes the correction window 1200 in the
screenshot 3400 near the upper left-hand corner of FIG. 34 to be
displayed. Since it is assumed that the selected text string is not
in the current vocabulary, there are no alternate choices displayed
in this choice list. In the view of the correction window pointed
to by 3402, the user taps the Word Form button 1220, which calls
the word form list routine described above with regard to FIG. 27.
Since the selected text string includes spaces, it is treated as a
multiple-word selection causing the portion of the routine shown in
FIG. 27 illustrated by functions 2716 through 2728 to be performed.
This includes a choice list such as that pointed to by 3404
including a choice 3406 in which the spaces have been removed from
the correction window's selection. In the example, the user taps
the Edit button 1232 next to the closest choice 3406. As indicated
in the view of the correction window pointed to by numeral 3410,
this causes the choice 3406 to be selected as the first choice, as
indicated in the view of the correction window pointed to by 3412.
The user taps on the Capitalization button 1222 until the first
choice becomes all capitalized at which point the correction window
has the appearance indicated in the screenshot 3414. At this point
the user clicks on the Punctuation Vocabulary button 1124 as
pointed to by 3416 and says the utterance "comma" pointed to by
3418. In the example it is assumed that this utterance is correctly
recognized causing a correction window 1200 pointed to by the
numeral 3420 to be displayed and the former first choice
"e.l.v.i.s." to be outputted as text.
[0272] FIG. 35 is a continuation of this example. In it, it is
assumed that the user clicks the Large Vocabulary button as
indicated by numeral 3500, and then says the utterance "the" 3502.
This causes the correction window 3504 to be displayed. The user
responds by confirming this recognition by again pressing the large
vocabulary button as indicated by 3506 and saying the utterance
"embedded" pointed to by 3508. In the example, this causes the
correction window 3510 to be displayed in which the utterance has
been mis-recognized as the word "imbedded" and in which the desired
word is not shown on the first choice list. Starting at this point,
as is indicated by the comment 3512, a plurality of different
correction options will be illustrated.
[0273] FIG. 36 illustrates the correction option of scrolling
through the first and second choice list associated with the
mis-recognition. In the view of the correction window pointed to by
3604, the user shown tapping the page down scroll button 3600 in
the scroll bar 3602 of the correction window causes the first
choice list 3603 to be replaced by the first screenful of the
second choice list 3605 as indicated in the view of the correction
window 3606. As can be seen in this view, the slide bar 3608 of the
correction window has moved down below a horizontal bar 3609, which
defines the position in the scroll bar associated with the end of
the first choice list. In the example, the desired word is not in
the portion of the alphabetically ordered second choice list shown
in view 3606, and thus the user presses the Page Down button of the
scroll bar as indicated by 3610. This causes the correction window
to have the appearance shown in view 3612 in which a new screenful
of alphabetically listed choices is shown. In the example, the
desired word "embedded" is shown on this choice list as is
indicated by the 3616. In the example, the user clicks on this
choice button 3619 associated with this desired choice as shown in
the view of the correction window pointed to by 3618. This causes
the correction window to have the view pointed to by 3620 in which
this choice is displayed in the first choice window. In the
example, the user taps the Capitalized button as pointed to by
numeral 3622 which causes this first choice to have initial
capitalization as shown in the screenshot 3624.
[0274] Thus it can be seen that the SIP user interface provides a
rapid way to allow a user to select from among a relatively large
number of recognition choices. In the embodiment shown, the first
choice list is composed of up to six choices, and the second choice
list can include up to three additional screens of up to 18
additional choices. Since the choices are arranged alphabetically
and since all four screens can be viewed in less than a second,
this enables the user to select from among up to 24 choices very
quickly.
[0275] FIG. 37 illustrates the method of filtering choices by
dragging across an initial part of a choice, as has been described
above with regard to functions 1664 through 1666 of FIG. 16. In the
example of this figure, it is assumed that the first choice list
includes a choice 3702 shown in the view of the correction window
pointed to by 3700, which includes the first six characters of the
desired word "embedded". As is illustrated in the correction window
3704, the user drags across these initial six letters and the
system responds by displaying a new correction window limited to
recognition candidates that start with an unambiguous filter
corresponding to the six characters, as is displayed in the
screenshot 3706. In this screenshot the desired word is the first
choice and the first six unambiguously confirmed letters of the
first choice are shown highlighted as indicated by the box 3708,
and the filter cursor 3710 is also illustrated.
[0276] FIG. 38 illustrates the method of filtering choices by
dragging across two choices in the choice list that has been
described above with regard to functions 1706 through 1708 of FIG.
17. In this example, the correction window 3800 displays the
desired choice "embedded" as it occurs alphabetically between the
two displayed numeral 3802 and 3804. As shown in the view 3806, the
user indicates that the desired word falls in this range of the
alphabet by dragging across these two choices. This causes a new
correction window to be displayed in which the possible choices are
limited to words which occur in the selected range of the alphabet,
as indicated by the screenshot 3808. In this example, it is assumed
that the desired word is selected as a first choice and as a result
of the filtering caused by the selection shown in 3806. In this
screenshot the portion of the first choice which forms an initial
portion of the two choices selected in the view 3806 is indicated
as unambiguously confirmed portion of the filter string 3810 and
the filter cursor 3812 is placed after that confirmed filter
portion.
[0277] FIG. 39 illustrates a method in which alphabetic filtering
is used in one-at-a-time mode to help select the desired word
choice. In this example, the user presses the Filter button as
indicated in the correction window view 3900. It is assumed that
the default filter vocabulary is the letter name vocabulary.
Pressing the Filter button starts speech recognition for the next
utterance and the user says the letter "e" as indicated by 3902.
This causes the correction window 3904 to be shown in which it is
assumed that the filter character has been mis-recognized as in
"p." In the embodiment shown, in one-at-a-time mode, alphabetic
input also has a choice list displayed for its recognition. In this
case, it is a filter character choice list window 3906 of the type
described above with regard to the filter character choice
subroutine of FIG. 30. In the example, the user selects the desired
filtering character, the letter "e," as shown in the view 3908,
which causes a new correction window 3900 to be displayed. In the
example, the user decides to enter an additional filtering letter
by again pressing the Filter button as shown in the view 3912, and
then says the utterance "m" 3914. This causes the correction window
3916 to be displayed, which displays the filter character choice
window 3918. In this correction window, the filtering character has
been correctly recognized and the user could either confirm it by
speaking an additional filtering character or by selecting the
correct letter as is shown in the window 3916. This confirmation of
the desired filtering character causes a new correction window to
be displayed with the filter strain "em" treated as an
unambiguously confirmed filter's string. In the example shown in
screenshot 3920, this causes the desired word to be recognized.
[0278] FIG. 40 illustrates a method of alphabetic filtering with
AlphaBravo, or ICA word, alphabetic spelling. In the screenshot
4000, the user taps on the AlphaBravo button 1128. This changes the
alphabet to the ICA word alphabet, as described above by functions
1402 through 1408 of FIG. 14. In this example, it is assumed that
the Display_Alpha_On_Double_- Click variable has not been set. Thus
the function 1406 of FIG. 14 will display the list of ICA words
4002 shown in the screenshot 4004 during the press of the
AlphaBravo button 1128. In the example, the user enters the ICA
word "echo," which represents the letter "e" followed by a second
pressing out of the AlphaBravo key as shown at 4008 and the
utterance of a second ICA word "Mike" which represents the letter
"m". In the example, the inputting of these two alphabetic
filtering characters successfully creates an unambiguous filter
string composed of the desired letters "em" and produces
recognition of the desired word, "embedded".
[0279] FIG. 41 illustrates a method in which the user selects part
of a choice as a filter and then uses AlphaBravo spelling to
complete the selection of a word which is not in the system's
vocabulary, in this case the made up word "embedded".
[0280] In this example, the user is presented with the correction
window 4100 which includes one choice 4100, and which includes the
first six letters of the desired word. As shown in the correction
window 4104, the user drags across these first six letters causing
those letters to be unambiguously confirmed characters of the
current filter string. This results in a correction window 4106.
The screenshot 4108 shows the display of this correction window in
which the user drags from the filter button 1218 and releases on
the Discrete/Continuous button 1134, changing it from the discrete
filter dictation mode to the continuous filter dictation mode, as
is indicated by the continuous line on that button shown in the
screenshot 4108. In screenshot 4110, the user presses the alpha
button again and says an utterance containing the following ICA
words "Echo, Delta, Echo, Sierra, Tango". This causes the current
filter string to correspond to the spelling of the desired word.
Since there are no words in the vocabulary matching this filter
string, the filter string itself becomes the first choice as is
shown in the correction window 4114. In the view of this window
shown at 4116, the user taps on the check button to indicate
selection of the first choice, causing the PDA screen to have the
appearance shown at 4108.
[0281] FIGS. 42 through 44 demonstrate the dictation, recognition,
and correction of continuous speech. In the screenshot 4200 the
user clicks the Clear button 1112 described above with regard to
functions 1310 through 1314 of FIG. 13. This causes the text in the
SIP buffer 1104 to be cleared without causing any associated change
with the corresponding text in the application window 1106, as is
indicated by the screenshot 4204. In the screenshot 4204 the user
clicks the Continuous/Discrete button 1134, which causes it to
change from discrete recognition indicated on the button by a
sequence of dots in the screenshot 4002 to do a continuous line
shown in screenshot 4204. This starts speech recognition according
to the current recognition duration mode, and the user says a
continuous utterance of the following words "large vocabulary
interface system from voice signal technologies period", as
indicated by numeral 4206. The system responds by recognizing this
utterance and placing a recognized text in the SIP buffer 1104 and
through the operating system to the application window 1106, as
shown in the screenshot 4208. Because the recognized text is
slightly more than fits within the SIP window at one time, the user
scrolls in the SIP window as shown at numeral 4210 and then taps on
the word "vocabularies" 4214, to cause functions 1436 through 1438
of FIG. 14 to select that word and generate a correction window for
it. In response the correction window 4216 is displayed. In the
example the desired word "vocabulary" 4218 is on the choice list of
this correction window and in the view of the correction window
4220 user taps on this word to cause it to be selected, which will
replace the word "vocabularies" in both the SIP buffer in the
application window with that selected word.
[0282] Continuing now in FIG. 43, this correction is shown by the
screenshot 4300. In the example, the user selects the four mistaken
words "enter faces men rum" by dragging across them as indicated in
view 4302. This causes functions 1502 and 1504 to display a choice
window with the dragged words as the selection, as is indicated by
the view 4304.
[0283] FIG. 44 illustrates how the correction window shown at the
bottom of FIG. 43 can be corrected by a combination of horizontal
and vertical scrolling of the correction window and choices that
are displayed in it. Numeral 4400 points to a view of the same
correction window shown at 4304 in FIG. 43. In it is not only a
vertical scroll bar 4602 that is displayed but also a horizontal
scroll bar 4402 in this view. The user is shown tapping the page
down button 3006 in the vertical scroll bar which causes the
portion of the choice list displayed to move from the display of
the one page alphabetically ordered first choice list shown in the
view 4400 to the first page of the second alphabetically ordered
choice list shown in the view 4404. In the example none of the
recognition candidates in this portion of the second choice list
start with a character sequence matching the desired recognition
output, which is "interface system from." Thus the user again taps
the page down scroll button 3600 as is indicated by numeral 4408.
This causes the correction window to have the appearance shown at
4410 in which two of the displayed choices 4412 start with a
character sequence matching the desired recognition output. In
order to see if the ending of these recognition candidates matched
the desired output the user scrolls a like word on the horizontal
scroll bar 4402 as shown at 4414. This allows the user to see that
the choice 4418 matches the desired output. As is shown at is 4420,
the user taps on this choice and causes it to be inserted into the
dictated text both in the SIP window 1104 and in the application
window 1106 as is shown in the screenshot 4422.
[0284] FIG. 45 illustrates how the use of an ambiguous filter
created by the recognition of continuously spoken letter names and
edited by filter character choice windows can be used to rapidly
correct an erroneous dictation. In this example, the user presses
the talk button 1102 as shown at 4500 and then utters the word
"trouble" as indicated at 4502. In the example it is assumed that
this utterance is miss-recognized as the word "treble" as indicated
at 4504. In the example, the user taps on the word "treble" as
indicated 4506, which causes the correction window shown at 4508 to
be shown. Since the desired word is not shown as any of the choices
the user caps the filter button 1218 as shown at 4510 and makes a
continuous utterance 4512 containing the names of each of the
letters in the desired word "trouble." In this example it is
assumed that the filter recognition mode is set to include
continuous letter name recognition.
[0285] In the example the system responds to recognition of the
utterance 4512 by displaying the choice list 4518. In this example
it is assumed that the result of the recognition of this utterance
is to cause a filter strain to be created that is comprised of one
ambiguous length element. As has been described above with regard
to functions 2644 through 2652, an ambiguous length filter element
allows any recognition candidate that contains in the corresponding
portion of its initial character sequence one of the character
sequences that are represented by that ambiguous element. In the
correction window 4518 the portion of the first choice word 4519
that corresponds to an ambiguous filter element is indicated by the
ambiguous filter indicator 4520. Since the filter uses an ambiguous
element, the choice list displayed contains best scoring
recognition candidates that start with different initial character
sequences including ones with length less than the portion of the
first choice that corresponds to a matching character sequence
represented by the ambiguous element.
[0286] In the example, the user drags upward from the first
character of the first choice, which causes operation of functions
1747 through 1750 described above with regard to FIG. 17. This
causes a filter choice window 4526 to be display. As shown in the
correction window 4524, the user drags up to the initial desired
character the letter "t," and releases the drag at that location
which causes functions 1749 and 1740 through 1746 to be performed.
These close the filter choice window, callfilter added with the
selected character as an unambiguous correction to the prior
ambiguous filter element and causes a new correction window to be
displayed with the new filter as is indicated at 4528. As is shown
in this correction window the first choice 4530 is shown with an
unambiguous filter indicator 4532 for its first letter "t" and an
ambiguous filter indicator 4534 for its remaining characters. Next,
as is shown in the view of the same correction window shown at 4536
the user drags upward from the fifth letter "p" of the new first
choice which causes a new correction window 4538 to be displayed.
When the user releases this drag on the character "p" that causes
that character and all the characters that preceded in the first
choice to be defined unambiguously in the current filter strain
this is indicated in the new correction window 4540, which is shown
as a result of the selection in which the first choice 4542 is the
desired word, and the unambiguous portion of the filter is
indicated by the unambiguous filter indicator 4544 and the
remaining portion of the ambiguous filter element, which stays in
the filter string by operations of functions 2900 through 2910 as
shown in FIG. 29.
[0287] FIG. 46 illustrates that the SIP recognizer allows the user
to also input text and filtering information by use of a character
recognizer similar to the character recognizer that comes standard
with that Windows CE operating system.
[0288] As shown in the screenshot 4600 of this figure, if the user
drags up from the function key functions 1428 and 1430 of FIG. 14
it will display a punch and menu 4602 and if the user releases on
the menu's character recognition entry 4604 the character
recognition mode described in FIG. 47 will be turned on.
[0289] As shown in FIG. 47, this causes function 4702 to display
the character recognition window 4608, shown in FIG. 46, and then
to enter an input loop 4704 which is repeated until the user
selects to exit the window by selecting another input option on the
function menu 4602. When in this loop, if the user touches the
character recognition window, function 4906 records "ink" during
the continuation of such a touch which records the motion if any of
the touch across the surface of the portion of the display touch
screen corresponding to the character recognition window. If the
user releases a touch in this window, functions 4708 through 4714
are performed. Function 4710 performance character recognition on
the "ink" currently in the window. Function 4712 clears the
character recognition window, as indicated by the numeral 4610 in
FIG. 46. And function 4708 supplies the corresponding recognized
character to the SIP buffer and the operating system.
[0290] FIG. 48 illustrates that if the user selects the handwriting
recognition option in the function menu shown in the screenshot
4600, a handwriting recognition entry window 4008 will be displayed
in association with the SIP as is shown in screenshot 4802.
[0291] The operation of the handwriting mode is provided in FIG.
49. When this mode is entered function 4902 displays the
handwriting recognition window, and then a loop 4903 is entered
until the user selects to use another input option. In this loop,
if the user touches the handwriting recognition window in any place
other then the delete button 4804 shown in FIG. 48, the motion if
any during the touch is recorded as "ink" by function 4904. If the
user touches down in the right button area 4806 shown in FIG. 48
function 4905 causes functions 4906 through 4910 to be performed.
Function 4906 performs handwriting recognition on any "ink"
previously entered in the handwriting recognition window. Function
4908 supplies the recognized output to the SIP buffer and the
operating system, and function 4910 clears the recognition window.
If the user presses the Delete button 4804 shown in FIG. 48
functions 4912 and 4914 clear the recognition window of any
"ink."
[0292] It should be appreciated that the use of the recognition
button 4806 allows the user to both instruct the system to
recognize the "ink" that was previously in the handwriting
recognition and also starts the writing of a new word to be
recognized.
[0293] FIG. 50 shows the keypad 5000, which can also be selected
from the function menu.
[0294] Having character recognition, handwriting recognition, and
keyboard input methods rapidly available as part of the speech
recognition SIP is often extremely advantageous because it lets the
user switch back and forth between these different modes in a
fraction of a second depending upon which is most convenient at the
current time. And it allows the outputs of all of these modes to be
used in editing text in the SIP buffer.
[0295] As shown in FIG. 51, in one embodiment of the SIP buffer, if
the user drags up from the filter button 1218 a window 5100 is
display that provides the user with optional filter entry mode
options. These include options of using a letter-name speech
recognition, AlphaBravo speech recognition, character recognition,
handwriting recognition, and the keyboard window, as alternative
methods of entering filtering spellings. It also enables a user to
select whether any of the speech recognition modes are discrete or
continuous and whether the letter name recognition character
recognition and handwriting recognition entries are to be treated
as ambiguous in the filter string. This user interface enables the
user to quickly select that filter entry mode which is appropriate
for the current time and place. For example, in a quiet location
where one does not have to worry about offending people by
speaking, continuous letter name recognition is often very useful.
However, in a location where there's a lot of noise, but a user
feels that speech would not be offensive to neighbors, AlphaBravo
recognition might be more appropriate. In a location such as a
library where speaking might be offensive to others silent filter
entry methods such as character recognition, handwriting
recognition or keyboard input might be more appropriate.
[0296] FIG. 52 provides an example of how character recognition can
be quickly selected to filter a recognition. 5200 shows a portion
of a correction window in which the user has pressed the filter
button and dragged up, causing the filter entry mode menu 5100
shown in FIG. 51 to be displayed, and then selected the character
recognition option. As is shown in screenshot 5202 this causes the
character recognition entry window 4608 to be displayed in a
location that allows the user to see the entire correction window.
In the screenshot 5202 the user has drawn the character "e" and
when he releases his stylus from the drawing of that character the
letter "e" will be entered into the filter string causing a
correction window 5204 to be displayed in the example. The user
then enters an additional character "m" into the character
recognition window as indicated at 5206, and when he releases his
stylus from the drawing of this letter the recognition of the
character "m" causes the filter string to include "e" as shown at
5208.
[0297] FIG. 53 starts with a partial screenshot 5300 where the user
has tapped and dragged up from the filter key 1218 to cause the
display of the filter entry mode menu, and has selected the
handwriting option. This displays a screen such as 5302 with a
handwriting entry window 4800 displayed at a location that does not
block a view of the correction window. In the screenshot 5302 the
user has handwritten in a continuous cursive script the letters
"embed" and then presses the "REC" button to cause recognition of
those characters. Once he has tapped that button an ambiguous
filter string indicated by the ambiguous filter indicator 5304 is
displayed in the first choice window corresponding to the
recognized characters as shown by the correction window 5306. FIG.
54 shows how the user can use a keypad window 5000 to enter
alphabetic filtering information.
[0298] FIG. 55 illustrates how speech recognition can be used to
collect handwriting recognition. Screenshot 5500 shows a
handwriting entry window 4800 displayed in a position for entering
text into the SIP buffer window 1104. In this screenshot the user
has just finished writing a word. Numerals 5502 through 5510
indicate the handwriting of five additional words. The word in each
of these views is started by a touchdown in the "REC" button so as
to cause recognition of the prior written word. Numeral 5512 points
to a handwriting recognition window where the user makes a final
tap on the "REC" button to cause recognition of the last
handwritten word "speech". In the example of FIG. 55, after this
sequence of handwriting input has been recognized, the SIP buffer
window 1104 in the application window 1106 had the appearance shown
in the screenshot 5514 as indicated by 5516. The user drags across
the miss recognized words "snack shower." This causes the
correction window 5518 to be shown. In the example, the user taps
the re-utterance button 1216 and discretly re-utters the desired
words "much . . . slower." By operation of a slightly modified
version of the "get" choices function described above with regard
to FIG. 23 this will cause the recognition scores from recognizing
the utterance 5520 to be combined with the recognition results from
combining the handwritten "REC" in the input pointed to by numerals
5504 and 5506 to select a best scoring recognition candidate, which
in the case of the example is the desired words as shown at
numerals 5522.
[0299] It should also be appreciated that the user could have
pressed the new button in the correction window 5518 instead of the
Re-Add button in which case the utterance 5520 would have used the
output of speech recognition to replace the handwriting outputs
that had been selected as shown at 5516.
[0300] As indicated in FIG. 56, if the user had pressed the filter
button 1218 instead of the re-utterance button in the correction
window 5518, the user could have used the speech recognition of a
known word, such as the utterance 5600 shown in FIG. 56, to
alphabetically filter the handwriting recognition of the two words
selected at 5516 in FIG. 55.
[0301] FIG. 57 illustrates an alternate embodiment 5700 of the SIP
speech recognition interface in which there are two separate
top-level buttons 5702 and 5704 to select between discrete and
continuous speech recognition, respectively. It will be appreciated
that it is a matter of design choice which buttons are provided at
the top level of a speech recognizes user interface. However, the
ability to rapidly switch between the more rapid and more natural
continuous speech recognition versus the more reliable although
more halting and slow discrete speech recognition is something that
can be very desirable, and in some embodiments justifies the
allocation of a separate top-level key for the selection of
discrete and the selection of continuous recognition.
[0302] FIG. 58 displays an alternate embodiment of the display
choice list routine shown in FIG. 22 except that it creates a
single scrollable score ordered choice list rather than the two
alphabetically ordered choice lists created by the routine in FIG.
22. The only portions of its language that differs from the
language contained in FIG. 22 is underlined, with the exception of
the fact that functions 2226 and 2228 have also been deleted in the
version of the routine shown in FIG. 58.
[0303] FIGS. 67 through 74 displayed various mapping of a basic
phone number keypad to functions that are used in various modes or
menus of the disclosed cell phone speech recognition editor. The
main numbered phone key mapping in the editor mode is shown in FIG.
67 and FIG. 68 shows the phone key portion of the entry mode menu
which is selected if the user presses the one key when in the
editor mode. The entry mode menu is used to select among various
text and alphabetic entry modes available on the system. FIG. 69
displays the functions that are available on the numerical phone
key pad when the user has a correction window displayed, which can
be caused from the editor mode by pressing the "2" key. FIG. 70
displays the numerical phone key commands available from an edit
menu selected by pressing the "3" key when in the edit mode
illustrated in FIG. 67. This menu is used to change the
navigational functions performed by pressing the navigation keys of
the phone keypad. FIG. 71 illustrates a somewhat similar correction
navigation menu that displays navigational options available in the
correction window by pressing the "3" key. In addition to changing
navigational modes while in a correction window it also allows the
user to vary the function that is performed when a choice is
selected.
[0304] FIG. 72 illustrates the numerical phone key mapping during a
key Alpha mode, in which the pressing of a phone key having letters
associated with it will cause a prompt to be shown on the cell
phone display asking the user to say the ICA word associated with
the desired one of the sets of letters associated with the pressed
key. This mode is selected by double-clicking the "3" phone key
when in the entry mode menu shown in FIG. 68.
[0305] FIG. 73 shows a basic keys menu, which allows the user to
rapidly select from among a set of the most common punctuation and
function keys used in text editing, or by pressing the "1" key to
see a menu that allows a selection of less commonly used
punctuation marks. The basic keys menu is selected by pressing a
"9" in the editor mode illustrated in FIG. 67. FIG. 74 illustrates
the edit option menu that is selected by pressing "0" in the editor
shown in FIG. 67. This contains a menu which allows a user to
perform basic tasks associated with use of the editor that are not
available in the other modes or menus.
[0306] At the top of each of the numerical phone key mappings shown
in FIGS. 67 through 74 is a title bar that is shown at the top of
the cell phone display when that menu or command list is shown. As
can be seen from these FIG.s the title bar is illustrated in FIGS.
67, 69 and 72 start with the letters "Cmds" to indicate that the
displayed options are part of a command list, where as FIGS. 68,
70, 71, 73 and 74 have title bars i start with "MENU." This is used
to indicate a distinction between the command lists shown in FIGS.
67, 69 and 72 and the menus shown in the others of these figures. A
command list displays commands that are available in a mode even
when that command list is not display. When in the editor mode
associated with the command list of FIG. 67 or the key Alpha mode
associated with FIG. 72, normally the text editor window will be
displayed even though the phone keys have the functional mappings
shown in those figures. Normally when in the correction window mode
associated with the command list shown in FIG. 69, a correction
window is shown on the cell phones display. In all these modes, the
user can access the command list to see the current phone key
mapping as is illustrated in FIG. 75 by merely pressing the menu
key, as is pointed to by the numerals 7500 in that figure. In the
example shown in FIG. 75, a display screen 7502 shows a window of
the editor mode before the pressing of the Menu button. When the
user presses the Menu button the first page of the editor command
list is shone as indicated by 7504, the user then has the option of
scrolling up or down in the command list to see not only the
commands that are mapped with the numerical phone keys but also the
commands that are mapped with the menu "talk" and "end" key as
shown in screen 7506 as well as the navigational key buttons "OK"
and "Menu" as shown in 7508 and by screen 7510, if there are
additional options associated with the current mode at the time the
command list is entered, they can also be selected from the command
list by means of scrolling the highlight 7512 and using the "OK"
key. In the example shown in FIG. 75 a phone call indicator 7514
having the general shape of a telephone handset is indicated at the
left of each title bar to indicate to the user that the cell phone
is currently in a telephone call. In this case extra functions are
available in the editor that allow the user to quickly select to
mute the microphone of the cell found, to record only audio from
the user side of the phone conversation and to play the playback
only to the user side of the phone conversation.
[0307] FIGS. 76 through 78 provide a more detailed pseudocode
description of the functions of the editor mode than is shown by
the mere command listings shown in FIGS. 67 and 75. This pseudocode
is represented as one input loop 7602 in which the editor responds
to various user inputs.
[0308] If the user inputs one of the navigational commands
indicated by numeral 7603, by either pressing one of the
navigational keys or speaking a corresponding navigational command,
the functions invented under it as in FIG. 76 are performed.
[0309] These include a function 7604 that tests to see if the
editor is currently in word/line navigational mode. This is the
most common mode of navigation in the editor, and it can be quickly
selected by pressing the "3" key twice from the editor. The first
press selects the navigational mode menu shown in FIG. 70 and the
second press selects the word/line navigational mode from that
menu. If the editor is in word-line mode function 7606 through 7624
are performed.
[0310] If the navigational input is a word-left or word-right
command, function 7606 causes function 7608 through 7617 to be
performed. Functions 7608 and 7610 test to see if extended
selection is on, and if so, they move the cursor one word to the
left or right, respectively, and extend the previous selection to
that word. If extended selection is not on, function 7612 causes
functions 7614 to 7617 to be performed. Functions 7614 and 7615
test to see if either the prior input was a word left/right command
of a different direction than the current command or if the current
command would put the cursor before or after the end of text. If
either of these conditions is true, the cursor is placed to the
left or right out of the previously selected word, and that
previously selected word is unselected. If the conditions in the
test of function 7614 are not met then function 7617 will move the
cursor one word to the left or the right out of its current
position and make the word that has been moved to the current
selection.
[0311] The operation of function 7612 through 7617 enable word left
and word right navigation to allow a user to not only move the
cursor by a word but also to select the current word at each move
if so desired. It also enables the user to rapidly switch between
the cursor that corresponds to a selected word or cursor that
represents an insertion point before or after a previously selected
word.
[0312] If the user input has been a line up or a line down command,
function 7620 moves the cursor to the nearest word on the line up
or down from the current cursor position, and if extended selection
is on, function 7624 extends the current selection through that new
current word.
[0313] As indicated by numeral 7626 the editor also includes
programming for responding to navigational inputs when the editor
is in other navigation modes that can be selected from the edit
navigation menu shown in FIG. 70.
[0314] If the user selects "OK" either by pressing the button or
using voice command, function 7630 tests to see if the editor has
been called to enter text into another program, such as to enter
text into a field of a Web document or a dialog box, and if so
function 7632 enters the current context of the editor into that
other program at the current text entry location in that program
and returns. If the test 7630 is not met, function 7634 exits the
editor saving its current content and state for possible later
use.
[0315] If the user presses the Menu key when in the editor,
function 7638 calls the display menu routine for the editor
commands which causes a command list to be displayed for the editor
as has been described above with regard to FIG. 75. As has been
described above, this allows the user to scroll through all the
current command mappings for the editor mode within a second or
two. If the user double-clicks on the Menu key when in the editor
functions 7642 through 7646 call the display menu to show the
command list for the editor, set the recognition vocabulary to the
editor's command vocabulary, and command speech recognition using
the last press of the double-click to determine the duration of
that recognition.
[0316] If the user makes a sustained press of the menu key,
function 7650 enters help mode for the editor. This will provide a
quick explanation of the function of the editor mode and allow the
user to explore the editor's hierarchical command structure by
pressing its keys and having a brief explanation produced for the
portion of that hierarchical command structure reached as a result
of each such key pressed.
[0317] If the user presses the Talk button when in the editor,
function 7654 turns on recognition according to current recognition
settings, including vocabulary and recognition duration mode. The
talk button will often be used as the major button used for
initiating speech recognition in the cellphone embodiment.
[0318] If the user selects the End button, function 7658 goes to
the phone mode, such as the quickly make or answer a phone call. It
saves the current state of the editor so that the user can return
to it when such a phone call is over.
[0319] A shown in FIG. 77, if the user selects the entry mode menu
illustrated in FIG. 68, function 7702 causes that menu to be
displayed. As will be described below in greater detail, this menu
allows the user to quickly select between dictation modes somewhat
as buttons 1122 through 1134 shown in FIG. 11 did in the PDA
embodiment. In the embodiment shown, the entry mode menu has been
associated with the "1" key because of the "1" key's proximity to
the talk key. This allows the user to quickly switch dictation
modes and then continue dictation using the talk button.
[0320] If the user selects "choice list," functions 7706 and 7708
set the correction window navigational mode to be page/item
navigational mode, which is best for scrolling through and
selecting recognition candidate choices. They then can call the
correction window routine for the current selection, which causes a
correction window somewhat similar to the correction window 1200
shown in FIG. 12 to be displayed on the screen of the cellphone. If
there currently is no cursor, the correction window will be called
with an empty selection. If this is the case, it can be used to
select one or more words using alphabetic input, word completion,
and/or the addition of what are more utterances. The correction
window routine will be described in greater detail below.
[0321] If the user selects "filter choices" such as by
double-clicking on the "2" key, function 7712 through 7716 set the
correction window navigational mode to the word/character mode used
for navigating in a first choice or filter string. They than call
the correction window routine for the current selection and treat
the second press of the double-click, if one has been entered, as
the speech key for recognition duration purposes.
[0322] In most cellphones, the "2" key is usually located directly
below the navigational key. This enables the user to navigate in
the editor to a desired word or words that need correction and then
single press the nearby "2" key to see a correction window with
alternate choices for the selection, or to double-click on the "2"
key and immediately start entering filtering information to help
the recognizer selects a correct choice.
[0323] If the user selects the navigational mode menu shown in FIG.
70, function 7720 causes it to be displayed. As will be described
in more detail below, this function enables the user to change the
navigation that is accomplished by pressing the left and right and
the up-and-down navigational buttons. In order to make such
switches more easily to make the navigational button has been
placed in the top row of the numbered phone keys.
[0324] If the user selects the discrete recognition input, function
7724 turns on discrete recognition according to current vocabulary
using the press and click to utter and send duration mode as the
current recognition duration setting. This button is provided to
enable the user to quickly shift to discrete utterance recognition
whenever desired by the pressing of the "1" button. As has been
stated before, discrete recognition tends to be substantially more
accurate than continuous recognition, although it is more halting.
The location of this commands key has been selected to be close to
the talk button and the entry mode menu button. Because of the
availability of the discrete recognition key, the recognition modes
normally mapped to the Talk button will be continuous. Such a
setting allows the user to switch between continuous and discrete
recognition by altering between pressing the Talk button and the
"4" key.
[0325] If the user selects selections start or selections stop as
by toggling the "5" key, function 7728 toggles extended selection
on and off, depending whether that mode was currently on or off.
Then function 7730 tests to see whether extended selection has just
been turned off any if so, function 7732 de-selects any prior
selection other than one, if any, at the current cursor. In the
embodiment described, the "5" key was selected for the extended
selection command because of its proximity to the navigational
controls and the "2" key which is used for bringing up correction
windows.
[0326] If the user chooses the select all command, such as by
double-clicking on the "5" key function 7736 selects all the text
in the current document.
[0327] If the user selects the "6" key or any of the associated
commands which are currently active, which can include play start,
play stop or records stop, function 7740 tests to see if the system
is currently not playing audio. If so, function 7742 toggles
between an audio play mode and a mode in which audio play is off.
If not, function 7742 toggles between an audio play mode and a mode
in which audio play is off. If the cellphone is currently on a
phone call and the play only to me option 7513 shown in FIG. 75 has
been set to the off mode, function 7746 sends audio from the play
over the phone line to the other side of the phone conversation as
well as to the speaker or headphone of the cellphone itself.
[0328] If, on the other hand the system is recording audio when the
"6" button is pressed, function 7750 turns recording off.
[0329] If the user double-click on the "6" key or enters a record
command, function 7754 turns audio recording on. Then function 7756
tests to see if the system is currently on a phone call and if the
record only me setting 7511 shown in FIG. 75 is in the off state.
If so, function 7758 records audio from the other side of the phone
line as well as from the phone's microphone or microphone input
jack.
[0330] If the user presses the "7" key or otherwise selects the
capitalized menu command, function 7762 displays a capitalized menu
that offers the user the choice to select between modes that cause
all subsequently entered text to be either in all lowercase, all
initial caps, or all capitalized. It also allows the user to select
to change one or more words currently selected, if any, to all
lowercase, all initial caps, or all capitalized form.
[0331] If the user double-clicks on the "7" key or otherwise
selects the capitalized cycle key, the capitalized cycle routine
which can be called one or more times to change the current
selection, if any, to all initial caps, all capitalized, or all
lowercase form.
[0332] It the user presses the "8" key or otherwise selects the
word form list, function 7770 calls the word form list routine
described above with regard to FIG. 27.
[0333] If the user double-click on the "8" key or selects the word
type command, function 7774 displays the word type menu. The word
type menu allows the user to select a word type limitations as
described above with regard to the filter match routine of FIG. 26
upon a selected word. In the embodiment shown, this menu is a
hierarchical menu having the general form shown in FIG. 91, which
allows the user to specify word ending types, word start types,
word tense types, word part of speech types and other word types
such as possessive or non-possessive form, singular or plural
nominative forms, singular or plural verb forms, spelled or not
spelled forms and homonyms, if any exist.
[0334] As shown in FIG. 78, if the user presses the "9" key or
selects the basic key's menu command, function 7802 displays the
basic key's menu shown in FIG. 73, which allows the user to select
the entry of one of the punctuation marks or input character that
can be selected from that menu as text input.
[0335] If the user double-clicks on the "9" key or selects the New
Paragraph Command, function 7806 enters a New Paragraph Character
into the editor's text.
[0336] If the user selects the "*" key or the escape command,
functions 7810 to 7824 are performed. Function 7810 tests to see if
the editor has been called to input or edit text in another
program, in which case function 7812 returns from the call to the
editor with the edited text for insertion to that program. If the
editor has not been called for such purpose, function 7820 prompts
the user with the choice of exiting the editor, saving its contents
and/or canceling escape. If the user selects to escape, functions
7822 and 7824 escape to the top level of the phone mode described
above with regard to FIG. 63. If the user double-clicks on the "*"
key or selects the task list function, function 7828 goes to the
task list, as such a double-click does in most of the cellphones,
operating modes and menus.
[0337] It the user presses the "0" key or selects the edit options
menu command, function 7832 is the edited options menu described
above briefly with regard to FIG. 74. If the user double-clicks on
the "0" key or selects the undo command, function 7836 undoes the
last command in the editor, if any.
[0338] It the user presses the "#" key or selects the backspace
command, function 7840 tests to see if there's a current selection.
If so, function 7842 deletes it. If there is no current selection
and if the current smallest navigational unit is a character, word,
or outline item, functions 7846 and 7848 delete backward by that
smallest current navigational unit.
[0339] FIGS. 79 and 80 illustrate the options as provided by the
entry mode menu discussed above with regard to FIG. 68.
[0340] When in this menu, if the user presses the "1" key or
otherwise selects large vocabulary recognition, functions 7906
through 7914 are performed. These set the recognition vocabulary to
the large vocabulary. They treat the press of the "1" key as a
speech key for recognition duration purposes. They also test to see
if a correction window is displayed. If so, they set the
recognition mode to discrete recognition, based on the assumption
that in a correction window, the user desires the more accurate
discrete recognition. They add any new utterance or utterances
received in this mode to the utterance list of the type described
above, and they call to the display the choice list routine of FIG.
22 to display a new correction window for any re-utterance
received.
[0341] In the cellphone embodiment shown, the "1" key has been
selected for large vocabulary in the entry mode menu because it is
the most common recognition vocabulary and thus the user can easily
select it by clicking the "1" key twice from the editor. The first
click selects the entry mode menu and the second click selects the
large vocabulary recognition.
[0342] If the user presses the "2" key when in entry mode, the
system will be set to a letter-name recognition of the type
described above. If the user double-clicks on that key when the
entry mode menu is displayed at a time when the user is in a
correction window, function 7926 sets the recognition vocabulary to
the letter-name vocabulary and indicates that the output of that
recognition is to be treated as an ambiguous filter. In the
preferred embodiment, the user has the capability to indicate under
the entry preference option associated with the "9" key of the menu
whether or not such filters are to be treated as ambiguous length
filters or not. The default setting is to let such recognition be
treated as an ambiguous length filter in continuous letter-name
recognition, and a fixed length ambiguous filter in response to the
discrete letter-name recognition.
[0343] At the user presses the "3" key, recognition is set to the
AlphaBravo mode. If the user double-clicks on the "3" key,
recognition is set to the key "Alpha" mode as described above with
regard to FIG. 72. This mode is similar to AlphaBravo mode except
that pressing one of the number keys "2" through "9" will cause the
user to be prompted to one of the ICA words associated with the
letters on the pressed key and the recognition will favor
recognition of one from that limited set of ICA words, so as to
provide very reliable alphabetic entry even under relatively
extreme noise conditions.
[0344] It the user presses the "4" key, the vocabulary is changed
to the digit vocabulary. If the user double-click on the "4" key,
the system will respond to the pressing of numbered phone keys by
entering the corresponding numbers into the editors text.
[0345] If the user presses the "5" key, the recognition vocabulary
is limited to a punctuation vocabulary.
[0346] If the user presses the "6" key, the recognition vocabulary
is limited to the contact name vocabulary described above.
[0347] FIG. 86 illustrates the key Alpha mode which has been
described above to some extent with regard to FIG. 72. As indicated
in FIG. 86, when this note is entered the navigation mode is set to
the word/character navigation mode normally associated with
alphabetic entry. Then function 8604 overlays the keys listed below
it with the functions indicated with each such key. In this mode,
pressing the talk key turns on recognition with the AlphaBravo
vocabulary according to current recognition settings and responding
to key press according to the current recognition duration setting.
The "1" key continues to operate as the entry edit mode so that the
user can press it to exit the key Alpha mode. A pressing of the
numbered phone keys "2" through "9" causes functions 8618 through
8624 to be performed during such a press displaying a prompt of the
ICA words corresponding to the phone key's letters. This causes
recognition to substantially favor the recognition of one of those
three or four ICA words, it turns on recognition for the duration
of the press, and it outputs the letter corresponding to the
recognized ICA word either into the text of the editor if in editor
mode or into the filter string if in filter edit mode.
[0348] If the user presses the zero button, function 8628 enters a
key punctuation mode that responds to the pressing of any phone key
having letters associated with it by displaying a scrollable list
of all punctuation marks that start with one of the set of letters
associated with that key, and which favors the recognition of one
of those punctuation words.
[0349] FIG. 87 represents an alternate embodiment of the key Alpha
mode which is identical to that of FIG. 86 except for portions of
the pseudocode which are underlined in FIG. 87. In this mode, if
the user presses the top button, large vocabulary recognition will
be turned on but only the initial letter of each recognized word
will be output, as indicated in function 8608A. As functions 8618A
and 8620A indicate, when the user presses a phone key having a set
of three or four letters associated with it, the user is prompted
to say a word starting with the desired letter and the recognition
vocabulary is substantially limited to words that started with one
of the key's associated letters, and function 8624A outputs the
initial letter corresponding to the recognized word.
[0350] In some embodiments of the invention, a third alternative
key Alpha mode can be used in which a limited set of words is
associated with each letter of the alphabet and during the pressing
of the key, recognition is substantially limited to recognition of
one of the sets of words associated with the key's associated
letters. In some such embodiments, a set of five or fewer words
would be associated with each such letter.
[0351] FIGS. 89 and 90 represent some of the options available in
the thought edit options menu which is accessed by pressing the 0
button in the editor and correction window modes. In this menu, if
the user presses the 1 key, he gets a menu of file options as
indicated at function 8902. If the user presses the 2 key, he gets
a menu of edit options, such as those that are common in most
editing programs as indicated by function 8904. If the user presses
the 3 button, function 8906 displays the same entry preference menu
that is accessed by pressing a 9 in the entry mode menu described
above with regard to FIGS. 68 and 79.
[0352] If the user presses the "4" key when in the edit options
menu, a text-to-speech or TTS menu will be displayed. In this menu,
the "4" key toggles TTS play on or off. If this key toggles
text-to-speech on if there's a current selection, functions 8916
and 8918 cause the TTS to say the selection, preferably preceding
it by a TTS or pre-recorded saying of the word "selection." If
there is no selection when TTS is toggled on, TTS saying the
current text at the current cursor location until the end of the
current document or until the user provides input other than cursor
movement within the document. As will be explained below with
regard to FIG. 99, when TTS mode is on, the user will be provided
with audio prompts and text-to-speech playback of text so as to
enable a substantial portion of the systems functionality to be
used without requiring being able to see the cell phones
screen.
[0353] The TTS submenu also includes a choice that allows the user
to play the current selection whenever he or she desires to do so
as indicated by functions 8924 and 8926 and functions 8928 and 8930
that allow the user to toggle continuous play on or off whether or
not the machine is in a TTS on or TTS off mode. As indicated by the
top-level choices in the edit options menu at 8932, a double-click
of the "4" key toggles text-to-speech on or off as if the user had
pressed the "4" key, then waited for the text-to-speech menu to be
displayed and then again pressed the "4" key.
[0354] The "5" key in the edit options menu selects the outline
menu that includes a plurality of functions that let a user
navigate in an expand and contract headings and an outline mode. If
the user double-clicks on the "5" key, the system toggles between
totally expanding and totally contracting the current outline
element in which the editor's cursor is located.
[0355] If the user selects the "6" key and audio menu is displayed
as a submenu, some of the options of which are displayed indented
under the audio menu item 8938 in the combination of FIGS. 89 and
90. This audio menu includes an item selected by the "1" key which
gives the user finer control over audio navigation speed that is
provided by use of the "6" button in the edit now menu described
above with regard to FIGS. 84 and 70. If the user selects the "2"
key, he or she will see a submenu that allows the user to call
audio playback settings such as volume and speed and whether audio
associated with recognized words is to be played and/or audio
recorded without associated recognized words.
[0356] FIG. 90 starts with items selected by the "3", "4", "5", "6"
and "7" keys under the audio menu described above, starting with
numeral 8938 in FIG. 89. If the user presses the "3" key, a
recognized audio options dialog box 9000 will be displayed that, as
is described by numerals 9002 through 9014, gives the user the
option to select to perform speech recognition on any audio
contained in the current selection in the editor, to recognize all
audio in the current document, to decide whether or not previously
recognized audio is to be read recognized, and to set parameters to
determine the quality of, and time required by, such recognition.
As indicated at function 9012, this dialog box provides an estimate
of recognizing the current selection with the current quality
settings and, if a task of recognizing a selection is currently
underway, status on the current job. This dialog box allows the
user to perform recognitions on relatively large amounts of audio
as a background task or at times when a phone is not being used for
other purposes, including times when it is plugged into an
auxiliary power supply.
[0357] If the user selects the "4" key in the audio menu, the user
is provided with a submenu that allows him to select to delete
certain information from the current selection. This includes
allowing the user to select to delete all audio that is not
associated with recognized words, to delete all audio that is
selected with recognized words, to delete all audio, or to delete
text from the desired selection. Deleting recognition audio from
recognized text greatly reduces the memory associated with the
storage of such text and is often a useful thing to do once the
user has decided that he does not need the text-associated audio to
help him her determine its intended meaning. Deleting text but not
audio from a portion of media is often useful where the text has
been produced by speech recognition from the audio but is
sufficiently inaccurate to be of little use.
[0358] In the audio menu, the "5" key allows the users to select
whether or not text that has associated recognition audio is
marked, such as by underlining to allow the user to know if such
text has playback that can be used to help understand it or, in
some embodiments, will have an acoustic representation from which
alternate recognition choices can be generated. The "6" key allows
the user to choose whether or not recognition audio is to be capped
for recognized text. In many embodiments, even if the recording of
recognition audio is turned off, such audio will be capped for some
number of the most recently recognized words so that it will be
available for correction playback purposes.
[0359] In the audio menu, the "7" key selects a transcription mode
dialog box. This causes the dialog box to be displayed, that allows
the user to select settings to be used in a transcription mode that
is described below with regard to FIG. 94. This is a mode that is
designed to make it easy for user to transcribe prerecorded audio
by speech recognition.
[0360] If the user presses the "8" key, function 9036 will be
performed, call a search dialog box with the current selection, if
any, as the search string. As will be illustrated below, the speech
recognition text editor can be used to enter a different search
string, if so desired. If the user double-clicks on the "8" key,
this will be interpreted as a find again command which will search
again for the previously entered search string.
[0361] If the user selects the "9" key in the edit options menu, a
vocabulary menu is displayed that allows the user to determine
which words are in the current vocabulary, to select between
different vocabularies, and to add words to a given vocabulary. If
the user either presses or double-clicks the "0" button when in the
edit options menu, an undo function will be performed. A double
click accesses the undo function from within the edit options menu
so as to provide similarity with the fact that a double-click on
"0" accesses the undo function from the editor or the correction
window. In the edit options menu, the pound key operates as a redo
button.
[0362] FIG. 94 illustrates the TTS play rules. These are the rules
that govern the operation of TTS generation when TTS operation has
been selected through the text-to-speech options described above
with regard to functions 8908 to 8932 of FIG. 89.
[0363] If a TTS keys mode has been turned on by operation of the 1
key when in the TTS, as indicated by function 1909 above, function
9404 causes functions 9406 to 9414 to be performed. These functions
enable a user to safely select phone keys without being able to see
them, such as when the user is driving a car or is otherwise
occupied. Preferably this mode is not limited to operation in the
speech recognition editor that can be used in any mode of the cell
phones operation. When any phone key is pressed, function 9408
tests to see if the same key has been pressed within a TTS Key
Time, which is a short period of time such as a quarter or a third
of a second. For purposes of this test, the time is measured since
the release of the last key press of the same key. If the same key
has not been pressed within a short period of time, functions 9410
and 9412 will cause a text to speech, or in some embodiments
recorded audio playback, saying of the number of the key and its
current command name. This audio feedback continues only as long as
the user continues to press the key. If the key has a double-click
command associated with it, it also will be said if the user
continues to press the key long enough. If the test of function
9408 finds that the time since the release of the last key press of
the same key is less than the TTS key time function 9414 the
cellphones software responds to the key press, including any
double-clicks, the same as it would as if the TTS key mode were not
on.
[0364] Thus it can be seen that the TTS keys mode allows the user
to find a cell phone key by touch, to press it to determine if it
is the desired key and, if so, to quickly press it again one or
more times to achieve the key's desired function. Since the press
of the key that is responded to by functions 9410 and 9412 does not
cause any response other than the saying of its associated
function, this mode allows the user to search for the desired key
without causing any undesired consequences.
[0365] In some cell phone embodiments, the cell phone keys are
designed so that they are merely touch rather than pushed audio
feedback as to which key they are and their current function,
similar to that provided by function 9412, will be provided. This
can be provided, for example, by having the material of the phone
keys made of the conductive material, or by having other portions
of the phone that are separated from those keys generate a voltage
that if conducted through a user's body to a key, can be detected
by circuitry associated with the key. Such a system would provide
an even faster way for a user to find a desired key by touch, since
with it a user could receive feedback as to which keys he was
touching merely by scanning a finger over the keypad in the
vicinity of the desired key. It would also allow a user to rapidly
scan for desired command name by likewise scanning his fingers over
successive keys until the desired command was found.
[0366] When TTS is on, if the system recognizes or otherwise
receives a command input, functions 9416 and 9418 cause TTS or
recorded audio playback to say the name of the recognized command.
Preferably such audio confirmation of commands have an associated
sound quality, such as in the form of the different tone of voice
or different associated sounds, that distinguish the saying of
command words from the saying of recognized text.
[0367] When TTS is on, when a text utterance is recognized,
functions 9420 through 9424 can tell the end of the utterance, and
the completion of its recognition and then use TTS to say the words
that have been recognized as the first choice for the
utterance.
[0368] As indicated in functions 9426 through 9430, TTS responds to
the recognition of a filtering utterance in a similar manner.
[0369] When in TTS, if the user moves the cursor to select a new
word or character, functions 9432 to 9438 use TTS to say that newly
selected word or character. If such a movement of a cursor to a new
word or character position extends an already started selection,
after the saying of the new cursor position, functions 9436 and
9438 will say the word "selection" in a manner that indicates that
it is not part of recognized text, and then proceeds to say the
words of the current selection. If the user moves the cursor to be
a non-selection cursor, such as is described above with regard to
functions 7614 and 7615 of FIG. 76, functions 9440 and 9442 of FIG.
94 use TTS to say the two words that the cursor is located
between.
[0370] When in TTS mode, if a new correction windows is displayed,
functions 9444 and 9446 use TTS to say the first choice in the
correction window, dispel the current filter if any, indicating
which parts of it are unambiguous and which parts of it are
ambiguous, and then use TTS to say each candidate in the currently
displayed portion of the choice list. For purposes of speed, it is
best that differences in tone or sound be used to indicate which
portions of the filter are absolute or ambiguous.
[0371] If the user scrolls an item in the correction window,
functions 9448 and 9450 use TTS to say the currently highlighted
choice and its selection number in response to each such scroll. If
the user scrolls a page in a correction window, functions 9452 and
9454 use TTS to say that newly displayed choices as well as
indicating the currently highlighted choice.
[0372] When in correction mode, if the user enters a menu,
functions 9456 and 9458 use TTS or free recorded audio to say the
name of the current menu and all of the choices in the menu and
their associated numbers, indicating the current selection
position. Preferably this is done with audio cues that indicate to
a user that the words being said are menu options.
[0373] If the user scrolls up or down an item in a menu, functions
9460 and 9462 use TTS or pre-recorded audio to say the highlighted
choice and then, after a brief pause, any following selections on
the currently displayed page of the menu.
[0374] FIG. 95 illustrates some aspects of the programming used in
TTS generation. If a word to be generated by text-to-speech is in
the speech recognition programming's vocabulary of phonetically
spelled words, function 9502 causes functions 9504 through 9512 to
be performed. Function 9504 tests to see if the word has multiple
phonetic spellings associated with different parts of speech, and
if the word to be set using TTS has a current linguistic context
indicating its current part of speech. If both these conditions are
met, function 9506 uses speech recognition programming's part of
speech indicating code to select the phonetic spelling associated
with a part of speech found most probable by that part of speech
indicating code as the phonetic spelling in the TTS generation for
the current word. If, on the other hand, there is only one phonetic
spelling associated with the word or there is no context sufficient
to identify the most probable part of speech for the word, function
9510 selects the single phonetic spelling for the word or its most
common phonetic spelling. Once a phonetic spelling has been
selected for the word to be generated either by function 9506 or
function 9510, function 9512 uses the phonetic spelling selected
for the word as a phonetic spelling to be used in the TTS
generation. If, as is indicated at 9514, the word to be generated
by text-to-speech does not have a phonetic spelling, function 9514
and 9516 use pronunciation guessing software that is used by the
speech recognizer to assign a phonetic spelling to names and newly
entered words for the text-to-speech generation of the word.
[0375] FIG. 96 describes the operation of the transcription mode
that can be selected by operation of the transcription mode dialog
box that is activated under the audio menu option of the edits
options menu described above in association with the number "7" in
FIG. 90 under the audio menu of the edits options menu shown in
FIGS. 89 and 90.
[0376] When the transcription mode is entered, function 9602
normally changes navigation mode to an audio navigation mode that
navigates forward or backward five seconds and an audio recording
in response to left and right navigational key input and forward
and backward one second in response to up and down navigational
input. These are default values which can be changed in the
transcription mode dialog box. During this mode, if the user clicks
the play key, which is the "6" key in the editor, functions 9606
through 9614 are performed. Functions 9607 and 9608 toggle play
between on and off. Function 9610 causes functions 9612 to be
performed if the toggle is turning play on. If so, if there has
been no sound navigation since the last time sound was played,
function 9614 starts playback a set period of the time cbefore the
last end of play. This is performed so that if the user is
performing transcription, each successive playback will start
slightly before the last one ended so the user will be able to
recognize words that were only partially said in the prior playback
and so that the user will better be able to interpret speech sounds
as words by being able to perceive a little bit of the preceding
language context. If the user presses the play key for more than a
specified period of time, such as a third of the second, function
9616 causes functions 9618 through 9622 to be performed. These
functions test to see if play is on, and if so they turn it off.
They also turn on large vocabulary recognition during the press, in
either continuous or discrete mode, according to present settings.
They then insert the recognize text into the editor in the location
in the audio being transcribed at which the last end of play took
place. If the user double-clicks the play button, functions 9624
and 9626 prompt the user that audio recording is not available in
transcription mode and that transcription mode can be turned off in
the audio menu under the added options menu.
[0377] It can be seen that its transcription mode enables the user
to alternate between playing a portion of previously recorded audio
and then transcribing it by use of speech recognition by merely
alternating between clicking and making sustained presses of the
play key, which is the number "6" phone key. The user is free to
use the other functionality of the editor to correct any mistakes
that have been made in the recognition during the transcription
process, and then merely return to it by again pressing the "6" key
to play the next segment of audio to be transcribed Of course, it
should be understood that the user will often not desire to perform
a literal transcription out of the audio. For example, the user may
play back a portion of a phone call and merely transcribe a summary
of the more noteworthy portions.
[0378] FIG. 97 illustrates the operation of a dialogue box editing
programming that uses many features of the editor mode described
above to enable users to enter text and other information into a
dialogue box displayed in the cell phones screen.
[0379] When a dialogue box is first entered, function 9702 displays
an editor window showing the first portion of the dialog box. If
the dialog box is too large to fit on one screen at one time, it
will be displayed in a scrollable window. As indicated by function
9704, the dialog box responds to all inputs in the same way that
the editor mode described above with regard to FIGS. 76 through 78
does, except as is indicated by the functions 9704 through 9726. As
indicated at 9707 and 9708, if the user supplies navigational input
when in a dialog box, the cursor movement responds in a manner
similar to that in which it would in the editor except that it can
normally only move to a control into which the user can supply
input. Thus, if the user moved left or right of a word, the cursor
would move left or right to the next dialog box control, moving up
or down lines if necessary to find such a control. If the user
moves up or down a line, the cursor would move to the nearest of
the lines above or below the current cursor position. In order to
enable the user to read extended portions of text that might not
contain any controls, normally a cursor will not move more than a
page even if there are no controls within that distance.
[0380] As indicated by functions 9700 and through 9716, if the
cursor has been moved to a field and the user provides any input of
a type that would input text into the editor, function 9712
displays a separate editor window for the field, which displays the
text currently in that field, if any. If the field has any
vocabulary limitations associated with it, functions 9714 and 9716
limit the recognition in the editor to that vocabulary. For
example, if the field were limited to state names, recognition in
that field would be so limited. As long as this field-editing
window is displayed, function 9718 will direct all editor commands
to perform editing within it. The user can exit this field-editing
window by selecting OK, which will cause the text currently in the
window at that time to be entered into the corresponding field in
the dialog box window.
[0381] If the cursor in the dialog box is moved to a choice list
and the user selects a text input command, function 9722 displays a
correction window showing the current value in the list box as the
first choice and other options provided in the list box as other
available choices shown in a scrollable choice list. In this
particular choice lists, the scrollable options are not only
accessible by selecting an associated number but also are available
by speech recognition using a vocabulary limited to those
options.
[0382] If the cursor is in a check box or a radio button and the
user selects any editor text input command, functions 9724 and 9726
change the state of the check box or radio button, by toggling
whether the check box or radio button is selected.
[0383] FIG. 98 illustrates a help routine 9800 which is the cell
phone embodiment analog of the help mode described above with
regard to FIG. 19 in the PDA embodiments. When this help mode is
called when the cell phone is in a given state or mode of
operation, function 9802 displays a scrollable help menu for the
state that includes a description of the state along with a
selectable list of help options and of all of the state's commands.
FIG. 99 displays such a help menu for the editor mode described
above with regard to FIGS. 67 and 76 through 78. FIG. 100
illustrates such a help menu for the entry mode menu described
above with regard to FIG. 68 and FIGS. 79 and 80. As his shown in
FIGS. 99 and 100, each of these help menus includes a help options
selection, which can be selected by means of a scrollable highlight
and operation of the help key, which will allow the user to quickly
jump to the various portions of the help menu as well as the other
help related functions. Each help menu also includes a brief
statement, 9904, of the current command state the cell phone is in.
Each help menu also includes a scrollable, selectable menu 9906
listing all the options accessible by phone key. It also includes a
function 9908 which allows the user to access other help functions,
including a description of how to use the help function and in some
cases help about the function of different portions of the screen
that is available in the current mode.
[0384] As shown in FIG. 101, if the user in the editor mode makes a
sustained press on the menu key as indicated at 10100, the help
mode will be entered for the editor mode, causing the cell phone to
display the screen 10102. This displays the selectable help
options, option 9902, and displays the beginning of the brief
description of the operation of the other mode 9900 as shown in
FIG. 99. If the user presses the right arrow key of the cell phone,
which functions as a page right button, since, in help mode, the
navigational mode is a page/line navigational mode as indicated by
the characters "<P{circumflex over ( )}L" shown in screen 1102,
the display will scroll down a page as indicated by screen 10104.
If the user presses the page right key again, the screen will again
scroll down a page, causing the screen to have the appearance shown
at 10106. In this example, the user has been able to read the
summary of the function of the editor mode 9904 shown in FIG. 99
with just two clicks of the page right key.
[0385] If the user clicks the page right key again causing the
screen to scroll down a page as is shown in the screen shot 10108,
the beginning of the command list associated with the editor mode
can be seen. The user can use the navigational keys to scroll the
entire length of the help menu if so desired. In the example shown,
when the user finds the key number associated with the entry mode
menu, he presses that key as shown at 10110 to cause the help mode
to display the help menu associated with the entry mode menu as
shown at screen 10112.
[0386] It should be appreciated that whenever the user is in a help
menu, he can immediately select the commands listed under the
"select by key" line 9910 shown in FIG. 99. Thus, there is no need
for a user to scroll down to the portion of the help menu in which
commands are listed in order to press the key associated with a
command in order to see its function. In fact, a user who thinks he
understands the function associated with the key can merely make a
sustained press of the menu key and then type the desired key to
see a brief explanation of its function and a list of the commands
that are available under it.
[0387] The commands listed under the "select by OK" line 9912 shown
in FIGS. 99 and 100 have to be collected by scrolling the highlight
to the commands line in the menu and selecting by use of the OK
command. This is because the commands listed below the line 9912
are associated with keys that are used in the operation of the help
menu itself. This is similar to the commands listed in screen 7506
of the editor mode command list shown in FIG. 75, which are also
only selectable by selection with the OK command in that command
list.
[0388] In the example of FIG. 101, it is assumed that the user
knows that the entry preference menu can be selected by pressing
and a "9" in the entry mode menu, and presses that key as soon as
he enters help for the entry mode menu as indicated by 10114. This
causes the help menu for the entry preference menu to be shown as
illustrated at 10116.
[0389] In the example, the user presses the "1" key followed by the
escape key. The "1" key briefly calls the help menu for the
dictation defaults option and the escape key returns to the entry
preference menu at the location and menu associated with the
dictation defaults option, as shown by screen 10118. Such a
selection of a key option followed by an escape allows the user to
rapidly navigate to a desired portion of the help menu's command
list merely by pressing the number of the key in that portion of
the command and list followed by an escape.
[0390] In the example, the user presses the page right key as shown
at 10120 to scroll down a page in the command list as indicated by
screen 1122. In the example, it is assumed the user selects the
option associated with the "5" key, by pressing that key as
indicated at 10124 to obtain a description of the press continuous,
click discrete to utterance option. This causes a help menu for
that option to be displayed as shown in screen 10126. In the
example, the user scrolls down two more screens to read the brief
description of the function of this option and then presses the
escape key as shown at 10128 to return back to the help menu for
the entry preference menu as shown at screen 10130.
[0391] As shown in FIG. 102, in the example, when the user returns
to help for the entry preference menu, he or she selects the "4"
key as indicated by numeral 10200 which causes the help menu for
the During Press and Click to utterance and option, as shown at
screen 10202. The user then scrolls down two more screens to read
enough of the description of this mode to understand its function
and then, as shown at 10204, escapes back up to help for the entry
preference menu as shown at screen 10206. The user then presses
escape again to return to the help menu from which the entry
preference menu had been called, which is help for the entry mode
menu as shown at screen 10210. The user presses escape again to
return to the help menu from which help for entry mode had been
called, which is the help menu for the editor mode as shown in
screen 10214.
[0392] In the example, it is assumed that the user presses the page
right key six times to scroll down to the bottom portion, 9908,
shown in FIG. 99 of the help menu for the editor mode. If the user
desires he can use a place command to access options in this
portion of the help menu more rapidly. Once in the "other help"
portion of the help menu, the user presses the down line button as
shown at 10220 to select the editor screen option 10224 shown in
the screen 10222. At this point, the user selects the OK button
causing the help for the editor screen itself to be displayed as is
shown in screen 10228. In the mode in which this screen is shown,
phone key member indicators 10230 are used to label portions of the
editor screen. If the user presses one of these associated phone
numbers, a description of the corresponding portion of the screen
will be displayed. In the example of FIG. 102, the user presses the
"4" key, which causes an editor screen help screen 10234 to be
displayed, which describes the function of the navigation mode
indicator "<W{circumflex over ( )}L" shown at the top of the
editor screen help screen 10228.
[0393] In the example, the user presses the escape key three times
as is shown to numeral 10236. The first of these escapes from the
screen 10234 back to the screen 10228, giving the user the option
to select explanations of other of the numbered portions of the
screen being described. In the example, the user has no interest in
making such other selections, and thus has followed the first press
of the escape key with two other rapid presses, the first of which
escapes back to the help menu for the editor mode and the second of
which escapes back to the editor mode itself.
[0394] As can be seen in the FIGS. 101 and 102, the hierarchical
operation of help menus enables the user to rapidly explore the
command structure on the cell phone. This can be used either to
search for a command that performs a desired function, or to merely
learn the command structure in a linear order.
[0395] FIGS. 103 and 104 describe an example of a user continuously
dictating some speech in the editor mode and then using the
editor's interface to correct the resulting text output.
[0396] The sequence starts in FIG. 103 with the user making a
sustained press of the talk button as indicated at 10300 during
which he says the utterance 10302. This results in the recognition
of this utterance, which in the example causes the text shown in
screen 10304 to be displayed in the editor's text window 10305. The
numeral 10306 points to the position of the cursor at the end of
this recognized text, which is a non-selection cursor at the end of
the continuous dictation.
[0397] It is assumed that the system has been set in a mode that
will cause the utterance to be recognized using continuous large
vocabulary speech recognition. This is indicated by the characters
"_LV" 10306 in the title bar of the editor window shown in screen
10304.
[0398] In the example, the user presses the "3" key to access the
added navigation menu illustrated in FIGS. 70 and 84 and then
presses the "1" button to select the utterance option shown in
those FIG.s. This makes the cursor correspond to the first word of
the text recognized for the most recent utterance as indicated at
10308 in screen 10310. Next, the user double-clicks the "7" key to
select the capitalized cycle function described in FIG. 77. This
causes the selected word to be capitalized as shown at 10312.
[0399] Next, the user presses the right button, which in the
current word/line navigational mode, indicated by the navigational
mode indicator 10314, functions as a word right button. This causes
the cursor to move to the next word to the right, 10316. Next the
user presses the "5" key to set the editor to an extended selection
mode as described above with regard to functions 7728 through 7732
of FIG. 77. Then the user presses the word right again, which
causes the cursor to move to the word 10318 and the extended
selection 10320 to include the text "got it".
[0400] Next, the user presses the "2" key to select the choice list
command of FIG. 77, which causes a correction window 10322 to be
displayed for the selection 10320 as the first choice and with a
first alphabetically ordered choice list shown as displayed at
10324. In this choice list, each choice is shown with an associated
phone key number that can be used to select it.
[0401] In the example, it is assumed that the desired choice is not
shown in the first choice list, so the user presses the right key
three times to scroll down to the third screen of the second
alphabetically ordered choice list, shown in 10328, in which the
desired word "product" is located.
[0402] As indicated by function 7706 in FIG. 77, when the user
enters the correction window by a single press of the choice list
button, the correction window's navigation of the set to the
page/item navigational mode, as is indicated by the navigational
mode indicator 10326 shown in screen 10332.
[0403] In the example, the user presses the "6" key to select the
desired choice, which causes it to be inserted into the editor's
text window at the location of the cursor selection, causing the
editor text window to appear as shown at 10330.
[0404] Next, the user presses the word right key three times to
place the cursor at the location 10332. In this case, the
recognized word is "results" and a desired word is the singular
form of that word "result." For this reason, the user presses the
word form list button, which causes a word form list correction
window, 10334, to be displayed, which has the desired alternate
form as one of its displayed choices. The user data selects the
desired choice by pressing its associated phone key, causing the
editor's text window to have the appearance shown at 10336.
[0405] As shown in FIG. 104, the user next presses the line down
button to move the cursor down to the location 1400. The user then
presses the "5" key to start an extended selection and presses the
word key to move the cursor right one word to the location 10402,
causing the current selection 10404 to be extended rightward by one
word.
[0406] Next, the user double-clicks the "2" key to select a filter
choices option described above with regard to function 7712 through
7716, in FIG. 77. The second click of the "2" key is an extended
click, as indicated by the down arrow 10406. During this extended
press, the user continuously utters the letter string, "p, a, i, n,
s, t," which are the initial letters of the desired word,
"painstaking."
[0407] In the example, it is assumed that the correction window is
in the continuous letter name recognition mode as indicated by the
characters "_abc" in the title bar of the correction 10412.
[0408] In the example, the recognition of the utterance 10408 as
filter input causes the correction window 10412 to show a set of
choices that have been filtered against an ambiguous length filter
corresponding to the recognition results from the recognition of
that continuously spoken string of letter names. The correction
window has a first choice, 10414, that starts with one of the
character sequences associated with the ambiguous filter element.
The portion of the first choice that corresponds to a sequence of
characters associated with the ambiguous filter is indicated by the
ambiguous filter indicator 10416. The filter cursor, 10418, is
located after the end of this portion of the first choice.
[0409] At this point, the user presses the word right key which,
due to the operation of functions 8124 and 8126 at FIG. 81, causes
a filter cursor to be moved to and to select the first character,
10420, of the current word. Functions 8151 and 8162 of FIG. 81
cause a filter character choice window, 10422, to be displayed.
Since the desired character is a "p," the user presses the "7" key
to choose it, which causes that character to be made an unambiguous
character of the filter string, and causes a new correction window,
10424, to be displayed as a result of that change in the
filter.
[0410] Next, the user presses the character down button four times,
which due to the operation of function 8150 in FIG. 81, causes the
filter cursor's selection to be moved four characters to the right
in the first choice, which in the example is the letter "f," 10426.
Since this is a portion of the first choice that still corresponds
to the ambiguous portion of the filter strength as indicated by the
ambiguous filter marker 10428, the call to filter character choice
in line 8152 of FIG. 81 will cause another character choice window
to be displayed, as shown.
[0411] In the example, the desired character, the letter "s," is
associated with the "5" phone key in the choice list, and the user
presses that key to cause the correct character, 10430, to be
inserted into the current filter strength and all the characters
before it to be unambiguously confirmed, as indicated by numeral
10432.
[0412] At this time, the correct choice is shown associated with
the phone key "6," and the user presses that phone key to cause the
desired word to be inserted into the editor's text window as shown
at 10434.
[0413] Next, in the example, the user presses the line down and
word right keys to move the cursor selection down a line and to the
right so as to select the text "period" shown at 10436. The user
then presses the "8," or word form list key, which causes a word
form list correction window 10438, to be displayed. The desired
output, a period mark, is associated with the "4" phone key. The
user presses that key and causes the desired output to be inserted
into the text of the editor window as shown at 10440.
[0414] FIG. 105 illustrates how the user can scroll a choice list
horizontally by operation of functions 8132 and 8135 described
above with regard to FIG. 81.
[0415] FIG. 106 illustrates how the Key Alpha recognition mode can
be used to enter alphabetic input into the editor's text window.
Screen 10600 shows an editor text window in which the cursor 10602
is shown. In this example, the user presses the "1" key to open the
entry mode menu described above with regard to FIGS. 79 and 68,
resulting in the screen 10604. Once in this mode, the user
double-clicks the "3" key to select the Key Alpha recognition mode
described above with regard to function 7938 of FIG. 79. This
causes the system to be set to the Key Alpha mode described above
with regard to FIG. 86, and the editor window to display the prompt
10606 shown in FIG. 106.
[0416] In the example, the user makes an extended press of the
phone key as indicated at 10608, which causes a prompt window,
10610 to display the ICA words associated with each of the letters
on the phone key that has been pressed. In response, the user makes
the utterance "charley," 10612. This causes the corresponding
letter "c" to be entered into the text window at the former
position of the cursor and causes the text window to have the
appearance shown in screen 10614.
[0417] In the example, it is next assumed that the user presses the
talk key while continuously uttering two ICA words, "alpha" and
"bravo" as indicated at 10616. This causes the letters "a" and "b"
associated with these two ICA words to be entered into the text
window at the cursor as indicated by screen 10618. Next in the
example, the user presses the 8 key, is prompted to say one of the
three ICA words associated with that key, and utters the word
"uniform" to cause the letter "u" to be inserted into the editor's
text window as shown at 10620.
[0418] FIG. 7 provides an illustration of the same Key Alpha
recognition mode being used to enter alphabetic filtering input. It
shows that the Key Alpha mode can be entered when in the correction
window by pressing the "1" key followed by a double-click on the
"3" key in the same way it can be from the text editor, as shown in
FIG. 106.
[0419] FIGS. 106 and 109 show how a user can use the interface of
the voice recognition text editor described above to address and to
enter and correct text and e-mails in the cell phone
embodiment.
[0420] In FIG. 108, screen 10800 shows the e-mail option screen
which a user accesses if he selects the e-mail option by
double-clicking on the "4" key when in the main menu, as
illustrated in FIG. 66.
[0421] In the example shown, it is assumed that the user wants to
create a new e-mail message and thus selects the "1" option. This
causes a new e-mail message window, 10802, to be displayed with the
cursor located at the first editable location in that window. This
is the first character in the portion of the e-mail message
associated with the addressee of the message. In the example, the
user makes an extended press of the talk button and utters the name
"Dan Roth" as indicated by the numeral 10804.
[0422] In the example, this causes the slightly incorrect name,
"Stan Roth," to be inserted into the message's addressee line as a
shown at 10806. The user responds by pressing the "2" key to select
a choice list, 10806, for the selection. In the example, the
desired name is shown on the choice list and the user presses the
"5" key to select it, causing the desired name to be inserted into
the addressee line as shown at 10808.
[0423] Next, the user presses the down line button twice to move
the cursor down to the start of the subject line, as a shown in
screen 10810. The user then presses the talk button while saying
the utterance "cell phone speech interface," 10812. In the example,
this is slightly mis-recognized as "sell phone speech interface,"
and this text is inserted at the cursor location on the subject
line to cause the e-mail edit window to have the appearance shown
at 10814. In response, the user presses the line up button and the
word left button to position the cursor selection at the position
10816. The user then presses the "8" key to cause a word form list
correction window, 10818, to be displayed. In the example, the
desired output is associated with the "4" key, the user selects
that key and causes the desired output to be placed in the cursor's
position as indicated in screen 10820.
[0424] Next, the user presses the line down button twice to place
the cursor at the beginning of the body portion of the e-mail
message as shown in screen 10822. Once this is done, the user
presses the talk button while continuously saying the utterance
"the new Elvis interface is working really well". This causes the
somewhat mis-recognized string, "he knew elfish interface is
working really well", to be inserted at the cursor position as
indicated by screen 10824.
[0425] In response, the user presses the line up key once and the
word left key twice to place the cursor in the position shown by
screen 10900 of FIG. 199. The user then presses the "5" key to
start an extended selection and presses the word left key twice to
place the cursor at the position 10902 and to cause the selection
to be extended as is shown by 10904. At this point, the user
double-clicks on the "2" key to enter the correction window, 10906,
for the current selection and, during that press, continuously says
the characters "t, h, e, space, n". This causes a new correction
window, 10908, to be displayed with unambiguous filter 10910
corresponding to be continuously entered letter name character
sequence.
[0426] Next, the enter presses the word right key, which moves the
filter cursor to the first character of the next word to the right,
as indicated by numeral 10912. The user then presses the "1" key to
enter the entry mode menu and presses the "3" key to select to
select the AlphaBravo, or ICA word, input vocabulary. During the
continuation of the press of the "3" key, the user says the
continuous utterance, "echo, lima, victor, india, sierra" 10914.
This is recognized as detector sequence "ELVIS," which is inserted,
starting with the prior filter cursor position, into the first
choice window of the correction window, 10916. In the example
shown, it is assumed that AlphaBravo recognition is treated as
unambiguous because of its reliability, causing the entered
characters and all the characters before it in the first choice
window to be treated as unambiguously confirmed, as is indicated by
the unambiguous confirmation indication 10918 shown in screen
10916.
[0427] In the example, the user presses the "OK" key to select the
current first choice because it is the desired output.
[0428] FIG. 110 illustrates how re-utterance can be used to help
obtain the desired recognition output. It starts with the
correction window in the same state as was indicated by screen
10906 and FIG. 109. But in the example of FIG. 110, the user
responds to the screen by pressing the "1" key twice, once to enter
the entry menu mode, and a second time to select a large vocabulary
recognition. As indicated by function 7908 through 7914 in FIG. 79,
if large vocabulary recognition is selected in the entry mode menu
when a correction window is displayed, the system interprets this
as an indication that the user wants to perform a re-utterance,
that is, to add a new utterance for the desired output into the
utterance list for use in helping to select the desired output. In
the example, the user continues the second press of the "1" key
while using discrete speech to say the three words "the," "new,"
"Elvis" corresponding to the desired output. In the example above,
it is assumed that the additional discrete utterance information
provided by this new utterance list entry causes the system to
correctly recognize the first two of the three words. In the
example it is assumed that the third of the three words is not in
the current vocabulary, which will require the user to spell that
third word with filtering input, such as was done by the utterance
10914 in FIG. 109.
[0429] FIG. 110 illustrates how the editor functionality can be
used to enter a URL text string for purposes of accessing a desired
web page on a Web browser that is part of the cell phone's
software.
[0430] The browser option screen, 11100, shows the screen that is
displayed if the user selects the Web browser option associated
with the "7" key in the main menu, as indicated on FIG. 66. In the
example, it is assumed that the user desires to enter the URL of a
desired web site and selects the URL window option associated with
the "1" key by pressing that key. This causes the screen 11102 to
display a brief prompt instructing the user. The user responds by
using continuous letter-name spelling to spell the name of a
desired web site during a continuous press of the talk button. In
the embodiment shown, the URL editor is always in correction mode
so that the recognition of the utterance, 11103, causes a
correction window, 11104, to be displayed. The user then uses
filter string editing techniques of the type which have been
described above to correct the originally mis-recognized URL to the
desired spelling as indicated at screen 11106, at which time he
selects the first choice, causing the system to access the desired
web site.
[0431] FIGS. 112 through 114 illustrate how the editor interface
can be used to navigate and enter text into the fields of Web
pages.
[0432] Screen 11200 illustrates the appearance of the cell phone's
Web browser when it first accesses a new web site. A URL field,
11201, is shown before the top of the web page, 11204, to help the
user identify the current web page. This position can be scrolled
back to at any time if the user wants to see the URL of the
currently displayed web page. When web pages are first entered,
they are in a document/page navigational mode in which moving the
left and right key will act like the page back and page forward
controls on most Web browsers. In this case, the word "document" is
substituted for "page" because the word "page" is used in other
navigational modes to refer to a screen full of media on the cell
phone display. If the user presses the up or down keys, the web
page's display will be scrolled by a full display page (or
screen).
[0433] FIG. 116 illustrates how the cell phone embodiment shown
allows a special form of correction window to be used as a list box
when editing a dialog box of the type described above with regard
to FIG. 115.
[0434] The example of FIG. 116 starts from the find dialog box
being in the state shown at screen 11504 in FIG. 15. From this
state, the user presses the down line key twice to place the cursor
in the "In:" list box, which defines in which portions of the cell
phone's data the search conducted in response to the find dialog
box is to take place. When the user presses the talk button with
the cursor in this window, a list box correction window, 11512, is
displayed that shows the current selection in the list box as the
current first choice and provides a scrollable list of the other
list box choices, with each such other choice being shown with
associated phone key number. The user could scroll through this
list and choose the desired choice by phone key number or by using
a highlighted selection. In the example, the user continues the
press of the talk key and says the desired list box value with the
utterance, 11514. In list box correction windows, the active
vocabulary is substantially limited to list values. With such
limited vocabulary correct regcognition is fairly likely, as is
indicated in the example where the desired list value is the first
choice. The user responds by pressing the OK key, which causes the
desired list value to be placed in the list box of the dialog box
as is indicated, 11518.
[0435] FIG. 117 illustrates a series of interactions between a user
and the cell phone interface, which display some of the functions
that the interface allows the user to perform when making phone
calls.
[0436] The screen 6400 at FIG. 117 is the same top-level phone mode
screen described above with regard to FIG. 64. If when it is
displayed the user selects the last navigation button, which is
mapped to be name dial command, the system will enter the name dial
mode, the basic functions of which are those illustrated in the
pseudocode of FIG. 119. As can be seen from that figure, this mode
allows a user to select names from a contact list by adding them,
and if there is a mis-recognition, to correct it by alphabetic
filtering by selecting choices from a potentially scrollable choice
list in a correction window that is similar to those of the
described above.
[0437] When the cell phone enters the name dial mode, an initial
prompt screen, 11700, is shown as indicated in FIG. 117. In the
example, the user utters a name, 11702, during the pressing of the
talk key. In name dial, such utterances are recognized with the
vocabulary automatically limited to the name vocabulary, and the
resulting recognition causes a correction window, 11704, to be
displayed. In the example, the first choice is correct, so the user
selects the "OK" key, causing the phone to initiate a call to the
phone number associated with the named party in the user's contact
list.
[0438] When the phone call is connected, a screen, 11706, is
displayed having the same ongoing call indicator, 7414, described
above with regard to FIG. 75. At the bottom of the screen, as
indicated by the numeral 11708, an indication is given of the
functions associated with each of the navigation keys during the
ongoing call. In the example, the user selects the down button,
which is associated with the same Notes function described above
with regard to FIG. 64. In response, an editor window, 11710, is
displayed for the Notes outline with an automatically created
heading item, 11712, being created in the Notes outline for the
current call, labeling the party to whom it is made and its start
and ultimately its end time. A cursor, 11714, is then placed at a
new item indented under the calls heading.
[0439] In the example, the user says a continuous utterance, 11714,
during the pressing of the talk button because recognized text
corresponding to that utterance to be inserted into the notes
outline at the cursor as indicated in screen 11716. Then the user
double-clicks the "6" key to start recording, which causes an audio
graphic representation of the sound to be placed in the notes to
editor window at the current location of the cursor. As indicated
at 11718, audio from portions of the phone call in which the cell
phone operator is speaking is underlined in the audio graphics to
make it easier for the user to keep track of who's been talking how
long in the call and, if desired, to be able to better search for
portions of the recorded audio in which one or the other of the
phone call's to parties was speaking.
[0440] In the example of FIG. 117, the user next double-clicks on
the star key to select the task list. This shows a screen, 11720,
that lists the currently opened tasks, on the cell phone. In the
example, the user selects the task associated with the "4" phone
key, which is another notes editor window displaying a different
location in the notes outline. In response, the phone keys display
shows a screen, 11722, of that portion of the notes outlined.
[0441] In the example, the user presses the up key three times to
move the cursor to location 11724 and then presses the "6" key to
start playing the sound associated with the audio graphics
representation at the cursor, as indicated by the motion between
the cursors of screens 11726 and 11728.
[0442] Unless the play only to me option, 7513, described above
with regard to FIG. 75 is on, the playback of the audio in screen
11728 will be played to both sides of the current phone call,
enabling the user of the cell phone to share audio recording with
the other party during the cell phone call.
[0443] FIG. 118 illustrates that when an edit window is recording
audio, such as is shown in screen 11717 near the bottom middle of
FIG. 117, the user can turn on speech recognition during the
recording of such an audio to cause the audio recorded during that
portion to also have speech recognition performed upon it. In the
example shown during the recording shown in screen 11717, the user
presses the talk button and speaks the utterance, 11800. This
causes the text associated with that utterance, 11802, to be
inserted in the editor window, 11806. Audio recorded after the
duration of the recognition is recorded merely with audio graphics.
Normally this would be used in the methods in which the user tries
to speak clearly during an utterance, such as the utterance 11800,
which is to be recognized, and then would feel free to talk more
casually during portions of conversation or dictation that are
being recorded only with audio. Normally audio is recorded in
association with speech recognition so that the user could later go
back, listened to and correct any dictation such as the dictation
11802, which was incorrectly recognized during a recording.
[0444] FIG. 119 illustrates how the system enables the user to
select a portion of audio, such as the portion 11900 shown in that
figure by a combination of the extended selection key and play or
navigation keys, and then to select the recognized audio dialog box
discussed above with regard to functions 9000 through 9014 of FIG.
90 to have the selected text recognized as indicated at 11902. In
the example of FIG. 119, the user has selected the show recognized
audio option, 9026, shown in FIG. 90, which causes the recognized
text, 11902, to be underlined, indicating that it has a playable
audio associated with it.
[0445] FIG. 120 illustrates how a user can select a portion,
12,000, of recognized text that has associated recorded audio, and
then select to have that text stripped from its associated
recognized audio by selecting the option 9024, shown in FIG. 90, in
a submenu under the editor options menu. This leaves just the
audio, 12002, and its corresponding audio graphic representation,
remaining in the portion of media where the recognized text
previously stood.
[0446] FIG. 121 illustrates how the function 9020, of FIG. 90, from
under the audio menu of the edit options menu allows the user to
strip the recognition audio that has been associated with a
portion, 12100, of recognized text from that text as indicated at
12102 in FIG. 21.
[0447] FIGS. 122 through 125 provide illustrations of the operation
of the digital dialed mode described in pseudocode in FIG. 126. If
the user selects the digit dial mode, such as by pressing the 2
phone key when in the main menu, as illustrated at function 6552 of
FIG. 65 or by selecting the left navigational button when the
system is in the top-level phone mode shown in screen 6400 and FIG.
64, the system will enter the digital dial mode shown in FIG. 126
and will display a prompt screen, 12202, which prompts the user to
say a phone number. When the user says an utterance of a phone
number, as indicated at 12204, that utterance will be recognized.
If the system is quite confident that the recognition of the phone
number is correct, it will automatically dial the recognized phone
number as indicated at 12206. If the system is not that confident
of the phone number's recognition, it will display a correction
window 12208. If the correction window has the desired number as
the first choice as is indicated 12210, the user can merely select
it by pressing the OK key, which causes the system to dial the
number as indicated at 12212. If the correct choice is on the first
choice list as is indicated at 12214, the user can merely press the
phone key number associated with that choice because the system to
dial the number as is indicated at 12216.
[0448] If the correct number is neither the first choice nor in the
first choice list as indicated in the screen 12300, shown at the
top of FIG. 123, the user can check to see if the desired number is
on one of the screens of the second choice list by either
repeatedly pressing the page down key as indicated by the number
12302, or repeatedly pressing the item down key as is indicated at
12304. If by scrolling through the choice list in either of these
methods the user sees the desired number, the user can select it
either by pressing its associated phone key or by moving the choice
highlight to it and then pressing the OK key. This will cause the
system to dial the number as indicated at screen 12308. It should
be appreciated that because the phone numbers in the choice list
are numerically ordered, the user is able to find the desired
number rapidly by scrolling through the list. In the embodiment
shown in these figures, digit change indicators, 12310, are
provided to indicate the digit column of the most significant digit
by which any choice differs from the choice ahead of it on the
list. This makes it easier for the eye to scan for the desired
phone number.
[0449] FIG. 124 illustrates how the digit dial mode allows the user
to navigate to a digit position in the first choice and correct any
error that exists within it. In FIG. 124, this is done by speaking
the desired number, but the user is also allowed to correct the
desired number by pressing the appropriate phone key.
[0450] As illustrated in FIG. 125, the user is also able to edit a
misperceived phone number by inserting a missing digit as well as
by replacing a mis-recognized one.
[0451] The invention described above has many aspects that can be
used for the entering and correcting of speech recognition as well
as other forms of recognition on many different types of computing
platforms, including all those shown in FIGS. 3 through 8. A lot of
the features of the invention described with regard to FIG. 94 can
be used in situations where a user desires to enter and/or edit
text without having to pay close visual attention to those tasks.
For example, this could allow a user to listen to e-mail and
dictate responses while walking in a Park, without the need to look
closely at his cell phone or other dictation device. One particular
environment in which such audio feedback is useful for speech
recognition and other control functions, such as phone dialing and
phone control, is in an automotive arena, such as is illustrated in
FIG. 126.
[0452] In the embodiment by shown in FIG. 126, the car has a
computer, 12600, which is connected to a cell cellular wireless
communication system, 12602, into the cars audio system, 12604. In
many embodiments, the car's electronic system will have a short
range wireless transceiver such as a Blue Tooth or other short
range transceiver, 12606. These can be used to communicate to a
wireless headphone, 2608, or the user's cell phone, 12610, so that
the user can have the advantage of accessing information stored on
his normal cell phone while using his car.
[0453] Preferably, the cell phone/wireless transceiver, 12602, can
be used not only to send and receive cell phone calls but also to
send and receive e-mail, digital files, such as text files that can
be listened to and edited with the functionality described above,
and audio Web pages.
[0454] The input device for controlling many of the functions
described above with regard to the shown cell phone embodiment can
be accessed by a phone keypad, 12212 which is preferably located in
a position such as on the steering wheel of the automobile, which
will enable a user to a access its keys without unduly distracting
him from the driving function. In fact, with a keypad having a
location similar to that shown in FIG. 126, a user can have the
forefingers of one hand around the rim of the steering wheel while
selecting keypad buttons with the thumb of the same hand. In such
an embodiment, preferably the system would have the TTS keys
function described above with regard to 9404 through 9414 of FIG.
94 to enable the user to determine which key he is pressing and the
function of that key without having to look at the keypad. In other
embodiments, the touch sensitive keypad that responds to a mere
touching of its phone keys with such information could also be
provided that would be even easier and more rapid to use.
[0455] FIGS. 127 and 128 illustrate that most of the capabilities
described above with regard to the cell phone embodiment can be
used on other types of phones, such as on the cordless phone shown
in FIG. 127 or on the landline found indicated at FIG. 128.
[0456] It should be understood that the foregoing description and
drawings are given merely to explain and illustrate, and that the
invention is not limited thereto except insofar as the
interpretation of the appended claims are so limited. Those skilled
in the art who have the disclosure before them will be able to make
modifications and variations therein without departing from the
scope of the invention.
[0457] The invention of the present application, as broadly
claimed, is not limited to use with any one type of operating
system, computer hardware, or computer network, and, thus, other
embodiments of the invention could use differing software and
hardware systems.
[0458] Furthermore, it should be understood that the program
behaviors described in the claims below, like virtually all program
behaviors, can be performed by many different programming and data
structures, using substantially different organization and
sequencing. This is because programming is an extremely flexible
art in which a given idea of any complexity, once understood by
those skilled in the art, can be manifested in a virtually
unlimited number of ways. Thus, the claims are not meant to be
limited to the exact functions and/or sequence of functions
described in the figures. This is particularly true since the
pseudo-code described in the text above has been highly simplified
to let it more efficiently communicate that which one skilled in
the art needs to know to implement the invention without burdening
him or her with unnecessary details. In the interest of such
simplification, the structure of the pseudo-code described above
often differs significantly from the structure of the actual code
that a skilled programmer would use when implementing the
invention. Furthermore, many of the programmed behaviors that are
shown being performed in software in the specification could be
performed in hardware in other embodiments.
[0459] In the many embodiment of the invention discussed above,
various aspects of the invention are shown occurring together which
could occur separately in other embodiments of those aspects of the
invention.
[0460] It should be appreciated that the present invention extends
to methods, apparatus systems, and programming recorded in
machine-readable form, for all the features and aspects of the
invention which have been described in this application is filed
including its specification, its drawings, and its original
claims.
* * * * *