U.S. patent application number 12/794896 was filed with the patent office on 2011-07-28 for integration of embedded and network speech recognizers.
This patent application is currently assigned to Google Inc.. Invention is credited to William J. Byrne, Alexander GRUENSTEIN.
Application Number | 20110184740 12/794896 |
Document ID | / |
Family ID | 44309629 |
Filed Date | 2011-07-28 |
United States Patent
Application |
20110184740 |
Kind Code |
A1 |
GRUENSTEIN; Alexander ; et
al. |
July 28, 2011 |
Integration of Embedded and Network Speech Recognizers
Abstract
A method, computer program product, and system are provided for
performing a voice command on a client device. The method can
include translating, using a first speech recognizer located on the
client device, an audio stream of a voice command to a first
machine-readable voice command and generating a first query result
using the first machine-readable voice command to query a client
database. In addition, the audio stream can be transmitted to a
remote server device that translates the audio stream to a second
machine-readable voice command using a second speech recognizer.
Further, the method can include receiving a second query result
from the remote server device, where the second query result is
generated by the remote server device using the second
machine-readable voice command and displaying the first query
result and the second query result on the client device.
Inventors: |
GRUENSTEIN; Alexander;
(Mountain View, CA) ; Byrne; William J.; (Davis,
CA) |
Assignee: |
Google Inc.
Mountain View
CA
|
Family ID: |
44309629 |
Appl. No.: |
12/794896 |
Filed: |
June 7, 2010 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
61298251 |
Jan 26, 2010 |
|
|
|
Current U.S.
Class: |
704/275 ;
704/E21.019 |
Current CPC
Class: |
G10L 15/32 20130101;
G10L 15/30 20130101 |
Class at
Publication: |
704/275 ;
704/E21.019 |
International
Class: |
G10L 21/06 20060101
G10L021/06 |
Claims
1. A method for performing a voice command on a client device,
comprising: translating, using a first speech recognizer located on
the client device, an audio stream of a voice command to a first
machine-readable voice command; generating a first query result
using the first machine-readable voice command to query a client
database; transmitting the audio stream to a remote server device
that translates the audio stream to a second machine-readable voice
command using a second speech recognizer; receiving a second query
result from the remote server device, wherein the second query
result is generated by the remote server device using the second
machine-readable voice command to query a remote server database;
and displaying the first query result and the second query result
on the client device.
2. The method of claim 1, further comprising: storing at least a
portion of the first and second query results on the client
device.
3. The method of claim 2, further comprising retrieving the stored
first and second query results when translation of a subsequent
voice command is determined to be substantially similar to the
translated voice command that generated the first and second query
results.
4. The method of claim 3, further comprising: transmitting to the
remote server device a second audio stream associated with the
subsequent voice command; translating the second audio stream to a
third machine-readable voice command using the second speech
recognizer; receiving a third query result from the remote server
device, wherein the third query result is generated from a
subsequent query made to the server database based on the third
machine-readable voice command; and displaying the first, second,
and third query results on the client device.
5. The method of claim 2, further comprising identifying which
portion of the first and second query results to store, the
identification comprising: receiving a user selection of an item of
interest from a list of items returned as part of the second query
result.
6. The method of claim 1, wherein generating the first query result
comprises transmitting the audio stream to the second speech
recognizer such that the query made to the remote server database
based on the second machine-readable voice command occurs during a
time period that overlaps when the query is made to the client
database based on the first machine-readable voice command.
7. The method of claim 1, wherein transmitting the audio stream
comprises transmitting a compressed audio stream of the voice
command from the client device to the server device.
8. The method of claim 1, wherein displaying the first and second
query results comprises displaying the first result and a first
subset of the second query result at a first time instance and the
first result, the first subset of the second query result, and a
second subset of the second query result at a second time
instance.
9. A computer program product comprising a computer-usable medium
having computer program logic recorded thereon for enabling a
processor to perform a voice command on a client device, the
computer program logic comprising: first computer readable program
code that enables a processor to translate, using a first speech
recognizer located on the client device, an audio stream of a voice
command to a first machine-readable voice command; second computer
readable program code that enables a processor to generate a first
query result using the first machine-readable voice command to
query a client database; third computer readable program code that
enables a processor to transmit the audio stream to a remote server
device that translates the audio stream to a second
machine-readable voice command using a second speech recognizer;
fourth computer readable program code that enables a processor to
receive a second query result from the remote server device,
wherein the second query result is generated by the remote server
device using the second machine-readable voice command to query a
remote server database; and fifth computer readable program code
that enables a processor to display the first query result and the
second query result on the client device.
10. The computer program product of claim 9, further comprising:
sixth computer readable program code that enables a processor to
store at least a portion of the first and second query results on
the client device.
11. The computer program product of claim 10, further comprising:
seventh computer readable program code that enables a processor to
retrieve the stored first and second query results when translation
of a subsequent voice command is determined to be substantially
similar to the translated voice command that generated the first
and second query results.
12. The computer program product of claim 11, further comprising:
eighth computer readable program code that enables a processor to
transmit to the remote server device a second audio stream
associated with the subsequent voice command; ninth computer
readable program code that enables a processor to translate the
second audio stream to a third machine-readable voice command using
the second speech recognizer; tenth computer readable program code
that enables a processor to receive a third query result from the
remote server device, wherein the third query result is generated
from a subsequent query made to the server database based on the
third machine-readable voice command; and eleventh computer
readable program code that enables a processor to display the
first, second, and third query results on the client device.
13. The computer program product of claim 10, wherein the sixth
computer readable program code comprises: seventh computer readable
program code that enables a processor to identify which portion of
the first and second query results to store, the identification
comprising receiving a user selection of an item of interest from a
list of items returned as a part of the second query result.
14. The computer program product of claim 9, wherein the second
computer readable program code comprises: sixth computer readable
program code that enables a processor to transmit the audio stream
to the second speech recognizer such that the query made to the
remote server database based on the second machine-readable voice
command occurs during a time period that overlaps when the query is
made to the client database based on the first machine-readable
voice command.
15. A system for performing a voice command on a client device,
comprising: a first speech recognizer device configured to
translate an audio stream of a voice command to a first
machine-readable voice command; a client query manager configured
to: generate a first query result using the first machine-readable
voice command to query a client database; transmit the audio stream
to a remote server device that translates the audio stream to a
second machine-readable voice command using a second speech
recognizer device; and receive a second query result from the
remote server device, wherein the second query result is generated
by the remote server device using the second machine-readable voice
command to query a remote server database; and a display device
configured to display the first query result and the second query
result on the client device.
16. The system of claim 15, further comprising: a microphone
configured to receive the audio stream of the voice command and to
provide the audio stream to the first speech recognizer device; and
a storage device configured to store at least a portion of the
first and second query results on the client device.
17. The system of claim 16, wherein the client query manager is
configured to retrieve the stored first and second query results
from the storage device when translation of a subsequent voice
command is determined to be substantially similar to the translated
voice command that generated the first and second query
results.
18. The system of claim 17, wherein the client query manager is
configured to: transmit to the remote server device a second audio
stream associated with the subsequent voice command; translate the
second audio stream to a third machine-readable voice command using
the second speech recognizer device; and receive a third query
result from the remote server device, wherein the third query
result is generated from a subsequent query made to the server
database based on the third machine-readable voice command.
19. The system of claim 15, wherein the client query manager is
configured to transmit the audio stream to the second speech
recognizer device such that the query made to the remote server
database based on the second machine-readable voice command occurs
during a time period that overlaps when the query is made to the
client database based on the first machine-readable voice
command.
20. The system of claim 15, wherein the display device is
configured to display the first result and a first subset of the
second query result at a first time instance and the first result,
the first subset of the second query result, and a second subset of
the second query result at a second time instance.
Description
CROSS REFERENCE TO RELATED APPLICATION
[0001] This application claims the benefit of U.S. Provisional
Application No. 61/298,251 (SKGF Ref. No. 2525.2310000), filed Jan.
26, 2010, titled "Integration of Embedded and Network Speech
Recognizers," which is incorporated herein by reference in its
entirety.
BACKGROUND
[0002] 1. Field
[0003] This description generally relates to the field of speech
recognition.
[0004] 2. Background
[0005] Speech recognition systems in mobile devices allow users to
communicate and provide commands to the mobile device with minimal
usage of input controls such as, for example, keypads, buttons, and
dials. Some speech recognition tasks can be a complex process for
mobile devices, requiring an extensive analysis of speech signals
and search of word and language statistical models. This is because
mobile devices typically have limited computational, memory, and
battery resources. As such, more complex speech recognition tasks
are oftentimes offloaded to speech recognition systems located
externally to the mobile device such as, for example, speech
recognition systems in network servers.
[0006] Since more complex speech recognition tasks are performed on
network servers and not on the mobile device, the results of the
voice command may be limited to data stored in the network server.
For these speech recognition tasks, the mobile device user does not
have the benefit of viewing query results that may correspond to
the voice command based on data stored in the mobile device. In
addition, the delay time in transferring the voice command to the
network server, performing the speech recognition operation at the
network server, and transferring the query result from the network
server to the mobile device can be significant. Significant delay
time in the execution of applications on mobile devices, such as
speech recognition tasks, can lead to a poor user experience.
[0007] Methods and systems are needed for performing speech
recognition tasks on a client device, such as a mobile device, to
overcome the above-noted limitations of speech recognition systems
in mobile applications.
SUMMARY
[0008] Embodiments include a method for performing a voice command
on a client device. The method includes translating, using a first
speech recognizer located on the client device, an audio stream of
a voice command to a first machine-readable voice command and
generating a first query result using the first machine-readable
voice command to query a client database. In addition, the audio
stream can be transmitted to a remote server device that translates
the audio stream to a second machine-readable voice command using a
second speech recognizer. Further, the method includes receiving a
second query result from the remote server device, where the second
query result is generated by the remote server device using the
second machine-readable voice command to query a remote server
database, and displaying the first query result and the second
query result on the client device. The transmission of the audio
stream to the remote server device and the transmission of the
second query result from the remote server device to the client
device can occur simultaneously, substantially at the same time as,
or a time period that overlaps with the generation of the first
query result by the client device.
[0009] Embodiments additionally include a computer program product
that includes a computer-usable medium with computer program logic
recorded thereon for enabling a processor to perform a voice
command on a client device. The computer program logic includes the
following: first computer readable program code that enables a
processor to translate, using a first speech recognizer located on
the client device, an audio stream of a voice command to a first
machine-readable voice command; second computer readable program
code that enables a processor to generate a first query result
using the first machine-readable voice command to query a client
database; third computer readable program code that enables a
processor to transmit the audio stream to a remote server device
that translates the audio stream to a second machine-readable voice
command using a second speech recognizer; fourth computer readable
program code that enables a processor to process a second query
result from the remote server device, wherein the second query
result is generated by the remote server device using the second
machine-readable voice command; and, fifth computer readable
program code that enables a processor to a control a display of the
first query result and the second query result on the client
device.
[0010] Embodiments further include a system for performing a voice
command on a client device. The system includes a first speech
recognizer device, a client query manager, and a display device.
The first speech recognizer device is configured to translate an
audio stream of a voice command to a first machine-readable voice
command. The client query manager is configured to perform the
following functions: generate a first query result using the first
machine-readable voice command to query a client database; transmit
the audio stream to a remote server device that translates the
audio stream to a second machine-readable voice command using a
second speech recognizer; and, receive a second query result from
the remote server device, where the second query result is
generated by the remote server device using the second
machine-readable voice command to query a remote server database.
Further, the display device is configured to display the first
query result and the second query result on the client device.
[0011] Further features and advantages of embodiments described
herein, as well as the structure and operation of various
embodiments, are described in detail below with reference to the
accompanying drawings. It is noted that the embodiments described
below are not limited to the specific embodiments described herein.
Such embodiments are presented herein for illustrative purposes
only. Additional embodiments will be apparent to persons skilled in
the relevant art based on the teachings contained herein.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] The accompanying drawings, which are incorporated herein and
form a part of the specification, illustrate embodiments and,
together with the description, further serve to explain the
principles of the embodiments and to enable a person skilled in the
relevant art to make and use the embodiments.
[0013] FIG. 1 is an illustration of an exemplary communication
system in which embodiments can be implemented.
[0014] FIG. 2 is an illustration of an embodiment of a client
device.
[0015] FIG. 3 is an illustration of an embodiment of a server
device.
[0016] FIG. 4 is an illustration of an embodiment of a method for
performing a voice command on a client device.
[0017] FIGS. 5(a)-5(c) are illustrations of an exemplary user
interface on a mobile phone in accordance with embodiments.
[0018] FIG. 6 is an illustration of an embodiment of a method for
performing a voice command on a client device.
[0019] FIGS. 7(a) and 8(b) are illustrations of another exemplary
user interface on a mobile phone in accordance with
embodiments.
[0020] FIG. 8 is an illustration of an embodiment of another method
for performing a voice command on a client device.
[0021] FIG. 9 is an illustration of an embodiment of another method
for performing a voice command on a client device.
[0022] FIGS. 10(a)-10(e) are illustrations of an of yet another
exemplary user interface on a mobile phone in accordance with
embodiments.
[0023] FIG. 11 is an illustration of an example computer system in
which embodiments can be implemented.
DETAILED DESCRIPTION
[0024] The following detailed description refers to the
accompanying drawings that illustrate exemplary embodiments. Other
embodiments are possible, and modifications can be made to the
embodiments within the spirit and scope of this description.
Therefore, the detailed description is not meant to limit the
embodiments described below.
[0025] It would be apparent to one of skill in the relevant art
that the embodiments described below can be implemented in many
different embodiments of software, hardware, firmware, and/or the
entities illustrated in the figures. Any actual software code with
the specialized control of hardware to implement embodiments is not
limiting of this description. Thus, the operational behavior of
embodiments will be described with the understanding that
modifications and variations of the embodiments are possible, given
the level of detail presented herein.
[0026] FIG. 1 is an illustration of an exemplary communication
system 100 in which embodiments can be implemented. Communication
system 100 includes a client device 110 that is communicatively
coupled to a server device 130 via a network 120. Client device 110
can be, for example and without limitation, a mobile phone, a
personal digital assistant (PDA), a laptop, or other similar types
of mobile devices. Server device 130 can be, for example and
without limitation, a telecommunications server, a web server, or
other similar types of database servers. In an embodiment, server
device 130 can have multiple processors and multiple shared or
separate memory components such as, for example and without
limitation, one or more computing devices incorporated in a
clustered computing environment or server farm. The computing
process performed by the clustered computing environment, or server
farm, may be carried out across multiple processors located at the
same or different locations. In an embodiment, server device 130
can be implemented on a single computing device. Examples of
computing devices include, but are not limited to, a central
processing unit, an application-specific integrated circuit, or
other type of computing device having at least one processor and
memory. Further, network 120 can be, for example and without
limitation, a wired (e.g., ethernet) or a wireless (e.g., Wi-Fi and
3G) network that communicatively couples client device 110 to
server device 130.
[0027] FIG. 2 is an illustration of an embodiment of client device
110. Client device 110 includes a speech recognizer 210, a client
query manager 220, a microphone 230, a client database 240, and a
display device 250. In an embodiment, microphone 230 is coupled to
speech recognizer 210, which is coupled to client query manager
220. Client manager 220 is also coupled to client database 240 and
display 250, according to an embodiment.
[0028] In an embodiment, speech recognizer 210 and client query
manager 220 can be implemented in software, firmware, hardware, or
a combination thereof. Embodiments of speech recognizer 210 and
client query manager 220, or portions thereof, can also be
implemented as computer-readable code executed on one or more
computing devices capable of carrying out the functionality
described herein. Examples of computing devices include, but are
not limited to, a central processing unit, an application-specific
integrated circuit, or other type of computing device having at
least one processor and memory.
[0029] In an embodiment, microphone 230 is configured to receive an
audio stream corresponding to a voice command and to provide the
voice command to speech recognizer 210. The voice command can be
generated from an audio source such as, for example and without
limitation, a mobile phone user, according to an embodiment. In
turn, speech recognizer 210 is configured to translate the audio
stream to a machine-readable voice command, according to an
embodiment. Methods and techniques to translate the audio stream to
the machine-readable voice command are known to a person of
ordinary skill in the relevant art. Examples of these methods and
techniques can be found in commercial speech recognition software
such as Dragon Naturally Speaking Software and MacSpeech Software,
both by Nuance Communications, Inc.
[0030] Based on the machine-readable voice command, in an
embodiment, client query manager 220 queries client database 240 to
generate a query result. In an embodiment, client database 240
contains information that is locally stored in client device 110
such as, for example and without limitation, telephone numbers,
address information, and results from previous voice commands
(described in further detail below). Based on the description
herein, a person of ordinary skill in the relevant art will
recognize that other data stored in client database 240 can provide
query results to embodiments described herein.
[0031] In an embodiment, client query manager 220 also coordinates
a transmission of the audio stream corresponding to the voice
command to server device 130 via network 120 of FIG. 1. The audio
stream can be transmitted to server device 130 in multiple types of
audio file formats such as, for example and without limitation, a
WAVE audio format. After server device 130 processes the audio
stream, which will be described in further detail below, client
query manager 220 coordinates a reception of a query result from
server device 130 via network 120. The transmission of data to and
reception of data from server device 130 can be performed using a
transceiver (not shown in FIG. 2), which is known by a person of
ordinary skill in the relevant art.
[0032] Client query manager 220 of FIG. 2 coordinates the
transmission of the audio stream to server device 130
simultaneously, substantially the same time, or in a parallel
manner as it queries client database 240, according to an
embodiment. As a result, in an embodiment, the query result from
server device 130 can be received by client query manager 220 and
displayed on display device 250 at substantially the same time as,
in parallel with, or soon after the query result from client device
110. In the alternative, depending on the computation time for
client query manager 220 to query client database 240 or the
complexity of the voice command, the query result from server
device 130 can be received by client query manager 220 and
displayed on display device 250 prior to the display of a query
result from client database 240, according to an embodiment.
[0033] In reference to FIG. 2, in an embodiment, display device 250
is configured to display the query results from client database 240
and from server device 130. These query results are stored in
client database 240 and may be retrieved at a later time based on a
future voice command that is substantially the same as or
substantially similar to the voice command used to generate the
query results, according to an embodiment.
[0034] FIG. 3 is an illustration of an embodiment of server device
130. Server device 130 includes a speech recognizer 310, a server
query manager 320, and a server database 330. In an embodiment,
speech recognizer 310 is coupled to server query manager 320, which
is coupled to server database 330.
[0035] In an embodiment, speech recognizer 310 and server query
manager 320 can be implemented in software, firmware, hardware, or
a combination thereof. Embodiments of speech recognizer 310 and
server query manager 320, or portions thereof, can also be
implemented as computer-readable code executed on one or more
computing device capable of carrying out the functionality
described herein. Examples of computing devices include, but are
not limited to, a central processing unit, an application-specific
integrated circuit, or other type of computing device having at
least one processor and memory.
[0036] As described above, with respect to FIG. 2, server device
130 receives an audio stream corresponding to a voice command from
client device 110. In an embodiment, server query manager 320
coordinates the reception of the audio stream from client device
110 via a transceiver (not shown in FIG. 3) and transfer of the
audio stream to speech recognizer 310. In turn, speech recognizer
310 is configured to translate the audio stream to a
machine-readable voice command, according to an embodiment of the
present.
[0037] In an embodiment, speech recognizer 310 is configured to
translate both simple speech recognition tasks, as well as more
complex speech recognition tasks than those tasks translated by
speech recognizer 210 in client device 110. This is because speech
recognizer 310 has more computational and memory resources than
speech recognizer 210 to translate more complex voice commands to
corresponding machine-readable voice commands, according to an
embodiment. Methods and techniques to process complex speech
recognition tasks are known to a person of ordinary skill in the
relevant art.
[0038] Based on the machine-readable voice command translated by
speech recognizer 310, in an embodiment, server query manager 320
queries server database 330 to generate a query result. In an
embodiment, server database 330 contains a wide array of
information such as, for example and without limitation, text data,
image data, and video. Based on the description herein, a person of
ordinary skill in the relevant art will recognize that other data
stored in server database 330 can provide query results to
embodiments described herein.
[0039] After a query result is retrieved from server database 330,
server query manager 320 coordinates a transmission of the query
result to client device 110 via network 120 of FIG. 1. The
transmission of data to and the reception of data from client
device 110 can be performed using a transceiver (not shown in FIG.
3), which is known by a person of ordinary skill in the relevant
art.
[0040] FIG. 4 is an illustration of an embodiment of a method 400
for performing a voice command on a client device. Method 400 can
occur using, for example, client device 110 in communication system
100 of FIG. 1. Unlike speech recognition systems that offload more
complex voice commands to a server device for processing and
returns a corresponding query result to the client device, a speech
recognition system performing in accordance with method 400
processes both simple and complex voice commands on the client
device as well as the server device. The query results generated by
both the client device and the server device provide information
from a client database and a server database, respectively. As a
result, the user of the client device receives the benefit of
viewing query results that may correspond to the voice command
based on data stored on the client device as well as data stored on
the server device.
[0041] For ease of explanation, communication system 100 will be
used to facilitate in the description of method 400. However, based
on description herein, a person of ordinary skill in the relevant
art will recognize that method 400 can be executed on other
communication systems. These other communication systems are within
the scope and spirit of the embodiments described herein.
[0042] Further, for ease of explanation, method 400 will be
described in the context of a mobile phone (e.g., client device 110
of FIG. 1) with a mobile phone user as the audio source of the
voice command. The mobile phone is communicatively coupled to a
network server (e.g., server device 130 of FIG. 1) via a
communications network (e.g., network 120 of FIG. 1). Based on the
description herein, a person of ordinary skill in the relevant art
will recognize that method 400 can be executed on other types of
client devices such as, for example and without limitation, a PDA
and a laptop and with other audio sources such as, for example and
without limitation, a radio and a computer. These other types of
client devices and audio sources are within the scope and spirit of
the embodiments described herein.
[0043] In step 410, an audio stream of a voice command is
translated into a machine-readable voice command with a speech
recognizer located on the mobile phone. As described above, with
respect to FIG. 2, speech recognizer 210 translates the audio
stream received by microphone 230.
[0044] In step 420, a query is made to a database of the mobile
phone to generate a query result based on the machine-readable
voice command generated from step 410. In reference to FIG. 2,
based on the machine-readable voice command translated by speech
recognizer 210, client query manager 220 queries client database
240 to generate the query result.
[0045] FIGS. 5(a)-(c) are illustrations of an exemplary user
interface (UI) 510 on a mobile phone in accordance with embodiments
described herein. These illustrations are used to help facilitate
in the explanation of steps 410 and 420 of FIG. 4.
[0046] With respect to FIG. 5(a), mobile phone UI 510 prompts the
mobile phone user for a voice command. In this example, the mobile
phone user provides "Barry Cage" as the voice command. In turn, in
accordance with step 410, the mobile phone translates the audio
stream of the voice command into a machine-readable voice command
using its embedded speech recognizer (e.g., speech recognizer 210
of FIG. 2). A query manager on the mobile phone (e.g., client query
manager 220 of FIG. 2) queries the mobile phone's database for
"Barry Cage."
[0047] With respect to FIG. 5(b), the mobile phone's query manager
queries a contact list database for the name "Barry Cage" and finds
a query result 520. Based on the description herein, a person of
ordinary skill in the relevant art will recognize that other
databases on the mobile phone can be queried to generate the query
result such as, for example and without limitation, call log
information, music libraries, and calendar listings.
[0048] With respect to FIG. 5(c), the mobile phone user can select
query result 520 to view contact information 530 corresponding to
the voice command.
[0049] In reference to FIG. 4, in step 430, the audio stream of the
voice command is transmitted to a network server, where the voice
command is translated to a machine-readable voice command with a
speech recognizer located on the network server. As described
above, with respect to FIG. 2, client query manager 220 coordinates
a transmission of the audio stream to server device 130.
[0050] In step 440, a query result is received from the network
server, where the query result is generated from a query made to a
server database based on the machine-readable voice command from
step 430. With respect to FIG. 3, speech recognizer 310 translates
the voice command to the machine-readable voice command. Based on
the machine-readable voice command, server query manager 320
queries server database 330 to generate the query result. This
query result is then transmitted from server device 130 to client
device 110 via network 120.
[0051] In an embodiment, as illustrated in method 600 of FIG. 6,
the transmission of the audio stream to the network server (step
430) and the reception of the query result from the network server
(step 440) can be performed simultaneously with, substantially at
the same time as, or to overlap with the translation of the audio
stream of the voice command by the mobile phone (step 410) and
query of the database on the mobile phone (step 420). As a result,
in an embodiment, the query result from the network server can be
received by and displayed on the mobile phone at substantially the
same time as, in parallel with, or soon after a display of the
query result from the database of the mobile phone. In the
alternative, depending on the computation time to query the mobile
phone's database or the complexity of the voice command, the query
result from the network server can be received by and displayed on
the mobile phone prior to the display of the query result from the
mobile phone's database, according to an embodiment.
[0052] In step 450 of FIG. 4, the query result from step 420 and
the query result from step 440 are displayed on the mobile phone.
In an embodiment, the query results from steps 420 and 440 are
stored in the database of the mobile phone and may be displayed
based on a future voice command by the mobile phone user.
[0053] FIGS. 7(a) and 7(b) are illustrations of an exemplary UI 710
on a mobile phone in accordance with embodiments described herein.
These illustrations are used to help facilitate in the explanation
of steps 430-450 of FIG. 4.
[0054] Similar to FIG. 5, the illustrations of FIGS. 7(a) and 7(b)
assume that the mobile phone user provides "Barry Cage" as the
voice command. With respect to FIG. 7(a), in accordance with steps
410 and 420 of FIG. 4, field 720 displays a query result from a
query made to the mobile phone's database (e.g., client database
240 of FIG. 2). In addition, in accordance with steps 430-450,
field 730 displays a query result from a query made to the network
server (e.g., server database 330 of FIG. 3).
[0055] In the example of FIG. 7(a), field 730 is a list of three
entries in which the network server returns as possible matches for
the voice command: "barry cage"; "mary paige"; and, "mary peach."
If the mobile phone user does not decide to select an entry from
field 720 (i.e., "Barry Cage"), then the mobile phone user can
select an entry from field 730. In addition, a partial portion of
the list in field 630 can be received by and displayed on the
mobile phone at a first time instance and the remainder of the list
in field 730 can be received by and displayed on the mobile phone
at a second time instance (e.g., later in time than the first time
instance). In this way, the mobile phone user can view a portion of
the query results as the remainder of the query results is being
processed by the network server and received by the mobile
phone.
[0056] With respect to FIG. 7(b), if the mobile phone user selects
"barry cage" from field 730 of FIG. 7(a), then results 740 from a
web search is displayed on the mobile phone. The mobile phone user
can then scroll through search results 740 to locate a hyperlink of
interest. In accordance with step 450 of FIG. 4, search results 740
and the query result from step 420 of FIG. 4 (e.g., field 720 of
FIG. 7(a)) are stored in the mobile phone for a future voice
command by the mobile phone user, according to an embodiment. For
instance, if the mobile phone user provides "Barry Cage" as a voice
command at a later point in time, "Barry Cage" in field 720 and
"barry cage" in field 730 of FIG. 7(a) would be retrieved from the
mobile phone's memory and displayed to the mobile phone user. In
storing the web search result for "Barry Cage," the mobile phone
user receives the benefit of viewing a previously-selected web
search result. In turn, the mobile phone user's experience is
enhanced since the mobile phone is able to quickly recall a
selected entry from a previous voice command. An exemplary method
and system to store and retrieve data in fields 720 and 730 of FIG.
7(a) can be found in U.S. patent application Ser. No. 12/783,470
(Atty. Docket No. 2525.2360000), which is entitled "Personalization
and Latency Reduction for Voice-Activated Commands" and
incorporated herein by reference in its entirety.
[0057] FIG. 8 is an illustration of another method 800 for
performing a voice command on a client device. Method 800 can occur
using, for example, client device 110 in communication system 100
of FIG. 1. Similar to method 400 of FIG. 4, for ease of
explanation, communication system 100 will be used to facilitate in
the description of method 800. Further, for ease of explanation,
method 800 will be described in context of a mobile phone (e.g.,
client device 110 of FIG. 1) with a mobile phone user as the audio
source of the voice command.
[0058] In step 810, an audio stream of a voice command is received
by the mobile phone. As described above, with respect to FIG. 2,
microphone 230 is configured to receive the audio stream of the
voice command.
[0059] In step 820, a speech recognizer located on the mobile phone
determines whether the audio stream (from step 810) can be
translated into a machine-readable voice command with an
appropriate confidence score. In an embodiment, due to
computational and memory resources of the mobile phone, the speech
recognizer located on the mobile phone (e.g., speech recognizer 210
of FIG. 2) may not be able to translate more complex voice command
into corresponding machine-readable voice commands with relatively
high confidence scores. In particular, if a speech recognition
confidence score for the voice command is below a predetermined
threshold, then a query is not made to a database of the mobile
phone based on the voice command, according to an embodiment.
Instead, in an embodiment, the mobile phone stores the
machine-readable voice command with the relatively low confidence
score for future recall by the mobile phone. This future recall
feature will be described in further detail below. Methods and
techniques to determine speech recognition confidence scores are
known to a person of ordinary skill in the relevant art.
[0060] In step 830, if the speech recognizer located on the mobile
phone is able to provide a machine-readable voice command
translation for the audio stream of the voice command, then the
voice command is translated into the machine-readable voice command
with the speech recognizer located on the mobile phone. Step 830
performs a similar function as step 410 of FIG. 4.
[0061] In step 840, a query is made on a database of the mobile
phone to generate a query result based on the machine-readable
voice command generated from step 830. Step 840 performs a similar
function as step 420 of FIG. 4.
[0062] In step 850, regardless of whether the speech recognizer
located on the mobile phone is able to provide the machine-readable
voice command translation for the audio stream of the voice command
with the appropriate confidence score, the audio stream of the
voice command is transmitted to a network server, where the voice
command is translated to a machine-readable voice command with a
speech recognizer located on the network server. Step 850 performs
a similar function as step 430 of FIG. 4.
[0063] In step 860, a query result is received from the network
server, where the query result is generated from a query made to a
server database based on the machine-readable voice command from
step 850. Step 860 performs a similar function as step 440 of FIG.
4.
[0064] FIG. 9 is an illustration of another method 900 for
performing a voice command on a client device. Similar to steps 430
and 440 of FIG. 6, steps 860 and 870 of FIG. 8 can be performed
simultaneously with, substantially at the same time as, to overlap
with the translation of the audio stream of the voice command by
the mobile phone (step 830) and query of the database on the mobile
phone (step 840), according to an embodiment. As a result, in an
embodiment, the query result from the network server can be
received by and displayed on the mobile phone at substantially the
same time as, in parallel with, or soon after a display of the
query result from the database of the mobile phone. In the
alternative, depending on the computation time to query the mobile
phone's database or the complexity of the voice command, the query
result from the network server can be received by and displayed on
the mobile phone prior to the display of a query result from the
mobile phone's database, according to an embodiment.
[0065] In reference to step 880 of FIG. 8, if the speech recognizer
located on the mobile phone is able to provide a machine-readable
voice command translation for the audio stream of the voice command
(see step 870), the query result from step 820 and the query result
from 840 are displayed on the mobile phone (see step 880). In an
embodiment, the query results from steps 820 and 840 are stored in
the database of the mobile phone for a future voice command by the
mobile phone user.
[0066] In the alternative, if the speech recognizer located on the
mobile device is not able to provide a machine-readable voice
command translation for the audio stream of the voice command (see
step 870), then only the query result from step 840 is displayed on
the mobile phone (see step 890). In an embodiment, the query result
from step 840 is stored in the database of the mobile phone for a
future voice command by the mobile phone user.
[0067] In an embodiment, a future voice command can be translated
into a machine-readable voice command, in which this
machine-readable voice command can be compared to the
machine-readable voice command with the relatively low confidence
score (from step 820 of FIG. 2). If the two machine-readable voice
commands substantially match one another or are substantially
similar to one another, then the mobile phone displays the query
result from step 820 and/or the query result from step 840,
according to an embodiment. An exemplary method and system to store
and retrieve data in fields 720 and 730 of FIG. 7(a) can be found
in U.S. patent application Ser. No. 12/783,470 (Atty. Docket No.
2525.2360000), which is entitled "Personalization and Latency
Reduction for Voice-Activated Commands" and incorporated herein by
reference in its entirety.
[0068] In addition, according to an embodiment, the audio stream
corresponding to the future voice command is transmitted to the
network server, where the voice command is translated to a
machine-readable voice command with the speech recognizer located
on the network server. Based on the machine-readable voice command
corresponding to the future voice command, in an embodiment, a
query is made to a database on the network server to generate a
query result. This query result is received by, displayed on, and
stored in the mobile phone, according to an embodiment.
[0069] A benefit, among others, in displaying the stored query
result corresponding to the prior voice command and another query
result corresponding to the future voice command is that the mobile
phone user receives the benefit of viewing an updated query result
(if any) from the network server, according to an embodiment. In
addition, in an embodiment, the speech recognizer on the mobile
phone may mischaracterize the future voice command as corresponding
to a previously-stored voice command. In this case, the speech
recognizer located on the network server may be able to resolve the
mischaracterization by providing a more accurate translation of the
future voice command than the translation provided by speech
recognizer located on the mobile phone, according to an
embodiment.
[0070] FIGS. 10(a)-(e) are illustrations of an exemplary UI 1010 on
a mobile phone in accordance with embodiments described herein.
These illustrations are used to help facilitate in the explanation
of method 800.
[0071] With respect to FIG. 10(a), mobile phone UI 1010 prompts the
mobile phone user for a voice command. In this example, the mobile
phone user provides "pizza my heart" as the voice command. In turn,
in accordance with steps 810 and 820, the mobile phone receives the
voice command and determines whether the audio stream of the voice
command can be translated into a machine-readable voice command
with an appropriate confidence score.
[0072] In the example illustrated in FIG. 10, the voice command
"pizza my heart" does not return a speech recognition confidence
score above the predetermined threshold value. In other words, the
voice command "pizza my heart" does not return a high-confidence
match from the speech recognizer located on the mobile phone. The
audio stream of the voice command is transmitted to a network
server for further speech recognition processing, in accordance
with step 850.
[0073] FIG. 10(b) is an illustration of an exemplary list of query
results 1020 from the voice command made to the network server.
Exemplary list of query results 1020 is transmitted from the
network server to the mobile phone, in accordance with step 850. In
an embodiment, as the mobile phone user views exemplary list of
query results 1020, information relating to each of the query
results (e.g., web pages, images, text data) is stored in cache
memory of the mobile phone. This allows the mobile user to select a
query result of interest from exemplary list of query results 1020
and instantly view information relating to the query result, thus
improving the mobile phone user's experience. For instance, with
respect to FIG. 10(c), the mobile phone user selects the top entry
"pizza my heart" from exemplary list of query results 1020 and a
list of web search results 1030 is displayed on the mobile phone.
From the web search results, the mobile phone user can select a
hyperlink of interest (e.g., www.pizzamyheart.com) and view the
contents of the web page on the mobile phone, as illustrated in a
web page 1040 of FIG. 10(d).
[0074] Further, in an embodiment of step 850, a partial portion of
the exemplary list of query results can be received by and
displayed on the mobile phone at a first time instance and the
remainder of the exemplary list of query results can be received by
and displayed on the mobile phone at a second time instance (e.g.,
later in time than the first time instance). In this way, the
mobile phone user can view a portion of the query results as the
remainder of the query results is being processed by the network
server and received by the mobile phone.
[0075] In an embodiment, the query result selected by the mobile
phone user (e.g., www.pizzamyheart.com) is stored in the database
of the mobile phone for a future voice command by the mobile phone
user. For instance, the hyperlink "www.pizzamyheart.com" appears as
a query result from a query made to the database of the mobile
phone when, at a later time, the mobile phone user provides "pizza
my heart" as a voice command to the mobile phone. This is
illustrated in field 1050 of FIG. 10(e). The mobile phone user can
select the query result in field 1050 and view the web page at
"www.pizzamyheart.com," as illustrated in FIG. 10(d). In storing
the query result and associated web page, the mobile phone user
receives the benefit of viewing a previously-selected web search
result. In turn, the mobile phone user's experience is enhanced
since the mobile phone is able to quickly recall a selected entry
from a previous voice command. An exemplary method and system to
store and retrieve data in field 1050 of FIG. 10(e) can be found in
U.S. patent application Ser. No. 12/783,470 (Atty. Docket No.
2525.2360000), which is entitled "Personalization and Latency
Reduction for Voice-Activated Commands" and incorporated herein by
reference in its entirety.
[0076] Various aspects of the embodiments described herein may be
implemented in software, firmware, hardware, or a combination
thereof. FIG. 11 is an illustration of an example computer system
1100 in which embodiments, or portions thereof, can be implemented
as computer-readable code. For example, the methods illustrated by
flowchart 400 of FIG. 4, flowchart 600 of FIG. 6, flowchart 800 of
FIG. 8, or flowchart 900 of FIG. 9 can be implemented in computer
system 1100. Various embodiments are described in terms of this
example computer system 1100. After reading this description, it
will become apparent to a person skilled in the relevant art how to
implement embodiments described herein using other computer systems
and/or computer architectures.
[0077] Computer system 1100 is an example computing device and
includes one or more processors, such as processor 1104. Processor
1104 may be a special purpose or a general-purpose processor.
Processor 1104 is connected to a communication infrastructure 1106
(e.g., a bus or network).
[0078] Computer system 1100 also includes a main memory 1108,
preferably random access memory (RAM), and may also include a
secondary memory 1110. Secondary memory 1110 can include, for
example, a hard disk drive 1112, a removable storage drive 1114,
and/or a memory stick. Removable storage drive 1114 can comprise a
floppy disk drive, a magnetic tape drive, an optical disk drive, a
flash memory, or the like. The removable storage drive 1114 reads
from and/or writes to a removable storage unit 1118 in a well-known
manner. Removable storage unit 1118 can include a floppy disk,
magnetic tape, optical disk, etc. which is read by and written to
by removable storage drive 1114. As will be appreciated by persons
skilled in the relevant art, removable storage unit 1118 includes a
computer-usable storage medium having stored therein computer
software and/or data.
[0079] In alternative implementations, secondary memory 1110 can
include other similar devices for allowing computer programs or
other instructions to be loaded into computer system 1100. Such
devices can include, for example, a removable storage unit 1122 and
an interface 1120. Examples of such devices can include a program
cartridge and cartridge interface (such as those found in video
game devices), a removable memory chip (e.g., EPROM or PROM) and
associated socket, and other removable storage units 1122 and
interfaces 1120 which allow software and data to be transferred
from the removable storage unit 1122 to computer system 1100.
[0080] Computer system 1100 can also include a communications
interface 1124. Communications interface 1124 allows software and
data to be transferred between computer system 1100 and external
devices. Communications interface 1124 can include a modem, a
network interface (such as an Ethernet card), a communications
port, a PCMCIA slot and card, or the like. Software and data
transferred via communications interface 1124 are in the form of
signals, which may be electronic, electromagnetic, optical, or
other signals capable of being received by communications interface
1124. These signals are provided to communications interface 1124
via a communications path 1126. Communications path 1126 carries
signals and may be implemented using wire or cable, fiber optics, a
phone line, a cellular phone link, a RF link or other
communications channels.
[0081] In this document, the terms "computer program medium" and
"computer-usable medium" are used to generally refer to media such
as removable storage unit 1118, removable storage unit 1122, and a
hard disk installed in hard disk drive 1112. Computer program
medium and computer-usable medium can also refer to memories, such
as main memory 1108 and secondary memory 1110, which can be memory
semiconductors (e.g., DRAMs, etc.). These computer program products
provide software to computer system 1100.
[0082] Computer programs (also called computer control logic) are
stored in main memory 1108 and/or secondary memory 1110. Computer
programs may also be received via communications interface 1124.
Such computer programs, when executed, enable computer system 1100
to implement embodiments discussed herein. In particular, the
computer programs, when executed, enable processor 904 to implement
processes described above, such as the steps in the methods
illustrated by flowchart 400 of FIG. 4, flowchart 600 of FIG. 6,
flowchart 800 of FIG. 8, and flowchart 900 of FIG. 9, discussed
above. Accordingly, such computer programs represent controllers of
the computer system 1100. Where embodiments described herein are
implemented using software, the software can be stored in a
computer program product and loaded into computer system 1100 using
removable storage drive 1114, interface 1120, hard drive 1112 or
communications interface 1124.
[0083] Based on the description herein, a person of ordinary skill
in the relevant will recognize that the computer programs, when
executed, can enable one or more processors to implement processes
described above, such as the steps in the methods illustrated by
flowchart 400 of FIG. 4, flowchart 600 of FIG. 6, flowchart 800 of
FIG. 8, and flowchart 900 of FIG. 9. In an embodiment, the one or
more processors can be part of a computing device incorporated in a
clustered computing environment or server farm. Further, in an
embodiment, the computing process performed by the clustered
computing environment such as, for example, the steps in the
methods illustrated by flowcharts 400, 600, 800, and 900 may be
carried out across multiple processors located at the same or
different locations.
[0084] Embodiments are also directed to computer program products
including software stored on any computer-usable medium. Such
software, when executed in one or more data processing device,
causes a data processing device(s) to operate as described herein.
Embodiments employ any computer-usable or -readable medium, known
now or in the future. Examples of computer-usable mediums include,
but are not limited to, primary storage devices (e.g., any type of
random access memory), secondary storage devices (e.g., hard
drives, floppy disks, CD ROMS, ZIP disks, tapes, magnetic storage
devices, optical storage devices, MEMS, nanotechnological storage
devices, etc.), and communication mediums (e.g., wired and wireless
communications networks, local area networks, wide area networks,
intranets, etc.).
[0085] While various embodiments have been described above, it
should be understood that they have been presented by way of
example only, and not limitation. It will be understood by those
skilled in the relevant art that various changes in form and
details can be made therein without departing from the spirit and
scope of the embodiments described herein. It should be understood
that this description is not limited to these examples. This
description is applicable to any elements operating as described
herein. Accordingly, the breadth and scope of this description
should not be limited by any of the above-described exemplary
embodiments, but should be defined only in accordance with the
following claims and their equivalents.
* * * * *
References