U.S. patent application number 13/504264 was filed with the patent office on 2012-08-23 for speech recognition system, speech recognition request device, speech recognition method, speech recognition program, and recording medium.
This patent application is currently assigned to NEC CORPORATION. Invention is credited to Kentaro Nagatomo.
Application Number | 20120215528 13/504264 |
Document ID | / |
Family ID | 43921838 |
Filed Date | 2012-08-23 |
United States Patent
Application |
20120215528 |
Kind Code |
A1 |
Nagatomo; Kentaro |
August 23, 2012 |
SPEECH RECOGNITION SYSTEM, SPEECH RECOGNITION REQUEST DEVICE,
SPEECH RECOGNITION METHOD, SPEECH RECOGNITION PROGRAM, AND
RECORDING MEDIUM
Abstract
Provided is a speech recognition system, including: a first
information processing device including a speech recognition
processing unit for receiving data to be used for speech
recognition transmitted via a network, carrying out speech
recognition processing, and returning resultant data; and a second
information processing device connected to the first information
processing device via the network. The second information
processing device performs conversion of the data into data having
a format that disables a content thereof from being perceived and
also enables the speech recognition processing unit to perform the
speech recognition processing. Thereafter, the second information
processing device transmits the data to be used for the speech
recognition by the speech recognition processing unit and
constructs resultant data returned from the first information
processing device into a content of a valid and perceivable
recognition result.
Inventors: |
Nagatomo; Kentaro; (Tokyo,
JP) |
Assignee: |
NEC CORPORATION
Tokyo
JP
|
Family ID: |
43921838 |
Appl. No.: |
13/504264 |
Filed: |
October 12, 2010 |
PCT Filed: |
October 12, 2010 |
PCT NO: |
PCT/JP2010/068230 |
371 Date: |
April 26, 2012 |
Current U.S.
Class: |
704/211 ;
704/201; 704/E19.001 |
Current CPC
Class: |
G10L 15/22 20130101;
G10L 15/30 20130101; G06F 21/32 20130101; G10L 17/00 20130101; G10L
15/26 20130101; G10L 15/02 20130101; G10L 15/187 20130101; G10L
2015/025 20130101 |
Class at
Publication: |
704/211 ;
704/201; 704/E19.001 |
International
Class: |
G10L 19/00 20060101
G10L019/00 |
Foreign Application Data
Date |
Code |
Application Number |
Oct 28, 2009 |
JP |
2009-247874 |
Claims
1. A speech recognition system, comprising: a first information
processing device comprising a speech recognition processing unit
for receiving data which are used for speech recognition and which
are transmitted via a network, carrying out speech recognition
processing, and returning resultant data; and a second information
processing device connected to the first information processing
device via the network, for performing conversion of data used for
the speech recognition by the speech recognition processing unit
into data having a format that disables a content thereof from
being perceived and that also enables the speech recognition
processing unit to perform the speech recognition processing to
obtain converted data in the format and to thereafter transmit the
converted data to the speech recognition processing unit, and
constructing the resultant data returned from the first information
processing device into a content of a valid and perceivable
recognition result.
2. A speech recognition system, comprising: a first information
processing device comprising a speech recognition processing unit
for receiving data to be used for speech recognition transmitted
via a network, carrying out speech recognition processing, and
returning resultant data; and a second information processing
device which is connected to the first information processing
device via the network, which transmits the data to be used for the
speech recognition by the speech recognition processing unit after
performing mapping thereof by using a mapping function unknown to
the first information processing device, and constructing a speech
recognition result by modifying the resultant data returned from
the first information processing device into the same result as a
result of performing the speech recognition without using the
mapping function.
3. A speech recognition system, comprising a plurality of
information processing devices that are connected to one another
via a network and comprise a speech recognition processing unit in
at least one information processing device, wherein: the
information processing device with the speech recognition
processing unit receives at least one data structure of data to be
used for speech recognition processing by the speech recognition
processing unit; wherein: the at least one data structure of the
data is converted by using a mapping function and transmitted to
the information processing device with the speech recognition
processing unit; the information processing device with the speech
recognition processing unit carries out the speech recognition
processing based on the converted data structure and transmits a
result thereof; and the result of carrying out the speech
recognition processing which is affected by the mapping function is
constructed into a result of carrying out the speech recognition
processing which is not affected by the mapping function.
4. A speech recognition system according to claim 2, wherein the
mapping function that is used comprises a mapping function .PHI. in
which, when the mapping function .PHI.={.phi.} maps a data
structure X and a data structure Y to .phi._x{X} and .phi._y{Y},
respectively, with regard to a function F(X,Y) used by the speech
recognition processing unit, values of F(X,Y) and F(.phi._x{X},
.phi._y{Y}) are constantly the same or constantly less than a given
threshold value, or a ratio therebetween is constantly fixed.
5. A speech recognition system according to claim 2, wherein a data
structure used by the speech recognition processing unit indicates
a reference relationship between a given index and a reference
destination in relation to an index that refers to specific data
included in the data structure.
6. A speech recognition system according to claim 2, wherein the
mapping function comprises a function in which: with regard to a
reference relationship between an index that refers to specific
data included in a given data structure and a reference
destination, a destination to which a given arbitrary index refers
before mapping does not necessarily match a destination to which
the same index refers after the mapping; and data at the reference
destination to which any one of indices refers before the mapping
is always referred to by any one of the indices after the
mapping.
7. A speech recognition system according to claim 6, wherein the
mapping function comprises shuffling of indices that refer to the
specific data included in the given data structure.
8. A speech recognition system according to claim 6, wherein the
mapping function adds an arbitrary number of indices to the
specific data included in the given data structure.
9. A speech recognition system according to claim 2, wherein at
least one item of data to be used for speech recognition which is
subjected to mapping by using the mapping function is retained
before the mapping only on an information processing device for
inputting a sound to be subjected to the speech recognition.
10. A speech recognition system according to claim 2, wherein the
data to be used by the speech recognition processing unit has a
structure to which at least one selected from the group consisting
of a structure of an acoustic model, a structure of a language
model, and a structure of a feature vector is mapped.
11. A speech recognition system according to claim 10, wherein:
indices indicating respective features included in the feature
vector are mapped by using the mapping function given by a device
for inputting a sound to be subjected to speech recognition; and
indices to models associated with respective features within the
acoustic model are mapped by using the mapping function given by
the device for inputting the sound to be subjected to the speech
recognition.
12. A speech recognition system according to claim 11, wherein:
phoneme IDs being indices to phonemes included in the acoustic
model are mapped by using the mapping function given by the device
for inputting the sound; phoneme ID strings indicating
pronunciations of respective words included in the language model
are mapped by using the mapping function given by the device for
inputting the sound; and at least information on representation
character strings of the respective words included in the language
model is deleted.
13. A speech recognition system according to claim 12, wherein word
IDs being indices to the respective words included in the language
model are mapped by using the mapping function given by the device
for inputting the sound.
14. A speech recognition system according to claim 2, comprising
the information processing device which is operable in response to
the speech data and which comprises at least an acoustic likelihood
computation unit and is configured to: map phoneme ID strings
indicating pronunciations of respective words included in the
language model by using the mapping function given by the
information processing device, and delete at least information on
representation character strings of the respective words included
in the language model; compute acoustic likelihoods of all known
phonemes or necessary phonemes for each frame of the speech data to
generate a sequence of a group of the phoneme IDs and acoustic
likelihoods that are mapped by using the mapping function given by
the information processing device; and transmit the sequence of the
group of the mapped phoneme IDs and acoustic likelihoods and the
language model after the mapping to the information processing
device comprising a hypothesis search unit.
15. A speech recognition system according to claim 2, comprising
the information processing device which is operable in response to
speech data and which is configured to: divide the speech data into
blocks; map a time sequence among the divided blocks by using the
mapping function given by the information processing device for
inputting speech data; transmit the blocks of speech to an
information processing device for performing speech recognition
based on the time sequence after the mapping; receive any one of a
feature vector or a sequence of a group of phoneme IDs and acoustic
likelihoods from the information processing device for performing
the speech recognition; and restore the time sequence by using an
inverse function to the mapping function given by the information
processing device for inputting speech data.
16. A speech recognition request device, comprising: a
communication unit connected via a network to a speech recognition
device comprising a speech recognition processing unit for
receiving data to be used for speech recognition transmitted via
the network, carrying out speech recognition processing, and
returning resultant data; an information conversion unit for
converting the data to be used for the speech recognition by the
speech recognition processing unit into data having a format that
disables a content thereof from being perceived and also enables
the speech recognition processing unit to perform the speech
recognition processing; and an recognition result construction unit
for reconstructing the resultant data returned from the speech
recognition device after performing the speech recognition on the
converted data into a speech recognition result that is perceivable
as a content of being valid recognition result, based on the
converted content.
17. A speech recognition request device, comprising: a
communication unit connected via a network to a speech recognition
device comprising a speech recognition processing unit for
receiving data to be used for speech recognition transmitted via
the network, carrying out speech recognition processing, and
returning resultant data; an information conversion unit for
mapping the data to be used for the speech recognition by the
speech recognition processing unit by using a mapping function
unknown to the speech recognition device; and an recognition result
construction unit which is operable on the basis of the mapping
function and which constructs the resultant data returned from the
speech recognition device to obtain, from the resultant data, the
same result as a result of performing the speech recognition
without using the mapping function.
18. A speech recognition request device according to claim 17,
wherein the information conversion unit maps a data structure of
the data to be used for the speech recognition which is transmitted
to the speech recognition processing unit so as to indicate a
reference relationship between a predetermined index and a
reference destination in relation to an index that refers to
specific data included in the data structure.
19. A speech recognition request device according to claim 17,
wherein the mapping function comprises a function in which: with
regard to a reference relationship between an index that refers to
specific data included in a given data structure and a reference
destination, a destination to which a given arbitrary index refers
before mapping does not necessarily match a destination to which
the same index refers after the mapping; and data at the reference
destination to which any one of indices refers before the mapping
are always referred to by any one of the indices after the
mapping.
20. A speech recognition request device according to claim 17,
wherein: indices indicating respective features included in the
feature vector are mapped by using the mapping function; and
indices to models associated with respective features within the
acoustic model are mapped by using the mapping function.
21. A speech recognition request device according to claim 17,
wherein: phoneme IDs being indices to phonemes included in the
acoustic model are mapped by using the mapping function; phoneme ID
strings indicating pronunciations of respective words included in
the language model are mapped by using the mapping function; and at
least information on representation character strings of the
respective words included in the language model is deleted.
22. A speech recognition request device according to claim 17,
further comprising an acoustic likelihood computation unit and
being configured to: map phoneme ID strings indicating
pronunciations of respective words included in the language model
by using the mapping function, and delete at least information on
representation character strings of the respective words included
in the language model; compute acoustic likelihoods of all known
phonemes or necessary phonemes for each frame of the speech data to
generate a sequence of a group of the phoneme IDs and acoustic
likelihoods that are mapped by using the mapping function given by
the speech recognition device; and transmit the sequence of the
group of the mapped phoneme IDs and acoustic likelihoods and the
language model after the mapping to the speech recognition device
comprising a hypothesis search unit.
23. A speech recognition request device according to claim 17,
further configured to: divide speech data of a sound to be
subjected to the speech recognition into a plurality of blocks; map
a time sequence among the divided blocks by using the mapping
function; transmit the blocks of speech to the speech recognition
device based on the time sequence after the mapping; and receive
result data on the speech recognition transmitted from the speech
recognition device, and restore the time sequence by using an
inverse function to the mapping function.
24. An information processing device, comprising: means for storing
an acoustic model, a language model, and conversion/reconstruction
data used for conversion that achieves secrecy and restoration; a
first conversion means for acquiring the acoustic model, the
language model, and the conversion/reconstruction data, and
converting a data structure of each model used for speech
recognition into a data structure having the secrecy; a second
conversion means for converting a sound to be subjected to
identification into data, and converting a data structure of the
data into a data structure having the secrecy; means for
transmitting the converted data to an acoustic recognition device
via a network; and means for constructing a recognition result
equivalent to a result of performing the speech recognition without
using the first conversion means and the second conversion means,
based on a result of the speech recognition received from the
acoustic recognition device via the network, the acoustic model,
the language model, and the conversion/reconstruction data.
25. A speech recognition method, comprising: connecting a speech
recognition device comprising a speech recognition processing unit
and a speech recognition request device for requesting the speech
recognition device for speech recognition to each other via a
network; converting, by the speech recognition request device, at
least one data structure of data to be used for speech recognition
processing by the speech recognition processing unit by using a
mapping function, and transmitting the resultant to the speech
recognition device; carrying out, by the speech recognition device,
the speech recognition processing based on the converted data
structure, and transmitting a result thereof to the speech
recognition request device; and constructing, by the speech
recognition request device, the result of carrying out the speech
recognition processing which is affected by the mapping function
into a result of carrying out the speech recognition processing
which is not affected by the mapping function.
26. A speech recognition method according to claim 25, wherein the
data to be used by the speech recognition processing unit, which is
converted and transmitted from the speech recognition request
device to the speech recognition device, has a structure to which
at least one selected from the group consisting of a structure of
an acoustic model, a structure of a language model, and a structure
of a feature vector is mapped.
27. A speech recognition method according to claim 25, wherein the
mapping function comprises a function of shuffling indices that
refer to specific data included in a given data structure or adding
an arbitrary number of indices to the indices that refer to the
specific data included in the given data structure.
28. A speech recognition method according to claim 25, wherein the
mapping function that is used comprises a mapping function .PHI. in
which, when the mapping function .PHI.={.phi.} maps a data
structure X and a data structure Y to .phi._x {X} and .phi._y{Y},
respectively, with regard to a function F(X,Y) used by the speech
recognition processing unit, values of F(X,Y) and F(.phi._x {X},
.phi._y{Y}) are constantly the same or constantly less than a given
threshold value, or a ratio therebetween is constantly fixed.
29. A non-transitory recording medium having recorded thereon a
speech recognition program which is used in a control unit of an
information processing device coupled through a network to a speech
recognition processing device comprising a speech recognition unit
which receives data to be used through the network, which carries
out speech recognition processing, and which returns resultant data
via the network; the speech recognition program making the control
unit operate as: a communication unit connected via a network to
the speech recognition device; an information conversion unit which
converts the data used for the speech recognition by the speech
recognition processing unit into data of a format that disables a
content thereof from being perceived and also enables the speech
recognition processing unit to perform the speech recognition
processing; and an recognition authentication result construction
unit which reconstructs the resultant data returned from the speech
recognition device after performing the speech recognition on the
converted data into a speech recognition result that enables a
content being a valid and perceivable recognition result, based on
the converted content.
30. A non-transitory recording medium having recorded thereon a
speech recognition program used in a control unit of an information
processing device which is coupled through a network to an acoustic
recognition device and which comprises; means for storing an
acoustic model, a language model, and conversion/reconstruction
data used for conversion that achieves secrecy and restoration; and
means for transmitting the converted data to the an acoustic
recognition device via a network, the speech recognition program
making the control unit operate as: a first conversion means for
acquiring the acoustic model, the language model, and the
conversion/reconstruction data, to convert a data structure of each
model used for speech recognition into a data structure having the
secrecy; a second conversion means for converting a sound to be
subjected to identification into data, to convert a data structure
of the data into a data structure having the secrecy; and means for
constructing a recognition result equivalent to a result of
performing the speech recognition without using the first
conversion means and the second conversion means, based on a result
of the speech recognition received from the acoustic recognition
device via the network, the acoustic model, the language model, and
the conversion/reconstruction data.
31. (canceled)
32. (canceled)
Description
TECHNICAL FIELD
[0001] This invention relates to a speech recognition system, a
speech recognition method, and a speech recognition program.
Specifically, this invention relates to a speech recognition
system, a speech recognition method, and a speech recognition
program, which disable the third party from restoring details of a
recognition result regarding a content of speech to be subjected to
speech recognition, details of a speech recognition dictionary, or
the like.
BACKGROUND ART
[0002] A speech recognition technology using an information
processing system is a technology for taking out language
information included in input speech data. A system using the
speech recognition technology can be used as a speech word
processor if all the speech data are converted into text, and can
be used as a speech command input device if a keyword included in
the speech data is extracted.
[0003] FIG. 7 illustrates an example of a related speech
recognition system. The speech recognition system illustrated in
FIG. 7 includes an utterance segment extraction unit, a feature
vector extraction unit, an acoustic likelihood computation unit, a
hypothesis search unit, and a database for speech recognition.
[0004] The speech recognition system including such components
operates as follows.
[0005] A segment that involves actual utterance (speech segment)
and a segment that does not (silent segment) coexist in a sound
(speech) input to the speech recognition system, and hence the
utterance segment extraction unit is used to take out only the
speech segment therefrom.
[0006] Subsequently, the speech data within the extracted segment
is input to the feature vector extraction unit, and a feature
vector is extracted by taking out various features included in the
speech at regular time intervals (frames). The features that are
often used may be, for example, cepstrum, power, and .DELTA. power.
A combination of a plurality of features is handled as a sequence
(vector) and may be referred to as "feature vector".
[0007] The extracted feature vector of the speech is sent to the
acoustic likelihood computation unit to obtain likelihood (acoustic
likelihood) thereof with respect to each of a plurality of phonemes
that are given in advance. Often used as the acoustic likelihood is
a similarity to a model of each of phonemes recorded in an acoustic
model of the database. The similarity is generally expressed as a
"distance" (magnitude of deviation) from the model, and hence
"acoustic likelihood computation" is referred to also as "distance
calculation". The phonemes are obtained intuitively by dividing a
phonetic unit into a consonant and a vowel, but even the same
phoneme exhibits different acoustic features when the preceding
phoneme or the following phoneme differs. It is therefore known
that such cases are separately modeled to increase precision in the
recognition. The phonemes obtained by thus taking the phonemes
before and after a phoneme into consideration are referred to as
"triphone (trio of phonemes)". The acoustic model widely used today
expresses state transitions among the phonemes by a Hidden Markov
Model (HMM). Accordingly, the acoustic model represents a set of
HMMs on a triphone-to-triphone basis. In most implementations, each
triphone is assigned an ID (hereinafter, referred to as "phoneme
ID"), and is handled wholly by the phoneme ID in processing in the
subsequent stages.
[0008] The hypothesis search unit references a language model
regarding the acoustic likelihood obtained by the acoustic
likelihood computation unit to make a search for a word string
having the highest likelihood. The language model may be considered
by being classified into a dictionary and a strict language model.
In this case, the dictionary is given a list of vocabulary that can
be handled by the (broad-sense) language model. In general, each
word entry within the dictionary is assigned a phoneme string
(phoneme ID string) of a corresponding word and a representation
character string thereof. Meanwhile, the strict language model
includes information obtained by modeling the likelihood (language
likelihood) that a given word group within the vocabulary
continuously appears in a given order. Grammar and N-gram are most
often used as the strict language model today. The grammar
represents direct descriptions of adequacy of given word
concatenations that are made by using words, attributes of words,
categories to which words belong, and the like. Meanwhile, the
N-gram is obtained by statistically computing an appearance
likelihood of each word concatenation formed of N words based on an
actual appearance frequency within a large volume of corpus (text
data for learning). In general, each entry of the dictionary is
assigned an ID (hereinafter, referred to as "word ID"), and the
(strict) language model serves as a function that returns a
language likelihood when a word ID string is input thereto. In
summary, search processing performed by the hypothesis search unit
is processing for obtaining the likelihood (acoustic likelihood) of
phonemes from a feature vector string, obtaining whether or not to
allow conversion into the word ID from the phoneme ID string,
obtaining the appearance likelihood (language likelihood) of the
word string from the word ID string, and finally finding the word
string having the highest likelihood.
[0009] A typical example of those kinds of speech recognition
systems includes that described by T. Kawahara, A. Lee, T.
Kobayashi, K. Takeda, N. Minematsu, S. Sagayama, K. Itou, A. Ito,
M. Yamamoto, A. Yamada, T. Utsuro, and K. Shikano in "Free software
toolkit for Japanese large vocabulary continuous speech
recognition." In Proc. Int'l Conf. on Spoken Language Processing
(ICLSP), Vol. 4, pp. 476-479, 2000 (Non Patent Literature 1).
[0010] Note that, there are limitations on the vocabulary and
phrases that can be modeled by a single language model. If a larger
volume of vocabulary and diverse phrases are to be modeled beyond
the limitations, ambiguity in hypothesis search increases, which
results in a decrease in recognition speed and a deterioration in
recognition precision. Further, it is impossible to collect all an
enormous amount of vocabulary in the first place. Accordingly, in a
normal case, it is general to customize the language model
depending on a task or domain for which the speech recognition
technology is to be made use of. For example, to use the speech
recognition technology for speech command, a language model formed
of only commands that can be received is created. Alternatively, if
the speech recognition technology is used to support dictation from
minutes of recorded speech in a meeting, a language model is
constructed by modeling only words and phrases that appeared in the
written minutes of the meeting held in the past and the speech in
the meeting along with their related words and phrases. This
enables the vocabulary specific to a particular task or domain to
be collected and enables appearance patterns thereof to be
modeled.
[0011] Further, the acoustic model is generally obtained by putting
a machine learning technology to full use by use of a large amount
of labeled speech data (set of speech data to which information as
to which segment of the speech data corresponds to which phoneme is
given). In general, such speech data, collection of which requires
high cost, is not customized for each user and is prepared
individually so as to suit general properties of expected use
scenes. For example, the acoustic model learned from labeled data
of telephone speech is used for telephone speech recognition. There
is sometimes provided an optimization processing function (referred
to generally as "speaker learning" function or "enrollment"
function) suitable for the speech of the individual users, which is
a function of learning difference information between the acoustic
model shared by users and the speech of the user, but a basic
acoustic model itself is rarely constructed for each user.
[0012] The speech recognition is widely applicable to various
purposes, but poses a problem of requiring corresponding
calculation amount particularly in the above-mentioned hypothesis
search processing. The speech recognition technology has been
developed by solving mutually contradictory objects to increase the
recognition precision and to reduce the calculation amount, but
even today, there still remains a problem, for example, that there
are limitations on a vocabulary number that can be handled by a
cellular telephone terminal and the like. In order to realize the
speech recognition which is high in the degree of freedom with a
high precision, it is more effective to execute speech recognition
processing on a remote server that can process an abundant amount
of calculation. For such reasons, in recent years, such an
implementation form (client-server speech recognition form) as to
execute the speech recognition processing on the remote server and
receive only a recognition result (or some action based on the
result) on a local terminal is under active development.
[0013] Japanese Unexamined Patent Application Publication (JP-A)
No. 2003-5949 (Patent Document 1) discloses an example of the
speech recognition system having the implementation form described
above. As illustrated in FIG. 8, a speech recognition system
disclosed in Patent Document 1 includes a client terminal and a
server that communicate with each other via a network. The client
terminal includes a speech detection unit (utterance extraction
unit) for detecting a speech segment from an input speech, a
waveform compression unit for compressing the speech data of the
detected segment, and a waveform transmission unit for transmitting
compressed waveform data to the server. Further, the server
includes a waveform reception unit for receiving the compressed
waveform data transmitted from the client terminal, a waveform
decompression unit for decompressing the received compressed
speech, and an analysis unit and a recognition unit for analyzing
the decompressed waveform and subjecting the waveform to the speech
recognition processing.
[0014] The speech recognition system of Patent Document 1 including
such components operates as follows. That is, a sound (speech)
taken in the client terminal is divided into a speech segment and a
non-speech segment by the speech detection unit. Of those, the
speech segment is compressed by the waveform compression unit and
then transmitted to the server by the waveform transmission unit.
The waveform reception unit of the server, which has received this,
sends the received data to the waveform decompression unit. The
server causes the analysis unit to extract a feature from the
waveform data decompressed by the waveform decompression unit, and
finally causes the recognition unit to execute speech recognition
processing.
[0015] Also in a client-server speech recognition technology, an
operation itself of a speech recognition unit has essentially the
same operation itself as that operating on a single host. In the
invention disclosed in Patent Document 1, the processing up to the
step of FIG. 7 performed by the utterance extraction unit is
executed by the client terminal, and the subsequent steps are
executed by the server. In addition thereto, there exists a
client-server speech recognition technology in which the processing
up to the step corresponding to the feature vector extraction unit
is performed on the client terminal.
[0016] The client-server speech recognition technology has been
developed mainly by assuming the use on mobile terminals (such as
cellular telephones, PDAs, PHSs, and netbooks). As described above,
an original object thereof is to overcome the problem that the
speech recognition is difficult to perform on the mobile terminals
having poor processing performance because the calculation amount
involved in the speech recognition processing is severe. In recent
years, the processing performance of the mobile terminals has
improved, while the speech recognition technology has been
sophisticated, and hence a client-server speech recognition system
is not always necessary. On the other hand, the client-server
speech recognition system is drawing much more attention. This is
based on a trend (so-called software as a service (SaaS)) wherein
various functions heretofore realized in the local terminal are now
provided over the network in consideration of an increase of a
network bandwidth, management costs, and the like. In a case where
the speech recognition technology is provided as a network service,
a system therefore is constructed on the basis of the client-server
speech recognition technology.
DISCLOSURE OF THE INVENTION
Problems to be Solved by the Invention
[0017] Next, description would be made about future problems
concerned with a speech recognition system.
[0018] A first problem is that a risk that a content of a user's
utterance (speech signal) may be leaked to the third party
increases in a case where a speech recognition function is realized
as a service provided via a network. This is because even if
secrets of communications are protected by encrypting speech data
on a communication channel, the speech data may be decoded at least
on a speech recognition server that provides a speech recognition
service.
[0019] A second problem is that a risk that a content expected to
be uttered by the user or special information related to a task or
domain to be used for a speech recognition technology by the user
may be leaked to the third party increases in the case where the
speech recognition function is realized as the service provided via
the network. This is because more or less customization is
necessary for a language model in order to perform the speech
recognition with a practical accuracy. Specifically, such
customization may need to add, to the language model, a vocabulary
that expresses the special information related to the task or
domain. This is also because the language model is essential in a
hypothesis search stage within speech recognition processing, and
hence the language model is put into a readable state at least on
the recognition server in a system that performs hypothesis search
processing on a recognition server.
[0020] Note that, the third party referred to herein includes one
(natural person, artificial person, and other system) that provides
the speech recognition service. If the leak only to a speech
recognition service provider is no problem, the communication
channel and a language model file may be simply encrypted. However,
in a case of wishing to make information secret even from the
speech recognition service provider, the above-mentioned technology
cannot handle the case. Further, other examples of the third party
include a hacker or cracker that illegally breaks into a server,
and a system (program) that performs such an act. This means that
in the case where the server that provides the speech recognition
service has been broken into, the speech data, analysis results,
the special information related to the task or domain, and the like
may be acquired with ease by the third party and no countermeasures
can be taken by service users at all.
[0021] This invention provides a speech recognition system capable
of secret speech recognition which suppresses a risk that a content
of a user's utterance may be leaked to the third party to a minimum
level in a case where a speech recognition function is realized as
a service provided via a network.
[0022] Further, this invention provides a speech recognition system
capable of secret speech recognition which suppresses a risk that a
content expected to be uttered by the user or special information
related to a task or domain to be used for a speech recognition
technology by the user may be leaked to the third party to a
minimum level in a case where the speech recognition function is
realized as the service provided via the network.
Means to Solve the Problems
[0023] A speech recognition system according to this invention
includes: a first information processing device including a speech
recognition processing unit for receiving data to be used for
speech recognition transmitted via a network, carrying out speech
recognition processing, and returning resultant data; and a second
information processing device connected to the first information
processing device via the network, for transmitting the data to be
used for the speech recognition by the speech recognition
processing unit after performing conversion thereof into data
having a format that disables a content thereof from being captured
and also enables the speech recognition processing unit to perform
the speech recognition processing, and constructing the resultant
data returned from the first information processing device into a
content being a valid recognition result.
[0024] A speech recognition request device according to this
invention includes: a communication unit connected via a network to
a speech recognition device including a speech recognition
processing unit for receiving data to be used for speech
recognition transmitted via the network, carrying out speech
recognition processing, and returning resultant data; an
information conversion unit for converting the data to be used for
the speech recognition by the speech recognition processing unit
into data having a format that disables a content thereof from
being captured and also enables the speech recognition processing
unit to perform the speech recognition processing; and an
authentication result construction unit for reconstructing the
resultant data returned from the speech recognition device after
performing the speech recognition on the converted data into a
speech recognition result that enables a content being a valid
recognition result to be captured, based on the converted
content.
Effect of the Invention
[0025] According to this invention, it is possible to provide a
speech recognition system capable of secret speech recognition
which suppresses a risk that a content of a user's utterance may be
leaked to the third party to a minimum level in a case where a
speech recognition function is realized as a service provided via a
network.
[0026] Further, according to this invention, it is possible to
provide a speech recognition system capable of secret speech
recognition which suppresses a risk that a content expected to be
uttered by the user or special information related to a task or
domain to be used for a speech recognition technology by the user
may be leaked to the third party to a minimum level in a case where
the speech recognition function is realized as the service provided
via the network.
BRIEF DESCRIPTION OF THE DRAWING
[0027] FIG. 1 is a block diagram illustrating a configuration of a
first embodiment.
[0028] FIG. 2 is a flowchart illustrating speech recognition
processing according to the first embodiment.
[0029] FIG. 3 is a block diagram illustrating a configuration of a
second embodiment.
[0030] FIG. 4 is a block diagram illustrating a configuration of a
third embodiment.
[0031] FIG. 5 is a block diagram illustrating a configuration of a
fourth embodiment.
[0032] FIG. 6 is a block diagram illustrating a configuration of a
fifth embodiment.
[0033] FIG. 7 is a block diagram illustrating an example of a
configuration of a speech recognition system.
[0034] FIG. 8 is a block diagram illustrating an example of a
configuration of the speech recognition system having a
client-server structure.
REFERENCE SIGNS LIST
[0035] 110 client (speech recognition request device) [0036] 111
utterance extraction unit (utterance extraction means) [0037] 112
feature vector extraction unit (feature vector extraction means)
[0038] 113 feature vector conversion unit (feature vector
conversion means) [0039] 114 phoneme ID conversion unit (phoneme ID
conversion means) [0040] 115 data transmission unit (data
transmission means) [0041] 116 search result reception unit (search
result reception means) [0042] 117 recognition result construction
unit (recognition result construction means) [0043] 118 database
(data recording means) [0044] 120 server (speech recognition
device) [0045] 121 data reception unit (data reception means)
[0046] 122 speech recognition unit (data reception means) [0047]
122a acoustic likelihood computation unit (acoustic likelihood
computation means) [0048] 122b hypothesis search unit (hypothesis
search means) [0049] 123 search result transmission unit (search
result transmission means)
BEST MODE FOR EMBODYING THE INVENTION
[0050] Next, a mode for embodying the invention is described in
detail by referring to the accompanying drawings. Note that, to
clarify the description, simplification or omission would be made
about descriptions which are related to inputs, control processing,
display, communications, and the like all of which have less to do
with this invention. Here, to facilitate an understanding of the
invention, premised issues according to a first embodiment would be
summed up. [0051] A content (information) to be made secret
includes an uttered content (information converted into data)
itself and a content that can be uttered (information related to
utterance: information to be used for speech recognition). [0052]
The former is caused to leak by restoring speech, and the latter is
caused to leak by decrypting vocabulary information included in a
language model or other such operation. [0053] The speech can be
restored from an acoustic feature although incompletely. [0054]
Even if the speech itself cannot be restored, one that knows
details of the acoustic feature can restore an utterance content
although incompletely by performing corresponding speech
recognition processing. [0055] Normally, a speech recognition
server provider knows what kind of fuature is used during an
operation of a recognition processing unit of a recognition server
provided by itself. [0056] Therefore, at least the speech
recognition server provider can restore the utterance content from
the acoustic feature. [0057] The vocabulary information included in
the language model normally includes at least pronunciation
information, and in most cases, further includes a representation
character string. [0058] Normally, the pronunciation information
represents data that can be converted into a phoneme ID string
corresponding to an acoustic model to be used by a given procedure
or the phoneme ID string itself. [0059] In the former case, the
recognition processing unit of the recognition server is supposed
to know the conversion procedure. [0060] Therefore, at least the
speech recognition server provider can decrypt the vocabulary
information included in the language model. [0061] In the latter
case, phoneme IDs cannot seemingly be decrypted by a human, but the
one that knows the details of the acoustic model can grasp phonemes
indicated by the respective phoneme IDs. [0062] Normally, the
acoustic model is difficult for a user to construct, and is
generally constructed and provided by the speech recognition server
provider or another provider. [0063] That is, the speech
recognition server provider or another acoustic model provider
knows details of the phoneme IDs. [0064] In other words, the speech
recognition server provider can know the details of the phoneme IDs
without being noticed by the user. [0065] Therefore, at least the
speech recognition server provider can decrypt the vocabulary
information included in the language model.
[0066] From the above-mentioned points, in order to carry out
secret speech recognition via the network, consideration should be
made about at lease one measure of : [0067] using the acoustic
feature the details of which cannot be easily known by the speech
recognition server provider; and [0068] using the phoneme IDs the
details of which cannot be easily known by the speech recognition
server provider, in addition to general leak prevention of speech
data on a communication channel.
[0069] FIG. 1 illustrates a configuration of the first embodiment
of this invention. By referring to FIG. 1, the first embodiment of
this invention includes a client 110 and a server 120.
[0070] Each thereof includes components for performing the
following operations:
[0071] The client 110 includes an utterance extraction unit 111 a
feature vector extraction unit 112, a feature vector conversion
unit 113, a phoneme ID conversion unit 114, a data transmission
unit 115, a search result reception unit 116, and a recognition
result construction unit 117. Further included therein is a
database 118, which stores an acoustic model, a language model, and
conversion/reconstruction data. The conversion/reconstruction data
is used by the feature vector conversion unit 113, the phoneme ID
conversion unit 114, and the recognition result construction unit
117. Note that, the conversion/reconstruction data may be
previously set in the feature vector conversion unit 113, the
phoneme ID conversion unit 114, and the recognition result
construction unit 117.
[0072] The utterance extraction unit 111 extracts a speech from
acoustic sound and outputs the speech as the speech data. For
example, a segment that involves actual utterance (utterance
segment) is extracted from the acoustic data by discriminating from
a segment that does not (silent segment). Further, noise is
separated from the speech and eliminated.
[0073] The feature vector extraction unit 112 extracts a set
(feature vector) of the acoustic features such as cepstrum, power,
and .DELTA. power from the speech data.
[0074] The feature vector conversion unit 113 converts the feature
vector into data having a format that disables the third party from
capturing or perceiving a content thereof. At this time, the
feature vector conversion unit 113 performs conversion processing
so as to guarantee that, in a case where an acoustic likelihood
computation unit 122a of the server 120 performs an acoustic
likelihood calculation on the data after conversion by using the
appropriately-converted acoustic model, an output result thereof
has the same value as or an approximate value to an output result
obtained from a combination of the acoustic model before the
conversion and the feature vector. Examples of the conversion
include shuffling the order of feature vector and adding a
dimension that is redundant and can be ignored in terms of the
calculation.
[0075] The phoneme ID conversion unit 114 converts the acoustic
model and the phoneme IDs of the language model into the data
having a format that disables the third party from perceiving
contents thereof. Further, information unnecessary for the speech
recognition processing performed on the server 120 is deleted from
the acoustic model and the language model. In addition, depending
on the content of the conversion processing, information necessary
for restoration thereof is recorded in the database 118 as the
conversion/reconstruction data. Examples of the conversion and
deletion include shuffling the phoneme IDs and word IDs and
deleting the representation character string and the like from the
language model. The kind of conversion processing to be performed
may be supplied in advance or may be dynamically determined
[0076] Note that, processing operations of the feature vector
conversion unit 113 and the phoneme ID conversion unit 114 are
described later in detail.
[0077] The data transmission unit 115 transmits the converted data
such as the feature vector, the acoustic model, and the language
model to the server 120 as appropriate.
[0078] The search result reception unit 116 receives the output of
a speech recognition unit 122 such as a maximum-likelihood word ID
string via a search result transmission unit 123 of the server
120.
[0079] The recognition result construction unit 117 references the
conversion/reconstruction data recorded in the database 118
regarding the maximum-likelihood word ID string received from the
search result reception unit 116 to restore the data subjected to
the conversion by the phoneme ID conversion unit 114. For example,
in the case where the word IDs has been shuffled, conversion
reverse thereto is performed to reconstruct the word IDs within the
language model before the conversion. The recognition result
construction unit 117 references the language model before the
conversion by using the thus-restored word IDs to thereby construct
the recognition result being the same as a recognition result
obtained by an existing system. That is, almost without affecting a
speech recognition result, the server 120 that performs the speech
recognition can be disabled from capturing the content of the data
used for the speech recognition.
[0080] The server 120 includes a data reception unit 121, the
speech recognition unit 122, and the search result transmission
unit 123.
[0081] The data reception unit 121 receives the data used for the
speech recognition from the client 110. Note that, the data used
for the speech recognition which are received in this embodiment
are converted data which include the feature vector, the acoustic
model, and the language model.
[0082] The speech recognition unit 122 references the acoustic
model and the language model to make a search for a
maximum-likelihood word string regarding a feature vector sequence.
Note that, the speech recognition unit 122, which is to be
described in detail, is divided into the acoustic likelihood
computation unit 122a and a hypothesis search unit 122b.
[0083] The acoustic likelihood computation unit 122a obtains an
acoustic likelihood of the feature vector regarding the respective
phonemes within the acoustic model. The hypothesis search unit 122b
uses the acoustic likelihood and a language likelihood to obtain
the maximum-likelihood word ID string (=phoneme ID string). Note
that, an implementation for collectively evaluating those
processing steps may be employed.
[0084] The search result transmission unit 123 transmits the output
of the speech recognition unit 122 such as the maximum-likelihood
word ID string to the client 110.
[0085] Next, an overall operation example of this embodiment is
described in detail by referring to FIG. 2. In the following
description, (C) indicates a client device, and (S) indicates a
server device. Upon reception of an input of sound or a start
instruction for speech recognition, the client device and the
server device start the speech recognition and operate as follows.
[0086] 1. (C) The phoneme ID conversion unit 114 converts the
acoustic model and the phoneme IDs of the language model into the
data having a format that disables the third party from perceiving
or capturing contents thereof. The phoneme ID conversion unit 114
records information necessary for restoration corresponding to the
content of the conversion processing in the database 118 as the
conversion/reconstruction data. For example, the phoneme ID
conversion unit 114 generates the acoustic model obtained by
converting the phoneme IDs and the feature vector and the language
model obtained by similarly converting the phoneme IDs and deleting
the vocabulary information other than the phoneme ID string. In
addition, the information used for the restoration performed by the
recognition result construction unit 117 is recorded as the
conversion/reconstruction data in the database 118. Note that the
conversion processing is described later in detail. [0087] 2. (C)
The data transmission unit 115 transmits the acoustic model
(acoustic model after conversion) and the language model (language
model after conversion) that have been generated after the
conversion to the server 120 as information for speech recognition.
[0088] 3. (C) The utterance extraction unit 111 cuts out a speech
segment from the input sound (speech) in parallel with the
above-mentioned processing steps 1 and 2. [0089] 4. (C) The feature
vector extraction unit 112 computes a group (feature vector) of
acoustic features within respective minute segments (frames) of the
cut-out speech segment. [0090] 5. (C) The feature vector conversion
unit 113 converts the computed feature vector into a data structure
having a format that disables the third party from capturing a
content thereof and also obtains, from a recognition processing
result of the speech recognition unit 122, a data structure of
constructing a normal or valid processing result. Note that, the
conversion is described later in detail. [0091] 6. (C) The data
transmission unit 115 transmits the converted feature vector
(feature vector after conversion) to the server 120 as the
information for speech recognition.
[0092] Note that, the above-mentioned processing steps 1 and 2 and
the above-mentioned processing steps 3 to 6 may be performed in
parallel with each other. [0093] 7. (S) The data reception unit 121
receives the information for speech recognition after conversion
such as the acoustic model after conversion, the language model
after conversion, and the feature vector after conversion from the
client 110. [0094] 8. (S) The speech recognition unit 122 searches
for the maximum-likelihood word ID string regarding the feature
vector sequence while referencing the acoustic model and the
language model that have been received. Note that, an example of
search processing is described later in detail. [0095] 9. (S) The
search result transmission unit 123 transmits the word ID string
and the like to the client 110 as speech recognition result data
obtained as a search result. As appropriate, the search result
transmission unit 123 also transmits N word ID strings (N-best)
that are top-ranked in the likelihood or score, likelihood
information on the word ID strings, a search space itself (lattice
or word graph), or the like together. [0096] 10. (C) The search
result reception unit 116 receives the word ID string of the search
result and the like (speech recognition result data) from the
server 120. [0097] 11. (C) The recognition result construction unit
117 acquires word information conesponding to the respective word
IDs of the word ID string from the language model before
conversion, and generates the final word string of the recognition
result. As necessary, the N-best, the word graph, and the like are
processed in the same manner.
[0098] Here, details of the search processing are described below.
[0099] 8-1. (S) The acoustic likelihood computation unit 122a
performs processing for obtaining the acoustic likelihoods
regarding the respective phonemes included in the acoustic model
(acoustic model after conversion) for each feature vector. [0100]
8-2. (S) Further, the acoustic likelihood computation unit 122a
references a word (word ID) regarding the phoneme ID string
conesponding to a pronunciation of any one of words included in the
language model (language model after conversion), and performs
computation processing for a likelihood (language likelihood)
obtained from information on adequacy of the word ID string
similarly included in the language model. [0101] 8-3. (S) The
hypothesis search unit 122b performs the search processing for the
word ID string that gives the greatest likelihood to a feature
vector string while referencing the above-mentioned the acoustic
likelihood and the language likelihood. [0102] 8-4. (S) Note that,
the hypothesis search unit 122b may perform arbitrary rescoring
processing as necessary and assume the word ID string having the
highest score as a result thereof as the search result.
[0103] Next, an operation of one conversion processing (conversion
processing using a mapping function) for the feature vector and the
acoustic model is described in detail. Note that, information on
the mapping function and the like described below is described
within the conversion/reconstruction data. Further, a processing
method using the mapping function may be previously stored in the
respective units.
[0104] The conversion of the feature vector and the acoustic model
using the mapping function which is performed by the feature vector
conversion unit 113 and the phoneme ID conversion unit 114 relates
to an operation of the speech recognition unit 122, in particular,
the acoustic likelihood computation unit 122a included therein.
Described below as an example is a process for recovery to the
valid processing result in the case of using the mapping
function.
[0105] The processing performed by the acoustic likelihood
computation unit 122a is processing for obtaining the likelihood of
the feature vector given to the respective phoneme. This can be
expressed as processing that employs an acoustic likelihood
function D being:
1.sub.--A(V)=D(V,A)=(D(V,A.sub.--1),D(V,A.sub.--2), . . . ,
D(V,A.sub.--M))=(1.sub.--A.sub.--1, . . . , 1.sub.--A.sub.--M)
where V represents the feature vector, A represents the acoustic
model, and M kinds of phoneme are included therein.
[0106] When the conversion of the feature vector and the acoustic
model which is performed by the feature vector conversion unit 113
and the phoneme ID conversion unit 114 is expressed by a given
mapping function F=(f_v,f_a), a property required for f_v and f_a
is that D(f_v(V),f_a(A))=D(V,A) always holds true with regard to an
arbitrary feature vector V.
[0107] If the above-mentioned statement holds true,
1.sub.--A(V)=D(V,A)=D(f.sub.--v(V),f.sub.--a(A))=1.sub.--{f.sub.--a(A)}(-
f.sub.--v(V))
is derived, and hence even if the feature vector and the acoustic
model that are converted by using the mapping function F are used,
completely the same recognition result as that before conversion
can be obtained.
[0108] A plurality of examples of the mapping function that
satisfies such a property are taken.
[0109] The feature vector, if being a vector of N features, can be
expressed by the following expression.
V=(v.sub.--1, . . . , v_N)
[0110] Now, if the acoustic likelihood of the feature vector
regarding a given phoneme is given by a total sum of the
likelihoods regarding respective elements of the feature vector,
the following expression holds true.
1.sub.--{A.sub.--j}(V)=D(V,A.sub.--j)=D(v.sub.--1,A.sub.--{1,j})+ .
. . +D(v.sub.--N,A.sub.--{N,j})
sum.sub.--{i,j}{D(v.sub.--i,A.sub.--{i,j})}
[0111] Here, it is assumed that f_v shifts suffixes of the
respective elements of the feature vector one by one to move the
N-th element to the zeroth position. That is, the shift is caused
as in the following expression.
f.sub.--v((v.sub.--1, . . . , v.sub.--N))=v.sub.--N,v.sub.--1, . .
. ,v.sub.--{N-1})
[0112] Meanwhile, if La is a function that shifts a model regarding
the ith feature within the acoustic model to the (i+1)th
position,
f.sub.--a((A.sub.--{1,j}, . . . ,
A.sub.--{N,j}))=((A.sub.--{N,j},A.sub.--{1,j}, . . . ,
A.sub.--{N-1,j}))
is derived, and at this time,
D(f.sub.--v(V),f.sub.--a(A.sub.--j)=D(v.sub.--N,A.sub.--{N,j})+D(v.sub.--
-1,A.sub.--1,j})+ . . . +D(v.sub.--{N-1},A.sub.--{N-1,j})=
sum.sub.--{i,j}D(v.sub.--i,A.sub.--{i,j})=D(V,A.sub.--j)
is derived.
[0113] In general, if the acoustic likelihood is linear with
respect to the likelihoods regarding respective elements of the
feature vector, a mapping (k-shift function) that shifts the
elements of the feature vector by k satisfies the required
property. In addition, the order itself has no meaning, and hence a
mapping (shuffle function) that converts the order of the elements
of the feature vector into an arbitrary order satisfies the
required property as well.
[0114] Next, an example of another function is taken. It is assumed
that the acoustic likelihood is defined as described above and
that
D(v.sub.--i, alpha A.sub.--{i,j})= alpha
D(v.sub.--i,A.sub.--{i,j})
[0115] and
sum.sub.--k{D(c.sub.--k,c.sub.--k {-1})}=0
both hold true. Here, c_k and c_k {-1} are a group of known values
that satisfy the above-mentioned expression.
[0116] If the mappings (f_v,f_a) are given as
f.sub.--v((v.sub.--1, . . . , v.sub.--N))=(v.sub.--1, . . . ,
v.sub.--N,c.sub.--1, . . . , c.sub.--L,v.sub.--1)
f.sub.--a((A.sub.--{1,j}, . . . , A.sub.--{N,j}))=(A.sub.--{1,j}/2,
. . . , A.sub.--{N,j},c.sub.--1 {-1}, . . . ,c.sub.--L
{-1},A.sub.--{1,j{/2)
respectively,
D ( f_v ( V ) , f_a ( A_j ) ) = D ( v_ 1 , A_ { 1 , j } / 2 ) + + D
( v_N , A_ { N , j } ) + D ( c_ 1 , c_ 1 ^ { - 1 } ) + D ( c_L ,
c_L ^ { - 1 } ) + D ( v_ 1 , A { 1 , j } / 2 ) = D ( v_ 1 , A_ { 1
, j } ) / 2 + + D ( v_N , A_ { N , j } ) + 0 + D ( v_ 1 , A_ { 1 ,
j } ) / 2 = sum_ { i , j } { D ( v_i , A_ { i , j } ) } = D ( V ,
A_j ) ##EQU00001##
is derived.
[0117] In general, if the acoustic likelihood is linear with
respect to the likelihoods regarding respective elements of the
feature vector, and if a combination of the value of the feature
for which the total sum of the acoustic likelihoods becomes zero
and the model regarding the feature is known, it is possible to
increase the number of apparent dimensions of the feature vector by
using the combination.
[0118] Further, in general, if the acoustic likelihood is linear
with respect to the likelihoods regarding respective elements of
the feature vector, and if an acoustic likelihood function
D(v_i,A_{i,j}) regarding the respective features is also linear, it
is possible to increase the number of apparent dimensions of the
feature vector by dividing a given feature into a plurality of
elements.
[0119] If the acoustic likelihood computation unit 122a is
established on the basis of the acoustic likelihood function
exhibiting such a property, as many arbitrary mapping functions
required by the embodiment of this invention as desired can be
given by combining "shuffling of the feature vector" and "extension
of the number of apparent dimensions" as described above.
[0120] Naturally, even the acoustic likelihood function having a
different property from the one exemplified herein can be used as
the system described in the embodiment of this invention as long as
the mapping F=(f_v,f_a) that satisfies D(f_v(V),f_a(A))=D(V,A) can
be defined.
[0121] Further, even when D(V,A) and D(f_v(V),f_a(A)) do not
completely match each other, if an error therebetween is
sufficiently small, the embodiment of this invention can be
realized by using such a mapping F'=(f_v,f_a).
[0122] As described above, even if the feature vector conversion
unit 113 and the phoneme ID conversion unit 114 convert the feature
vector and the acoustic model by using the mapping function, the
speech recognition unit 122 of the server 120 can obtain the
recognition result the same as or approximate to the case where
such conversion is not performed.
[0123] Next, the conversion processing for the acoustic model and
the language model is described in detail.
[0124] The conversion for the acoustic model and the language model
performed by the phoneme ID conversion unit 114 relates to the
inside of the speech recognition unit 122, in particular, relates
to the operation of the hypothesis search unit 122b.
[0125] In the processing of the hypothesis search unit 122b, it is
necessary to determine whether or not a given phoneme string
a.sub.--1, . . . , a_N forms a given word w.
[0126] In other words, with regard to the language model L having M
words, a lookup function that returns any one of 0 and 1 in
relation to all the words w included in L can be expressed as the
following expression.
S.sub.--L(a.sub.--1, . . . , a.sub.--N)=T(L,a.sub.--1, . . . ,
a.sub.--N)={e.sub.--1, . . . , e.sub.--M }
where e_j in {0,1}
[0127] Here, ej with respect to the suffix j indicates whether a
word w_j is formed by the phoneme string (=1) or not (=0).
[0128] At first glance, this function seems to have an extremely
high calculation load, but can be obtained speedily by using a TRIE
structure and the like.
[0129] In actuality, the phoneme ID string and the word ID are
often used instead of the phoneme string itself and the word
itself, respectively, but are both correspond to the phoneme and
the word on a one-to-one basis, and hence only the phoneme and the
word are described below.
[0130] If the conversion for the acoustic model and the language
model performed by the phoneme ID conversion unit 114 is expressed
by a given mapping function G=(g.sub.--1,g_a), the property
required for g.sub.--1 and g_a is that the following expression
always holds true with respect to an arbitrary phoneme string
a.sub.--1, . . . ,a_N.
T(L,.LAMBDA.,a.sub.--1, . . . ,
a.sub.--N)=T(g.sub.--1(L),g.sub.--a(.LAMBDA.),g.sub.--a(a.sub.--1),
. . . , g.sub.--a(a.sub.--N))
[0131] If the above-mentioned expression holds true, the following
expression holds true, and hence completely the same recognition
result as the case of using the acoustic model and the language
model before conversion can be obtained even by using the acoustic
model and the language model converted by the mapping function
G
S_ { L , A } ( a_ 1 , , a_N ) = T ( L , A , a_ 1 , , a_N ) = T ( g_
1 ( L ) , g_a ( A ) , g_a ( a_ 1 ) , , g_a ( a_N ) ) = S_ { g_ 1 (
L ) , g_a ( A ) } ( g_a ( a_ 1 ) , , g_a ( a_N ) ) ##EQU00002##
[0132] In the same manner as the mapping regarding the
above-mentioned the feature vector, such a mapping as to shuffle
the phoneme IDs or the word IDs satisfies this property.
[0133] Further, when there is a phoneme ID p_i conesponding to a
given phoneme a_i, such a mapping as to add a new phoneme ID
conesponding to the phoneme a_i as p_i' also satisfies this
property.
[0134] The above-mentioned two conversion processing steps can be
conversion processing steps that satisfy the following requirements
after all.
Requirements:
[0135] When a mapping function .PHI.={.phi.} used for the
conversion maps a data structure X and a data structure Y to
.phi._x{X} and .phi._y{Y}, respectively, with regard to a function
F(X,Y) used by the recognition processing unit,
F(X,Y) and F(.phi._x{X}, .phi._y{Y})
constantly have the same values.
[0136] Specific examples of F include: [0137] (feature
vector)+(acoustic model).fwdarw.(acoustic likelihood) [0138] where
X represents the feature vector and Y represents the acoustic
model; and [0139] (phoneme ID string)+(acoustic model)+(language
model).fwdarw.(word establishment vector) where X represents the
acoustic model and Y represents the language model.
[0140] Note that, if the implementation of the speech recognition
unit 122, in particular, the hypothesis search unit 122b is
expressed as a search problem that regards the likelihood as a
score and obtains a path exhibiting the highest score, only a
magnitude relationship between the likelihoods may be saved, and
hence what actually matters in the conversion performed on the
feature vector and the acoustic model is such a property that:
[0141] not the equivalence of F(X,Y) and F(.phi._x{}, .phi._y{Y}),
[0142] but a ratio between F(X,Y) and F(.phi._x{X}, .phi._y{Y}) is
constantly fixed. Therefore, in the case of using such the speech
recognition unit 122, the above-mentioned requirements are relaxed.
Further, no matter what kind of speech recognition unit is used,
the error between F(X,Y) and F(.phi._x{X}, .phi._y{Y}) which is
sufficiently small can be permitted because the error hardly
affects recognition precision.
[0143] On the other hand, in the conversion performed on the
phoneme ID, the acoustic model, and the language model, the ratio
equality or error is not enough to satisfy the requirements, and
the equivalence is strictly required. Otherwise, an adverse
influence is exerted on the recognition precision.
[0144] Next, the conversion processing for the language model is
described in detail.
[0145] In the conversion for language model performed by the
phoneme ID conversion unit 114, information related to the
respective words included in the language model is basically
deleted other than information on the phoneme ID string (with the
phoneme ID also converted as described above by the mapping
function). This not only achieves secrecy but is also effective in
reduction of a communication amount.
[0146] However, if there is other data to be referenced by the
speech recognition unit 122 (information that affects a speech
recognition processing result), it is desirable that the data are
not deleted. Examples thereof include data such as part-of-speech
information of the word and class information to which the word
belongs. Note that, the speech recognition unit 122 that requests
for data that may be involved in the leak of the word information
should be avoided from being used for the speech recognition
processing. For example, it is assumed that the speech recognition
unit 122 that requests for a display character string of the word
is not used in this embodiment. In a case of wishing to use the
speech recognition processing unit that requests for such data at
any cost, there is an attempt to avoid the leak by a method of, for
example, performing the mapping in the same manner as the phoneme
ID and the word ID.
[0147] Next described are a timing for the feature vector
conversion and the phoneme ID conversion and a timing to switch the
conversion operation.
[0148] The feature vector conversion is executed each time when a
new feature vector is obtained.
[0149] The conversion of the acoustic model and the phoneme IDs of
the language model may be performed once prior to the speech
recognition as described above.
[0150] However, continuous use of the model converted by the same
mapping function increases a risk that the mapping function may be
conjectured by using a statistical method or the like.
[0151] Therefore, the secrecy against the third party is enhanced
by periodically switching a behavior of the conversion operation
such as changing the mapping function to another one.
[0152] Specifically, the switching may be performed at the timing
of once every several utterances or once every several minutes. On
the other hand, if a calculation amount necessary for the
conversion operation and the communication amount for transmitting
the model after conversion to the server are taken into
consideration, it is not appropriate to perform the switching very
frequently.
[0153] The timing and frequency for the switching may have values
obtained in consideration of overhead (calculation amount necessary
for the conversion operation and communication amount for
transmitting the model after conversion to the server) that occurs
due to the frequent switching. Further, alteration may be performed
as appropriate at a timing at which a processing amount or the
communication amount is lowered, for example, during the silent
segment.
[0154] Next described are effects of the embodiment for performing
the conversion using the mapping function described above.
[0155] The embodiment for performing the conversion using the
mapping function is configured to convert the feature vector by the
mapping function and then transmit the feature vector to the
server, and hence even if the third party obtains the feature
vector on the communication channel or the server, it can be made
difficult for the third party to immediately restore the speech
therefrom.
[0156] On the other hand, the acoustic model is also converted by
the mapping function selected so as to return the same acoustic
likelihood as the feature vector before conversion, which
guarantees that the same acoustic likelihood is computed, in other
words, the same recognition result is obtained, as in the case
where the feature vector is not converted.
[0157] Further, the above-mentioned mode is configured to avoid
transmitting the information on the representation character string
within the information on the respective word entries included in
the language model to the server and to also convert the phoneme ID
string indicating the pronunciation of the word entry by the
mapping function and then transmit the phoneme ID string to the
server. Hence, even if the third party that knows the structure of
the language model obtains the phoneme ID string, it can be made
difficult for the third party to immediately know the information
such as the pronunciation and surface form of the word included
therein.
[0158] On the other hand, the acoustic model is also converted by
the mapping function selected so as to return the same outcome of
the word with regard to the same phoneme string as the language
model before conversion, which guarantees that the same outcome
regarding the word is obtained, in other words, the same
recognition result is obtained, as in the case where the language
model is not converted with regard to the same phoneme string.
[0159] Next, a second embodiment is described by referring to FIG.
3. Note that, to clarify the description, descriptions of the same
parts as those of the first embodiment are simplified or
omitted.
[0160] FIG. 3 is a block diagram illustrating a configuration of
the second embodiment. A speech recognition system according to the
second embodiment includes a plurality of speech recognition
servers. Further, an information processing device that requests
for the speech recognition is also a server.
[0161] The plurality of speech recognition servers correspond to
mutually different items of converted acoustic recognition
information data (in the figure, types A, B, and C). The server
that requests for the speech recognition previously stores
specifications of respective acoustic recognition servers, and
stores the converted acoustic recognition information data to be
transmitted to the respective acoustic recognition servers. Note
that, such specifications of the acoustic recognition server and
the like may be managed integrally with the
conversion/reconstruction data or may be managed by another
method.
[0162] Even such a configuration enables the speech recognition to
be performed on the speech acquired by the server that requests for
the speech recognition while achieving the secrecy against the
third party. An operation example thereof is described below.
[0163] The server that requests for the speech recognition uses the
respective units to carry out utterance extraction processing and
feature vector extraction processing, then selects the acoustic
recognition server to be used, converts the information for speech
recognition into data having such a format that enables the
recovery to the valid processing result corresponding to the
selected acoustic recognition server, and transmits the data to the
selected acoustic recognition server.
[0164] The server that requests for the speech recognition uses the
respective units to construct result data returned from the
acoustic recognition server into the speech recognition result
being a valid recognition result and output the resultant.
[0165] At this time, a shuffling method and the acoustic
recognition server to be a transmission destination are switched as
necessary or with the lapse of time.
[0166] Next, a third embodiment is described by referring to FIG.
4. Note that, to clarify the description, descriptions of the same
parts as those of the first and second embodiments are simplified
or omitted.
[0167] FIG. 4 is a block diagram illustrating a configuration of
the third embodiment. A plurality of speech recognition servers of
a speech recognition system according to the third embodiment
provide only the service of hypothesis search processing.
Alternatively, the speech recognition servers are capable of
performing acoustic likelihood detection processing and the
hypothesis search processing, and can provide only the service of
the hypothesis search processing.
[0168] The information processing device that requests for the
speech recognition includes an acoustic likelihood detection unit,
and is enabled to perform a distance calculation.
[0169] The plurality of speech recognition servers perform
requested speech recognition processing (acoustic likelihood
detection processing and hypothesis search processing)
respectively, and return the result thereof. A requesting terminal
that requests for the speech recognition previously stores
specifications of respective acoustic recognition servers, and
stores the converted acoustic recognition information data to be
transmitted to the respective acoustic recognition servers. Note
that, such specifications of the acoustic recognition server and
the like may be managed integrally with the
conversion/reconstruction data or may be managed by another
method.
[0170] Even such a configuration enables the speech recognition to
be performed on the speech acquired by the requesting terminal that
requests for the speech recognition while achieving the secrecy
against the third party. An operation example thereof is described
below.
[0171] The requesting terminal that requests for the speech
recognition uses the respective units to carry out utterance
extraction processing, feature vector extraction processing, and
acoustic likelihood detection processing, then selects the acoustic
recognition server to be used, converts information on detected
acoustic likelihood and the information for speech recognition into
data having such a format that enables the recovery to the valid
processing result corresponding to the selected acoustic
recognition server, and transmits the data to the selected acoustic
recognition server.
[0172] Subsequently, the requesting terminal uses the respective
units to construct result data returned from the acoustic
recognition server into the speech recognition result being a valid
recognition result and output the resultant.
[0173] At this time, a shuffling method and the acoustic
recognition server to be a transmission destination are switched as
necessary or with the lapse of time.
[0174] Such a configuration can omit shuffling processing for the
acoustic model or the transmission of the acoustic model. That is,
if the terminal has such a calculation ability as to perform
acoustic likelihood computation processing, the communication
amount can be compressed.
[0175] Next, a fourth embodiment is described by referring to FIG.
5. Note that, to clarify the description, descriptions of the same
parts as those of other embodiments are simplified or omitted.
[0176] FIG. 5 is a block diagram illustrating a configuration of
the fourth embodiment. A plurality of speech recognition servers of
a speech recognition system according to the fourth embodiment each
provide a speech recognition service.
[0177] The information processing device that requests for the
speech recognition includes an utterance dividing unit for
extracting the feature vector by performing time division on the
sound (speech) input thereto. Note that, instead of the time
division for the feature vector, division may be performed in units
of clauses or words of the speech.
[0178] The information processing device that requests for the
speech recognition (requesting server) performs the shuffling or
the like on a sequence relationship between the divided items of
speech data, then subjects the resultant data to the conversion as
the information for speech recognition, which is then transmitted
separately to the plurality of speech recognition servers, and
collectively reconstructs the results returned from the respective
speech recognition servers.
[0179] Even such a configuration enables the speech recognition to
be performed on the speech acquired by the terminal that requests
for the speech recognition while achieving the secrecy against the
third party.
[0180] At this time, a time-division interval, the shuffling
method, and the acoustic recognition server to be the transmission
destination are switched as necessary.
[0181] With such a configuration, only partial speech is
transmitted to the individual speech recognition servers, and hence
the restoration becomes more difficult with an increase of the
number of speech recognition servers operated in parallel.
[0182] Next, a fifth embodiment is described by referring to FIG.
6. Note that, to clarify the description, descriptions of the same
parts as those of other embodiments are simplified or omitted.
[0183] FIG. 6 is a block diagram illustrating a configuration of a
fifth embodiment. A speech recognition system according to the
fifth embodiment has a mode in which the speech recognition server
including the acoustic likelihood detection unit is used to
generate result data on the acoustic likelihood and transfer the
result data to another speech recognition server including the
hypothesis search unit. Further, the speech recognition system may
be configured such that a secret speech identification device
instructs the speech recognition server including the acoustic
likelihood detection unit to perform the transfer itself. Further,
the speech recognition system may be configured such that the
result data on the acoustic likelihood to be transferred is divided
and transferred to the plurality of speech recognition servers each
including the hypothesis search unit.
[0184] Even the above-mentioned configuration enables the speech
recognition to be performed on the speech acquired by the device
that requests for the speech recognition while achieving the
secrecy against the third party.
[0185] Next, a sixth embodiment is described. Note that, to clarify
the description, descriptions of the same parts as those of the
other embodiments are simplified or omitted.
[0186] In the sixth embodiment, the speech data or the feature
extracted on the secret speech identification device serving as a
client is divided, the sequence relationship therebetween is
shuffled, and the respective servers are requested for the speech
recognition. The secret speech identification device subjects the
speech recognition results sent from the respective servers to
inverse processing to the shuffling performed before transmission,
and reconstructs the content being the valid recognition result.
That is, the secret speech identification device carries out the
processing up to feature vector extraction and reconstruction
processing, while the server carries out the others.
[0187] Such an operation can reduce communication load and load on
the secret speech identification device.
[0188] Next described is an embodiment that does not use the
mapping function. This embodiment has a feature of deleting the
word or concatenation information on words for which the leak of
information is feared from a dictionary. That is, unlike the other
embodiments, the entry including the pronunciation information
(=phoneme ID string information) is completely deleted.
Alternatively, the same may not be included in the language model
in the first place. As a result, the server that performs the
speech recognition cannot detect the existence of the word
including a trace of the existence at all.
[0189] A client terminal caused to perform the speech recognition
receives the speech recognition result from the server, and in
response to the result, executes second recognition processing for
inserting the word and the concatenation information on words
deleted from the dictionary. That is, information the leak of which
is feared and which is not included in the recognition result sent
from the server is regained by being subjected to second speech
recognition processing (search processing).
[0190] A second speech recognition unit is provided within a
recognition result construction unit, and uses the recognition
result output by the speech recognition unit (first speech
recognition unit) on the server as an input. This means that the
input may be the word ID string having the maximum likelihood
(=maximum-likelihood word ID string), the word ID strings
exhibiting the top-N likelihoods (N-best), or the word graph. In
the word graph, the word and its likelihood (one or both of
language likelihood and acoustic likelihood or other standard score
such as reliability) are assigned to each arc appearing in a graph
structure generated halfway through the search processing, and the
search processing is processing for finding a path exhibiting the
highest total sum of the likelihoods.
[0191] The recognition result construction unit converts those into
the word string, and further converts the word string into the
phoneme string by using the pronunciation information. By
performing the processing in this manner, only one phoneme string
is obtained in the case where the maximum-likelihood word ID string
is used as an input, and otherwise a plurality of phoneme strings
are obtained.
[0192] Meanwhile, the word and a word concatenation deleted for
fear of the leak are also converted into the phoneme string. Then,
the second speech recognition unit takes out the phoneme strings
from the recognition result returned from the server, and searches
the phoneme strings for a segment that matches the phoneme string
of the deleted word and word concatenation.
[0193] In this search processing, not only a strict match but also
an ambiguous match can be performed if a confusion matrix which is
a table of a discrimination difficulty between a given phoneme and
another phoneme is separately provided. For example, if a
difficulty to tell f from v is high, in a case where a match can be
regarded as having occurred when there is only a match between f
and v in a matching process within the deleted segment, those may
be handled as a match by being regarded as the same.
[0194] If the processing in the above-mentioned manner is performed
to find the phoneme string that matches the word or the word
concatenation for which the leak is feared from the recognition
result sent from the server (first recognition unit), the
recognition result construction unit constructs the valid
recognition result by replacing (inserting) the word or the word
concatenation into the corresponding part.
[0195] As a merit of this method, the mapping for the word ID
becomes unnecessary with the result that uploading of only the
acoustic model and the dictionary suffices. In other words, by
performing the processing in the above-mentioned manner, even if a
strict language model prepared by the server is used, the secrecy
can be ensured. Note that, the strict language model occupies most
of the capacity of a broad-sense language model, which produces a
remarkable effect in the reduction of the communication bandwidth
between the server and the client.
[0196] Next, further another embodiment is described. This
embodiment is configured to inhibit the client terminal from
executing the acoustic likelihood calculation without involving the
uploading of the acoustic model. That is, the extraction of the
feature and the acoustic likelihood calculation are carried out on
the server and transmitted, while the search processing is carried
out on the client terminal. At this time, the acoustic data
transmitted from the client terminal to the server is kept secret
by the encryption operation that can be decrypted by the server and
the mapping operation of mapping the content into data which cannot
be perceived or captured by the server.
[0197] Such a configuration effectively operates as means for
performing client-server speech recognition that guarantees the
secrecy without particularly converting the language model.
[0198] As described above, according to this invention, the
following effects can be obtained.
[0199] The first effect is to be able to reduce a risk that the
utterance content of a speaker may be leaked to the third party.
This is because, even if the third party acquires intermediate data
(feature vector, phoneme ID string, and word string ID string)
obtained by converting the speech data, it is necessary for the
third party to know the details of how the phoneme ID and the like
have been converted in order to restore the same, which can make it
difficult for the third party to restore the speech data by
performing the conversion as appropriate.
[0200] The second effect is to be able to reduce a risk that
special information related to a task or domain may be leaked from
the language model to the third party. This is because the language
model temporarily retained on the server includes only the minimum
word information such as the phoneme ID after conversion, the
details of the conversion of the phoneme ID are unknown to the
server, which can make it difficult for the third party to know the
details of the content of the language model.
[0201] Note that, as have already been described, the third party
referred to herein also includes a speech recognition service
provider. Therefore, indirect effects of this invention include the
ability to perform the speech recognition in the form of a network
service also with regard to the speech whose secrecy is demanded
extremely strongly, for example, speech related to privacy or a
trade secret.
[0202] Note that, by using the technology described above, the
speech recognition system may be configured in the following
manner.
[0203] A speech recognition system, including: a first information
processing device including a speech recognition processing unit
for receiving data to be used for speech recognition transmitted
via a network, carrying out speech recognition processing, and
returning resultant data; and a second information processing
device connected to the first information processing device via the
network, for transmitting the data to be used for the speech
recognition by the speech recognition processing unit after
performing mapping thereof by using a mapping function unknown to
the first information processing device, and constructing a speech
recognition result by modifying, based on the mapping function
used, the resultant data returned from the first information
processing device into the same result as a result of performing
the speech recognition without using the mapping function.
[0204] A speech recognition system, including a plurality of
information processing devices that are connected to one another
via a network and include a speech recognition processing unit in
at least one information processing device. The requesting
information processing device converts at least one data structure
of data to be used for speech recognition processing by the speech
recognition processing unit by using a mapping function and
transmits the resultant to the information processing device
including the speech recognition processing unit. The information
processing device including the speech recognition processing unit
carries out the speech recognition processing based on the
converted data structure and transmits a result thereof. The
requesting information processing device constructs the result of
carrying out the speech recognition processing which is affected by
the mapping function into a result of carrying out the speech
recognition processing which is not affected by the mapping
function.
[0205] A speech recognition system, which is configured by using
the mapping function .PHI. in which, if .PHI.={.phi.} is used as
the mapping function, and when a data structure X and a data
structure Y are mapped to .phi._x{X} and .phi._y{Y}, respectively,
with regard to a function F(X,Y) used by the speech recognition
processing unit, values of F(X,Y) and F(.phi._x{X}, .phi._y{Y}) are
constantly the same or constantly less than a given threshold
value.
[0206] A speech recognition system, which is configured by using
the mapping function .PHI. in which, if .PHI.={.phi.} is used as
the mapping function, and when the data structure X and the data
structure Y are mapped to .phi._x{X} and .phi._y{Y}, respectively,
with regard to the function F(X,Y) used by the speech recognition
processing unit, the ratio between F(X,Y) and F(.phi._x{X},
.phi._y{Y}) is constantly fixed.
[0207] A speech recognition system, which is configured by using
the mapping function in which: with regard to a reference
relationship between an index that refers to specific data included
in a given data structure and a reference destination, a
destination to which a given arbitrary index refers before mapping
does not necessarily match a destination to which the same index
refers after the mapping; and it is guaranteed that data at the
reference destination to which any one of indices refers before the
mapping is always referred to by any one of the indices after the
mapping.
[0208] A speech recognition system, which is configured by using
the mapping function which indicates shuffling of indices that
refer to the specific data included in the given data
structure.
[0209] A speech recognition system, which is configured by using
the mapping function which adds an arbitrary number of indices to
the specific data included in the given data structure.
[0210] A speech recognition system, in which at least one item of
data to be used for speech recognition which is subjected to
mapping by using the mapping function is retained before the
mapping only on an information processing device for inputting a
sound to be subjected to the speech recognition.
[0211] A speech recognition system, in which the data to be used by
the speech recognition processing unit has a structure to which at
least one selected from the group consisting of a structure of an
acoustic model, a structure of a language model, and a structure of
a feature vector is mapped.
[0212] A speech recognition system, in which: indices indicating
respective features included in the feature vector are mapped by
using the mapping function given by a device for inputting a sound
to be subjected to speech recognition; and indices to models
associated with respective features within the acoustic model are
mapped by using the mapping function given by the device for
inputting the sound to be subjected to the speech recognition.
[0213] A speech recognition system, in which: phoneme IDs being
indices to phonemes included in the acoustic model are mapped by
using the mapping function given by the device for inputting the
sound; phoneme ID strings indicating pronunciations of respective
words included in the language model are mapped by using the
mapping function given by the device for inputting the sound; and
at least information on representation character strings of the
respective words included in the language model is deleted.
[0214] A speech recognition system, in which word IDs being indices
to the respective words included in the language model are mapped
by using the mapping function given by the device for inputting the
sound.
[0215] A speech recognition system, in which an information
processing device for inputting speech data includes at least an
acoustic likelihood computation unit and is configured to: map
phoneme ID strings indicating pronunciations of respective words
included in the language model by using the mapping function given
by the information processing device for inputting speech data, and
delete at least information on representation character strings of
the respective words included in the language model; compute
acoustic likelihoods of all known phonemes or necessary phonemes
for each frame of the speech data to generate a sequence of a group
of the phoneme IDs and acoustic likelihoods that are mapped by
using the mapping function given by the information processing
device for inputting speech data; and transmit the sequence of the
group of the mapped phoneme IDs and acoustic likelihoods and the
language model after the mapping to the information processing
device including a hypothesis search unit.
[0216] A speech recognition system, in which an information
processing device for inputting speech data is configured to:
divide the speech data into blocks; map a time sequence among the
divided blocks by using the mapping function given by the
information processing device for inputting speech data; transmit
the blocks of speech to an information processing device for
performing speech recognition based on the time sequence after the
mapping; receive any one of a feature vector or a sequence of a
group of phoneme IDs and acoustic likelihoods from the information
processing device for performing the speech recognition; and
restore the time sequence by using an inverse function to the
mapping function given by the information processing device for
inputting speech data.
[0217] Further, specific configurations of this invention are not
limited to the above-mentioned embodiments, and changes within the
scope that does not depart from the gist of the invention are also
included in this invention. For example, a combination of the
respective characteristics of the above-mentioned embodiments may
be included in this invention.
[0218] Further, the respective units of a speech recognition
request device may be realized by hardware or by using a
combination of hardware and software. In the mode that combines
hardware and software, the respective units and various means are
realized by causing a speech recognition program to be expanded in
a RAM and hardware such as a CPU to be operated according to the
program. Further, the program may be distributed by being recorded
on a recording medium. The program recorded on the recording medium
is read into a memory in a wired manner, in a wireless manner, or
via the recording medium itself to cause a control unit and the
like to operate. Note that, examples of the recording medium
include an optical disc, a magnetic disk, a semiconductor memory
device, and a hard disk.
[0219] This invention can be applied for the purpose of increasing
the secrecy in all the applications for performing the
client-server speech recognition.
[0220] For example, this invention can be applied for constructing
a SaaS-based speech recognition system for recognizing the speech
including a trade secret. Further, this invention can be applied
for constructing a SaaS-based speech recognition system for the
speech high in privacy such as a diary.
[0221] Further, for example, in a case of constructing a
speech-controlled online store website that allows a menu selection
and the like to be performed by speech, if the website is
constructed by using the SaaS-based speech recognition system using
this invention, the user can keep his/her purchase history and the
like from being known by at least a SaaS-based speech recognition
system provider. This serves as a merit for a webmaster of the
speech-controlled online store website in that a fear of the leak
of customer information decreases.
[0222] Further, from the viewpoint of the SaaS-based speech
recognition system provider, the use of this invention eliminates
the need, although temporarily, for retaining the speech of users
and the language model including a vocabulary corresponding to
personal information on the users on the self-managed speech
recognition server, which can avoid unintended leak of the personal
information to a cracker or the like.
[0223] This application claims priority from Japanese Patent
Application No. 2009-247874, filed on Oct. 28, 2009, the entire
disclosure of which is incorporated herein by reference.
* * * * *