U.S. patent application number 13/861020 was filed with the patent office on 2015-10-01 for keyword detection based on acoustic alignment.
The applicant listed for this patent is Google Inc.. Invention is credited to Patrick An Phu Nguyen, Maria Carolina Parada San Martin, Johan Schalkwyk.
Application Number | 20150279351 13/861020 |
Document ID | / |
Family ID | 54191269 |
Filed Date | 2015-10-01 |
United States Patent
Application |
20150279351 |
Kind Code |
A1 |
Nguyen; Patrick An Phu ; et
al. |
October 1, 2015 |
KEYWORD DETECTION BASED ON ACOUSTIC ALIGNMENT
Abstract
Embodiments pertain to automatic speech recognition in mobile
devices to establish the presence of a keyword. An audio waveform
is received at a mobile device. Front-end feature extraction is
performed on the audio waveform, followed by acoustic modeling,
high level feature extraction, and output classification to detect
the keyword. Acoustic modeling may use a neural network or Gaussian
mixture modeling, and high level feature extraction may be done by
aligning the results of the acoustic modeling with expected event
vectors that correspond to a keyword.
Inventors: |
Nguyen; Patrick An Phu;
(Palo Alto, CA) ; San Martin; Maria Carolina Parada;
(Palo Alto, CA) ; Schalkwyk; Johan; (Scarsdale,
NY) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Google Inc. |
Mountain View |
CA |
US |
|
|
Family ID: |
54191269 |
Appl. No.: |
13/861020 |
Filed: |
April 11, 2013 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
61788749 |
Mar 15, 2013 |
|
|
|
61786251 |
Mar 14, 2013 |
|
|
|
61739206 |
Dec 19, 2012 |
|
|
|
Current U.S.
Class: |
704/251 |
Current CPC
Class: |
G10L 15/08 20130101;
G10L 2015/088 20130101; G10L 15/02 20130101 |
International
Class: |
G10L 15/02 20060101
G10L015/02 |
Claims
1. A computer-implemented method comprising: receiving a plurality
of audio frame vectors that each model an audio waveform during a
different period of time; selecting a non-empty subset of the audio
frame vectors; obtaining a corresponding non-empty subset of
detected acoustic event vectors that results from coding the subset
of the audio frame vectors; aligning the detected acoustic event
vectors and a set of expected event vectors that correspond to a
keyword to generate an output feature vector that characterizes an
acoustic match between the detected acoustic event vectors and the
expected event vectors; and inputting the output feature vector
into a keyword classifier.
2. The method of claim 1, further comprising: determining, using
the keyword classifier, that a keyword was present in the audio
waveform during an overall period of time modeled by the audio
frame vectors.
3. The method of claim 1, wherein the audio frame vectors are coded
using a neural network.
4. The method of claim 1, wherein the audio frame vectors are coded
using a Gaussian mixture model.
5. The method of claim 1, wherein aligning comprises extracting
features to characterize the acoustic match, the features
comprising one or more of: length of alignment, number of phones
aligned, frame distance across phone boundaries, probability of the
duration of each phone with respect to average duration of a phone
in training data, speaker speaking rate, average acoustic score,
worst acoustic score, best acoustic score, standard deviation of
acoustic scores, start frame of the alignment, stability of the
alignment, binary features representing changes related to the
difference between detected acoustic events and expected acoustic
events, and binary features representing changes related to the
difference between detected acoustic events and acoustic events in
an alignment window.
6. The method of claim 1, further comprising: producing a plurality
of audio frame vectors by performing front-end feature extraction
on an acoustic signal.
7. A system comprising: one or more computers and one or more
storage devices storing instructions that are operable, when
executed by the one or more computers, to cause the one or more
computers to perform operations comprising: receiving a plurality
of audio frame vectors that each model an audio waveform during a
different period of time; selecting a non-empty subset of the audio
frame vectors; obtaining a corresponding non-empty subset of
detected acoustic event vectors that results from coding the subset
of the audio frame vectors; aligning the detected acoustic event
vectors and a set of expected event vectors that correspond to a
keyword to generate an output feature vector that characterizes an
acoustic match between the detected acoustic event vectors and the
expected event vectors; and inputting the output feature vector
into a keyword classifier.
8. The system of claim 7, wherein the operations further comprise:
determining, using the keyword classifier, that a keyword was
present in the audio waveform during an overall period of time
modeled by the audio frame vectors.
9. The system of claim 7, wherein the audio frame vectors are coded
using a neural network.
10. The system of claim 7, wherein the audio frame vectors are
coded using a Gaussian mixture model.
11. The system of claim 7, wherein aligning comprises extracting
features to characterize the acoustic match, the features
comprising one or more of: length of alignment, number of phones
aligned, frame distance across phone boundaries, probability of the
duration of each phone with respect to average duration of a phone
in training data, speaker speaking rate, average acoustic score,
worst acoustic score, best acoustic score, standard deviation of
acoustic scores, start frame of the alignment, stability of the
alignment, binary features representing changes related to the
difference between detected acoustic events and expected acoustic
events, and binary features representing changes related to the
difference between detected acoustic events and acoustic events in
an alignment window.
12. The system of claim 7, the operations further comprising:
producing a plurality of audio frame vectors by performing
front-end feature extraction on an acoustic signal.
13. A non-transitory computer-readable medium storing software
comprising instructions executable by one or more computers which,
upon such execution, cause the one or more computers to perform
operations comprising: receiving a plurality of audio frame vectors
that each model an audio waveform during a different period of
time; selecting a non-empty subset of the audio frame vectors;
obtaining a corresponding non-empty subset of detected acoustic
event vectors that results from coding the subset of the audio
frame vectors; aligning the detected acoustic event vectors and a
set of expected event vectors that correspond to a keyword to
generate an output feature vector that characterizes an acoustic
match between the detected acoustic event vectors and the expected
event vectors; and inputting the output feature vector into a
keyword classifier.
14. The medium of claim 13, wherein the operations further
comprise: determining, using the keyword classifier, that a keyword
was present in the audio waveform during an overall period of time
modeled by the audio frame vectors.
15. The medium of claim 13, wherein the audio frame vectors are
coded using a neural network.
16. The medium of claim 13, wherein the audio frame vectors are
coded using a Gaussian mixture model.
17. The medium of claim 13, wherein aligning comprises extracting
features to characterize the acoustic match, the features
comprising one or more of: length of alignment, number of phones
aligned, frame distance across phone boundaries, probability of the
duration of each phone with respect to average duration of a phone
in training data, speaker speaking rate, average acoustic score,
worst acoustic score, best acoustic score, standard deviation of
acoustic scores, start frame of the alignment, stability of the
alignment, binary features representing changes related to the
difference between detected acoustic events and expected acoustic
events, and binary features representing changes related to the
difference between detected acoustic events and acoustic events in
an alignment window.
18. The medium of claim 13, the operations further comprising:
producing a plurality of audio frame vectors by performing
front-end feature extraction on an acoustic signal.
Description
CROSS-REFERENCE TO REPLATED APPLICATIONS
[0001] This application claims the benefit of U.S. Provisional
Application No. 61/788,749, filed Mar. 15, 2013, U.S. Provisional
Application No. 61/786,251, filed Mar. 14, 2013 and U.S.
Provisional Application No. 61/739,206, filed Dec. 19, 2012, which
are incorporated herein by reference.
FIELD
[0002] This specification describes technologies related to voice
recognition.
BACKGROUND
[0003] Automatic speech recognition is an important technology that
is used in mobile devices. One task that is a common goal for this
technology is to be able to use voice commands to wake up and have
basic spoken interactions with the device. For example, it may be
desirable to recognize a "hotword" that signals that the mobile
device should activate when the mobile device is in a sleep
state.
SUMMARY
[0004] The methods and systems described herein provide keyword
recognition that is fast and low latency, power efficient,
flexible, and optionally speaker adaptive. A designer or user can
choose the keywords. Embodiments include various systems directed
towards robust and efficient keyword detection.
[0005] In general, one innovative aspect of the subject matter
described in this specification can be embodied in a process that
is performed by a data processing apparatus. The process includes
receiving a plurality of audio frame vectors that each model an
audio waveform during a different period of time, selecting a
non-empty subset of the audio frame vectors, obtaining a
corresponding non-empty subset of detected acoustic event vectors
that results from coding the subset of the audio frame vectors,
aligning the detected acoustic event vectors and a set of expected
event vectors that correspond to a keyword to generate an output
feature vector that characterizes an acoustic match between the
detected acoustic event vectors and the expected event vectors, and
inputting the output feature vector into a keyword classifier.
[0006] Other embodiments include corresponding system, apparatus,
and computer programs, configured to perform the actions of the
method, encoded on computer storage devices.
[0007] These and other embodiments may each optionally include one
or more of the following features. For instance, the process may
include determining, using the keyword classifier, that a keyword
was present in the audio waveform during an overall period of time
modeled by the audio frame vectors. Embodiments may include
embodiments in which the audio frame vectors are coded using a
neural network and in which the audio frame vectors are coded using
a Gaussian mixture model.
[0008] After aligning, the system extracts features to characterize
the acoustic match, the features comprising one or more of: length
of alignment, number of phones aligned, frame distance across phone
boundaries, probability of the duration of each phone with respect
to average duration of a phone in training data, speaker speaking
rate, average acoustic score, worst acoustic score, best acoustic
score, standard deviation of acoustic scores, start frame of the
alignment, stability of the alignment, binary features representing
changes related to the difference between detected acoustic events
and expected acoustic events, and binary features representing
changes related to the difference between detected acoustic events
and acoustic events in an alignment window. The process may also
include producing a plurality of audio frame vectors by performing
front-end feature extraction on an acoustic signal.
[0009] Particular embodiments of the subject matter described in
this specification can be implemented so as to realize one or more
of the following advantages. Embodiments provide a way to recognize
whether or not a keyword was uttered in a way that provides a
simple design that can obtain good results while minimizing the
need for processing and power resources.
[0010] The details of one or more embodiments of the subject matter
described in this specification are set forth in the accompanying
drawings and the description below. Other potential features,
aspects, and advantages of the subject matter will become apparent
from the description, the drawings, and the claims.
BRIEF DESCRIPTION OF THE DRAWINGS
[0011] FIG. 1 is a block diagram 100 that illustrates dataflow in
an embodiment.
[0012] FIG. 2 is a block diagram 200 that illustrates dataflow in a
front-end feature extraction process.
[0013] FIG. 3 is a block diagram 300 that illustrates dataflow in
an acoustic modeling process.
[0014] FIG. 4 is a block diagram 400 that illustrates dataflow in a
high-level feature extraction process.
[0015] FIG. 5 is a block diagram 500 that illustrates dataflow in
an output classification process.
[0016] FIG. 6 is a flowchart 600 of the stages involved in an
example process for detecting keyword utterances in an audio
waveform.
[0017] FIG. 7 is a block diagram 700 of an example system that can
detect keyword utterances in an audio waveform.
[0018] Like reference numbers and designations in the various
drawings indicate like elements.
DETAILED DESCRIPTION
[0019] When using a mobile device, it is desirable to provide a way
of turning the device on or performing another action based on the
utterance of a keyword. For example, if a user says "Google," it
may cause a smartphone to activate. However, it requires power to
constantly monitor and process the audio received by the mobile
device, and hence it is important to provide an approach for
recognizing whether or not the keyword has been uttered while
minimizing the power consumption needed to "listen" for the
keyword.
[0020] Embodiments may listen for keywords while minimizing
resource usage through a variety of approaches. For example, a
variety of acoustic modeling techniques may be used to obtain
feature vectors that represent audio received at the mobile device.
However, another aspect of embodiments is that certain embodiments
may use a high-level feature extraction module based on acoustic
match and alignment. The input features obtained from a front-end
feature extraction module are converted into detected acoustic
events in real-time. Embodiments operate by finding an alignment of
the detected acoustic events with expected acoustic events that
would signify the presence of the keyword. The expected acoustic
events represent a standard dictionary pronunciation for the
keyword of interest. After aligning the events, embodiments are
able to extract features to characterize the acoustic match, which
will be described in greater detail, below. However, some
implementations only extract features when an initial alignment is
found, thereby reducing high-level feature computation.
[0021] While some implementations discussed elsewhere in this
specification discuss an implementation that detects a single
keyword, implementations are not necessarily limited to detecting
one keyword. In fact, some implementations may be used to detect a
plurality of keywords. The keywords in these implementations may
also be short phrases. Such implementations allow a user to select
one of a certain number of actions, such as actions presented in a
menu, by saying one of the menu entries. For example,
implementations may use different keywords to trigger different
actions such as taking a photo, sending an email, recording a note,
and so on. Given a finite number of words and/or phrases to be
detected, which will ordinarily not exceed 20 or so, this
technology may be used. However, other implementations may be
adapted to handle more words and/or phrases if required.
[0022] At a high level, one system embodiment comprises four
modules. Module 1 is a front-end feature extraction module, which
performs: a) speech activity detection; b) windowing of the
acoustic signal; c) short-term Fourier transform; d) spectral
subtraction, optionally; e) filter bank extraction; and f)
log-energy transform of the filtered output. Module 2 is an
acoustic model, which can be one of: a) a neural network, possibly
truncated of its last layers; or b) a Gaussian mixture model (GMM).
If a neural network is used, the neural network may be truncated of
its last layers. In module 2, the input features may be converted
into acoustic events by forward-propagation through the neural
network (NN). If a Gaussian mixture model is used, it may provide a
probabilistic model for representing the presence of subpopulations
within an overall population in order to code the acoustic events.
Module 3 is a high level feature extraction module based on
acoustic match/alignment. As discussed above, Module 3 finds an
alignment of detected acoustic events with expected acoustic
events. Module 3 also extracts certain information once an
alignment has been found to characterize the match. Module 4 is an
output classifier, which takes as an input the output feature
vector from module 3 and possibly some side information to yield a
binary decision about the presence of the keyword. The output
classifier can be for example: a) a support vector machine or b) a
logistic regression.
[0023] Various embodiments will now be discussed in connection with
the drawings to explain their operation.
[0024] FIG. 1 is a block diagram 100 that illustrates dataflow in
an embodiment. The data flow begins with an audio waveform 102.
Audio waveform 102 represents audio received by an embodiment. For
example, audio waveform 102 may be an analog or digital
representation of sound in the environment of an embodiment that is
captured by a microphone. Once audio waveform 102 is introduced
into the embodiment, it is sent to front-end feature extraction
module 104. Front-end feature extraction module 104 performs a
series of stages, detailed in FIG. 2, that take audio waveform 102
and transform it into a series of vectors for further processing.
Once front-end feature extraction module 104 has done the
processing of audio waveform 102, its output is sent to acoustic
modeling module 106. Acoustic modeling module 106 may use a variety
of techniques, detailed in FIG. 3, to perform coding on the inputs
to produce acoustic event vectors that are representative of
features of audio waveform 102 over a period of time. The acoustic
event vectors from acoustic modeling module are sent to a
high-level feature extraction module 108 that finds an alignment
for the acoustic event vectors, as detailed in FIG. 4, to further
analyze characteristics of audio waveform 102 over a time interval
to provide information that can be used to produce an output
feature vector to detect if the keyword was uttered. After the
acoustic event vectors are aligned, the output feature vector is
sent to output classifier module 110 to make a determination about
whether the keyword is present, as is discussed in FIG. 5.
[0025] Various system embodiments are similar in their overall
structure. They include modules that use similar architectures to
accomplish similar goals: 1) front-end feature extraction, 2)
acoustic model, 3) higher level feature extraction module, and a 4)
classifier module. However, there are several embodiments that
differ in certain respects.
[0026] Embodiments approach the problem of keyword detection in
advantageous ways. For example, one embodiment has the advantage
that it only extracts features when a first level alignment is
found, reducing high level feature computation. The approaches used
in these systems are advantageous because they only involve
adaptation of a few parameters to adapt to change the keywords
matched or to adapt to a given speaker's voice.
[0027] FIG. 2 is a block diagram 200 that illustrates dataflow in a
front-end feature extraction process. Audio waveform 102, as
illustrated in FIG. 2, includes analog and/or digital information
about incoming sound that an embodiment can analyze to detect the
presence of a keyword. One way to capture audio waveform 102 for
analysis is to divide it up into a plurality of analysis windows.
For example, FIG. 2 shows an analysis window 204 that uses a vector
to represent audio waveform 102 over a time period that is chosen
as the size of analysis window 204, for example a 25 ms time
period. Multiple analysis windows are obtained in succession by
performing an analysis window shift 206, for example a 10 ms time
period. Analysis windows may be chosen to overlap. For example, one
analysis window may represent audio waveform 102 from a start time
of 0 ms to an end time of 25 ms, and a subsequent analysis window
may represent audio waveform 102 from a start time of 10 ms to an
end time of 35 ms.
[0028] The analysis windows 204 are obtained as part of speech
activity detection 210, in which an embodiment obtains information
about available sound in its environment. Speech activity detection
210 may be designed to occur regardless of whether there is sound
in the surroundings of an embodiment, or it may, for example, occur
only when a volume of sound greater than a threshold volume is
received. Once speech activity detection 210 occurs, it is followed
by windowing of the acoustic signal 220. As discussed, each window
should be a fairly short time interval, such as 25 ms, that
represents characteristics of audio waveform 102 over that time
interval. After windowing, embodiments may perform a fast Fourier
transform 230 on the windowed data so as to analyze the constituent
frequencies present in the audio waveform. Additionally,
embodiments may optionally perform spectral substitution 240 to
minimize the effects of noise on the information provided by the
other steps. Next, filter bank extraction 250 can allow the
decomposition of the information from the previous steps by using
filters to separate individual components of the audio data from
one another. Finally, performance of a log-energy transform 260 can
help normalize the data in order to make it more meaningful.
[0029] The result of the processing performed in FIG. 2 is a moving
window of a stack of frames 270. For example, stack of frames 270
may include 11 frames, each including information about 25 ms of
audio waveform 102, with a shift of 10 ms between frames. However,
it is not necessary to use a stack of 11 frames, and stack of
frames 270 may include as few as 2 frames or any larger number of
frames. The end output of front-end feature extraction 200 is thus
a stack of a plurality of frames 280 that represents features of
audio waveform 102 by performing the aforementioned analytical
techniques to obtain information about characteristics of the audio
waveform 102 for successive time intervals.
[0030] FIG. 3 is a block diagram 300 that illustrates dataflow in
an acoustic modeling process. FIG. 3 begins with stack of the
plurality of frames 280 produced by the process depicted in FIG. 2.
FIG. 3 includes coding 310 of the plurality of frames to produce
coded acoustic events for each stack 320. For example, 3 ways in
which this process may occur are included in FIG. 3, including a
neural network 310A, or a Gaussian mixture model 310B. If a neural
network 310A is used, the neural network may possibly be truncated
of its last layers. The goal of this coding is to produce coded
acoustic events 320 that are single vectors that represent
pluralities of initial vectors in the stack of plurality of frame
280 as single vectors with salient information about features of
the audio waveform 102 over the interval that they model. For
example, the input features may be converted into acoustic events
by coding through the neural network 310A or a Gaussian mixture
model 310B.
[0031] FIG. 4 is a block diagram 400 that illustrates dataflow in a
high-level feature extraction process. FIG. 4 begins with coded
acoustic events for each stack 320, as produced in FIG. 3. Coded
acoustic events for each stack 320 are aligned with expected event
vectors 410. Expected event vectors 410 include phonemes 420A-420D,
each of which is associated with a standardized pronunciation for
that phoneme. The high-level feature extraction operates by
aligning 420 keyword 430 with coded acoustic events for each stack
320 based on detecting a preliminary alignment match with aligned
phonemes 422. The alignment produces output feature vector 440.
Output vector 440 includes information about the audio waveform 102
that has been distilled and processed so it is easy to draw
conclusions about the presence of the keyword in audio waveform 102
over time window 410 that aligning 420 represents.
[0032] Aligning 420 may be accomplished by decoding with a graph,
which automatically force aligns the audio to a keyword, such as
"computer" or "google." Back-epsilon arcs may allow such a graph to
restart at any point, avoiding misses when the keyword is spoken
while in the middle of the decoding graph. For example,
implementations may generate a confusion network of pronunciations
for the keyword by running a phone loop decoder on positive
examples for the keyword and extract the most frequent
pronunciations.
[0033] Other ways to obtain an alignment are also possible. One way
to obtain an alignment may include extracting features in a fixed
window, after reaching a stable partial result, force align a
phonetic sequence in that window and extract features from that
alignment. An alternative is to use an HMM hotword/garbage model,
which may use a high bias and may only extract features if a
hotword path is successfully decoded. Yet another way is for
positive examples, to force align or manually align phonetic
sequences, and for negative examples, to find the alignment, whose
score may satisfy a condition given current model parameters.
[0034] As part of the aligning 420, high-level feature extraction
400 extracts features to characterize the quality of the acoustic
match. All of these features assume that there is an alignment for
both positive and negative examples with respect to the true
phonetic sequences p_k for keyword k. The extracted information may
include length of alignment, number of phones aligned, frame
distance across phone boundaries, probability of the duration of
each phone with respect to average duration of a phone in training
data, speaker speaking rate, average acoustic score, worst acoustic
score, best acoustic score, standard deviation of acoustic scores,
start frame of the alignment, stability of the alignment, binary
features representing changes related to the difference between
detected acoustic events and expected acoustic events, and/or
binary features representing changes related to the difference
between detected acoustic events and acoustic events in an
alignment window.
[0035] Binary features representing changes related to the
difference between detected acoustic events and expected acoustic
events may include identity/insertions/deletions of detected
acoustic events from a GMM coding process. Binary features
representing changes related to the difference between detected
acoustic events and acoustic events in an alignment window may
include identity/insertions/deletions of detected acoustic events
from a neural network coding process.
[0036] Frame distance may be found, given an identified
segmentation or phoneme alignment, by computing the Euclidean
distance d between frames at sequential distances from each phoneme
boundary. The assumption is that if the hotword was uttered, then
the phoneme alignment will be correct, and hence the distance
between neighboring frames across phoneme boundaries will be large.
If the hotword was not uttered, the phoneme alignment will be
incorrect, and hence distance between neighboring frames at phoneme
boundaries will be small. Frame distance may be found using
Equation 1:
.phi. j ( x _ , p _ , s _ ) = 1 p _ l = 1 p _ - 1 d ( x - j + n l ,
x j + n l ) , j .di-elect cons. { 1 , 2 , 3 , 4 } Equation 1
##EQU00001##
[0037] Another feature is phoneme duration score, which computes
the probability of the current duration using a Gaussian
distribution with a mean and standard deviation equal to that of
the average phoneme duration for phonemes encountered in training.
Phoneme duration score may be found using Equation 2:
.phi. 6 ( x _ , p _ , s _ ) = 1 p _ l = 1 p _ log ( s l + 1 - s l ;
.mu. ^ p l , .sigma. ^ p l ) Equation 2 ##EQU00002##
[0038] Another feature is speaker rate changes, which features
local changes in speaking rate, given the assumption that changes
should be smooth. It may be found using Equation 3.
.phi. 7 ( x _ , p _ , s _ ) = 1 p _ l = 2 p _ ( r l - r l - 1 ) 2
Equation 3 ##EQU00003##
[0039] Speaking rate itself may be provided by Equation 4.
r.sub.l=(s.sub.l+1-s.sub.l)/{circumflex over (.mu.)}.sub.pl
[0040] Sample Code 1, below, includes information that might be
provided in a data structure that includes information about
features of an alignment.
TABLE-US-00001 Sample Code 1 message HotwordConfidenceFeature { //
Description of the first four features in //
speech/greco3/hotword/feature_extractor.h. optional float
phone_duration_score = 1 [default = 0.0]; optional float
speaker_rate = 2 [default = 0.0]; repeated float frame_distance =
3; optional float word_duration_frames = 4 [default = 0.0]; //
Baseline system detected hotword. optional bool baseline = 6
[default = false]; // Features inherited from WordConfFeature.
optional float num_phones = 8 [default = 0.0]; // From
WordConfFeature: word_duration. It corresponds to word duration //
in frames divided by the number of phones. optional float
normalized_word_duration = 9 [default = 0.0]; optional float
ascore_mean = 10 [default = 0.0]; optional float ascore_stddev = 11
[default = 0.0]; optional float ascore_worst = 12 [default = 0.0];
optional float ascore_meandiff = 13 [default = 0.0]; optional float
ascore_best = 14 [default = 0.0]; optional float lm_score = 15
[default = 0.0]; optional float dur_score = 16 [default = 0.0];
optional float am_score = 17 [default = 0.0]; // Start frame of the
keyword. optional float start_frame = 18 [ default = 0.0]; // Phone
expectation match features: u is expected and observed. // One
feature for each phone in dictionary pronunciation of the hotword.
repeated bool ph_expectation_align = 19; // Same as
ph_expectation_align, but phones are detected from nn stream.
repeated bool ph_expectation_nn = 20; // Phone expectation delete
features: u is expected but *not* observed. // One feature per
phone in dictionary pronunciation of the hotword. repeated bool
ph_expectation_delete_align = 21; // Phone expectation insert
features: u is *not* expected but observed. // One feature per
phone in phone-list. repeated bool ph_expectation_insert_align =
22; // Same as ph_expectation_delete_align and
ph_expectation_insert_align resp. repeated bool
ph_expectation_delete_nn = 24; repeated bool
ph_expectation_insert_nn = 25; // Stability of the partial result.
optional float stability = 23; }
[0041] FIG. 5 is a block diagram 500 that illustrates dataflow in
an output classification process. FIG. 5 begins with output vector
440 that has been produced by FIG. 4. Based on output vector 440,
FIG. 5 takes a step to classify output 510 using classification
module 520. For example, classification module 520 may use support
vector machine 520A or logistic regression 520B. The goal of
classification module 520 is to make a binary decision about
whether the keyword was uttered during time window 410 associated
with output vector 440. Classification module 520 produces
classification result 530. This may be an actual classification
decision 550, in terms of a Boolean decision confirming that the
keyword was present or not. Alternatively, classification result
may also be a score, for example one that represents the likelihood
that the keyword is present. If classification result 530 is a
score, there may be a step to process the result 540 to yield
classification decision 550, for example comparing the score to a
threshold value.
[0042] FIG. 6 is a flowchart 600 of the stages involved in an
example process for detecting keyword utterances in an audio
waveform.
[0043] In stage 610, audio frame vectors are received. For example,
stage 610 may be performed as in FIG. 2, such that front-end
feature extraction module 104 processes audio waveform 102 to yield
the vectors, which are represented in FIG. 2 as stack of the
plurality of frames 280.
[0044] In stage 620 subsets of vectors are selected. For example
stage 620 may be performed as in FIG. 2, such that the processing
of audio waveform 102 yields a stack of the plurality of frames 280
that constitutes the subset of vectors.
[0045] In stage 630, event vectors are obtained by coding. For
example, this step is performed by acoustic modeling module 106 as
in FIG. 3.
[0046] In stage 640, the vectors are aligned. For example, this
step may occur as aligning 420 as in FIG. 4 by high-level feature
extraction module 108.
[0047] In stage 650, the output vector is input to the classifier.
For example, high-level feature extraction module 108 sends its
output, output vector 440 to output classifier module 110 to make
this determination as in FIG. 5.
[0048] FIG. 7 is a block diagram 700 of an example system that can
detect keyword utterances in an audio waveform. The system contains
a variety of constituent parts and modules that may be implemented
through appropriate combinations of hardware, firmware, and
software that allow computing device 700 to function as an
embodiment of appropriate features.
[0049] Computing device 700 contains one or more processors 712
that may include various hardware devices designed to process data.
Processors 712 are communicatively coupled to other parts of
computing device 700. For example, processors 712 may be coupled to
a speaker 702 and a microphone 704 that allow output and input of
audio signals to and from the surroundings of computing device 700.
Microphone 704 is of special import to the functioning of computing
device 700 in that microphone 704 provides the raw signals that
capture aspects of audio waveform 102 that are processed in other
portions of computing device 700. Additionally, computing device
700 may include persistent memory 706. Persistent memory may
include a variety of memory storage devices that allow permanent
retention and storage of information manipulated by processors 712.
Furthermore, input device 708 allows the receipt of commands from a
user, and interface 714 allows computing device 700 to interact
with other devices to allow information exchange. Additionally,
processors 712 may be communicatively coupled to a display 710 that
provides a graphical representation of information processed by
computing device 700 for the user to view.
[0050] Additionally, processors 712 may be communicatively coupled
to a series of modules that perform the functionalities necessary
to implement the method of embodiments that is presented in FIG. 6.
These modules include front-end feature extraction module 716,
which performs as illustrated in FIG. 2, acoustic modeling module
718, which performs as illustrated in FIG. 3, high-level feature
extraction module 720, which performs as illustrated in FIG. 4, and
output classifier module 722, which performs as illustrated in FIG.
5.
[0051] As discussed above, the task of hotword or keyword detection
is an important component in many speech recognition applications.
For example, when the vocabulary size is limited, or when the task
requires activating a device, for example, a phone, by saying a
word, keyword detection is applied to classify whether an utterance
contains a word or not.
[0052] For example, the task performed by some embodiments includes
detecting a single word, for example, "Google," that will activate
a device in standby to perform a task. This device, thus, should be
listening all the time for such word. A common problem in portable
devices is battery life, and limited computation capabilities.
Because of this, it is important to design a keyword detection
system that is both accurate and computationally efficient.
[0053] This application begins by presenting embodiments, which
include approaches to recognizing when a mobile device should
activate or take other actions in response to receiving a keyword
as a voice input. The application describes how these approaches
operate and discuss the advantageous results provided by the
approaches. These approaches provide the potential to obtain good
results while using resources efficiently.
[0054] A number of implementations have been described.
Nevertheless, it will be understood that various modifications may
be made without departing from the spirit and scope of the
disclosure. For example, various forms of the flows shown above may
be used, with steps re-ordered, added, or removed.
[0055] Embodiments of the invention and all of the functional
operations described in this specification may be implemented in
digital electronic circuitry, or in computer software, firmware, or
hardware, including the structures disclosed in this specification
and their structural equivalents, or in combinations of one or more
of them. Embodiments of the invention may be implemented as one or
more computer program products, i.e., one or more modules of
computer program instructions encoded on a computer readable medium
for execution by, or to control the operation of, data processing
apparatus. The computer readable medium may be a machine-readable
storage device, a machine-readable storage substrate, a memory
device, a composition of matter affecting a machine-readable
propagated signal, or a combination of one or more of them. The
term "data processing apparatus" encompasses all apparatus,
devices, and machines for processing data, including by way of
example a programmable processor, a computer, or multiple
processors or computers. The apparatus may include, in addition to
hardware, code that creates an execution environment for the
computer program in question, e.g., code that constitutes processor
firmware, a protocol stack, a database management system, an
operating system, or a combination of one or more of them. A
propagated signal is an artificially generated signal, e.g., a
machine-generated electrical, optical, or electromagnetic signal
that is generated to encode information for transmission to
suitable receiver apparatus.
[0056] A computer program (also known as a program, software,
software application, script, or code) may be written in any form
of programming language, including compiled or interpreted
languages, and it may be deployed in any form, including as a
standalone program or as a module, component, subroutine, or other
unit suitable for use in a computing environment. A computer
program does not necessarily correspond to a file in a file system.
A program may be stored in a portion of a file that holds other
programs or data (e.g., one or more scripts stored in a markup
language document), in a single file dedicated to the program in
question, or in multiple coordinated files (e.g., files that store
one or more modules, sub programs, or portions of code). A computer
program may be deployed to be executed on one computer or on
multiple computers that are located at one site or distributed
across multiple sites and interconnected by a communication
network.
[0057] The processes and logic flows described in this
specification may be performed by one or more programmable
processors executing one or more computer programs to perform
functions by operating on input data and generating output. The
processes and logic flows may also be performed by, and apparatus
may also be implemented as, special purpose logic circuitry, e.g.,
an FPGA (field programmable gate array) or an ASIC (application
specific integrated circuit).
[0058] Processors suitable for the execution of a computer program
include, by way of example, both general and special purpose
microprocessors, and any one or more processors of any kind of
digital computer. Generally, a processor will receive instructions
and data from a read only memory or a random access memory or both.
The essential elements of a computer are a processor for performing
instructions and one or more memory devices for storing
instructions and data. Generally, a computer will also include, or
be operatively coupled to receive data from or transfer data to, or
both, one or more mass storage devices for storing data, e.g.,
magnetic, magneto optical disks, or optical disks. However, a
computer need not have such devices. Moreover, a computer may be
embedded in another device, e.g., a tablet computer, a mobile
telephone, a personal digital assistant (PDA), a mobile audio
player, a Global Positioning System (GPS) receiver, to name just a
few. Computer readable media suitable for storing computer program
instructions and data include all forms of non-volatile memory,
media and memory devices, including by way of example semiconductor
memory devices, e.g., EPROM, EEPROM, and flash memory devices;
magnetic disks, e.g., internal hard disks or removable disks;
magneto optical disks; and CD ROM and DVD-ROM disks. The processor
and the memory may be supplemented by, or incorporated in, special
purpose logic circuitry.
[0059] To provide for interaction with a user, embodiments of the
invention may be implemented on a computer having a display device,
e.g., a CRT (cathode ray tube) or LCD (liquid crystal display)
monitor, for displaying information to the user and a keyboard and
a pointing device, e.g., a mouse or a trackball, by which the user
may provide input to the computer. Other kinds of devices may be
used to provide for interaction with a user as well; for example,
feedback provided to the user may be any form of sensory feedback,
e.g., visual feedback, auditory feedback, or tactile feedback; and
input from the user may be received in any form, including
acoustic, speech, or tactile input.
[0060] Embodiments of the invention may be implemented in a
computing system that includes a back end component, e.g., as a
data server, or that includes a middleware component, e.g., an
application server, or that includes a front end component, e.g., a
client computer having a graphical user interface or a Web browser
through which a user may interact with an implementation of the
invention, or any combination of one or more such back end,
middleware, or front end components. The components of the system
may be interconnected by any form or medium of digital data
communication, e.g., a communication network. Examples of
communication networks include a local area network ("LAN") and a
wide area network ("WAN"), e.g., the Internet.
[0061] The computing system may include clients and servers. A
client and server are generally remote from each other and
typically interact through a communication network. The
relationship of client and server arises by virtue of computer
programs running on the respective computers and having a
client-server relationship to each other.
[0062] While this specification contains many specifics, these
should not be construed as limitations on the scope of the
invention or of what may be claimed, but rather as descriptions of
features specific to particular embodiments of the invention.
Certain features that are described in this specification in the
context of separate embodiments may also be implemented in
combination in a single embodiment. Conversely, various features
that are described in the context of a single embodiment may also
be implemented in multiple embodiments separately or in any
suitable subcombination. Moreover, although features may be
described above as acting in certain combinations and even
initially claimed as such, one or more features from a claimed
combination may in some cases be excised from the combination, and
the claimed combination may be directed to a subcombination or
variation of a subcombination.
[0063] Similarly, while operations are depicted in the drawings in
a particular order, this should not be understood as requiring that
such operations be performed in the particular order shown or in
sequential order, or that all illustrated operations be performed,
to achieve desirable results. In certain circumstances,
multitasking and parallel processing may be advantageous. Moreover,
the separation of various system components in the embodiments
described above should not be understood as requiring such
separation in all embodiments, and it should be understood that the
described program components and systems may generally be
integrated together in a single software product or packaged into
multiple software products.
[0064] Thus, particular embodiments of the invention have been
described. Other embodiments are within the scope of the following
claims. For example, the actions recited in the claims may be
performed in a different order and still achieve desirable
results.
* * * * *