U.S. patent application number 15/388147 was filed with the patent office on 2017-12-21 for far field automatic speech recognition pre-processing.
The applicant listed for this patent is Adam Kupryjanow, Lukasz Kurylo, Przemyslaw Maziewski. Invention is credited to Adam Kupryjanow, Lukasz Kurylo, Przemyslaw Maziewski.
Application Number | 20170365255 15/388147 |
Document ID | / |
Family ID | 60659998 |
Filed Date | 2017-12-21 |
United States Patent
Application |
20170365255 |
Kind Code |
A1 |
Kupryjanow; Adam ; et
al. |
December 21, 2017 |
FAR FIELD AUTOMATIC SPEECH RECOGNITION PRE-PROCESSING
Abstract
System and techniques for automatic speech recognition
pre-processing are described herein. First, a plurality of audio
channels may be obtained. Then, reverberations mat be removed from
the audio channels. The plurality of audio channels may be
partitioned into beams after reverberations are removed. A
partition corresponding to a beam in the beams may be selected
based on a noise level. An audio signal may be filtered from the
selected partition. The filtered audio signal may be provided to an
external entity via an output interface of the pre-processing
pipeline.
Inventors: |
Kupryjanow; Adam; (Gdansk,
PL) ; Maziewski; Przemyslaw; (Gdansk, PL) ;
Kurylo; Lukasz; (Gdansk, PL) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Kupryjanow; Adam
Maziewski; Przemyslaw
Kurylo; Lukasz |
Gdansk
Gdansk
Gdansk |
|
PL
PL
PL |
|
|
Family ID: |
60659998 |
Appl. No.: |
15/388147 |
Filed: |
December 22, 2016 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
62350507 |
Jun 15, 2016 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G10L 15/20 20130101;
G10L 21/0232 20130101; H04R 1/04 20130101; G10L 15/30 20130101;
G10L 15/22 20130101; G10L 21/034 20130101; H04R 2420/07 20130101;
H04R 31/006 20130101; H04R 2201/401 20130101; H04R 2201/403
20130101; H04R 1/406 20130101; H04R 3/005 20130101; H04R 1/2876
20130101; G10L 21/0216 20130101; G10L 2021/02166 20130101; G10L
25/21 20130101; G10L 21/0364 20130101; G10L 2021/02082 20130101;
G10L 25/51 20130101 |
International
Class: |
G10L 15/20 20060101
G10L015/20; G10L 21/0232 20130101 G10L021/0232; G10L 21/034
20130101 G10L021/034; G10L 25/51 20130101 G10L025/51; G10L 15/22
20060101 G10L015/22 |
Claims
1. A system for automatic speech recognition pre-processing, the
system comprising: a sampler to obtain a plurality of audio
channels; a de-reverberator to remove reverberations from the
plurality of audio channels; a beam-former processor to partition
the plurality of audio channels into beams after reverberations are
removed; a stream selector to select a partition corresponding to a
beam in the beams based on a noise level; a filter to reduce a
noise level in a speech signal from the selected partition; and a
controller to provide the audio signal to an external entity via an
output interface of the pre-processing pipeline.
2. The system of claim 1, comprising an echo cancelation block
disposed between the de-reverberator and the beam-former processor
to cancel echoes from the plurality of audio channels after the
reverberations are removed and before the plurality of audio
channels are partitioned into beams.
3. The system of claim 1, wherein, to partition the plurality of
audio channels into beams, the beam-former processor is to: receive
the plurality of audio channels; partition the plurality of audio
channels into partitions of two audio channels based on a
relationship between microphones producing the plurality of audio
channels; and provide each partition to a phase-based
beam-former.
4. The system of claim 1, wherein, to select the partition
corresponding to the beam based on the noise level, the stream
selector is to: compare noise levels between the beams; and select
the beam based on having the lowest noise levels determined from
the comparison.
5. The system of claim 1, wherein, to reduce the noise level in the
speech signal from the selected partition, the filter applies noise
reduction to the audio signal.
6. The system of claim 1, wherein, to reduce the noise level in the
speech signal from the selected partition, the filter applies a
spectral profile matching (SPM) to the audio signal.
7. The system of claim 6, wherein the spectral profile matching is
applied after noise reduction is applied to the audio signal.
8. The system of claim 1, wherein, to reduce the noise level in the
speech signal from the selected partition, the filter applies an
automated gain control to the audio signal.
9. The system of claim 8, wherein the automated gain control is
applied after a spectral profile matching is applied to the audio
signal.
10. At least machine readable medium including instructions for a
pre-processing pipeline, the instructions, when executed by a
machine, causing the machine to perform operations comprising:
obtaining a plurality of audio channels; removing reverberations
from the audio channels; partitioning the plurality of audio
channels into beams after reverberations are removed; selecting a
partition corresponding to a beam in the beams based on a noise
level; filtering an audio signal from the selected partition; and
providing the filtered audio signal to an external entity via an
output interface of the pre-processing pipeline.
11. The at least machine readable medium of claim 10, wherein the
operations include canceling echoes from the plurality of audio
channels after the reverberations are removed and before the
plurality of audio channels are partitioned into beams.
12. The at least machine readable medium of claim 10, wherein the
partitioning the plurality of audio channels into beams includes:
receiving the plurality of audio channels at a beam-former
processor; partitioning the plurality of audio channels into
partitions of two audio channels based on a relationship between
microphones producing the plurality of audio channels; and
providing each partition to a phase-based beam-former.
13. The at least machine readable medium of claim 10, wherein the
selecting the partition corresponding to the beam based on the
noise level includes comparing noise levels between the beams and
selecting the beam based on having the lowest noise levels
determined from the comparison.
14. The at least machine readable medium of claim 10, wherein the
filtering includes applying noise reduction to the audio
signal.
15. The at least machine readable medium of claim 10, wherein the
filtering includes applying a spectral profile matching (SPM) to
the audio signal.
16. The at least machine readable medium of claim 15, wherein the
spectral profile matching is applied after noise reduction is
applied to the audio signal.
17. The at least machine readable medium of claim 10, wherein the
filtering includes applying an automated gain control to the audio
signal.
18. The at least machine readable medium of claim 17, wherein the
automated gain control is applied after a spectral profile matching
is applied to the audio signal.
19. A method for automatic speech recognition pre-processing, the
method comprising: obtaining a plurality of audio channels;
removing reverberations from the audio channels; partitioning the
plurality of audio channels into beams after the reverberations are
removed; selecting a partition corresponding to a beam in the beams
based on a noise level; filtering an audio signal from the selected
partition; and providing the filtered audio signal to an external
entity via an output interface of the pre-processing pipeline.
20. The method of claim 19, comprising canceling echoes from the
plurality of audio channels after the reverberations are removed
and before the plurality of audio channels are partitioned into
beams.
21. The method of claim 19, wherein partitioning the plurality of
audio channels into beams includes: receiving the plurality of
audio channels at a beam-former processor; partitioning the
plurality of audio channels into partitions of two audio channels
based on a relationship between microphones producing the plurality
of audio channels; and providing each partition to a phase-based
beam-former.
22. The method of claim 19, wherein the filtering includes applying
a spectral profile matching (SPM) to the audio signal.
23. The method of claim 22, wherein the spectral profile matching
is applied after noise reduction is applied to the audio
signal.
24. The method of claim 19, wherein the filtering includes applying
an automated gain control to the audio signal.
25. The method of claim 24, wherein the automated gain control is
applied after a spectral profile matching is applied to the audio
signal.
Description
CLAIM OF PRIORITY
[0001] This patent application claims the benefit of priority,
under 35 U.S.C. .sctn.119, to U.S. Provisional Application Ser. No.
62/350,507, titled "FAR FIELD AUTOMATIC SPEECH RECOGNITION" and
filed on Jun. 15, 2016, the entirety of which is hereby
incorporated by reference herein.
TECHNICAL FIELD
[0002] Embodiments described herein generally relate to automatic
speech recognition (ASR) and more specifically to improving ASR
pre-processing.
BACKGROUND
[0003] ASR involves a machine-based collection of techniques to
understand human languages. ASR is interdisciplinary, often
involving microphone, analog to digital conversion, frequency
processing, database, and artificial intelligence technologies to
convert the spoken word into textual or machine readable
representations of not only what said (e.g., a transcript) but also
what was meant (e.g., semantic understanding) by a human speaker.
Far field ASR involves techniques to decrease a word error rate
(WER) in utterances made a greater distance to a microphone, or
microphone array, than traditionally accounted for in ASR
processing pipelines. Such distance often decreases the signal to
noise (SNR) ratio and thus increases WER in traditional ASR
systems. As used herein, far field ASR involves distances more than
half meter from the microphone.
BRIEF DESCRIPTION OF THE DRAWINGS
[0004] In the drawings, which are not necessarily drawn to scale,
like numerals may describe similar components in different views.
Like numerals having different letter suffixes may represent
different instances of similar components. The drawings illustrate
generally, by way of example, but not by way of limitation, various
embodiments discussed in the present document.
[0005] FIG. 1 is an example of a smart home gateway housing,
according to an embodiment.
[0006] FIG. 2 is a block diagram of an example of a system for far
field automatic speech recognition pre-processing, according to an
embodiment.
[0007] FIG. 3 illustrates phase-based beam forming (PBF)
directivity patterns, according to an embodiment.
[0008] FIG. 4 is a plot of far field ASR WER improvements for
different types of noise, according to an embodiment.
[0009] FIG. 5 illustrates an example of a method for automatic
speech recognition pre-processing, according to an embodiment.
[0010] FIG. 6 is a block diagram illustrating an example of a
machine upon which one or more embodiments may be implemented.
DETAILED DESCRIPTION
[0011] Embodiments and examples herein general described a number
of systems, devices, and techniques for automatic speech
recognition pre-processing. It is understood, however, that the
systems, devices, and techniques are examples illustrating the
underlying concepts.
[0012] FIG. 1 is an example of a smart home gateway 105, according
to an embodiment. As illustrated, the circles atop the housing are
lumens 110 behind which are housed microphones (as illustrated
there are eight microphones). The dashed lines illustrate
microphones in a linear arrangement 115 as well as in a circular
arrangement 120. Many of the examples described herein operate with
these dual arrangements (e.g., linear 115 and circular 120) with
respect to a device 105. Although the device 105 here takes the
form of the smart home gateway, other configurations are
contemplated, such as in a desktop or laptop computer
configuration, a refrigerator or other appliance, etc.
[0013] A factor contributing to the far field performance drop for
ASR may include speech signal quality degradation due to some or
all of reverberations, echo, noise, or amplitude loss. For example,
from several experiments, four issues related to far field ASR
where found: reverberation; echo; noise; and amplitude losses. The
influence of one or all of these factors may be mitigated by
intelligently ordering a variety of processing techniques. For
example, reverberation (e.g., reverb) reduction enables use of
beam-formers and noise reduction (NR) techniques that were not
designed to work in reverberant conditions. In another example,
acoustic echo cancelation (AEC) reduces echo generated by internal
loudspeakers. Also, for example, beam-formers and additional
post-filtering modules reduce noise level. An automatic gain
control (AGC) device counteracts amplitude losses. Overall the
unique combination and order of the processing used in the
described far field pre-processing pipeline enables accurate far
field ASR.
[0014] An example of just such a pipeline in the device 105 may
include a sampler 125, a de-reverberator 127, a beam-former
processor 130, a stream selector 135, a filter 140, and a
controller 145. Each of these components are implemented in
electronic hardware, such as that described below (e.g.,
circuits).
[0015] The sampler 125 is arranged to obtain a plurality of audio
channels. Thus, the sampler 125 may be a part of a microphone
array, have a tap on microphone output, or have the plurality of
audio channels delivered via another component of the device 105.
In an example, an audio channel is audio from a single microphone.
In an example, an audio channel is audio from a plurality of
microphones wherein the signal from these microphones is correlated
based on a physical arrangement of the microphones, such as a
spacing, linear or circular relationship, etc. In an example, after
obtaining the plurality of audio channels by the sampler 125, the
de-reverberator 127 removes reverberation prior to the beam-former
processor partitioning the plurality of audio channels into beams.
Removing the reverberation may be accomplished using a variety of
techniques, such as short-time Fourier transform (STFT) domain
inverse filtering methods, non-negative room impulse response (RIR)
modeling, statistical RIR modeling, or nonlinear mapping (e.g.,
denoising auto-encoder using a deep neural network or bidirectional
long short-term memory (BLSTM) recurrent neural network). After
obtaining the plurality of audio channels by the sampler 125, and
after applying de-reverberation to the audio channels by the
de-reverberator 127, an output may be directed to, or retrieved by,
the beam-former processor 130.
[0016] The beam-former processor 130 is arranged to partition the
plurality of audio channels into beams. Here, beams refer to energy
received from a specific direction. Generally, given a single
stationary microphone, the frequency and amplitude of sound energy
may be determined but there is not enough information to also
determine a direction. The addition of a second microphone, (e.g.,
analogous to two human ear) provides two signals that may be
correlated in frequency and amplitude but which may vary in time.
With a known and fixed relationship between these microphones, the
variations of the audio signal in time may provide a relative
direction of the energy. This may then be considered the beam.
Thus, in an example, to partition the plurality of audio channels
into beams, the beam-former processor 130 is arranged to obtain
(e.g., receive or retrieve) the plurality of audio channels,
partition the plurality of audio channels into partitions of two
audio channels based on a relationship between microphones
producing the plurality of audio channels, and provide each
partition to the phase-based beam-former. In this example, the
audio channel partitioning allows the beam-former processor 130 or
the phased-based beam-former to ascertain the time variance (e.g.,
a measure of how in-phase the signals are) with a known physical
arrangement of microphones. As explained earlier, this provides the
information to ascertain from what direction the energy (e.g.,
sound) came from. Beamforming provides another level of control in
finding a clean signal from which to process ASR.
[0017] The stream selector 135 is arranged to select a partition
corresponding to a beam in the beams based on a noise level. In an
example, to select the partition corresponding to the beam based on
the noise level, the stream selector 135 is arranged to compare
noise levels between the beams and select the beam based on having
the lowest noise levels determined from the comparison. In an
example, the stream selector 135 uses a phrase quality scorer of
the stream selector to compare the noise levels across the beams.
In an example, an SNR meter of the stream selector provides a noise
level for each beam. The stream selector 135 thus discriminates
amongst a variety of possible input sources to provide (e.g., to
send or make available) a better signal to downstream
processors.
[0018] The filter 140 is arranged to reduce the level of noise in
an audio signal from the selected partition. In an example, to
reduce the level of noise in the audio signal from the selected
partition, the filter 140 applies noise reduction to the audio
signal. In an example, to enhance the speech signal from the
selected partition, the filter applies a spectral profile matching
(SPM) to the audio signal. In an example, the spectral profile
matching is applied after noise reduction is applied to the audio
signal.
[0019] In an example, to boost the speech signal in the selected
partition, the filter 140 applies an automated gain control to the
audio signal. In an example, the automated gain control is applied
after a spectral matching profile is applied to the audio
signal.
[0020] In an example, the pipeline may optionally include a second
filter (not illustrated) to perform acoustic echo cancellation to
the plurality of audio channels. In an example, the acoustic echo
cancellation is performed prior to partitioning the plurality of
audio channels into beams. In an example, the second filter is part
of the de-reverberator 127.
[0021] The controller 145 is arranged to provide the audio signal
to an external entity via an output interface of the pre-processing
pipeline. Thus, the controller 145 interfaces with downstream
components to further process the semantic content in an ASR
system.
[0022] FIG. 2 is a block diagram of an example of a system 200 for
far field automatic speech recognition pre-processing, according to
an embodiment. The system 200 includes additional examples of the
components discussed above. The components of the system 200 are
implemented in electronic hardware, such as that described above or
below (e.g., circuits).
[0023] The system 200 includes a pipeline 205 for real-time far
field ASR. By ordering the components of the system 200 as
illustrated, ASR techniques that previously have been discarded in
far field ASR due to reverberations may be reintroduced, such as:
[0024] the phase-based beam-former (PBF); and [0025] the Spectral
Profile Matching (SPM)
[0026] The far field pre-processing pipeline 205 may be composed of
six processing blocks: a de-reverberator 210, an optional AEC 215;
a beam-former 220, a stream selector 230; a post-filtering block
245, and a content analysis block 265. In an example, the order of
the far field pre-processing blocks is important (i.e., they must
be in the order present in FIG. 2). The far field pre-processing
pipeline 205 may operate on a multichannel input. The multichannel
input may be obtained from a microphone array containing at least
two microphones. In an example, there is no upper limit for the
number of microphones that may be used. In an example, there are no
limitations for the microphone array geometry (e.g., linear,
circular, etc.). In an example, the number of microphones are an
even number (e.g., the modulus of the number of microphones and two
is zero).
[0027] In the de-reverb block 210, reverberations are removed from
the multichannel input. Parameters of the de-reverberation block
210 may be adjusted to balance computational complexity and
performance. Techniques to remove reverberation may include
pre-configured room impulse models, or others, as described
above.
[0028] In an example, the far field pre-processing pipeline 205 may
be used with the device containing internal loudspeakers. In this
example, acoustical leakage from the loudspeakers to the
microphones may be reduced by the optional multichannel AEC block
215. In an example, the AEC block 215 includes one or more of the
following properties: [0029] it is located after the de-reverb
block 210, thus the AEC block 215 analyses signals that are not
affected by the room reverb; [0030] it creates a cancelling filter
using the multichannel reference signal, which improves AEC
performance due to additional information that can be extracted
from the different channels; or [0031] it is positioned before the
beam-former block 220, not after the beam-former block 220.
[0032] After the AEC block 215, the multichannel stream has had the
room reverb and loudspeaker echo removed (to the extent practical).
Thus the beam-former block 220 may use phase-based beam formers
(PBFs) 225, or other beam forming techniques such as the Minimum
Variance Distortionless Response beam formers, to process the
multichannel stream. Generally, for far field ASR, PBFs 225 cannot
be used without removing the echo and reverb because the PBF 225
generally requires direct sound in the microphone signals. In
reverberant conditions this requirement is not met because
reflections (e.g., none-direct signals) would also be captured.
Consequently, the precise detection of user position--an important
feature in PBF 225 processing--will not be possible. This issue
worsens for distances between the user and the device greater than
two meters. However, in the illustrated arrangement, nearly all
reflections (e.g., most of their energy) are removed before the PBF
225 stage. Thus, it is possible to use PBFs 225 effectively.
[0033] The PBFs 225 use two signals coming from a microphone pair.
Therefore, for microphone arrays with more than two microphones,
multiple instances of PBFs 225 may be used (e.g., one PBF 225 for
each exclusive pair). Each PBF 225 instance may be steered toward
different directions (e.g., relative to the device). FIG. 3
illustrates directivity patterns of four PBF 225 instances when
used together with the microphone board described herein. In FIG. 3
signals from eight microphones, two blank, two diagonally striped,
two diagonally cross-hatched, and two vertically cross-hatched,
(grouped pairwise in the center with the center most microphones in
a group) are grouped in four steering pairs of covered area [i.e.,
the groups of 1) dashed with two dots, 2) dashed with one dot, 3)
dashed, and 4) dotted]. As illustrated, sounds from each area pair
are fed into the separate PBF 225 instances. As a result, the
PBF-processed signals point towards four different directions with
a 45-degree beam width each. Since the PBF 225 processing is
bi-directional--e.g., the same beam pattern for front and back
facing directions relative to a microphone pair, these directions
being perpendicular to a line drawn between the two
microphones--the combined solution provides 360 degrees coverage
(e.g., the circular long and short dashed lines in FIG. 3).
[0034] In an example, owing to four directional streams, user
localization is possible. Thus, the stream selector 230 may assess
each directional stream against selected localization criteria,
such as highest Signal-to-Noise Ratio (SNR)--e.g., calculated using
the Signal Level Measurement (SLM) 270 or highest score of the
Voice Activity Detector (VAD) 275 in the content analysis block
265--and select a stream more conducive to ASR. The stream selector
230 may include one or more of a phrase quality scorer 235 or SNR
meter 240 to provide localization criteria scores on the streams.
Based on the localization criteria, only one of the PBF-processed
streams may be selected for further processing (e.g., the stream
with the highest SNR), by the stream selector 230. Because the
selected stream (e.g., for further processing) is beam-formed, the
influence of noise coming from all directions (e.g., areas not
covered by the formed beam) is reduced and the user's speech is
better exposed (e.g., more clear or less obstructed by that noise).
This improves SNR leading to better far field ASR performance.
[0035] In an example, one or more post-filtering operations may be
applied to the streams by the post filtering block 245. Example
post-filtering operations may include: [0036] NR 250--used to
reduce remaining noise; [0037] Spectral Profile Matching (SPM)
255--used to equalize the speech signal to match frequency response
of the ASRs training corpora; or [0038] AGC 260--used to normalize
signal level.
[0039] In an example, the NR 250 may accept a reference stream
containing PBF-processed signals that were classified by the stream
selector block 230 as noisy, at least compared to the other
available streams (e.g., beams pointing in a direction that is
different than that of the user). In an example, noisy streams may
be used to calculate a robust estimation of the noise floor that
the NR 250 will remove.
[0040] In an example, the AGC block 260 uses a reference signal. In
an example, the reference signal may be a typical loopback signal
from the playback path.
[0041] Some experiments have shown that the SPM block 255 helps
some ASR engines and the NR 250 helps for some other (e.g.,
different) ASR engines. Thus, in an example, the inclusion of one
or more of these components is optional, providing further
customization for performance, effectiveness, power use, design
complexity, etc.
[0042] Output of the far field pre-processing pipeline may be
provided to a client 280 that may implement an ASR engine 285. In
an example, however, the client 280 may implement a wake on voice
(WoV) engine 290 or in a VoIP communication channel 295. FIG. 4
illustrates far field ASR WER improvements obtained using the far
field pre-processing pipeline 205. FIG. 4 illustrates far field ASR
WER improvements for different noise types--LiRo: living room;
SiSp: side speaker; Public: public place; and Work: work
place--obtained using the far field pre-processing pipeline;
unprocessed signals are the dashed line (on top) and processed
signals are the short dash-double dotted line (on bottom).
[0043] All of the blocks illustrated in FIG. 2 were implemented and
evaluated to find their influence on far field ASR performance. It
was shown that every element of the pipeline introduces
improvement. The improvement was illustrated by the lower WERs
obtained from multiple ASR engines in far field scenarios. Further,
blocks were combined offline to simulate the far field
pre-processing pipeline. The simulation demonstrated better ASR
performance compared to using the blocks individually. The far
field pre-processing pipeline 205 was then ported to a real-time
audio stack and used in the mock-up of a smart home gateway device
(e.g., intelligent loudspeaker) illustrated in FIG. 1. Real-time
demonstrations with the mock-up exhibited the simulated far field
ASR improvements. Although the techniques discussed above are
useful in far field applications, they may be applied in near field
ASR, or other ASR applications (e.g., distances) as well.
[0044] FIG. 5 illustrates an example of a method 500 for automatic
speech recognition pre-processing, according to an embodiment. The
operations of the method 500 are implemented in electronic
hardware, such as that described above or below (e.g.,
circuits).
[0045] At operation 505, a plurality of audio channels is obtained.
In an example, obtaining the plurality of audio channels includes
removing reverberation prior to a beam-former processor
partitioning the plurality of audio channels into beams.
[0046] At operation 510, the plurality of audio channels are
partitioned into beams. In an example, partitioning the plurality
of audio channels into beams includes receiving the plurality of
audio channels at a beam-former processor, partitioning the
plurality of audio channels into partitions of two audio channels
based on a relationship between microphones producing the plurality
of audio channels, and providing each partition to a phase-based
beam-former.
[0047] At operation 515, a partition corresponding to a beam in the
beams is selected based on a noise level. In an example, selecting
the partition corresponding to the beam based on the noise level
includes comparing noise levels between the beams and selecting the
beam based on having the lowest noise levels determined from the
comparison. In an example, a phrase quality scorer of a stream
selector performing the partition selection compares the noise
levels between the beams. In an example, a signal-to-noise (SNR)
meter of the stream selector provides a noise level for each
beam.
[0048] At operation 520, an speech signal is filtered from the
selected partition. In an example, the filtering includes applying
noise reduction to the audio signal. In an example, the filtering
includes applying a spectral matching profile (SPM) to the audio
signal. In an example, the SPM is applied after noise reduction is
applied to the audio signal.
[0049] In an example, the filtering includes applying an automated
gain control to the audio signal. In an example, the automated gain
control is applied after a spectral matching profile is applied to
the audio signal.
[0050] In an example, the method 500 may be extended by optionally
performing acoustic echo cancellation to the plurality of audio
channels. In an example, the acoustic echo cancellation is
performed prior to partitioning the plurality of audio channels
into beams.
[0051] At operation 525, the filtered audio signal is provided to
an external entity via an output interface of the pre-processing
pipeline.
[0052] FIG. 6 illustrates a block diagram of an example machine 600
upon which any one or more of the techniques (e.g., methodologies)
discussed herein may perform. In alternative embodiments, the
machine 600 may operate as a standalone device or may be connected
(e.g., networked) to other machines. In a networked deployment, the
machine 600 may operate in the capacity of a server machine, a
client machine, or both in server-client network environments. In
an example, the machine 600 may act as a peer machine in
peer-to-peer (P2P) (or other distributed) network environment. The
machine 600 may be a personal computer (PC), a tablet PC, a set-top
box (STB), a personal digital assistant (PDA), a mobile telephone,
a web appliance, a network router, switch or bridge, or any machine
capable of executing instructions (sequential or otherwise) that
specify actions to be taken by that machine. Further, while only a
single machine is illustrated, the term "machine" shall also be
taken to include any collection of machines that individually or
jointly execute a set (or multiple sets) of instructions to perform
any one or more of the methodologies discussed herein, such as
cloud computing, software as a service (SaaS), other computer
cluster configurations.
[0053] Examples, as described herein, may include, or may operate
by, logic or a number of components, or mechanisms. Circuitry is a
collection of circuits implemented in tangible entities that
include hardware (e.g., simple circuits, gates, logic, etc.).
Circuitry membership may be flexible over time and underlying
hardware variability. Circuitries include members that may, alone
or in combination, perform specified operations when operating. In
an example, hardware of the circuitry may be immutably designed to
carry out a specific operation (e.g., hardwired). In an example,
the hardware of the circuitry may include variably connected
physical components (e.g., execution units, transistors, simple
circuits, etc.) including a computer readable medium physically
modified (e.g., magnetically, electrically, moveable placement of
invariant massed particles, etc.) to encode instructions of the
specific operation. In connecting the physical components, the
underlying electrical properties of a hardware constituent are
changed, for example, from an insulator to a conductor or vice
versa. The instructions enable embedded hardware (e.g., the
execution units or a loading mechanism) to create members of the
circuitry in hardware via the variable connections to carry out
portions of the specific operation when in operation. Accordingly,
the computer readable medium is communicatively coupled to the
other components of the circuitry when the device is operating. In
an example, any of the physical components may be used in more than
one member of more than one circuitry. For example, under
operation, execution units may be used in a first circuit of a
first circuitry at one point in time and reused by a second circuit
in the first circuitry, or by a third circuit in a second circuitry
at a different time.
[0054] Machine (e.g., computer system) 600 may include a hardware
processor 602 (e.g., a central processing unit (CPU), a graphics
processing unit (GPU), a hardware processor core, or any
combination thereof), a main memory 604 and a static memory 606,
some or all of which may communicate with each other via an
interlink (e.g., bus) 608. The machine 600 may further include a
display unit 610, an alphanumeric input device 612 (e.g., a
keyboard), and a user interface (UI) navigation device 614 (e.g., a
mouse). In an example, the display unit 610, input device 612 and
UI navigation device 614 may be a touch screen display. The machine
600 may additionally include a storage device (e.g., drive unit)
616, a signal generation device 618 (e.g., a speaker), a network
interface device 620, and one or more sensors 621, such as a global
positioning system (GPS) sensor, compass, accelerometer, or other
sensor. The machine 600 may include an output controller 628, such
as a serial (e.g., universal serial bus (USB), parallel, or other
wired or wireless (e.g., infrared (IR), near field communication
(NFC), etc.) connection to communicate or control one or more
peripheral devices (e.g., a printer, card reader, etc.).
[0055] The storage device 616 may include a machine readable medium
622 on which is stored one or more sets of data structures or
instructions 624 (e.g., software) embodying or utilized by any one
or more of the techniques or functions described herein. The
instructions 624 may also reside, completely or at least partially,
within the main memory 604, within static memory 606, or within the
hardware processor 602 during execution thereof by the machine 600.
In an example, one or any combination of the hardware processor
602, the main memory 604, the static memory 606, or the storage
device 616 may constitute machine readable media.
[0056] While the machine readable medium 622 is illustrated as a
single medium, the term "machine readable medium" may include a
single medium or multiple media (e.g., a centralized or distributed
database, and/or associated caches and servers) configured to store
the one or more instructions 624.
[0057] The term "machine readable medium" may include any medium
that is capable of storing, encoding, or carrying instructions for
execution by the machine 600 and that cause the machine 600 to
perform any one or more of the techniques of the present
disclosure, or that is capable of storing, encoding or carrying
data structures used by or associated with such instructions.
Non-limiting machine readable medium examples may include
solid-state memories, and optical and magnetic media. In an
example, a massed machine readable medium comprises a machine
readable medium with a plurality of particles having invariant
(e.g., rest) mass. Accordingly, massed machine-readable media are
not transitory propagating signals. Specific examples of massed
machine readable media may include: non-volatile memory, such as
semiconductor memory devices (e.g., Electrically Programmable
Read-Only Memory (EPROM), Electrically Erasable Programmable
Read-Only Memory (EEPROM)) and flash memory devices, magnetic
disks, such as internal hard disks and removable disks;
magneto-optical disks; and CD-ROM and DVD-ROM disks.
[0058] The instructions 624 may further be transmitted or received
over a communications network 626 using a transmission medium via
the network interface device 620 utilizing any one of a number of
transfer protocols (e.g., frame relay, internet protocol (IP),
transmission control protocol (TCP), user datagram protocol (UDP),
hypertext transfer protocol (HTTP), etc.). Example communication
networks may include a local area network (LAN), a wide area
network (WAN), a packet data network (e.g., the Internet), mobile
telephone networks (e.g., cellular networks), Plain Old Telephone
(POTS) networks, and wireless data networks (e.g., Institute of
Electrical and Electronics Engineers (IEEE) 802.11 family of
standards known as Wi-Fi.RTM., IEEE 802.16 family of standards
known as WiMax.RTM.), IEEE 802.15.4 family of standards,
peer-to-peer (P2P) networks, among others. In an example, the
network interface device 620 may include one or more physical jacks
(e.g., Ethernet, coaxial, or phone jacks) or one or more antennas
to connect to the communications network 626. In an example, the
network interface device 620 may include a plurality of antennas to
wirelessly communicate using at least one of single-input
multiple-output (SIMO), multiple-input multiple-output (MIMO), or
multiple-input single-output (MISO) techniques. The term
"transmission medium" shall be taken to include any intangible
medium that is capable of storing, encoding or carrying
instructions for execution by the machine 600, and includes digital
or analog communications signals or other intangible medium to
facilitate communication of such software.
ADDITIONAL NOTES & EXAMPLES
[0059] Example 1 is a system for automatic speech recognition
pre-processing, the system comprising: a sampler to obtain a
plurality of audio channels; a de-reverberator to remove
reverberations from the plurality of audio channels; a beam-former
processor to partition the plurality of audio channels into beams
after reverberations are removed; a stream selector to select a
partition corresponding to a beam in the beams based on a noise
level; a filter to reduce a noise level in a speech signal from the
selected partition; and a controller to provide the audio signal to
an external entity via an output interface of the pre-processing
pipeline.
[0060] In Example 2, the subject matter of Example 1 optionally
includes an echo cancelation block disposed between the
de-reverberator and the beam-former processor to cancel echoes from
the plurality of audio channels after the reverberations are
removed and before the plurality of audio channels are partitioned
into beams.
[0061] In Example 3, the subject matter of any one or more of
Examples 1-2 optionally include wherein, to partition the plurality
of audio channels into beams, the beam-former processor is to:
receive the plurality of audio channels; partition the plurality of
audio channels into partitions of two audio channels based on a
relationship between microphones producing the plurality of audio
channels; and provide each partition to a phase-based
beam-former.
[0062] In Example 4, the subject matter of any one or more of
Examples 1-3 optionally include wherein, to select the partition
corresponding to the beam based on the noise level, the stream
selector is to: compare speech levels between the beams; and select
the beam based on having the highest speech levels determined from
the comparison.
[0063] In Example 5, the subject matter of any one or more of
Examples 1-4 optionally include wherein, to select the partition
corresponding to the beam based on the noise level, the stream
selector is to: compare noise levels between the beams; and select
the beam based on having the lowest noise levels determined from
the comparison.
[0064] In Example 6, the subject matter of Example 5 optionally
includes wherein the stream selector uses a phrase quality scorer
of the stream selector to compare the noise levels between the
beams.
[0065] In Example 7, the subject matter of Example 6 optionally
includes wherein a signal-to-noise (SNR) meter of the stream
selector provides a noise level for each beam.
[0066] In Example 8, the subject matter of any one or more of
Examples 1-7 optionally include wherein, to reduce the noise level
in the speech signal from the selected partition, the filter
applies noise reduction to the audio signal.
[0067] In Example 9, the subject matter of any one or more of
Examples 1-8 optionally include wherein, to reduce the noise level
in the speech signal from the selected partition, the filter
applies a spectral profile matching (SPM) to the audio signal.
[0068] In Example 10, the subject matter of Example 9 optionally
includes wherein the spectral profile matching is applied after
noise reduction is applied to the audio signal.
[0069] In Example 11, the subject matter of any one or more of
Examples 1-10 optionally include wherein, to reduce the noise level
in the speech signal from the selected partition, the filter
applies an automated gain control to the audio signal.
[0070] In Example 12, the subject matter of Example 11 optionally
includes wherein the automated gain control is applied after a
spectral profile matching is applied to the audio signal.
[0071] In Example 13, the subject matter of any one or more of
Examples 1-12 optionally include a second filter to perform
acoustic echo cancellation to the plurality of audio channels.
[0072] In Example 14, the subject matter of Example 13 optionally
includes wherein the acoustic echo cancellation is performed prior
to partitioning the plurality of audio channels into beams.
[0073] Example 15 is at least machine readable medium including
instructions for a pre-processing pipeline, the instructions, when
executed by a machine, causing the machine to perform operations
comprising: obtaining a plurality of audio channels; removing
reverberations from the audio channels; partitioning the plurality
of audio channels into beams after reverberations are removed;
selecting a partition corresponding to a beam in the beams based on
a noise level; filtering an audio signal from the selected
partition; and providing the filtered audio signal to an external
entity via an output interface of the pre-processing pipeline.
[0074] In Example 16, the subject matter of Example 15 optionally
includes wherein the operations include canceling echoes from the
plurality of audio channels after the reverberations are removed
and before the plurality of audio channels are partitioned into
beams.
[0075] In Example 17, the subject matter of any one or more of
Examples 15-16 optionally include wherein the partitioning the
plurality of audio channels into beams includes: receiving the
plurality of audio channels at a beam-former processor:
partitioning the plurality of audio channels into partitions of two
audio channels based on a relationship between microphones
producing the plurality of audio channels; and providing each
partition to a phase-based beam-former.
[0076] In Example 18, the subject matter of any one or more of
Examples 15-17 optionally include wherein the selecting the
partition corresponding to the beam based on the noise level
includes comparing speech levels between the beams and selecting
the beam based on having the highest speech levels determined from
the comparison.
[0077] In Example 19, the subject matter of any one or more of
Examples 15-18 optionally include wherein the selecting the
partition corresponding to the beam based on the noise level
includes comparing noise levels between the beams and selecting the
beam based on having the lowest noise levels determined from the
comparison.
[0078] In Example 20, the subject matter of Example 19 optionally
includes wherein a phrase quality scorer of a stream selector
performing the partition selection compares the noise levels
between the beams.
[0079] In Example 21, the subject matter of Example 20 optionally
includes wherein a signal-to-noise (SNR) meter of the stream
selector provides a noise level for each beam.
[0080] In Example 22, the subject matter of any one or more of
Examples 15-21 optionally include wherein the filtering includes
applying noise reduction to the audio signal.
[0081] In Example 23, the subject matter of any one or more of
Examples 15-22 optionally include wherein the filtering includes
applying a spectral profile matching (SPM) to the audio signal.
[0082] In Example 24, the subject matter of Example 23 optionally
includes wherein the spectral profile matching is applied after
noise reduction is applied to the audio signal.
[0083] In Example 25, the subject matter of any one or more of
Examples 15-24 optionally include wherein the filtering includes
applying an automated gain control to the audio signal.
[0084] In Example 26, the subject matter of Example 25 optionally
includes wherein the automated gain control is applied after a
spectral profile matching is applied to the audio signal.
[0085] In Example 27, the subject matter of any one or more of
Examples 15-26 optionally include wherein the operations comprise
performing acoustic echo cancellation to the plurality of audio
channels.
[0086] In Example 28, the subject matter of Example 27 optionally
includes wherein the acoustic echo cancellation is performed prior
to partitioning the plurality of audio channels into beams.
[0087] Example 29 is a method for automatic speech recognition
pre-processing, the method comprising: obtaining a plurality of
audio channels; removing reverberations from the audio channels;
partitioning the plurality of audio channels into beams after the
reverberations are removed; selecting a partition corresponding to
a beam in the beams based on a noise level; filtering an audio
signal from the selected partition; and providing the filtered
audio signal to an external entity via an output interface of the
pre-processing pipeline.
[0088] In Example 30, the subject matter of Example 29 optionally
includes canceling echoes from the plurality of audio channels
after the reverberations are removed and before the plurality of
audio channels are partitioned into beams.
[0089] In Example 31, the subject matter of any one or more of
Examples 29-30 optionally include wherein partitioning the
plurality of audio channels into beams includes: receiving the
plurality of audio channels at a beam-former processor;
partitioning the plurality of audio channels into partitions of two
audio channels based on a relationship between microphones
producing the plurality of audio channels; and providing each
partition to a phase-based beam-former.
[0090] In Example 32, the subject matter of any one or more of
Examples 29-31 optionally include wherein selecting the partition
corresponding to the beam based on the noise level includes
comparing speech levels between the beams and selecting the beam
based on having the highest speech levels determined from the
comparison.
[0091] In Example 33, the subject matter of any one or more of
Examples 29-32 optionally include wherein selecting the partition
corresponding to the beam based on the noise level includes
comparing noise levels between the beams and selecting the beam
based on having the lowest noise levels determined from the
comparison.
[0092] In Example 34, the subject matter of Example 33 optionally
includes wherein a phrase quality scorer of a stream selector
performing the partition selection compares the noise levels
between the beams.
[0093] In Example 35, the subject matter of Example 34 optionally
includes wherein a signal-to-noise (SNR) meter of the stream
selector provides a noise level for each beam.
[0094] In Example 36, the subject matter of any one or more of
Examples 29-35 optionally include wherein the filtering includes
applying noise reduction to the audio signal.
[0095] In Example 37, the subject matter of any one or more of
Examples 29-36 optionally include wherein the filtering includes
applying a spectral profile matching (SPM) to the audio signal.
[0096] In Example 38, the subject matter of Example 37 optionally
includes wherein the spectral profile matching is applied after
noise reduction is applied to the audio signal.
[0097] In Example 39, the subject matter of any one or more of
Examples 29-38 optionally include wherein the filtering includes
applying an automated gain control to the audio signal.
[0098] In Example 40, the subject matter of Example 39 optionally
includes wherein the automated gain control is applied after a
spectral profile matching is applied to the audio signal.
[0099] In Example 41, the subject matter of any one or more of
Examples 29-40 optionally include performing acoustic echo
cancellation to the plurality of audio channels.
[0100] In Example 42, the subject matter of Example 41 optionally
includes wherein the acoustic echo cancellation is performed prior
to partitioning the plurality of audio channels into beams.
[0101] Example 43 is a system comprising means to perform any of
the methods 29-42.
[0102] Example 44 is at least one machine readable medium including
instructions that, when executed by a machine, cause the machine to
perform any of the methods 29-42.
[0103] Example 45 is a system for automatic speech recognition
pre-processing, the system comprising: means for obtaining a
plurality of audio channels; means for removing reverberations from
the plurality of audio channels; means for partitioning the
plurality of audio channels into beams after the reverberations are
removed; means for selecting a partition corresponding to a beam in
the beams based on a noise level; means for filtering an audio
signal from the selected partition; and means for providing the
filtered audio signal to an external entity via an output interface
of the pre-processing pipeline.
[0104] In Example 46, the subject matter of Example 45 optionally
includes means for canceling echoes from the plurality of audio
channels after the reverberations are removed and before the
plurality of audio channels are partitioned into beams.
[0105] In Example 47, the subject matter of any one or more of
Examples 45-46 optionally include wherein the means for
partitioning the plurality of audio channels into beams includes:
means for receiving the plurality of audio channels at a
beam-former processor; means for partitioning the plurality of
audio channels into partitions of two audio channels based on a
relationship between microphones producing the plurality of audio
channels; and providing each partition to a phase-based
beam-former.
[0106] In Example 48, the subject matter of any one or more of
Examples 45-47 optionally include wherein the means for selecting
the partition corresponding to the beam based on the noise level
includes means for comparing speech levels between the beams and
selecting the beam based on having the highest speech levels
determined from the comparison.
[0107] In Example 49, the subject matter of any one or more of
Examples 45-48 optionally include wherein the means for selecting
the partition corresponding to the beam based on the noise level
includes means for comparing noise levels between the beams and
selecting the beam based on having the lowest noise levels
determined from the comparison.
[0108] In Example 50, the subject matter of Example 49 optionally
includes wherein a phrase quality scorer of a stream selector
performing the partition selection compares the noise levels
between the beams.
[0109] In Example 51, the subject matter of Example 50 optionally
includes wherein a signal-to-noise (SNR) meter of the stream
selector provides a noise level for each beam.
[0110] In Example 52, the subject matter of any one or more of
Examples 45-51 optionally include wherein the means for filtering
includes means for applying noise reduction to the audio
signal.
[0111] In Example 53, the subject matter of any one or more of
Examples 45-52 optionally include wherein the means for filtering
includes means for applying a spectral profile matching (SPM) to
the audio signal.
[0112] In Example 54, the subject matter of Example 53 optionally
includes wherein the spectral profile matching is applied after
noise reduction is applied to the audio signal.
[0113] In Example 55, the subject matter of any one or more of
Examples 45-54 optionally include wherein the means for filtering
includes means for applying an automated gain control to the audio
signal.
[0114] In Example 56, the subject matter of Example 55 optionally
includes wherein the automated gain control is applied after a
spectral profile matching is applied to the audio signal.
[0115] In Example 57, the subject matter of any one or more of
Examples 45-56 optionally include means for performing acoustic
echo cancellation to the plurality of audio channels.
[0116] In Example 58, the subject matter of Example 57 optionally
includes wherein the acoustic echo cancellation is performed prior
to partitioning the plurality of audio channels into beams.
[0117] The above detailed description includes references to the
accompanying drawings, which form a part of the detailed
description. The drawings show, by way of illustration, specific
embodiments that may be practiced. These embodiments are also
referred to herein as "examples." Such examples may include
elements in addition to those shown or described. However, the
present inventors also contemplate examples in which only those
elements shown or described are provided. Moreover, the present
inventors also contemplate examples using any combination or
permutation of those elements shown or described (or one or more
aspects thereof), either with respect to a particular example (or
one or more aspects thereof), or with respect to other examples (or
one or more aspects thereof) shown or described herein.
[0118] All publications, patents, and patent documents referred to
in this document are incorporated by reference herein in their
entirety, as though individually incorporated by reference. In the
event of inconsistent usages between this document and those
documents so incorporated by reference, the usage in the
incorporated reference(s) should be considered supplementary to
that of this document; for irreconcilable inconsistencies, the
usage in this document controls.
[0119] In this document, the terms "a" or "an" are used, as is
common in patent documents, to include one or more than one,
independent of any other instances or usages of"at least one" or
"one or more." In this document, the term "or" is used to refer to
a nonexclusive or, such that "A or B" includes "A but not B," "B
but not A," and "A and B," unless otherwise indicated. In the
appended claims, the terms "including" and "in which" are used as
the plain-English equivalents of the respective terms "comprising"
and "wherein." Also, in the following claims, the terms "including"
and "comprising" are open-ended, that is, a system, device,
article, or process that includes elements in addition to those
listed after such a term in a claim are still deemed to fall within
the scope of that claim. Moreover, in the following claims, the
terms "first," "second," and "third," etc. are used merely as
labels, and are not intended to impose numerical requirements on
their objects.
[0120] The above description is intended to be illustrative, and
not restrictive. For example, the above-described examples (or one
or more aspects thereof) may be used in combination with each
other. Other embodiments may be used, such as by one of ordinary
skill in the art upon reviewing the above description. The Abstract
is to allow the reader to quickly ascertain the nature of the
technical disclosure and is submitted with the understanding that
it will not be used to interpret or limit the scope or meaning of
the claims. Also, in the above Detailed Description, various
features may be grouped together to streamline the disclosure. This
should not be interpreted as intending that an unclaimed disclosed
feature is essential to any claim. Rather, inventive subject matter
may lie in less than all features of a particular disclosed
embodiment. Thus, the following claims are hereby incorporated into
the Detailed Description, with each claim standing on its own as a
separate embodiment. The scope of the embodiments should be
determined with reference to the appended claims, along with the
full scope of equivalents to which such claims are entitled.
* * * * *