U.S. patent application number 14/566579 was filed with the patent office on 2015-06-18 for apparatus and a method for audio signal processing.
The applicant listed for this patent is GN Netcom A/S. Invention is credited to Rasmus Kongsgaard Olsson.
Application Number | 20150172807 14/566579 |
Document ID | / |
Family ID | 49765885 |
Filed Date | 2015-06-18 |
United States Patent
Application |
20150172807 |
Kind Code |
A1 |
Olsson; Rasmus Kongsgaard |
June 18, 2015 |
Apparatus And A Method For Audio Signal Processing
Abstract
An apparatus, such as a headset, configured to process audio
signals from multiple microphones, comprising: a first pair of
microphones (101, 102) outputting a first pair of microphone
signals and a second pair of microphones (103, 104) outputting a
second pair of microphone signals; a first beamformer (105) and a
second beamformer (106) each configured to receive a pair of
microphone signals and adapt the spatial sensitivity of a
respective pair of microphones as measured in a respective
beamformed signal (X.sub.L; X.sub.R) output from a respective
beamformer (105; 106); wherein the spatial sensitivity is adapted
to suppress noise relative to a desired signal; a third beamformer
(107) configured to dynamically combine the signals (X.sub.L;
X.sub.R) output from the first beamformer (105) and the second
beamformer (106) into a combined signal (X.sub.C); wherein the
signals are combined such that signal energy in the combined signal
is minimized while a desired signal is preserved; and a noise
reduction unit (109) configured to process the combined signal
(X.sub.C) from the third beamformer (107) and output the combined
signal such that noise is reduced.
Inventors: |
Olsson; Rasmus Kongsgaard;
(Roskilde, DK) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
GN Netcom A/S |
Ballerup |
|
DK |
|
|
Family ID: |
49765885 |
Appl. No.: |
14/566579 |
Filed: |
December 10, 2014 |
Current U.S.
Class: |
381/74 ;
381/92 |
Current CPC
Class: |
H04R 3/005 20130101;
G10K 11/175 20130101; H04R 1/406 20130101; H04R 2201/107 20130101;
G10L 2021/02166 20130101; H04R 2201/10 20130101; H04R 1/1091
20130101; G10L 21/0208 20130101 |
International
Class: |
H04R 1/10 20060101
H04R001/10; H04R 3/00 20060101 H04R003/00 |
Foreign Application Data
Date |
Code |
Application Number |
Dec 13, 2013 |
EP |
13197139 |
Claims
1. An apparatus, such as a headset, configured to process audio
signals from multiple microphones, comprising: a first pair of
microphones outputting a first pair of microphone signals and a
second pair of microphones outputting a second pair of microphone
signals; wherein the first pair of microphones are arranged with a
first mutual distance and the second pair of microphones are
arranged with a second mutual distance, and wherein the first pair
of microphones are arranged at a distance from the second pair of
microphones that is greater than the first mutual distance and the
second mutual distance at least when the apparatus is in normal
operation; a first beamformer and a second beamformer each
configured to receive a pair of microphone signals and adapt the
spatial sensitivity of a respective pair of microphones as measured
in a respective beamformed signal (X.sub.L; X.sub.R) output from a
respective beamformer; wherein the spatial sensitivity is adapted
to suppress noise relative to a desired signal; a third beamformer
configured to dynamically combine the signals (X.sub.L; X.sub.R)
output from the first beamformer and the second beamformer into a
combined signal (X.sub.C); wherein the signals are combined such
that noise energy in the combined signal is minimized while a
desired signal is preserved; a noise reduction unit configured to
process the combined signal (X.sub.C) from the third beamformer and
output the combined signal such that noise is reduced.
2. An apparatus according to claim 1, wherein the noise reduction
unit is configured to perform noise suppression on the combined
signal (X.sub.C) from the third beamformer in response to a noise
suppression coefficient (A.sub.L; A.sub.R); and wherein the noise
suppression coefficient (A.sub.L; A.sub.R) is estimated from the
microphone signals and/or a beamformed signal (X.sub.L;
X.sub.R).
3. An apparatus according to claim 1, wherein the apparatus
comprises: a first control branch synthesizing a first noise
suppression gain, A.sub.L, from the first pair of microphone
signals and/or the first beamformer; a second control branch
synthesizing a second noise suppression gain, A.sub.R, from the
second pair of microphone signals and/or the second beamformer; a
selector configured to dynamically select and/or output the first
noise suppression gain, A.sub.L, or the second noise suppression
gain, A.sub.R; wherein the noise reduction unit is configured to
process the combined signal from the third beamformer in response
to the selected and/or output noise suppression gain, A.sub.S, from
the selector.
4. An apparatus according to claim 3, wherein the selector is
configured to operate in response to a first signal quality
indicator (P.sub.L) and a second signal quality indicator
(P.sub.R); and wherein the signal quality indicators (P.sub.L;
P.sub.R) are synthesized from a respective beamformed signal
(X.sub.L; X.sub.R) processed to reduce noise in response to
respective noise reduction gains (A.sub.L; A.sub.R).
5. An apparatus according to claim 3, wherein a beamformed signal
(X.sub.L; X.sub.R), processed to reduce noise in response to
respective noise reduction gains (A.sub.L; A.sub.R), is input to an
evaluator that is configured to output a control signal (P.sub.L;
P.sub.R) to the selector and thereby control selection; and wherein
the evaluator evaluates the beamformed signal (X.sub.L; X.sub.R),
processed to reduce noise in response to respective noise reduction
gains (A.sub.L; A.sub.R), according to a criterion of least power
during a time interval when voice activity is detected as not
present.
6. An apparatus according to claim 2, wherein the noise suppression
coefficient is computed to reduce noise by a predetermined, fixed
factor.
7. An apparatus according to claim 1, wherein at least one of the
first beamformer or second beamformer is configured to comprise: a
first stage that generates a summation signal and a difference
signal from the input signals, subject to at least one of the input
signals being phase and/or amplitude aligned with another of the
input signals with respect to a desired signal; and a second stage
that filters the difference signal and generating a filtered
signal; wherein the beamformed output signal is generated from the
difference between the summation signal and the filtered signal;
and wherein the filter is adapted using a least mean square
technique to minimize the power of the beamformed output
signal.
8. An apparatus according to claim 1, wherein the third beamformer
is configured with a fixed sensitivity with respect to a predefined
spatial position relative to the spatial position of the
microphones.
9. An apparatus according to claim 1, wherein the microphones
output digital signals; wherein the apparatus performs a
transformation of the digital signals to a time-frequency
representation, in multiple frequency bands; and wherein the
apparatus performs an inverse transformation of at least the
combined signal to a time-domain representation.
10. An apparatus according to claim 1, wherein the microphones
output analogue signals; wherein the apparatus performs
analogue-to-digital conversion of the analogue signals to provide
digital signals; wherein the apparatus performs a transformation of
the digital signals to a time-frequency representation, in multiple
frequency bands; and wherein the apparatus performs an inverse
transformation of at least the combined signal to a time-domain
representation.
11. An apparatus according to claim 1, wherein the microphones of
at least one pair of the set of microphones is arranged in an
end-fire configuration oriented towards a position where a person's
mouth is expected to be when the apparatus is used by the
person.
12. A method for processing audio signals from multiple
microphones, comprising: receiving a first pair and a second pair
of microphone signals from a first pair of microphones and a second
pair of microphones, respectively; wherein the first pair of
microphones are arranged with a first mutual distance and the
second pair of microphones are arranged with a second mutual
distance, and wherein the first pair of microphones are arranged at
a distance from the second pair of microphones that is greater than
the first mutual distance and the second mutual distance at least
when the apparatus is in normal operation; performing first
beamforming and second beamforming on the first pair of microphone
signals and the second pair of microphone signals to output
respective beamformed signals (X.sub.L; X.sub.R); adapting the
spatial sensitivity by a respective pair of microphones as measured
in a respective beamformed signal (X.sub.L; X.sub.R) such that
spatial sensitivity is adapted to suppress noise relative to a
desired signal; performing third beamforming to dynamically combine
the signals (X.sub.L; X.sub.R) output from the first beamforming
and the second beamforming into a combined signal (X.sub.C);
wherein the signals are combined such that noise energy in the
combined signal is minimized while a desired signal is preserved;
performing noise reduction to process the combined signal (X.sub.C)
from the third beamformer and output the combined signal such that
noise is reduced.
13. A computer program product comprising program code means
adapted to cause a data processing system to perform the steps of
the method according to claim 12, when said program code means are
executed on the data processing system.
14. A computer program product according to claim 13, comprising a
computer-readable medium having stored thereon the program code
means.
15. A computer data signal embodied in a carrier wave and
representing sequences of instructions which, when executed by a
processor, cause the processor to perform the steps of the method
according to claim 12.
Description
[0001] It has been discovered that use of multiple microphones and
the use of beamforming techniques provide audio signal reproduction
that is superior to single microphone or non-beamforming systems.
The multiple microphones are located at different positions and
allows so-called spatial sampling which in turn enables cancelling
of noise interfering with a desired signal such as a person's
voice; this is also known as beamforming, spatial filtering or
noise-cancelling. Subsequent time varying post-filters are often
applied as a means to further discriminate the person's voice from
(background) noise signals.
[0002] Multiple microphones and the use of beamforming techniques
are frequently embodied in headsets, hearing aids, laptop computers
and other electronic consumer devices.
[0003] The technical field of beamformers has been extensively
researched; however their qualities and configurations have not
been fully exploited.
RELATED PRIOR ART
[0004] US 2012/0020485 discloses an audio signal processing method
which estimates a first indication of a direction of arrival,
relative to a first pair of microphones, of a first sound component
received by the first pair of microphones; and estimates a second
indication of a direction of arrival, relative to a second pair of
microphones, of a second sound component received by the second
pair of microphones. The first and the second pair of microphones
are arranged at respective sides of a person's head during normal
operation of a device using the method. The method also involves
controlling gain of an audio signal to produce an output signal,
based on the first and second direction indications.
SUMMARY
[0005] There is provided an apparatus, such as a headset,
configured to process audio signals from multiple microphones,
comprising: a first pair of microphones outputting a first pair of
microphone signals and a second pair of microphones outputting a
second pair of microphone signals; wherein the first pair of
microphones are arranged with a first mutual distance and the
second pair of microphones are arranged with a second mutual
distance, and wherein the first pair of microphones are arranged at
a distance from the second pair of microphones that is greater than
the first mutual distance and the second mutual distance at least
when the apparatus is in normal operation; a first beamformer and a
second beamformer each configured to receive a pair of microphone
signals and adapt the spatial sensitivity of a respective pair of
microphones as measured in a respective beamformed signal output
from a respective beamformer; wherein the spatial sensitivity is
adapted to suppress noise relative to a desired signal; a third
beamformer configured to dynamically combine the signals output
from the first beamformer and the second beamformer into a combined
signal; wherein the signals are combined such that noise energy in
the combined signal is minimized while a desired signal is
preserved; and a noise reduction unit configured to process the
combined signal from the third beamformer and output the combined
signal such that noise is reduced.
[0006] Thus, beamforming is provided in a first beamforming stage
with the first beamformer and the second beamformer processing the
microphone signals and in a second stage with a third beamformer
processing signals output from the first stage. The first
beamforming stage serves to enhance or emphasize the desired signal
locally with respect to the microphone pairs by adapting the
spatial sensitivity of a respective microphone pair. The spatial
sensitivity is adapted, e.g., by adjusting beamformer coefficients
to control the spatial configuration of the beamformer nulls which
may comprise adjusting beamformer coefficients such that the
beamformer obtains an omni-directional characteristic, which is
useful to avoid amplification of uncorrelated (between microphones)
noise such as wind noise. The effectiveness of the first
beamforming stage depends on the assumption that the microphones of
each microphone pair are situated closely to one another (for
reasons explained below).
[0007] In addition to such local optimization in capturing a
desired signal, the level of the noise component may vary
considerably between the first and second beamformed signals. This
may be due to different levels at the microphones, e.g., wind
turbulence is a highly local phenomenon, and acoustic shadowing
effects from the user's head in a head worn device. Furthermore,
the first and the second beamformers may not be able to cancel the
noise equally well, depending on the relative position of the
microphone pair, the signal of interest and interfering noises.
[0008] The third beamformer is thus configured to receive signals
that have already been subject to local optimization by the first
stage beamformers whereby the desired signal is isolated as far as
possible. By dynamically combining signals from the left-hand side
and the right-hand side, it is possible to select or emphasize a
spatially controlled signal from the most favourably positioned
microphone pair.
[0009] Processing microphone signals in this way, improves the
effect of noise suppression by the noise reduction unit when, as
claimed, it is configured to process the combined signal from the
third beamformer. This is partly ascribed to the observation that
desired signals stands out clearer after such a two-stage
beamforming and thereby makes noise suppression more effective.
Furthermore, the two-stage beamformer approach achieves the
combined benefit of beamforming on microphones that are closely
spaced and microphones that are not closely spaced using well known
dual-microphone beamformers. The third beamformer may combine its
input signals by linear or non-linear weighing of the input
signals.
[0010] The apparatus, such as a headset, a hearing aid or another
apparatus picking up audio signals by means of microphones may be
configured to be worn by a person with the first pair of
microphones arranged on a left-hand side of a person's head and the
second pair of microphones arranged on the right-hand side of the
person's head. Typically, the two pairs of microphones are sitting
on an ear-cup of a headphone, a spectacle frame or booms or other
protrusions at respective sides of a person's head. The microphones
are arranged, at least approximately, in a so-called end-fire
configuration. The microphones may alternatively or additionally be
arranged in a broadside configuration.
[0011] By arranging the microphones, such that intra-pair
microphones sit closer than inter-pair microphones at least when
the apparatus is in normal operation and intra-pairs in end-fire
configurations, the first and the second beamformer can take
advantage of any near-field effect to cancel or suppress more noise
at low frequencies and in addition make it possible to cancel more
noise at higher frequencies, avoiding spatial aliasing.
Additionally, the third beamformer can take advantage of the
different local noise levels that the different pairs of
microphones are exposed to. When the microphone pairs sit on
different sides of a person's head, the head may form a wind and/or
sound shadow reducing noise level on one side of the person's head.
It is a major advantage of the invention that the highly complex
problem of designing a single adaptive beamformer operating on all
microphone inputs is decomposed into three simple, robust,
well-understood dual-microphone beamformers.
[0012] In general, different types of microphones with different
characteristics may be selected.
[0013] A desired signal is a signal that typically represents voice
from a speaker within proximity of the microphones or voice
appearing from a certain direction relative to the orientation of
the microphones. A desired signal may be characterised by being
emitted from one or more sound sources having predefined spatial
locations with respect to the spatial location of the microphones.
Since multiple microphones are used to pick up the desired signal
the desired signal may be characterised by a predefined phase
and/or amplitude difference among the microphone signal and/or
among beamformed signals. A desired signal may also be
characterised by a predefined temporal characteristic and/or a
predefined phase-/amplitude-frequency characteristic.
[0014] A noise signal or simply noise may include turbulence sounds
induced by wind occurring at sufficiently high wind speeds and
acting on the microphone membranes. Noise may also include
background sounds such as tones from machines, sounds from items
rattling or chinking, sounds from people talking amongst each
other, etc. In some definitions, noise is characterised by being
emitted from one or more sound sources that are located at other
locations than the desired signal.
[0015] The first beamformer and the second beamformer adapt the
directional sensitivity gradually or in steps e.g. comprising
sensitivities that are at least approximated from the group of the
following characteristics: Omni-directional, bi-directional,
cardioid, subcardioid, hypercardioid, supercardioid or shotgun. The
directional sensitivity may be changed gradually between an
omni-directional, a bi-directional and a cardioid characteristic.
The first beamformer may be configured as disclosed in WO
2009/132646 which is hereby incorporated by reference for
everything disclosed in connection with especially FIG. 1
thereof.
[0016] The third beamformer may combine the signals from the first
and the second beamformer in accordance with coefficients estimated
from noise powers. In case the noise power of the signal from the
first beamformer is higher than the noise power of the signal from
the second beamformer, the signal from the second beamformer is
weighted higher than the signal from the first beamformer and vice
versa. The noise level of a signal may be estimated when voice is
detected as not present.
[0017] The first mutual distance between the microphones of the
first pair and the second mutual distance between the microphones
of the second pair is shorter than the minimum wavelength of
interest in the case of end-fire pairs, depending on the desired
directional sensitivity. At and above frequencies with a shorter
wavelength than the wavelength of interest, the ability to suppress
or cancel noise will diminish due to the effect of spatial
aliasing. The distance between the microphone pairs may correspond
to the straight-line distance between a person's two ears, which
may be about 18-22 cm. The first mutual distance and the second
mutual distance may be about 10, 20, or 40 mm for a bandwidth of
interest up to 4 KHz.
[0018] In general, the apparatus may perform signal processing in a
time-domain or in a time-frequency-domain. In the latter case,
time-to-frequency transformations are performed on signal blocks of
a predefined duration on a running basis. In the
time-frequency-domain signals are represented as time-domain
samples in a number of frequency bins. Correspondingly,
frequency-to-time reconstruction is performed on signals processed
in the time-frequency-domain.
[0019] In some embodiments the noise reduction unit is configured
to perform noise suppression on the combined signal from the third
beamformer in response to a noise suppression coefficient; and the
noise suppression coefficient is estimated from the microphone
signals and/or a beamformed signal.
[0020] The noise suppression coefficient may comprise a first
coefficient estimated from the first set of microphone signals and
from a/the beamformed signal. The noise suppression coefficient may
alternatively or additionally comprise a second coefficient
estimated from the second set of microphone signals and from a/the
beamformed signal. The noise suppression coefficient may be
combined from the first and the second coefficient.
[0021] The noise suppression coefficient may be a gain factor of a
multiplier in a time-frequency domain or a filter coefficient of a
time-domain filter.
[0022] In some embodiments the apparatus comprises: a first control
branch synthesizing a first noise suppression gain from the first
pair of microphone signals and/or the first beamformer; a second
control branch synthesizing a second noise suppression gain from
the second pair of microphone signals and/or the second beamformer;
and a selector configured to dynamically select and/or output the
first noise suppression gain or the second noise suppression gain;
wherein the noise reduction unit is configured to process the
combined signal from the third beamformer in response to the
selected and/or output noise suppression gain from the
selector.
[0023] Thereby it is possible to dynamically select the first or
the second noise suppression gain such that it is in accordance
with signal quality measures estimated from respective beamformed
signal output from a respective beamformer and respective noise
suppression gains. This is expedient since the first and the second
noise reduction gains may be computed under conditions which are
not equally favourable. As a consequence, the noise may not be
suppressed equally well and/or the desired signal may not be
preserved equally well. For example, the mechanism for computing
the first noise suppression gain may have access to signals which
lend themselves to easier discrimination of the noise and the
desired signal. This condition may arise from the situation where
noise is less powerful at the input to the first beamformer due to
a user's head shadow causing less wind noise or background noise.
The condition may also arise from the situation where the spatial
cues employed by the first noise suppression computation are more
discriminative.
[0024] A hysteresis or threshold may be applied and used as a
criterion on whether to enable the selector or not. Thereby it is
possible to disable switching when an estimated noise level is
below a predefined hysteresis or threshold. The hysteresis or
threshold may be in the range of about 1 dB to about 3 dB. Thereby,
it is possible to strike a trade-off between (1) achieving lowest
output noise level and (2) minimize distortion of a desired signal
such as a voice signal.
[0025] In some embodiments the selector is configured to operate in
response to a first signal quality indicator and a second signal
quality indicator; the signal quality indicators are synthesized
from a respective beamformed signal processed to reduce noise in
response to respective noise reduction gains.
[0026] In terms of noise suppression, an important aspect of signal
quality is signal-to-noise ratio. As an example, with reference to
FIG. 2, when using the beamformed, noise reduced signals as input
to Signal Quality Evaluation, signal-to-noise ratio is influenced
through X.sub.L and X.sub.R. For example, if the signal-to-noise
ratio of X.sub.L is greater than that of X.sub.R, in cases where
A.sub.L and A.sub.R reduce the noise component by the same factor,
the signal-to-noise ratio of A.sub.LX.sub.L will be higher than
that of A.sub.RX.sub.R.
[0027] Furthermore, the Signal Quality Evaluation is influence by
the qualities of A.sub.L and A.sub.R. In some cases, speech is
easier distinguishable from noise at one side of the head. A reason
is that a user's head may shield the microphones from wind on a lee
side of the user's head. Another reason is that the spatial cues
employed by the noise suppression computation may be discriminated
more clearly on the lee side of the user's head.
[0028] The signal quality indicators P.sub.L; P.sub.R, may be
computed from the mean-squared product of the respective noise
reduction gains, A.sub.L; A.sub.R, and the respective beam-formed
signals X.sub.L; X.sub.R. The signal quality indicators may be
computed per frequency band or accumulated across all frequency
bands.
[0029] In some embodiments a beamformed signal, processed to reduce
noise in response to respective noise reduction gains, is input to
an evaluator that is configured to output a control signal to the
selector and thereby control selection; and the evaluator evaluates
the beamformed signal, processed to reduce noise in response to
respective noise reduction gains, according to a criterion of least
power during a time interval when voice activity is detected as not
present.
[0030] Thereby, the selection of respective noise suppression gains
can be performed from an evaluation of the noise conditions (e.g.
noise power) at respective sides of a person's head.
[0031] Least noise power of the left and the right beamformed,
noise reduced signals used as a selection criterion combines a
number of quality parameters into a simple computation. As
previously mentioned, noise power is a similar measure of
signal-to-noise ratio when the microphone inputs are aligned
through alignment filters, but it is simpler to compute.
[0032] When noise reduction is performed, there is a risk of
introducing voice processing artifacts that degrades voice quality.
The noise power measure, used in the least noise power criterion,
selects for higher voice quality in many cases. When the criterion
is based on least power, preference is associated with signals
where it is easier to detect all parts of the voice component,
especially the low-level parts, which in turn leads to fewer
audible instances of voice processing artifacts. A voice activity
detector may output a signal indicative of whether voice activity
is detected or not. Voice activity may be detected when an
amplitude or peak magnitude or power level of one or more
microphone signals and/or a beamformed signal exceed a predefined
or time-varying threshold. The level of the threshold may be
adapted to an estimated noise level.
[0033] In some embodiments the noise suppression coefficient is
computed to reduce noise by a predetermined, fixed factor.
[0034] The predetermined factor may be e.g. 13 dB, 6 dB, 10 dB, 15
dB or another factor. This may be achieved by limiting the noise
suppression gain to the predetermined factor.
[0035] As an example, an estimated noise level at the output of the
first beamformer and the second beamformer may be, say, -30 dB and
-20 dB, respectively; the fixed factor may be say 10 dB; and
consequently, the estimated noise level after noise suppression is
then -40 dB and -30 dB, respectively.
[0036] The left and right signal beamformed signals may be matched
in level towards the signal of interest, e.g. using alignment
filters/gains on the microphones at any point in the signal chain
preceding the noise suppression gain selection module. As a
beneficial consequence of using fixed noise suppression factors and
level-matched left and right channels, noise power computations are
conditioned to serve as left and right signal quality measures
which reflect the signal-to-noise ratios of the left and right
beamformer outputs to a higher degree.
[0037] In some embodiments at least one of the first beamformer or
the second beamformer is configured to comprise: a first stage that
generates a summation signal and a difference signal from the input
signals, subject to at least one of the input signals being phase
and/or amplitude aligned with another of the input signals with
respect to a desired signal; and a second stage that filters the
difference signal and generating a filtered signal; wherein the
beamformed output signal is generated from the difference between
the summation signal and the filtered signal; and wherein the
filter is adapted using a least mean square technique to minimize
the power of the beamformed output signal.
[0038] Thereby the first and/or the second beamformer selectively
and adaptively cancel out sound from certain directions.
[0039] The filter may have a low-pass characteristic to enhance
lower frequency components relative to higher frequency components.
The filter may be a bass-boost filter.
[0040] Such a beamformer may be configured as disclosed in WO
2009/132646 which is hereby incorporated by reference for
everything it discloses.
[0041] In some embodiments the third beamformer is configured with
a fixed sensitivity with respect to a predefined spatial position
relative to the spatial position of the microphones.
[0042] A fixed sensitivity means that the third beamformer applies
a fixed frequency response with respect to sound emanating from an
acoustic source at the predefined spatial position.
[0043] The predefined position is located in a predefined way with
respect to the spatial position and orientation of the first set of
microphones and the second set of microphones. The predefined space
is preferably centred about a person's mouth when the apparatus is
worn by the person in a normal way.
[0044] Beamforming coefficients of the third beamformer may be
constrained to sum to a fixed gain e.g. unity gain towards the
spatial position. The gain is fixed in the sense that it is not
adaptive. However, the gain may be adjusted in connection with
calibration or as a preference setting.
[0045] The third beamformer may combine the input signals by a
linear combination. Alternatively, the signals may be combined by a
non-linear combination.
[0046] In some embodiments the microphones output digital signals;
the apparatus performs a transformation of the digital signals to a
time-frequency representation, in multiple frequency bands; and the
apparatus performs an inverse transformation of at least the
combined signal to a time-domain representation.
[0047] The transformation may be performed by means of a Fast
Fourier Transformation, FFT, applied to a signal block of a
predefined duration. The transformation may involve applying a Hann
window or another type of window. A time-domain signal may be
reconstructed from the time-frequency representation via an Inverse
Fast Fourier Transformation, IFFT.
[0048] The signal block of a predefined duration may have duration
of 8 ms with 50% overlap, which means that transformations,
adaptation updates, noise reduction updates and time-domain signal
reconstruction are computed every 4 ms. However, other durations
and/or update intervals are possible. The digital signals may be
one-bit signals at a many-times oversampled rate, two-bit or
three-bit signals or 8 bit, 10 bit, 12 bit, 16 bit or 24 bit
signals.
[0049] In alternative implementations/embodiments, all or parts of
the system operate directly in the time-domain. For example, noise
suppression may be applied to a time domain signal by means of FIR
or IIR filtering, the noise suppression filter coefficients
computed in the frequency domain.
[0050] In some embodiments the microphones output analogue signals;
the apparatus performs analogue-to-digital conversion of the
analogue signals to provide digital signals; the apparatus performs
a transformation of the digital signals to a time-frequency
representation, in multiple frequency bands; and the apparatus
performs an inverse transformation of at least the combined signal
to a time-domain representation.
[0051] In some embodiments the microphones of at least one pair of
the set of microphones is arranged in an end-fire configuration
oriented towards a position where a person's mouth is expected to
be when the apparatus is used by the person. Such a configuration
has shown to give good noise cancelling and suppression, e.g., for
headsets or hearing aids.
[0052] There is also provided a method for processing audio signals
from multiple microphones, comprising: receiving a first pair and a
second pair of microphone signals from a first pair of microphones
and a second pair of microphones, respectively; wherein the first
pair of microphones are arranged with a first mutual distance and
the second pair of microphones are arranged with a second mutual
distance, and wherein the first pair of microphones are arranged at
a distance from the second pair of microphones that is greater than
the first mutual distance and the second mutual distance at least
when the apparatus is in normal operation; performing first
beamforming and second beamforming on the first pair of microphone
signals and the second pair of microphone signals to output
respective beamformed signals; adapting the spatial sensitivity by
a respective pair of microphones as measured in a respective
beamformed signal such that spatial sensitivity is adapted to
suppress noise relative to a desired signal; performing third
beamforming to dynamically combine the signals output from the
first beamforming and the second beamforming into a combined
signal; wherein the signals are combined such that noise energy in
the combined signal is minimized while a desired signal is
preserved; and performing noise reduction to process the combined
signal from the third beamformer and output the combined signal
such that noise is reduced.
[0053] There is also provided a computer program product, e.g.
stored on a computer-readable medium such as a DVD, comprising
program code means adapted to cause a data processing system to
perform the steps of the method, when said program code means are
executed on the data processing system.
[0054] There is also provided a computer data signal, e.g. a
download signal, embodied in a carrier wave and representing
sequences of instructions which, when executed by a processor,
cause the processor to perform the steps of the method.
[0055] Here and in the following, the terms `processing means` and
`processing unit` are intended to comprise any circuit and/or
device suitably adapted to perform the functions described herein.
In particular, the above term comprises general purpose or
proprietary programmable microprocessors, Digital Signal Processors
(DSP), Application Specific Integrated Circuits (ASIC),
Programmable Logic Arrays (PLA), Field Programmable Gate Arrays
(FPGA), special purpose electronic circuits, etc., or a combination
thereof.
BRIEF DESCRIPTION OF THE FIGURES
[0056] The above and/or additional objects, features and advantages
of the present invention will be further elucidated by the
following illustrative and non-limiting detailed description of
embodiments of the present invention, with reference to the
appended drawings, wherein:
[0057] FIG. 1 shows a block diagram of a signal processor;
[0058] FIG. 2 shows a more detailed block diagram of the signal
processor; and
[0059] FIG. 3 shows different configurations of an apparatus with
multiple microphones.
DETAILED DESCRIPTION
[0060] In the following description, reference is made to the
accompanying figures, which show, by way of illustration, how the
invention may be practiced.
[0061] FIG. 1 shows a block diagram of a signal processor and a
first and second pair of microphones. The first set of microphones,
101 and 102, and the second set of microphones, 103 and 104, are
arranged with an intra-pair distance between the microphones that
is relatively short compared to the microphone pairs
inter-distance, between the pairs of microphones. The signal
processor is designated by reference numeral 100.
[0062] The first pair of microphones 101 and 102 outputs a first
microphone signal pair input to a first beamformer 105 and the
second pair of microphones 103 and 104 outputs a second microphone
signal pair, which is input to a second beamformer 106. The first
beamformer 105 and the second beamformer 106 outputs respective
output signals X.sub.L and X.sub.R.
[0063] The first beamformer 105 and the second beamformer 106 are
each configured to adapt their spatial sensitivity. The spatial
sensitivity is adapted to cancel or suppress noise relative to a
desired signal. The first beamformer and the second beamformer may
be configured as disclosed in WO 2009/132646.
[0064] The third beamformer 107 is configured to dynamically
combine the signals, X.sub.L; X.sub.R, output from the first
beamformer 105 and the second beamformer 106 into a combined signal
X.sub.C. The combined signal X.sub.C can be expressed by the
following expression:
X.sub.C=G.sub.LX.sub.L+G.sub.RX.sub.R
[0065] Where G.sub.L and G.sub.R represent transfer functions from
a first input at which X.sub.L is received and from a second input
at which X.sub.R is received, respectively. The above expression
relies on a frequency domain representation; X.sub.L and X.sub.R
are complex numbers. An equivalent representation exists for a
time-domain representation. The third beamformer is configured to
adjust real or complex G.sub.L and G.sub.R dynamically to output
X.sub.C with a lowest noise level while preserving a desired
signal.
[0066] The following expression is an example of how real G.sub.L,
G.sub.R may be computed:
G ^ L = X R 2 - Re X L X R * X L - X R 2 ##EQU00001## G ^ R = G ^ L
- 1 ##EQU00001.2##
where Re is the real part of a complex number, .*, .cndot. and
|.cndot.| represent complex conjugate, averaging across a time
interval and absolute value, respectively.
[0067] The above expressions for real G.sub.L and G.sub.R are
solutions to a mean squares cost function subject to a
constraint:
G ^ L = arg min G L X C 2 ##EQU00002## subject to : ##EQU00002.2##
G ^ L + G ^ R = 1 ##EQU00002.3##
[0068] That is, the mean-squares of X.sub.C are minimized as a
function of real G.sub.L, subject to a constraint. The constraint
ensures that the desired signal is favoured over signals from at
least some other locations.
[0069] In some embodiments matching filters are inserted between
the microphones and the inputs to the beamformers of the first
stage i.e. in the shown embodiment the first and the second
beamformer. Thereby filtering the input signals to the first and
the second beamformers so that the desired signal component is
sufficiently identical in all the inputs, i.e., with respect to
phase and amplitude. The filters compensate for variations in
acoustic path of the desired signal to the microphones as well as
variations in microphone sensitivities or other variations. Such
matching filters may also be denoted alignment filters and matching
may be denoted alignment. As a result of the input alignment with
respect to the desired source, the output desired signal component
of the first and second beamformers are similarly identical due to
the inbuilt constraints (e.g. as described in WO 2009/132646). That
is, the inputs to the third beamformer are sufficiently identical
with respect to the desired signal component. As a consequence, the
G.sub.L+G.sub.R=1 constraint leads to the output and inputs of the
third beamformer being sufficiently identical with respect to the
desired signal.
[0070] One of the inputs may be chosen as a reference for
microphone alignment. For example, one of the alignment filters may
be configured to produce an all-pass characteristic; the other
alignment filters are configured accordingly. As a result, the
outputs of each of the first stage beamformers with respect to the
desired signal are sufficiently similar and also similar to the
reference input.
[0071] The microphone alignment filters may be pre-configured by
assuming and compensating for a known acoustical relation between
the origin of the desired signal and the microphones and using
microphones with very small variations in sensitivities. The
microphone sensitivities may be estimated in a calibration step at
the time of production. The microphone alignment filters may be
estimated while the device is in operation: when activated by a
voice or noise activity detector, the alignment filters are
estimated by, e.g., a least squares technique.
[0072] Constraining the beamformer with respect to the desired
signal may be equivalently achieved by integrating the microphone
alignment filters directly into one or more of the beamformers'
calculations, or, alternatively at the outputs of the first and
second beamformers.
[0073] When the input signals (X.sub.L; X.sub.R) are combined in
this way, the input signal that exhibits the lowest noise level is
emphasized over the other one.
[0074] The above expression for computing G.sub.L and G.sub.R is at
least to some extent resistant to the influence of the desired
signal and may work sufficiently well without any voice-activity
detector, VAD.
[0075] The below expression is an alternative and is somewhat less
resource demanding to compute, but is advantageously used in
combination with a voice-activity detector, VAD:
G ~ L = X R 2 X R 2 + X L 2 ##EQU00003## G ~ R = G ~ L - 1
##EQU00003.2##
[0076] Where X.sub.R and X.sub.L are complex representations of the
respective signals. This expression is subject to similar
minimization and constraint as mentioned above but assumes that
noise components in X.sub.R and X.sub.L are uncorrelated. In this
case the voice-activity detector is applied to discard signal
portions of X.sub.R and X.sub.L wherein voice is present for the
purpose of estimating G.sub.L and G.sub.R. Such a weighting rule
was disclosed in U.S. Pat. No. 7,206,421 B1 for a multi-microphone
input.
[0077] For more robust performance, G.sub.L and G.sub.R may be
constrained further to an interval, say, between 0 and 1.
[0078] In general, it should be noted that the estimated position
of the source emitting the desired signal may be pre-configured and
locked to an expected position relative to the positions of the
microphones. This could be the case for a headset, wherein the
position of a person's mouth may be sufficiently well-defined when
the headset is worn in a normal position. In other cases, the
apparatus may comprise a tracker that estimates the position of the
source of the desired signal from, e.g., phase and/or amplitude
differences in the signals from one, two or more microphone pairs
or sets of more than two microphones. This could be the case for a
speakerphone or a hands-free set for a communications device in,
e.g., a car.
[0079] The combined signal, X.sub.C, is input to a noise
suppression unit 109 that computes a noise suppression gain,
A.sub.S, from the beamformed signals X.sub.L and X.sub.R.
Additionally, the noise suppression unit 109 may include the
microphone signals from one or more of the microphones 101, 102,
103, 104 in computing the noise suppression gain, A. The signals
from M3 and M4 and the signal X.sub.R output from the beamformer
106 are labelled `a`, `b` and `c` and are input to the noise
suppression unit 109 as indicated by respective labels.
[0080] Computation of the noise suppression gain, A.sub.S, is
described further below.
[0081] In the shown embodiment, the noise suppression gain,
A.sub.S, is applied to the combined signal, X.sub.C, by a
multiplier 108. A signal output from the multiplier is a reproduced
audio signal comprising beamformed and noise suppressed signal
components picked up by the microphones. Label `O` designates
output from the signal processor. The output may be subject to
further signal processing, amplification and/or transmission.
[0082] FIG. 2 shows a more detailed block diagram of the signal
processor. It is shown that the noise suppression gain, A.sub.S, is
selected as either a first or left noise suppression gain, A.sub.L,
or a second or right noise suppression gain, A.sub.R. The left
noise suppression gain, A.sub.L, is computed from the beamformed
signal X.sub.L and/or the microphone signals xm.sub.1 and/or
xm.sub.2. Correspondingly, the right noise suppression gain,
A.sub.R, is computed from the beamformed signal X.sub.R and/or the
microphone signals xm.sub.3 and/or xm.sub.4.
[0083] A.sub.L is applied to X.sub.L via multiplier 205 and A.sub.R
is applied to X.sub.R via multiplier 209. Respective outputs of the
multipliers 205 and 209 are input to respective signal quality
evaluators 203 and 208. The inputs may be interpreted as left and
right noise-reduced, beamformed signals.
[0084] The signal quality evaluators 203 and 208 may evaluate the
signal quality of the signals output from the multipliers 205 and
209 according to a criterion of signal-to-noise ratio.
Alternatively, signal quality may be evaluated according to a
criterion of noise signal power during a time interval when voice
activity is detected as not present. This may be facilitated by
applying the microphone alignment filters to render the desired
signal component sufficiently identical at all beamformer inputs
and outputs. In this case, signal-to-noise ratio and noise power
are similar measures of signal quality. The signal quality
evaluators output signals P.sub.L and P.sub.R that selects either
A.sub.L or A.sub.R via a selector 204. A.sub.S, which is output
from the selector represents the selected noise suppression gain
and it is applied to X.sub.C via a multiplier 108.
[0085] Signals P.sub.L and P.sub.R and hence the signal quality
evaluators 203 and 208 may be defined as power computations on the
noise component of the signals received as inputs. For example,
P.sub.L may be defined as the mean square of the beamformed,
noise-reduced input during noise-only intervals. Averaging may be
performed across a suitable time interval, e.g., 100 ms or 1 s, and
across a suitable frequency interval, e.g. 0-8000 Hz.
[0086] The selector 204 may be configured to select A.sub.L when
P.sub.L is less than P.sub.R and conversely select A.sub.R when
P.sub.L is larger than P.sub.R. Voice activity detectors 202 and
207 output signals to the signal quality evaluators 203 and 208,
respectively, indicative of whether voice is detected.
[0087] A voice activity detector, VAD, of a single-input type, may
be configured to estimate a noise floor level, N, by receiving an
input signal and computing a slowly varying average of the
magnitude of the input signal. A comparator may output a signal
indicative of the presence of a voice signal when the magnitude of
the signal temporarily exceeds the estimated noise floor by a
predefined factor of, say, 10 dB. The VAD may disable noise floor
estimation when the presence of voice is detected. Such a voice
detector works when the noise is quasi-stationary and when the
magnitude of voice exceeds the estimated noise floor sufficiently.
Such a voice activity detector may operate at a band-limited signal
or at multiple frequency bands to generate a voice activity signal
aggregated from multiple frequency bands. When the voice activity
detector works at multiple frequency bands, it may output multiple
voice activity signals for respective multiple frequency bands.
[0088] A voice activity detector, VAD, of a multiple-input type,
may be configured to compute a signal indicative of coherence
between multiple signals. For example, the voice signal may exhibit
a higher level of coherence between the microphones due to the
mouth being closer to the microphones than the noise sources. Other
types of voice activity detectors are based on computing spatial
features or cues such as directionality and proximity, and,
dictionary approaches decomposing signal into codebook
time/frequency profiles.
[0089] A noise suppression gain designated G.sub.NS or A.sub.L or
A.sub.R may be computed from the following expression:
G NS = X 2 X 2 + P N F ##EQU00004##
[0090] Wherein P.sub.N is the square of the estimated noise floor
level at a time instance t; |X|.sup.2 is the square of the input
signal at the time instance t; and F is a factor, e.g., a factor of
10. The noise suppression gain affects an input signal via a
multiplier, if applied in a frequency domain.
[0091] Thus, on the one hand, if the noise floor level is very low,
G.sub.NS becomes 1 when voice is significantly present. On the
other hand, if voice is absent or the noise level rises, G.sub.NS
moves to values less than 1 and consequently a suppression of the
input signal. The factor F is selected to set how aggressively the
input signal should be suppressed.
[0092] In respect of the above description of a voice-activity
detector and noise suppression gain, its input signal(s) may be any
of the microphone signals and/or output from the first beamformer
and/or second beamformer and/or third beamformer.
[0093] In general, a way to estimate the signal and noise relation
is based on tracking the noise floor, wherein voice or noisy voice
is identified by signal parts significantly exceeding the noise
floor level. Noise levels may, e.g., be estimated by minimum
statistics as in [R. Martin, "Noise Power Spectral Density
Estimation Based on Optimal Smoothing and Minimum Statistics,"
Trans. on Speech and Audio Processing, Vol. 9, No. 5, July 2001],
where the minimum signal level is adaptively estimated.
[0094] Other ways to identify signal and noise parts are based on
computing multi-microphone/spatial features such as directionality
and proximity [O. Yilmaz and S. Rickard, "Blind Separation of
Speech Mixtures via Time-Frequency Masking", IEEE Transactions on
Signal Processing, Vol. 52, No. 7, pages 1830-1847, July 2004] or
coherence [K. Simmer et al., "Post-filtering techniques."
Microphone Arrays. Springer Berlin Heidelberg, 2001. 39-60].
Dictionary approaches decomposing signal into codebook
time/frequency profiles may also be applied [M. Schmidt and R.
Olsson: "Single-channel speech separation using sparse non-negative
matrix factorization," Interspeech, 2006].
[0095] In general, noise suppression may be implemented as
described in [Y. Ephraim and D. Malah, "Speech enhancement using
optimal non-linear spectral amplitude estimation," in Proc. IEEE
Int. Conf. Acoust. Speech Signal Processing, 1983, pp. 1118-1121]
or as described elsewhere in the literature on noise suppression
techniques. Typically, a time-varying filter is applied to the
signal. Analysis and/or filtering are often implemented in a
frequency transformed domain/filter bank, representing the signal
in a number of frequency bands. At each represented frequency, a
time-varying gain is computed depending on the relation of
estimated desired signal and noise components e.g. when the
estimated signal-to-noise ratio exceeds a pre-determined, adaptive
or fixed threshold, the gain is steered toward 1. Conversely, when
the estimated signal-to-noise ratio does not exceed the threshold,
the gain is set to a value smaller than 1. The labels designated
`x` and `y` connect the respective signals: x-to-x and y-to-y.
[0096] FIG. 3 shows different configurations of an apparatus with
multiple microphones. On the left-hand side, a spectacle frame 303
with bows 306 are configured with two sets of microphones 304 and
305. On the right-hand side, a flexible neckband 307 is configured
with two sets of microphones 308 and 309. Reference numeral 301
designates the head of a person wearing the spectacle frame 303 and
reference numeral 302 designates the head of a person wearing the
neckband 307.
[0097] The microphones may be arranged in a so-called end-fire
configuration wherein the microphones of a respective pair or set
of microphones sit on a line that intersects with or passes close
to a position of a source of a desired signal. The position may be
a position of the person's mouth opening or a position in proximity
of the person's mouth opening. In an end-fire configuration the
microphones of a microphone pair sit on a straight line
intersecting the position of the source of the desired signal. Such
a configuration is found to be suitable for effectively suppressing
or cancelling noise from sources located elsewhere when the
apparatus is a headset, hearing aid or the like.
[0098] In alternative configurations, a so-called broadside
configuration for the microphone positions is used. In a broadside
configuration the microphones of a microphone pair sit on a
straight line at an equal distance to the position of the source of
the desired signal.
[0099] In still alternative configurations, the microphones of a
microphone pair sit on a line inclined e.g. at 5.degree.,
10.degree., 45.degree. relative to a direction from the microphone
pair to the position of the source of the desired signal, thereby
providing a configuration that may be more practically
suitable.
[0100] Generally, in the above it is assumed that so-called digital
microphones outputting digital signals are used. However, analogue
microphones in conjunction with an analogue-to-digital converter or
any other transduction from the sound field to a sampled domain
could be used. The microphones are typically embodied in so-called
capsules with a diameter in the range of typically 3 mm to 5 mm or
6 mm.
[0101] In general, a beamformer may receive signals from more than
a pair of microphones. A beamformer, e.g., a first stage
beamformer, may receive microphone signals from 3, 4 or more
microphones. The first stage may comprise more than the first and
the second beamformer; the first stage may comprise, e.g., 3, 4 or
more beamformers.
* * * * *