U.S. patent application number 10/965795 was filed with the patent office on 2005-05-19 for perceptual weighting device and method for efficient coding of wideband signals.
This patent application is currently assigned to Voiceage Corporation. Invention is credited to Bessette, Bruno, Lefebvre, Roch, Salami, Redwan.
Application Number | 20050108007 10/965795 |
Document ID | / |
Family ID | 4162966 |
Filed Date | 2005-05-19 |
United States Patent
Application |
20050108007 |
Kind Code |
A1 |
Bessette, Bruno ; et
al. |
May 19, 2005 |
Perceptual weighting device and method for efficient coding of
wideband signals
Abstract
A perceptual weighting device for producing a perceptually
weighted signal in response to a wideband signal comprises a signal
pre-emphasis filter, a synthesis filter calculator, and a
perceptual weighting filter. The signal pre-emphasis filter
enhances the high frequency content of the wideband signal to
thereby produce a pre-emphasized signal. The signal pre-emphasis
filter has a transfer function of the form: P(z)=1-.mu.z.sup.-1,
wherein .mu. is a pre-emphasis factor having a value located
between 0 and 1. The synthesis filter calculator is responsive to
the pre-emphasized signal for producing synthesis filter
coefficients. Finally, the perceptual weighting filter processes
the pre-emphasized signal in relation to the synthesis filter
coefficients to produce the perceptually weighted signal. The
perceptual weighting filter has a transfer function, with fixed
denominator, of the form:
W(z)=A(z/.gamma..sub.1)/(1-.gamma..sub.2z.sup.-1) where
0<.gamma..sub.2<.gamma..sub.1.ltoreq.1.
Inventors: |
Bessette, Bruno; (Rock
Forest, CA) ; Salami, Redwan; (Sherbrooke, CA)
; Lefebvre, Roch; (Canton de Magog, CA) |
Correspondence
Address: |
BIRCH STEWART KOLASCH & BIRCH
PO BOX 747
FALLS CHURCH
VA
22040-0747
US
|
Assignee: |
Voiceage Corporation
Ville Mont-Royal
CA
|
Family ID: |
4162966 |
Appl. No.: |
10/965795 |
Filed: |
October 18, 2004 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
10965795 |
Oct 18, 2004 |
|
|
|
09830276 |
Jun 20, 2001 |
|
|
|
6807524 |
|
|
|
|
09830276 |
Jun 20, 2001 |
|
|
|
PCT/CA99/01010 |
Oct 27, 1999 |
|
|
|
Current U.S.
Class: |
704/223 |
Current CPC
Class: |
G10L 19/26 20130101;
G10L 2019/0011 20130101 |
Class at
Publication: |
704/223 |
International
Class: |
G10L 019/00 |
Foreign Application Data
Date |
Code |
Application Number |
Oct 27, 1998 |
CA |
2,252,170 |
Claims
What is claimed is:
1. A perceptual weighting device for producing a perceptually
weighted signal in response to a wideband signal in order to reduce
a difference between a weighted wideband signal and a subsequently
synthesized weighted wideband signal, said perceptual weighting
device comprising: a) a signal preemphasis filter responsive to the
wideband signal for enhancing a high frequency content of the
wideband signal to thereby produce a preemphasised signal; b) a
synthesis filter calculator responsive to said preemphasised signal
for producing synthesis filter coefficients; and c) a perceptual
weighting filter, responsive to said preemphasised signal and said
synthesis filter coefficients, for filtering said preemphasised
signal in relation to said synthesis filter coefficients to thereby
produce said perceptually weighted signal, said perceptual
weighting filter having a transfer function with fixed denominator
whereby weighting of said wideband signal in a formant region is
substantially decoupled from a spectral tilt of said wideband
signal.
2. A perceptual weighting device as defined in claim 1, wherein
said signal preemphasis filter has a transfer function of the form:
P(z)=1-.mu.z.sup.-1 wherein .mu. is a preemphasis factor having a
value located between 0 and 1.
3. A perceptual weighting device as defined in claim 2, wherein
said preemphasis factor .mu. is 0.7.
4. A perceptual weighting device as defined in claim 2, wherein
said perceptual weighting filter has a transfer function of the
form: W(z)=A(z/.gamma..sub.1)/(1-.gamma..sub.2z.sup.-1) where
0.ltoreq..gamma..sub.2.ltoreq..gamma..sub.1.ltoreq.1 and
.gamma..sub.2 and .gamma..sub.1 are weighting control values.
5. A perceptual weighting device as defined in claim 4, wherein
.gamma..sub.2 is set equal to .mu..
6. A perceptual weighting device as defined in claim 1, wherein
said perceptual weighting filter has a transfer function of the
form: W(z)=A(z/.gamma..sub.1)/(1-.gamma..sub.2z.sup.-1) where
0.ltoreq..gamma..sub.2.ltoreq..gamma..sub.1.ltoreq.1 and
.gamma..sub.2 and y.sub.1 are weighting control values.
7. A perceptual weighting device as defined in claim 6, wherein
.gamma..sub.2 is set equal to .mu..
8. A method for producing a perceptually weighted signal in
response to a wideband signal in order to reduce a difference
between a weighted wideband signal and a subsequently synthesized
weighted wideband signal, said method comprising: a) filtering the
wideband signal to produce a preemphasised signal with enhanced
high frequency content; b) calculating, from said preemphasised
signal, synthesis filter coefficients; and c) filtering said
preemphasised signal in relation to said synthesis filter
coefficients to thereby produce a perceptually weighted speech
signal, wherein said filtering comprises processing the preemphasis
signal through a perceptual weighting filter having a transfer
function with fixed denominator whereby weighting of said wideband
signal in a formant region is substantially decoupled from a
spectral tilt of said wideband signal.
9. A method for producing a perceptually weighted signal as defined
in claim 8, wherein filtering the wideband signal comprises
filtering through a transfer function of the form:
P(z)=1-.mu.z.sup.-1 wherein .mu. is a preemphasis factor having a
value located between 0 and 1.
10. A method for producing a perceptually weighted signal as
defined in claim 9, wherein said preemphasis factor .mu. is
0.7.
11. A method for producing a perceptually weighted signal as
defined in claim 9, wherein said perceptual weighting filter has a
transfer function of the form:
W(z)=A(z/.gamma..sub.1)/(1-.gamma..sub.2z.sup.-1) where
0.ltoreq..gamma..sub.2<.gamma..sub.1.ltoreq.1 and .gamma..sub.2
and .gamma..sub.1 are weighting control values.
12. A method for producing a perceptually weighted signal as
defined in claim 11, wherein .gamma..sub.2 is set equal to
.mu..
13. A method for producing a perceptually weighted signal as
defined in claim 8, wherein said perceptual weighting filter has a
transfer function of the form:
W(z)=A(z/.gamma..sub.1)/(1-.gamma..sub.2z.sup.-1) where
0<.gamma..sub.2<.gamma..sub.1.ltoreq.1 and .gamma..sub.2 and
.gamma..sub.1 are weighting control values.
14. A method for producing a perceptually weighted signal as
defined in claim 13, wherein .gamma..sub.2 is set equal to
.mu..
15. An encoder for encoding a wideband signal, comprising: a) a
perceptual weighting device as recited in claim 1; b) an pitch
codebook search device responsive to said perceptually weighted
signal for producing pitch codebook parameters and an innovative
search target vector; c) an innovative codebook search device,
responsive to said synthesis filter coefficients and to said
innovative search target vector, for producing innovative codebook
parameters; and d) a signal forming device for producing an encoded
wideband signal comprising said pitch codebook parameters, said
innovative codebook parameters, and said synthesis filter
coefficients.
16. An encoder as defined in claim 15, wherein said signal
preemphasis filter has a transfer function of the form:
P(z)=1-.mu.z.sup.-1 wherein .mu. is a preemphasis factor having a
value located between 0 and 1.
17. An encoder as defined in claim 16, wherein said preemphasis
factor .mu. is 0.7.
18. An encoder as defined in claim 16, wherein said perceptual
weighting filter has a transfer function of the form:
W(z)=A(z/.gamma..sub.1)/(1-.g- amma..sub.2z.sup.-1) where
0<.gamma..sub.2<.gamma..sub.1.ltoreq.1 and .gamma..sub.2 and
.gamma..sub.1 are weighting control values.
19. An encoder as defined in claim 18, wherein .gamma..sub.2 is set
equal to .mu..
20. An encoder as defined in claim 15, wherein said perceptual
weighting filter has a transfer function of the form:
W(z)=A(z/.gamma..sub.1)/(1-.g- amma..sub.2z.sup.-1) where
0<.gamma..sub.2<.gamma..sub.1.ltoreq.1 and Y.sub.2 and
.gamma..sub.1 are weighting control values.
21. An encoder as defined in claim 20, wherein .mu. is set equal to
.gamma..sub.2.
22. A cellular communication system for servicing a large
geographical area divided into a plurality of cells, comprising: a)
mobile transmitter/receiver units; b) cellular base stations
respectively situated in said cells; c) a control terminal for
controlling communication between the cellular base stations; d) a
bidirectional wireless communication sub-system between each mobile
unit situated in one cell and the cellular base station of said one
cell, said bidirectional wireless communication sub-system
comprising, in both the mobile unit and the cellular base station:
i) a transmitter including an encoder for encoding a wideband
signal as recited in claim 15 and a transmission circuit for
transmitting the encoded wideband signal; and ii) a receiver
including a receiving circuit for receiving a transmitted encoded
wideband signal and a decoder for decoding the received encoded
wideband signal.
23. A cellular communication system as defined in claim 22, wherein
said signal preemphasis filter has a transfer function of the form:
P(z)=1-.mu.z.sup.-1 wherein .mu. is a preemphasis factor having a
value located between 0 and 1.
24. A cellular communication system as defined in claim 23, wherein
said preemphasis factor .mu. is 0.7.
25. A cellular communication system as defined in claim 23, wherein
said perceptual weighting filter has a transfer function of the
form: W(z)=A(z/.gamma..sub.1)/(1-.gamma..sub.2z.sup.-1) where
0<.gamma..sub.2<.gamma..sub.1.ltoreq.1 and .gamma..sub.2 and
.gamma..sub.1 are weighting control values.
26. A cellular communication system as defined in claim 25, wherein
.mu. is set equal to .gamma..sub.2.
27. A cellular communication system as defined in claim 22, wherein
said perceptual weighting filter has a transfer function of the
form: W(z)=A(z/.gamma..sub.1)/(1-.gamma..sub.2z.sup.-1) where
1<.gamma..sub.2<.gamma..sub.1.ltoreq.1 and .gamma..sub.2 and
.gamma..sub.1 are weighting control values.
28. A cellular communication system as defined in claim 27, wherein
.gamma..sub.2 is set equal to .mu..
29. A cellular mobile transmitter/receiver unit comprising: a) a
transmitter including an encoder for encoding a wideband signal as
recited in claim 15 and a transmission circuit for transmitting the
encoded wideband signal; and b) a receiver including a receiving
circuit for receiving a transmitted encoded wideband signal and a
decoder for decoding the received encoded wideband signal.
30. A cellular mobile transmitter/receiver unit as defined in claim
29, wherein said signal preemphasis filter has a transfer function
of the form: P(z)=1-.mu.z.sup.-1 wherein .mu. is a preemphasis
factor having a value located between 0 and 1.
31. A cellular mobile transmitter/receiver unit as defined in claim
30, wherein said preemphasis factor .mu. is 0.7.
32. A cellular mobile transmitter/receiver unit as defined in claim
30, wherein said perceptual weighting filter has a transfer
function of the form:
W(z)=A(z/.gamma..sub.1)/(1-.gamma..sub.2z.sup.-1) where
0<.gamma..sub.2<.gamma..sub.1.ltoreq.1 and .gamma..sub.2 and
.gamma..sub.1 are weighting control values.
33. A cellular mobile transmitter/receiver unit as defined in claim
32, wherein .gamma..sub.2 is set equal to .mu..
34. A cellular mobile transmitter/receiver unit as defined in claim
29, wherein said perceptual weighting filter has a transfer
function of the form:
W(z)=A(z/.gamma..sub.1)/(1-.gamma..sub.2z.sup.-1) where
0<.gamma..sub.2<.gamma..sub.1.ltoreq.1 and .gamma..sub.2 and
.gamma..sub.1 are weighting control values.
35. A cellular mobile transmitter/receiver unit as defined in claim
34, wherein .gamma..sub.2 is set equal to .mu..
36. A cellular network element comprising: a) a transmitter
including an encoder for encoding a wideband signal as defined in
claim 15 and a transmission circuit for transmitting the encoded
wideband signal; and b) a receiver including a receiving circuit
for receiving a transmitted encoded wideband signal and a decoder
for decoding the received encoded wideband signal.
37. A cellular network element as defined in claim 36, wherein said
signal preemphasis filter has a transfer function of the form:
P(z)=1-.mu.z.sup.-1 wherein .mu. is a preemphasis factor having a
value located between 0 and 1.
38. A cellular network element as defined in claim 37, wherein said
preemphasis factor .mu. is 0.7.
39. A cellular network element as defined in claim 37, wherein said
perceptual weighting filter has a transfer function of the form:
W(z)=A(z/.gamma..sub.1)/(1-.gamma..sub.2z.sup.-1) where
0<.gamma..sub.2<.gamma..sub.1.ltoreq.1 and .gamma..sub.2 and
.gamma..sub.1 are weighting control values.
40. A cellular network element as defined in claim 39, wherein
.gamma..sub.2 is set equal to .mu..
41. A cellular network element as defined in claim 36, wherein said
perceptual weighting filter has a transfer function of the form:
W(z)=A(z/.gamma..sub.1)/(1-.gamma..sub.2z.sup.-1) where
0<.gamma..sub.2<.gamma..sub.1.ltoreq.1 and .gamma..sub.2 and
.gamma..sub.1 are weighting control values.
42. A cellular network element as defined in claim 41, wherein .mu.
is set equal to .mu..sub.2.
43. In a cellular communication system for servicing a large
geographical area divided into a plurality of cells, comprising:
mobile transmitter/receiver units; cellular base stations,
respectively situated in said cells; and control terminal for
controlling communication between the cellular base stations: a
bidirectional wireless communication sub-system between each mobile
unit situated in one cell and the cellular base station of said one
cell, said bidirectional wireless communication sub-system
comprising, in both the mobile unit and the cellular base station:
a) a transmitter including an encoder for encoding a wideband
signal as recited in claim 15 and a transmission circuit for
transmitting the encoded wideband signal; and b) a receiver
including a receiving circuit for receiving a transmitted encoded
wideband signal and a decoder for decoding the received encoded
wideband signal.
44. A bidirectional wireless communication sub-system as defined in
claim 43, wherein said signal preemphasis filter has a transfer
function of the form: P(z)=1-.mu.z.sup.-1 wherein .mu. is a
preemphasis factor having a value located between 0 and 1.
45. A bidirectional wireless communication sub-system as defined in
claim 44, wherein said preemphasis factor .mu. is 0.7.
46. A bidirectional wireless communication sub-system as defined in
claim 44, wherein said perceptual weighting filter has a transfer
function of the form:
W(z)=A(z/.gamma..sub.1)/(1-.gamma..sub.2z.sup.-1) where
0<.gamma..sub.2<.gamma..sub.1.ltoreq.1 and .gamma..sub.2 and
.gamma..sub.1 are weighting control values.
47. A bidirectional wireless communication sub-system as defined in
claim 46, wherein .mu. is set equal to .gamma..sub.2.
48. A bidirectional wireless communication sub-system as defined in
claim 43, wherein said perceptual weighting filter has a transfer
function of the form:
W(z)=A(z/.gamma..sub.1)/(1-.gamma..sub.2z.sup.-1) where
0<.gamma..sub.2<.gamma..sub.2.ltoreq.1 and .gamma..sub.2 and
.gamma..sub.1 are weighting control values.
49. A bidirectional wireless communication sub-system as defined in
claim 48, wherein .gamma..sub.2 is set equal to .mu..
Description
BACKGROUND OF THE INVENTION
[0001] 1. Field of the Invention
[0002] The present invention relates to a perceptual weighting
device and method for producing a perceptually weighted signal in
response to a wideband signal (0-7000 Hz) in order to reduce a
difference between a weighted wideband signal and a subsequently
synthesized weighted wideband signal.
[0003] 2. Brief Description of the Prior Art
[0004] The demand for efficient digital wideband speech/audio
encoding techniques with a good subjective quality/bit rate
trade-off is increasing for numerous applications such as
audio/video teleconferencing, multimedia, and wireless
applications, as well as Internet and packet network applications.
Until recently, telephone bandwidths filtered in the range 200-3400
Hz were mainly used in speech coding applications. However, there
is an increasing demand for wideband speech applications in order
to increase the intelligibility and naturalness of the speech
signals. A bandwidth in the range 50-7000 Hz was found sufficient
for delivering a face-to-face speech quality. For audio signals,
this range gives an acceptable audio quality, but is still lower
than the CD quality which operates on the range 20-20000 Hz.
[0005] A speech encoder converts a speech signal into a digital
bitstream which is transmitted over a communication channel (or
stored in a storage medium). The speech signal is digitized
(sampled and quantized with usually 16-bits per sample) and the
speech encoder has the role of representing these digital samples
with a smaller number of bits while maintaining a good subjective
speech quality. The speech decoder or synthesizer operates on the
transmitted or stored bit stream and converts it back to a sound
signal.
[0006] One of the best prior art techniques capable of achieving a
good quality/bit rate trade-off is the so-called Code Excited
Linear Prediction (CELP) technique. According to this technique,
the sampled speech signal is processed in successive blocks of L
samples usually called frames where L is some predetermined number
(corresponding to 10-30 ms of speech). In CELP, a linear prediction
(LP) synthesis filter is computed and transmitted every frame. The
L-sample frame is then divided into smaller blocks called subframes
of size N samples, where L=kN and k is the number of subframes in a
frame (N usually corresponds to 4-10 ms of speech). An excitation
signal is determined in each subframe, which usually consists of
two components: one from the past excitation (also called pitch
contribution or adaptive codebook) and the other from an innovative
codebook (also called fixed codebook). This excitation signal is
transmitted and used at the decoder as the input of the LP
synthesis filter in order to obtain the synthesized speech.
[0007] An innovative codebook in the CELP context, is an indexed
set of N-sample-long sequences which will be referred to as
N-dimensional codevectors. Each codebook sequence is indexed by an
integer k ranging from 1 to M where M represents the size of the
codebook often expressed as a number of bits b, where
M=2.sup.b.
[0008] To synthesize speech according to the CELP technique, each
block of N samples is synthesized by filtering an appropriate
codevector from a codebook through time varying filters modelling
the spectral characteristics of the speech signal. At the encoder
end, the synthesis output is computed for all, or a subset, of the
codevectors from the codebook (codebook search). The retained
codevector is the one producing the synthesis output closest to the
original speech signal according to a perceptually weighted
distortion measure. This perceptual weighting is performed using a
so-called perceptual weighting filter, which is usually derived
from the LP synthesis filter.
[0009] The CELP model has been very successful in encoding
telephone band sound signals, and several CELP-based standards
exist in a wide range of applications, especially in digital
cellular applications. In the telephone band, the sound signal is
band-limited to 200-3400 Hz and sampled at 8000 samples/sec. In
wideband speech/audio applications, the sound signal is
band-limited to 50-7000 Hz and sampled at 16000 samples/sec.
[0010] Some difficulties arise when applying the telephone-band
optimized CELP model to wideband signals, and additional features
need to be added to the model in order to obtain high quality
wideband signals. Wideband signals exhibit a much wider dynamic
range compared to telephone-band signals, which results in
precision problems when a fixed-point implementation of the
algorithm is required (which is essential in wireless
applications). Furthermore, the CELP model will often spend most of
its encoding bits on the low-frequency region, which usually has
higher energy contents, resulting in a low-pass output signal. To
overcome this problem, the perceptual weighting filter has to be
modified in order to suit wideband signals, and pre-emphasis
techniques which boost the high frequency regions become important
to reduce the dynamic range, yielding a simpler fixed-point
implementation, and to ensure a better encoding of the higher
frequency contents of the signal.
[0011] In CELP-type encoders, the optimum pitch and innovative
parameters are searched by minimizing the mean squared error
between the input speech and synthesized speech in a perceptually
weighted domain. This is equivalent to minimizing the error between
the weighted input speech and weighted synthesis speech, where the
weighting is performed using a filter having a transfer function
W(z) of the form:
W(z)=A(z/g.sub.1)/A(z/g.sub.2) where
0<.GAMMA.'.sub.2<.GAMMA.'.sub.1- .ltoreq.1.
[0012] In analysis-by-synthesis (AbS) coders, analysis show that
the quantization error is weighted by the inverse of the weighting
filter, W.sup.-1(z), which exhibits some of the formant structure
in the input signal. Thus, the masking property of the human ear is
exploited by shaping the error, so that it has more energy in the
formant regions, where it will be masked by the strong signal
energy present in those regions. The amount of weighting is
controlled by the factors .GAMMA.'.sub.1 and .GAMMA.'.sub.2.
[0013] This filter works well with telephone band signals. However,
it was found that this filter is not suitable for efficient
perceptual weighting when it was applied to wideband signals. It
was found that this filter has inherent limitations in modelling
the formant structure and the required spectral tilt concurrently.
The spectral tilt is more pronounced in wideband signals due to the
wide dynamic range between low and high frequencies. It was
suggested to add a tilt filter into filter W(z) in order to control
the tilt and formant weighting separately.
OBJECT OF THE INVENTION
[0014] An object of the present invention is therefore to provide a
perceptual weighting device and method adapted to wideband signals,
using a modified perceptual weighting filter to obtain a high
quality reconstructed signal, these device and method enabling
fixed point algorithmic implementation.
SUMMARY OF THE INVENTION
[0015] More specifically, in accordance with the present invention,
there is provided a perceptual weighting device for producing a
perceptually weighted signal in response to a wideband signal in
order to reduce a difference between a weighted wideband signal and
a subsequently synthesized weighted wideband signal. This
perceptual weighting device comprises:
[0016] a) a signal preemphasis filter responsive to the wideband
signal for enhancing the high frequency content of the wideband
signal to thereby produce a preemphasised signal;
[0017] b) a synthesis filter calculator responsive to the
preemphasised signal for producing synthesis filter coefficients;
and
[0018] c) a perceptual weighting filter, responsive to the
preemphasised signal and the synthesis filter coefficients, for
filtering the preemphasised signal in relation to the synthesis
filter coefficients to thereby produce the perceptually weighted
signal. The perceptual weighting filter has a transfer function
with fixed denominator whereby weighting of the wideband signal in
a formant region is substantially decoupled from a spectral tilt of
that wideband signal.
[0019] The present invention also relates to a method for producing
a perceptually weighted signal in response to a wideband signal in
order to reduce a difference between a weighted wideband signal and
a subsequently synthesized weighted wideband signal. This method
comprises: filtering the wideband signal to produce a preemphasised
signal with enhanced high frequency content; calculating, from the
preemphasised signal, synthesis filter coefficients; and filtering
the preemphasised signal in relation to the synthesis filter
coefficients to thereby produce a perceptually weighted speech
signal. The filtering comprises processing the preemphasis signal
through a perceptual weighting filter having a transfer function
with fixed denominator whereby weighting of the wideband signal in
a formant region is substantially decoupled from a spectral tilt of
the wideband signal.
[0020] In accordance with preferred embodiments of the subject
invention:
[0021] reduction of the dynamic range comprises filtering the
wideband signal through a transfer function of the form:
P(z)=1-.mu.z.sup.-1
[0022] wherein .mu. is a preemphasis factor having a value located
between 0 and 1;
[0023] the preemphasis factor .mu. is 0.7;
[0024] the perceptual weighting filter has a transfer function of
the form:
W(z)=A(z/.gamma..sub.1)/(1-.gamma..sub.2z.sup.-1)
[0025] where 0<.gamma..sub.2<.gamma..sub.1.ltoreq.1 and
.gamma..sub.2 and .gamma..sub.1 are weighting control values;
and
[0026] the variable .gamma..sub.2 is set equal to .mu..
[0027] Therefore, the overall perceptual weighting of the
quantization error is obtained by a combination of a preemphasis
filter and a modified weighting filter to enable high subjective
quality of the decoded wideband sound signal into filter W(z) in
order to control the tilt and formant weighting separately.
[0028] The solution to the problem exposed in the brief description
of the prior art is accordingly to introduce a preemphasis filter
at the input, compute the synthesis filter coefficients based on
the preemphasized signal, and use a modified perceptual weighting
filter by fixing its denominator. By reducing the dynamic range of
the wideband signal, the preemphasis filter renders the wideband
signal more suitable for fixed-point implementation, and improves
the encoding of the high frequency contents of the spectrum.
[0029] The present invention further relates to an encoder for
encoding a wideband signal, comprising: a) a perceptual weighting
device as described herein above; b) an pitch codebook search
device responsive to the perceptually weighted signal for producing
pitch codebook parameters and an innovative search target vector;
c) an innovative codebook search device, responsive to the
synthesis filter coefficients and to the innovative search target
vector, for producing innovative codebook parameters; and d) a
signal forming device for producing an encoded wideband signal
comprising the pitch codebook parameters, the innovative codebook
parameters, and the synthesis filter coefficients.
[0030] Still further in accordance with the present invention,
there is provided:
[0031] a cellular communication system for servicing a large
geographical area divided into a plurality of cells, comprising: a)
mobile transmitter/receiver units; b) cellular base stations
respectively situated in the cells; c) a control terminal for
controlling communication between the cellular base stations; d) a
bidirectional wireless communication sub-system between each mobile
unit situated in one cell and the cellular base station of this
cell, this bidirectional wireless communication sub-system
comprising, in both the mobile unit and the cellular base
station:
[0032] i) a transmitter including an encoder as described
hereinabove for encoding a wideband signal and a transmission
circuit for transmitting the encoded wideband signal; and
[0033] ii) a receiver including a receiving circuit for receiving a
transmitted encoded wideband signal and a decoder for decoding the
received encoded wideband signal.
[0034] a cellular mobile transmitter/receiver unit comprising:
[0035] a) a transmitter including an encoder as described
hereinabove for encoding a wideband signal and a transmission
circuit for transmitting the encoded wideband signal; and
[0036] b) a receiver including a receiving circuit for receiving a
transmitted encoded wideband signal and a decoder for decoding the
received encoded wideband signal;
[0037] a cellular network element comprising:
[0038] a) a transmitter including an encoder as described
hereinabove for encoding a wideband signal and a transmission
circuit for transmitting the encoded wideband signal; and
[0039] b) a receiver including a receiving circuit for receiving a
transmitted encoded wideband signal and a decoder for decoding the
received encoded wideband signal; and
[0040] a bidirectional wireless communication sub-system between
each mobile unit situated in one cell and the cellular base station
of this cell, this bidirectional wireless communication sub-system
comprising, in both the mobile unit and the cellular base
station:
[0041] a) a transmitter including an encoder as described
hereinabove for encoding a wideband signal and a transmission
circuit for transmitting the encoded wideband signal; and
[0042] b) a receiver including a receiving circuit for receiving a
transmitted encoded wideband signal and a decoder for decoding the
received encoded wideband signal.
[0043] The objects, advantages and other features of the present
invention will become more apparent upon reading of the following
non restrictive description of preferred embodiments thereof, given
by way of example only with reference to the accompanying
drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
[0044] In the appended drawings:
[0045] FIG. 1 is a schematic block diagram of a preferred
embodiment of wideband encoding device;
[0046] FIG. 2 is a schematic block diagram of a preferred
embodiment of wideband decoding device;
[0047] FIG. 3 is a schematic block diagram of a preferred
embodiment of pitch analysis device; and
[0048] FIG. 4 is a simplified, schematic block diagram of a
cellular communication system in which the wideband encoding device
of FIG. 1 and the wideband decoding device of FIG. 2 can be
used.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
[0049] As well known to those of ordinary skill in the art, a
cellular communication system such as 401 (see FIG. 4) provides a
telecommunication service over a large geographic area by dividing
that large geographic area into a number C of smaller cells. The C
smaller cells are serviced by respective cellular base stations
402.sub.1, 402.sub.2 . . . 402.sub.C to provide each cell with
radio signalling, audio and data channels.
[0050] Radio signalling channels are used to page mobile
radiotelephones (mobile transmitter/receiver units) such as 403
within the limits of the coverage area (cell) of the cellular base
station 402, and to place calls to other radiotelephones 403
located either inside or outside the base station's cell or to
another network such as the Public Switched Telephone Network
(PSTN) 404.
[0051] Once a radiotelephone 403 has successfully placed or
received a call, an audio or data channel is established between
this radiotelephone 403 and the cellular base station 402
corresponding to the cell in which the radiotelephone 403 is
situated, and communication between the base station 402 and
radiotelephone 403 is conducted over that audio or data channel.
The radiotelephone 403 may also receive control or timing
information over a signalling channel while a call is in
progress.
[0052] If a radiotelephone 403 leaves a cell and enters another
adjacent cell while a call is in progress, the radiotelephone 403
hands over the call to an available audio or data channel of the
new cell base station 402. If a radiotelephone 403 leaves a cell
and enters another adjacent cell while no call is in progress, the
radiotelephone 403 sends a control message over the signalling
channel to log into the base station 402 of the new cell. In this
manner mobile communication over a wide geographical area is
possible.
[0053] The cellular communication system 401 further comprises a
control terminal 405 to control communication between the cellular
base stations 402 and the PSTN 404, for example during a
communication between a radiotelephone 403 and the PSTN 404, or
between a radiotelephone 403 located in a first cell and a
radiotelephone 403 situated in a second cell.
[0054] Of course, a bidirectional wireless radio communication
subsystem is required to establish an audio or data channel between
a base station 402 of one cell and a radiotelephone 403 located in
that cell. As illustrated in very simplified form in FIG. 4, such a
bidirectional wireless radio communication subsystem typically
comprises in the radiotelephone 403:
[0055] a transmitter 406 including:
[0056] an encoder 407 for encoding the voice signal; and
[0057] a transmission circuit 408 for transmitting the encoded
voice signal from the encoder 407 through an antenna such as 409;
and
[0058] a receiver 410 including:
[0059] a receiving circuit 411 for receiving a transmitted encoded
voice signal usually through the same antenna 409; and
[0060] a decoder 412 for decoding the received encoded voice signal
from the receiving circuit 411.
[0061] The radiotelephone further comprises other conventional
radiotelephone circuits 413 to which the encoder 407 and decoder
412 are connected and for processing signals therefrom, which
circuits 413 are well known to those of ordinary skill in the art
and, accordingly, will not be further described in the present
specification.
[0062] Also, such a bidirectional wireless radio communication
subsystem typically comprises in the base station 402:
[0063] a transmitter 414 including:
[0064] an encoder 415 for encoding the voice signal; and
[0065] a transmission circuit 416 for transmitting the encoded
voice signal from the encoder 415 through an antenna such as 417;
and
[0066] a receiver 418 including:
[0067] a receiving circuit 419 for receiving a transmitted encoded
voice signal through the same antenna 417 or through another
antenna (not shown); and
[0068] a decoder 420 for decoding the received encoded voice signal
from the receiving circuit 419.
[0069] The base station 402 further comprises, typically, a base
station controller 421, along with its associated database 422, for
controlling communication between the control terminal 405 and the
transmitter 414 and receiver 418.
[0070] As well known to those of ordinary skill in the art, voice
encoding is required in order to reduce the bandwidth necessary to
transmit sound signal, for example voice signal such as speech,
across the bidirectional wireless radio communication subsystem,
i.e., between a radiotelephone 403 and a base station 402.
[0071] LP voice encoders (such as 415 and 407) typically operating
at 13 kbits/second and below such as Code-Excited Linear Prediction
(CELP) encoders typically use a LP synthesis filter to model the
short-term spectral envelope of the voice signal. The LP
information is transmitted, typically, every 10 or 20 ms to the
decoder (such 420 and 412) and is extracted at the decoder end.
[0072] The novel techniques disclosed in the present specification
may apply to different LP-based coding systems. However, a
CELP-type coding system is used in the preferred embodiment for the
purpose of presenting a non-limitative illustration of these
techniques. In the same manner, such techniques can be used with
sound signals other than voice and speech as well with other types
of wideband signals.
[0073] FIG. 1 shows a general block diagram of a CELP-type speech
encoding device 100 modified to better accommodate wideband
signals.
[0074] The sampled input speech signal 114 is divided into
successive L-sample blocks called "frames". In each frame,
different parameters representing the speech signal in the frame
are computed, encoded, and transmitted. LP parameters representing
the LP synthesis filter are usually computed once every frame. The
frame is further divided into smaller blocks of N samples (blocks
of length N), in which excitation parameters (pitch and innovation)
are determined. In the CELP literature, these blocks of length N
are called "subframes" and the N-sample signals in the subframes
are referred to as N-dimensional vectors. In this preferred
embodiment, the length N corresponds to 5 ms while the length L
corresponds to 20 ms, which means that a frame contains four
subframes (N=80 at the sampling rate of 16 kHz and 64 after
down-sampling to 12.8 kHz). Various N-dimensional vectors occur in
the encoding procedure. A list of the vectors which appear in FIGS.
1 and 2 as well as a list of transmitted parameters are given
herein below:
[0075] List of the Main N-Dimensional Vectors
[0076] s Wideband signal input speech vector (after down-sampling,
pre-processing, and preemphasis);
[0077] s.sub.w Weighted speech vector;
[0078] s.sub.0 Zero-input response of weighted synthesis
filter;
[0079] s.sub.p Down-sampled pre-processed signal;
[0080] Oversampled synthesized speech signal;
[0081] s' Synthesis signal before deemphasis;
[0082] s.sub.d Deemphasized synthesis signal;
[0083] s.sub.h Synthesis signal after deemphasis and
postprocessing;
[0084] x Target vector for pitch search;
[0085] x' Target vector for innovation search;
[0086] h Weighted synthesis filter impulse response;
[0087] v.sub.T Adaptive (pitch) codebook vector at delay T;
[0088] y.sub.T Filtered pitch codebook vector (v.sub.T convolved
with h);
[0089] c.sub.k Innovative codevector at index k (k-th entry from
the innovation codebook);
[0090] c.sub.f Enhanced scaled innovation codevector;
[0091] u Excitation signal (scaled innovation and pitch
codevectors);
[0092] u' Enhanced excitation;
[0093] z Band-pass noise sequence;
[0094] w' White noise sequence; and
[0095] w Scaled noise sequence.
[0096] List of Transmitted Parameters
[0097] STP Short term prediction parameters (defining A(z));
[0098] T Pitch lag (or pitch codebook index);
[0099] b Pitch gain (or pitch codebook gain);
[0100] j Index of the low-pass filter used on the pitch
codevector;
[0101] k Codevector index (innovation codebook entry); and
[0102] g Innovation codebook gain.
[0103] In this preferred embodiment, the STP parameters are
transmitted once per frame and the rest of the parameters are
transmitted four times per frame (every subframe).
[0104] Encoder Side
[0105] The sampled speech signal is encoded on a block by block
basis by the encoding device 100 of FIG. 1 which is broken down
into eleven modules numbered from 101 to 111.
[0106] The input speech is processed into the above mentioned
L-sample blocks called frames.
[0107] Referring to FIG. 1, the sampled input speech signal 114 is
down-sampled in a down-sampling module 101. For example, the signal
is down-sampled from 16 kHz down to 12.8 kHz, using techniques well
known to those of ordinary skill in the art. Down-sampling down to
another frequency can of course be envisaged. Down-sampling
increases the coding efficiency, since a smaller frequency
bandwidth is encoded. This also reduces the algorithmic complexity
since the number of samples in a frame is decreased. The use of
down-sampling becomes significant when the bit rate is reduced
below 16 kbit/s, although down-sampling is not essential above 16
kbit/s.
[0108] After down-sampling, the 320-sample frame of 20 ms is
reduced to 256-sample frame (down-sampling ratio of 4/5).
[0109] The input frame is then supplied to the optional
pre-processing block 102. Pre-processing block 102 may consist of a
high-pass filter with a 50 Hz cut-off frequency. High-pass filter
102 removes the unwanted sound components below 50 Hz.
[0110] The down-sampled pre-processed signal is denoted by
s.sub.p(n), n=0, 1, 2, . . . , L-1, where L is the length of the
frame (256 at a sampling frequency of 12.8 kHz). In a preferred
embodiment of the preemphasis filter 103, the signal s.sub.p(n) is
preemphasized using a filter having the following transfer
function:
P(z)=1-.mu.z.sup.-1
[0111] where .mu. is a preemphasis factor with a value located
between 0 and 1 (a typical value is .mu.=0.7). A higher-order
filter could also be used. It should be pointed out that high-pass
filter 102 and preemphasis filter 103 can be interchanged to obtain
more efficient fixed-point implementations.
[0112] The function of the preemphasis filter 103 is to enhance the
high frequency contents of the input signal. It also reduces the
dynamic range of the input speech signal, which renders it more
suitable for fixed-point implementation. Without preemphasis, LP
analysis in fixed-point using single-precision arithmetic is
difficult to implement.
[0113] Preemphasis also plays an important role in achieving a
proper overall perceptual weighting of the quantization error,
which contributes to improved sound quality. This will be explained
in more detail herein below.
[0114] The output of the preemphasis filter 103 is denoted s(n).
This signal is used for performing LP analysis in calculator module
104. LP analysis is a technique well known to those of ordinary
skill in the art. In this preferred embodiment, the autocorrelation
approach is used. In the autocorrelation approach, the signal s(n)
is first windowed using a Hamming window (having usually a length
of the order of 30-40 ms). The autocorrelations are computed from
the windowed signal, and Levinson-Durbin recursion is used to
compute LP filter coefficients, a.sub.i, where i=1, . . . , p, and
where p is the LP order, which is typically 16 in wideband coding.
The parameters a.sub.i are the coefficients of the transfer
function of the LP filter, which is given by the following
relation: 1 A ( z ) = 1 + i = 1 p a I z - 1
[0115] LP analysis is performed in calculator module 104, which
also performs the quantization and interpolation of the LP filter
coefficients. The LP filter coefficients are first transformed into
another equivalent domain more suitable for quantization and
interpolation purposes. The line spectral pair (LSP) and immitance
spectral pair (ISP) domains are two domains in which quantization
and interpolation can be efficiently performed. The 16 LP filter
coefficients, a.sub.i, can be quantized in the order of 30 to 50
bits using split or multi-stage quantization, or a combination
thereof. The purpose of the interpolation is to enable updating the
LP filter coefficients every subframe while transmitting them once
every frame, which improves the encoder performance without
increasing the bit rate. Quantization and interpolation of the LP
filter coefficients is believed to be otherwise well known to those
of ordinary skill in the art and, accordingly, will not be further
described in the present specification.
[0116] The following paragraphs will describe the rest of the
coding operations performed on a subframe basis. In the following
description, the filter A(z) denotes the unquantized interpolated
LP filter of the subframe, and the filter (z) denotes the quantized
interpolated LP filter of the subframe.
[0117] Perceptual Weighting:
[0118] In analysis-by-synthesis encoders, the optimum pitch and
innovation parameters are searched by minimizing the mean squared
error between the input speech and synthesized speech in a
perceptually weighted domain. This is equivalent to minimizing the
error between the weighted input speech and weighted synthesis
speech.
[0119] The weighted signal s.sub.w(n) is computed in a perceptual
weighting filter 105. Traditionally, the weighted signal s.sub.w(n)
is computed by a weighting filter having a transfer function W(z)
in the form:
W(z)=A(z/.gamma..sub.1)/A(z/.gamma..sub.2) where
0<.gamma..sub.2<.ga- mma..sub.1.ltoreq.1
[0120] As well known to those of ordinary skill in the art, in
prior art analysis-by-synthesis (AbS) encoders, analysis shows that
the quantization error is weighted by a transfer function
W.sup.-1(z), which is the inverse of the transfer function of the
perceptual weighting filter 105. This result is well described by
B. S. Atal and M. R. Schroeder in "Predictive coding of speech and
subjective error criteria", IEEE Transaction ASSP, vol. 27, no. 3,
pp. 247-254, June 1979. Transfer function W.sup.-1(z) exhibits some
of the formant structure of the input speech signal. Thus, the
masking property of the human ear is exploited by shaping the
quantization error so that it has more energy in the formant
regions where it will be masked by the strong signal energy present
in these regions. The amount of weighting is controlled by the
factors .gamma..sub.1 and .gamma..sub.2.
[0121] The above traditional perceptual weighting filter 105 works
well with telephone band signals. However, it was found that this
traditional perceptual weighting filter 105 is not suitable for
efficient perceptual weighting of wideband signals. It was also
found that the traditional perceptual weighting filter 105 has
inherent limitations in modelling the formant structure and the
required spectral tilt concurrently. The spectral tilt is more
pronounced in wideband signals due to the wide dynamic range
between low and high frequencies. The prior art has suggested to
add a tilt filter into W(z) in order to control the tilt and
formant weighting of the wideband input signal separately.
[0122] A novel solution to this problem is, in accordance with the
present invention, to introduce the preemphasis filter 103 at the
input, compute the LP filter A(z) based on the preemphasized speech
s(n), and use a modified filter W(z) by fixing its denominator.
[0123] LP analysis is performed in module 104 on the preemphasized
signal s(n) to obtain the LP filter A(z). Also, a new perceptual
weighting filter 105 with fixed denominator is used. An example of
transfer function for the perceptual weighting filter 104 is given
by the following relation:
W(z)=A(z/.gamma..sub.1)/(1-.gamma..sub.2z.sup.-1) where
0<.gamma..sub.2<.gamma..sub.1.ltoreq.1
[0124] A higher order can be used at the denominator. This
structure substantially decouples the formant weighting from the
tilt.
[0125] Note that because A(z) is computed based on the
preemphasized speech signal s(n), the tilt of the filter
1/A(z/.gamma..sub.1) is less pronounced compared to the case when
A(z) is computed based on the original speech. Since deemphasis is
performed at the decoder end using a filter having the transfer
function:
P.sup.-1(z)=1/(1-.mu.z.sup.-1),
[0126] the quantization error spectrum is shaped by a filter having
a transfer function W.sup.-1(z)P.sup.-1(z). When .gamma..sub.2 is
set equal to .mu., which is typically the case, the spectrum of the
quantization error is shaped by a filter whose transfer function is
1/A(z/.gamma..sub.1), with A(z) computed based on the preemphasized
speech signal. Subjective listening showed that this structure for
achieving the error shaping by a combination of preemphasis and
modified weighting filtering is very efficient for encoding
wideband signals, in addition to the advantages of ease of
fixed-point algorithmic implementation.
[0127] Pitch Analysis:
[0128] In order to simplify the pitch analysis, an open-loop pitch
lag T.sub.OL is first estimated in the open-loop pitch search
module 106 using the weighted speech signal s.sub.w(n). Then the
closed-loop pitch analysis, which is performed in closed-loop pitch
search module 107 on a subframe basis, is restricted around the
open-loop pitch lag T.sub.OL which significantly reduces the search
complexity of the LTP parameters T and b (pitch lag and pitch
gain). Open-loop pitch analysis is usually performed in module 106
once every 10 ms (two subframes) using techniques well known to
those of ordinary skill in the art.
[0129] The target vector x for LTP (Long Term Prediction) analysis
is first computed. This is usually done by subtracting the
zero-input response s.sub.0 of weighted synthesis filter W(z)/(z)
from the weighted speech signal s.sub.w(n). This zero-input
response s.sub.0 is calculated by a zero-input response calculator
108. More specifically, the target vector x is calculated using the
following relation:
x=s.sub.w-s.sub.0
[0130] where x is the N-dimensional target vector, s.sub.w is the
weighted speech vector in the subframe, and s.sub.0 is the
zero-input response of filter W(z)/(z) which is the output of the
combined filter W(z)/(z) due to its initial states. The zero-input
response calculator 108 is responsive to the quantized interpolated
LP filter (z) from the LP analysis, quantization and interpolation
calculator 104 and to the initial states of the weighted synthesis
filter W(z)/(z) stored in memory module 111 to calculate the
zero-input response so (that part of the response due to the
initial states as determined by setting the inputs equal to zero)
of filter W(z)/(z). This operation is well known to those of
ordinary skill in the art and, accordingly, will not be further
described.
[0131] Of course, alternative but mathematically equivalent
approaches can be used to compute the target vector x.
[0132] A N-dimensional impulse response vector h of the weighted
synthesis filter W(z)/(z) is computed in the impulse response
generator 109 using the LP filter coefficients A(z) and (z) from
module 104. Again, this operation is well known to those of
ordinary skill in the art and, accordingly, will not be further
described in the present specification.
[0133] The closed-loop pitch (or pitch codebook) parameters b, T
and j are computed in the closed-loop pitch search module 107,
which uses the target vector x, the impulse response vector h and
the open-loop pitch lag T.sub.OL as inputs. Traditionally, the
pitch prediction has been represented by a pitch filter having the
following transfer function:
1/(1-bz.sup.-T)
[0134] where b is the pitch gain and T is the pitch delay or lag.
In this case, the pitch contribution to the excitation signal u(n)
is given by bu(n-T), where the total excitation is given by
u(n)=bu(n-T)+gc.sub.k(n)
[0135] with g being the innovative codebook gain and c.sub.k(n) the
innovative codevector at index k.
[0136] This representation has limitations if the pitch lag T is
shorter than the subframe length N. In another representation, the
pitch contribution can be seen as an pitch codebook containing the
past excitation signal. Generally, each vector in the pitch
codebook is a shift-by-one version of the previous vector
(discarding one sample and adding a new sample). For pitch lags
T>N, the pitch codebook is equivalent to the filter structure
(1(1-bz.sup.-T), and an pitch codebook vector v.sub.T(n) at pitch
lag T is given by
v.sub.T(n)=u(n-T), n=0, . . . N-1.
[0137] For pitch lags T shorter than N, a vector v.sub.T(n) is
built by repeating the available samples from the past excitation
until the vector is completed (this is not equivalent to the filter
structure).
[0138] In recent encoders, a higher pitch resolution is used which
significantly improves the quality of voiced sound segments. This
is achieved by oversampling the past excitation signal using
polyphase interpolation filters. In this case, the vector
v.sub.T(n) usually corresponds to an interpolated version of the
past excitation, with pitch lag T being a non-integer delay (e.g.
50.25).
[0139] The pitch search consists of finding the best pitch lag T
and gain b that minimize the mean squared weighted error E between
the target vector x and the scaled filtered past excitation. Error
E being expressed as:
E=.parallel.x-by.sub.T.parallel..sup.2
[0140] where y.sub.T is the filtered pitch codebook vector at pitch
lag T: 2 y T ( n ) = v T ( n ) * h ( n ) = i = 0 n v T ( i ) h ( n
- 1 ) , n = 0 , , N - 1.
[0141] It can be shown that the error E is minimized by maximizing
the search criterion 3 C = x t y T y T t y T
[0142] where t denotes vector transpose.
[0143] In the preferred embodiment of the present invention, a 1/3
subsample pitch resolution is used, and the pitch (pitch codebook)
search is composed of three stages.
[0144] In the first stage, an open-loop pitch lag T.sub.OL is
estimated in open-loop pitch search module 106 in response to the
weighted speech signal s.sub.w(n). As indicated in the foregoing
description, this open-loop pitch analysis is usually performed
once every 10 ms (two subframes) using techniques well known to
those of ordinary skill in the art.
[0145] In the second stage, the search criterion C is searched in
the closed-loop pitch search module 107 for integer pitch lags
around the estimated open-loop pitch lag T.sub.OL (usually .+-.5),
which significantly simplifies the search procedure. A simple
procedure is used for updating the filtered codevector y.sub.T
without the need to compute the convolution for every pitch
lag.
[0146] Once an optimum integer pitch lag is found in the second
stage, a third stage of the search (module 107) tests the fractions
around that optimum integer pitch lag.
[0147] When the pitch predictor is represented by a filter of the
form 1/(1-bz.sup.-T), which is a valid assumption for pitch lags
T>N, the spectrum of the pitch filter exhibits a harmonic
structure over the entire frequency range, with a harmonic
frequency related to 1/T. In case of wideband signals, this
structure is not very efficient since the harmonic structure in
wideband signals does not cover the entire extended spectrum. The
harmonic structure exists only up to a certain frequency, depending
on the speech segment. Thus, in order to achieve efficient
representation of the pitch contribution in voiced segments of
wideband speech, the pitch prediction filter needs to have the
flexibility of varying the amount of periodicity over the wideband
spectrum.
[0148] A new method which achieves efficient modeling of the
harmonic structure of the speech spectrum of wideband signals is
disclosed in the present specification, whereby several forms of
low pass filters are applied to the past excitation and the low
pass filter with higher prediction gain is selected.
[0149] When subsample pitch resolution is used, the low pass
filters can be incorporated into the interpolation filters used to
obtain the higher pitch resolution. In this case, the third stage
of the pitch search, in which the fractions around the chosen
integer pitch lag are tested, is repeated for the several
interpolation filters having different low-pass characteristics and
the fraction and filter index which maximize the search criterion C
are selected.
[0150] A simpler approach is to complete the search in the three
stages described above to determine the optimum fractional pitch
lag using only one interpolation filter with a certain frequency
response, and select the optimum low-pass filter shape at the end
by applying the different predetermined low-pass filters to the
chosen pitch codebook vector v.sub.T and select the low-pass filter
which minimizes the pitch prediction error. This approach is
discussed in detail below.
[0151] FIG. 3 illustrates a schematic block diagram of a preferred
embodiment of the proposed approach.
[0152] In memory module 303, the past excitation signal u(n),
n<0, is stored. The pitch codebook search module 301 is
responsive to the target vector x, to the open-loop pitch lag
T.sub.OL and to the past excitation signal u(n), n<0, from
memory module 303 to conduct a pitch codebook (pitch codebook)
search minimizing the above-defined search criterion C. From the
result of the search conducted in module 301, module 302 generates
the optimum pitch codebook vector v.sub.T. Note that since a
sub-sample pitch resolution is used (fractional pitch), the past
excitation signal u(n), n<0, is interpolated and the pitch
codebook vector v.sub.T corresponds to the interpolated past
excitation signal. In this preferred embodiment, the interpolation
filter (in module 301, but not shown) has a low-pass filter
characteristic removing the frequency contents above 7000 Hz.
[0153] In a preferred embodiment, K filter characteristics are
used; these filter characteristics could be low-pass or band-pass
filter characteristics. Once the optimum codevector v.sub.T is
determined and supplied by the pitch codevector generator 302, K
filtered versions of v.sub.T are computed respectively using K
different frequency shaping filters such as 305.sup.(j), where j=1,
2, . . . , K. These filtered versions are denoted v.sub.f.sup.(j),
where j=1, 2, . . . , K. The different vectors v.sub.f.sup.(j) are
convolved in respective modules 304.sup.(j), where j=0, 1, 2, . . .
K, with the impulse response h to obtain the vectors y.sup.(j),
where j=0, 1, 2, . . . , K. To calculate the mean squared pitch
prediction error for each vector y.sup.(j), the value y.sup.(j) is
multiplied by the gain b by means of a corresponding amplifier
307.sup.(j) and the value by.sup.(j) is subtracted from the target
vector x by means of a corresponding subtractor 308.sup.(j).
Selector 309 selects the frequency shaping filter 305.sup.(j) which
minimizes the mean squared pitch prediction error
e.sup.(j)=.parallel.x-b.sup.(j)y.sup.(j).parallel..sup.2, j=1, 2, .
. . , K
[0154] To calculate the mean squared pitch prediction error
e.sup.(j) for each value of y.sup.(j), the value y.sup.(j) is
multiplied by the gain b by means of a corresponding amplifier
307.sup.(j) and the value b.sup.(j)y.sup.(j) is subtracted from the
target vector x by means of subtractors 308.sup.(j). Each gain
b.sup.(j) is calculated in a corresponging gain calculator
306.sup.(j) in association with the frequency shaping filter at
index j, using the following relationship:
b.sup.(j)=x.sup.ty.sup.(j)/.parallel.y.sup.(j).parallel..sup.2.
[0155] In selector 309, the parameters b, T, and j are chosen based
on v.sub.T or v.sub.f.sup.(j) which minimizes the mean squared
pitch prediction error e.
[0156] Referring back to FIG. 1, the pitch codebook index T is
encoded and transmitted to multiplexer 112. The pitch gain b is
quantized and transmitted to multiplexer 112. With this new
approach, extra information is needed to encode the index j of the
selected frequency shaping filter in multiplexer 112. For example,
if three filters are used (j=0, 1, 2, 3), then two bits are needed
to represent this information. The filter index information j can
also be encoded jointly with the pitch gain b.
[0157] Innovative Codebook Search:
[0158] Once the pitch, or LTP (Long Term Prediction) parameters b,
T, and j are determined, the next step is to search for the optimum
innovative excitation by means of search module 110 of FIG. 1.
First, the target vector x is updated by subtracting the LTP
contribution:
x'=x-by.sub.T
[0159] where b is the pitch gain and y.sub.T is the filtered pitch
codebook vector (the past excitation at delay T filtered with the
selected low pass filter and convolved with the inpulse response h
as described with reference to FIG. 3).
[0160] The search procedure in CELP is performed by finding the
optimum excitation codevector c.sub.k and gain g which minimize the
mean-squared error between the target vector and the scaled
filtered codevector
E=.parallel.x'-gHc.sub.k.parallel..sup.2
[0161] where H is a lower triangular convolution matrix derived
from the impulse response vector h.
[0162] In the preferred embodiment of the present invention, the
innovative codebook search is performed in module 110 by means of
an algebraic codebook as described in U.S. Pat. No. 5,444,816
(Adoul et al.) issued on Aug. 22, 1995; U.S. Pat No. 5,699,482
granted to Adoul et al., on Dec. 17, 1997; U.S. Pat. No. 5,754,976
granted to Adoul et al., on May 19, 1998; and U.S. Pat. No.
5,701,392 (Adoul et al.) dated Dec. 23, 1997.
[0163] Once the optimum excitation codevector c.sub.k and its gain
g are chosen by module 110, the codebook index k and gain g are
encoded and transmitted to multiplexer 112.
[0164] Referring to FIG. 1, the parameters b, T, j, (z), k and g
are multiplexed through the multiplexer 112 before being
transmitted through a communication channel.
[0165] Memory Update:
[0166] In memory module 111 (FIG. 1), the states of the weighted
synthesis filter W(z)/(z) are updated by filtering the excitation
signal u=gc.sub.k+bv.sub.T through the weighted synthesis filter.
After this filtering, the states of the filter are memorized and
used in the next subframe as initial states for computing the
zero-input response in calculator module 108.
[0167] As in the case of the target vector x, other alternative but
mathematically equivalent approaches well known to those of
ordinary skill in the art can be used to update the filter
states.
[0168] Decoder Side
[0169] The speech decoding device 200 of FIG. 2 illustrates the
various steps carried out between the digital input 222 (input
stream to the demultiplexer 217) and the output sampled speech 223
(output of the adder 221).
[0170] Demultiplexer 217 extracts the synthesis model parameters
from the binary information received from a digital input channel.
From each received binary frame, the extracted parameters are:
[0171] the short-term prediction parameters (STP) (z) (once per
frame);
[0172] the long-term prediction (LTP) parameters T, b, and j (for
each subframe); and
[0173] the innovation codebook index k and gain g (for each
subframe).
[0174] The current speech signal is synthesized based on these
parameters as will be explained hereinbelow.
[0175] The innovative codebook 218 is responsive to the index k to
produce the innovation codevector c.sub.k, which is scaled by the
decoded gain factor g through an amplifier 224. In the preferred
embodiment, an innovative codebook 218 as described in the above
mentioned U.S. Pat. Nos. 5,444,816; 5,699,482; 5,754,976; and
5,701,392 is used to represent the innovative codevector
c.sub.k.
[0176] The generated scaled codevector gc.sub.k at the output of
the amplifier 224 is processed through a innovation filter 205.
[0177] Periodicity Enhancement:
[0178] The generated scaled codevector at the output of the
amplifier 224 is processed through a frequency-dependent pitch
enhancer 205.
[0179] Enhancing the periodicity of the excitation signal u
improves the quality in case of voiced segments. This was done in
the past by filtering the innovation vector from the innovative
codebook (fixed codebook) 218 through a filter in the form
1/(1-.epsilon.bz.sup.-T) where .epsilon. is a factor below 0.5
which controls the amount of introduced periodicity. This approach
is less efficient in case of wideband signals since it introduces
periodicity over the entire spectrum. A new alternative approach,
which is part of the present invention, is disclosed whereby
periodicity enhancement is achieved by filtering the innovative
codevector c.sub.k from the innovative (fixed) codebook through an
innovation filter 205 (F(z)) whose frequency response emphasizes
the higher frequencies more than lower frequencies. The
coefficients of F(z) are related to the amount of periodicity in
the excitation signal u.
[0180] Many methods known to those skilled in the art are available
for obtaining valid periodicity coefficients. For example, the
value of gain b provides an indication of periodicity. That is, if
gain b is close to 1, the periodicity of the excitation signal u is
high, and if gain b is less than 0.5, then periodicity is low.
[0181] Another efficient way to derive the filter F(z) coefficients
used in a preferred embodiment, is to relate them to the amount of
pitch contribution in the total excitation signal u. This results
in a frequency response depending on the subframe periodicity,
where higher frequencies are more strongly emphasized (stronger
overall slope) for higher pitch gains. Innovation filter 205 has
the effect of lowering the energy of the innovative codevector
c.sub.k at low frequencies when the excitation signal u is more
periodic, which enhances the periodicity of the excitation signal u
at lower frequencies more than higher frequencies. Suggested forms
for innovation filter 205 are
F(z)=1-.sigma.z.sup.-1, (1)
or
F(z)=-.alpha.z+1-.alpha.z.sup.-1 (2)
[0182] where .sigma. or .alpha. are periodicity factors derived
from the level of periodicity of the excitation signal u.
[0183] The second three-term form of F(z) is used in a preferred
embodiment. The periodicity factor .alpha. is computed in the
voicing factor generator 204. Several methods can be used to derive
the periodicity factor .alpha. based on the periodicity of the
excitation signal u. Two methods are presented below.
[0184] Method 1:
[0185] The ratio of pitch contribution to the total excitation
signal u is first computed in voicing factor generator 204 by 4 R p
= b 2 v T t v T u t u = b 2 n = 0 N - 1 v T 2 ( n ) n = 0 N - 1 u 2
( n )
[0186] where v.sub.T is the pitch codebook vector, b is the pitch
gain, and u is the excitation signal u given at the output of the
adder 219 by
u=gc.sub.k+bv.sub.T
[0187] Note that the term bv.sub.T has its source in the pitch
codebook (pitch codebook) 201 in response to the pitch lag T and
the past value of u stored in memory 203. The pitch codevector
v.sub.T from the pitch codebook 201 is then processed through a
low-pass filter 202 whose cut-off frequency is adjusted by means of
the index j from the demultiplexer 217. The resulting codevector
v.sub.T is then multiplied by the gain b from the demultiplexer 217
through an amplifier 226 to obtain the signal bv.sub.T.
[0188] The factor .alpha. is calculated in voicing factor generator
204 by
.alpha.=qR.sub.p bounded by .alpha.<q
[0189] where q is a factor which controls the amount of enhancement
(q is set to 0.25 in this preferred embodiment).
[0190] Method 2:
[0191] Another method used in a preferred embodiment of the
invention for calculating periodicity factor .alpha. is discussed
below.
[0192] First, a voicing factor r.sub.v is computed in voicing
factor generator 204 by
r.sub.v=(E.sub.v-E.sub.c)/(E.sub.v+E.sub.c)
[0193] where E.sub.v is the energy of the scaled pitch codevector
bv.sub.T and E.sub.c is the energy of the scaled innovative
codevector gc.sub.k. That is 5 E v = b 2 v T t v T = b 2 n = 0 N -
1 v T 2 ( n ) and E c = g 2 c k t c k = g 2 n = 0 N - 1 c k 2 ( n )
.
[0194] Note that the value of r.sub.v lies between -1 and 1 (1
corresponds to purely voiced signals and -1 corresponds to purely
unvoiced signals).
[0195] In this preferred embodiment, the factor .alpha. is then
computed in voicing factor generator 204 by
.alpha.=0.125(1+r.sub.v)
[0196] which corresponds to a value of 0 for purely unvoiced
signals and 0.25 for purely voiced signals.
[0197] In the first, two-term form of F(z), the periodicity factor
.sigma. can be approximated by using .sigma.=2.alpha. in methods 1
and 2 above. In such a case, the periodicity factor .sigma. is
calculated as follows in method 1 above:
.sigma.=2qR.sub.p bounded by .sigma.<2q.
[0198] In method 2, the periodicity factor .sigma. is calculated as
follows:
.sigma.=0.25(1+r.sub.v).
[0199] The enhanced signal c.sub.f is therefore computed by
filtering the scaled innovative codevector gc.sub.k through the
innovation filter 205 (F(z)).
[0200] The enhanced excitation signal u' is computed by the adder
220 as:
u'=c.sub.f+bv.sub.T
[0201] Note that this process is not performed at the encoder 100.
Thus, it is essential to update the content of the pitch codebook
201 using the excitation signal u without enhancement to keep
synchronism between the encoder 100 and decoder 200. Therefore, the
excitation signal u is used to update the memory 203 of the pitch
codebook 201 and the enhanced excitation signal u' is used at the
input of the LP synthesis filter 206.
[0202] Synthesis and Deemphasis
[0203] The synthesized signal s' is computed by filtering the
enhanced excitation signal u' through the LP synthesis filter 206
which has the form 1/(z), where A(z) is the interpolated LP filter
in the current subframe. As can be seen in FIG. 2, the quantized LP
coefficients (z) on line 225 from demultiplexer 217 are supplied to
the LP synthesis filter 206 to adjust the parameters of the LP
synthesis filter 206 accordingly. The deemphasis filter 207 is the
inverse of the preemphasis filter 103 of FIG. 1. The transfer
function of the deemphasis filter 207 is given by
D(z)=1/(1-.mu.z.sup.-1)
[0204] where .mu. is a preemphasis factor with a value located
between 0 and 1 (a typical value is .mu.=0.7). A higher-order
filter could also be used.
[0205] The vector s' is filtered through the deemphasis filter D(z)
(module 207) to obtain the vector s.sub.d, which is passed through
the high-pass filter 208 to remove the unwanted frequencies below
50 Hz and further obtain s.sub.h.
[0206] Oversampling and High-Frequency Regeneration
[0207] The over-sampling module 209 conducts the inverse process of
the down-sampling module 101 of FIG. 1. In this preferred
embodiment, oversampling converts from the 12.8 kHz sampling rate
to the original 16 kHz sampling rate, using techniques well known
to those of ordinary skill in the art. The oversampled synthesis
signal is denoted . Signal is also referred to as the synthesized
wideband intermediate signal.
[0208] The oversampled synthesis signal does not contain the higher
frequency components which were lost by the downsampling process
(module 101 of FIG. 1) at the encoder 100. This gives a low-pass
perception to the synthesized speech signal. To restore the full
band of the original signal, a high frequency generation procedure
is disclosed. This procedure is performed in modules 210 to 216,
and adder 221, and requires input from voicing factor generator 204
(FIG. 2).
[0209] In this new approach, the high frequency contents are
generated by filling the upper part of the spectrum with a white
noise property scaled in the excitation domain, then converted to
the speech domain, preferably by shaping it with the same LP
synthesis filter used for synthesizing the down-sampled signal
.
[0210] The high frequency generation procedure in accordance with
the present invention is described hereinbelow.
[0211] The random noise generator 213 generates a white noise
sequence w' with a flat spectrum over the entire frequency
bandwidth, using techniques well known to those of ordinary skill
in the art. The generated sequence is of length N' which is the
subframe length in the original domain. Note that N is the subframe
length in the down-sampled domain. In this preferred embodiment,
N=64 and N'=80 which correspond to 5 ms.
[0212] The white noise sequence is properly scaled in the gain
adjusting module 214. Gain adjustment comprises the following
steps. First, the energy of the generated noise sequence w' is set
equal to the energy of the enhanced excitation signal u' computed
by an energy computing module 210, and the resulting scaled noise
sequence is given by 6 w ( n ) = w ' ( n ) n = 0 N - 1 u '2 ( n ) n
= 0 N ' - 1 w '2 ( n ) , n = 0 , , N ' - 1.
[0213] The second step in the gain scaling is to take into account
the high frequency contents of the synthesized signal at the output
of the voicing factor generator 204 so as to reduce the energy of
the generated noise in case of voiced segments (where less energy
is present at high frequencies compared to unvoiced segments). In
this preferred embodiment, measuring the high frequency contents is
implemented by measuring the tilt of the synthesis signal through a
spectral tilt calculator 212 and reducing the energy accordingly.
Other measurements such as zero crossing measurements can equally
be used. When the tilt is very strong, which corresponds to voiced
segments, the noise energy is further reduced. The tilt factor is
computed in module 212 as the first correlation coefficient of the
synthesis signal s.sub.h and it is given by: 7 tilt = n = 1 N - 1 s
h ( n ) s h ( n - 1 ) n = 0 N - 1 s h 2 ( n ) , conditioned by tilt
0 and tilt r v .
[0214] where voicing factor r.sub.v is given by
r.sub.v=(E.sub.v-E.sub.c)/(E.sub.v+E.sub.c)
[0215] where E.sub.v is the energy of the scaled pitch codevector
bv.sub.T and E.sub.c is the energy of the scaled innovative
codevector gc.sub.k, as described earlier. Voicing factor r.sub.v
is most often less than tilt but this condition was introduced as a
precaution against high frequency tones where the tilt value is
negative and the value of r.sub.v is high. Therefore, this
condition reduces the noise energy for such tonal signals.
[0216] The tilt value is 0 in case of flat spectrum and 1 in case
of strongly voiced signals, and it is negative in case of unvoiced
signals where more energy is present at high frequencies.
[0217] Different methods can be used to derive the scaling factor
g.sub.t from the amount of high frequency contents. In this
invention, two methods are given based on the tilt of signal
described above.
[0218] Method 1:
[0219] The scaling factor g.sub.t is derived from the tilt by
g.sub.t=1-tilt bounded by 0.2.ltoreq.g.sub.t.ltoreq.1.0
[0220] For strongly voiced signal where the tilt approaches 1,
g.sub.t is 0.2 and for strongly unvoiced signals g.sub.t becomes
1.0.
[0221] Method 2:
[0222] The tilt factor g.sub.t is first restricted to be larger or
equal to zero, then the scaling factor is derived from the tilt
by
g.sub.t=10.sup.-0.6tilt
[0223] The scaled noise sequence w.sub.g produced in gain adjusting
module 214 is therefore given by:
w.sub.g=g.sub.tw.
[0224] When the tilt is close to zero, the scaling factor g.sub.t
is close to 1, which does not result in energy reduction. When the
tilt value is 1, the scaling factor g.sub.t results in a reduction
of 12 dB in the energy of the generated noise.
[0225] Once the noise is properly scaled (w.sub.g), it is brought
into the speech domain using the spectral shaper 215. In the
preferred embodiment, this is achieved by filtering the noise
w.sub.g through a bandwidth expanded version of the same LP
synthesis filter used in the down-sampled domain (1/(z/0.8)). The
corresponding bandwidth expanded LP filter coefficients are
calculated in spectral shaper 215.
[0226] The filtered scaled noise sequence w, is then band-pass
filtered to the required frequency range to be restored using the
band-pass filter 216. In the preferred embodiment, the band-pass
filter 216 restricts the noise sequence to the frequency range
5.6-7.2 kHz. The resulting band-pass filtered noise sequence z is
added in adder 221 to the oversampled synthesized speech signal to
obtain the final reconstructed sound signal s.sub.out on the output
223.
[0227] Although the present invention has been described
hereinabove by way of a preferred embodiment thereof, this
embodiment can be modified at will, within the scope of the
appended claims, without departing from the spirit and nature of
the subject invention. Even though the preferred embodiment
discusses the use of wideband speech signals, it will be obvious to
those skilled in the art that the subject invention is also
directed to other embodiments using wideband signals in general and
that it is not necessarily limited to speech applications.
* * * * *