U.S. patent application number 09/805634 was filed with the patent office on 2001-08-23 for multiple impulse excitation speech encoder and decoder.
Invention is credited to Lin, Daniel, McCarthy, Brian M..
Application Number | 20010016812 09/805634 |
Document ID | / |
Family ID | 27379669 |
Filed Date | 2001-08-23 |
United States Patent
Application |
20010016812 |
Kind Code |
A1 |
Lin, Daniel ; et
al. |
August 23, 2001 |
Multiple impulse excitation speech encoder and decoder
Abstract
A version of a speech signal and an output of a pitch synthesis
filter and a linear predictive all-pole (LPC) filter is received. A
system impulse response is produced based on in part the received
pitch synthesis filter and LPC output. An excitation pulse location
is determined so that the determined location minimizes an error
between the speech signal version and the system impulse response.
The speech signal is encoded with a representation of the
determined location.
Inventors: |
Lin, Daniel; (Montville,
NJ) ; McCarthy, Brian M.; (Lafayette Hill,
PA) |
Correspondence
Address: |
VOLPE AND KOENIG, PC
DEPT ICC
SUITE 400, ONE PENN CENTER
1617 JOHN F. KENNEDY BOULEVARD
PHILADELPHIA
PA
19103
US
|
Family ID: |
27379669 |
Appl. No.: |
09/805634 |
Filed: |
March 14, 2001 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
09805634 |
Mar 14, 2001 |
|
|
|
09441743 |
Nov 16, 1999 |
|
|
|
6223152 |
|
|
|
|
09441743 |
Nov 16, 1999 |
|
|
|
08950658 |
Oct 15, 1997 |
|
|
|
6006174 |
|
|
|
|
08950658 |
Oct 15, 1997 |
|
|
|
08670986 |
Jun 28, 1996 |
|
|
|
08670986 |
Jun 28, 1996 |
|
|
|
08104174 |
Aug 9, 1993 |
|
|
|
08104174 |
Aug 9, 1993 |
|
|
|
07592330 |
Oct 3, 1990 |
|
|
|
5235670 |
|
|
|
|
Current U.S.
Class: |
704/219 ;
704/E19.032 |
Current CPC
Class: |
G10L 25/90 20130101;
G10L 19/06 20130101; G10L 19/09 20130101; G10L 19/10 20130101; G10L
19/20 20130101 |
Class at
Publication: |
704/219 |
International
Class: |
G10L 019/04; G10L
019/08 |
Claims
What is claimed is:
1. A method for determining an excitation pulse location in a
speech signal for use in encoding the speech signal, the method
comprising: receiving a version of the speech signal and an output
of a pitch synthesis filter and a linear predictive all-pole (LPC)
filter; producing a system impulse response based on in part the
received pitch synthesis filter and LPC filter output; determining
an excitation pulse location so that the determined location
minimizes an error between the speech signal version and the system
impulse response; and encoding the speech signal with a
representation of the determined location.
2. The method of claim 1 further comprising determining an
excitation pulse weight associated with the determined location so
that the determined location weighted by the determined weight
minimizes the error.
3. The method of claim 2 further comprising determining a plurality
of additional excitation pulse locations and weights by minimizing
a remaining error between the speech signal subtracted by any
previously determined location weighted by its associated
excitation pulse weight and the system impulse response.
4. The method of claim 3 wherein the plurality of additional
locations numbers four and the encoding the speech signal further
comprises encoding the speech signal with a representation of the
four additional locations.
5. The method of claim 1 wherein the error minimizing is performed
by determining a minimum mean-squared error.
6. The method of claim 1 wherein the producing the system impulse
response is based on in part a concatenation of the pitch synthesis
filter and the LPC filter output.
7. The method of claim 1 wherein the pitch synthesis filter output
is a 3-tap pitch synthesis filter output.
8. A speech encoding system for use in determining an excitation
pulse location in a speech signal for use in encoding the speech
signal, the system comprising: a generate impulse response block
for receiving an output of a pitch synthesis filter and a linear
predictive all-pole (LPC) filter and producing a system impulse
response; a multipulse analysis block for receiving a version of
the speech signal and the system impulse response and determining
an excitation pulse location so that the determined location
minimizes an error between the speech signal version and the system
impulse response; and a bit packing block for encoding the speech
signal with a representation of the determined location.
9. The system of claim 8 wherein the multipulse analysis block for
determining an excitation pulse weight associated with the
determined location so that the determined location weighted by the
determined excitation pulse weight minimizes the error.
10. The system of claim 9 wherein the multipulse analysis block for
determining a plurality of additional excitation locations and
associated weights by minimizing a remaining error between the
speech signal subtracted by any previously determined location
weighted by its associated weight and the system impulse
response.
11. The system of claim 10 wherein the plurality of additional
locations numbers four and the encoding the speech signal further
comprises encoding the speech signal with a representation of the
four additional locations.
12. The system of claim 8 wherein the error minimizing is performed
by determining a minimum mean-squared error.
13. The system of claim 8 wherein the producing the system impulse
response is based on in part a concatenation of the pitch synthesis
filter and the LPC filter output.
14. The system of claim 8 wherein the pitch synthesis filter output
is an output of a 3-tap pitch synthesis filter.
15. The system of claim 8 further comprising a perceptually weight
speech block for perceptually weighting a sampled speech signal as
the version of the speech signal.
Description
RELATED APPLICATIONS
[0001] This application is a continuation of U.S. patent
application Ser. No. 09/441,743, filed Nov. 16, 1999, which is a
continuation of U.S. patent application Ser. No. 08/950,658, filed
Oct. 15, 1997, now U.S. Pat. No. 6,006,174, which is a file wrapper
continuation of U.S. patent application Ser. No. 08/670,986, filed
Jun. 28, 1996, which is a file wrapper continuation of U.S. patent
application Ser. No. 08/104,174, filed Aug. 9, 1993, which is a
continuation of U.S. patent application Ser. No. 07/592,330, filed
Oct. 3, 1990, now U.S. Pat. No. 5,235,670.
BACKGROUND
[0002] This invention relates to digital voice coders performing at
relatively low voice rates but maintaining high voice quality. In
particular, it relates to improved multipulse linear predictive
voice coders.
[0003] The multipulse coder incorporates the linear predictive
all-pole filter (LPC filter). The basic function of a multipulse
coder is finding a suitable excitation pattern for the LPC all-pole
filter which produces an output that closely matches the original
speech waveform. The excitation signal is a series of weighted
impulses. The weight values and impulse locations are found in a
systematic manner. The selection of a weight and location of an
excitation impulse is obtained by minimizing an error criterion
between the all-pole filter output and the original speech signal.
Some multipulse coders incorporate a perceptual weighting filter in
the error criterion function. This filter serves to frequency
weight the error which in essence allows more error in the format
regions of the speech signal and less in low energy portions of the
spectrum. Incorporation of pitch filters improve the performance,
of multipulse speech coders. This is done by modeling the long term
redundancy of the speech signal thereby allowing the excitation
signal to account for the pitch related properties of the
signal.
SUMMARY
[0004] A version of a speech signal and an output of a pitch
synthesis filter and a linear predictive all-pole (LPC) filter is
received. A system impulse response is produced based on in part
the received pitch synthesis filter and LPC output. An excitation
pulse location is determined so that the determined location
minimizes an error between the speech signal version and the system
impulse response. The speech signal is encoded with a
representation of the determined location.
BRIEF DESCRIPTION OF THE DRAWING(S)
[0005] FIG. 1 is a block diagram of an 8 kbps multipulse LPC speech
coder.
[0006] FIG. 2 is a block diagram of a sample/hold and A/D circuit
used in the system of FIG. 1.
[0007] FIG. 3 is a block diagram of the spectral whitening circuit
of FIG. 1.
[0008] FIG. 4 is a block diagram of the perceptual speech weighting
circuit of FIG. 1.
[0009] FIG. 5 is a block diagram of the reflection coefficient
quantization circuit of FIG. 1.
[0010] FIG. 6 is a block diagram of the LPC interpolation/weighting
circuit of FIG. 1.
[0011] FIG. 7 is a flow chart diagram of the pitch analysis block
of FIG. 1.
[0012] FIG. 8 is a flow chart diagram of the multipulse analysis
block of FIG. 1.
[0013] FIG. 9 is a block diagram of the impulse response generator
of FIG. 1.
[0014] FIG. 10 is a block diagram of the perceptual synthesizer
circuit of FIG. 1.
[0015] FIG. 11 is a block diagram of the ringdown generator circuit
of FIG. 1.
[0016] FIG. 12 is a diagrammatic view of the factorial tables
address storage used in the system of FIG. 1.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT(S)
[0017] This invention incorporates improvements to the prior art of
multipulse coders, specifically, a new type LPC spectral
quantization, pitch filter implementation, incorporation of pitch
synthesis filter in the multipulse analysis, and excitation
encoding/decoding.
[0018] Shown in FIG. 1 is a block diagram of an 8 kbps multipulse
LPC speech coder, generally designated 10.
[0019] It comprises a pre-emphasis block 12 to receive the speech
signals s(n). The pre-emphasized signals are applied to an LPC
analysis block 14 as well as to a spectral whitening block 16 and
to a perceptually weighted speech block 18.
[0020] The output of the block 14 is applied to a reflection
coefficient quantization and LPC conversion block 20, whose output
is applied both to the bit packing block 22 and to an LPC
interpolation/weighting block 24.
[0021] The output from block 20 to block 24 is indicated at .alpha.
and the outputs from block 24 are indicated at .alpha.,
.alpha..sup.1 and at .alpha..rho., .alpha..sup.1.rho..
[0022] The signal .alpha., .alpha..sup.1 is applied to the spectral
whitening block 16 and the signal .alpha..rho., .alpha..sup.1.rho.
is applied to the impulse generation block 26.
[0023] The output of spectral whitening block 16 is applied to the
pitch analysis block 28 whose output is applied to quantizer block
30. The quantized output {circumflex over (p)} from quantizer is
applied to the bit packer 22 and also as a second input to the
impulse response generation block 26. The output of block 26,
indicated at h(n), is applied to the multiple analysis block
32.
[0024] The perceptual weighting block 18 receives both outputs from
block 24 and its output, indicated at Sp(n), is applied to an adder
34 which also receives the output r(n) from a ringdown generator
36. The ringdown component r(n) is a fixed signal due to the
contributions of the previous frames. The output x(n) of the adder
34 is applied as a second input to the multipulse analysis block
32. The two outputs and of the multipulse analysis block 32 are fed
to the bit packing block 22.
[0025] The signals .alpha., .alpha..sup.1, p and , are fed to the
perceptual synthesizer block 38 whose output y(n), comprising the
combined weighted reflection coefficients, quantized spectral
coefficients and multipulse analysis signals of previous frames, is
applied to the block delay N/2 40. The output of block 40 is
applied to the ringdown generator 36.
[0026] The output of the block 22 is fed to the
synthesizer/postfilter 42.
[0027] The operation of the aforesaid system is described as
follows: The original speech is digitized using sample/hold and A/D
circuitry 44 comprising a sample and hold block 46 and an analog to
digital block 48. (FIG. 2). The sampling rate is 8 kHz. The
digitized speech signal, s(n), is analyzed on a block basis,
meaning that before analysis can begin, N samples of s(n) must be
acquired. Once a block of speech samples s(n) is acquired, it is
passed to the preemphasis filter 12 which has a z-transform
function
P(z)=1-.alpha.*z.sup.-1 (1)
[0028] It is then passed to the LPC analysis block 14 from which
the signal K is fed to the reflection coefficient quantizer and LPC
converter whitening block 20, (shown in detail in FIG. 3). The LPC
analysis block 14 produces LPC reflection coefficients which are
related to the all-pole filter coefficients. The reflection
coefficients are then quantized in block 20 in the manner shown in
detail in FIG. 5 wherein two sets of quantizer tables are
previously stored. One set has been designed using training
databases based on voiced speech, while the other has been designed
using unvoiced speech. The reflection coefficients are quantized
twice; once using the voiced quantizer 48 and once using the
unvoiced quantizer 50. Each quantized set of reflection
coefficients is converted to its respective spectral coefficients,
as at 52 and 54, which, in turn, enables the computation of the
log-spectral distance between the unquantized spectrum and the
quantized spectrum. The set of quantized reflection coefficients
which produces the smaller log-spectral distance shown at 56, is
then retained. The retained reflection coefficient parameters are
encoded for transmission and also converted to the corresponding
all-pole LPC filter coefficients in block 58.
[0029] Following the reflection quantization and LPC coefficient
conversion, the LPC filter parameters are interpolated using the
scheme described herein. As previously discussed, LPC analysis is
performed on speech of block length N which corresponds to N/8000
seconds (sampling rate=8000 Hz). Therefore, a set of filter
coefficients is generated for every N samples of speech or every
N/8000 sec.
[0030] In order to enhance spectral trajectory tracking, the LPC
filter parameters are interpolated on a sub-frame basis at block 24
where the sub-frame rate is twice the frame rate. The interpolation
scheme is implemented (as shown in detail in FIG. 6) as follows:
let the LPC filter coefficients for frame k-1 be .alpha..sup.0 and
for frame k be .alpha..sup.1. The filter coefficients for the first
sub-frame of frame k is then
+E,uns .alpha.=(+E,uns .alpha..sup.0++E,uns .alpha..sup.1)/2
(2)
[0031] and .alpha..sup.1 parameters are applied to the second
sub-frame. Therefore a different set of LPC filter parameters are
available every 0.5*(N/8000) sec.
[0032] Pitch Analysis
[0033] Prior methods of pitch filter implementation for multipulse
LPC coders have focused on closed loop pitch analysis methods (U.S.
Pat. No. 4,701,954). However, such closed loop methods are
computationally expensive. In the present invention the pitch
analysis procedure indicated by block 28, is performed in an open
loop manner on the speech spectral residual signal. Open loop
methods have reduced computational requirements. The spectral
residual signal is generated using the inverse LPC filter which can
be represented in the z-transform domain as A(z); A(z)=1/H(z) where
H(z) is the LPC all-pole filter. This is known as spectral
whitening and is represented by block 16. This block 16 is shown in
detail in FIG. 3. The spectral whitening process removes the
short-time sample correlation which in turn enhances pitch
analysis.
[0034] A flow chart diagram of the pitch analysis block 28 of FIG.
1 is shown in FIG. 7. The first step in the pitch analysis process
is the collection of N samples of the spectral residual signal.
This spectral residual signal is obtained from the pre-emphasized
speech signal by the method illustrated in FIG. 3. These residual
samples are appended to the prior K retained residual samples to
form a segment, r(n), where -K<n <N.
[0035] The autocorrelation Q(i) is performed for
.pi..ltoreq.i.ltoreq..pi.- .sub.h or 1 Q ( i ) = n = - K N r ( n )
r ( n - i ) 1 i h ( 3 )
[0036] The limits of i are arbitrary but for speech sounds a
typical range is between 20 and 147 (assuming 8 kHz sampling). The
next step is to search Q(i) for the max value, M.sub.1, where
M.sub.1=max(Q(i))=Q(k.sub.1) (4)
[0037] The value k is stored and Q (k.sub.1-1), Q (k.sub.1) and Q
(K.sub.1+1) are set to a large negative value.
[0038] We next find a second value M.sub.2 where
M.sub.2=max(Q(i))=Q (k.sub.2) (5)
[0039] The values k.sub.1 and k.sub.2 correspond to delay values
that produce the two largest correlation values. The values k.sub.1
and k.sub.2 are used to check for pitch period doubling. The
following algorithm is employed: If the ABS
(k.sub.2-2k.sub.1)<C, where C can be chosen to be equal to the
number of taps (3 in this invention), then the delay value, D, is
equal to k.sub.2 otherwise D=k.sub.1. Once the frame delay value,
D, is chosen the 3-tap gain terms are solved by first computing the
matrix and vector values in eq. (6). 2 [ r ( i ) r ( n - - 1 ) r (
n ) r ( n - i ) r ( n ) r ( n - i + 1 ) ] = [ r ( n - i - 1 ) r ( n
- i - 1 ) r ( n - i ) r ( n - i - 1 ) r ( n - i + 1 ) r ( n - i - 1
) r ( n - i - 1 ) r ( n - i ) r ( n - i ) r ( n - i ) r ( n - i + 1
) r ( n - i ) r ( n - i - 1 ) r ( n - i + 1 ) r ( n - i ) r ( n - i
+ 1 ) r ( n - i + 1 ) r ( n - i + 1 ) ] ( 6 )
[0040] The matrix is solved using the Choleski matrix
decomposition. Once the gain values are calculated, they are
quantized using a 32 word vector codebook. The codebook index along
with the frame delay parameter are transmitted. The {circumflex
over (P)} signifies the quantized delay value and index of the gain
codebook.
[0041] Excitation Analysis
[0042] Multipulse's name stems from the operation of exciting a
vocal tract model with multiple impulses. A location and amplitude
of an excitation pulse is chosen by minimizing the mean-squared
error between the real and synthetic speech signals. This system
incorporates the perceptual weighting filter 18. A detailed flow
chart of the multipulse analysis is shown in FIG. 8. The method of
determining a pulse location and amplitude is accomplished in a
systematic manner. The basic algorithm can be described as follows:
let h(n) be the system impulse response of the pitch analysis
filter and the LPC analysis filter in cascade; the synthetic speech
is the system's response to the multipulse excitation. This is
indicated as the excitation convolved with the system response or 3
s ^ ( n ) = k = 1 n ex ( k ) h ( n - k ) ( 7 )
[0043] where ex(n) is a set of weighted impulses located at
positions n.sub.1, n.sub.2, . . . n.sub.j or
ex(n)=.beta..sub.1.delta.(n-n.sub.1)+.beta..sub.2.delta.(n-n.sub.2)+.
. . +.beta..sub.j.delta.(n-n.sub.j) (8)
[0044] The synthetic speech can be re-written as 4 s ^ ( n ) = j =
1 j j h ( n - n j ) ( 9 )
[0045] In the present invention, the excitation pulse search is
performed one pulse at a time, therefore j=1. The error between the
real and synthetic speech is
e(n)=s.sub.p(n)-(n)-r(n) (10)
[0046] The squared error 5 E = n = 1 N e 2 ( n ) or ( 11 ) E = n =
1 N ( s p ( n ) - s ^ ( n ) - r ( n ) ) 2 ( 12 )
[0047] where s.sub.p(n) is the original speech after pre-emphasis
and perceptual weighting (FIG. 4) and r(n) is a fixed signal
component due to the previous frames' contributions and is referred
to as the ringdown component.
[0048] FIGS. 10 and 11 show the manner in which this signal is
generated, FIG. 10 illustrating the perceptual synthesizer 38 and
FIG. 11 illustrating the ringdown generator 36. The squared error
is now written as 6 E = n = 1 N ( x ( n ) - 1 h ( n - n j ) 2 ( 13
)
[0049] where x(n) is the speech signal s.sub.p(n)-r(n) as shown in
FIG. 1.
E=S-2BC+B.sup.2H (14)
[0050] where 7 C = n = 1 N - 1 x ( n ) h ( n - n j ) and ( 15 ) S =
n = 1 N - 1 x 2 ( n ) and ( 16 ) H = n = 1 N - 1 h ( n - n 1 h ( n
- n 1 ) ( 17 )
[0051] The error, E, is minimized by setting the dE/dB=0 or
dE/dB=-2C +2HB =0 (18)
[0052] or
B=CH (19)
[0053] The error, E, can then be written as
E=S-C.sup.2/H (20)
[0054] From the above equations it is evident that two signals are
required for multipulse analysis, namely h(n) and x(n). These two
signals are input to the multipulse analysis block 32.
[0055] The first step in excitation analysis is to generate the
system impulse response. The system impulse response is the
concatentation of the 3-tap pitch synthesis filter and the LPC
weighted filter. The impulse response filter has the z-transform: 8
H p ( z ) = 1 1 - i = 1 3 b 1 z - - i 1 1 - = 1 i 1 z - 1 ( 20
)
[0056] The b values are the pitch gain coefficients, the .alpha.
values are the spectral filter coefficients, and .mu. is a filter
weighting coefficient. The error signal, e(n), can be written in
the z-transform domain as
E(z)=X(z)-BH.sub.p(z)z.sup.-n1 (21)
[0057] where X(z) is the z-transform of x(n) previously
defined.
[0058] The impulse response weight .beta. and impulse response time
shift location n.sub.1 are computed by minimizing the energy of the
error signal, e(n). The time shift variable n.sub.1 (1=1 for first
pulse) is now varied from 1 to N. The value of n.sub.1 is chosen
such that it produces the smallest energy error E. Once n.sub.1 is
found .beta..sub.1 can be calculated. Once the first location,
n.sub.1 and impulse weight, .beta..sub.1, are determined the
synthetic signal is written as
(n)=.beta..sub.1h(n-n.sub.1) (22)
[0059] When two weighted impulses are considered in the excitation
sequence, the error energy can be written as
E=.SIGMA.(x(n)-.beta..sub.1h(n-n.sub.1)-.beta..sub.2h(n-n.sub.2)).sup.2
[0060] Since the first pulse weight and location are known, the
equation is rewritten as
E=.SIGMA.(x'(n)-.beta..sub.2h(n-n.sub.2)).sup.2 (23)
[0061] where
x'(n)=x(n)-.beta..sub.1h(n-n.sub.2) (24)
[0062] The procedure for determining .beta..sub.2 and n.sub.2 is
identical to that of determining .beta..sub.1 and n.sub.1. This
procedure can be repeated p times. In the present instancetion p=5.
The excitation pulse locations are encoded using an enumerative
encoding scheme.
[0063] Excitation Encoding
[0064] A normal encoding scheme for 5 pulse locations would take
5*Int(log.sub.2 N+0.5), where N is the number of possible
locations. For p=5 and N=80, 35 bits are required. The approach
taken here is to employ an enumerative encoding scheme. For the
same conditions, the number of bits required is 25 bits. The first
step is to order the pulse locations (i.e.
0L1.ltoreq.L2.ltoreq.L3.ltoreq.L4.ltoreq.L5N-1 where
L1=min(n.sub.1, n.sub.2, n.sub.3, n.sub.4, n.sub.5) etc.). The 25
bit number, B, is: 9 B = ( L1 1 ) + ( L2 2 ) + ( L3 3 ) + ( L4 4 )
+ ( L5 5 )
[0065] Computing the 5 sets of factorials is prohibitive on a DSP
device, therefore the approach taken here is to pre-compute the
values and store them on a DSP ROM. This is shown in FIG. 12. Many
of the numbers require double precision (32 bits). A quick
calculation yields a required storage (for N=80) of 790 words
((N-1)*2*5). This amount of storage can be reduced by first
realizing 10 ( L1 1 )
[0066] is simply L 1; therefore no storage is required. Secondly,
11 ( L2 2 )
[0067] contains only single precision numbers; therefore storage
can be reduced to 553 words. The code is written such that the five
addresses are computed from the pulse locations starting with the
5th location (Assumes pulse location range from 1 to 80). The
address of the 5th pulse is 2*L5+393. The factor of 2 is due to
double precision storage of L5's elements. The address of L4 is
2*L4+235, for L3, 2*L3+77, for L2, L2-1. The numbers stored at
these locations are added and a 25-bit number representing the
unique set of locations is produced. A block diagram of the
enumerative encoding schemes is listed.
[0068] Excitation Decoding
[0069] Decoding the 25-bit word at the receiver involves repeated
subtractions. For example, given B is the 25-bit word, the 5th
location is found by finding the value X such that 12 B - ( 79 5 )
< 0 B - ( X 5 ) < 0 B - ( X - 1 5 ) > 0
[0070] then L5=x-1. Next let 13 B = B - ( L5 5 )
[0071] The fourth pulse location is found by finding a value X such
that 14 B - ( L5 - 1 4 ) < 0 B - ( X 4 ) < 0 B - ( X - 1 4 )
> 0
[0072] then L4=X-1. This is repeated for L3 and L2. The remaining
number is L1.
* * * * *