U.S. patent application number 09/725345 was filed with the patent office on 2001-05-17 for method and device for speech coding.
Invention is credited to Heinen, Stefan, Xu, Wen.
Application Number | 20010001320 09/725345 |
Document ID | / |
Family ID | 8232031 |
Filed Date | 2001-05-17 |
United States Patent
Application |
20010001320 |
Kind Code |
A1 |
Heinen, Stefan ; et
al. |
May 17, 2001 |
Method and device for speech coding
Abstract
In the novel speech coding method and device, speech signals are
coded by a combination of speech parameters and excitation signals.
The speech parameters or excitation signals are described with
vectors. The vectors are formed by superposing at least two tracks,
wherein at least one track has at least two vector elements
different from zero. The algebraic signs of the vector elements
that differ from zero are coded independently of one another and
independently of the positions of the vector elements that differ
from zero.
Inventors: |
Heinen, Stefan; (Duren,
DE) ; Xu, Wen; (Unterhaching, DE) |
Correspondence
Address: |
Lerner and Greenberg, P.A.
P.O. Box 2480
Hollywood
FL
33022-2480
US
|
Family ID: |
8232031 |
Appl. No.: |
09/725345 |
Filed: |
November 29, 2000 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
09725345 |
Nov 29, 2000 |
|
|
|
PCT/EP99/03766 |
May 31, 1999 |
|
|
|
Current U.S.
Class: |
704/222 ;
704/E19.003; 704/E19.032; 704/E19.041 |
Current CPC
Class: |
G10L 19/10 20130101;
G10L 19/002 20130101; G10L 19/005 20130101; H04L 1/0057 20130101;
G10L 2019/0008 20130101; G10L 19/18 20130101 |
Class at
Publication: |
704/222 |
International
Class: |
G10L 019/12 |
Foreign Application Data
Date |
Code |
Application Number |
May 29, 1998 |
EP |
98 109 868.4 |
Claims
We claim:
1. A speech coding method, which comprises: coding a speech signal
by a combination of speech parameters and excitation signals;
describing the speech parameters or excitation signals by vectors;
forming the vectors by superposing at least two tracks, wherein at
least one track has at least two vector elements different from
zero; and coding algebraic signs of the vector elements that differ
from zero independently of one another and independently of the
positions of the vector elements that differ from zero.
2. The method according to claim 1, which comprises forming one of
the speech parameters and the excitation signals from the speech
signals on the CELP principle.
3. The method according to claim 1, which comprises coding the
positions of the vector elements that differ from zero together to
form an index value.
4. A device for speech coding, comprising an input for receiving a
speech signal, and a processor unit connected to said input and
configured to: code a speech signal by a combination of speech
parameters and excitation signals; describe the speech parameters
or excitation signals by vectors; form the vectors by superposing
at least two tracks; define at least one track with at least two
vector elements that differ from zero; and code an algebraic sign
of the vector elements that differ from zero independently of one
another and independently of positions of the vector elements that
differ from zero.
5. The device according to claim 4, wherein said processor unit is
programmed to obtain the speech parameters and excitation signals
from the speech signals on the CELP principle.
6. The device according to claim 4, wherein said processor unit is
programmed to code the positions of the vector elements that differ
from zero together to form an index value.
Description
CROSS-REFERENCE TO RELATED APPLICATION
1. This is a continuation of copending International Application
PCT/EP99/03766, filed May 31, 1999, which designated the United
States.
BACKGROUND OF THE INVENTION
2. 1. Field of the Invention
3. The invention lies in the communications field. More
specifically, the invention relates to methods and devices for
speech coding in which the speech signal is encoded by a
combination of speech parameters and excitation signals, in
particular on the CELP (coded excited linear predictive)
principle.
4. Source signals or source information such as voice, sound,
picture, and video signals almost always contain statistical
redundancy, that is redundant information. This redundancy can be
greatly reduced by source encoding, so that efficient transmission
or storage of the source signal is made possible. This reduction in
redundancy eliminates, prior to transmission, redundant signal
contents that are based on the prior knowledge of, for example,
statistical parameters of the signal variation. The bit rate of the
source-encoded information is also referred to as the encoding rate
or source bit rate. Following transmission, in the source decoding,
these component parts are once more added to the signal, so that
objectively and/or subjectively there is no ascertainable loss in
quality.
5. On the other hand, it is customary in signal transmission for
redundancy to be specifically added again by channel encoding, in
order to eliminate largely the influencing of the transmission by
channel interference. Additional redundant bits enable the receiver
or decoder to detect errors and possibly also correct them. The bit
rate of the channel-encoded information is also referred to as the
gross bit rate.
6. To allow information, in particular speech data, picture data or
other useful data, to be transmitted as efficiently as possible by
means of the limited transmission capacities of a transmission
medium, in particular an air interface, this information to be
transmitted is consequently compressed prior to the transmission by
a source encoding and protected against channel errors by a channel
encoding. Different methods are known for these procedures. For
example, in the GSM (Global System for Mobile Communication)
system, speech can be encoded by means of a full rate speech codec,
a half rate speech codec or an enhanced full rate speech codec
(EFR).
7. Within the scope of this description, a method of encoding
and/or corresponding decoding, which may also comprise source
encoding and/or channel encoding is also referred to as a speech
codec.
8. As part of the further development of the European GSM mobile
radio standard and the development of new mobile radio systems
which are based on a CDMA (code division multiple access) method,
such as the UMTS (Universal Mobile Telecommunications System) in
the process of being standardized, new methods are being developed
for encoded speech transmission, making it possible for the entire
data rate, and also the dividing of the data rate between the
source encoding and channel encoding, to be set adaptively
according to the channel state and network conditions (system
load). Instead of the speech codecs described above, having a fixed
source bit rate, new speech codecs, able to be operated in
different codec modes, are to be used here, the codec modes
differing with regard to their source bit rate (encoding rate).
9. The main objects of such AMR (adaptive multirate) speech codecs
with variable source bit rate or variable encoding rate are to
achieve fixed network quality of the speech under different channel
conditions and to ensure optimum distribution of the channel
capacity with certain network parameters taken into account.
10. 2. Summary of the Invention
11. The object of the invention is to provide a method and a device
for speech coding which overcome the above-noted deficiencies and
disadvantages of the prior art devices and methods of this kind,
and which make it possible to encode speech signals robustly to
resist transmission errors and with relatively little
expenditure.
12. With the above and other objects in view there is provided, in
accordance with the invention, a speech coding method, which
comprises:
13. coding a speech signal by a combination of speech parameters
and excitation signals;
14. describing the speech parameters or excitation signals by
vectors;
15. forming the vectors by superposing at least two tracks, wherein
at least one track has at least two vector elements different from
zero; and
16. coding algebraic signs of the vector elements that differ from
zero independently of one another and independently of the
positions of the vector elements that differ from zero.
17. In accordance with an added feature of the invention, the
speech parameters or the excitation signals are formed from the
speech signals on the CELP principle.
18. In accordance with an additional feature of the invention, the
positions of the vector elements that differ from zero are encoded
together to form an index value.
19. With the above and other objects in view there is also
provided, in accordance with the invention, a device for speech
coding. The device includes a processor unit receiving speech
signals and being configured to:
20. encode a speech signal by a combination of speech parameters
and excitation signals;
21. describe the speech parameters or excitation signals by
vectors;
22. form the vectors by superposing at least two tracks;
23. define at least one track with at least two vector elements
that differ from zero; and
24. encode an algebraic sign of the vector elements that differ
from zero independently of one another and independently of
positions of the vector elements that differ from zero.
25. In accordance with again an added feature of the invention, the
processor unit is programmed to obtain the speech parameters and
excitation signals from the speech signals on the CELP
principle.
26. In accordance with a concomitant feature of the invention, the
processor unit is programmed to encode the positions of the vector
elements that differ from zero together to form an index value.
27. In other words, the invention is premised on the idea of
encoding positions of certain predetermined vector elements which
differ from zero and the algebraic signs of these vector elements
separately from one another for the encoding of vectors for
describing speech parameters or excitation signals.
28. The problem is also solved by devices for speech coding in
which a digital signal processor is in each case set up in such a
way that positions of certain predetermined vector elements which
differ from zero and the algebraic signs of these vector elements
are encoded separately from one another for the encoding of vectors
for describing speech parameters or excitation signals.
29. The invention relates quite generally to methods for speech
coding in which the speech signal is encoded by a combination of
speech parameters and excitation signals. The speech parameters
include parameters or characteristic variables of a statistical
model on which the speech production is based, for example LPC or
LTP filter coefficients, and the excitation signals are signals of
the exciting processes of this model. These processes may be
modeled either statistically or deterministically. Examples of
statistical modeling are vocoder methods in which the excitation
signals are generated by noise sources. In deterministic modeling,
the excitation signals are obtained with the aid of the underlying
model from the speech signal and are quantized. Examples of this
are RPE/LTP (GSM Full Rate Codec), VSELP (GSM Half Rate Codec) and
ACELP (GSM Enhanced Full Rate Codec).
30. Other features which are considered as characteristic for the
invention are set forth in the appended claims.
31. Although the invention is illustrated and described herein as
embodied in a method and device for speech coding, it is
nevertheless not intended to be limited to the details shown, since
various modifications and structural changes may be made therein
without departing from the spirit of the invention and within the
scope and range of equivalents of the claims.
32. The construction and method of operation of the invention,
however, together with additional objects and advantages thereof
will be best understood from the following description of specific
embodiments when read in connection with the accompanying
drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
33. FIG. 1 is a block diagram of essential elements in a
telecommunications transmission chain;
34. FIG. 2 is a block diagram of an AMR encoder based on the CELP
principle; and
35. FIG. 3 is a schematic block diagram of a processor unit.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
36. Referring now to the figures of the drawing in detail and
first, particularly, to FIG. 1 thereof, there is seen a source Q,
which generates source signals qs, which are compressed by a source
encoder QE, such as the GSM full rate speech coder, to form symbol
sequences comprising symbols. In parametric source encoding
methods, the source signals qs generated by the source Q (for
example speech) are divided into blocks (for example time frames)
and these are processed separately. The source encoder QE generates
quantized parameters (for example speech parameters or speech
coefficients), which are also referred to hereafter as symbols of a
symbol sequence, and which reflect the properties of the source in
the current block in a certain way (for example spectrum of the
speech in the form of filter coefficients, amplitude factors,
excitation vectors). After the quantization, these symbols have a
certain symbol value.
37. The symbols of the symbol sequence or the corresponding symbol
values are mapped by a binary mapping (allocation specification),
which is frequently described as part of the source encoding QE,
onto a sequence of binary code words, which in each case have a
plurality of bit positions. If these binary code words are, for
example, further processed one after the other as a sequence of
binary code words, a sequence of source-encoded bit positions which
may be embedded in a frame structure is produced. After source
encoding carried out in this way, source bits or data bits db with
a source bit rate (encoding rate) dependent on the type of source
encoding are thus obtained in a structured form in the frame.
38. The term codebook is to be understood here as describing a
table with all the quantization representatives. The entries of the
table may be both scalar and vectorial quantities.
39. Scalar codebooks can be used for example for the quantization
of amplitude factors, since these are generally scalar quantities.
Examples of the use of vectorial codebooks are the quantization of
LSF (line spectrum frequencies) and the quantization of the
stochastic excitation.
40. Referring now to FIG. 2, there is shown a basic representation
a special variant of a source encoder. The exemplary embodiment is
a speech coder, namely an AMR encoder based on a CELP (coded
excited linear predictive) principle.
41. The CELP principle concerns a method of analysis by synthesis.
In this case, a filter structure obtained from the current portion
of speech is excited by excitation vectors (code vectors) taken one
after the other from a codebook. The output signal of the filter is
compared by means of a suitable error criterion with the current
portion of speech and the error-minimizing excitation vector is
selected. The representation of the filter structure and the
position number of the selected excitation vector are transmitted
to the receiver.
42. A specific variant of a CELP method uses an algebraic codebook,
which is often also referred to as a sparse algebraic code. It is a
multipulse codebook which is filled with binary (+/-1) or ternary
pulses (0, +/-1). Within the excitation vectors, only a few
positions are respectively occupied by pulses. After the selection
of the positions, the entire vector is weighted with an amplitude
factor. A codebook of this type has several advantages. On the one
hand, it does not take up any storage space, since the positions
allowed for the pulses are determined by an algebraic computing
rule, on the other hand it can be searched through very efficiently
for the best pulse positions on account of the way it is
structured.
43. A configurational variant of a conventional CELP encoder is
first described below with reference to FIG. 2. A target signal to
be approximated is reproduced by searching through two codebooks. A
distinction is drawn here between an adaptive codebook (a2), the
task of which is the reproduction of the harmonic speech
components, and a stochastic codebook (a4), which serves for the
synthesis of the speech components which cannot be obtained by
prediction. The adaptive codebook (a2) changes according to the
speech signal, while the stochastic codebook (a4) is invariant over
time. The search for the best excitation code vectors takes place
not by searching jointly, i.e. simultaneously, in the codebooks, as
would be necessary for an optimum selection of the excitation code
vectors, but, for expenditure-related reasons, by initially
searching through the adaptive codebook (a2). Once the excitation
code vector that is best according to the error criterion has been
found, its contribution to the reconstructed target signal is
subtracted from the target vector (target signal) and the part of
the target signal still to be reconstructed by means of a vector
from the stochastic codebook (a4) is obtained. The search in the
individual codebooks takes place on the same principle. In both
cases, the quotient from the square of the correlation of the
filtered excitation code vector with the target vector and the
energy of the filtered target vector is computed for all the
excitation code vectors. That excitation code vector which
maximizes this quotient is regarded as the best excitation code
vector, which minimizes the error criterion (a5). The preceding
error weighting (a6) weights the error according to the
characteristics of human hearing. The position of the found
excitation code vector within the excitation codebook is
transmitted to the decoder.
44. The computation of the afore-mentioned quotient has the effect
of implicitly determining the correct (codebook) amplitude factor
(amplification 1, amplification 2) for each excitation code vector.
Once the best candidate has been determined from the two codebooks,
the quality-reducing influence of the sequentially performed
codebook search can be reduced by a joint optimization of the
amplification. This involves re-specifying the original target
vector and computing the best amplifications matching the
excitation code vectors now selected. These amplifications usually
differ slightly from those which were determined during the
codebook search.
45. In the case of the CELP principle, each candidate vector can be
individually filtered (a3) and compared with the target signal for
finding the best excitation code vector.
46. Finally, filter parameters, amplitude factors, and excitation
code vectors are converted into binary signals and, embedded in a
fixed structure, are transmitted in frames. The filter parameters
may be LPC (linear predictive coding) coefficients, LTP (long term
prediction) indices, or LTP (long term prediction) amplitude
factors.
47. The LPC residual signal or excitation signal is vectorially
quantized. To keep the ROM requirement small, an algebraic codebook
is used. In other words, the group of quantization representatives
is not explicitly present in a table, but instead the various
representatives can be determined from an index value with the aid
of an algebraic computing rule. This additionally has
complexity-related advantages in the codebook search.
48. The algebraic computing rule for the determination of the
excitation code vectors from an index value uses a division of the
vector space into so-called tracks. Within one track (vector), only
components (vector elements) which lie on a track-specific grid can
assume values which differ from zero.
49. Positions which differ from zero are referred to as pulse
positions. Depending on the codebook, the pulses may have either a
binary (-1/+1) or ternary (-1/0/+1) value range. The superposing of
individual tracks finally supplies the excitation code vector.
50. This is to be explained on the basis of the following simple
example: the algebraic codebook of the dimension 20, which can be
formed by 2 tracks.
51. The symbols are as follows:
52. 0: allowed position in the track, no pulse
53. #: unallowed position in the track
54. +: positive pulse
55. -: negative pulse
1 Position: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Track1: 0 # 0 # 0 # 0 # 0 # 0 # + # 0 # 0 # 0 # Track2: # 0 # -- #
0 # 0 # 0 # 0 # 0 # 0 # 0 # 0 Excit: 0 0 0 -- 0 0 0 0 0 0 0 0 + 0 0
0 0 0 0 0
56. For encoding this codebook, accordingly 10 positions and the
algebraic sign of the set pulse have been encoded per track. A more
accurate quantization of the excitation signal is achieved with
codebooks which have more than one pulse per track, since then the
superposing of two pulses is also possible.
57. A track (vector) is consequently described by pulses (vector
elements which differ from zero). It is possible for the pulses to
be described by their position (vector element index) and their
algebraic sign. The total set of possible pulse position
combinations is described with the aid of an overall index.
58. The code word length required for encoding the excitation
vector found is made up of the number of bits required for encoding
the pulse positions and algebraic signs. If only one pulse is set
per track, not only the number of bits for the encoding of its
position but also a further bit for the encoding of its algebraic
sign are required.
59. Efficient encoding of the algebraic signs with likewise only
one bit for the case in which two pulses are set per track is
already used in the GSM-EFR codec. Here, the information of the
position sequence is utilized. If two pulses are located in the
position grid of the same track, in the case in which both pulses
have like algebraic signs, that pulse which assumes the lower
position within the grid is encoded first. In the case of different
algebraic signs, the pulse which occupies the higher position is
encoded first. Only the sign bit of the pulse encoded first is
transmitted. The algebraic sign of the second pulse is determined
on the decoder side by analysis of the encoding sequence of the
pulse positions.
60. This principle is to be illustrated on the basis of a codebook
with three pulses per track. In this case, the sign information of
the pulses of a track can also be encoded with a bit, in that the
principle of the algebraic sign encoding is extended by deliberate
changing of the encoding sequence. This is shown by the following
estimate: in the transmission of a sign bit, with P.sub.T=3 pulses
per track, N.sub.VZ possible sign combinations are obtained 1 N VZ
= 2 P r 2 = 2 3 2 = 4.
61. A number N.sub.perm of possible permutations of the pulse
positions becomes
N.sub.perm=P.sub.T!=3!=6>N.sub.VZ
62. As long as the number of possible permutations N.sub.perm is
greater than the number of possible sign combinations N.sub.VZ, the
sign information of a track must allow itself to be transmitted by
deliberate changing of the encoding sequence of the pulse
positions. If, for instance, a codebook which provides four pulses
per track were drawn up, the permutation of the encoding sequence
alone would be sufficient for the sign encoding; an additional sign
bit would not be required.
63. This efficient encoding of the sign information of an
excitation vector is accomplished, however, at the expense of an
increased susceptibility to interference on the transmission
channel. In the worst case, the interference of a sign bit of a
track causes the reversal of all the algebraic signs in this track.
Similarly, the interference of the parameter of a pulse position
may affect the algebraic signs of all the pulses of the same
track.
64. For this reason, the invention describes a significantly more
robust sign encoding. In this case, the overall set of possible
pulse positions is addressed by a suitable algebraic method with
the aid of a single index. Independently of this, the algebraic
signs of the pulses are respectively encoded with a bit.
65. The improved method may be explained by the example of the
algebraic codebook for the rate 9.5 kbits/s. In the case of this
codebook, two pulses are set to 14 possible positions. For the
encoding scheme with permutation encoding, one bit is required for
the first pulse sign and 4 bits are required for each of the two
pulse positions, that is a total of 9 bits.
66. The robust encoding method encodes the possible pulse positions
independently of the algebraic signs. Since both pulses may also
lie at the same position, there are combinations with repetition
here. It is known from the theory of combinations that in this case
there exists the following number of possiblities: 2 ( 14 + 2 - 1 2
) = ( 15 2 ) = 105 < 2 7
67. Since this number is less than 2.sup.7=128, seven bits are
sufficient for the encoding of the positions. The two algebraic
signs are respectively encoded with one bit. In this way, a
detachment of algebraic signs and pulse positions is achieved
without increasing the bit rate required for the encoding of the
excitation vectors.
68. In simulations, individual bit positions of the codebook
indices were in each case subjected to interference with a 100%
error rate and the resulting speech SNR measured after resynthesis.
In these simulations it was possible to improve the sensitivity of
the algebraic signs by about 3 dB on account of the more robust
encoding.
69. In a configurational variant of the invention, the different
encoding rates mean that different codec modes generally have
different frame sizes, and therefore also different structures in
which the bit positions serving for describing the filter
parameters, amplitude factors or excitation code vectors are
embedded.
70. To realize a variable encoding rate, the changing of the
encoding rate may be realized by a corresponding changing of the
number of bit positions for describing an excitation code vector
taken from a stochastic codebook. Switching over of the codec modes
consequently leads, as a result of switching over to a stochastic
codebook corresponding to the new codec mode, to the selection of
excitation code vectors contained in this codebook, for the
description of which a correspondingly changed number of bit
positions is required. Consequently, there are different stochastic
codebooks available according to the number of different codec
modes.
71. Referring now to FIG. 3, there is shown a processor unit PE,
which may be contained in particular in a communication device,
such as a base station BS or mobile station MS. The unit PE
contains a control device STE, which essentially comprises a
program-controlled microcontroller, and a processing device VE,
which comprises a processor, in particular a digital signal
processor, which can both gain writing and reading access to memory
chips SPE.
72. The microcontroller controls and monitors all the major
elements and functions of a functional unit which includes the
processor unit PE. The digital signal processor, part of the
digital signal processor or a specific processor is responsible for
carrying out the speech coding or speech decoding. The selection of
a speech codec may also be performed by the microcontroller or the
digital signal processor itself.
73. An input/output interface I/O serves for the input/output of
useful or control data, for example to a man-machine interface MMI,
which may include a keyboard and/or a display. The individual
elements of the processor unit may be connected to one another by a
digital bus system BUS.
74. It will be understood by those of skill in the pertinent art
that, on the basis of the foregoing description, the invention can
also apply to encoding methods other than the CELP encoding method
explained in the application.
* * * * *