U.S. patent number 3,740,476 [Application Number 05/161,173] was granted by the patent office on 1973-06-19 for speech signal pitch detector using prediction error data.
This patent grant is currently assigned to Bell Telephone Laboratories, Incorporated. Invention is credited to Bishnu Saroop Atal.
United States Patent |
3,740,476 |
Atal |
June 19, 1973 |
SPEECH SIGNAL PITCH DETECTOR USING PREDICTION ERROR DATA
Abstract
Pitch periods in a complex speech signal are determined by
evaluating the error in predicting the value of a sample of the
signal on the basis of past sample values, and by locating samples
for which the prediction error is large. Advantageously, the
prediction error signal is devoid of all formant structure, so that
there is no chance of confusing pitch signal peaks with formant
peaks. A voiced-unvoiced decision is obtained from the ratio of the
mean-squared value of the speech signal to the mean-squared value
of the prediction error signal.
Inventors: |
Atal; Bishnu Saroop (Murray
Hill, NJ) |
Assignee: |
Bell Telephone Laboratories,
Incorporated (Murray Hill, NJ)
|
Family
ID: |
22580131 |
Appl.
No.: |
05/161,173 |
Filed: |
July 9, 1971 |
Current U.S.
Class: |
704/207;
704/219 |
Current CPC
Class: |
G10L
25/90 (20130101); G10L 25/93 (20130101) |
Current International
Class: |
G10L
11/04 (20060101); G10L 11/00 (20060101); G10L
11/06 (20060101); G10l 001/04 () |
Field of
Search: |
;179/1SA,15.55R
;325/38A |
References Cited
[Referenced By]
U.S. Patent Documents
Primary Examiner: Claffy; Kathleen H.
Assistant Examiner: Leaheey; Jon Bradford
Claims
What is claimed is:
1. A signal analyzer for determining the fundamental period of a
speech signal, which comprises,
adaptive predictor means supplied with samples of said speech
signal for predicting the present value of each sample on the basis
of a weighted summation of a number of prior sample values of said
speech signal,
means for subtracting said predicted speech value from the actual
speech value to develop a difference signal, and
means for determining the fundamental frequency of said difference
signal as an indication of the fundamental period of said speech
signal.
2. A signal analyzer as defined in claim 1, wherein said means for
determining the fundamental frequency of said difference signal
comprises,
means for determining the frequency of occurrence of difference
signal maxima above a prescribed threshold.
3. A signal analyzer as defined in claim 1, wherein said means for
determining the fundamental frequency of said difference signal
comprises,
means for autocorrelating said difference signal for developing an
autocorrelation signal representative of the periodic character of
said difference signal, and
means for detecting the location of the peak value of said
autocorrelation signal.
4. Apparatus for determining the fundamental period of a speech
signal, which comprises,
means for developing an estimate of the present value of a speech
signal on the basis of past values of said speech signal,
means for developing a signal representative of the difference
between said signal estimate and the true present value of said
speech signal, and
means for determining the fundamental frequency of said difference
signal to develop a signal representative of the fundamental period
of said speech signal.
5. Apparatus for determining the fundamental period of a speech
signal, which comprises,
adaptive predictor means supplied with samples of said speech
signal for developing an estimate of the momentary value of said
speech signal from previously supplied samples,
means for developing a prediction error signal from the difference
between said predicted signal estimate and the corresponding
momentary value of samples of said speech signal,
means for identifying prediction error samples whose magnitudes are
above a prescribed threshold, and
means for utilizing the frequency of occurrence of said identified
error samples as a measure of the fundamental period of said speech
signal.
6. Apparatus for analyzing the character of a speech signal, which
comprises, in combination,
predictor means supplied with samples of a speech signal for
developing an estimate of the momentary value of said signal from
previously supplied samples,
means for developing prediction error signal samples from the
difference between samples of said signal estimate and the
corresponding momentary value of samples of said speech signal,
means for identifying prediction error samples whose magnitudes are
above a prescribed threshold,
means for developing a first signal proportional to the
mean-squared value of said speech samples,
means for developing a second signal proportional to the
mean-squared value of corresponding ones of said error samples,
means for developing a signal proportional to the ratio of said
first to said second mean-squared signals,
means for utilizing the frequency of occurrence of said identified
threshold error samples as a measure of the fundamental period of
said speech signal, and
means for utilizing said ratio of first and second mean-squared
signals as a measure of the voicing characteristic of said speech
signal.
7. Apparatus for analyzing the character of a speech signal as
defined in claim 6, wherein,
values of said ratio of mean-squared signals equal to or greater
than a prescribed threshold are used to classify said speech signal
as voiced, and
wherein values of said ratio of mean-squared signals less than said
threshold are used to classify said speech signal as unvoiced.
8. In a pitch analysis arrangement for speech signals, the
combination of,
means for developing a signal representative of the formant
structure of an applied speech signal,
means for removing said formant representative signal from said
speech signal to produce a signal essentially devoid of all formant
information,
means for measuring the period of said formant devoid signal,
and
means for determining the voicing character of said speech signal
on the basis of the power in said speech signal and the power in
said formant devoid signal.
Description
This invention is concerned with the analysis of complex signals,
and particularly with the determination of the fundamental
frequency, or period, of a complex periodic signal, such as a
voiced speech signal. Its principal objectives are to simplify the
measurement of pitch frequency and to improve the reliability of
the measure.
BACKGROUND OF THE INVENTION
A number of arrangements for reducing the channel capacity required
for the transmission of complex signals, such as speech signals,
have been proposed. One of the best known of these is the vocoder.
More recently, techniques for removing inherent signal redundancy
through the use of linear prediction techniques have been
described. In all of these arrangements, a speech wave is analyzed
to determine its significant characteristics, and coded information
concerning these characteristics is transmitted instead of the
speech signal itself. At a receiver station a synthetic speech
signal is developed from the coded information.
In general, a different set of coded signal information is employed
in each type of bandwidth compression system. However, virtually
all employ one characteristic of the speech signal, namely, its
pitch frequency. This characteristic denotes the fundamental
frequency at which the vocal cords vibrate during the production of
different voiced speech sounds. Most speech bandwidth compression
systems also employ coded information to identify a speech signal
as voiced or unvoiced. Some combine the two forms of information so
that the pitch signal inherently specifies the voicing
condition.
FIELD OF THE INVENTION
A number of different proposals for automatically measuring and
encoding the pitch characteristic of a speech signal are known and
used in the art. Some rely on simple filtering, some on signal
correlation, some on formant detection and tracking, and others on
a transformation of the logarithm of the spectrum of a speech
signal, the so-called cepstrum of the signal. All of these
arrangements, however, operate on the speech signal itself and in
one way or another strive to find peak values in the signal, or in
a modification of it, which identify the pitch characteristic.
Unfortunately, peaks due to formants, particularly the first
formant of a speech signal, are often stronger than a peak
developed to indicate pitch. If the two peaks are close together,
it is difficult to determine which is which. Consequently, even the
most sophisticated pitch detectors are subject to error and do not
always correctly characterize the pitch frequency of a signal.
It is thus another object of this invention to capitalize on a
unique property of a voiced speech signal to develop a measure of
the pitch frequency of the signal that is unambiguous and which is
entirely independent of the formant character of the speech
signal.
SUMMARY OF THE INVENTION
Analysis of a complex speech signal to determine its pitch
frequency is, in accordance with the invention, based on an
analysis of the error between a predicted value of the speech
signal based on its past sample values and its actual value at that
moment. The time interval represented by the number of samples used
to obtain the predicted value is typically 1 msec. Due to the short
memory used in the prediction process, the predicted signal values
represent, in large measure, the formant structure of the speech
signal. The pitch analysis arrangement of the invention is
particularly effective because, in developing a difference signal,
i.e., the prediction error signal, the formant structure of the
signal is removed from the input signal. Yet, since the pitch
period in speech signals ranges typically from 3 msec to 20 msec,
the prediction of the pitch structure, based on 1 msec of past
speech, is completely negligible. Thus, pitch information is
retained in the prediction error signal. Consequently, there is
little or no interference from the formant structure and a peak
picking operation is effective in developing a measure of the pitch
character of the input signal.
A feature of the invention is the additional use of prediction
error samples to develop a voiced-unvoiced signal indication. In
accordance with the invention, a voicing decision is based on the
ratio of the mean-squared value of input signal samples to the
mean-squared value of corresponding prediction error samples.
This invention will be more fully understood from the following
detailed description of an illustrative embodiment of it taken
together with the attached drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block schematic diagram of a speech signal analysis
system which illustrates the principles of the invention, and
FIG. 2 is an illustration of the waveform of a segment of a voiced
speech signal, the positions of detected pitch pulses in the voiced
speech signal, as shown by vertical lines, and a segment of
unvoiced speech.
DETAILED DESCRIPTION
A signal analysis arrangement which illustrates the principles of
the invention is illustrated in FIG. 1. Speech signals supplied
from any desired source are delivered to the analyzer and passed
through low-pass filter 10. Filter 10 typically has a cutoff
frequency in the neighborhood of 5 kHz. The resultant signal is
then sampled at a frequency of approximately 10 kHz in sampler 11
under control of signals from clock 12. Speech samples, s.sub.n,
thus derived are supplied to storage unit 13 which maintains them
in order, typically in blocks of 200 samples, i.e., s.sub.1,
s.sub.2, . . . , s.sub.200. Blocks or frames of samples are
periodically keyed out of storage unit 13, for example, under
control of a signal from clock 12, and delivered to adaptive
predictor 14, prediction parameter computer 15, and to subtractor
network 16.
Adaptive predictor 14 operates on supplied signal samples to
predict the present value of each sample on the basis of a weighted
summation of a number of prior sample values. The prediction
operation is carried out on a sample-by-sample basis and predictor
14 is periodically supplied with a new frame of samples from
storage unit 13. An adaptive predictor suitable for use in the
system of this invention is described in detail in a copending
application of B. S. Atal, Ser. No. 753,408, filed Aug. 19, 1968,
now U.S. Pat. No. 3,631,520.
To accommodate the constantly changing character of the input
speech signal, predictor 14 is controlled to adapt it to the
current signal condition. It has been found sufficient to readjust
the values of the parameters used to control the predictor at
intervals comparable to those of a pitch period of the signal.
Since the exact pitch interval is not available (although the pitch
output signal of the system may be used in a feedback arrangement
to approximate the interval of a later pitch period), readjustment
of the parameter values at intervals corresponding approximately to
the time of 200 samples is entirely satisfactory. This corresponds
to a time interval of approximately 20 msec.
Prediction parameter computer 15 thus operates on applied speech
samples from unit 13 to develop a sequence of parameter signals a =
a.sub.1, a.sub.2, . . . , a.sub.n, which are used periodically to
adjust predictor 14. Parameter values a are selected to minimize
the mean-squared prediction error of the system. An extensive
discussion of the relation of parameter signals a to the input
signal, their development, and the manner in which they are used to
control the predictor is explained in detail in the above-mentioned
copending patent application. Parameter signals from computer 15
are developed well in advance of the time that a block of signals
is processed in predictor 14 because of the delay inherent in the
prediction operation. Typically, parameter control signals are
developed within an interval corresponding to the time of
approximately 60 samples.
Sample values developed by predictor 14 are subtracted in network
16 from the actual value of corresponding signal samples delivered
from storage unit 13 to the subtractor. The resultant difference
signal represents the error in predicting the value of the signal.
It is accordingly called a "prediction error" signal. Evidently,
appropriate delay is provided, for example, in the readout of
samples from storage unit 13 or in their delivery to subtractor 16,
to allow time for all predictor operations to be completed. Suffice
it to say that all of the described operations are carried on in
synchronism in a conventional manner.
It is of importance to recognize that the values of signal samples
are predicted largely on the basis of their formant constituency.
Predicted signals, therefore, represent essentially the formant
structure of the input signal. Since the predicted signal values
are subtracted from actual signal values, the prediction error
signal at the output of subtractor network 16 is essentially devoid
of all formant information. Yet, the prediction error signal has
been found to preserve, and indeed to denote, the pitch character
of the applied signal.
Prediction error signals from subtractor 16 are passed through
low-pass filter 17. Filter 17 is constructed with a relatively low
cutoff frequency since the fundamental pitch of the applied signal
generally is in the lower portion of the band. Elimination of
higher frequency portions aids in isolating the pitch signal.
In accordance with the invention, the positions of individual pitch
pulses in the applied signal is determined by locating the samples
for which the prediction error is large. Samples delivered from
filter 17 thus have amplitudes that are proportional to the
difference between the applied signal sample and the predicted
signal. It is necessary, therefore, only to seek the fundamental
frequency of the prediction (error) signal. This may be done using
any desired fundamental frequency detector 18 of any desired
construction. A suitable detector includes a half-wave rectifier
19, employed to retain positive peaks only of the signal in order
to simplify later operations. The rectified signal is delivered to
peak picking network 20, which seeks the largest sample in each
frame of signals. Such peak picking arrangements are well known to
those skilled in the art and are frequently used in pitch detection
arrangements, particularly those of the cepstrum type. Peak signals
thus developed are passed through threshold detector 21, adjusted
to a level selected to prevent minor peaks from reaching the output
of the analyzer. The threshold is adjusted to accommodate the true
fundamental frequency peaks determined, for example, from
experience. The resulting sequence of pitch pulses is indicative of
the fundamental frequency or period of the applied speech signal
and may be used in any desired fashion.
Alternatively, as previously described in the art, the fundamental
frequency detector may include an autocorrelator followed by a peak
picker and a threshold detector.
FIG. 2 illustrates a typical interval of a speech signal. A voiced
speech segment is shown in line A. Line B illustrates the sequence
of pulses derived from fundamental frequency detector 18 as the
output signal of the analyzer system. Line C of the figure
illustrates a typical unvoiced segment of speech.
To assure that a clear distinction between voiced and unvoiced
signal segments is available, it is in accordance with the
invention to produce a voiced-unvoiced decision signal. In
accordance with the invention, the voiced-unvoiced decision is
based on the ratio of the mean-squared value of speech samples to
the mean-squared value of prediction error samples. It has been
found that this ratio is considerably smaller for unvoiced speech
sounds than for voiced speech sounds, typically by a factor of
approximately 10.
Accordingly, speech samples from sampler 11 are delivered to
mean-squared network 22 and prediction error samples from
subtractor 16 are delivered to mean-squared network 23. Networks
for deriving a signal proportional to the mean value of sequence of
samples are well known in the art and are frequently used in
acoustic signal processing apparatus. A typical network includes an
arrangement for developing a signal proportional to the square of
each signal sample, an adding network for summing a sequence of
squared signal values, and a divider network for developing a
signal proportional to the average, or mean value, of the summed
squared signals.
Two signals proportional, respectively, to the mean-squared value
of speech samples and the mean-squared value of prediction error
samples are delivered to divider network 24 which produces as its
output the quotient of the two signal values. The quotient signal
is thereupon delivered to threshold detector 25, which is arranged
to develop a first signal for quotient values greater than 10, as
an indication of a voiced signal interval, and a second signal for
quotients less than 10, as an indication of an unvoiced signal
interval. Output signals from detector 25 may be used in any
desired fashion to indicate the voicing character of the input
signal.
It will be evident to those skilled in the art that the fundamental
frequency determination arrangement of the invention, together with
the voicing decision arrangement, greatly enhances the reliability
with which two important characteristics of a speech signal are
determined. This increased reliability is due primarily to the
virtual absence of formant structure in the signal at the time the
pitch measurement is made. Furthermore, it will be apparent that
the fundamental frequency detector of the invention is particularly
applicable to use in a speech transmission system or a speech
analysis system in which a linear prediction arrangement is used.
In such cases, it is evident that the prediction error signal
delivered to subtractor 16 may be derived from the predictor used
in coding the speech signals.
Furthermore, it will be apparent that the voicing decision signal
may be used in conjunction with other criteria, such as the
spectral balance of low frequencies related to high frequencies to
make the voiced-unvoiced decision more reliable.
* * * * *