Speech Signal Pitch Detector Using Prediction Error Data Patent Grant Atal June 19, 1 [Bell Telephone Laboratories, Incorporated]

Speech Signal Pitch Detector Using Prediction Error Data

Atal June 19, 1

Patent Grant 3740476

U.S. patent number 3,740,476 [Application Number 05/161,173] was granted by the patent office on 1973-06-19 for speech signal pitch detector using prediction error data. This patent grant is currently assigned to Bell Telephone Laboratories, Incorporated. Invention is credited to Bishnu Saroop Atal.

United States Patent	3,740,476
Atal	June 19, 1973

SPEECH SIGNAL PITCH DETECTOR USING PREDICTION ERROR DATA

Abstract

Pitch periods in a complex speech signal are determined by evaluating the error in predicting the value of a sample of the signal on the basis of past sample values, and by locating samples for which the prediction error is large. Advantageously, the prediction error signal is devoid of all formant structure, so that there is no chance of confusing pitch signal peaks with formant peaks. A voiced-unvoiced decision is obtained from the ratio of the mean-squared value of the speech signal to the mean-squared value of the prediction error signal.

Inventors:	Atal; Bishnu Saroop (Murray Hill, NJ)
Assignee:	Bell Telephone Laboratories, Incorporated (Murray Hill, NJ)
Family ID:	22580131
Appl. No.:	05/161,173
Filed:	July 9, 1971

Current U.S. Class:	704/207; 704/219
Current CPC Class:	G10L 25/90 (20130101); G10L 25/93 (20130101)
Current International Class:	G10L 11/04 (20060101); G10L 11/00 (20060101); G10L 11/06 (20060101); G10l 001/04 ()
Field of Search:	;179/1SA,15.55R ;325/38A

References Cited [Referenced By]

U.S. Patent Documents


3437757	April 1969	Coker
3405237	October 1968	David
2732424	January 1956	Oliver
3026375	March 1962	Graham
3420955	January 1969	Noll

Primary Examiner: Claffy; Kathleen H.
Assistant Examiner: Leaheey; Jon Bradford

Claims

What is claimed is:

1. A signal analyzer for determining the fundamental period of a speech signal, which comprises,

adaptive predictor means supplied with samples of said speech signal for predicting the present value of each sample on the basis of a weighted summation of a number of prior sample values of said speech signal,

means for subtracting said predicted speech value from the actual speech value to develop a difference signal, and

means for determining the fundamental frequency of said difference signal as an indication of the fundamental period of said speech signal.

2. A signal analyzer as defined in claim 1, wherein said means for determining the fundamental frequency of said difference signal comprises,

means for determining the frequency of occurrence of difference signal maxima above a prescribed threshold.

3. A signal analyzer as defined in claim 1, wherein said means for determining the fundamental frequency of said difference signal comprises,

means for autocorrelating said difference signal for developing an autocorrelation signal representative of the periodic character of said difference signal, and

means for detecting the location of the peak value of said autocorrelation signal.

4. Apparatus for determining the fundamental period of a speech signal, which comprises,

means for developing an estimate of the present value of a speech signal on the basis of past values of said speech signal,

means for developing a signal representative of the difference between said signal estimate and the true present value of said speech signal, and

means for determining the fundamental frequency of said difference signal to develop a signal representative of the fundamental period of said speech signal.

5. Apparatus for determining the fundamental period of a speech signal, which comprises,

adaptive predictor means supplied with samples of said speech signal for developing an estimate of the momentary value of said speech signal from previously supplied samples,

means for developing a prediction error signal from the difference between said predicted signal estimate and the corresponding momentary value of samples of said speech signal,

means for identifying prediction error samples whose magnitudes are above a prescribed threshold, and

means for utilizing the frequency of occurrence of said identified error samples as a measure of the fundamental period of said speech signal.

6. Apparatus for analyzing the character of a speech signal, which comprises, in combination,

predictor means supplied with samples of a speech signal for developing an estimate of the momentary value of said signal from previously supplied samples,

means for developing prediction error signal samples from the difference between samples of said signal estimate and the corresponding momentary value of samples of said speech signal,

means for identifying prediction error samples whose magnitudes are above a prescribed threshold,

means for developing a first signal proportional to the mean-squared value of said speech samples,

means for developing a second signal proportional to the mean-squared value of corresponding ones of said error samples,

means for developing a signal proportional to the ratio of said first to said second mean-squared signals,

means for utilizing the frequency of occurrence of said identified threshold error samples as a measure of the fundamental period of said speech signal, and

means for utilizing said ratio of first and second mean-squared signals as a measure of the voicing characteristic of said speech signal.

7. Apparatus for analyzing the character of a speech signal as defined in claim 6, wherein,

values of said ratio of mean-squared signals equal to or greater than a prescribed threshold are used to classify said speech signal as voiced, and

wherein values of said ratio of mean-squared signals less than said threshold are used to classify said speech signal as unvoiced.

8. In a pitch analysis arrangement for speech signals, the combination of,

means for developing a signal representative of the formant structure of an applied speech signal,

means for removing said formant representative signal from said speech signal to produce a signal essentially devoid of all formant information,

means for measuring the period of said formant devoid signal, and

means for determining the voicing character of said speech signal on the basis of the power in said speech signal and the power in said formant devoid signal.

Description

This invention is concerned with the analysis of complex signals, and particularly with the determination of the fundamental frequency, or period, of a complex periodic signal, such as a voiced speech signal. Its principal objectives are to simplify the measurement of pitch frequency and to improve the reliability of the measure.

BACKGROUND OF THE INVENTION

A number of arrangements for reducing the channel capacity required for the transmission of complex signals, such as speech signals, have been proposed. One of the best known of these is the vocoder. More recently, techniques for removing inherent signal redundancy through the use of linear prediction techniques have been described. In all of these arrangements, a speech wave is analyzed to determine its significant characteristics, and coded information concerning these characteristics is transmitted instead of the speech signal itself. At a receiver station a synthetic speech signal is developed from the coded information.

In general, a different set of coded signal information is employed in each type of bandwidth compression system. However, virtually all employ one characteristic of the speech signal, namely, its pitch frequency. This characteristic denotes the fundamental frequency at which the vocal cords vibrate during the production of different voiced speech sounds. Most speech bandwidth compression systems also employ coded information to identify a speech signal as voiced or unvoiced. Some combine the two forms of information so that the pitch signal inherently specifies the voicing condition.

FIELD OF THE INVENTION

A number of different proposals for automatically measuring and encoding the pitch characteristic of a speech signal are known and used in the art. Some rely on simple filtering, some on signal correlation, some on formant detection and tracking, and others on a transformation of the logarithm of the spectrum of a speech signal, the so-called cepstrum of the signal. All of these arrangements, however, operate on the speech signal itself and in one way or another strive to find peak values in the signal, or in a modification of it, which identify the pitch characteristic. Unfortunately, peaks due to formants, particularly the first formant of a speech signal, are often stronger than a peak developed to indicate pitch. If the two peaks are close together, it is difficult to determine which is which. Consequently, even the most sophisticated pitch detectors are subject to error and do not always correctly characterize the pitch frequency of a signal.

It is thus another object of this invention to capitalize on a unique property of a voiced speech signal to develop a measure of the pitch frequency of the signal that is unambiguous and which is entirely independent of the formant character of the speech signal.

SUMMARY OF THE INVENTION

Analysis of a complex speech signal to determine its pitch frequency is, in accordance with the invention, based on an analysis of the error between a predicted value of the speech signal based on its past sample values and its actual value at that moment. The time interval represented by the number of samples used to obtain the predicted value is typically 1 msec. Due to the short memory used in the prediction process, the predicted signal values represent, in large measure, the formant structure of the speech signal. The pitch analysis arrangement of the invention is particularly effective because, in developing a difference signal, i.e., the prediction error signal, the formant structure of the signal is removed from the input signal. Yet, since the pitch period in speech signals ranges typically from 3 msec to 20 msec, the prediction of the pitch structure, based on 1 msec of past speech, is completely negligible. Thus, pitch information is retained in the prediction error signal. Consequently, there is little or no interference from the formant structure and a peak picking operation is effective in developing a measure of the pitch character of the input signal.

A feature of the invention is the additional use of prediction error samples to develop a voiced-unvoiced signal indication. In accordance with the invention, a voicing decision is based on the ratio of the mean-squared value of input signal samples to the mean-squared value of corresponding prediction error samples.

This invention will be more fully understood from the following detailed description of an illustrative embodiment of it taken together with the attached drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block schematic diagram of a speech signal analysis system which illustrates the principles of the invention, and

FIG. 2 is an illustration of the waveform of a segment of a voiced speech signal, the positions of detected pitch pulses in the voiced speech signal, as shown by vertical lines, and a segment of unvoiced speech.

DETAILED DESCRIPTION

A signal analysis arrangement which illustrates the principles of the invention is illustrated in FIG. 1. Speech signals supplied from any desired source are delivered to the analyzer and passed through low-pass filter 10. Filter 10 typically has a cutoff frequency in the neighborhood of 5 kHz. The resultant signal is then sampled at a frequency of approximately 10 kHz in sampler 11 under control of signals from clock 12. Speech samples, s.sub.n, thus derived are supplied to storage unit 13 which maintains them in order, typically in blocks of 200 samples, i.e., s.sub.1, s.sub.2, . . . , s.sub.200. Blocks or frames of samples are periodically keyed out of storage unit 13, for example, under control of a signal from clock 12, and delivered to adaptive predictor 14, prediction parameter computer 15, and to subtractor network 16.

Adaptive predictor 14 operates on supplied signal samples to predict the present value of each sample on the basis of a weighted summation of a number of prior sample values. The prediction operation is carried out on a sample-by-sample basis and predictor 14 is periodically supplied with a new frame of samples from storage unit 13. An adaptive predictor suitable for use in the system of this invention is described in detail in a copending application of B. S. Atal, Ser. No. 753,408, filed Aug. 19, 1968, now U.S. Pat. No. 3,631,520.

To accommodate the constantly changing character of the input speech signal, predictor 14 is controlled to adapt it to the current signal condition. It has been found sufficient to readjust the values of the parameters used to control the predictor at intervals comparable to those of a pitch period of the signal. Since the exact pitch interval is not available (although the pitch output signal of the system may be used in a feedback arrangement to approximate the interval of a later pitch period), readjustment of the parameter values at intervals corresponding approximately to the time of 200 samples is entirely satisfactory. This corresponds to a time interval of approximately 20 msec.

Prediction parameter computer 15 thus operates on applied speech samples from unit 13 to develop a sequence of parameter signals a = a.sub.1, a.sub.2, . . . , a.sub.n, which are used periodically to adjust predictor 14. Parameter values a are selected to minimize the mean-squared prediction error of the system. An extensive discussion of the relation of parameter signals a to the input signal, their development, and the manner in which they are used to control the predictor is explained in detail in the above-mentioned copending patent application. Parameter signals from computer 15 are developed well in advance of the time that a block of signals is processed in predictor 14 because of the delay inherent in the prediction operation. Typically, parameter control signals are developed within an interval corresponding to the time of approximately 60 samples.

Sample values developed by predictor 14 are subtracted in network 16 from the actual value of corresponding signal samples delivered from storage unit 13 to the subtractor. The resultant difference signal represents the error in predicting the value of the signal. It is accordingly called a "prediction error" signal. Evidently, appropriate delay is provided, for example, in the readout of samples from storage unit 13 or in their delivery to subtractor 16, to allow time for all predictor operations to be completed. Suffice it to say that all of the described operations are carried on in synchronism in a conventional manner.

It is of importance to recognize that the values of signal samples are predicted largely on the basis of their formant constituency. Predicted signals, therefore, represent essentially the formant structure of the input signal. Since the predicted signal values are subtracted from actual signal values, the prediction error signal at the output of subtractor network 16 is essentially devoid of all formant information. Yet, the prediction error signal has been found to preserve, and indeed to denote, the pitch character of the applied signal.

Prediction error signals from subtractor 16 are passed through low-pass filter 17. Filter 17 is constructed with a relatively low cutoff frequency since the fundamental pitch of the applied signal generally is in the lower portion of the band. Elimination of higher frequency portions aids in isolating the pitch signal.

In accordance with the invention, the positions of individual pitch pulses in the applied signal is determined by locating the samples for which the prediction error is large. Samples delivered from filter 17 thus have amplitudes that are proportional to the difference between the applied signal sample and the predicted signal. It is necessary, therefore, only to seek the fundamental frequency of the prediction (error) signal. This may be done using any desired fundamental frequency detector 18 of any desired construction. A suitable detector includes a half-wave rectifier 19, employed to retain positive peaks only of the signal in order to simplify later operations. The rectified signal is delivered to peak picking network 20, which seeks the largest sample in each frame of signals. Such peak picking arrangements are well known to those skilled in the art and are frequently used in pitch detection arrangements, particularly those of the cepstrum type. Peak signals thus developed are passed through threshold detector 21, adjusted to a level selected to prevent minor peaks from reaching the output of the analyzer. The threshold is adjusted to accommodate the true fundamental frequency peaks determined, for example, from experience. The resulting sequence of pitch pulses is indicative of the fundamental frequency or period of the applied speech signal and may be used in any desired fashion.

Alternatively, as previously described in the art, the fundamental frequency detector may include an autocorrelator followed by a peak picker and a threshold detector.

FIG. 2 illustrates a typical interval of a speech signal. A voiced speech segment is shown in line A. Line B illustrates the sequence of pulses derived from fundamental frequency detector 18 as the output signal of the analyzer system. Line C of the figure illustrates a typical unvoiced segment of speech.

To assure that a clear distinction between voiced and unvoiced signal segments is available, it is in accordance with the invention to produce a voiced-unvoiced decision signal. In accordance with the invention, the voiced-unvoiced decision is based on the ratio of the mean-squared value of speech samples to the mean-squared value of prediction error samples. It has been found that this ratio is considerably smaller for unvoiced speech sounds than for voiced speech sounds, typically by a factor of approximately 10.

Accordingly, speech samples from sampler 11 are delivered to mean-squared network 22 and prediction error samples from subtractor 16 are delivered to mean-squared network 23. Networks for deriving a signal proportional to the mean value of sequence of samples are well known in the art and are frequently used in acoustic signal processing apparatus. A typical network includes an arrangement for developing a signal proportional to the square of each signal sample, an adding network for summing a sequence of squared signal values, and a divider network for developing a signal proportional to the average, or mean value, of the summed squared signals.

Two signals proportional, respectively, to the mean-squared value of speech samples and the mean-squared value of prediction error samples are delivered to divider network 24 which produces as its output the quotient of the two signal values. The quotient signal is thereupon delivered to threshold detector 25, which is arranged to develop a first signal for quotient values greater than 10, as an indication of a voiced signal interval, and a second signal for quotients less than 10, as an indication of an unvoiced signal interval. Output signals from detector 25 may be used in any desired fashion to indicate the voicing character of the input signal.

It will be evident to those skilled in the art that the fundamental frequency determination arrangement of the invention, together with the voicing decision arrangement, greatly enhances the reliability with which two important characteristics of a speech signal are determined. This increased reliability is due primarily to the virtual absence of formant structure in the signal at the time the pitch measurement is made. Furthermore, it will be apparent that the fundamental frequency detector of the invention is particularly applicable to use in a speech transmission system or a speech analysis system in which a linear prediction arrangement is used. In such cases, it is evident that the prediction error signal delivered to subtractor 16 may be derived from the predictor used in coding the speech signals.

Furthermore, it will be apparent that the voicing decision signal may be used in conjunction with other criteria, such as the spectral balance of low frequencies related to high frequencies to make the voiced-unvoiced decision more reliable.

* * * * *