Speech Analyzer-synthesizer System Employing Improved Formant Extractor Patent Grant Rabiner , et al. March 14, 1 [Bell Telephone Laboratories, Incorporated]

Speech Analyzer-synthesizer System Employing Improved Formant Extractor

Rabiner , et al. March 14, 1

Patent Grant 3649765

U.S. patent number 3,649,765 [Application Number 04/872,050] was granted by the patent office on 1972-03-14 for speech analyzer-synthesizer system employing improved formant extractor. This patent grant is currently assigned to Bell Telephone Laboratories, Incorporated. Invention is credited to Lawrence R. Rabiner, Ronald W. Schafer.

United States Patent	3,649,765
Rabiner , et al.	March 14, 1972

SPEECH ANALYZER-SYNTHESIZER SYSTEM EMPLOYING IMPROVED FORMANT EXTRACTOR

Abstract

An important step in speech signal analysis is the identification of formant frequencies of voiced speech. Formant data is necessary in the synthesizer used, for example, in a resonance vocoder. To derive these data, i.e., to obtain an estimate of the pitch period of the signal and its spectral envelope, a cepstrum of a speech signal is used. The lowest three formants of a voiced speech signal are then estimated from a smoothed spectral envelope using constraints on formant frequency ranges and relative levels of spectral peaks at the formant frequencies. These constraints allow detection in cases where formants are too close together to be resolved from the initial spectral envelope.

Inventors:	Rabiner; Lawrence R. (Chatham, NJ), Schafer; Ronald W. (New Providence, NJ)
Assignee:	Bell Telephone Laboratories, Incorporated (Murray Hill, NJ)
Family ID:	25358730
Appl. No.:	04/872,050
Filed:	October 29, 1969

Current U.S. Class:	704/209
Current CPC Class:	G10L 19/02 (20130101)
Current International Class:	G10L 19/00 (20060101); G10L 19/02 (20060101); G10l 001/02 ()
Field of Search:	;179/1SA,15.55

References Cited [Referenced By]

U.S. Patent Documents


2938079	May 1960	Flanagan
3328525	June 1967	Kelly
3448216	June 1969	Kelly
3493684	February 1970	Kelly
3190963	June 1965	David
3268660	August 1966	Flanagan

Primary Examiner: Claffy; Kathleen H.
Assistant Examiner: Leaheey; Jon Bradford

Claims

What is claimed is:

1. Speech analysis apparatus for locating formants of a voiced speech signal, which comprises:

means supplied with a speech signal for developing a signal representative of a smoothed spectral envelope thereof,

means supplied with said spectral envelope signal for developing signals representative of the location and amplitude of peaks within assigned frequency ranges in said speech signal, said ranges being selected to encompass a prescribed frequency range of said speech signal with predetermined segments of overlap and

means responsive to said peak representative signals for selecting as formants of said speech signal the highest amplitude peaks according to location within said ranges.

2. Speech analysis apparatus for locating formants of voiced speech signals, which comprises:

means for developing a signal representative of the cepstrum of an applied speech signal,

means for developing from said cepstrum signal a signal representative of the spectral envelope of said speech signal,

means for evaluating said spectral envelope signal along a contour close to the pole locations in the complex frequency plane thereby to produce a signal in which spectrum peaks are sharpened,

means responsive both to said spectral envelope signal and selectively to said cepstrum signal for developing signals representative of the location and amplitude of all peaks in said spectral envelope signal,

means responsive to said peak location signal for selecting and ordering in frequency the highest of said amplitude peaks, and

means for identifying said selected and ordered peak location signals as formants of said applied signal.

3. Speech analysis apparatus for locating formants of a voiced speech signal, which comprises:

means for developing a signal representative of the smoothed spectral envelope of an applied speech signal,

means for locating all peaks in said spectral envelope signal,

means for developing signals representative of the location and amplitude of each of said located peaks within assigned frequency ranges, said ranges being selected to encompass a selected frequency range of said applied signal with prescribed segments of overlap,

means responsive to said peak location signals for selecting the highest amplitude peak in each of said ranges,

means for identifying as formants of said applied signal said selected peaks which occur in nonoverlapping segments of said ranges, and

means for identifying as formants of said applies signal the highest amplitude peaks according to their location, which occur in overlapping segments of said ranges.

4. Apparatus as defined in claim 3, in combination with,

spectral analysis means for enhancing said peaks in said spectral envelope signal.

5. Speech analysis apparatus for locating formants of voiced speech signals, which comprises:

means for developing a signal representative of the pitch period of an applied speech signal,

means for selectively weighting said applied speech signals with a symmetric window function of said pitch period signal,

means supplied with said weighted speech signal for developing a signal representative of the smoothed spectral envelope of said applies speech signal,

means for locating all peaks in said spectral envelope signal,

means for developing signals representative of the location and amplitude of each of said located peaks within assigned frequency ranges,

means responsive to said peak location signals for selecting the highest amplitude peak in each of said ranges,

means for identifying as formants of said applied signal said selected peaks which occur in nonoverlapping segments of said ranges, and

means for identifying as formants of said applied signal the highest amplitude peaks according to their location, which occur in overlapping segments of said ranges.

6. Apparatus as defined in claim 5 wherein said applied speech signals are weighted with a window function with a duration of approximately three times the pitch period of said applied speech signals.

7. Apparatus for analyzing speech frequency signals, which comprises:

means for counting the zero axis crossings of an applied speech signal,

means for developing a signal representative of the cepstrum of said speech signal, and

means responsive to said zero crossing count and to said cepstrum signal for determining therefrom the voiced-unvoiced character of said speech signal and, if voiced, the pitch period of said signal.

8. A speech signal analyzer system for producing coded signals from applied speech signals, which comprises:

means for developing a signal representative of the smoothed spectrum of an applied speech signal,

means for locating all peaks in said spectrum,

means responsive to said located peaks and selectively to said spectrum for developing, during voiced intervals of said speech signal, control signals representative of the location of the highest of said spectrum peaks in a prescribed order as formants of said applied signal,

means responsive to said spectrum for developing control signals representative of the level of said applied signal during voiced and unvoiced intervals, respectively,

means for developing a signal representative of the cepstrum of said applied speech signal,

means responsive to a count of zero axis crossings in said applied signal and to said cepstrum for developing a signal representative of the voicing character of said applied signal and the pitch period of voiced intervals thereof,

means responsive to said peak signals for developing a signal representative of the pole and zero locations for unvoiced intervals of said applied signal, and

means for utilizing all of said developed signals as a coded representation of said applied speech signal.

9. A speech signal analyzer-synthesizer system with reduced channel bandwidth requirements, which comprises:

at an analyzer station,

means for developing a signal representative of the smoothed spectrum of an applied speech signal,

means for locating all peaks in said spectrum,

means responsive to an indication of said located peaks and selectively to said spectrum signal for developing, during voiced intervals of said speech signal, control signals representative of the location of the highest of said amplitude peaks in a prescribed order as formants of said applied signal,

means responsive to said spectrum signal for developing signals representative of the level of said applied signal during voiced and unvoiced intervals, respectively,

means for developing a signal representative of the cepstrum of said applied signal,

means responsive to a count of zero axis crossings in said applied signal and to said cepstrum signal for developing signals representative of the voicing character of said applied signal and the pitch period of voiced intervals thereof,

means responsive to said peak signals for developing signals representative of the pole and zero locations for unvoiced intervals of said applied signal, and

means responsive to all of said developed signals for delivering them to a synthesizer station, and

at said synthesizer station,

means responsive to received unvoiced level control signals for adjusting the level of a source of noise signals,

a system of unvoiced resonant circuits energized by said adjusted noise signals,

means for adjusting said resonant system with said pole and zero location signals to produce an unvoiced signal,

generator means responsive to said pitch period control signal for developing pulses at pitch frequency,

means for adjusting the amplitude of said pulses according to said level control signal during voiced signals of said applied signal,

a system of resonant circuits energized by said control pulse signals and by said formant signals to produce a voiced signal,

means for combining said voiced and unvoiced signals,

means for shaping the spectrum of said combined signal, and

means for utilizing said shaped spectrum signal as a replica of said applied speech signal.

Description

This invention relates to the analysis and synthesis of speech in bandwidth compression systems. Subordinately, it relates to the identification and extraction of formants from continuous human speech.

BACKGROUND OF THE INVENTION

In order to make more economical use of the frequency bandwidth of speech transmission channels, a number of bandwidth compression arrangements have been devised for transmitting the information content of a speech wave over a channel whose bandwidth is substantially narrower than that required for analog transmission of the speech wave itself. Bandwidth compression systems typically include, at a transmitter terminal, an analyzer for deriving from an incoming speech wave a group of narrow bandwidth control signals representative of selected information-bearing characteristics of the speech wave and, at a receiver terminal, a synthesizer for reconstructing from the control signals a replica of the original speech wave.

1. Field of the Invention

It has been demonstrated that a speech waveform can be constructed by means of an arrangement that corresponds generally to the structure of the human vocal tract. Speech is produced in such an arrangement by exciting a series or parallel connection of resonators either by random noise, to produce unvoiced sounds, by a quasi-periodic pulse train, to produce voiced sounds, or in some cases by a mixture of these sources, to produce voiced fricatives. To produce natural sounding speech, the mode of operation of the human vocal tract is simulated by continuously tuning the natural frequencies of the resonators. As tuned, resonances are established at selected frequencies to produce peaks or maxima in the amplitude spectrum of the reconstructed signal which correspond to the principal resonances, or formants, of the human vocal tract. Since the first three formants, in order of frequency, contribute most to the intelligibility of speech, it is common practice to transmit at least three formant control signals to shape an artificial spectrum at the synthesizer.

2. Discussion of the Prior Art

Since formants are effective parameters for the production of artificial human speech, they are used as control signals, for example, in such devices as the well-known resonance vocoder. A typical resonance vocoder is described in J. C. Steinberg, U.S. Pat. No. 2,635,146, issued Apr. 14, 1953. Further, since the quality of speech reconstructed by a resonance vocoder or the like is largely dependent on the proper identification of formant frequencies and locations, a number of techniques have been proposed for extracting formant information from a speech wave. One such proposal is described in J. L. Flanagan, U.S. Pat. No. 2,938,079, issued May 24, 1960. Further, electrical methods for speech synthesis, using formant data, are discussed in detail in Speech Analysis, Synthesis and Perception by J. L. Flanagan, Academic Press, Inc., 1965.

SUMMARY OF THE INVENTION

It is an object of this invention to improve the accuracy and efficiency with which formants are derived from a speech signal. It is another object to use these formants and other selected parameters to transmit, over a narrow band communication circuit, sufficient information with which to produce an accurate replica of an input speech signal.

These and other objects are achieved, in accordance with this invention, by determining, at a transmitter station, as a function of time, the pitch period, the amplitude of voiced and unvoiced excitation, the location of the lowest three formants for voiced speech, and the locations of a single pole and zero necessary for the synthesis of unvoiced speech. These data are suitable for transmission to a receiver station for use in the synthesis of speech. Since the system is not pitch-synchronous, an exact determination of pitch period is not required. Instead, several periods of speech may be examined at a time. Averaging of this sort has the advantage of eliminating the difficult problem of accurately determining pitch periods in the acoustic waveform.

The analysis of applied voiced speech thus involves two basic parts, viz, initially, an estimation of pitch period and a computation of the spectral envelope of the applied signal, and, secondly, an estimation of formants from the spectral envelope. Estimation of the pitch period and the spectral envelope is accomplished through a computation of the cepstrum of a segment of the applied speech waveform. The cepstrum of a segment of sampled speech is defined as the inverse transform of the logarithm of the Fourier transform of that segment. Cepstral techniques for pitch period estimation have been described in "Cepstrum Pitch Determinations" by A. M. Noll, Journal of the Acoustical Society of America, February, 1967, at page 293. Previous investigations have shown that it is reasonable to assume that the logarithm of the Fourier transform (actually the logarithm of the z-transform in the case of sampled date) of a segment of voiced speech consists of a slowly varying component attributable to the convolution of the glottal pulse with the vocal tract impulse response, plus a rapidly varying periodic component due to the repetitive nature of an acoustic waveform. These two additive components can be separated by linear filtering of the logarithm of the transform. The assumption that the log magnitude is composed of two separate components is supported by investigation of models of the production of speech waveforms.

Accordingly, the pitch period is determined by searching the cepstrum for a strong peak in a region encompassing the minimum expected pitch period. The spectral envelope is obtained by low pass filtering of the log magnitude of the discrete Fourier transform. Formants are derived from the smoothed spectral envelope by locating all of the peaks (maxima) and identifying the location and amplitude level of each peak. This collection of peak locations and peak levels contains the spectral information necessary for a satisfactory estimation of formant values. The frequency region expected to contain the first three formants of a speech signal is then segmented into three regions. The lowest formant is searched for first, looking primarily in the lowest region, then the second formant is sought, primarily in the next highest region, and finally the third formant is searched in the highest of the three regions. Based on the amplitudes and frequencies of the peaks and their locations in the various regions or in regions of overlap, logical operations are performed by which spurious candidates are eliminated and the selected highest peaks are ordered and identified as speech formants. If the speech is unvoiced, only a single variable resonance peak and a single variable antiresonance are used to characterize the sound. They, too, are extracted from a cepstrally smoothed spectrum. A voiced-unvoiced decision additionally is obtained based on the presence or absence of a strong peak in the cepstrum together with a measure of a zero crossing count.

In order to convert the control parameters of the analyzer to speech, a digital, serial, terminal analog speech synthesizer is employed. It models the transmission characteristic of the vocal tract from glottis to mouth. Synthesizers based on such models have been described previously in the art, for example, in Gerstman-Kelly, U.S. Pat. No. 3,158,685, issued Nov. 24, 1964, as well as elsewhere. The variable resonance circuits employed in the synthesis network and the manner of controlling them may be substantially identical to those described in the Gerstman-Kelly patent.

Certain other refinements to the generation of parameter signals are employed to improve the synthesis of speech, particularly in those cases in which formants in the applied speech are too close together in frequency to be resolved.

This invention will be more fully understood from the following detailed description taken together with the drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block schematic diagram of a speech analyzer-synthesizer which illustrates the principles of this invention;

FIG. 2 illustrates the structure of a spectral envelope estimator suitable for use in the system of FIG. 1;

FIG. 3 depicts a pitch detector which may be used in the practice of the invention;

FIG. 4 illustrates the functional operation of unvoiced spectrum coder 18 used in the apparatus of FIG. 1;

FIG. 5 illustrates the manner in which FIGS. 6 and 7 are interconnected;

FIGS. 6 and 7 illustrate by way of a functional flow chart the operation of voiced spectrum coder 19 used in the analyzer of FIG. 1;

FIG. 8 depicts typical regions in the spectrum of a speech signal likely to contain formants;

FIG. 9 illustrates the threshold level of signal F.sub.2 relative to signal F.sub.1, useful in explaining the operation of a voiced spectrum coder;

FIG. 10 illustrates a characteristic cepstrally smoothed log spectrum of a speech signal; and

FIG. 11 illustrates the manner in which formants in the log spectrum of the signal of FIG. 10 are emphasized by virtue of the operation of the apparatus of this invention.

DETAILED DESCRIPTION OF THE INVENTION

FIG. 1 illustrates a band compression system including an analyzer at a transmitter station, and a synthesizer at a receiver station, which illustrates the principles of the invention. At the analyzer, an incoming speech wave from source 10, which may be a conventional transducer for converting speech sounds into a corresponding electrical wave, is applied both by way of modulator 11 to cepstrum analyzer 12, and to zero crossing counter 13. The purpose of the analyzer station is to develop control signals representative of the pitch period and formant locations for voiced speech, the resonance and antiresonance locations for unvoiced speech, and an indication of the magnitude of the buzz or hiss components during voiced and unvoiced speech intervals, respectively. A cepstrum analysis is particularly suitable for this purpose since it permits all of these parameter signals to be developed with a minimum of equipment complexity. Thus, estimation of the pitch period and the spectral envelope of the applied signal is accomplished from the computation of the cepstrum of a segment of the speech waveform. As discussed by Noll, the cepstrum of a signal is the spectrum of the logarithm of the power spectrum of a signal and exhibits a number of distinct peaks at pitch period intervals. Previous investigations have shown that the logarithm of the Fourier transform of a segment of voiced speech consists of a slowly varying component attributable to the convolution of the glottal pulse with the vocal tract impulse response, plus a rapidly varying periodic component due to the repetitive nature of the acoustic waveform. These two additive components, available in the cepstrum signal, may be separated by linear filtering.

Preparatory to developing the cepstrum of the applied signal, a segment of input speech, s(.xi.T+nT), is weighted, through the action of modulator 11, by a symmetric Hamming window function, w(nT), such that

x(nT) = s(.xi.T+nT) w(nT)

= [p(.xi.T+nT)*h(nT)].sup.. w(nT), (1) 0.ltoreq.n<M,

where * denotes a discrete convolution, where .xi.T is the starting sample of the segment of the speech waveform, and where T is the sampling period in seconds. In equation (1), p(.xi.T+nT) represents a quasi-periodic impulse train appropriate for the particular segment being analyzed and h(nT) represents the triple convolution of the vocal tract impulse response with the glottal pulse and the radiation load impulse response. The window function w(nT) tapers to zero at each end to minimize the effects of a nonintegral number of pitch periods within the window. Since the window function varies slowly with respect to variations in the pitch of the applied signal, it is convenient to develop it, in function generator 23, from the indication of pitch period developed by pitch detector 14. Thus, the purpose of modulating the applied speech wave from transducer 10 by the window function in modulator 11 is to improve the approximation that a segment of voiced speech can be represented as a convolution of a periodic impulse train with a time invariant, vocal tract impulse response sequence. Preferably, the window function is specified by the equation:

0.54 -0.46 cos. 2.pi.nT/3.tau.0.ltoreq.nT.ltoreq.3.tau.

w(nT) ={ (2)

0 elsewhere. The duration, 3.tau., of the window is three times the previous estimate of pitch period. It is made dependent on the pitch period estimate, from detector 14, for two conflicting reasons. In order to obtain a strong peak in the cepstrum at the pitch period, it is necessary to have several periods of the waveform within the window. In contrast, in order to obtain strong peaks in the smooth spectrum, only about two periods should be within the window, i.e., formants should not have changed appreciably within the time interval spanned by the window. Thus, an adaptive width window assures better estimates of pitch and formants since it presents a wider window for finding a strong peak at the pitch period, and a narrower window for finding strong, unambiguous indications of formants. The choice for window duration of three times the previous pitch period represents a compromise which has proven to be satisfactory.

As noted earlier, the cepstrum developed at the output of analyzer 12 consists of two components. The component due primarily to the glottal wave and the vocal tract is concentrated in the region .vertline.nT.vertline.<.tau., while the component due to the pitch occurs in the region .vertline.nT.vertline..gtoreq..tau., where .tau. is the pitch period during the segment being analyzed. The component due to excitation consists mainly of sharp peaks at multiples of the pitch period. Thus, pitch period can be determined by searching the cepstrum for a strong peak in the region nT>.tau..sub.min, where .tau..sub.min is the minimum expected pitch period. Signals from analyzer 12 are accordingly supplied as one input to pitch detector 14. Zero crossing count information developed by counter 13 is supplied as the other. This information is employed to provide an indication of the voiced or unvoiced character of the applied speech signal. Detector 14 produces a signal P, which may either be equal to .tau. for voiced signals, in which case .tau. denotes the pitch period of the input signal, or zero for unvoiced signal. Details of a suitable pitch detector are described hereinafter with reference to FIG. 3.

Similarly, a suitable examination of the cepstrum from analyzer 12 is performed to develop an estimate of the spectral envelope of the applied signal. Although a variety of techniques for deriving such an envelope signal are known in the art, one suitable arrangement is described hereinafter in the discussion of the arrangement of FIG. 2.

Peaks in the spectral envelope are identified in peak picker network 16. Suitable peak picking networks have been described variously in the art. Peaks of the spectral envelope are delivered by way of gate 17 either to unvoiced spectrum coder 18 or to voiced spectrum coder 19. The choice is dependent upon whether the input speech signal is voiced or unvoiced. Accordingly, gate 17 is actuated by the voiced-unvoiced signal character of the pitch period signal developed by detector 14. If the input signal is voiced, values of .tau. which appear as a "1" signal at the input of gate 17, open the gate so that peaks of the spectrum envelope are supplied to coder 19. If the input signal is unvoiced, a "0" pitch signal (absence of .tau.) is applied to the gate, it is switched to its other state, and peak signals are supplied to coder 18.

Unvoiced spectrum coder 18 and voiced spectrum coder 19 serve to analyze the peaks in the applied signal, to ascertain if they represent formants of the applied signal, to order those selected as formants, and to develop output control signals for use in synthesizing the applied wave. Two control signals, F.sub.P and F.sub.Z, are developed by coder 18, indicating for unvoiced speech the location of a single resonance and antiresonance in the speech signal, and three control signals, F.sub.1, F.sub.2 and F.sub.3, are produced by coder 19, representative of the location of the first three formants of the applied signal. Coder 19, in addition to operating on the peaks of the spectrum envelope, also is supplied with cepstrum signals from analyzer 12.

Control signals A.sub.V and A.sub.N, representative of the level of buzz and hiss signals to be used in synthesis, are developed in control network 20 from the first spectrum signal produced by cepstrum analyzer 12. Apparatus for developing such level control signals are well known in the vocoder art; any form of buzz-hiss level analyzer may be employed.

Signals P, F.sub.P, F.sub.Z, F.sub.1, F.sub.2, F.sub.3 and A.sub.V and A.sub.N constitute all of the controls necessary for characterizing applied speech, both when voiced and unvoiced. These signals together require considerably less transmission bandwidth than would analog transmission of the applied speech signal Accordingly, they may be delivered to multiplex unit 21, of any desired construction, wherein the group of control signals is prepared for transmission to a receiver station. At the receiver station distributor unit 22, again of any desired construction, recovers the transmitted signals and makes them available for synthesis.

Received parameter signals may be used to control the production of artificial speech, using any well-known synthesis apparatus. For example, a formant vocoder synthesizer of the form described in the above-mentioned Gerstman and Kelly U.S. Pat. No. 3,158,685, is satisfactory. Typically, a formant vocoder synthesizer includes two systems of resonant circuits, one energized by a noise signal to produce unvoiced sounds, and the other energized by a periodic pulse signal to develop voiced sounds. In the illustrated apparatus, unvoiced resonant circuits 24 receive noise signals from generator 25 by way of modulator 26. The modulator is controlled by the hiss level control signal A.sub.N and serves to control the amplitude of noise signals supplied to the input of the resonant circuits. Spectrum signals F.sub.P and F.sub.Z tune the resonant circuits 24 to shape the noise signals.

Voiced resonant circuits 27 are supplied, by way of modulator 28, with signals from pulse generator 29. Pulse generator 29 responsive to control signal P, develops a train of unit samples with the spacing between samples equal to .tau., where .tau. is the value of P during voiced intervals. Such pulses are similar to vocal pulses of air passing through the vocal chords at the fundamental frequency of vibration, 1/.tau., of the vocal chords. The amplitude of the resulting pulse train is controlled in modulator 28 by buzz level control signal A.sub.V. Signal A.sub.V represents the intensity of voicing. Resonant circuits 27 thus energized are controlled by formant control signals F.sub.1, F.sub.2 and F.sub.3 to shape the train of pulse signals in a fashion not unlike the shaping of voiced excitation that takes place in the human vocal tract, and to produce voiced signals which correspond to those contained in the input signal. In the conventional manner, resonant system 27 includes additional fixed resonant circuits to provide high frequency shaping of the spectrum.

Voiced and unvoiced replica signals from circuits 24 and 27 are combined in adder 30 and delivered for use, for example, to energize loud speaker 31. Additional spectral balance for the synthetic speech signals preferably is obtained by passing the signals from adder 30 through fixed spectral shaping network 32 before delivering them for use. This refinement aids in restoring realism to the reconstructed speech.

A form of spectral envelope estimator 15, suitable for use in the practice of the invention, is shown in FIG. 2. Low pass filtering of the cepstrum signal c(nT) is accomplished by first multiplying the supplied cepstrum by a function l(nT) of the form

where .tau..sub.1 + .DELTA..tau. is less than the minimum pitch period that will be encountered. The sequence e(nT) is next added to the sequence c(nT)l(nT). The purpose of adding this component to the cepstrum is to equalize formant amplitudes. The sequence e(nT) consists of four nonzero values, as follows:

e(0)=0.551

e(T)=-0.570

e(2T)=-0.279

e(3T)=-0.0544 (4)

Functions l(nT) and e(nT) may be produced, respectively, by function generators 51 and 53, constructed to evaluate the above equations. Function generators suitable for making such evaluations are well known in the art. The signal from function generator 51 is applied to modulator 50 and the signal from function generator 53 is added to the resultant signal in adder 52. The sequence, c(nT)l(nT) + e(nT) then transformed, in discrete Fourier transformer 54, of any well-known construction, to produce an equalized spectral envelope.

Since the component of the cepstrum due to voiced excitation consists mainly of sharp peaks at multiples of the pitch period, the pitch period of the applied speech wave can be determined by searching the cepstrum for strong peaks in the region of the minimum expected pitch period. A suitable manner of doing this is shown in the detailed illustration of pitch detector 14 by way of FIG. 3. A zero crossing count from counter 13 (FIG. 1) is supplied to compare network 34, where the total count is matched to a threshold signal, typically with a value of 1500 crossings per second. If the count is above the threshold, a signal, Y=0 is delivered to logic OR gate 36. If the count is below the threshold, a signal, Y=1 is delivered to gate 36. Cepstrum signals from analyzer 12 are delivered to peak picker network 37, which may be of the type described by Noll in U.S. Pat. No. 3,420,955, issued Jan. 7, 1969, or of any other desired form of construction. Cepstrum peaks are then compared in network 38 against a threshold established symbolically by potentiometer 39. If the amplitude of the detected peak is greater than the threshold, the comparator issues a signal X=1 to indicate that a voiced signal is present (because of the presence of a pitch period signal), but if the peak amplitude is below threshold, a signal X=0 is delivered to logic OR gate 36. Peak signals from peak picker 37 are also delivered to gate 40. Gate 40 is controlled by the output of OR gate 36 such that a cepstrum peak signal above threshold, or a zero crossing count signal below threshold, indicates a voiced signal. Gate 40 thereupon permits the peak location signal from picker 37 to be delivered as an output signal. It is designated P=.tau.. If neither of the threshold criteria are met, logic OR gate issues a zero, gate 40 is not actuated, and no signal appears at the output of the gate. This constitutes the signal P=0 and indicates that the applied signal is unvoiced.

From the derived peaks in the spectral envelope of the applied signal, it is in accordance with the invention to develop both signals for control of unvoiced resonant circuits at a synthesizer, and signals representative of the formant frequencies and locations for use in the control of voiced resonant circuits at the synthesizer. If the speech is unvoiced, as indicated by the P=0 signal from pitch detector 14 applied to gate 17, then only a single variable resonance peak is used to characterize the sound. It has not been found necessary to estimate a second unvoiced resonance in order to synthesize unvoiced sounds. The resonance peak for unvoiced sounds is extracted from peaks in the spectral envelope in coder 18. Since there is no pitch period for these sounds, a fixed number of data points is analyzed. The resonance peak used is the strongest spectral peak about 1,000 Hz. Although coder 18 may be implemented in any desired fashion to select and process the desired spectral peak, it has been found convenient to employ a special purpose computer programmed, for example, in accordance with the flow chart of steps shown in FIG. 4.

As indicated in FIG. 4, peaks of the spectral envelope signal delivered to coder 18 are processed by defining the frequency of the highest peak above 1,000 Hz. as F.sub.P. The difference between F.sub.P and the incoming signal is set equal, in Z-transform notation (discussed hereinafter), to

.DELTA..sub.P =log.vertline.H(e.sup.j2 .sup.F P.sup.T).vertline. - log.vertline.H(e.sup.jO).vertline.. (5)

If .DELTA..sub.P is found to be greater than 13 db., F.sub.Z is assumed to be 500 cycles and is determined. If .DELTA..sub.P is not greater than 13 db. above the reference, but is less than -20 db. below the reference, F.sub.Z is assumed to be equal to F.sub.P and F.sub.Z is determined. If F.sub.Z meets neither criteria, it is set equal to

F.sub.Z =(0.0065 F.sub.P +4.5 -.DELTA..sub.P)(0.014 F.sub.P +28), (6) and is determined. Thus, F.sub.P and F.sub.Z, representing a pole and zero in the unvoiced spectrum, are available for use at the synthesizer in adjusting unvoiced resonant circuits 24. A suitable program listing for carrying out these operations on a computer is set forth in Appendix I, attached to this specification .

Before proceeding to the details of the process for estimating the formant frequencies from peaks in the spectral envelope, in coder 19, it is believed helpful to present data relating to the properties of the speech spectrum. FIG. 8 shows the frequency ranges of the first three formants as determined from experimental data. Individual speakers may have formant ranges somewhat different from those shown in the figure and, if known, these ranges may be used for that speaker. It is apparent that there is a high degree of overlap between ranges in which formants may be located. The first formant range is from 200 to 900 Hz. However, for approximately one-half of this range (500-900 Hz.) the second formant can overlap the first. Simultaneously, the second and third formant regions overlap from 1,100-2,700 Hz. Thus, the estimation of the formants is not simply a matter of locating peaks of the spectrum in non-overlapping frequency bands.

Another property of speech pertinent to formant estimation is the relationship between formant frequencies and relative amplitudes of formant peaks in the smooth spectrum. Considerable importance, therefore, is placed on a measurement of the level of the second formant peak (F.sub.2) relative to the level of the first formant peak (F.sub.1). The level measurement .DELTA..sub.12, is defined, again in Z-transform notation, as:

.DELTA..sub.12 = log.vertline.H(e.sup.j2 F 2.sup.T).vertline.- log.vertline.H(e.sup.j2 F 1.sup.T).vertline., (7)

where F.sub.1 and F.sub.2 are the frequencies of the first and second formants, .vertline.H(e.sup.j2 FT).vertline. is the magnitude of the smoothed spectrum at F Hz. A careful analysis shows that .DELTA..sub.12 depends primarily upon F.sub.1, and F.sub.2 and is fairly insensitive to the bandwidths of all the formants and to the higher formant frequencies. FIG. 9 shows a curve of the minimum difference in formant level (in db.) between F.sub.2 and F.sub.1 as the function of the frequency F.sub.2. This curve takes into account equalization of the spectrum and serves as a threshold against which the difference between the level of a possible F.sub.2 peak and the level of an F.sub.1 peak is compared. The dependence of .DELTA..sub.12 on F.sub.1 is eliminated by assuming that F.sub.1 is fixed at its lower limit F1MN. If the F.sub.1 dependence were to be accounted for, a family of curves similar in shape but displaced vertically from the one shown in FIG. 9 is required. For a value of F.sub.1 greater than F1MN, the corresponding curve is above the curve shown in FIG. 9. In FIG. 9, the curve is flat until 500 Hz. because F.sub.2 is assumed to be above this minimum value. The curve then decreases until about 1,500 Hz., reflecting the drop in F.sub.2 level as it gets further away from F.sub.1. However, above 1,500 Hz. the curve rises again due to the increasing proximity of F.sub.2 and F.sub.3. The curve continues to rise until F.sub.2 gets to its maximum value F2MX = 2,700 Hz., at which point F.sub.2 and F.sub.3 are maximally close (according to the simple model of fixed F.sub.3).

In order to estimate formants from the spectrum envelope, all peaks are located and the frequency and amplitude of each peak is recorded. The frequency region of the applied signal is segmented into three regions not unlike those depicted in FIG. 8. The lowest formant is first searched for, then F.sub.2 and finally F.sub.3. Based on the amplitudes and frequencies of the peaks, spurious candidates are eliminated and ambiguities resulting, for example, from closely spaced formants are eliminated by a logical examination of the detected peaks.

In cases where F.sub.1, F.sub.2, and F.sub.3 are separated by more than about 300 Hz., there is no difficulty in resolving the corresponding peaks in the smoothed spectrum. However, when F.sub.1 and F.sub.2 or when F.sub.2 and F.sub.3 get closer than about 300 Hz. the cepstral smoothing results in the peaks not being resolved. In these cases, a spectral analysis algorithm called the Chirp z-Transform (CZT) can be used to advantage. The CZT permits the computation of samples of the z-transform at equally spaced intervals along a circular or spiral contour in the z-plane. In particular, if F.sub.1 and F.sub.2 are close together, it is possible to compute the z-transform on a contour which passes closer to the pole locations than the unit circle contour, thereby enhancing the peaks in the spectrum and improving the resolution. For example, FIG. 10 shows a smoothed spectral envelope in which F.sub.1 and F.sub.2 are unresolved. In this case the parameters of the cepstral window function 1(nT), were .tau..sub.1 = 2 msec. and .DELTA..tau. = 2 msec. FIG. 11 shows the results of a CZT analysis along a circular contour of radius e.sup..sup.-0.0314 over the frequency range 0 to 900 Hz. with a resolution of about 10 Hz. The effect of the use of the contour which passes closer to the poles is evident in contrast to FIG. 10. A discussion of the CZT algorithm is given in "The Chirp z-Transform Algorithm and Its Application," by Rabiner, Schafer and Rader, Bell System Technical Journal, May-June 1969, at p. 1249.

Voiced spectrum coder 19, supplied with peaks of the spectral envelope during voiced speech intervals from gate 17 and with cepstrum signals c(nT) from analyzer 12, is accordingly programmed to take these characteristics of voiced speech into account. It serves to derive control signals F.sub.1, F.sub.2, F.sub.3 which specify formant frequencies and which are sufficient for controlling voiced resonant circuits 27 at a synthesizer. Again, the logical operations performed on the cepstrum and peak signals may be carried out using any desired form of apparatus. In practice, however, it has been found most convenient to employ a computer programmed in accordance with the steps set forth in the flow chart of FIGS. 6 and 7. Program listings for the steps of the flow chart appears in Appendix II of this specification.

Referring to FIGS. 6 and 7, the formants are picked in sequence beginning with F.sub.1. To start the process, the highest level peak of the spectrum from the peak picker 16 in the frequency range 0 to F1MX is recorded as FOAMP. F1MX is the upper limit of the F.sub.1 region. Generally the value FOAMP will occur at a peak in the F.sub.1 region which will ultimately be chosen as the F.sub.1 peak. However, sometimes there is an especially strong peak below F1MN, the lower limit of the F.sub.1 region, which is due to the spectrum of the glottal source waveform. In such cases there may or may not be a clearly resolved F.sub.1 peak above F1MN. In order to avoid choosing a low level spurious peak or possibly the F.sub.2 peak for the F.sub.1 peak, when in fact the F.sub.1 peak and peak due to the source are not resolved, a peak in the F.sub.1 region is required to be less than 8.7 db. (1.0 on a natural log scale) below FOAMP to be considered as a possible F.sub.1 peak. The frequency of the highest level peak in the F.sub.1 region which exceeds this threshold is selected as the first formant, F.sub.1. The level of this peak is recorded as F1AMP. If no F.sub.1 can be selected this way, the spectral envelope in the region 0 to 900 Hz. is reevaluated. The spectral peaks are sharpened by weighting the cepstrum, c(nT), supplied to coder 19 directly from analyzer 12, with a window w.sub.1 (nT), where

w.sub.1 (nT) = e.sup.100 nT, (8)

and performing a spectral analysis on the resultant. This has the effect of evaluating the spectrum on a contour which passes closer to the poles. As previously discussed, the CZT algorithm is an efficient way of performing this evaluation. The enhanced section of the spectrum is then searched for the highest level peak in the F.sub.1 region. The location of this peak is accepted as F.sub.1. If the enhancement has failed to bring about a resolution of the source peak and the F.sub.1 peak, F.sub.1 is arbitrarily set equal to F1MN, the lower limit of the F.sub.1 region.

The quantity F1AMP is used in the estimation of F.sub.2. If the F.sub.1 peak is very low in frequency and is not clearly resolved from the lower frequency peak due to the glottal waveform, F1AMP is set equal to (F0AMP - 8.7 db.). This is done effectively to lower (because F.sub.1 is very low) the threshold which is used in searching for F.sub.2. The first step in estimating F.sub.2 is to fix the frequency range to be searched. If F.sub.1 has been estimated to be less than F2MN, the lower limit of the F.sub.2 region, then only the region from F2MN to F2MX is searched. However, if F.sub.1 has been estimated to be greater than F2MN, it is possible that the F.sub.2 peak has in fact been chosen as the F.sub.1 peak. Therefore the combined F.sub.1 -F.sub.2 region from F1MN to F2MX is searched to ensure that if this is the case, the F.sub.1 peak will be found as the F.sub.2 peak. After F.sub.2 has been estimated, F.sub.1 and F.sub.2 are compared and their values are interchanged if F.sub.2 is less than F.sub.1.

In deciding whether a particular spectral peak under investigation is a possible candidate for an F.sub.2 peak, the threshold curve of FIG. 9 is used. The spectral peak is first checked to see if it is located in the proper frequency range. If so, the difference between the level of the peak under consideration and F1AMP is computed. If this difference exceeds the threshold of FIG. 9, that peak is a possible F.sub.2 peak; if not, that peak is not considered as a possible F.sub.2 peak. The value of F.sub.2 is chosen to be the frequency of the highest level peak to exceed the threshold. The level of this peak is recorded as F2AMP.

If no peaks are found which exceeded the threshold, further analysis is called for. The fact that no peaks are located has been found to be a reliable indication that F.sub.1 and F.sub.2 are close together. Therefore the cepstrum is multiplied by the weighting function w.sub.1 (nT) and a high resolution, narrow band spectrum is computed over the frequency range (F.sub.1 -450) Hz. to (F.sub.1 +450) Hz. (If F.sub.1 < 450 Hz. the range is 0 to 900 Hz.). This spectrum is evaluated along a circular arc of radius e.sup..sup.-0.0314 in the z-plane. This analysis generally produces a spectrum such as shown in FIG. 11 in which the two formants F.sub.1 and F.sub.2 are readily apparent.

The value of F.sub.1 is reassigned as the frequency of the highest level peak in the F.sub.1 region and F.sub.2 is the frequency of the next highest peak. If only one peak is found, F.sub.1 is arbitrarily set equal to the frequency of that peak and F.sub.2 = (F.sub.1 +200) Hz.

In searching for F.sub.3, a threshold on the difference in level between a possible F.sub.3 peak and the F.sub.2 peak is employed. In this case a fixed, frequency-independent, threshold has been found satisfactory. If F.sub.2 is located without weighting the cepstrum with the w.sub.1 (nT) function, (i.e., F.sub.2 is not extremely low), the threshold on the difference is set at 17.38 db. (-2.0 on a natural log scale). Otherwise, the threshold is effectively removed by setting it at -1,000 db.

The estimation of F.sub.3 from the smoothed spectrum is then carried out. Because of equalization, there is a possibility of finding the F.sub.3 peak as F.sub.2. Thus, F.sub.2 is checked to see if it is greater than F3MN, the lower limit of the F.sub.3 region. If so, the search for F.sub.3 is extended to cover the combined F.sub.2 -F.sub.3 region from F2MN to F3MX. Otherwise the frequency region F3MN to F3MX is searched. As before, a spectral peak is first checked to see if it is in the correct frequency range. Then the difference between the level of the peak being considered for an F.sub.3 peak and F2AMP is computed. The highest level peak which exceeds the threshold is chosen as the F.sub.3 peak. If no peak is found for F.sub.3, further analysis is again called for. It has been found that this situation is generally due to F.sub.2 and F.sub.3 being very close together. As before, an enhanced spectrum is computed by multiplying the cepstrum by window function w.sub.1 (nT) and performing a spectrum analysis on the resultant, in this case over the frequency range (F.sub.2 -450) Hz. to (F.sub.2 +450) Hz. The result is normally a spectrum similar to that shown in FIG. 11, where F.sub.2 and F.sub.3 are clearly resolved. F.sub.2 is chosen to be the frequency of the highest peak and F.sub.3 to be the frequency of the next highest peak. If only one peak is found, that peak is arbitrarily called the F.sub.2 peak and F.sub.3 is set to (F.sub.2 +200) Hz. (This may sometimes result in estimates of both F.sub.2 and F.sub.3 which are slightly high.). The final step in the process is to compare F.sub.2 and F.sub.3 and interchange their values if F.sub.2 is greater than F.sub.3.

The arrangement for estimating the three lowest formant frequencies of voiced speech, i.e., F.sub.1, F.sub.2, F.sub.3, has been found to perform well on vowels, glides, and semivowels. Although no attempt is made to deal with voiced stop consonants or nasal consonants, experience has shown that extremely natural sounding synthetic speech nevertheless may be produced with the limited class of control signals employed in this invention. Advantageously, the control signals may be stored or transmitted with greatly limited channel capacity, thus to achieve substantial economies.

Variations and modifications of the system described herein will occur to those skilled in the art.

* * * * *