Pitch peak detection using linear prediction Patent Grant McCray October 28, 1 [International Business Machines Corporation]

Pitch peak detection using linear prediction

McCray October 28, 1

Patent Grant 3916105

U.S. patent number 3,916,105 [Application Number 05/446,847] was granted by the patent office on 1975-10-28 for pitch peak detection using linear prediction. This patent grant is currently assigned to International Business Machines Corporation. Invention is credited to William R. McCray.

United States Patent	3,916,105
McCray	October 28, 1975

Pitch peak detection using linear prediction

Abstract

The application of linear prediction techniques to speech analysis is well covered by the papers referred to below. This case describes a technique to determine the presence or absence of voicing in a digitized speech signal and to locate the glottal impulse positions in that signal when voicing is present.

Inventors:	McCray; William R. (Lexington, KY)
Assignee:	International Business Machines Corporation (Armonk, NY)
Family ID:	26978210
Appl. No.:	05/446,847
Filed:	February 28, 1974

Related U.S. Patent Documents


Application Number	Filing Date	Patent Number	Issue Date
312063	Dec 4, 1972

Current U.S. Class:	704/219; 704/E11.007; 704/214
Current CPC Class:	G10L 25/93 (20130101)
Current International Class:	G10L 11/00 (20060101); G10L 11/06 (20060101); G01L 001/04 ()
Field of Search:	;179/1SA,1SD,1SC

References Cited [Referenced By]

U.S. Patent Documents


3624302	November 1971	Atal
3631520	December 1971	Atal

Primary Examiner: Claffy; Kathleen H.
Assistant Examiner: Kemeny; E. S.
Attorney, Agent or Firm: Cooper; D. Kendell

Parent Case Text

This is a continuation-in-part of application Ser. No. 312,063, filed Dec. 4, 1972, now abandoned.

Claims

What is claimed is:

1. A method for determining the presence or absence of consistent voicing in speech signals characterized by voice intervals of substantially equally spaced voice pitch periods and unvoiced intervals of irregular unequally spaced unvoiced periods, comprising:

1. predicting speech values based on a weighted sum of a number of preceding samples of said speech signals;

2. generating an error signal having error peaks for a predetermined selected time interval P.sub.L seconds where P.sub.L is the period of the lowest acceptable pitch, said error signal representing the difference between actual speech samples and the corresponding predicted values;

3. analyzing error peaks of said error signal to detect a pitch pattern comprising a predetermined minimum number of substantially equally spaced pitch periods indicative of consistent voicing.

2. The method of claim 1, further comprising:

4. when consistent voicing is detected, providing an output representation of the related interval.

3. The method of claim 1, further comprising:

4. when an unvoiced interval is detected, providing an output representation of said unvoiced interval.

4. The method of claim 1 wherein said predetermined time interval is four (4) P.sub.L and said minimum number of peaks is four, designated Pk.sub.1 - Pk.sub.4.

5. The method of claim 1, further comprising:

5. determining the continuation of consistently voiced speech by comparing the length of a next occurring pitch period in a voiced interval with the length of a previous pitch period.

6. The method of claim 5, further comprising:

6. storing an indication of the occurrence of a voiced interval;

7. analyzing prediction weights for a preceding speech interval in relation to a current speech interval to develop an error signal prediction for bp seconds where b is a constant representative of a partial pitch period to be examined beyond the next expected pitch period ending and where p is the length of the previous pitch period;

8. detecting occurrence of the next pitch period by extracting two local maxima Pk.sub.1 and Pk.sub.2 respectively representative of maximum peaks within and outside of a small region around p seconds; and

9. determining the status of voicing by comparing Pk.sub.1 with (c Pk.sub.2) where c is a constant greater than 1.0.

7. The method of claim 6, further comprising:

10. providing a signal indicative of the continuation of consistent voicing when Pk.sub.1 equals or exceeds c Pk.sub.2.

8. The method of claim 7, further comprising:

11. outputting the current voiced speech interval.

9. The method of claim 6, further comprising:

10. providing a signal indicative of the discontinuance of consistent voicing when Pk.sub.1 does not equal or exceed c Pk.sub.2.

10. The method of claim 9 further comprising:

11. proceeding with steps (1) - (3) to detect the next voiced interval.

11. The method of claim 1, further comprising the following steps between steps (2) and (3):

2a. determining if the first peak of said predetermined minimum number is prior to P.sub.L where P.sub.L is the lowest pitch of interest; and

2b. if not prior, storing an indication that the speech signal interval is unvoiced and is not consistent voicing; and

2c. if prior, proceeding with step (3).

12. The method of claim 11, further comprising the following steps after step (3):

3a. determining and discarding the smallest peak Pk.sub.s from among said predetermined minimum number of peaks;

3b. scanning said error signal from the most recently formed peak to the next error peak Pk.sub.n ;

3c. if end of predicted P.sub.L seconds occurs, prior to next error peak Pk.sub.n, outputting a record for P.sub.L /2 seconds;

3d. if next error peak Pk.sub.n is formed prior to P.sub.L seconds, comparing its value to the value of the peak Pk.sub.s discarded in step (3a);

3e. if Pk.sub.n is larger than Pk.sub.s, establish Pk.sub.n as new last peak of said minimum number and repeat steps 2a-2c; and

3f. if Pk.sub.n is smaller than Pk.sub.s, repeat steps 3b-3d.

13. Apparatus for determining the presence or absence of consistent voicing in speech signals characterized by voiced intervals of substantially equally spaced voice pitch periods and unvoiced intervals of irregular unequally spaced unvoiced periods, comprising:

1. means for predicting speech values based on a weighted sum of a number of preceding samples of said speech signals;

2. means for generating an error signal having error peaks for a predetermined selected time interval P.sub.L seconds where P.sub.L is the period of the lowest acceptable pitch, said error signal representing the difference between actual speech samples and the corresponding predicted values; and

3. means for analyzing error peaks of said error signal to detect a pitch pattern comprising a predetermined minimum number of substantially equally spaced pitch periods indicative of consistent voicing.

14. The apparatus of claim 13, further comprising:

4. means operable when consistent voicing is detected for providing an output representation of the related voiced interval.

15. The apparatus of claim 13, further comprising:

4. means operable when an unvoiced interval is detected for providing an output representation of said unvoiced interval.

16. The apparatus of claim 13, further comprising:

5. means for determining the continuation of consistently voiced speech by comparing the length of a next occurring pitch period in a voiced interval with the length of a previous pitch period.

17. The apparatus of claim 16, further comprising:

6. means for storing an indication of the occurrence of a voiced interval;

7. means for analyzing prediction weights for a preceding speech interval in relation to a current speech interval to develop an error signal prediction for bp seconds where b is a constant representative of a partial pitch period to be examined beyond the next expected pitch period ending and where p is the length of the previous pitch period;

8. means for detecting occurrence of the next pitch period by extracting two local maxima Pk.sub.1 and Pk.sub.2 respectively representative of maximum peaks within and outside of a small region around p seconds; and

9. means for determining the status of voicing by comparing Pk.sub.1 with (c Pk.sub.2) where c is a constant greater than 1.0.

18. The apparatus of claim 17, further comprising:

10. means for providing a signal indicative of the continuation of consistent voicing when Pk.sub.1 equals or exceeds c Pk.sub.2.

19. The apparatus of claim 18, further comprising:

11. gating means for providing prediction weights, voiced/unvoiced status, and interval lengths of speech intervals following calculations.

20. The apparatus of claim 18, further comprising:

10. means for providing a signal indicative of the discontinuance of consistent voicing when Pk.sub.1 does not equal or exceed c Pk.sub.2.

Description

REFERENCES OF INTEREST

B. S. Atal and M. R. Schroeder, "Adaptive Predictive Coding of Speech Signals," Bell System Technical Journal, 49, 1973-1986 (1970).

B. S. Atal, "Characterization of Speech Signals by Linear Prediction of the Speech Wave," Proc. IEEE Symposium on Feature Extraction and Selection in Pattern Recognition, Argonne, Ill. (Oct. 1970), pp. 202-209.

B. S. Atal and Suzanne L. Hanauer, "Speech Analysis and Synthesis by Linear Prediction of the Speech Wave," Journal of the Acoustical Society of America, Vol. 50, Number 2 (Part 2), pp. 637-655 (1971).

U.S. Pat. No. 3,631,520, "Predictive Coding of Speech Signals," B. S. Atal.

U.S. Pat. No. 3,624,302, "Speech Analysis and Synthesis by the Use of the Linear Prediction of a Speech Wave," B. S. Atal.

BACKGROUND OF THE INVENTION AND PRIOR ART

In linear prediction, each sample of speech is predicted as the weighted sum of a number of preceding samples. The difference between the actual speech sample and the predicted sample is the prediction error. Atal shows in his papers that a distinct maximum of the error signal occurs at the beginning of a pitch period. The technique to be disclosed utilizes this fact.

The referenced papers and patents describe speech analysis by linear prediction.

SUMMARY OF THE INVENTION

A region of consistently-voiced speech is characterized by having pitch periods of approximately equal length. Thus, such a region may be discovered by locating a pattern of regularly spaced, large prediction errors, and within such a region it is only necessary to compare the length of the next pitch period with the length of the previous pitch period to determine if consistent voicing has ceased or continues

OBJECTS

Accordingly, the prime object of the present invention is to provide a speech analysis system based on linear prediction and having improved efficiency.

The foregoing and other objects, featuresnd advantages of the invention will be apparent from the following more particular description of the preferred embodiment of the invention as illustrated in the accompanying drawings.

DRAWINGS

IN THE DRAWINGS

FIG. 1 associates the prediction weight generator, previously taught by Atal and a prediction interval analyzer that is significant in practicing the present invention.

FIG. 2 is a block diagram of a system incorporating the speech analysis techniques of the present invention.

FIG. 3 is a flow chart related to the system of FIG. 2.

FIG. 4 is a detailed representation of the system.

DETAILED DESCRIPTION

FIG. 1 is a simplified diagram of a system incorporating the inventive techniques taught herein. Sample speech is considered to be available on line 1 for input to a prediction interval analyzer 2, and a prediction weight generator 3. Data and control is symbolized by line 4 with analyzed speech output on line 5.

Block 2 of FIG. 1 is particularly expanded upon in FIG. 2. As indicated, consistently-voiced speech is characterized by having pitch periods of approximately equal length. In accordance with the present technique, the length of a succeeding pitch period is compared with the length of the previous pitch period to determine is consistent voicing has ceased or continues.

For the sake of consistency, the blocks in the flow chart of FIG. 3 are designated with letters in parentheses (A) through (S), and where possible, corresponding letters in parentheses are incorporated in the blocks of FIG. 2. Thus, the hardware represented by block 7, FIG. 2, represents the decision block (A) in FIG. 3.

Other blocks shown in FIG. 2 include an error signal generator 8, a speech storage register 9, a next pitch detector 10, a prediction weight and error signal generator 11, a pitch pattern detector 12, an error storage register 13, and a voiced-unvoiced memory 14. The various blocks in the flow chart of FIG. 3 are designated 20-37.

Considering FIG. 2, first, the status of the voiced-unvoiced memory 14 is checked by block 7 to determine the character of the previous voice segment. If the segment is voiced then a decision is made to generate an error signal by generator circuit 8. Errors are stored in the error storage register 13. An input from generator 3 is provided at terminal 15 indicative of the weights calculated for the previous pitch period. An error signal is generated for a predetermined "bp" seconds. This is stored in register 13 and serves as an input by line 16 to block 10. Block 10, related to blocks 22 and 23 in FIG. 3, determines two maxima and makes a decision as will be discussed in connection with block 23, FIG. 3, as to whether voicing has ended or a change in pitch occurred. If this has not occurred, a set of predictor weights is calculated and an output record written as determined by a control signal on line 17.

If the end of voicing or a change in putch has occurred, then the routine proceeds by control on line 18 to block 11. It is noted that if an unvoiced segment was determined by block 7 then a control signal so indicates on line 19 directly to block 11 for processing of the speech segment.

In any case, additional determinations are made by the pitch pattern detector block 12 corresponding to blocks 26-37, FIG. 3. This primarily has to do with the detection of a speech pattern and the control of memory 14 to a voiced or unvoiced state. Various output situations are represented by control on line 6 which indicates an interval of speech weighted in order to be written as an output record.

FLOW CHART OF FIG. 3

As indicated, FIG. 3 illustrates a flow chart for carrying out the present invention. A decision is made at block 20 as to whether the previous segment was voiced or unvoiced. If voiced, the routine proceeds to block 21, if not, it proceeds to block 25.

BLOCK 21

Using the predictor weights calculated for the previous pitch period and beginning at the end of that period, the speech waveform is predicted and the error signal generated for bp seconds, where p is the previous pitch period length and b determines the partial period to be examined beyond the next expected pitch period ending. Obviously, b must be between 1.0 and 2.0, so that the expected time of the next error signal peak is included in the interval, but so that the second succeeding peak is excluded.

BLOCK 22

The peaks (local maxima) of the error signal are scanned out to bp seconds and two maxima are obtained, the maximum peak (Pk.sub.1) within a small region around p seconds and the maximum peak (Pk.sub.2) outside this small region.

BLOCK 23

If Pk.sub.1 does not exceed Pk.sub.2 by a significant amount (Pk.sub.1 less than c Pk.sub.2, where c is a constant greater than 1.0), either voicing has ended or a significant change in pitch has occurred. In either case, the region of consistent voicing has ended, and the procedure must be abandoned. Block 25 is executed next.

Otherwise, the location of Pk.sub.1 is taken as the end of the pitch period. A set of predictor weights is calculated over the period beteen the two pitch period endings, and an output record written, block 24. The process is then repeated from block 20.

When a consistently-voiced region of speech occurs, it contains a significant number of pitch periods (more than three). This fact is utilized in discovering the beginning of such a region. The error signal is scanned for a sufficient time in an attempt to discover four error peaks with nearly constant spacing between them. The following steps are taken. In this discussion, P.sub.L and P.sub.H are the periods of the lowest and highest pitches of interest.

BLOCK 25

Predictor weights are calculated, the speech waveform is predicted, and the error signal is generated over 4P.sub.L.

BLOCK 26

The peaks of the error signal are scanned beginning P.sub.H into the region of the waveform being analyzed. The first four peaks encountered are collected.

BLOCKS 30 AND 33

If the first collected peak is found beyond P.sub.L, consistent voicing has not been found in this region of speech. A set of predictor weights is calculated over a period equal to P.sub.L /2 and an output record written, at block 28 after setting memory 14 to an "unvoiced" state at block 31. The process is then repreated from block 20.

Otherwise, the four collected peaks are analyzed to determine if a pitch pattern exists, block 33. If the periods between adjacent peaks are approximately equal and each is not less than P.sub.H, such a pattern has been found. The collected peaks are assumed to be pitch period endings, and a region of consistent voicing has been found beginning at the first peak. A set of predictor weights is calculated up to the location of the first peak, and an output record written, at block 37, after setting memory 14 to a "voiced" state at block 34. Block 20 is executed next.

BLOCK 36

If a pitch pattern is not found, the smallest of the four collected peaks is discarded, block 36.

BLOCK 35

The error signal is scanned from the location of the most recently found peak to find the next error peak.

BLOCK 32

If the end of predicted speech (4P.sub.L) is found prior to the next error peak, this region of speech does not contain consistent voicing. Blocks 31 and 28 are executed next.

BLOCK 29

If a peak is found prior to the end of predicted speech, it is compared to the value of the peak discarded in block 36. If the new one is not larger, it is rejected and block 35 is repeated.

BLOCK 27

If the new peak is larger than the one discarded in block 36, it is taken as the new fourth peak, block 27. The pitch pattern recognition process is then repeated from block 30.

DETAILED SYSTEM, FIG. 4

A detailed implementation of the system is illustrated in FIG. 4.

Control Network 50 controls the timing and sequence of operation of the other portions of the system. Its outputs to the other blocks are represented by cable 51 to avoid undue complication of the diagram.

Sampled speech is inputted on line 52 and is stored in Speech storage 65 until its processing by the system is complete.

Voice-Unvoiced storage 53 indicates whether the previously analyzed segment of speech was voiced or unvoiced. To begin a cycle of operation, block 50 determines from block 53 by line 54 the status of the previous segment.

If the previous segment was "voiced", Control Network 50 obtains the length of that segment p from Segment Length Storage 56 on line 57. This is multiplied by a constant factor b, (between 1.0 and 2.0) to determine the length of speech bp to be evaluated and whether voicing continues.

The prediction weights of the previous segment are moved from storage block 60 to Adaptive Predictor 61 on line 62. This predictor, as described by Atal, uses these weights and the speech samples from Speech storage 65 on line 66 to predict subsequent speech samples.

Predictor 61 will operate on an interval of speech of length bp producing predicted speech samples which are conducted to Subtraction Network 68 by line 69. Here the original speech samples from line 66 are subtracted from the predicted samples to produce the prediction error samples on line 70.

Line 70 carries the error samples to two peak Pickers 72 and 73. Picker 72 is controlled to scan the error samples in a small interval around p. Picker 73 is turned on both before and after Picker 72, scanning the error samples throughout bp (the time of the speech predicted by block 61) except when Picker 72 is on. Thus, Picker 72 selects the largest error sample within a small interval around p and Picker 73 selects the largest sample outside this interval.

Consistent voicing is assumed to continue if the error peak found by Picker 72 is significantly greater than the one found by Picker 73. To determine this, the output of Picker 73 is transferred to Multiplication Network 76 by line 77 where it is multiplied by a constant greater than one (1). The result of this multiplication is presented to Comparison Network 80 on line 81 where it is compared to the peak found by Picker 72 on line 82.

Control Network 50 determines via line 84 the results of the comparison in block 80. If the output of Picker 72 is greater, then the exact location of the error peak in the speech interval is stored in Segment Length storage 56 via line 86. Prediction Weight Generator 88 (as described by Atal) uses the length of the speech segment from Segment Length storage 56 on line 89 and the speech samples from Speech storage 65 on line 90 to analyze the speech segment and transfers the results via line 91 to Storage block 60. The output gate 93 is opened to allow the contents of storage blocks 53, 56 and 60 to be outputted on line 95.

If, however, the result of the comparison showed that the output of Multiplication Network 76 was greater, then Control Network 50 would set the Voiced-Unvoiced storage 53 to unvoiced. Subsequent operation will then be identical to that which would have occurred had the previous segment of speech been unvoiced when the control cycle was initiated.

In the unvoiced case, Prediction Weight Generator 88 is controlled to calculate a set of weights over a portion of speech representing a time of at least 4 P.sub.L, where P.sub.L is the pitch period of the lowest pitch frequency of interest. The calculated weights are stored in block 60 and used in Adaptive Predictor 61 to predict speech. Predictor 61 and Subtraction Network 68 then operate to produce the prediction error signal.

Rejected Peak Value Storage 100 is set initially to zero. The error signal enters a three-stage shift register 101, which presents to a Comparison Network 103 the most recent three error values via lines 104, 105 and 106, while storage block 100 presents its present value via line 107. Each stage of register 101 is capable of storing enough bits to represent the full value of the error signal. When Comparison Network 103 detects that the value on line 105 is greater than the other three, then a local maximum of the error signal has been found. This maximum value is transferred to Peak Value Storage 110 via line 111 while the time location of the maximum is transferred to Peak Time Storage 112 by line 113.

When four such maxima have been placed in storage blocks 110 and 112, the stored times are gated onto lines 115, 116, 117 and 118. Comparison Network 120 compares the time of the first peak with P.sub.L. If the time on line 115 does not exceed P.sub.L then Control Network 50 is signalled on line 121 to continue the checks. Subtractor Networks 123, 124 and 125 produce the intervals between adjacent peaks which are presented to Comparison Network 127 by lines 128, 129, and 130. Network 127 compares these three values for approximate equality, and that each is less than P.sub.L and greater than P.sub.H, the pitch period of the highest pitch frequency of interest.

If all the requirements are met, control network 50 is signalled on line 131 that voicing has been found beginning at the time on line 115. The time on line 115 is placed in Segment Length Storage 56 (connection not shown) and generator 88 calculates a set of weights over this segment of unvoiced speech. The weights are stored in storage 60 and Output Gate 93 is opened to output a record. Subsequently, the Voiced-Unvoiced Storage 53 is changed to "voiced", the pitch interval on line 128 is placed in the Segment Length Storage 56 and a new cycle is initiated.

If, however, one or more of the requirements for voicing are absent, the peak values are gated to Comparison Network 132 via lines 135, 136, 137 and 138. Network 132 determines the least of the four, transfers this value by line 140 to Rejected Peak Value Storage 100, and signals Control Network 50 by line 141 which value is the least. Network 50 causes the least peak and its time to be removed from storage units 110 and 112, and the remaining peaks and times to be moved in storage to maintain chronological order and to leave position four vacant for a new error maximum. Shift Register 101 and Comparison Network 103 operate to locate error signal maxima as before, but now, since Storage block 100 has a non-zero value stored in it, the maximum selected by Network 103 must exceed the value of the rejected peak in block 100. A selected maximum and its time will be gated into blocks 110 and 112 and the aforementioned tests performed.

This process continues until voicing is found, or until one or more limits occur. These limits are:

1. that the location of the first peak in storage (110, 112) on line 115 exceeds P.sub.L,

2. that one or more of the peak intervals on lines 128, 129 and 130 exceeds P.sub.L, or

3. that Adaptive Predictor 61 has predicted a portion of speech of 4P.sub.L.

When one of these limits is exceeded, the process is discontinued, Control Network 50 places a fixed length (10 milliseconds) in Segment Length Storage 56 causes the Prediction Weight Generator 88 to generate weights over this period, which are stored in block 60 and opens Output Gate 93 to output a record. A new cycle of operation is then initiated.

While the invention has been particularly shown and described with respect to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made without departing from the spirit and scope of the invention.

* * * * *