U.S. patent number 3,916,105 [Application Number 05/446,847] was granted by the patent office on 1975-10-28 for pitch peak detection using linear prediction.
This patent grant is currently assigned to International Business Machines Corporation. Invention is credited to William R. McCray.
United States Patent |
3,916,105 |
McCray |
October 28, 1975 |
Pitch peak detection using linear prediction
Abstract
The application of linear prediction techniques to speech
analysis is well covered by the papers referred to below. This case
describes a technique to determine the presence or absence of
voicing in a digitized speech signal and to locate the glottal
impulse positions in that signal when voicing is present.
Inventors: |
McCray; William R. (Lexington,
KY) |
Assignee: |
International Business Machines
Corporation (Armonk, NY)
|
Family
ID: |
26978210 |
Appl.
No.: |
05/446,847 |
Filed: |
February 28, 1974 |
Related U.S. Patent Documents
|
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
Issue Date |
|
|
312063 |
Dec 4, 1972 |
|
|
|
|
Current U.S.
Class: |
704/219;
704/E11.007; 704/214 |
Current CPC
Class: |
G10L
25/93 (20130101) |
Current International
Class: |
G10L
11/00 (20060101); G10L 11/06 (20060101); G01L
001/04 () |
Field of
Search: |
;179/1SA,1SD,1SC |
References Cited
[Referenced By]
U.S. Patent Documents
Primary Examiner: Claffy; Kathleen H.
Assistant Examiner: Kemeny; E. S.
Attorney, Agent or Firm: Cooper; D. Kendell
Parent Case Text
This is a continuation-in-part of application Ser. No. 312,063,
filed Dec. 4, 1972, now abandoned.
Claims
What is claimed is:
1. A method for determining the presence or absence of consistent
voicing in speech signals characterized by voice intervals of
substantially equally spaced voice pitch periods and unvoiced
intervals of irregular unequally spaced unvoiced periods,
comprising:
1. predicting speech values based on a weighted sum of a number of
preceding samples of said speech signals;
2. generating an error signal having error peaks for a
predetermined selected time interval P.sub.L seconds where P.sub.L
is the period of the lowest acceptable pitch, said error signal
representing the difference between actual speech samples and the
corresponding predicted values;
3. analyzing error peaks of said error signal to detect a pitch
pattern comprising a predetermined minimum number of substantially
equally spaced pitch periods indicative of consistent voicing.
2. The method of claim 1, further comprising:
4. when consistent voicing is detected, providing an output
representation of the related interval.
3. The method of claim 1, further comprising:
4. when an unvoiced interval is detected, providing an output
representation of said unvoiced interval.
4. The method of claim 1 wherein said predetermined time interval
is four (4) P.sub.L and said minimum number of peaks is four,
designated Pk.sub.1 - Pk.sub.4.
5. The method of claim 1, further comprising:
5. determining the continuation of consistently voiced speech by
comparing the length of a next occurring pitch period in a voiced
interval with the length of a previous pitch period.
6. The method of claim 5, further comprising:
6. storing an indication of the occurrence of a voiced
interval;
7. analyzing prediction weights for a preceding speech interval in
relation to a current speech interval to develop an error signal
prediction for bp seconds where b is a constant representative of a
partial pitch period to be examined beyond the next expected pitch
period ending and where p is the length of the previous pitch
period;
8. detecting occurrence of the next pitch period by extracting two
local maxima Pk.sub.1 and Pk.sub.2 respectively representative of
maximum peaks within and outside of a small region around p
seconds; and
9. determining the status of voicing by comparing Pk.sub.1 with (c
Pk.sub.2) where c is a constant greater than 1.0.
7. The method of claim 6, further comprising:
10. providing a signal indicative of the continuation of consistent
voicing when Pk.sub.1 equals or exceeds c Pk.sub.2.
8. The method of claim 7, further comprising:
11. outputting the current voiced speech interval.
9. The method of claim 6, further comprising:
10. providing a signal indicative of the discontinuance of
consistent voicing when Pk.sub.1 does not equal or exceed c
Pk.sub.2.
10. The method of claim 9 further comprising:
11. proceeding with steps (1) - (3) to detect the next voiced
interval.
11. The method of claim 1, further comprising the following steps
between steps (2) and (3):
2a. determining if the first peak of said predetermined minimum
number is prior to P.sub.L where P.sub.L is the lowest pitch of
interest; and
2b. if not prior, storing an indication that the speech signal
interval is unvoiced and is not consistent voicing; and
2c. if prior, proceeding with step (3).
12. The method of claim 11, further comprising the following steps
after step (3):
3a. determining and discarding the smallest peak Pk.sub.s from
among said predetermined minimum number of peaks;
3b. scanning said error signal from the most recently formed peak
to the next error peak Pk.sub.n ;
3c. if end of predicted P.sub.L seconds occurs, prior to next error
peak Pk.sub.n, outputting a record for P.sub.L /2 seconds;
3d. if next error peak Pk.sub.n is formed prior to P.sub.L seconds,
comparing its value to the value of the peak Pk.sub.s discarded in
step (3a);
3e. if Pk.sub.n is larger than Pk.sub.s, establish Pk.sub.n as new
last peak of said minimum number and repeat steps 2a-2c; and
3f. if Pk.sub.n is smaller than Pk.sub.s, repeat steps 3b-3d.
13. Apparatus for determining the presence or absence of consistent
voicing in speech signals characterized by voiced intervals of
substantially equally spaced voice pitch periods and unvoiced
intervals of irregular unequally spaced unvoiced periods,
comprising:
1. means for predicting speech values based on a weighted sum of a
number of preceding samples of said speech signals;
2. means for generating an error signal having error peaks for a
predetermined selected time interval P.sub.L seconds where P.sub.L
is the period of the lowest acceptable pitch, said error signal
representing the difference between actual speech samples and the
corresponding predicted values; and
3. means for analyzing error peaks of said error signal to detect a
pitch pattern comprising a predetermined minimum number of
substantially equally spaced pitch periods indicative of consistent
voicing.
14. The apparatus of claim 13, further comprising:
4. means operable when consistent voicing is detected for providing
an output representation of the related voiced interval.
15. The apparatus of claim 13, further comprising:
4. means operable when an unvoiced interval is detected for
providing an output representation of said unvoiced interval.
16. The apparatus of claim 13, further comprising:
5. means for determining the continuation of consistently voiced
speech by comparing the length of a next occurring pitch period in
a voiced interval with the length of a previous pitch period.
17. The apparatus of claim 16, further comprising:
6. means for storing an indication of the occurrence of a voiced
interval;
7. means for analyzing prediction weights for a preceding speech
interval in relation to a current speech interval to develop an
error signal prediction for bp seconds where b is a constant
representative of a partial pitch period to be examined beyond the
next expected pitch period ending and where p is the length of the
previous pitch period;
8. means for detecting occurrence of the next pitch period by
extracting two local maxima Pk.sub.1 and Pk.sub.2 respectively
representative of maximum peaks within and outside of a small
region around p seconds; and
9. means for determining the status of voicing by comparing
Pk.sub.1 with (c Pk.sub.2) where c is a constant greater than
1.0.
18. The apparatus of claim 17, further comprising:
10. means for providing a signal indicative of the continuation of
consistent voicing when Pk.sub.1 equals or exceeds c Pk.sub.2.
19. The apparatus of claim 18, further comprising:
11. gating means for providing prediction weights, voiced/unvoiced
status, and interval lengths of speech intervals following
calculations.
20. The apparatus of claim 18, further comprising:
10. means for providing a signal indicative of the discontinuance
of consistent voicing when Pk.sub.1 does not equal or exceed c
Pk.sub.2.
Description
REFERENCES OF INTEREST
B. S. Atal and M. R. Schroeder, "Adaptive Predictive Coding of
Speech Signals," Bell System Technical Journal, 49, 1973-1986
(1970).
B. S. Atal, "Characterization of Speech Signals by Linear
Prediction of the Speech Wave," Proc. IEEE Symposium on Feature
Extraction and Selection in Pattern Recognition, Argonne, Ill.
(Oct. 1970), pp. 202-209.
B. S. Atal and Suzanne L. Hanauer, "Speech Analysis and Synthesis
by Linear Prediction of the Speech Wave," Journal of the Acoustical
Society of America, Vol. 50, Number 2 (Part 2), pp. 637-655
(1971).
U.S. Pat. No. 3,631,520, "Predictive Coding of Speech Signals," B.
S. Atal.
U.S. Pat. No. 3,624,302, "Speech Analysis and Synthesis by the Use
of the Linear Prediction of a Speech Wave," B. S. Atal.
BACKGROUND OF THE INVENTION AND PRIOR ART
In linear prediction, each sample of speech is predicted as the
weighted sum of a number of preceding samples. The difference
between the actual speech sample and the predicted sample is the
prediction error. Atal shows in his papers that a distinct maximum
of the error signal occurs at the beginning of a pitch period. The
technique to be disclosed utilizes this fact.
The referenced papers and patents describe speech analysis by
linear prediction.
SUMMARY OF THE INVENTION
A region of consistently-voiced speech is characterized by having
pitch periods of approximately equal length. Thus, such a region
may be discovered by locating a pattern of regularly spaced, large
prediction errors, and within such a region it is only necessary to
compare the length of the next pitch period with the length of the
previous pitch period to determine if consistent voicing has ceased
or continues
OBJECTS
Accordingly, the prime object of the present invention is to
provide a speech analysis system based on linear prediction and
having improved efficiency.
The foregoing and other objects, featuresnd advantages of the
invention will be apparent from the following more particular
description of the preferred embodiment of the invention as
illustrated in the accompanying drawings.
DRAWINGS
IN THE DRAWINGS
FIG. 1 associates the prediction weight generator, previously
taught by Atal and a prediction interval analyzer that is
significant in practicing the present invention.
FIG. 2 is a block diagram of a system incorporating the speech
analysis techniques of the present invention.
FIG. 3 is a flow chart related to the system of FIG. 2.
FIG. 4 is a detailed representation of the system.
DETAILED DESCRIPTION
FIG. 1 is a simplified diagram of a system incorporating the
inventive techniques taught herein. Sample speech is considered to
be available on line 1 for input to a prediction interval analyzer
2, and a prediction weight generator 3. Data and control is
symbolized by line 4 with analyzed speech output on line 5.
Block 2 of FIG. 1 is particularly expanded upon in FIG. 2. As
indicated, consistently-voiced speech is characterized by having
pitch periods of approximately equal length. In accordance with the
present technique, the length of a succeeding pitch period is
compared with the length of the previous pitch period to determine
is consistent voicing has ceased or continues.
For the sake of consistency, the blocks in the flow chart of FIG. 3
are designated with letters in parentheses (A) through (S), and
where possible, corresponding letters in parentheses are
incorporated in the blocks of FIG. 2. Thus, the hardware
represented by block 7, FIG. 2, represents the decision block (A)
in FIG. 3.
Other blocks shown in FIG. 2 include an error signal generator 8, a
speech storage register 9, a next pitch detector 10, a prediction
weight and error signal generator 11, a pitch pattern detector 12,
an error storage register 13, and a voiced-unvoiced memory 14. The
various blocks in the flow chart of FIG. 3 are designated
20-37.
Considering FIG. 2, first, the status of the voiced-unvoiced memory
14 is checked by block 7 to determine the character of the previous
voice segment. If the segment is voiced then a decision is made to
generate an error signal by generator circuit 8. Errors are stored
in the error storage register 13. An input from generator 3 is
provided at terminal 15 indicative of the weights calculated for
the previous pitch period. An error signal is generated for a
predetermined "bp" seconds. This is stored in register 13 and
serves as an input by line 16 to block 10. Block 10, related to
blocks 22 and 23 in FIG. 3, determines two maxima and makes a
decision as will be discussed in connection with block 23, FIG. 3,
as to whether voicing has ended or a change in pitch occurred. If
this has not occurred, a set of predictor weights is calculated and
an output record written as determined by a control signal on line
17.
If the end of voicing or a change in putch has occurred, then the
routine proceeds by control on line 18 to block 11. It is noted
that if an unvoiced segment was determined by block 7 then a
control signal so indicates on line 19 directly to block 11 for
processing of the speech segment.
In any case, additional determinations are made by the pitch
pattern detector block 12 corresponding to blocks 26-37, FIG. 3.
This primarily has to do with the detection of a speech pattern and
the control of memory 14 to a voiced or unvoiced state. Various
output situations are represented by control on line 6 which
indicates an interval of speech weighted in order to be written as
an output record.
FLOW CHART OF FIG. 3
As indicated, FIG. 3 illustrates a flow chart for carrying out the
present invention. A decision is made at block 20 as to whether the
previous segment was voiced or unvoiced. If voiced, the routine
proceeds to block 21, if not, it proceeds to block 25.
BLOCK 21
Using the predictor weights calculated for the previous pitch
period and beginning at the end of that period, the speech waveform
is predicted and the error signal generated for bp seconds, where p
is the previous pitch period length and b determines the partial
period to be examined beyond the next expected pitch period ending.
Obviously, b must be between 1.0 and 2.0, so that the expected time
of the next error signal peak is included in the interval, but so
that the second succeeding peak is excluded.
BLOCK 22
The peaks (local maxima) of the error signal are scanned out to bp
seconds and two maxima are obtained, the maximum peak (Pk.sub.1)
within a small region around p seconds and the maximum peak
(Pk.sub.2) outside this small region.
BLOCK 23
If Pk.sub.1 does not exceed Pk.sub.2 by a significant amount
(Pk.sub.1 less than c Pk.sub.2, where c is a constant greater than
1.0), either voicing has ended or a significant change in pitch has
occurred. In either case, the region of consistent voicing has
ended, and the procedure must be abandoned. Block 25 is executed
next.
Otherwise, the location of Pk.sub.1 is taken as the end of the
pitch period. A set of predictor weights is calculated over the
period beteen the two pitch period endings, and an output record
written, block 24. The process is then repeated from block 20.
When a consistently-voiced region of speech occurs, it contains a
significant number of pitch periods (more than three). This fact is
utilized in discovering the beginning of such a region. The error
signal is scanned for a sufficient time in an attempt to discover
four error peaks with nearly constant spacing between them. The
following steps are taken. In this discussion, P.sub.L and P.sub.H
are the periods of the lowest and highest pitches of interest.
BLOCK 25
Predictor weights are calculated, the speech waveform is predicted,
and the error signal is generated over 4P.sub.L.
BLOCK 26
The peaks of the error signal are scanned beginning P.sub.H into
the region of the waveform being analyzed. The first four peaks
encountered are collected.
BLOCKS 30 AND 33
If the first collected peak is found beyond P.sub.L, consistent
voicing has not been found in this region of speech. A set of
predictor weights is calculated over a period equal to P.sub.L /2
and an output record written, at block 28 after setting memory 14
to an "unvoiced" state at block 31. The process is then repreated
from block 20.
Otherwise, the four collected peaks are analyzed to determine if a
pitch pattern exists, block 33. If the periods between adjacent
peaks are approximately equal and each is not less than P.sub.H,
such a pattern has been found. The collected peaks are assumed to
be pitch period endings, and a region of consistent voicing has
been found beginning at the first peak. A set of predictor weights
is calculated up to the location of the first peak, and an output
record written, at block 37, after setting memory 14 to a "voiced"
state at block 34. Block 20 is executed next.
BLOCK 36
If a pitch pattern is not found, the smallest of the four collected
peaks is discarded, block 36.
BLOCK 35
The error signal is scanned from the location of the most recently
found peak to find the next error peak.
BLOCK 32
If the end of predicted speech (4P.sub.L) is found prior to the
next error peak, this region of speech does not contain consistent
voicing. Blocks 31 and 28 are executed next.
BLOCK 29
If a peak is found prior to the end of predicted speech, it is
compared to the value of the peak discarded in block 36. If the new
one is not larger, it is rejected and block 35 is repeated.
BLOCK 27
If the new peak is larger than the one discarded in block 36, it is
taken as the new fourth peak, block 27. The pitch pattern
recognition process is then repeated from block 30.
DETAILED SYSTEM, FIG. 4
A detailed implementation of the system is illustrated in FIG.
4.
Control Network 50 controls the timing and sequence of operation of
the other portions of the system. Its outputs to the other blocks
are represented by cable 51 to avoid undue complication of the
diagram.
Sampled speech is inputted on line 52 and is stored in Speech
storage 65 until its processing by the system is complete.
Voice-Unvoiced storage 53 indicates whether the previously analyzed
segment of speech was voiced or unvoiced. To begin a cycle of
operation, block 50 determines from block 53 by line 54 the status
of the previous segment.
If the previous segment was "voiced", Control Network 50 obtains
the length of that segment p from Segment Length Storage 56 on line
57. This is multiplied by a constant factor b, (between 1.0 and
2.0) to determine the length of speech bp to be evaluated and
whether voicing continues.
The prediction weights of the previous segment are moved from
storage block 60 to Adaptive Predictor 61 on line 62. This
predictor, as described by Atal, uses these weights and the speech
samples from Speech storage 65 on line 66 to predict subsequent
speech samples.
Predictor 61 will operate on an interval of speech of length bp
producing predicted speech samples which are conducted to
Subtraction Network 68 by line 69. Here the original speech samples
from line 66 are subtracted from the predicted samples to produce
the prediction error samples on line 70.
Line 70 carries the error samples to two peak Pickers 72 and 73.
Picker 72 is controlled to scan the error samples in a small
interval around p. Picker 73 is turned on both before and after
Picker 72, scanning the error samples throughout bp (the time of
the speech predicted by block 61) except when Picker 72 is on.
Thus, Picker 72 selects the largest error sample within a small
interval around p and Picker 73 selects the largest sample outside
this interval.
Consistent voicing is assumed to continue if the error peak found
by Picker 72 is significantly greater than the one found by Picker
73. To determine this, the output of Picker 73 is transferred to
Multiplication Network 76 by line 77 where it is multiplied by a
constant greater than one (1). The result of this multiplication is
presented to Comparison Network 80 on line 81 where it is compared
to the peak found by Picker 72 on line 82.
Control Network 50 determines via line 84 the results of the
comparison in block 80. If the output of Picker 72 is greater, then
the exact location of the error peak in the speech interval is
stored in Segment Length storage 56 via line 86. Prediction Weight
Generator 88 (as described by Atal) uses the length of the speech
segment from Segment Length storage 56 on line 89 and the speech
samples from Speech storage 65 on line 90 to analyze the speech
segment and transfers the results via line 91 to Storage block 60.
The output gate 93 is opened to allow the contents of storage
blocks 53, 56 and 60 to be outputted on line 95.
If, however, the result of the comparison showed that the output of
Multiplication Network 76 was greater, then Control Network 50
would set the Voiced-Unvoiced storage 53 to unvoiced. Subsequent
operation will then be identical to that which would have occurred
had the previous segment of speech been unvoiced when the control
cycle was initiated.
In the unvoiced case, Prediction Weight Generator 88 is controlled
to calculate a set of weights over a portion of speech representing
a time of at least 4 P.sub.L, where P.sub.L is the pitch period of
the lowest pitch frequency of interest. The calculated weights are
stored in block 60 and used in Adaptive Predictor 61 to predict
speech. Predictor 61 and Subtraction Network 68 then operate to
produce the prediction error signal.
Rejected Peak Value Storage 100 is set initially to zero. The error
signal enters a three-stage shift register 101, which presents to a
Comparison Network 103 the most recent three error values via lines
104, 105 and 106, while storage block 100 presents its present
value via line 107. Each stage of register 101 is capable of
storing enough bits to represent the full value of the error
signal. When Comparison Network 103 detects that the value on line
105 is greater than the other three, then a local maximum of the
error signal has been found. This maximum value is transferred to
Peak Value Storage 110 via line 111 while the time location of the
maximum is transferred to Peak Time Storage 112 by line 113.
When four such maxima have been placed in storage blocks 110 and
112, the stored times are gated onto lines 115, 116, 117 and 118.
Comparison Network 120 compares the time of the first peak with
P.sub.L. If the time on line 115 does not exceed P.sub.L then
Control Network 50 is signalled on line 121 to continue the checks.
Subtractor Networks 123, 124 and 125 produce the intervals between
adjacent peaks which are presented to Comparison Network 127 by
lines 128, 129, and 130. Network 127 compares these three values
for approximate equality, and that each is less than P.sub.L and
greater than P.sub.H, the pitch period of the highest pitch
frequency of interest.
If all the requirements are met, control network 50 is signalled on
line 131 that voicing has been found beginning at the time on line
115. The time on line 115 is placed in Segment Length Storage 56
(connection not shown) and generator 88 calculates a set of weights
over this segment of unvoiced speech. The weights are stored in
storage 60 and Output Gate 93 is opened to output a record.
Subsequently, the Voiced-Unvoiced Storage 53 is changed to
"voiced", the pitch interval on line 128 is placed in the Segment
Length Storage 56 and a new cycle is initiated.
If, however, one or more of the requirements for voicing are
absent, the peak values are gated to Comparison Network 132 via
lines 135, 136, 137 and 138. Network 132 determines the least of
the four, transfers this value by line 140 to Rejected Peak Value
Storage 100, and signals Control Network 50 by line 141 which value
is the least. Network 50 causes the least peak and its time to be
removed from storage units 110 and 112, and the remaining peaks and
times to be moved in storage to maintain chronological order and to
leave position four vacant for a new error maximum. Shift Register
101 and Comparison Network 103 operate to locate error signal
maxima as before, but now, since Storage block 100 has a non-zero
value stored in it, the maximum selected by Network 103 must exceed
the value of the rejected peak in block 100. A selected maximum and
its time will be gated into blocks 110 and 112 and the
aforementioned tests performed.
This process continues until voicing is found, or until one or more
limits occur. These limits are:
1. that the location of the first peak in storage (110, 112) on
line 115 exceeds P.sub.L,
2. that one or more of the peak intervals on lines 128, 129 and 130
exceeds P.sub.L, or
3. that Adaptive Predictor 61 has predicted a portion of speech of
4P.sub.L.
When one of these limits is exceeded, the process is discontinued,
Control Network 50 places a fixed length (10 milliseconds) in
Segment Length Storage 56 causes the Prediction Weight Generator 88
to generate weights over this period, which are stored in block 60
and opens Output Gate 93 to output a record. A new cycle of
operation is then initiated.
While the invention has been particularly shown and described with
respect to a preferred embodiment, it will be understood by those
skilled in the art that various changes in form and detail may be
made without departing from the spirit and scope of the
invention.
* * * * *