U.S. patent number 3,909,532 [Application Number 05/456,027] was granted by the patent office on 1975-09-30 for apparatus and method for determining the beginning and the end of a speech utterance.
This patent grant is currently assigned to Bell Telephone Laboratories, Incorporated. Invention is credited to Lawrence Richard Rabiner, Lewis Hyman Rosenthal, Ronald William Schafer.
United States Patent |
3,909,532 |
Rabiner , et al. |
September 30, 1975 |
Apparatus and method for determining the beginning and the end of a
speech utterance
Abstract
It has been discovered that the energy of the code words at the
output of an adaptive speech encoder may be utilized to accurately
determine the beginning and end of an encoded speech utterance. The
beginning of an utterance is detected when the code word energy
exceeds a predetermined threshold for a fixed duration of time.
Likewise, the end of an utterance is detected when the code word
energy falls below the threshold for another fixed duration of
time.
Inventors: |
Rabiner; Lawrence Richard
(Berkeley Heights, NJ), Rosenthal; Lewis Hyman (Cambridge,
MA), Schafer; Ronald William (New Providence, NJ) |
Assignee: |
Bell Telephone Laboratories,
Incorporated (Murray Hill, NJ)
|
Family
ID: |
23811146 |
Appl.
No.: |
05/456,027 |
Filed: |
March 29, 1974 |
Current U.S.
Class: |
704/215; 375/247;
52/DIG.13; 704/E11.005 |
Current CPC
Class: |
G10L
25/87 (20130101); Y10S 52/13 (20130101) |
Current International
Class: |
G10L
11/00 (20060101); G10L 11/02 (20060101); G10L
001/04 () |
Field of
Search: |
;179/1SA,1SC
;325/38B,62,326 |
References Cited
[Referenced By]
U.S. Patent Documents
Other References
Johnson, C. et al., "Adaptive Rate Delta Modulator," IBM Tech.
Disclosure, April, 1973..
|
Primary Examiner: Claffy; Kathleen H.
Assistant Examiner: Kemeny; E. S.
Attorney, Agent or Firm: Murphy; G. E.
Claims
What is claimed is:
1. Apparatus for determining a boundary of an applied speech
utterance comprising:
means for adaptive encoding said applied speech utterance to
develop coded output signals;
means for developing a signal representative of the energy of said
coded output signals; and
means for comparing said representative signal with a predetermined
threshold signal.
2. The apparatus defined in claim 1 wherein said signal
representative of the energy of said coded output signals is
representative of the adaptation activity of said means for
adaptive encoding.
3. The apparatus defined in claim 1 wherein said threshold signal
is representative of an energy level intermediate the energy of
background silence and the average energy of said speech
utterance.
4. Apparatus for determining a boundary of an applied speech
utterance comprising:
means for adaptive differential pulse code modulating said applied
speech utterance to develop digitally coded output signals;
means for developing a signal representative of the energy of said
coded output signals; and
means for comparing said representative signal with a predetermined
digital threshold signal.
5. The apparatus defined in claim 4 wherein said signal
representative of the energy of said coded output signals is
defined as the sum of the squares of a predetermined number of said
digitally coded output signals.
6. The apparatus defined in claim 4 wherein said digital threshold
signal is representative of an energy level intermediate the energy
of background silence and the average energy of said speech
utterance.
7. Apparatus for detecting the beginning of a speech utterance,
including an adaptive differential pulse code modulation circuit
responsive to said speech utterance, comprising:
means responsive to the digitally coded output signals of said
modulation circuit for developing a digital signal representative
of the energy of said coded output signals; and
means responsive to said representative signal for developing an
output signal when said representative signal is greater than, for
a predetermined interval of time, an applied digital threshold
signal, said output signal indicative of the beginning of said
speech utterance.
8. The apparatus defined in claim 7 wherein said signal
representative of the energy of said coded output signals is
defined as the sum of the squares of a predetermined number of said
digitally coded output signals.
9. The apparatus defined in claim 7 wherein said digital threshold
signal is representative of an energy level intermediate the energy
of background silence and the average energy of said speech
utterance.
10. The apparatus defined in claim 7 wherein said means for
developing a signal representative of the energy of said digitally
coded output signals comprises:
first means for doubling each digitally coded output signal of said
modulation circuit;
second means for subtracting from each of said doubled coded output
signals a predetermined digital reference signal;
third means for squaring each output signal of said second
means;
fourth means for sequentially storing a predetermined number of
said squared signals;
fifth means for subtracting from the most recently stored squared
signal in said fourth means the oldest stored squared signal in
said fourth means;
sixth means for adding the output signal of said fifth means and an
applied signal to develop a signal representative of the energy of
said coded output signals; and
seventh means for applying said representative signal to said sixth
means after a predetermined interval of time has elapsed.
11. Apparatus for determining the beginning of a speech signal,
including an adaptive differential pulse code modulation circuit
responsive to said speech signal, comprising:
means responsive to the output signals of said modulation circuit
for developing a signal representative of the energy of said output
signals; and
means responsive to said representative signal for developing an
indicator signal when said representative signal is greater than,
for a predetermined interval of time, an applied threshold signal,
said indicator signal indicative of the beginning of said speech
signal.
12. Apparatus for detecting the end of a speech utterance,
including an adaptive differential pulse code modulation circuit
responsive to said speech utterance, comprising:
means responsive to the digitally coded output signals of said
modulation circuit for developing a digital signal representative
of the energy of said coded output signals; and
means responsive to said representative signal for developing an
output signal when said representative signal is less than, for a
predetermined interval of time, an applied digital threshold
signal, said output signal indicative of the end of said speech
utterance.
13. The apparatus defined in claim 12 wherein said signal
representative of the energy of said coded output signals is
defined as the sum of the squares of a predetermined number of said
digitally coded output signals.
14. The apparatus defined in claim 12 wherein said digital
threshold signal is representative of an energy level intermediate
the energy of background silence and the average energy of said
speech utterance.
15. The apparatus defined in claim 12 wherein said means for
developing a signal representative of the energy of said digitally
coded output signals comprises:
first means for doubling each digitally coded output signal of said
modulation circuit;
second means for subtracting from each of said doubled coded output
signals a predetermined digital reference signal;
third means for squaring each output signal of said second
means;
fourth means for sequentially storing a predetermined number of
said squared signals;
fifth means for subtracting from the most recently stored squared
signal in said fourth means the oldest stored squared signal in
said fourth means;
sixth means for adding the output signal of said fifth means and an
applied signal to develop a signal representative of the energy of
said coded output signals; and
seventh means for applying said representative signal to said sixth
means after a predetermined interval of time has elapsed.
16. Apparatus for determining the end of a speech signal, including
an adaptive differential pulse code modulation circuit responsive
to said speech signal, comprising:
means responsive to the output signals of said modulation circuit
for developing a signal representative of the energy of said output
signals; and
means responsive to said representative signal for developing an
indicator signal when said representative signal is less than, for
a predetermined interval of time, an applied threshold signal, said
indicator signal indicative of the end of said speech signal.
17. Apparatus for detecting the boundaries of a speech utterance,
including an adaptive differential pulse code modulation circuit
responsive to said speech utterance, comprising:
code word energy means responsive to the digitally coded output
signals of said modulation circuit for developing a digital signal
representative of the energy of said coded output signals;
comparator means for comparing said digital representative signal
with an applied digital threshold signal; and
means responsive to said comparator means for developing a signal
indicative of the beginning of said speech utterance when said
representative signal is greater than, for a first predetermined
interval of time, said threshold signal, and for developing a
signal indicative of the end of said speech utterance when said
representative signal is less than, for a second predetermined
interval of time, said threshold signal.
18. Apparatus for determining the boundaries of a speech signal,
including an adaptive differential pulse code modulation circuit
responsive to said speech signal, comprising:
code word energy means responsive to the output signals of said
modulation circuit for developing a signal representative of the
energy of said output signals;
comparator means for comparing said representative signal with an
applied threshold signal; and
means responsive to said comparator means for developing a signal
indicative of the beginning of said speech signal when said
representative signal is greater than, for a first predetermined
interval of time, said threshold signal, and for developing a
signal indicative of the end of said speech signal when said
representative signal is less than, for a second predetermined
interval of time, said threshold signal.
19. Apparatus for detecting the boundaries of a speech utterance,
including an adaptive differential pulse code modulation circuit
responsive to said speech utterance, comprising:
code word energy means responsive to the digitally coded output
signals of said modulation circuit for developing a digital signal
representative of the energy of said coded output signals; and
means for developing a signal indicative of the beginning of said
speech utterance when said representative signal is greater than an
applied digital threshold signal for a first predetermined interval
of time, and for developing a signal indicative of the end of said
speech utterance when said representative signal is less than said
applied threshold signal for a second predetermined interval of
time.
20. The apparatus defined in claim 19 wherein said signal
representative of the energy of said coded output signals is
defined as the sum of the squares of a predetermined number of said
digitally coded output signals.
21. The apparatus as defined in claim 19 wherein said digital
threshold signal is representative of an energy level intermediate
the energy of background silence and the average energy of said
speech utterance.
22. The apparatus defined in claim 19 wherein said means for
developing a signal representative of the energy of said coded
output signals comprises:
first means for doubling each digitally coded output signal of said
modulation circuit;
second means for subtracting from each of said doubled coded output
signals a predetermined digital reference signal;
third means for squaring each output signal of said second
means;
fourth means for sequentially storing a predetermined number of
said squared signals;
fifth means for subtracting from the most recently stored squared
signal in said fourth means the oldest stored squared signal in
said fourth means;
sixth means for adding the output signal of said fifth means and an
applied signal to develop a signal representative of the energy of
said coded output signals; and
seventh means for applying said representative signal to said sixth
means after a predetermined interval of time has elapsed.
23. The apparatus defined in claim 19 wherein said means for
developing said indicative signals comprises:
digital comparator means responsive to said signal representative
of the energy of said coded output signals and to said applied
digital threshold signal for developing a signal at a first output
terminal when said representative energy signal is greater than
said threshold signal and for developing a signal at a second
output terminal when said
representative energy signal is less than said threshold
signal;
a bistable circuit having first and second output terminals, and
set and reset terminals;
a first logic circuit responsive to said comparator first output
terminal signal and to the signal at the first output terminal of
said bistable circuit;
a second logic circuit responsive to said comparator second output
terminal signal and to the signal at the second output terminal of
said bistable circuit;
a third logic circuit responsive to the output signals of said
first and second logic circuits;
a fourth logic circuit responsive to the output signal of said
third logic circuit and to an applied clock signal;
a counter circuit, having a plurality of output terminals,
responsive to the output signal of said fourth logic circuit;
a fifth logic circuit, for developing said signal indicative of the
end of said speech utterance, responsive to the signal at the
second output terminal of said bistable circuit and to the signal
at a preselected one of said plurality of counter circuit output
terminals;
a sixth logic circuit, for developing said signal indicative of the
beginning of said speech utterance, responsive to the signal at the
first output terminal of said bistable circuit first and to the
signals at the other of said plurality of counter circuit output
terminals;
a seventh logic circuit responsive to the output signals of said
third, fifth, and sixth logic circuits for developing
a control signal for said counter circuit, said control signal
returning said counter to a predetermined initial state; and
means for connecting the output terminals of said fifth and sixth
logic circuits, respectively, to said set and reset terminals of
said bistable circuit.
24. The method of determining a boundary of an applied speech
utterance comprising the steps of:
adaptive differential pulse code modulating said applied speech
utterance to develop digitally coded output signals;
developing a signal representative of the energy of said coded
output signals; and
comparing said representative signal with a predetermined digital
threshold signal.
25. The method defined in claim 24 wherein said signal
representative of the energy of said coded output signals is
defined as the sum of the squares of a predetermined number of said
digitally coded output signals.
26. The method defined in claim 24 wherein said digital threshold
signal is representative of an energy level intermediate the energy
of background silence and the average energy of said speech
utterance.
27. The method of determining a boundary of an applied speech
utterance comprising the steps of:
adaptive encoding said applied speech utterance to develop coded
output signals;
developing a signal representative of the energy of said coded
output signals; and
comparing said representative signal with a predetermined threshold
signal.
Description
BACKGROUND OF THE INVENTION
This invention pertains to the processing of speech signals and,
more particularly, to apparatus for detecting the beginning and
end, i.e., the endpoints or boundaries, of a speech utterance.
It is well known that one of the goals of modern communications
research is to facilitate communication between man and machine,
preferably to an extent that the human voice may be utilized to
control and direct the operations of a machine, e.g., a computer.
Thus, the fields of speech recognition, speech verification, and
automatic voice response are currently the subject of extensive
research. Generally, for these applications, speech must be stored
in digital form. Typically, a file of speech is created and stored
in a suitable memory, e.g., a fixed head disk or drum. In order to
efficiently store speech, it is necessary that individual words and
phrases be stored in memory without intervening periods of silence
between entries. Thus, the need to automatically locate the
beginning and end of a speech utterance frequently arises in speech
processing for man-machine communication.
DESCRIPTION OF THE PRIOR ART
Conventionally, the task of determining the endpoints of a speech
utterance has been accomplished by manual editing, utilizing a
combination of auditory and visual examinations of the speech
waveform. However, manual editing is both time-consuming and
subject to the inaccuracies concomitant with human judgment.
Furthermore, repeatable results are not normally obtained. One
reason for this is that the wide dynamic range of speech renders
the combination of ear and eye a poor determinant of word
boundaries. This is especially true when an unvoiced segment of
speech, e.g., the fricative at the beginning of the word "three,"
appears at the beginning or end of a word. Consequently, manual
editing usually results in shortening the speech, both at the
beginning and at the end of the utterance. Thus, the words are
"chopped," and when they are concatenated to form a message, the
effects are quite discernible and also distracting.
It is thus an object of this invention to efficiently, accurately,
and automatically detect the beginning and end of a speech
utterance.
SUMMARY OF THE INVENTION
This and other objects of this invention are accomplished by
utilizing an adaptive speech encoder, e.g., an adaptive
differential pulse code modulator (ADPCM), an adaptive delta
modulator, etc. It has been discovered by us that because of the
step size adaptation used in developing adaptive encoded speech, an
adaptive speech encoder effectively exhibits a form of automatic
gain control useful in determining the endpoints of an utterance.
Coded output words of such a coder, it has been found, exhibit high
energy during both voiced and unvoiced speech, but not during
background silence. However, the code word energy is not simply and
directly related to the energy of the original speech signal. Thus,
in accordance with this invention, the beginning of a speech
utterance is detected when the code word energy exceeds a
predetermined threshold for a fixed interval of time. Likewise, the
end of an utterance is detected when the code word energy falls
below the threshold for another fixed interval of time.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 depicts a prior art ADPCM coder which may be used in the
practice of this invention;
FIG. 2 displays the code word sequence for the utterance "oh";
FIG. 3 displays the decoded speech waveform corresponding to the
code word sequence of FIG. 2;
FIG. 4 is a block diagram of apparatus used in the practice of this
invention to determine code word energy;
FIG. 5 is a block diagram of apparatus used in the practice of this
invention to determine the beginning and end of a speech
utterance;
FIG. 5A is a block diagram depicting the system operation of this
invention;
FIG. 6 displays the code word sequence for the beginning of the
utterance "three";
FIG. 7 displays the code word energy corresponding to the code word
sequence of FIG. 6;
FIG. 8 displays the decoded speech waveform corresponding to the
code word sequence of FIG. 6;
FIG. 9 displays the energy of the speech waveform of FIG. 8;
FIG. 10 displays the code word sequence for the end of the
utterance "three"; and
FIG. 11 displays the speech waveform corresponding to the code word
sequence of FIG. 10.
DETAILED DESCRIPTION OF THE INVENTION
FIG. 1 depicts a prior art adaptive differential pulse code
modulation circuit which is described in detail in the article by
P. Cummiskey, N. S. Jayant, and J. L. Flanagan, entitled "Adaptive
Quantization in Differential PCM Coding of Speech," Bell System
Technical Journal, Vol. 52, pp. 1105-1118, September 1973. In the
ADPCM coder of FIG. 1, differential input amplifier or network 11
develops an output signal proportional to the difference between an
applied sampled speech signal and a signal which is an estimate of
the incoming speech signal. This difference signal is quantized in
adaptive quantizer 12 and applied to encoder 13 and to summing
amplifier or network 14. Summing amplifier 14, in conjunction with
first order prediction network 15, having a transfer function, for
example, of az.sup.-.sup.1, is utilized to develop an estimate of
the incoming speech signal. If the estimate of the input speech
signal is fairly accurate, then the difference signal emanating
from network 11 will be small and thus more accurately represented
by a fixed number of bits than the input speech samples themselves.
The difference signal, although nowhere near as redundant as the
original speech signal, still exhibits a wide amplitude range. In
order to make efficient use of the available quantization levels of
quantizer 12, the peak excursion of the signal should be matched to
the range of the quantizer or vice versa. Thus, for low level
signals such as fricatives, the absolute amplitude value of the
quantizer step size should be small compared to that required for
high level voiced sounds. Accordingly, the need for adaptive
quantization is apparent; logic network 16 utilizes the coded
speech signals (code words) emanating from encoder 13 to determine
optimum quantization steps. That is, logic network 16 monitors the
coded output of encoder 13 and provides for adaptation of the step
size on the basis of the most recent encoded quantizer output. For
example, if the code word corresponds to one of the higher levels,
the quantizer is overloaded and the step size is increased. On the
other hand, if the code word corresponds to one of the lower
levels, the step size is decreased. Step size adaptation
effectively compensates for amplitude variations to the extent that
the quantizer treats low level unvoiced speech signals, e.g.,
fricatives, much the same as high level voiced speech signals. The
objective, of course, is that each of the quantizer levels be used
a significant portion of the time regardless of the absolute
amplitude level of the incoming speech samples. However, when the
amplitude of the input speech signal is of the order of the minimum
step size, the adaptation logic insures that the step size will
seek its minimum value and the difference signal will then fall
within the lowest quantization levels. We have discovered that when
no speech is present at the input, the code word energy will vary
only slightly. It is this feature of ADPCM speech encoding, and of
adaptive encoding in general, that is turned to account in the
practice of this invention. It is to be understood that the
principles of this invention are applicable to all forms of
adaptive encoders including ADPCM and adaptive delta
modulation.
FIG. 2 is an exemplary display of code word activity for the voiced
utterance "oh." Each line, A, B, C, D of FIG. 2, corresponds to
approximately 256 samples (6 kHz sampling rate) of the applied
speech utterance, i.e., approximately 40 milliseconds of the
signal. Line B is to be considered a continuation of line A, line C
a continuation of line B, etc. It is noted that for line A, and for
most of line B, the code words show little activity, remaining for
the most part within a limited range of quantization levels. This
first part of the code word sequence corresponds to background
silence. However, at almost the end of line B, and then for the
remainder of lines C and D, the code word sequence fluctuates much
more rapidly and with greater amplitude. FIG. 3 illustrates the
decoded speech waveform corresponding to the code word sequence of
FIG. 2. It is noted in FIG. 3 that voiced speech apparently
commences somewhere near the end of line B and continues for line C
and D. This property of code words to indicate the presence of
speech activity is more accurately reflected in what we define as
adaptation activity or code word energy. The code word energy may
be defined as the number of code word adaptations per unit time. In
one embodiment of this invention, we used as a measurement of
energy the sum of the squares of the code words for one hundred and
one samples, or code words, corresponding to a 16 millisecond
window centered about a selected sample. That is, the code word
energy may be defined as ##EQU1## where c(i) corresponds to a code
word emanating from encoder 13 of FIG. 1. Of course, other
equivalent definitions of energy may be utilized.
In the prior art ADPCM implementation of FIG. 1, the largest
negative quantization level is represented by the binary code word
0000 while the largest positive quantization level is represented
by the binary code word 1111, corresponding to the decimal number
15. Thus, it is necessary, if one is using such a symmetrical
coding system, to subtract from the code words a number
corresponding to the d.c. level or average value of the code words
to make the average level of the code words equal to zero. Of
course, a different coding implementation may be utilized which
inherently has a zero average value. Since the number 7.5,
corresponding to the average value, may not be conveniently
represented in digital form, the following definition of energy may
be utilized: ##EQU2## where a(i) = [2 c(i) - 15].sup.2. By using
this definition, the d.c. level is removed from consideration and
the energy content of the code words differs from the definition of
Eq. (1) by only a multiplicative constant. It may readily be shown
that the energy term defined by Eq. (2) is equivalent to
E(n) = E(n-1) + a(n+50) - a(n-51). (3)
The code word energy, in accordance with this invention, is
computed at each sample of the speech signal and compared with a
threshold which is established at a level intermediate to the
measured energy of silence and the average measured energy of the
speech utterance. When the code word energy exceeds this threshold
for approximately 320 consecutive samples, corresponding to about
50 milliseconds of speech, the word c(n) at which the energy first
exceeded the threshold is defined as the beginning of an utterance.
The code word energy-threshold comparison is continued, and when
the code word energy falls below the threshold for approximately
1,024 consecutive samples, corresponding to about 160 milliseconds
of speech, the point at which the energy first fell below the
threshold is defined as the end of the utterance. The 160
millisecond criterion insures that a stop consonant within a word
or phrase will not be mistaken for the end of the utterance.
Apparatus for determining the energy of the code words in
accordance with Eq. (3) is illustrated in FIG. 4. A code word,
c(i), emanating from encoder 13 of FIG. 1, is applied to digital
doubler 17, wherein it is doubled in value to develop a signal 2
c(i), which is twice the digital value of the applied code word.
Digital doubler 17 may be of any well-known configuration, e.g., a
shift left by one bit register will double the value of an applied
binary signal. Digital subtractor 18 subtracts from signal 2 c(i),
a signal supplied by digital reference register 19. The signal
stored in register 19 is proportional to the d.c. level or average
of the code words. In a particular embodiment, the digital signal
stored in register 19 is equal to fifteen as required by Eq. (2).
Digital multiplier 21 multiplies the output signal of subtractor 18
by itself to achieve a squared signal which corresponds to the
function a(i) of Eq. (2). Both subtractor 18 and multiplier 21 may
be conventional digital arithmetic circuits. The signal output,
a(i), of multiplier 21 is applied to shift register 22. Register
22, which preferably has a digital capacity of one hundred and two
words, sequentially shifts digital signal a(i) through the register
at the system clock rate. It is to be understood that in the
circuitry of FIG. 4, and also in that of FIG. 5, that all
operations are performed in synchronism with the master sampling
clock of the coder of FIG. 1, which has not been depicted in order
not to obfuscate the operation of the instant invention. At any
point of time, the last digital word stored in register 22, i.e.,
the oldest word in storage, corresponds to a(n-51) and the first
word stored in register 22, i.e., the most recently stored word,
corresponds to a(n+50). The first and last words of register 22 are
combined in conventional digital subtractor 23 to form a difference
signal, a(n+50) -a(n-51). This difference signal is applied to
conventional digital adder 24 which, in conjunction with delay
network 25, develops a signal representative of the code word
energy as defined in Eq. (3). Delay network 25 may be of
conventional design and is utilized to delay the output of adder 24
by one clock period.
The output signal E(n), of adder 24, is applied to digital
comparator 26 of FIG. 5. Comparator 26 compares the energy of each
code word E(n) with a signal stored in register 27 to determine
whether or not the energy of the code word is above or below a
predetermined threshold. The threshold is generally empirically
determined and may be approximately equal to a point midway between
the measured energy of background silence and the average measured
energy of the speech signal, which is readily obtained by averaging
the output of the apparatus of FIG. 4. As discussed above, when the
code word energy exceeds this threshold for approximately 50
milliseconds or 300 consecutive samples, the point at which the
energy function first exceeded the threshold is defined as the
beginning of an utterance. The apparatus of FIG. 5 is utilized to
determine when this has occurred. Also, when an utterance has been
determined to have begun, the apparatus of FIG. 5 continues to make
a comparison of the energy of subsequent code words with the
threshold signal stored in register 27. When the code word energy
falls below this threshold for approximately 160 milliseconds or
1,000 consecutive samples, the point at which the energy function
first passed below the threshold is recorded as the end of the
utterance.
To understand the operation of the circuit of FIG. 5, it is
convenient to assume that speech is not present at the input to the
ADPCM coder and, in fact, has not been present long enough so that
the last indication encountered was an end of a speech utterance.
This is indicated by certain states or levels for particular
circuit components. Thus, it may be assumed that output lead 39 of
flip-flop 34 is at a logical 0 state and that output lead 41 of
flip-flop 34 is at a logical 1 state. It may also be assumed that
output lead 43 of digital comparator 26 is at a 0 state and that
output lead 45 of digital comparator 26 is at a 1 state.
Accordingly, input lead 42 to NAND gate 28 is at a logical 1 state
and input lead 44 to NAND gate 29 is at a logical 0 state. In
accordance with the well-known logical rules for NAND circuits,
input lead 46 to NAND gate 31 is at a logical 1 state and input
lead 47 to NAND gate 31 is at a logical 1 state. Thus, lead 48,
connecting the output of NAND gate 31 and one of the inputs to NAND
gate 32, is at a 0 state and lead 51, one of the inputs to NAND
gate 38, is also at a 0 state. Clock input 49 to NAND gate 32 is
presumed to enable NAND gate 32 upon the presence of a logical 1 on
lead 49. Accordingly, output lead 54 of NAND gate 32 is at a
logical 1 state; counter 33 is presumed to be incremented upon the
presence of a 0 level input on line 54. Thus, output leads 55, 56
and 57 of counter 33, which correspond to the 10th, 8th and 6th
powers, respectively, of the binary base "two," are at a logical 0
state. Output lead 58 of NAND gate 35 is thus at a logical 1 state
as is output lead 59 of NAND gate 36. Input leads 53 and 52 to NAND
gate 38 are also at a logical 1 state, thus establishing output
lead 61 of NAND gate 38 at a logical 1 state and output lead 62 of
inverter circuit 37 at a logical 0 state. Since this is the clear
input to counter 33, a logical 0 state is presumed to clear the
counter.
If it is now presumed that the energy signal applied to digital
comparator 26 exceeds the output of digital threshold register 27,
output lead 43 of comparator 26 assumes a logical 1 state and
output lead 45 of comparator 26 assumes a 0 state. Output lead 46
of NAND gate 28 is then at a logical 0 state and output lead 47 of
NAND gate 29 is at a logical 1 state. Output lead 48 of NAND gate
31 assumes a logic 1 state as does lead 51, which is one of the
inputs to NAND gate 38. Since input leads 52 and 53 are already at
a logical 1 state, the output lead 61 of NAND gate 38 assumes a
logical 0 state and therefore output lead 62 of inverter 37 assumes
a logical 1 state, thereby allowing counter 33 to be incremented.
Upon the presence of a logical 1 signal at clock input 49 to NAND
gate 32, output lead 54 of NAND gate 32 assumes a logical 0 state
and counter 33 is incremented. Assuming that the input energy
signal to comparator 26 remains above the predetermined threshold,
then with each energy word, counter 33 will be incremented. When
counter 33 reaches a level of 320, which corresponds to a 1 output
on leads 56 and 57, output lead 59 of NAND gate 36 assumes a
logical 0 state indicating the beginning of a speech utterance. The
presence of a 0 level signal on output lead 59 resets flip-flop 34
so that a logical 1 signal appears on output lead 39 and a logical
0 signal appears on output lead 41. Output lead 58 of NAND gate 35
remains at a logical 1 state. The resetting of flip-flop 34 causes
output lead 59 to return to a logical 1 state and in turn causes
input lead 44 to NAND gate 29 to assume a logical 1 state and input
lead 42 to NAND gate 28 to assume a logical 0 state. Assuming that
the energy signal remains above the threshold, output lead 43 is
still at a logical 1 state, but since input lead 42 to NAND gate 28
is now at a logical 0 state, output lead 46 of NAND gate 28 assumes
a logical 1 state. Output lead 45 of comparator 26 is still at a 0
state, but input lead 44 to NAND gate 29 is now at a logical 1
state. Thus, output lead 47 of NAND gate 29 is at a logical 1
state. Accordingly, output lead 48 of NAND gate 31 assumes a
logical 0 state as does input lead 51 to NAND gate 38. Input lead
54 to counter 33 assumes a logical 1 state and counter 33 is not
incremented. Since input lead 51 is at a 0 state and input leads 52
and 53 of NAND gate 38 are at a logical 1 state, output lead 61 of
NAND gate 38 is at a logical 1 state and the clear input to counter
33, lead 62, is at a logical 0 state. Thus, the counter is cleared
and output leads 58, 59 remain at a logical 1 state. When the
energy of the applied code words to digital comparator 26 decreases
to a level below the threshold level established by register 27,
output lead 45 of comparator 26 assumes a logical 1 state and
output lead 43 assumes a logical 0 state. Since input lead 42 to
NAND gate 28 is at a 0 level, output lead 46 of NAND gate 28
assumes a logical 1 state. Similarly, since input lead 44 to NAND
gate 29 is at a logical 1 state, output lead 47 of NAND gate 29
assumes a 0 logic state. Thus, output lead 48 of NAND gate 31 is at
a logical 1 state as is input lead 51 to NAND gate 38. Upon the
occurrence of a 1 level on clock input 49 to NAND gate 32, output
lead 54 assumes a logical 0 state and increments counter 33.
Assuming the input energy level of the code words remains below the
predetermined threshold, counter 33 will be successively
incremented but no change in the logic states of the circuit will
occur until leads 55, 56, and 57 of counter 33 all assume a logical
1 state. This state corresponds to a count of 1024. Upon the
occurrence of this condition, output lead 58 assumes a logical 0
state indicating the end of the speech utterance while output lead
59 remains at a logical 1 state. The occurrence of a 0 logic state
on output lead 58 sets flip-flop 34 back to its original state,
i.e., output lead 39 assumes a 0 state and output lead 41 assumes a
1 state. Output lead 58 accordingly returns to a logical 1 state
and the apparatus of FIG. 5 has returned to the conditions
initially assumed prior to the beginning of the speech utterance.
The waveforms appearing at output leads 59 and 58 of the apparatus
of FIG. 5 indicate the logic state transition, respectively, at the
beginning and end of a speech utterance. The output signals of the
apparatus of FIG. 5 may be used in a variety of ways. For example,
they may be used to gate a register which temporarily stores the
code words of the apparatus of FIG. 1 so that the code words of the
speech utterance, determined by the apparatus of FIG. 5, may be
conveyed to a permanent store. Or, if so desired, the signals
appearing on leads 58 and 59 may be utilized to activate an alarm
circuit to indicate to an operator that the beginning and end of a
speech utterance has occurred. Many other applications, of course,
will be apparent to those skilled in the art.
FIG. 5A is a block diagram depicting the overall operation of this
invention, as discussed above. Adaptive encoder 501 corresponds to
the encoder shown in FIG. 1, code word energy detector 502
corresponds to the apparatus depicted in FIG. 4, and threshold
detector 503 corresponds to the apparatus shown in FIG. 5.
The significant advantages of the instant invention, in determining
the beginning and end of a speech utterance, are illustrated by
FIGS. 6 through 11. FIG. 6 displays the sequence of code words
corresponding to the beginning of the word "three." The left-half
of line A shows very little code word variation and corresponds to
low level noise. The right-half of line A, and the next two lines,
B and C, correspond to the initial fricative "th" of the word
"three." The code words show markedly greater variation as does the
last line, D, which corresponds to the beginning of voicing, i.e.,
"ree." The marker in the middle of line A denotes the beginning
point of the speech utterance, as determined by this invention.
FIG. 7 displays the energy of the code words of FIG. 6, as
determined by this invention. The marker on line A denotes the
point at which the energy of the code words exceeded the threshold
and remained above the threshold for approximately 50 milliseconds,
as discussed above. It is noted that the code word energy is
roughly the same for both the voiced and unvoiced segments of the
utterance while the energy is significantly lower when no speech is
present. FIG. 8 displays the actual speech waveform represented by
the code word sequence of FIG. 6. The beginning of the word "three"
is not nearly as evident as in the code word sequence; indeed, it
is hardly discernible. FIG. 9, which displays the energy of the
speech waveform of FIG. 8, emphasizes the fact that the beginning
of a speech utterance is not readily discernible from an
examination of the energy of the speech waveform itself. FIG. 10
displays the code word sequence at the end of the word "three". The
marker on line B indicates the end point of the utterance as
determined by the instant invention. FIG. 11 displays the speech
waveform corresponding to the code word sequence of FIG. 10. The
end point of the utterance is clearly not apparent from an
examination of the speech waveform itself.
The instant invention has been tested extensively in determining
the beginning and end speech entries for a voice response system
vocabulary, and has proved to be very reliable. Two other aspects
of the coded speech signal, i.e., the energy of the difference
signal of the coder of FIG. 1, and the energy of the quantizer
output were also studied as possible considerations for use in the
instant invention. However the results based on the coded word
samples themselves were found to be far more accurate.
* * * * *