Apparatus and method for determining the beginning and the end of a speech utterance

Rabiner , et al. September 30, 1

Patent Grant 3909532

U.S. patent number 3,909,532 [Application Number 05/456,027] was granted by the patent office on 1975-09-30 for apparatus and method for determining the beginning and the end of a speech utterance. This patent grant is currently assigned to Bell Telephone Laboratories, Incorporated. Invention is credited to Lawrence Richard Rabiner, Lewis Hyman Rosenthal, Ronald William Schafer.


United States Patent 3,909,532
Rabiner ,   et al. September 30, 1975

Apparatus and method for determining the beginning and the end of a speech utterance

Abstract

It has been discovered that the energy of the code words at the output of an adaptive speech encoder may be utilized to accurately determine the beginning and end of an encoded speech utterance. The beginning of an utterance is detected when the code word energy exceeds a predetermined threshold for a fixed duration of time. Likewise, the end of an utterance is detected when the code word energy falls below the threshold for another fixed duration of time.


Inventors: Rabiner; Lawrence Richard (Berkeley Heights, NJ), Rosenthal; Lewis Hyman (Cambridge, MA), Schafer; Ronald William (New Providence, NJ)
Assignee: Bell Telephone Laboratories, Incorporated (Murray Hill, NJ)
Family ID: 23811146
Appl. No.: 05/456,027
Filed: March 29, 1974

Current U.S. Class: 704/215; 375/247; 52/DIG.13; 704/E11.005
Current CPC Class: G10L 25/87 (20130101); Y10S 52/13 (20130101)
Current International Class: G10L 11/00 (20060101); G10L 11/02 (20060101); G10L 001/04 ()
Field of Search: ;179/1SA,1SC ;325/38B,62,326

References Cited [Referenced By]

U.S. Patent Documents
3750024 July 1973 Dunn et al.

Other References

Johnson, C. et al., "Adaptive Rate Delta Modulator," IBM Tech. Disclosure, April, 1973..

Primary Examiner: Claffy; Kathleen H.
Assistant Examiner: Kemeny; E. S.
Attorney, Agent or Firm: Murphy; G. E.

Claims



What is claimed is:

1. Apparatus for determining a boundary of an applied speech utterance comprising:

means for adaptive encoding said applied speech utterance to develop coded output signals;

means for developing a signal representative of the energy of said coded output signals; and

means for comparing said representative signal with a predetermined threshold signal.

2. The apparatus defined in claim 1 wherein said signal representative of the energy of said coded output signals is representative of the adaptation activity of said means for adaptive encoding.

3. The apparatus defined in claim 1 wherein said threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.

4. Apparatus for determining a boundary of an applied speech utterance comprising:

means for adaptive differential pulse code modulating said applied speech utterance to develop digitally coded output signals;

means for developing a signal representative of the energy of said coded output signals; and

means for comparing said representative signal with a predetermined digital threshold signal.

5. The apparatus defined in claim 4 wherein said signal representative of the energy of said coded output signals is defined as the sum of the squares of a predetermined number of said digitally coded output signals.

6. The apparatus defined in claim 4 wherein said digital threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.

7. Apparatus for detecting the beginning of a speech utterance, including an adaptive differential pulse code modulation circuit responsive to said speech utterance, comprising:

means responsive to the digitally coded output signals of said modulation circuit for developing a digital signal representative of the energy of said coded output signals; and

means responsive to said representative signal for developing an output signal when said representative signal is greater than, for a predetermined interval of time, an applied digital threshold signal, said output signal indicative of the beginning of said speech utterance.

8. The apparatus defined in claim 7 wherein said signal representative of the energy of said coded output signals is defined as the sum of the squares of a predetermined number of said digitally coded output signals.

9. The apparatus defined in claim 7 wherein said digital threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.

10. The apparatus defined in claim 7 wherein said means for developing a signal representative of the energy of said digitally coded output signals comprises:

first means for doubling each digitally coded output signal of said modulation circuit;

second means for subtracting from each of said doubled coded output signals a predetermined digital reference signal;

third means for squaring each output signal of said second means;

fourth means for sequentially storing a predetermined number of said squared signals;

fifth means for subtracting from the most recently stored squared signal in said fourth means the oldest stored squared signal in said fourth means;

sixth means for adding the output signal of said fifth means and an applied signal to develop a signal representative of the energy of said coded output signals; and

seventh means for applying said representative signal to said sixth means after a predetermined interval of time has elapsed.

11. Apparatus for determining the beginning of a speech signal, including an adaptive differential pulse code modulation circuit responsive to said speech signal, comprising:

means responsive to the output signals of said modulation circuit for developing a signal representative of the energy of said output signals; and

means responsive to said representative signal for developing an indicator signal when said representative signal is greater than, for a predetermined interval of time, an applied threshold signal, said indicator signal indicative of the beginning of said speech signal.

12. Apparatus for detecting the end of a speech utterance, including an adaptive differential pulse code modulation circuit responsive to said speech utterance, comprising:

means responsive to the digitally coded output signals of said modulation circuit for developing a digital signal representative of the energy of said coded output signals; and

means responsive to said representative signal for developing an output signal when said representative signal is less than, for a predetermined interval of time, an applied digital threshold signal, said output signal indicative of the end of said speech utterance.

13. The apparatus defined in claim 12 wherein said signal representative of the energy of said coded output signals is defined as the sum of the squares of a predetermined number of said digitally coded output signals.

14. The apparatus defined in claim 12 wherein said digital threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.

15. The apparatus defined in claim 12 wherein said means for developing a signal representative of the energy of said digitally coded output signals comprises:

first means for doubling each digitally coded output signal of said modulation circuit;

second means for subtracting from each of said doubled coded output signals a predetermined digital reference signal;

third means for squaring each output signal of said second means;

fourth means for sequentially storing a predetermined number of said squared signals;

fifth means for subtracting from the most recently stored squared signal in said fourth means the oldest stored squared signal in said fourth means;

sixth means for adding the output signal of said fifth means and an applied signal to develop a signal representative of the energy of said coded output signals; and

seventh means for applying said representative signal to said sixth means after a predetermined interval of time has elapsed.

16. Apparatus for determining the end of a speech signal, including an adaptive differential pulse code modulation circuit responsive to said speech signal, comprising:

means responsive to the output signals of said modulation circuit for developing a signal representative of the energy of said output signals; and

means responsive to said representative signal for developing an indicator signal when said representative signal is less than, for a predetermined interval of time, an applied threshold signal, said indicator signal indicative of the end of said speech signal.

17. Apparatus for detecting the boundaries of a speech utterance, including an adaptive differential pulse code modulation circuit responsive to said speech utterance, comprising:

code word energy means responsive to the digitally coded output signals of said modulation circuit for developing a digital signal representative of the energy of said coded output signals;

comparator means for comparing said digital representative signal with an applied digital threshold signal; and

means responsive to said comparator means for developing a signal indicative of the beginning of said speech utterance when said representative signal is greater than, for a first predetermined interval of time, said threshold signal, and for developing a signal indicative of the end of said speech utterance when said representative signal is less than, for a second predetermined interval of time, said threshold signal.

18. Apparatus for determining the boundaries of a speech signal, including an adaptive differential pulse code modulation circuit responsive to said speech signal, comprising:

code word energy means responsive to the output signals of said modulation circuit for developing a signal representative of the energy of said output signals;

comparator means for comparing said representative signal with an applied threshold signal; and

means responsive to said comparator means for developing a signal indicative of the beginning of said speech signal when said representative signal is greater than, for a first predetermined interval of time, said threshold signal, and for developing a signal indicative of the end of said speech signal when said representative signal is less than, for a second predetermined interval of time, said threshold signal.

19. Apparatus for detecting the boundaries of a speech utterance, including an adaptive differential pulse code modulation circuit responsive to said speech utterance, comprising:

code word energy means responsive to the digitally coded output signals of said modulation circuit for developing a digital signal representative of the energy of said coded output signals; and

means for developing a signal indicative of the beginning of said speech utterance when said representative signal is greater than an applied digital threshold signal for a first predetermined interval of time, and for developing a signal indicative of the end of said speech utterance when said representative signal is less than said applied threshold signal for a second predetermined interval of time.

20. The apparatus defined in claim 19 wherein said signal representative of the energy of said coded output signals is defined as the sum of the squares of a predetermined number of said digitally coded output signals.

21. The apparatus as defined in claim 19 wherein said digital threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.

22. The apparatus defined in claim 19 wherein said means for developing a signal representative of the energy of said coded output signals comprises:

first means for doubling each digitally coded output signal of said modulation circuit;

second means for subtracting from each of said doubled coded output signals a predetermined digital reference signal;

third means for squaring each output signal of said second means;

fourth means for sequentially storing a predetermined number of said squared signals;

fifth means for subtracting from the most recently stored squared signal in said fourth means the oldest stored squared signal in said fourth means;

sixth means for adding the output signal of said fifth means and an applied signal to develop a signal representative of the energy of said coded output signals; and

seventh means for applying said representative signal to said sixth means after a predetermined interval of time has elapsed.

23. The apparatus defined in claim 19 wherein said means for developing said indicative signals comprises:

digital comparator means responsive to said signal representative of the energy of said coded output signals and to said applied digital threshold signal for developing a signal at a first output terminal when said representative energy signal is greater than said threshold signal and for developing a signal at a second output terminal when said

representative energy signal is less than said threshold signal;

a bistable circuit having first and second output terminals, and set and reset terminals;

a first logic circuit responsive to said comparator first output terminal signal and to the signal at the first output terminal of said bistable circuit;

a second logic circuit responsive to said comparator second output terminal signal and to the signal at the second output terminal of said bistable circuit;

a third logic circuit responsive to the output signals of said first and second logic circuits;

a fourth logic circuit responsive to the output signal of said third logic circuit and to an applied clock signal;

a counter circuit, having a plurality of output terminals, responsive to the output signal of said fourth logic circuit;

a fifth logic circuit, for developing said signal indicative of the end of said speech utterance, responsive to the signal at the second output terminal of said bistable circuit and to the signal at a preselected one of said plurality of counter circuit output terminals;

a sixth logic circuit, for developing said signal indicative of the beginning of said speech utterance, responsive to the signal at the first output terminal of said bistable circuit first and to the signals at the other of said plurality of counter circuit output terminals;

a seventh logic circuit responsive to the output signals of said third, fifth, and sixth logic circuits for developing

a control signal for said counter circuit, said control signal returning said counter to a predetermined initial state; and

means for connecting the output terminals of said fifth and sixth logic circuits, respectively, to said set and reset terminals of said bistable circuit.

24. The method of determining a boundary of an applied speech utterance comprising the steps of:

adaptive differential pulse code modulating said applied speech utterance to develop digitally coded output signals;

developing a signal representative of the energy of said coded output signals; and

comparing said representative signal with a predetermined digital threshold signal.

25. The method defined in claim 24 wherein said signal representative of the energy of said coded output signals is defined as the sum of the squares of a predetermined number of said digitally coded output signals.

26. The method defined in claim 24 wherein said digital threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.

27. The method of determining a boundary of an applied speech utterance comprising the steps of:

adaptive encoding said applied speech utterance to develop coded output signals;

developing a signal representative of the energy of said coded output signals; and

comparing said representative signal with a predetermined threshold signal.
Description



BACKGROUND OF THE INVENTION

This invention pertains to the processing of speech signals and, more particularly, to apparatus for detecting the beginning and end, i.e., the endpoints or boundaries, of a speech utterance.

It is well known that one of the goals of modern communications research is to facilitate communication between man and machine, preferably to an extent that the human voice may be utilized to control and direct the operations of a machine, e.g., a computer. Thus, the fields of speech recognition, speech verification, and automatic voice response are currently the subject of extensive research. Generally, for these applications, speech must be stored in digital form. Typically, a file of speech is created and stored in a suitable memory, e.g., a fixed head disk or drum. In order to efficiently store speech, it is necessary that individual words and phrases be stored in memory without intervening periods of silence between entries. Thus, the need to automatically locate the beginning and end of a speech utterance frequently arises in speech processing for man-machine communication.

DESCRIPTION OF THE PRIOR ART

Conventionally, the task of determining the endpoints of a speech utterance has been accomplished by manual editing, utilizing a combination of auditory and visual examinations of the speech waveform. However, manual editing is both time-consuming and subject to the inaccuracies concomitant with human judgment. Furthermore, repeatable results are not normally obtained. One reason for this is that the wide dynamic range of speech renders the combination of ear and eye a poor determinant of word boundaries. This is especially true when an unvoiced segment of speech, e.g., the fricative at the beginning of the word "three," appears at the beginning or end of a word. Consequently, manual editing usually results in shortening the speech, both at the beginning and at the end of the utterance. Thus, the words are "chopped," and when they are concatenated to form a message, the effects are quite discernible and also distracting.

It is thus an object of this invention to efficiently, accurately, and automatically detect the beginning and end of a speech utterance.

SUMMARY OF THE INVENTION

This and other objects of this invention are accomplished by utilizing an adaptive speech encoder, e.g., an adaptive differential pulse code modulator (ADPCM), an adaptive delta modulator, etc. It has been discovered by us that because of the step size adaptation used in developing adaptive encoded speech, an adaptive speech encoder effectively exhibits a form of automatic gain control useful in determining the endpoints of an utterance. Coded output words of such a coder, it has been found, exhibit high energy during both voiced and unvoiced speech, but not during background silence. However, the code word energy is not simply and directly related to the energy of the original speech signal. Thus, in accordance with this invention, the beginning of a speech utterance is detected when the code word energy exceeds a predetermined threshold for a fixed interval of time. Likewise, the end of an utterance is detected when the code word energy falls below the threshold for another fixed interval of time.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 depicts a prior art ADPCM coder which may be used in the practice of this invention;

FIG. 2 displays the code word sequence for the utterance "oh";

FIG. 3 displays the decoded speech waveform corresponding to the code word sequence of FIG. 2;

FIG. 4 is a block diagram of apparatus used in the practice of this invention to determine code word energy;

FIG. 5 is a block diagram of apparatus used in the practice of this invention to determine the beginning and end of a speech utterance;

FIG. 5A is a block diagram depicting the system operation of this invention;

FIG. 6 displays the code word sequence for the beginning of the utterance "three";

FIG. 7 displays the code word energy corresponding to the code word sequence of FIG. 6;

FIG. 8 displays the decoded speech waveform corresponding to the code word sequence of FIG. 6;

FIG. 9 displays the energy of the speech waveform of FIG. 8;

FIG. 10 displays the code word sequence for the end of the utterance "three"; and

FIG. 11 displays the speech waveform corresponding to the code word sequence of FIG. 10.

DETAILED DESCRIPTION OF THE INVENTION

FIG. 1 depicts a prior art adaptive differential pulse code modulation circuit which is described in detail in the article by P. Cummiskey, N. S. Jayant, and J. L. Flanagan, entitled "Adaptive Quantization in Differential PCM Coding of Speech," Bell System Technical Journal, Vol. 52, pp. 1105-1118, September 1973. In the ADPCM coder of FIG. 1, differential input amplifier or network 11 develops an output signal proportional to the difference between an applied sampled speech signal and a signal which is an estimate of the incoming speech signal. This difference signal is quantized in adaptive quantizer 12 and applied to encoder 13 and to summing amplifier or network 14. Summing amplifier 14, in conjunction with first order prediction network 15, having a transfer function, for example, of az.sup.-.sup.1, is utilized to develop an estimate of the incoming speech signal. If the estimate of the input speech signal is fairly accurate, then the difference signal emanating from network 11 will be small and thus more accurately represented by a fixed number of bits than the input speech samples themselves. The difference signal, although nowhere near as redundant as the original speech signal, still exhibits a wide amplitude range. In order to make efficient use of the available quantization levels of quantizer 12, the peak excursion of the signal should be matched to the range of the quantizer or vice versa. Thus, for low level signals such as fricatives, the absolute amplitude value of the quantizer step size should be small compared to that required for high level voiced sounds. Accordingly, the need for adaptive quantization is apparent; logic network 16 utilizes the coded speech signals (code words) emanating from encoder 13 to determine optimum quantization steps. That is, logic network 16 monitors the coded output of encoder 13 and provides for adaptation of the step size on the basis of the most recent encoded quantizer output. For example, if the code word corresponds to one of the higher levels, the quantizer is overloaded and the step size is increased. On the other hand, if the code word corresponds to one of the lower levels, the step size is decreased. Step size adaptation effectively compensates for amplitude variations to the extent that the quantizer treats low level unvoiced speech signals, e.g., fricatives, much the same as high level voiced speech signals. The objective, of course, is that each of the quantizer levels be used a significant portion of the time regardless of the absolute amplitude level of the incoming speech samples. However, when the amplitude of the input speech signal is of the order of the minimum step size, the adaptation logic insures that the step size will seek its minimum value and the difference signal will then fall within the lowest quantization levels. We have discovered that when no speech is present at the input, the code word energy will vary only slightly. It is this feature of ADPCM speech encoding, and of adaptive encoding in general, that is turned to account in the practice of this invention. It is to be understood that the principles of this invention are applicable to all forms of adaptive encoders including ADPCM and adaptive delta modulation.

FIG. 2 is an exemplary display of code word activity for the voiced utterance "oh." Each line, A, B, C, D of FIG. 2, corresponds to approximately 256 samples (6 kHz sampling rate) of the applied speech utterance, i.e., approximately 40 milliseconds of the signal. Line B is to be considered a continuation of line A, line C a continuation of line B, etc. It is noted that for line A, and for most of line B, the code words show little activity, remaining for the most part within a limited range of quantization levels. This first part of the code word sequence corresponds to background silence. However, at almost the end of line B, and then for the remainder of lines C and D, the code word sequence fluctuates much more rapidly and with greater amplitude. FIG. 3 illustrates the decoded speech waveform corresponding to the code word sequence of FIG. 2. It is noted in FIG. 3 that voiced speech apparently commences somewhere near the end of line B and continues for line C and D. This property of code words to indicate the presence of speech activity is more accurately reflected in what we define as adaptation activity or code word energy. The code word energy may be defined as the number of code word adaptations per unit time. In one embodiment of this invention, we used as a measurement of energy the sum of the squares of the code words for one hundred and one samples, or code words, corresponding to a 16 millisecond window centered about a selected sample. That is, the code word energy may be defined as ##EQU1## where c(i) corresponds to a code word emanating from encoder 13 of FIG. 1. Of course, other equivalent definitions of energy may be utilized.

In the prior art ADPCM implementation of FIG. 1, the largest negative quantization level is represented by the binary code word 0000 while the largest positive quantization level is represented by the binary code word 1111, corresponding to the decimal number 15. Thus, it is necessary, if one is using such a symmetrical coding system, to subtract from the code words a number corresponding to the d.c. level or average value of the code words to make the average level of the code words equal to zero. Of course, a different coding implementation may be utilized which inherently has a zero average value. Since the number 7.5, corresponding to the average value, may not be conveniently represented in digital form, the following definition of energy may be utilized: ##EQU2## where a(i) = [2 c(i) - 15].sup.2. By using this definition, the d.c. level is removed from consideration and the energy content of the code words differs from the definition of Eq. (1) by only a multiplicative constant. It may readily be shown that the energy term defined by Eq. (2) is equivalent to

E(n) = E(n-1) + a(n+50) - a(n-51). (3)

The code word energy, in accordance with this invention, is computed at each sample of the speech signal and compared with a threshold which is established at a level intermediate to the measured energy of silence and the average measured energy of the speech utterance. When the code word energy exceeds this threshold for approximately 320 consecutive samples, corresponding to about 50 milliseconds of speech, the word c(n) at which the energy first exceeded the threshold is defined as the beginning of an utterance. The code word energy-threshold comparison is continued, and when the code word energy falls below the threshold for approximately 1,024 consecutive samples, corresponding to about 160 milliseconds of speech, the point at which the energy first fell below the threshold is defined as the end of the utterance. The 160 millisecond criterion insures that a stop consonant within a word or phrase will not be mistaken for the end of the utterance.

Apparatus for determining the energy of the code words in accordance with Eq. (3) is illustrated in FIG. 4. A code word, c(i), emanating from encoder 13 of FIG. 1, is applied to digital doubler 17, wherein it is doubled in value to develop a signal 2 c(i), which is twice the digital value of the applied code word. Digital doubler 17 may be of any well-known configuration, e.g., a shift left by one bit register will double the value of an applied binary signal. Digital subtractor 18 subtracts from signal 2 c(i), a signal supplied by digital reference register 19. The signal stored in register 19 is proportional to the d.c. level or average of the code words. In a particular embodiment, the digital signal stored in register 19 is equal to fifteen as required by Eq. (2). Digital multiplier 21 multiplies the output signal of subtractor 18 by itself to achieve a squared signal which corresponds to the function a(i) of Eq. (2). Both subtractor 18 and multiplier 21 may be conventional digital arithmetic circuits. The signal output, a(i), of multiplier 21 is applied to shift register 22. Register 22, which preferably has a digital capacity of one hundred and two words, sequentially shifts digital signal a(i) through the register at the system clock rate. It is to be understood that in the circuitry of FIG. 4, and also in that of FIG. 5, that all operations are performed in synchronism with the master sampling clock of the coder of FIG. 1, which has not been depicted in order not to obfuscate the operation of the instant invention. At any point of time, the last digital word stored in register 22, i.e., the oldest word in storage, corresponds to a(n-51) and the first word stored in register 22, i.e., the most recently stored word, corresponds to a(n+50). The first and last words of register 22 are combined in conventional digital subtractor 23 to form a difference signal, a(n+50) -a(n-51). This difference signal is applied to conventional digital adder 24 which, in conjunction with delay network 25, develops a signal representative of the code word energy as defined in Eq. (3). Delay network 25 may be of conventional design and is utilized to delay the output of adder 24 by one clock period.

The output signal E(n), of adder 24, is applied to digital comparator 26 of FIG. 5. Comparator 26 compares the energy of each code word E(n) with a signal stored in register 27 to determine whether or not the energy of the code word is above or below a predetermined threshold. The threshold is generally empirically determined and may be approximately equal to a point midway between the measured energy of background silence and the average measured energy of the speech signal, which is readily obtained by averaging the output of the apparatus of FIG. 4. As discussed above, when the code word energy exceeds this threshold for approximately 50 milliseconds or 300 consecutive samples, the point at which the energy function first exceeded the threshold is defined as the beginning of an utterance. The apparatus of FIG. 5 is utilized to determine when this has occurred. Also, when an utterance has been determined to have begun, the apparatus of FIG. 5 continues to make a comparison of the energy of subsequent code words with the threshold signal stored in register 27. When the code word energy falls below this threshold for approximately 160 milliseconds or 1,000 consecutive samples, the point at which the energy function first passed below the threshold is recorded as the end of the utterance.

To understand the operation of the circuit of FIG. 5, it is convenient to assume that speech is not present at the input to the ADPCM coder and, in fact, has not been present long enough so that the last indication encountered was an end of a speech utterance. This is indicated by certain states or levels for particular circuit components. Thus, it may be assumed that output lead 39 of flip-flop 34 is at a logical 0 state and that output lead 41 of flip-flop 34 is at a logical 1 state. It may also be assumed that output lead 43 of digital comparator 26 is at a 0 state and that output lead 45 of digital comparator 26 is at a 1 state. Accordingly, input lead 42 to NAND gate 28 is at a logical 1 state and input lead 44 to NAND gate 29 is at a logical 0 state. In accordance with the well-known logical rules for NAND circuits, input lead 46 to NAND gate 31 is at a logical 1 state and input lead 47 to NAND gate 31 is at a logical 1 state. Thus, lead 48, connecting the output of NAND gate 31 and one of the inputs to NAND gate 32, is at a 0 state and lead 51, one of the inputs to NAND gate 38, is also at a 0 state. Clock input 49 to NAND gate 32 is presumed to enable NAND gate 32 upon the presence of a logical 1 on lead 49. Accordingly, output lead 54 of NAND gate 32 is at a logical 1 state; counter 33 is presumed to be incremented upon the presence of a 0 level input on line 54. Thus, output leads 55, 56 and 57 of counter 33, which correspond to the 10th, 8th and 6th powers, respectively, of the binary base "two," are at a logical 0 state. Output lead 58 of NAND gate 35 is thus at a logical 1 state as is output lead 59 of NAND gate 36. Input leads 53 and 52 to NAND gate 38 are also at a logical 1 state, thus establishing output lead 61 of NAND gate 38 at a logical 1 state and output lead 62 of inverter circuit 37 at a logical 0 state. Since this is the clear input to counter 33, a logical 0 state is presumed to clear the counter.

If it is now presumed that the energy signal applied to digital comparator 26 exceeds the output of digital threshold register 27, output lead 43 of comparator 26 assumes a logical 1 state and output lead 45 of comparator 26 assumes a 0 state. Output lead 46 of NAND gate 28 is then at a logical 0 state and output lead 47 of NAND gate 29 is at a logical 1 state. Output lead 48 of NAND gate 31 assumes a logic 1 state as does lead 51, which is one of the inputs to NAND gate 38. Since input leads 52 and 53 are already at a logical 1 state, the output lead 61 of NAND gate 38 assumes a logical 0 state and therefore output lead 62 of inverter 37 assumes a logical 1 state, thereby allowing counter 33 to be incremented. Upon the presence of a logical 1 signal at clock input 49 to NAND gate 32, output lead 54 of NAND gate 32 assumes a logical 0 state and counter 33 is incremented. Assuming that the input energy signal to comparator 26 remains above the predetermined threshold, then with each energy word, counter 33 will be incremented. When counter 33 reaches a level of 320, which corresponds to a 1 output on leads 56 and 57, output lead 59 of NAND gate 36 assumes a logical 0 state indicating the beginning of a speech utterance. The presence of a 0 level signal on output lead 59 resets flip-flop 34 so that a logical 1 signal appears on output lead 39 and a logical 0 signal appears on output lead 41. Output lead 58 of NAND gate 35 remains at a logical 1 state. The resetting of flip-flop 34 causes output lead 59 to return to a logical 1 state and in turn causes input lead 44 to NAND gate 29 to assume a logical 1 state and input lead 42 to NAND gate 28 to assume a logical 0 state. Assuming that the energy signal remains above the threshold, output lead 43 is still at a logical 1 state, but since input lead 42 to NAND gate 28 is now at a logical 0 state, output lead 46 of NAND gate 28 assumes a logical 1 state. Output lead 45 of comparator 26 is still at a 0 state, but input lead 44 to NAND gate 29 is now at a logical 1 state. Thus, output lead 47 of NAND gate 29 is at a logical 1 state. Accordingly, output lead 48 of NAND gate 31 assumes a logical 0 state as does input lead 51 to NAND gate 38. Input lead 54 to counter 33 assumes a logical 1 state and counter 33 is not incremented. Since input lead 51 is at a 0 state and input leads 52 and 53 of NAND gate 38 are at a logical 1 state, output lead 61 of NAND gate 38 is at a logical 1 state and the clear input to counter 33, lead 62, is at a logical 0 state. Thus, the counter is cleared and output leads 58, 59 remain at a logical 1 state. When the energy of the applied code words to digital comparator 26 decreases to a level below the threshold level established by register 27, output lead 45 of comparator 26 assumes a logical 1 state and output lead 43 assumes a logical 0 state. Since input lead 42 to NAND gate 28 is at a 0 level, output lead 46 of NAND gate 28 assumes a logical 1 state. Similarly, since input lead 44 to NAND gate 29 is at a logical 1 state, output lead 47 of NAND gate 29 assumes a 0 logic state. Thus, output lead 48 of NAND gate 31 is at a logical 1 state as is input lead 51 to NAND gate 38. Upon the occurrence of a 1 level on clock input 49 to NAND gate 32, output lead 54 assumes a logical 0 state and increments counter 33. Assuming the input energy level of the code words remains below the predetermined threshold, counter 33 will be successively incremented but no change in the logic states of the circuit will occur until leads 55, 56, and 57 of counter 33 all assume a logical 1 state. This state corresponds to a count of 1024. Upon the occurrence of this condition, output lead 58 assumes a logical 0 state indicating the end of the speech utterance while output lead 59 remains at a logical 1 state. The occurrence of a 0 logic state on output lead 58 sets flip-flop 34 back to its original state, i.e., output lead 39 assumes a 0 state and output lead 41 assumes a 1 state. Output lead 58 accordingly returns to a logical 1 state and the apparatus of FIG. 5 has returned to the conditions initially assumed prior to the beginning of the speech utterance. The waveforms appearing at output leads 59 and 58 of the apparatus of FIG. 5 indicate the logic state transition, respectively, at the beginning and end of a speech utterance. The output signals of the apparatus of FIG. 5 may be used in a variety of ways. For example, they may be used to gate a register which temporarily stores the code words of the apparatus of FIG. 1 so that the code words of the speech utterance, determined by the apparatus of FIG. 5, may be conveyed to a permanent store. Or, if so desired, the signals appearing on leads 58 and 59 may be utilized to activate an alarm circuit to indicate to an operator that the beginning and end of a speech utterance has occurred. Many other applications, of course, will be apparent to those skilled in the art.

FIG. 5A is a block diagram depicting the overall operation of this invention, as discussed above. Adaptive encoder 501 corresponds to the encoder shown in FIG. 1, code word energy detector 502 corresponds to the apparatus depicted in FIG. 4, and threshold detector 503 corresponds to the apparatus shown in FIG. 5.

The significant advantages of the instant invention, in determining the beginning and end of a speech utterance, are illustrated by FIGS. 6 through 11. FIG. 6 displays the sequence of code words corresponding to the beginning of the word "three." The left-half of line A shows very little code word variation and corresponds to low level noise. The right-half of line A, and the next two lines, B and C, correspond to the initial fricative "th" of the word "three." The code words show markedly greater variation as does the last line, D, which corresponds to the beginning of voicing, i.e., "ree." The marker in the middle of line A denotes the beginning point of the speech utterance, as determined by this invention. FIG. 7 displays the energy of the code words of FIG. 6, as determined by this invention. The marker on line A denotes the point at which the energy of the code words exceeded the threshold and remained above the threshold for approximately 50 milliseconds, as discussed above. It is noted that the code word energy is roughly the same for both the voiced and unvoiced segments of the utterance while the energy is significantly lower when no speech is present. FIG. 8 displays the actual speech waveform represented by the code word sequence of FIG. 6. The beginning of the word "three" is not nearly as evident as in the code word sequence; indeed, it is hardly discernible. FIG. 9, which displays the energy of the speech waveform of FIG. 8, emphasizes the fact that the beginning of a speech utterance is not readily discernible from an examination of the energy of the speech waveform itself. FIG. 10 displays the code word sequence at the end of the word "three". The marker on line B indicates the end point of the utterance as determined by the instant invention. FIG. 11 displays the speech waveform corresponding to the code word sequence of FIG. 10. The end point of the utterance is clearly not apparent from an examination of the speech waveform itself.

The instant invention has been tested extensively in determining the beginning and end speech entries for a voice response system vocabulary, and has proved to be very reliable. Two other aspects of the coded speech signal, i.e., the energy of the difference signal of the coder of FIG. 1, and the energy of the quantizer output were also studied as possible considerations for use in the instant invention. However the results based on the coded word samples themselves were found to be far more accurate.

* * * * *


uspto.report is an independent third-party trademark research tool that is not affiliated, endorsed, or sponsored by the United States Patent and Trademark Office (USPTO) or any other governmental organization. The information provided by uspto.report is based on publicly available data at the time of writing and is intended for informational purposes only.

While we strive to provide accurate and up-to-date information, we do not guarantee the accuracy, completeness, reliability, or suitability of the information displayed on this site. The use of this site is at your own risk. Any reliance you place on such information is therefore strictly at your own risk.

All official trademark data, including owner information, should be verified by visiting the official USPTO website at www.uspto.gov. This site is not intended to replace professional legal advice and should not be used as a substitute for consulting with a legal professional who is knowledgeable about trademark law.

© 2024 USPTO.report | Privacy Policy | Resources | RSS Feed of Trademarks | Trademark Filings Twitter Feed