U.S. patent application number 15/434164 was filed with the patent office on 2018-05-17 for phonation style detection.
The applicant listed for this patent is GOVERMENT OF THE UNITED STATES AS REPRESENTED BY TE SECRETARY OF THE AIR FORCE, GOVERMENT OF THE UNITED STATES AS REPRESENTED BY TE SECRETARY OF THE AIR FORCE. Invention is credited to DARREN M. HADDAD, STANLEY J. WENNDT.
Application Number | 20180137880 15/434164 |
Document ID | / |
Family ID | 62107298 |
Filed Date | 2018-05-17 |
United States Patent
Application |
20180137880 |
Kind Code |
A1 |
WENNDT; STANLEY J. ; et
al. |
May 17, 2018 |
Phonation Style Detection
Abstract
The invention provides a method for detecting phonation style in
dynamic communication environments and making software control
decisions based on phonation styles enabling an audio message to be
classified based on the phonation style such as, but not limited
to: normal phonation, whispered phonation, softly spoken speech
phonation, high-level phonation, babble phonation, and non-voice
sounds. The purpose of the invention is to introduce the phonation
style as a way to control computer software.
Inventors: |
WENNDT; STANLEY J.; (ROME,
NY) ; HADDAD; DARREN M.; (FRANKFORT, NY) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
GOVERMENT OF THE UNITED STATES AS REPRESENTED BY TE SECRETARY OF
THE AIR FORCE |
ROME |
NY |
US |
|
|
Family ID: |
62107298 |
Appl. No.: |
15/434164 |
Filed: |
February 16, 2017 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
62422611 |
Nov 16, 2016 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G10L 15/02 20130101;
H04L 63/162 20130101; G10L 25/84 20130101; H04L 2209/08 20130101;
H04L 45/02 20130101; G10L 25/21 20130101; H04L 2209/12 20130101;
G06F 21/32 20130101; G10L 25/90 20130101; H04L 45/22 20130101; H04L
45/54 20130101; H04L 9/002 20130101; G06F 21/72 20130101; G10L
25/24 20130101; G06F 21/79 20130101; G06F 21/60 20130101; G10L
15/16 20130101 |
International
Class: |
G10L 25/84 20060101
G10L025/84; G10L 15/02 20060101 G10L015/02 |
Goverment Interests
STATEMENT OF GOVERNMENT INTEREST
[0001] The invention described herein may be manufactured and used
by or for the Government for governmental purposes without the
payment of any royalty thereon.
Claims
1. A method for phonation style detection, comprising: detecting
speech activity in a signal; extracting signal features from said
detected speech activity; characterizing said extracted signal
features; and performing a decision process on said characterized
signal features which determines whether said detected speech
activity is one of normally spoken speech, loudly spoken speech,
softly spoken speech, whisper speech, babble, and non-voice
sound.
2. In the method of claim 1, characterizing further comprises
characterizing said extracted signal features in terms of harmonic
measure, signal energy, mixed-excitation, clipping, and
voicing.
3. In the method of claim 2, performing a decision process further
comprises classifying said speech activity as non-voice sounds in
the absence of harmonics and low energy signal features.
4. In the method of claim 2, performing a decision process further
comprises classifying said speech activity as softly spoken speech
in the absence of harmonics and low energy signal features but in
the presence of voicing signal features.
5. In the method of claim 2, performing a decision process further
comprises classifying said speech activity as babble in the
presence of harmonics and mixed excitation signal features.
6. In the method of claim 2, performing a decision process further
comprises classifying said speech activity as loudly spoken speech
in the presence of harmonics and clipping but in the absence of
mixed excitation signal features.
7. In the method of claim 2, performing a decision process further
comprises classifying said speech activity as normally spoken
speech in the presence of harmonics but in the absence clipping and
mixed excitation signal features.
8. In the method of claim 2, performing a decision process further
comprises classifying said speech activity as whisper speech in the
absence of harmonics and voicing signal features but in the
presence of low energy signal features.
9. In the method of claim 1, speech activity detection is performed
on substantially 10 to 30 millisecond blocks of said signal.
10. In the method of claim 1, speech activity detection further
comprises measurement of any one of the following: energy, pitch
extraction, autocorrelation, spectral tilt, and cepstral
coefficients.
11. In the method of claim 10, said speech activity detection
further comprises coupling said measurements with classifiers and
learning algorithms.
12. In the method of claim 11, said classifiers and leaning
algorithms are selected from the group comprising state vector
machines, neural networks, and Gaussian mixture models.
13. In the method of claim 10, said measurement of energy further
comprises computing discrete signal energy in the time domain
according to: E(n)=.SIGMA..sub.m=0.sup.N-1[w(m)s(n-m)].sup.2 where
w(m) is a weighting function; s is signal amplitude; n is a current
discrete time sample; m is a current discrete time sample of a
window of time; and E(m) is computed signal energy over said time
window.
14. In the method of claim 13, said weighting function w(m) is
selected from the group consisting of rectangular, hamming and
triangular window functions.
15. In the method of claim 10, said measurement of energy further
comprises computing signal energy in the frequency domain according
to: E=.SIGMA..sub.f.sub.1.sup.f.sup.2|X[f]|.sup.2 where X[f] is the
Fourier transform of said signal; f.sub.1 and f.sub.2 are the
Fourier transform frequency limits; and E is the computed energy of
said signal.
16. In the method of claim 15, f.sub.1 is 0 and f.sub.2 is one-half
the Nyquist rate.
Description
BACKGROUND OF THE INVENTION
[0002] Phonation is the rapid, periodic opening and closing of the
glottis through separation and apposition of the vocal folds that,
accompanied by breath under lung pressure, constitutes a source of
vocal sound. Technology exists that detects sound created through
phonation and attempts to understand and decode the sounds.
Examples include Siri, automated phone menus, and Shazaam.
Limitations on the current technologies mean that they only work
well when the speaker is using a normal speaking or phonation style
not including loud, babble, whisper, or pitch and they assume that
the speaker wants to be heard and understood.
[0003] Phonation style refers to different speaking styles which
may include normal phonation, whispered speech phonation, low-level
speech phonation, high-level speech phonation, and babble
phonation. Babble phonation occurs when there is more than one
speaker talking at the same time. The current state of the art has
considered noise degraded speech applications where the goal is to
extract the target speech or suppress the interfering speech, but
lacks a decision making process of how to address different
phonation styles. For most speech recognition processes, the speech
recognition algorithm tries to recognize whatever data that it has
been given. Using a probabilistic approach, the speech recognition
algorithm tries to decipher what was the most likely sequence of
spoken words. For constrained environments where the vocabulary is
limited, the speech recognition algorithm can be successful even in
noisy environments due to a prior knowledge about the potential
sequence. For example, if the spoken words are, `Find the nearest
liquor store`, the speech recognizer only needs to recognize
`liquor` and then it can guess what the rest of the words are or,
at least, get close to the lexical content. The detection of babble
speech (where more than one speaker is speaking), allows for the
speech to not be processed by a speech recognizer since the output
would be unreliable.
[0004] Current methods also experience issues in situations of
degraded speech which can occur in almost any communication
setting. Typically, speech degradation is assumed to be due to
environmental noise or communication channel artifacts. However,
speech degradation can also occur due to changes in phonation
style. A speech processing algorithm that is trained using normally
phonated speech but is given whispered or high-volume phonation
style speech will quickly degrade and create nonsensical outputs.
Instead of assuming a phonation style, a pre-processing technique
is used to classify the speech as being normally-phonated,
whispered, low-level volume, high-level volume or babble speech.
Features are extracted and analyzed so as to make a decision. Many
of the same features such as spectral tilt, energy, and envelope
shape are standard for each phonation class. However, each
phonation class may have specific features just for that phonation
class. Based on the outcome of the pre-processing, a decision tree
is used to take the next appropriate step. When speech recognition
is the desired next step and the pre-processer indicated that it
was whispered speech, then the next step would be to send the data
to a speech recognizer that is trained on whispered speech.
[0005] In an unconstrained, dynamically changing environment,
speech recognizers have not succeeded in being able to accurately
recognize the spoken dialogue. This process is complicated by
multiple speakers, noisy environment, and unstructured lexical
information. The current state of the art lacks an approach for
uniquely classifying the various phonation styles; it also does not
address how to make appropriate follow-on decisions. The current
technology also assumes that there is only one speaker and that the
speaker desires to be understood. When there are multiple speakers
or the speaker wishes to obfuscate his/her communication, the
output of speech recognizers quickly degrades into non-sense
lexical information.
OBJECTS AND SUMMARY OF THE INVENTION
[0006] It is therefore an object of the invention to classify
speech into several possible phonation styles.
[0007] It is a further object of the invention to extract from
speech signals characteristics which lead to the classification of
several phonation styles.
[0008] It is yet a further object of the present invention to make
a determination as to which of several phonation styles is present
based on the detection of the presence or absence of extracted
speech characteristics.
[0009] Briefly stated, the invention provides a method for
detecting phonation style in dynamic communication environments and
making software control decisions based on phonation styles
enabling an audio message to be classified based on the phonation
style such as, but not limited to: normal phonation, whispered
phonation, softly spoken speech phonation, high-level phonation,
babble phonation, and non-voice sounds. The purpose of the
invention is to introduce the phonation style as a way to control
computer software.
[0010] In an embodiment of the invention, a method for phonation
style detection, comprises detecting speech activity in a signal;
extracting signal features from the detected speech activity;
characterizing the extracted signal features; and performing a
decision process on the characterized signal features which
determines whether the detected speech activity is normally spoken
speech, loudly spoken speech, softly spoken speech, whisper speech,
babble, or non-voice sound.
[0011] The above, and other objects, features and advantages of the
invention will become apparent from the following description read
in conjunction with the accompanying drawings, in which like
reference numerals designate the same elements.
REFERENCES
[0012] [1] Stanley J. Wenndt, Edward J. Cupples; "Method and
apparatus for detecting illicit activity by classifying whispered
speech and normally phonated speech according to the relative
energy content of formants and fricatives"; U.S. Pat. No.
7,577,564; Issued: Aug. 18, 2009 [0013] [2] Darren M. Haddad,
Andrew J. Noga; "Generalized Harmonicity Indicator"; U.S. Pat. No.
7,613,579; Issued: Nov. 3, 2009 [0014] [3] Deller, Jr., J. R.,
Proakis, J. G., Hansen, J. H. L. (1993), "Discrete-Time Processing
of Speech Signals," New York, N.Y.: Macmillan Publishing Company,
pp. 234-240. [0015] [4] Deller, Jr., J. R., Proakis, J. G., Hansen,
J. H. L. (1993), "Discrete-Time Processing of Speech Signals," New
York, N.Y.: Macmillan Publishing Company, pp. 110-115. [0016] [5]
Peter J. Watson, Angela H. Ciccia, Gary Weismer; "The relation of
lung volume initiation to selected acoustic properties of speech";
Journal Acoustical Society of America 113 (5), May 2003; pages
2812-2819. [0017] [6] Zhi Tao; Xue-Dan Tan; Tao Han; Ji-Hua Gu;
Yi-Shen Xu; He-Ming Zhao (2010), "Reconstruction of Normal Speech
from Whispered Speech Based on RBF Neural Network," 2010 Third
International Symposium on Intelligent Information Technology and
Security Informatics. [0018] [7] Krishnamachari, K. R.; Yantorno,
R. E.; Lovekin, J. M.; Benincasa, D. S.; Wenndt, S. J., "Use of
local kurtosis measure for spotting usable speech segments in
co-channel speech,", 2001 IEEE International Conference on
Acoustics, Speech, and Signal Processing. Proceedings. [0019] [8]
Kizhanatham, A. R.; Yantorno, R. E.; Smolenski, B. Y., "Peak
difference of autocorrelation of wavelet transform (PDAWT)
algorithm based usable speech measure," 2003 7th World
Multiconference on Systemics, Cybernetics and Informatics
Proceedings [0020] [9] Yantorno, R. E.; Smolenski, B. Y.; Chandra,
N.; "Usable speech measures and their fusion," Proceedings of the
2003 International Symposium on Circuits and Systems. [0021] [10]
Krishnamurthy, N.; Hansen, J. H. L.; "Speech babble: Analysis and
modeling for speech systems"; IEEE International Conference on
Acoustics, Speech and Signal Processing, 2008, ICASSP 2008; pages
4505-4508 [0022] [11] Hayakawa, Makoto; Fukumori, Takahiro;
Nakayama, Masato; Nishiura, Takanobu, "Suppression of clipping
noise in observed speech based on spectral compensation with
Gaussian mixture models and reference of clean speech," Proceedings
of Meetings on Acoustics-ICA 2013.
BRIEF DESCRIPTION OF THE DRAWINGS
[0023] FIG. 1 depicts the present invention's process for feature
extraction and phonation style classification.
[0024] FIG. 2 depicts the present invention's decision process for
detection and phonation classification based on extracted
features.
[0025] FIG. 3 depicts a measured time domain and spectrogram for
normally-phonated speech.
[0026] FIG. 4 depicts a measured time domain and spectrogram for
low-level speech.
[0027] FIG. 5 depicts a measured time domain and spectrogram for
high-level speech.
[0028] FIG. 6 depicts a measured time domain and spectrogram for
whispered speech.
[0029] FIG. 7 depicts a measured time domain and spectrogram for
babble speech.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
[0030] The invention described herein allows an audio stream to be
analyzed and classified based on the phonation style and then,
depending on the application, a software control process is able to
make appropriate control decisions. The goal of spoken language is
to communicate. Communication occurs when the intended recipient of
the spoken message receives the intended message from the speaker.
While communication can occur based on body language, facial
expressions, written words, and the spoken message, this invention
addresses phonation style detection by using only the spoken
message.
[0031] For typical speech applications, normal phonation is assumed
and the audio applications attempt to process all the incoming
audio. Some audio applications try to detect aberrations to this
assumption, for example, if the energy level drops significantly.
In this case, the algorithm may label it as whispered speech. This
invention is unique in that no assumptions are made as to the type
of phonation style. Instead, a multi-feature set is used to
classify the phonation style.
[0032] People communicate in different phonation style depending on
the communication setting and lexical intent. If a person wishes to
conceal the spoken message, then whispering may be employed.
Whisper speech occurs when then is no vocal fold vibration and the
speech is characterized by a noise-like or hissing quality. For
low-level speech, vocal vibrations occur, but the sound level is at
a reduced level compared to normally phonated speech. If a person
wishes to emphasize a point or needs to speak so as to be heard
above noisy environment, then a high-level phonation style may be
employed. Additionally, babble-phonation may exist where more than
one speaker is talking simultaneously. Other types of phonation are
possible depending on the environment. For example, loud whispering
may be used for dramatic purposes or emphasis of an opinion. This
invention has applicability to any speech processing
application.
[0033] Referring to FIG. 1 displays the invention's feature
extraction and classifier approach to the phonation style detection
while Error! Reference source not found. provides a decision tree
process into which the extracted features are analyzed. As in FIG.
1, the invention allows an audio stream to be classified based on
the phonation style and then, as in FIG. 2, depending on the
application, a software control process is able to make appropriate
control decisions. The goal of spoken language is to communicate.
Communication occurs when the intended recipient of the spoken
message receives the intended message from the speaker. While
communication can occur based on body language, facial expressions,
written words, and the spoken message, this invention addresses
phonation style detection by using only the spoken message.
[0034] For typical speech applications, normal phonation is assumed
and the audio applications attempts to process all the incoming
audio. Some audio applications try to detect aberrations to this
assumption, for example, if the energy level drops significantly.
In this case, the algorithm may label it as whispered speech.
Earlier approaches, such as [1], only assumed two phonation states:
normally-phonated speech or whispered speech. This invention is
unique in that no assumptions are made as to the type of phonation
style. Instead, a multi-feature set is used to classify the
phonation style (see FIG. 2).
[0035] Referring to FIG. 2, the first step in classifying the
signal is to decide if there is speech activity or not 110.
Referring momentarily to Error! Reference source not found. shows
how the invention delineates the phonation style once speech
activity has been found.
[0036] Referring back to FIG. 1, the signal is processed in small
blocks of data that are typically 10-30 milliseconds in duration.
Features such as energy, pitch extraction, autocorrelation,
spectral tilt, and/or cepstral coefficients coupled with
classifiers and learning algorithms such as state vector machines,
neural networks, and Gaussian mixture models are typical approaches
to detect speech activity. If speech activity is detected 110, and
signal features are extracted 120, 130, 140, 150, 160 and
characterized through a fusion and decision process 170 under
software control 180, then a multi-step, multi-feature approach is
used to determine the phonation style as depicted in FIG. 2.
[0037] Referring again to FIG. 2, the first decision point is to
measure if there is harmonic information 210 in the signal. A
phoneme or a unit of a voice sound has a fundamental frequency
associated with it, which is determined by how many times the vocal
folds open and close in one second (given in units of Hz). The
harmonics are produced when the fundamental frequency resonates
within a cavity, in the case of voice that cavity is the vocal
tract. Harmonics are an integer multiple of the fundamental
frequency. The fundamental frequency is the first harmonic, the
next octave up is the second harmonic and so on. There are many
methods to determine the harmonics of a voice sample, one method is
found in [2].
[0038] If insufficient harmonics are found in the decision block
210, then an energy measure 220 is used to decide if the signal is
either non-voice sounds 230, softly spoken speech 250, or whisper
260. There are many ways to compute the energy of a signal. The
easiest way to compute energy is in the time domain by summing up
the squared values of the signal:
E(n)=.SIGMA..sub.n=-.infin..sup.N+.infin.s.sup.2(n) [3]. Since
speech is time-varying and may change very rapidly, a windowing
function is applied:
E(n)=.SIGMA..sub.m=0.sup.N-1[w(m)s(n-m)].sup.2
[0039] where w(m) is a weighting function such as a rectangular,
hamming, or triangular window. The length of the window tends to
approximately be 10-30 milliseconds due to the time-vary nature of
speech signals. Additionally, the window length should encompass at
least one pitch period, but not too many pitch periods. There are
many variations of computing the energy such as using the absolute
value instead of the squared value or the log of the energy or
using a frequency range. In the frequency domain,
E=.SIGMA..sub.f.sub.1.sup.f.sup.2|X[f]|.sup.2 [4] where X[f] is the
Fourier transform of the signal. The bandwidth can be the full
bandwidth between f.sub.1=0 and f.sub.2=f.sub.s/2 where f.sub.s is
the Nyquist. In this case, the frequency domain energy calculation
would be the same as the time domain energy calculation. Or, the
bandwidth may be a reduced frequency range to avoid unreliable or
noisy regions. The energy can also be computed using
autocorrelation values. The important concept isn't how the energy
is calculated, but to understand how to use the energy levels.
Higher energy coupled with harmonics would be indicative of normal
spoken speech. Lower energy would be indicative of softly spoken or
whispered speech.
[0040] If the signal fails the harmonics test 210, but has higher
energy 220, the phonation decision would be labeled Non-Voice
sounds 230. If the signal is non-voice sounds, the next step may be
to avoid these regions of the signal or to employ an additional
step to classify the non-voice signal.
[0041] If the signal fails the harmonics test 210, but has lower
energy 220, the next step would be to test for voicing 240. Voicing
occurs as the vocal folds open and close due to air pressure
building up behind the vocal folds and then being released [5].
This vibration of the vocal folds opening and closing is referred
to as the fundamental frequency or pitch period of the speech
signal. The bursts of air that are released as the vocal folds
opens and closes then becomes an excitation source for the vocal
cavity. Vowels are examples of sustained voicing where the pitch
period is quasi-periodic. For males, the vocal folds open and close
at about 110 cycles per second. For females, it is about 250 cycles
per second. Unvoiced sounds occur in speech where there is a lack
of the vocal folds vibrating. For unvoiced speech, such as a
fricative /s/ or /f/, the vocal folds are not vibrating and there
is a constriction in the vocal tract that gives it a noise-like,
turbulent quality to the speech. Various combinations of these two
main voicing states (voiced and unvoiced) allow other voicing
states to be reached such as voiced fricatives (/z/) or whispered
speech where no vocal fold vibration occur, even in the vowels.
[0042] If there is voicing 240, albeit at low energy levels 220,
then the signal would be labeled as Softly Spoken speech 250.
Decision processes for softly spoken speech may be to amplify the
signal in order to make it more audible. If there is no voicing 240
with low energy levels 220, then the signal would be labeled as
Whisper speech 260.
[0043] Detecting whisper speech may have many applications for
hearing impaired people. An application may be to amplify the low
energy sounds or to convert the whispered speech to normally
phonated speech [6]. In addition to having less energy, the energy
in whispered speech is at higher frequencies compared to normal
phonation. The combination of the speech having lower energy at
higher frequencies make it very troublesome for people with hearing
loss. Detecting and converting whispered speech to normal speech
could be very beneficial for the hearing impaired.
[0044] If harmonics are found in the decision block 210, then a
mixed excitation measure 270 could be used to decide if the signal
is either babble 280, loudly spoken speech 300, or normally
phonated speech 310. Mixed excitations are detected 270 when the
pitch is changing rapidly such as a transition between phonemes or
when two or more speakers are talking at the same time. For single
talker, mixed excitations, these regions will be short with a
duration of about 10-20 milliseconds. For multi-talker mixed
excitation, these regions will be longer and occur when one
speaker's speech is corrupted by another speaker's speech. U.S.
Pat. No. 7,177,808 B addresses how to improve speaker
identification when multiple speakers are present by finding the
"usable" (single-talker excitation) speech and processing only the
usable speech. As with the energy measurement, there are many ways
to estimate the usable speech by using techniques such as the
kurtosis, linear predicative residual, autocorrelation, and wavelet
transform to name a few [7], [8], [9].
[0045] If harmonics are detected 210 in the signal, but the signal
has mixed excitations 270, the phonation decision would be to label
the signal as Babble (multiple speakers) sounds 280 [10]. Once
again, it is not important as to how to estimate mixed excitations,
but to know when there is one talker present or multiple.
Applications for Babble speech may be to avoid that region, label
it as unreliable, or try to separate the speakers. If there is no
mixed excitation detected 270, then one talker is present and the
next step is to look for clipping 290.
[0046] Clipping is a form of waveform distortion that occurs when
the speaker speaks loud enough to over drive the audio amplifier,
the voltage or current that represents the speech is beyond its
maximum capability. This typically occurs when the speaker is
shouting. If too much clipping 290 occurs, the signal is labeled as
Loudly Spoken 300. Applications for detecting Loudly Spoken speech
may be to attenuate the signal, detect emotions, or mitigate the
effects of clipping [11]. If there is little or no clipping, the
signal is labeled as Normal Spoken 310 speech. Typical applications
for Normal Spoken includes speech recognition, speaker
identification, and language identification.
[0047] Referring to FIG. 3, for normally-phonated speech the speech
is characterized by sustained energy and vowels. The graph clearly
shows the onsets/offsets of speech and the energy has a larger
standard deviation due to the distinct classes of silence, unvoiced
speech, and voiced speech. For the voiced speech, there is strong
harmonicity which can be measured by a pitch estimator. Features
for normally phonated speech may include total energy, standard
deviation of the energy, spectral tilt, envelope shape and
harmonicity measures.
[0048] Referring to FIG. 4, for low-level speech phonation the
energy levels are lower for the voiced and unvoiced regions. There
still is a glottal pulse but the various regions of silence,
voiced, and unvoiced are not as distinct especially if there is
some background noise present. The dynamic range of the energy
levels are reduced. The voiced speech regions may be shorter and
the unvoiced regions may be longer. Features for low-level speech
phonation may include the same features as normally phonated speech
(total energy, standard deviation of the energy, spectral tilt,
envelope shape and harmonicity measures), but the typical values
for these measurements will be different.
[0049] Referring to FIG. 5, for high-level speech phonation the
energy levels are higher for the voiced and unvoiced regions. The
glottal pulse is very strong due to the strong airflow causing the
vocal folds to abduct and adduct. The voiced speech regions may be
longer and the unvoiced regions may be shorter. Features for
high-level speech phonation may include the same features as
normally phonated speech (total energy, standard deviation of the
energy, spectral tilt, envelope shape and harmonicity measures),
but the typical values for these measurements will be different.
Clipping may be an additional feature to detect high-level speech
phonation.
[0050] Referring to FIG. 6, for whispered-speech phonation the
speech is characterized by lower volume levels (sound pressure) and
the lack of a glottal excitation. Whisper speech occurs when there
is no vocal fold vibrations and the speech is characterized by a
noise-like or hissing quality. There will be little, if any, voiced
speech regions. The same features of total energy, standard
deviation of the energy, spectral tilt, envelope shape and
harmonicity measures can be used to detect whispered-speech
phonation. However, there will be less energy in the lower
frequency regions compared to normally phonated speech.
[0051] Referring to FIG. 7, babble-speech phonation occurs, when
there is more than one speaker talking at the same time. This leads
to fewer silence regions and mixed-excitation where unvoiced and
voiced speech overlap. There will be fewer unvoiced regions. The
same features apply, but the reduced amount of silence and unvoiced
regions provide an extra clue about the phonation style.
[0052] Other phonation style may exist such as loud whispering for
dramatic purposes but the use of the same feature set along with a
unique feature for loud whispering, would still allow for
successful detection of the new phonation style. The phonation
style detection is not geared towards any one speech processing
application but provides a decision point of how to proceed.
[0053] Referring to FIG. 1, the pre-processing step to phonation
style detection is to extract the speech signal, including but not
limited to extraction from a microphone or an interception of a
transmission containing speech. The speech is then analyzed by a
speech activity detector (SAD) 110 as shown in FIG. 1. The SAD
analyzes a segment of audio data and determines if the signal is
speech, silence, or background noise. If the SAD 110 detects
speech, the feature extraction section 120, 130, 140, 150, 160
follows. The feature extraction section is used to classify the
phonation of an individual's speech. The features include but are
not limited to a harmonic measurement, signal energy measurement,
voice activity detector, time domain measurements, and frequency
domain measurements. Some systems may utilize more features, while
others may utilize less. Once the features are extracted from the
speech, the information from the features is fused 170.
[0054] The information fusion/decision 170 is an algorithm under
software control 180 that any type of statistical classifier can
perform to detect the type of phonation. This statistical
classifier can be any type of classifiers such as but not limited
to iVectors, Gaussian mixture models, support vector machine,
and/or a type of neural networks.
[0055] The information fusion and decision 170 will output the type
of phonation, such as, but not limited to, normal phonation,
whispered speech phonation, low-level speech phonation, high-level
speech phonation, and babble phonation. The output can feed a
number of software applications, such as, but not limited to,
speech recognition systems, speaker identification systems,
language identification systems, and/or video gaming applications.
For video gaming and simulated war games, the phonation style
becomes part of the game where how a person speaks is as important
was what is being said and how the controller is being used.
Additionally, phonation style detection can be used for hearing
aids to detect whisper speech and convert high frequency,
non-phonation information to normal, phonated to speech at a lower
frequency region. Phonation style detection can also be used for
voice coaching to give the subject feedback as to his/her
pronunciation and style of pronunciation.
[0056] Uses of the present invention include but are not limited to
the following: [0057] PTSD detection; [0058] Screening/testing:
thyroid cancer screening, eHealth over the phone diagnosis, stroke
detection, intoxication detection, heart attack detection; [0059]
Hearing aids; [0060] Pain management monitoring, pain level
detection for non-verbal patients (including animals); [0061]
Improved closed caption translation; [0062] Military--information
exploitation, simulated war games; [0063] Gaming software--vocal
control; [0064] Voice coaching; [0065] Call center--customer
service rep assist, auditory cues, suicide prevention; [0066]
Internet of Things--robotic caregivers, accents for robots,
appliance language tone/accent; [0067] Microphone improvement;
[0068] Apps--voice change tracker, meditation alert, health
monitor, mood/calming, whisper transcription, tonal detection and
feedback, PIN/code replacement, voice to text with emotion, mental
health monitor; [0069] Security--prioritized listening, voice
interception, scanning chatter, crowd monitoring, airport
monitoring/TSA, profiling, threat detection, interrogation;
[0070] Having described preferred embodiments of the invention with
reference to the accompanying drawings, it is to be understood that
the invention is not limited to those precise embodiments, and that
various changes and modifications may be effected therein by one
skilled in the art without departing from the scope or spirit of
the invention as defined in the appended claims.
* * * * *