Phonation Style Detection WENNDT; STANLEY J. ; et al. [GOVERMENT OF THE UNITED STATES AS REPRESENTED BY TE SECRETARY OF THE AIR FORCE]

Phonation Style Detection

WENNDT; STANLEY J. ; et al.

Patent Application Summary

U.S. patent application number 15/434164 was filed with the patent office on 2018-05-17 for phonation style detection. The applicant listed for this patent is GOVERMENT OF THE UNITED STATES AS REPRESENTED BY TE SECRETARY OF THE AIR FORCE, GOVERMENT OF THE UNITED STATES AS REPRESENTED BY TE SECRETARY OF THE AIR FORCE. Invention is credited to DARREN M. HADDAD, STANLEY J. WENNDT.

Application Number	20180137880 15/434164
Document ID	/
Family ID	62107298
Filed Date	2018-05-17

United States Patent Application	20180137880
Kind Code	A1
WENNDT; STANLEY J. ; et al.	May 17, 2018

Phonation Style Detection

Abstract

The invention provides a method for detecting phonation style in dynamic communication environments and making software control decisions based on phonation styles enabling an audio message to be classified based on the phonation style such as, but not limited to: normal phonation, whispered phonation, softly spoken speech phonation, high-level phonation, babble phonation, and non-voice sounds. The purpose of the invention is to introduce the phonation style as a way to control computer software.

Inventors:

WENNDT; STANLEY J.; (ROME, NY) ; HADDAD; DARREN M.; (FRANKFORT, NY)

Applicant:

Name	City	State	Country	Type
GOVERMENT OF THE UNITED STATES AS REPRESENTED BY TE SECRETARY OF THE AIR FORCE	ROME	NY	US

Family ID:

62107298

Appl. No.:

15/434164

Filed:

February 16, 2017

Related U.S. Patent Documents


Application Number	Filing Date	Patent Number
62422611	Nov 16, 2016

Current U.S. Class:	1/1
Current CPC Class:	G10L 15/02 20130101; H04L 63/162 20130101; G10L 25/84 20130101; H04L 2209/08 20130101; H04L 45/02 20130101; G10L 25/21 20130101; H04L 2209/12 20130101; G06F 21/32 20130101; G10L 25/90 20130101; H04L 45/22 20130101; H04L 45/54 20130101; H04L 9/002 20130101; G06F 21/72 20130101; G10L 25/24 20130101; G06F 21/79 20130101; G06F 21/60 20130101; G10L 15/16 20130101
International Class:	G10L 25/84 20060101 G10L025/84; G10L 15/02 20060101 G10L015/02

Goverment Interests

STATEMENT OF GOVERNMENT INTEREST

[0001] The invention described herein may be manufactured and used by or for the Government for governmental purposes without the payment of any royalty thereon.

Claims

1. A method for phonation style detection, comprising: detecting speech activity in a signal; extracting signal features from said detected speech activity; characterizing said extracted signal features; and performing a decision process on said characterized signal features which determines whether said detected speech activity is one of normally spoken speech, loudly spoken speech, softly spoken speech, whisper speech, babble, and non-voice sound.

2. In the method of claim 1, characterizing further comprises characterizing said extracted signal features in terms of harmonic measure, signal energy, mixed-excitation, clipping, and voicing.

3. In the method of claim 2, performing a decision process further comprises classifying said speech activity as non-voice sounds in the absence of harmonics and low energy signal features.

4. In the method of claim 2, performing a decision process further comprises classifying said speech activity as softly spoken speech in the absence of harmonics and low energy signal features but in the presence of voicing signal features.

5. In the method of claim 2, performing a decision process further comprises classifying said speech activity as babble in the presence of harmonics and mixed excitation signal features.

6. In the method of claim 2, performing a decision process further comprises classifying said speech activity as loudly spoken speech in the presence of harmonics and clipping but in the absence of mixed excitation signal features.

7. In the method of claim 2, performing a decision process further comprises classifying said speech activity as normally spoken speech in the presence of harmonics but in the absence clipping and mixed excitation signal features.

8. In the method of claim 2, performing a decision process further comprises classifying said speech activity as whisper speech in the absence of harmonics and voicing signal features but in the presence of low energy signal features.

9. In the method of claim 1, speech activity detection is performed on substantially 10 to 30 millisecond blocks of said signal.

10. In the method of claim 1, speech activity detection further comprises measurement of any one of the following: energy, pitch extraction, autocorrelation, spectral tilt, and cepstral coefficients.

11. In the method of claim 10, said speech activity detection further comprises coupling said measurements with classifiers and learning algorithms.

12. In the method of claim 11, said classifiers and leaning algorithms are selected from the group comprising state vector machines, neural networks, and Gaussian mixture models.

13. In the method of claim 10, said measurement of energy further comprises computing discrete signal energy in the time domain according to: E(n)=.SIGMA..sub.m=0.sup.N-1[w(m)s(n-m)].sup.2 where w(m) is a weighting function; s is signal amplitude; n is a current discrete time sample; m is a current discrete time sample of a window of time; and E(m) is computed signal energy over said time window.

14. In the method of claim 13, said weighting function w(m) is selected from the group consisting of rectangular, hamming and triangular window functions.

15. In the method of claim 10, said measurement of energy further comprises computing signal energy in the frequency domain according to: E=.SIGMA..sub.f.sub.1.sup.f.sup.2|X[f]|.sup.2 where X[f] is the Fourier transform of said signal; f.sub.1 and f.sub.2 are the Fourier transform frequency limits; and E is the computed energy of said signal.

16. In the method of claim 15, f.sub.1 is 0 and f.sub.2 is one-half the Nyquist rate.

Description

BACKGROUND OF THE INVENTION

[0002] Phonation is the rapid, periodic opening and closing of the glottis through separation and apposition of the vocal folds that, accompanied by breath under lung pressure, constitutes a source of vocal sound. Technology exists that detects sound created through phonation and attempts to understand and decode the sounds. Examples include Siri, automated phone menus, and Shazaam. Limitations on the current technologies mean that they only work well when the speaker is using a normal speaking or phonation style not including loud, babble, whisper, or pitch and they assume that the speaker wants to be heard and understood.

[0003] Phonation style refers to different speaking styles which may include normal phonation, whispered speech phonation, low-level speech phonation, high-level speech phonation, and babble phonation. Babble phonation occurs when there is more than one speaker talking at the same time. The current state of the art has considered noise degraded speech applications where the goal is to extract the target speech or suppress the interfering speech, but lacks a decision making process of how to address different phonation styles. For most speech recognition processes, the speech recognition algorithm tries to recognize whatever data that it has been given. Using a probabilistic approach, the speech recognition algorithm tries to decipher what was the most likely sequence of spoken words. For constrained environments where the vocabulary is limited, the speech recognition algorithm can be successful even in noisy environments due to a prior knowledge about the potential sequence. For example, if the spoken words are, `Find the nearest liquor store`, the speech recognizer only needs to recognize `liquor` and then it can guess what the rest of the words are or, at least, get close to the lexical content. The detection of babble speech (where more than one speaker is speaking), allows for the speech to not be processed by a speech recognizer since the output would be unreliable.

[0004] Current methods also experience issues in situations of degraded speech which can occur in almost any communication setting. Typically, speech degradation is assumed to be due to environmental noise or communication channel artifacts. However, speech degradation can also occur due to changes in phonation style. A speech processing algorithm that is trained using normally phonated speech but is given whispered or high-volume phonation style speech will quickly degrade and create nonsensical outputs. Instead of assuming a phonation style, a pre-processing technique is used to classify the speech as being normally-phonated, whispered, low-level volume, high-level volume or babble speech. Features are extracted and analyzed so as to make a decision. Many of the same features such as spectral tilt, energy, and envelope shape are standard for each phonation class. However, each phonation class may have specific features just for that phonation class. Based on the outcome of the pre-processing, a decision tree is used to take the next appropriate step. When speech recognition is the desired next step and the pre-processer indicated that it was whispered speech, then the next step would be to send the data to a speech recognizer that is trained on whispered speech.

[0005] In an unconstrained, dynamically changing environment, speech recognizers have not succeeded in being able to accurately recognize the spoken dialogue. This process is complicated by multiple speakers, noisy environment, and unstructured lexical information. The current state of the art lacks an approach for uniquely classifying the various phonation styles; it also does not address how to make appropriate follow-on decisions. The current technology also assumes that there is only one speaker and that the speaker desires to be understood. When there are multiple speakers or the speaker wishes to obfuscate his/her communication, the output of speech recognizers quickly degrades into non-sense lexical information.

OBJECTS AND SUMMARY OF THE INVENTION

[0006] It is therefore an object of the invention to classify speech into several possible phonation styles.

[0007] It is a further object of the invention to extract from speech signals characteristics which lead to the classification of several phonation styles.

[0008] It is yet a further object of the present invention to make a determination as to which of several phonation styles is present based on the detection of the presence or absence of extracted speech characteristics.

[0009] Briefly stated, the invention provides a method for detecting phonation style in dynamic communication environments and making software control decisions based on phonation styles enabling an audio message to be classified based on the phonation style such as, but not limited to: normal phonation, whispered phonation, softly spoken speech phonation, high-level phonation, babble phonation, and non-voice sounds. The purpose of the invention is to introduce the phonation style as a way to control computer software.

[0010] In an embodiment of the invention, a method for phonation style detection, comprises detecting speech activity in a signal; extracting signal features from the detected speech activity; characterizing the extracted signal features; and performing a decision process on the characterized signal features which determines whether the detected speech activity is normally spoken speech, loudly spoken speech, softly spoken speech, whisper speech, babble, or non-voice sound.

[0011] The above, and other objects, features and advantages of the invention will become apparent from the following description read in conjunction with the accompanying drawings, in which like reference numerals designate the same elements.

REFERENCES

[0012] [1] Stanley J. Wenndt, Edward J. Cupples; "Method and apparatus for detecting illicit activity by classifying whispered speech and normally phonated speech according to the relative energy content of formants and fricatives"; U.S. Pat. No. 7,577,564; Issued: Aug. 18, 2009 [0013] [2] Darren M. Haddad, Andrew J. Noga; "Generalized Harmonicity Indicator"; U.S. Pat. No. 7,613,579; Issued: Nov. 3, 2009 [0014] [3] Deller, Jr., J. R., Proakis, J. G., Hansen, J. H. L. (1993), "Discrete-Time Processing of Speech Signals," New York, N.Y.: Macmillan Publishing Company, pp. 234-240. [0015] [4] Deller, Jr., J. R., Proakis, J. G., Hansen, J. H. L. (1993), "Discrete-Time Processing of Speech Signals," New York, N.Y.: Macmillan Publishing Company, pp. 110-115. [0016] [5] Peter J. Watson, Angela H. Ciccia, Gary Weismer; "The relation of lung volume initiation to selected acoustic properties of speech"; Journal Acoustical Society of America 113 (5), May 2003; pages 2812-2819. [0017] [6] Zhi Tao; Xue-Dan Tan; Tao Han; Ji-Hua Gu; Yi-Shen Xu; He-Ming Zhao (2010), "Reconstruction of Normal Speech from Whispered Speech Based on RBF Neural Network," 2010 Third International Symposium on Intelligent Information Technology and Security Informatics. [0018] [7] Krishnamachari, K. R.; Yantorno, R. E.; Lovekin, J. M.; Benincasa, D. S.; Wenndt, S. J., "Use of local kurtosis measure for spotting usable speech segments in co-channel speech,", 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. [0019] [8] Kizhanatham, A. R.; Yantorno, R. E.; Smolenski, B. Y., "Peak difference of autocorrelation of wavelet transform (PDAWT) algorithm based usable speech measure," 2003 7th World Multiconference on Systemics, Cybernetics and Informatics Proceedings [0020] [9] Yantorno, R. E.; Smolenski, B. Y.; Chandra, N.; "Usable speech measures and their fusion," Proceedings of the 2003 International Symposium on Circuits and Systems. [0021] [10] Krishnamurthy, N.; Hansen, J. H. L.; "Speech babble: Analysis and modeling for speech systems"; IEEE International Conference on Acoustics, Speech and Signal Processing, 2008, ICASSP 2008; pages 4505-4508 [0022] [11] Hayakawa, Makoto; Fukumori, Takahiro; Nakayama, Masato; Nishiura, Takanobu, "Suppression of clipping noise in observed speech based on spectral compensation with Gaussian mixture models and reference of clean speech," Proceedings of Meetings on Acoustics-ICA 2013.

BRIEF DESCRIPTION OF THE DRAWINGS

[0023] FIG. 1 depicts the present invention's process for feature extraction and phonation style classification.

[0024] FIG. 2 depicts the present invention's decision process for detection and phonation classification based on extracted features.

[0025] FIG. 3 depicts a measured time domain and spectrogram for normally-phonated speech.

[0026] FIG. 4 depicts a measured time domain and spectrogram for low-level speech.

[0027] FIG. 5 depicts a measured time domain and spectrogram for high-level speech.

[0028] FIG. 6 depicts a measured time domain and spectrogram for whispered speech.

[0029] FIG. 7 depicts a measured time domain and spectrogram for babble speech.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

[0030] The invention described herein allows an audio stream to be analyzed and classified based on the phonation style and then, depending on the application, a software control process is able to make appropriate control decisions. The goal of spoken language is to communicate. Communication occurs when the intended recipient of the spoken message receives the intended message from the speaker. While communication can occur based on body language, facial expressions, written words, and the spoken message, this invention addresses phonation style detection by using only the spoken message.

[0031] For typical speech applications, normal phonation is assumed and the audio applications attempt to process all the incoming audio. Some audio applications try to detect aberrations to this assumption, for example, if the energy level drops significantly. In this case, the algorithm may label it as whispered speech. This invention is unique in that no assumptions are made as to the type of phonation style. Instead, a multi-feature set is used to classify the phonation style.

[0032] People communicate in different phonation style depending on the communication setting and lexical intent. If a person wishes to conceal the spoken message, then whispering may be employed. Whisper speech occurs when then is no vocal fold vibration and the speech is characterized by a noise-like or hissing quality. For low-level speech, vocal vibrations occur, but the sound level is at a reduced level compared to normally phonated speech. If a person wishes to emphasize a point or needs to speak so as to be heard above noisy environment, then a high-level phonation style may be employed. Additionally, babble-phonation may exist where more than one speaker is talking simultaneously. Other types of phonation are possible depending on the environment. For example, loud whispering may be used for dramatic purposes or emphasis of an opinion. This invention has applicability to any speech processing application.

[0033] Referring to FIG. 1 displays the invention's feature extraction and classifier approach to the phonation style detection while Error! Reference source not found. provides a decision tree process into which the extracted features are analyzed. As in FIG. 1, the invention allows an audio stream to be classified based on the phonation style and then, as in FIG. 2, depending on the application, a software control process is able to make appropriate control decisions. The goal of spoken language is to communicate. Communication occurs when the intended recipient of the spoken message receives the intended message from the speaker. While communication can occur based on body language, facial expressions, written words, and the spoken message, this invention addresses phonation style detection by using only the spoken message.

[0034] For typical speech applications, normal phonation is assumed and the audio applications attempts to process all the incoming audio. Some audio applications try to detect aberrations to this assumption, for example, if the energy level drops significantly. In this case, the algorithm may label it as whispered speech. Earlier approaches, such as [1], only assumed two phonation states: normally-phonated speech or whispered speech. This invention is unique in that no assumptions are made as to the type of phonation style. Instead, a multi-feature set is used to classify the phonation style (see FIG. 2).

[0035] Referring to FIG. 2, the first step in classifying the signal is to decide if there is speech activity or not 110. Referring momentarily to Error! Reference source not found. shows how the invention delineates the phonation style once speech activity has been found.

[0036] Referring back to FIG. 1, the signal is processed in small blocks of data that are typically 10-30 milliseconds in duration. Features such as energy, pitch extraction, autocorrelation, spectral tilt, and/or cepstral coefficients coupled with classifiers and learning algorithms such as state vector machines, neural networks, and Gaussian mixture models are typical approaches to detect speech activity. If speech activity is detected 110, and signal features are extracted 120, 130, 140, 150, 160 and characterized through a fusion and decision process 170 under software control 180, then a multi-step, multi-feature approach is used to determine the phonation style as depicted in FIG. 2.

[0037] Referring again to FIG. 2, the first decision point is to measure if there is harmonic information 210 in the signal. A phoneme or a unit of a voice sound has a fundamental frequency associated with it, which is determined by how many times the vocal folds open and close in one second (given in units of Hz). The harmonics are produced when the fundamental frequency resonates within a cavity, in the case of voice that cavity is the vocal tract. Harmonics are an integer multiple of the fundamental frequency. The fundamental frequency is the first harmonic, the next octave up is the second harmonic and so on. There are many methods to determine the harmonics of a voice sample, one method is found in [2].

[0038] If insufficient harmonics are found in the decision block 210, then an energy measure 220 is used to decide if the signal is either non-voice sounds 230, softly spoken speech 250, or whisper 260. There are many ways to compute the energy of a signal. The easiest way to compute energy is in the time domain by summing up the squared values of the signal: E(n)=.SIGMA..sub.n=-.infin..sup.N+.infin.s.sup.2(n) [3]. Since speech is time-varying and may change very rapidly, a windowing function is applied: E(n)=.SIGMA..sub.m=0.sup.N-1[w(m)s(n-m)].sup.2

[0039] where w(m) is a weighting function such as a rectangular, hamming, or triangular window. The length of the window tends to approximately be 10-30 milliseconds due to the time-vary nature of speech signals. Additionally, the window length should encompass at least one pitch period, but not too many pitch periods. There are many variations of computing the energy such as using the absolute value instead of the squared value or the log of the energy or using a frequency range. In the frequency domain, E=.SIGMA..sub.f.sub.1.sup.f.sup.2|X[f]|.sup.2 [4] where X[f] is the Fourier transform of the signal. The bandwidth can be the full bandwidth between f.sub.1=0 and f.sub.2=f.sub.s/2 where f.sub.s is the Nyquist. In this case, the frequency domain energy calculation would be the same as the time domain energy calculation. Or, the bandwidth may be a reduced frequency range to avoid unreliable or noisy regions. The energy can also be computed using autocorrelation values. The important concept isn't how the energy is calculated, but to understand how to use the energy levels. Higher energy coupled with harmonics would be indicative of normal spoken speech. Lower energy would be indicative of softly spoken or whispered speech.

[0040] If the signal fails the harmonics test 210, but has higher energy 220, the phonation decision would be labeled Non-Voice sounds 230. If the signal is non-voice sounds, the next step may be to avoid these regions of the signal or to employ an additional step to classify the non-voice signal.

[0041] If the signal fails the harmonics test 210, but has lower energy 220, the next step would be to test for voicing 240. Voicing occurs as the vocal folds open and close due to air pressure building up behind the vocal folds and then being released [5]. This vibration of the vocal folds opening and closing is referred to as the fundamental frequency or pitch period of the speech signal. The bursts of air that are released as the vocal folds opens and closes then becomes an excitation source for the vocal cavity. Vowels are examples of sustained voicing where the pitch period is quasi-periodic. For males, the vocal folds open and close at about 110 cycles per second. For females, it is about 250 cycles per second. Unvoiced sounds occur in speech where there is a lack of the vocal folds vibrating. For unvoiced speech, such as a fricative /s/ or /f/, the vocal folds are not vibrating and there is a constriction in the vocal tract that gives it a noise-like, turbulent quality to the speech. Various combinations of these two main voicing states (voiced and unvoiced) allow other voicing states to be reached such as voiced fricatives (/z/) or whispered speech where no vocal fold vibration occur, even in the vowels.

[0042] If there is voicing 240, albeit at low energy levels 220, then the signal would be labeled as Softly Spoken speech 250. Decision processes for softly spoken speech may be to amplify the signal in order to make it more audible. If there is no voicing 240 with low energy levels 220, then the signal would be labeled as Whisper speech 260.

[0043] Detecting whisper speech may have many applications for hearing impaired people. An application may be to amplify the low energy sounds or to convert the whispered speech to normally phonated speech [6]. In addition to having less energy, the energy in whispered speech is at higher frequencies compared to normal phonation. The combination of the speech having lower energy at higher frequencies make it very troublesome for people with hearing loss. Detecting and converting whispered speech to normal speech could be very beneficial for the hearing impaired.

[0044] If harmonics are found in the decision block 210, then a mixed excitation measure 270 could be used to decide if the signal is either babble 280, loudly spoken speech 300, or normally phonated speech 310. Mixed excitations are detected 270 when the pitch is changing rapidly such as a transition between phonemes or when two or more speakers are talking at the same time. For single talker, mixed excitations, these regions will be short with a duration of about 10-20 milliseconds. For multi-talker mixed excitation, these regions will be longer and occur when one speaker's speech is corrupted by another speaker's speech. U.S. Pat. No. 7,177,808 B addresses how to improve speaker identification when multiple speakers are present by finding the "usable" (single-talker excitation) speech and processing only the usable speech. As with the energy measurement, there are many ways to estimate the usable speech by using techniques such as the kurtosis, linear predicative residual, autocorrelation, and wavelet transform to name a few [7], [8], [9].

[0045] If harmonics are detected 210 in the signal, but the signal has mixed excitations 270, the phonation decision would be to label the signal as Babble (multiple speakers) sounds 280 [10]. Once again, it is not important as to how to estimate mixed excitations, but to know when there is one talker present or multiple. Applications for Babble speech may be to avoid that region, label it as unreliable, or try to separate the speakers. If there is no mixed excitation detected 270, then one talker is present and the next step is to look for clipping 290.

[0046] Clipping is a form of waveform distortion that occurs when the speaker speaks loud enough to over drive the audio amplifier, the voltage or current that represents the speech is beyond its maximum capability. This typically occurs when the speaker is shouting. If too much clipping 290 occurs, the signal is labeled as Loudly Spoken 300. Applications for detecting Loudly Spoken speech may be to attenuate the signal, detect emotions, or mitigate the effects of clipping [11]. If there is little or no clipping, the signal is labeled as Normal Spoken 310 speech. Typical applications for Normal Spoken includes speech recognition, speaker identification, and language identification.

[0047] Referring to FIG. 3, for normally-phonated speech the speech is characterized by sustained energy and vowels. The graph clearly shows the onsets/offsets of speech and the energy has a larger standard deviation due to the distinct classes of silence, unvoiced speech, and voiced speech. For the voiced speech, there is strong harmonicity which can be measured by a pitch estimator. Features for normally phonated speech may include total energy, standard deviation of the energy, spectral tilt, envelope shape and harmonicity measures.

[0048] Referring to FIG. 4, for low-level speech phonation the energy levels are lower for the voiced and unvoiced regions. There still is a glottal pulse but the various regions of silence, voiced, and unvoiced are not as distinct especially if there is some background noise present. The dynamic range of the energy levels are reduced. The voiced speech regions may be shorter and the unvoiced regions may be longer. Features for low-level speech phonation may include the same features as normally phonated speech (total energy, standard deviation of the energy, spectral tilt, envelope shape and harmonicity measures), but the typical values for these measurements will be different.

[0049] Referring to FIG. 5, for high-level speech phonation the energy levels are higher for the voiced and unvoiced regions. The glottal pulse is very strong due to the strong airflow causing the vocal folds to abduct and adduct. The voiced speech regions may be longer and the unvoiced regions may be shorter. Features for high-level speech phonation may include the same features as normally phonated speech (total energy, standard deviation of the energy, spectral tilt, envelope shape and harmonicity measures), but the typical values for these measurements will be different. Clipping may be an additional feature to detect high-level speech phonation.

[0050] Referring to FIG. 6, for whispered-speech phonation the speech is characterized by lower volume levels (sound pressure) and the lack of a glottal excitation. Whisper speech occurs when there is no vocal fold vibrations and the speech is characterized by a noise-like or hissing quality. There will be little, if any, voiced speech regions. The same features of total energy, standard deviation of the energy, spectral tilt, envelope shape and harmonicity measures can be used to detect whispered-speech phonation. However, there will be less energy in the lower frequency regions compared to normally phonated speech.

[0051] Referring to FIG. 7, babble-speech phonation occurs, when there is more than one speaker talking at the same time. This leads to fewer silence regions and mixed-excitation where unvoiced and voiced speech overlap. There will be fewer unvoiced regions. The same features apply, but the reduced amount of silence and unvoiced regions provide an extra clue about the phonation style.

[0052] Other phonation style may exist such as loud whispering for dramatic purposes but the use of the same feature set along with a unique feature for loud whispering, would still allow for successful detection of the new phonation style. The phonation style detection is not geared towards any one speech processing application but provides a decision point of how to proceed.

[0053] Referring to FIG. 1, the pre-processing step to phonation style detection is to extract the speech signal, including but not limited to extraction from a microphone or an interception of a transmission containing speech. The speech is then analyzed by a speech activity detector (SAD) 110 as shown in FIG. 1. The SAD analyzes a segment of audio data and determines if the signal is speech, silence, or background noise. If the SAD 110 detects speech, the feature extraction section 120, 130, 140, 150, 160 follows. The feature extraction section is used to classify the phonation of an individual's speech. The features include but are not limited to a harmonic measurement, signal energy measurement, voice activity detector, time domain measurements, and frequency domain measurements. Some systems may utilize more features, while others may utilize less. Once the features are extracted from the speech, the information from the features is fused 170.

[0054] The information fusion/decision 170 is an algorithm under software control 180 that any type of statistical classifier can perform to detect the type of phonation. This statistical classifier can be any type of classifiers such as but not limited to iVectors, Gaussian mixture models, support vector machine, and/or a type of neural networks.

[0055] The information fusion and decision 170 will output the type of phonation, such as, but not limited to, normal phonation, whispered speech phonation, low-level speech phonation, high-level speech phonation, and babble phonation. The output can feed a number of software applications, such as, but not limited to, speech recognition systems, speaker identification systems, language identification systems, and/or video gaming applications. For video gaming and simulated war games, the phonation style becomes part of the game where how a person speaks is as important was what is being said and how the controller is being used. Additionally, phonation style detection can be used for hearing aids to detect whisper speech and convert high frequency, non-phonation information to normal, phonated to speech at a lower frequency region. Phonation style detection can also be used for voice coaching to give the subject feedback as to his/her pronunciation and style of pronunciation.

[0056] Uses of the present invention include but are not limited to the following: [0057] PTSD detection; [0058] Screening/testing: thyroid cancer screening, eHealth over the phone diagnosis, stroke detection, intoxication detection, heart attack detection; [0059] Hearing aids; [0060] Pain management monitoring, pain level detection for non-verbal patients (including animals); [0061] Improved closed caption translation; [0062] Military--information exploitation, simulated war games; [0063] Gaming software--vocal control; [0064] Voice coaching; [0065] Call center--customer service rep assist, auditory cues, suicide prevention; [0066] Internet of Things--robotic caregivers, accents for robots, appliance language tone/accent; [0067] Microphone improvement; [0068] Apps--voice change tracker, meditation alert, health monitor, mood/calming, whisper transcription, tonal detection and feedback, PIN/code replacement, voice to text with emotion, mental health monitor; [0069] Security--prioritized listening, voice interception, scanning chatter, crowd monitoring, airport monitoring/TSA, profiling, threat detection, interrogation;

[0070] Having described preferred embodiments of the invention with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes and modifications may be effected therein by one skilled in the art without departing from the scope or spirit of the invention as defined in the appended claims.

* * * * *