U.S. patent application number 12/370424 was filed with the patent office on 2009-10-29 for computer-implemented methods and systems for modeling and recognition of speech.
This patent application is currently assigned to THE TRUSTEES OF COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK. Invention is credited to Marios Athineos, Daniel P.W. Ellis.
Application Number | 20090271182 12/370424 |
Document ID | / |
Family ID | 41215872 |
Filed Date | 2009-10-29 |
United States Patent
Application |
20090271182 |
Kind Code |
A1 |
Athineos; Marios ; et
al. |
October 29, 2009 |
COMPUTER-IMPLEMENTED METHODS AND SYSTEMS FOR MODELING AND
RECOGNITION OF SPEECH
Abstract
In accordance with the present invention, computer implemented
methods and systems are provided for representing and modeling the
temporal structure of audio signals. In response to receiving a
signal, a time-to-frequency domain transformation on at least a
portion of the received signal to generate a frequency domain
representation is performed. The time-to-frequency domain
transformation converts the signal from a time domain
representation to the frequency domain representation. A frequency
domain linear prediction (FDLP) is performed on the frequency
domain representation to estimate a temporal envelope of the
frequency domain representation. Based on the temporal envelope,
one or more speech features are generated.
Inventors: |
Athineos; Marios; (New York,
NY) ; Ellis; Daniel P.W.; (New York, NY) |
Correspondence
Address: |
WilmerHale/Columbia University
399 PARK AVENUE
NEW YORK
NY
10022
US
|
Assignee: |
THE TRUSTEES OF COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
New York
NY
|
Family ID: |
41215872 |
Appl. No.: |
12/370424 |
Filed: |
February 12, 2009 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
11090728 |
Mar 25, 2005 |
|
|
|
12370424 |
|
|
|
|
11000874 |
Dec 1, 2004 |
|
|
|
11090728 |
|
|
|
|
60525947 |
Dec 1, 2003 |
|
|
|
60578985 |
Jun 10, 2004 |
|
|
|
Current U.S.
Class: |
704/205 ;
704/E21.019 |
Current CPC
Class: |
G10L 25/12 20130101;
G10L 15/02 20130101 |
Class at
Publication: |
704/205 ;
704/E21.019 |
International
Class: |
G10L 21/06 20060101
G10L021/06 |
Goverment Interests
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
[0002] The government may have certain rights in the present
invention pursuant to grants from the Effective, Affordable,
Reusable Speech-to-Text (EARS-NA) program at the Defense Advanced
Research Projects Agency (DARPA), Contract No. MDA972-02-1-0024.
Claims
1. A method of extracting speech features from signals for use in
performing automatic speech recognition, the method comprising:
receiving a signal; performing a time-to-frequency domain
transformation on at least a portion of the received signal to
generate a frequency domain representation; performing a frequency
domain linear prediction on the frequency domain representation to
estimate a temporal envelope of the frequency domain
representation; and generating at least one speech feature based at
least in part on the temporal envelope.
Description
CROSS REFERENCE TO RELATED APPLICATION
[0001] This application is a continuation of and claims priority
under 35 U.S.C. .sctn.120 to U.S. patent application Ser. No.
11/090,728, filed Mar. 25, 2005, and entitled "Computer-Implemented
Methods and Systems for Modeling and Recognition of Speech," which
is a continuation of U.S. patent application Ser. No. 11/000,874,
filed Dec. 1, 2004, which claims the benefit under 35 U.S.C. .sctn.
119(e) of U.S. Provisional Patent Application Nos. 60/525,947,
filed Dec. 1, 2003, and 60/578,985, filed Jun. 10, 2004, which are
hereby incorporated by reference herein in their entireties.
FIELD OF THE INVENTION
[0003] The present invention generally relates to sound
recognition. More particularly, the present invention relates to
modeling audio signals for speech recognition, sound encoding and
decoding, and artificial sound synthesis.
BACKGROUND OF THE INVENTION
[0004] In recent years, automatic speech recognition (ASR) systems
have been employed in a wide variety of areas, such as, for
example, telephone dialing, directory assistance, order entry, home
banking, database inquiry, and dictation. For example, cellular
telephones commonly employ ASR systems to simplify the user
interface. Using ASR systems, many cellular telephones recognize
and execute commands to initiate an outgoing phone call or answer
an incoming phone call. For example, a cellular telephone having an
ASR system may recognize a spoken name from a phone book or a
contact list and automatically initiate a phone call to the phone
number associated with the spoken name.
[0005] In an ASR system, a user speaks into a microphone (i.e.,
inputs a speech signal). The inputted analog signal is digitized
and the blocks of digital data are then transformed from the time
domain into the frequency domain using a digital signal processing
(DSP) chip. Once the ASR system has digitized the signal and
calculated certain parameters, the system compares the signal to a
library of known phrases and finds the closest match.
[0006] To extract the features from the signal for comparison with
data in the library, such ASR systems generally use short-term
spectral features, such as mel-frequency cepstral (i.e.,
frequency-related) coefficients (MFCC). MFCCs are based on a Fast
Fourier Transform (FFT), which converts the inputted signal from a
time domain representation to a frequency domain representation.
The MFCC representation is an example of an approach that further
analyzes the FFT of the signal. The MFCC representation is
generated by using a mathematical transformation called the cepstu
which computes the inverse Fourier transform of the log-spectrum of
the speech signal.
[0007] These ASR systems uniformly employ short-time spectral
analysis, usually over windows of about 10 to 30 milliseconds, as
the basis for acoustic representations. It should be noted,
however, that the detailed time structure below this timescale is
lost and the time structure above this level is weakly represented
in the form of deltas. The temporal structure in sub-10 millisecond
transient segments contains important cues for both the perception
of natural sounds as well as the understanding of stop bursts in
speech. The gross temporal distribution of acoustic energy in
windows of up to 1 second is a successful domain for the
recognition of complete phonemes and the description of their
dynamics. Thus, while the spectral structures resulting from the
spectral analysis convey important linguistic information, they are
only a partial representation of speech signals.
[0008] Other feature extraction techniques, such as, for example,
dynamic (delta) features and relative spectra processing technique
(RASTA), have been adopted as post-processing techniques that
operate on sequences of the short-term feature vectors. Such
techniques provide a "locally-global" view in which features to be
used in classification are based upon a speech segment of about one
syllable's length.
[0009] Accordingly; it is desirable to provide systems and methods
that overcome these and other deficiencies of the prior art.
SUMMARY OF THE INVENTION
[0010] In accordance with the present invention, computer
implemented methods and systems are provided for representing and
modeling the temporal structure of audio signals.
[0011] In accordance with some embodiments of the present
invention, computer implemented methods and systems of extracting
speech features from signals for use in performing automatic speech
recognition are provided. In response to receiving a signal, a
time-to-frequency domain transformation on at least a portion of
the received signal to generate a frequency domain representation
is performed. The time-to-frequency domain transformation converts
the signal from a time domain representation to the frequency
domain representation. A frequency domain linear prediction (FDLP)
is performed on the frequency domain representation to estimate a
temporal envelope of the frequency domain representation. Based on
the temporal envelope, one or more speech features are
generated.
[0012] In some embodiments, the time-to-frequency domain
transformation is performed by applying a discrete cosine transform
(DCT) or a discrete Fourier transform on the portion of the
received signal.
[0013] In some embodiments, the frequency domain linear prediction
may include selecting a temporal window to apply the linear
prediction and automatically determining a pole rate to distribute
poles for modeling the temporal envelope. The poles generally
characterize the temporal peaks of the temporal envelope. The pole
rate may be automatically determined to capture both gross
variation and stop burst transients of the signal.
[0014] In some embodiments, an index of sharpness may be extracted
from each of the poles. The index of sharpness of the FDLP poles
{.rho..sub.i} is defined as
.rho. i = 1 1 - .rho. i . ##EQU00001##
[0015] In some embodiments, the frequency domain linear prediction
is performed by estimating the square of the Hilbert envelope of
the signal or calculating the inverse Fourier transform of the
magnitude-squared Fourier transform of a portion of the frequency
domain representation raised to a given power. When the given power
is 1, the autocorrelation of the single sided (positive frequency)
spectrum is calculated. Alternatively, when the given power is not
1, the pseudoautocorrelation is calculated. The autocorrelation of
the spectral coefficients may be used to predict the temporal
envelope of the signal.
[0016] In accordance with some embodiments of the present
invention, the frequency domain representation may be divided into
a plurality of frequency bands. A FDLP polynomial may then be
fitted to each of the plurality of frequency bands. Temporal
envelopes may be extracted from each of the plurality of frequency
bands using the fitted FDLP polynomial.
[0017] In some embodiments, the frequency domain representation may
be divided by logarithmically splitting the frequency domain
representation into the plurality of frequency bands.
[0018] In accordance with some embodiments of the present
invention, computer implemented methods and systems of extracting
speech features from signals are provided. In response to receiving
a signal, a time-to-frequency domain transformation on at least a
portion of the received signal to generate a frequency domain
representation is performed. The time-to-frequency domain
transformation converts the signal from a time domain
representation to the frequency domain representation. The
frequency domain representation may be divided into a plurality of
frequency bands and a FDLP polynomial may be fitted to each of the
plurality of frequency bands. Temporal envelopes may be extracted
from each of the plurality of frequency bands using the fitted FDLP
polynomial. Spectral envelopes may be constructed by taking
simultaneous points in the temporal envelopes. A smooth envelope
may be fitted to each of the spectral envelopes. Based on the
temporal and spectral envelopes, one or more speech features are
generated. This is sometimes referred to herein as "PLP.sup.2
modeling."
[0019] These methods and systems for modeling the temporal
structure of the signal may be used to improve sound recognition
(in particular, speech recognition), sound encoding and decoding,
and artificial sound synthesis.
[0020] There has thus been outlined, rather broadly, the more
important features of the invention in order that the detailed
description thereof that follows may be better understood, and in
order that the present contribution to the art may be better
appreciated. There are, of course, additional features of the
invention that will be described hereinafter and which will form
the subject matter of the claims appended hereto.
[0021] In this respect, before explaining at least one embodiment
of the invention in detail, it is to be understood that the
invention is not limited in its application to the details of
construction and to the arrangements of the components set forth in
the following description or illustrated in the drawings. The
invention is capable of other embodiments and of being practiced
and carried out in various ways. Also, it is to be understood that
the phraseology and terminology employed herein are for the purpose
of description and should not be regarded as limiting.
[0022] As such, those skilled in the art will appreciate that the
conception, upon which this disclosure is based, may readily be
utilized as a basis for the designing of other structures, methods
and systems for carrying out the several purposes of the present
invention. It is important, therefore, that the claims be regarded
as including such equivalent constructions insofar as they do not
depart from the spirit and scope of the present invention.
[0023] These together with other objects of the invention, along
with the various features of novelty which characterize the
invention, are pointed out with particularity in the claims annexed
to and forming a part of this disclosure. For a better
understanding of the invention, its operating advantages and the
specific objects attained by its uses, reference should be had to
the accompanying drawings and descriptive matter in which there is
illustrated preferred embodiments of the invention.
BRIEF DESCRIPTION OF THE DRAWINGS
[0024] Various objects, features, and advantages of the present
invention can be more fully appreciated with reference to the
following detailed description of the invention when considered in
connection with the following drawings, in which like reference
numerals identify like elements.
[0025] FIG. 1 is a simplified illustration of a spectrogram of a
speech sample and a spectrogram of the discrete cosine
transformation (DCT) of the speech sample in accordance with some
embodiments of the present invention.
[0026] FIG. 2 is a simplified illustration of one example of a
waveform and temporal envelopes of the waveform with various poles
in accordance with some embodiments of the present invention.
[0027] FIG. 3 is a simplified illustration of one example of a
subband frequency-domain linear prediction (FDLP) in accordance
with some embodiments of the present invention.
[0028] FIG. 4 is a simplified illustration of one example of a
waveform, a temporal envelopes of the waveform modeled by FDLP, and
a Gaussian window of the waveform in accordance with some
embodiments of the present invention.
[0029] FIG. 5 is a simplified illustration of one example of a
spectrogram of the speech sample, a per-frame maximum of the
temporal envelope of the sample extracted in each band by FDLP, and
sharpness index features in accordance with some embodiments of the
present invention.
[0030] FIG. 6 shows the comparison between word-level confusion
matrices in accordance with some embodiments of the present
invention.
[0031] FIG. 7 is a simplified illustration of one example of a
subband FDLP and one example of a PLP.sup.2 in accordance with some
embodiments of the present invention.
[0032] FIG. 8 is a simplified illustration of PLP.sup.2 having pole
locations in accordance with some embodiments of the present
invention.
[0033] FIG. 9 shows the mean-squared differences between the
log-magnitude surfaces obtained in successive iterations of the
PLP.sup.2 analysis in accordance with some embodiments of the
present invention.
[0034] FIG. 10 is a simplified flowchart illustrating the steps
performed in using frequency domain linear prediction to estimate
the temporal envelope of a frequency domain representation in
accordance with some embodiments of the present invention.
[0035] FIG. 11 is a simplified flowchart illustrating the steps
performed in combining the temporal information extracted by FDLP
with spectral information extracted by PLP to extract one or more
speech features in accordance with some embodiments of the present
invention.
[0036] FIG. 12 is a schematic diagram of an illustrative system
suitable for implementation of an application that uses the
temporal structure model in accordance with some embodiments of the
present invention.
[0037] FIG. 13 is a detailed example of the server and one of the
workstations that may be used in accordance with some embodiments
of the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0038] The following description includes many specific details.
The inclusion of such details is for the purpose of illustration
only and should not be understood to limit the invention. Moreover,
certain features which are well known in the art are not described
in detail in order to avoid complication of the subject matter of
the present invention. In addition, it will be understood that
features in one embodiment may be combined with features in other
embodiments of the invention.
[0039] In accordance with the present invention, computer
implemented methods and systems are provided for representing and
modeling the temporal structure of audio signals. More
particularly, the methods and systems provide a compact
representation of an audio signal that includes substantial detail
about its temporal structure such that accurate modeling,
classification, recognition, and/or resynthesis may be
performed
[0040] In some embodiments, a representation of the temporal
envelope in different frequency bands is provided by exploring the
dual of linear prediction when applied in the transform domain.
With this technique of frequency domain linear prediction, the
poles of the model describe temporal, rather than spectral, peaks.
By using analysis windows on the order of hundreds of milliseconds,
a processor may perform a procedure that automatically determines
how to distribute poles or the pole rate to best model the temporal
structure within the window. By taking an index describing the
sharpness of individual poles within a window, a substantial
improvement to the word error rate is shown.
[0041] Using the representation of the temporal envelope, the
processor may adaptively capture fine temporal nuances with
millisecond accuracy while at the same time summarize the signal's
gross temporal evolution in timescales of about 500 milliseconds or
more. Fine time-adaptive accuracy may be used to pin-point
significant moments in time such as, for example, those associated
with transient events like stop bursts. At the same time, the
long-timescale summarization power of temporal envelopes provide
the ability to train, for example, speech recognizers on complete
linguistic units lasting longer than 10 milliseconds and learning
acoustically-feasible phoneme sequences.
[0042] The representation of the temporal envelope of a signal is
created generally by applying a discrete cosine transform (DCT) on
long time frames and a frequency domain linear prediction (FDLP) on
the output of the DCT.
[0043] The DCT generally appears as a post-processing step in
feature extractors for automatic speech recognition. The forward
DCT of an N point real sequence x[n] may be defined as:
X DCT [ k ] = a [ k ] n = 0 N - 1 x [ n ] cos ( ( 2 n + 1 ) .pi. k
2 N ) ##EQU00002##
where k=0, 1, . . . , N-1 and
a [ k ] = { 1 k = 0 2 k = 1 , 2 , , N - 1 ##EQU00003##
[0044] In some embodiments, the DCT may be used to approximate the
envelope of the dicrete Fourier transform (DFT). Denoting as
X.sub.DFT[k], the DFT of a length 2N zero-padded version of x[n],
it has been determined that the envelope of the DCT is bounded by
the envelope of the zero-padded DFT and are related by:
X DCT [ k ] = a [ k ] X DFT [ k ] cos ( .theta. [ k ] - .pi. k 2 N
) ##EQU00004##
where k=0, 1, . . . , N-1, and |X.sub.DFT[k]| and .theta.[k] are
the magnitude and phase of the zero-padded DFT, respectively.
[0045] FIG. 1 is a simplified illustration of a spectrogram of a
speech sample and a spectrogram of the discrete cosine
transformation (DCT) of the speech sample in accordance with some
embodiments of the present invention. As shown in FIG. 1,
spectrogram 110 is of a 2 second speech sample and spectrogram 120
is of a DCT transform of the whole sample (treating the DCT output
sequence as a sequence in time). It should be noted that while the
DCT spectrogram 120 appears to be a mirror image of the regular
spectrogram 110, it is not due to the cosine modulating term in the
above-mentioned equation.
[0046] FDLP, the frequency domain dual of the time domain linear
prediction (TDLP), is the part of the model that provides the time
adaptive behavior. TDLP is fully familiar to those of ordinary
skill in the art. Applying FDLP analysis estimates the temporal
envelope of the signal, and in particular, is the square of its
Hilbert envelope,
e(t)=F.sup.-1{{tilde over (X)}(c){tilde over (X)}(c-f)dc}
The Hilbert envelope is the inverse Fourier transform of the
autocorrelation of the single sided (positive frequency) spectrum
{tilde over (X)}(f). The autocorrelation of the spectral
coefficients may be used to predict the temporal envelope of the
signal.
[0047] In some embodiments, the frequency domain linear prediction
is performed by calculating the inverse Fourier transform of the
magnitude-squared Fourier transform of a portion of the frequency
domain representation raised to a given power. When the given power
is 1, the autocorrelation is calculated (as shown, in the equation
above). When the given power is not 1, the psuedoautocorrelation is
calculated.
[0048] FIG. 2 is a simplified illustration of one example of a
waveform 210 and temporal envelopes 220, 230, and 240 of the
waveform 210 with various poles 250 in accordance with some
embodiments of the present invention. FIG. 2 shows a 256
millisecond long speech segment at a 8 kHz sample rate. After using
the processor to take the 2048 point DCT of the whole sample, the
processor fits a single FDLP polynomial to the DCT and then
extracts the temporal envelope of the segment. Note that FIG. 2
shows the tradeoffs involved in model order selection (defining
pole rate). When the processor generates an envelope 220 having 10
poles, the resulting envelope is too smooth and provides only a
loose approximation. On the other hand, when the processor
generates an envelope 240 having 40 poles, the resulting envelope
is starting to fit the pitch pulses, which is generally something
to avoid for English-language automatic speech recognition. When
the processor generates an envelope 230 having 20 poles, this
resulting envelope strikes a good balance as it captures both the
gross variation as well as the stop burst transients in the
beginning of the sample. Envelope 230 has a pole rate of 20 poles
per 256 milliseconds or about 0.1 poles/ms. In accordance with some
embodiments of the present invention, 20 poles per 256 milliseconds
or a pole rate of 0.1 poles/millisecond is advantageously used in
order to generate the model. It should be noted, however, that the
poles are distributed adaptively within the 256 ms window, thereby
providing flexibility to the model. It should also be noted that
any suitable number of poles or pole rate may be determined and
used by the processor.
[0049] FIG. 3 is a simplified illustration of one example of a
subband frequency-domain linear prediction (FDLP) in accordance
with some embodiments of the present invention. In FIG. 3, the same
256 ms long sample is used, but the processor applies FDLP on four
logarithmically-split octave bands 310, 320, 330, and 340. More
particularly, each band represents a range of frequencies: 0-0.5,
0.5-1, 1-2, and 2-4 kHz, respectively. It should be noted that the
same pole rate of 20 poles per 256 ms for each band is used. It
should also be noted that the high frequency band is resolving the
transient while the low frequency band is capturing the gross
spectral variation. This approach is sometimes referred to herein
as "subband FDLP." By transforming longer 256 ms blocks of signal
(which is extensible to seconds or more), enough variation is
captured to manifest itself as significantly different temporal
envelopes between bands.
[0050] This approach provides a new parameter space from which
features may be extracted for use in, for example, automatic speech
recognition. There are many approaches in which the above-mentioned
temporal envelope information modeled in FDLP may be converted into
features for use in speech recognizers.
[0051] In some embodiments, the temporal envelopes may be used
directly. The envelopes as shown in FIG. 2 are samples DFTs of the
impulse responses (IR) of the all-pole filters that have been fit
to the frequency domain. The basic linear prediction may be
suitable for direct transformation into temporal-based features
such as modulation spectra. In addition, relationships such as the
direct transformation from prediction coefficients to cepstra may
provide decorrelated features describing the temporal behavior in
different subbands.
[0052] In another suitable embodiment, features may be derived from
each individual pole in the model (i.e., the roots of the predictor
polynomial). The angle of the pole on the z-plane corresponds to
accurate timing information and the magnitude of the pole may
provide knowledge about the energy of the signal. It should be
noted that this is a smoothed approximation to the true Hilbert
envelope. The sharpness of the pole (i.e., how closely it
approaches the unit circle) relates to the dynamics of the
envelope. For example, a sharper pole indicates more rapid
variation of the envelope at that time.
[0053] The index of sharpness of the FDLP poles {.rho..sub.i} is
defined by:
.rho. i = 1 1 - .rho. i . ##EQU00005##
As pole magnitudes grow from zero to approaching the unit circle,
.rho..sub.i grows from 1 to an unbounded large positive value.
[0054] For each analysis frame in time, the full DCT is taken and
FDLP is performed on 4 log bands using 20 poles per band. The
choice of a 256 ms analysis window (2048 samples at 8 kHz) is,
without loss of generality, dictated by computational
considerations. Subbands are formed by breaking up the DCT into
subranges that are exact powers of two (e.g., 128, 256, 512 and
1024 points for a 4-way split). After modeling with 20 poles per
band per frame, the processor calculates the sharpness index. The
sharpness indices may be scaled using a Gaussian window 410 to
achieve a finer time resolution than the 256 ms window, as
illustrated in FIG. 4, and the maximum value in each band in each
frame is retained. The purpose of the window is to localize the
sharpness values in the vicinity of the center of the frame. FIG. 5
visually compares these pole sharpness features with direct
measures of the subband energy. After examining the distributions
of the sharpness parameters, a logarithmic transform was added to
make the distributions closer to Gaussian, and thus a better match
to the statistical models.
[0055] Using a conventional HTK recognizer and Gaussian Mixture
Model-Hidden Markov Model (GMM-HMM) models that are trained on a
mixture of conversational and read speech using a combination of
Switchboard, Callhome, and Macrophone databases, the temporal
envelope modeled in FDLP was tested.
TABLE-US-00001 TABLE 1 Recognition of word error rate (WER)
results. Features raw 20 k pad 85 k PLP12 4.97% 2.75% FDLP-4log
4.08% 2.90% FDLP-2log + dct 3.81% 2.82% FDLP-3log + dct 2.61%
FDLP-4log + dct 2.63% FDLP-5log + dct 2.69% FDLP-8bark + dct
4.38%
Table 1 shows the recognition word error rate (WER) results. The
first line, "PLP12", is the baseline system employing 12th order
PLP features (plus deltas and double deltas). Subsequent systems
augment these features with FDLP sharpness features in various
guises. "FDLP-4 log" adds four elements to each feature vector,
derived from 4 logarithmically-spaced octave subbands (e.g., 0-500
Hz, 500 Hz-1 kHz, 1-2 kHz, and 2-4 kHz). It should be noted that
performing a final DCT decorrelation on each frame of FDLP features
improved recognition, as shown in the "FDLP-Xlog+dct" lines.
Between two and five octave bands (where 2 octaves is 0-2 kHz and
2-4 kHz, and 5 octave bands is down to 0 to 250 Hz) were used to
find the best compromise between signal detail and model accuracy
(since narrow frequency bands contain fewer frequency samples with
which to estimate the linear prediction parameters). In some
embodiments, dividing the frequency axis on a Bark scale, which is
fully familiar to those of ordinary skill in the art, may be used
to allow the use of more bands (since Bark bands do not get narrow
so quickly in the low frequencies).
[0056] In some embodiments, padding each end of our test utterances
with 100 ms of artificial background noise silence may be
beneficial. In some embodiments, all test set utterances marked as
coming from the same speaker may be normalized. Such changes may
improve the WER from about 4.97% to about 2.75%.
[0057] For the "raw 20 k" system, it should be noted that any kind
of FDLP-derived information improved word error rate with the
greatest improvement coming from augmenting the PLP features with
decorrelated 4 octave-subband FDLP sharpness features ("FDLP-4
log+dct"). The WER changed from 4.97% to 3.81%, which represents a
23.3% relative improvement. With the larger, better-performing "pad
85 k" system, the improvements from FDLP were smaller with the best
improvement of 2.75% baseline WER to 2.61% for 3 subband
decorrelated features ("FDLP-3 log+dct") constituting a 5% relative
improvement.
[0058] FIG. 6 compares the word-level confusion matrices for the
baseline "raw 20 k" PLP system, and for the best performing "FDLP-4
log+dct" system. Looking at the absolute differences in error
counts (middle pane), the greatest differences is seen for the
words "four" (fewer confusions with "forty"), "eight" and "six"
(fewer deletions), and "five" (fewer confusions with "nine"). It
should be noted that most of these main differences involve stops
(/t/ in "eight" and "forty", and /k/ is "six"), which is consistent
with our initial drive for the FDLP sharpness features, of
capturing information about short-duration transients in the speech
signal.
[0059] Accordingly, FDLP analysis is advantageous because of its
ability to describe temporal structure without frame-rate
quantization, and its rich and flexible representation of temporal
structure in the form of poles. This flexible, adaptive
representation of the temporal structure may be analyzed across the
full-band or for arbitrarily-spaced subbands, and presents many
possibilities for advanced speech recognition features.
[0060] Perceptual Linear Prediction (PLP) is another auditory-based
approach to feature extraction. In contrast to pure linear
predictive analysis of speech, PLP generally uses several
perceptually motivated transforms including Bark frequency, masking
curves, etc. to modify the short-term spectrum of the speech. In
accordance with some embodiments of the present invention, the
temporal information extracted by FDLP may be combined with
spectral information extracted using PLP.
[0061] As described above, a squared Hilbert envelope (the squared
magnitude of the analytic signal) represents the total
instantaneous energy in a signal, while the squared Hilbert
envelopes of sub-band signals are a measure of the instantaneous
energy in the corresponding sub-bands. Deriving these Hilbert
envelopes generally involve either using a Hilbert operator in the
time domain (which is difficult in practice because of its
doubly-infinite impulse response), or the use of two Fourier
transforms with modifications to the intermediate spectrum.
Alternatively, an all-pole approximation of the Hilbert envelope
may be calculated by computing a linear predictor for the positive
half of the Fourier transform of an even-symmetrized input signal,
which is equivalent to computing the predictor from the cosine
transform of the signal. Such FDLP is the frequency-domain dual of
the well-known TDLP. Similar to how the TDLP fits the power
spectrum of an all-pole model to the power spectrum of a signal,
FDLP fits a "power spectrum" of an all-pole model (e.g., in the
time domain) to the squared Hilbert envelope of the input signal.
To obtain such a model for a specific sub-band, the prediction may
be based only on the corresponding range of coefficients from the
original Fourier transform.
[0062] To summarize temporal dynamics, rather than capture every
nuance of the temporal envelope, the all-pole approximation to the
temporal trajectory offers parametric control over the degree to
which the Hilbert envelope is smoothed (e.g., the number of peaks
in the smoothed envelope cannot exceed half the order of the
model).
[0063] Having an approach for estimating temporal envelopes in
individual frequency bands of the original signal permits the
construction of a spectrogram-like signal representation. Just as a
typical spectrogram is constructed by appending individual
short-term spectral vectors alongside each other, a similar
representation may be constructed by vertical stacking of the
temporal vectors approximating the individual sub-band Hilbert
envelopes, recalling the outputs of the separate band-pass filters
used to construct the original, analog Spectrograph. This is shown
in FIG. 7. The top panel 710 shows the time-frequency pattern
obtained by short-term Fourier transform analysis and Bark scale
energy binning to 15 critical bands, which is the way the
short-term critical-band spectrum is derived in PLP feature
extraction. The second panel 720 shows the result of PLP smoothing,
with each 15-point vertical spectral slice now smooth and
continuous as a result of being fit with a linear prediction model.
The third panel 730 is based on a series 24-pole FDLP models, one
for each Bark band, to give estimates of the 15 subband squared
Hilbert envelopes. Similar to PLP, cube-root compression is applied
here to the sub-band Hilbert envelope prior to computing the
all-pole model of the temporal trajectory. The similarity of all
these patterns is evident, but there are also some important
differences: Whereas the binned, short-time spectrogram is `blocky`
in both time and frequency, the PLP model gives a smooth,
continuous spectral profile at each time step. Conversely, the
temporal evolution of the spectral energy in each sub-band is much
smoother in the all-pole FDLP representation, constrained by the
implicit properties of the temporal all-pole model.
[0064] In PLP, an auditory-like critical-band spectrum, obtained as
the weighted summation of the short-term Fourier spectrum followed
by cube-root amplitude compression, is approximated by an all-pole
model in an approach similar to the way that linear prediction
techniques approximate the linear-frequency short-term power
spectrum of a signal. Subband FDLP offers an alternative approach
to estimate the energy in each critical band as a function of time,
raising the possibility of replacing the short-term critical band
spectrum in PLP with this new estimate. In doing so, a new
representation of the critical-band time-frequency plane is
obtained. However, comparing this new representation to the subband
FDLP spectrotemporal pattern (constrained by the all-pole model
along the temporal axis), the all-pole constraint is now along the
spectral dimension of the pattern.
[0065] In some embodiments, the processor may repeat the processing
along the temporal dimension of the new representation to enforce
the all-pole constraints along the time axis. The outcome of such
processing may be subject to another stage of all-pole modeling on
the spectral axis. It should be noted that this alternation may be
iterated until the difference between successive representations is
negligible.
[0066] As a result, the processor provides a two-dimensional
spectro-temporal auditory-motivated pattern that is constrained by
all-pole models along both the time and frequency axes. This is
sometimes referred to herein as "Perceptual Linear Prediction
Squared" or "PLP.sup.2." The perceptual constraints are derived
from the use of a critical-band frequency axis and from the use of
a 250 ms critical-timespan interval, whereas the linear prediction
(LP) portion indicates the use of all-pole modeling and the
"squared" portion comes from the use of all-pole models along both
the time and frequency axes.
[0067] In response to the processor taking the DCT of a 250 ms
speech segment (equivalent to the Fourier transform of the related
500 ms even symmetric signal) at a sampling rate of 8 kHz, about
2000 unique values in the frequency domain are generated. The
processor may then divide these into 15 bands with overlapping
Gaussian windows whose widths and spacing select frequency regions
of approximately one Bark, and apply 24th order FDLP separately on
each of the 15 bands such that each predictor approximates the
squared Hilbert envelope of the corresponding sub-band. The
processor computes the critical-band time-frequency pattern within
the 250 ms time span by sampling each all-pole envelope at 240
points (i.e. every 1.04 ms) and stacks the temporal trajectories
vertically. This provides a 2-dimensional array amounting to a
spectrogram, but constructed row-by-row, rather than
column-by-column as in conventional short-term analysis. This
time-frequency pattern is the starting point for further
processing.
[0068] In response to generating the above-mentioned time-frequency
pattern, the processor may compute 240 12th-order time-domain LP
(TDLP) models to model the spectra constituted by the 15 amplitude
values in a vertical slice from the pattern at each of the 240
temporal sample points. The spectral envelopes of these models are
each sampled at 120 points (i.e. every 0.125 Bark) and stacked next
to each other to form a new 240.times.120=28,800 point
spectro-temporal pattern. Each horizontal slice of 240 points is
modeled by the same process of mapping a compressed magnitude
"spectrum" to an autocorrelation and then to an all-pole model,
thereby yielding 120 24th-order FDLP approximations to the temporal
trajectories in the new fractional-Bark subbands. Sampling these
models on the same 240 point grid gives the next iteration of the
28,800 point spectro-temporal pattern. The process may then repeats
where it converges after a given number of iterations, where the
number of iterations required for convergence appears to depend on
the models orders as well as the compression factor in the all-pole
modeling process. The mean-squared difference between the
logarithmic surfaces of the successive spectro-temporal patterns as
a function of the iteration number is shown in FIG. 9, which shows
stabilization after 10 iterations in this example. (Although this
plot shows that the differences between successive iterations do
not decline all the way to zero, it should be noted that the
residual changes in later iterations are immaterial; inspection of
the time frequency distribution of these differences reveals no
significant structure.)
[0069] The final panel 740 of FIG. 7 shows the results of the new
PLP.sup.2 compared with conventional PLP. The increased temporal
resolution in comparison with the 10 ms sampled PLP (second panel
720) is very clear; the second important property of the PLP.sup.2
surface is the increased spectral resolution in comparison with the
15 frequency values at each time for the basic FDLP model (third
panel 730).
[0070] In some embodiments, further insight may be obtained by
plotting the pole locations on the time frequency plane. As shown
in FIG. 8, the pole locations may be superimposed on a grayscale
version of the PLP.sup.2 pattern presented on the 4th panel of FIG.
7. Dots show the 12 FDLP poles for each of the 120 subband envelope
estimates. Due to the dense frequency sampling, the poles of
adjacent bands are close in value, and the dots merge into
near-vertical curves in the figure. Dots are the 6 TDLP poles at
each of the 240 temporal sample points, and merge into
near-horizontal lines.
[0071] The blue TDLP poles track the smoothed formants in the
t=0.14 to 0.24 s region but fail to capture the transient at around
0.08 s. The red FDLP poles, on the other hand, with their emphasis
on temporal modeling, make an accurate description of this
transient. As expected, neither TDLP or FDLP models track any
energy peaks in the quiet region between 0 and 0.08 s. But, while
the TDLP models for these temporal slices are obliged to place
their poles somewhere in this region, the FDLP models are free to
shift the majority of their poles into the later portion of the
time window, between 0.08 and 0.25 s, where the bulk of the energy
lies.
[0072] In some embodiments, after receiving a 250 ms segment of
speech, the processor divides its DCT into 15 Bark bands. Each band
is fit with a 12th order FDLP polynomial, and the resulting
smoothed temporal envelope is sampled on a 10 ms grid. The
central-most spectral slices are then smoothed across frequency
using the conventional PLP technique. However, it should be noted
that any further iterations are not performed. Instead, the cepstra
resulting from this stage are taken as replacements for the
conventional PLP features and input to the recognizer.
[0073] Thus far, these features have indeed shown performance very
close to standard PLP features, achieving word error that differ by
less than 2% relative. Although small, these differences are
statistically very significant, and when the results from a
PLP.sup.2 system are combined with conventional system outputs
using simple word-level voting, a significant improvement in
overall accuracy is achieved.
[0074] Alternatively, techniques for reducing the smoothed energy
surface to a lower-dimensional description appropriate for
statistical classifiers include conventional basis decompositions
such as Principal Component Analysis or two-dimensional DCTs.
[0075] In another suitable embodiment, the pole locations may be
viewed as a reduced, parametric description of the energy
concentrations. For example, recording the crossing points of the
nearly-continuous time and frequency pole trajectories may provide
a highly compact description of the principal energy peaks in each
250 ms spectro-temporal window.
[0076] Accordingly, a new modeling scheme to describe the time and
frequency structure in short segments of sound is provided. This
approach of all-pole (linear predictive) modeling, applied in both
the time and frequency domains, allows one to smooth this
representation to adaptively preserve the most significant peaks
within this window in both dimensions.
[0077] FIGS. 10 and 11 are generalized flow charts illustrating the
steps performed in the modeling and representing of the temporal
structure of audio signals in accordance with some embodiments of
the present invention. It will be understood that the steps shown
in these figures may be performed in any suitable order, some may
be deleted, and others added.
[0078] FIG. 10 is a simplified flow chart illustrating the steps
performed in extracting speech features from signals by using FDLP
in accordance with some embodiments of the present invention. At
step 1010, the process may receive a signal (e.g., a waveform). In
response to receiving a signal, a time-to-frequency domain
transformation on at least a portion of the received signal to
generate a frequency domain representation is performed at step
1020. The time-to-frequency domain transformation converts the
signal from a time domain representation to the frequency domain
representation. In some embodiments, the time-to-frequency domain
transformation is performed by applying a discrete cosine transform
(DCT) or a discrete Fourier transform on the portion of the
received signal.
[0079] At step 1030, the processor may divide the frequency domain
representation into a plurality of frequency bands. For example,
subbands may be formed breaking up the frequency domain
representation into subranges. These subbands may be determined by
logarithmically splitting the frequency domain representation into
the plurality of frequency bands.
[0080] At step 1040, the processor may perform a frequency domain
linear prediction (FDLP) on each of the frequency bands by, for
example, fitting a FDLP polynomial. The frequency domain linear
prediction is performed by estimating the square of the Hilbert
envelope of the signal or calculating the inverse Fourier transform
of the magnitude-squared Fourier transform of a portion of the
frequency domain representation raised to a given power. When the
given power is 1, the autocorrelation of the single sided (positive
frequency) spectrum is calculated. Alternatively, when the given
power is not 1, the pseudoautocorrelation is calculated. The
autocorrelation of the spectral coefficients may be used to predict
the temporal envelope of the signal.
[0081] In some embodiments, the frequency domain linear prediction
may include selecting a temporal window to apply the linear
prediction and automatically determining a pole rate to distribute
poles for modeling the temporal envelope. The poles generally
characterize the temporal peaks of the temporal envelope. The pole
rate may be automatically determined to capture both gross
variation and stop burst transients of the signal.
[0082] In some embodiments, an index of sharpness may be extracted
from each of the poles. The sharpness of the pole relates to the
dynamics of the temporal envelope. The index of sharpness of the
FDLP poles {.rho..sub.i} is defined as
.rho. i = 1 1 - .rho. i . ##EQU00006##
[0083] Temporal envelopes may be extracted from each of the
plurality of frequency bands using the fitted FDLP polynomial at
step 1050.
[0084] In some embodiments, the temporal envelope may be used to
generate at least one speech feature. Speech features may be used
for sound recognition (in particular, speech recognition), sound
encoding and decoding, and artificial sound synthesis.
[0085] FIG. 11 is a simplified flowchart illustrating the steps
performed in combining the temporal information extracted by FDLP
with spectral information extracted by PLP to extract one or more
speech features in accordance with some embodiments of the present
invention. In response to extracting temporal envelopes from the
audio signal using FDLP, the process may construct spectral
envelopes by, for example, taking simultaneous points in the
temporal envelopes (step 1110). For example, the processor may
compute time-domain linear prediction models to model the spectra
constituted by the points in the temporal envelopes. In some
embodiments, the processor may iterate the fitting in frequency and
time domains.
[0086] A smooth envelope may be fitted to each of the spectral
envelopes at step 1120. The smoothing of the spectral envelopes may
be achieved by fitting a linear prediction polynomial to each of
the spectral envelopes. This may be performed by calculating the
inverse Fourier transform of the Fourier transform magnitude of the
spectral envelope raised to a given power. In some embodiments, the
spectral envelopes may be modified by a nonlinear warping of the
frequency axis and/or the time axis.
[0087] Based on both the temporal and spectral envelopes, one or
more speech features are generated at step 1130. Speech features
may be used for sound recognition (e.g., speech recognition), sound
encoding and decoding, and artificial sound synthesis. For example,
an ASR system may be tuned for various speech recognition tasks by
using the improved speech features generated by the methods and
systems of the present invention. Some applications in which such
an ASR system with improved speech modeling may be used are, for
example, cellular telephones (e.g., automatic dialing in response
to receiving a voice command), telephone directories, software for
operating a computer, data entry, automobile controls, etc.
[0088] FIG. 12 is a schematic diagram of an illustrative system
1200 suitable for implementation of an application that generates
and uses the temporal structure model for speech recognition, sound
encoding, sound decoding, and sound synthesis in accordance with
some embodiments of the present invention. Referring to FIG. 12, an
exemplary system 1200 for implementing the present invention is
shown. As illustrated, system 1200 may include one or more
workstations 1202. Workstations 1202 may be local to each other or
remote from each other, and are connected by one or more
communications links 1204 to a communications network 1206 that is
linked via a communications link 1208 to a server 1210.
[0089] In system 1200, server 1210 may be any suitable server for
providing access to the application or to the temporal structure
model, such as a processor, a computer, a data processing device,
or a combination of such devices. Communications network 1206 may
be any suitable computer network including the Internet, an
intranet, a wide-area network (WAN), a local-area network (LAN), a
wireless network, a digital subscriber line (DSL) network, a frame
relay network, an asynchronous transfer mode (ATM) network, a
virtual private network (VPN), or any combination of any of the
same. Communications links 1204 and 1208 may be any communications
links suitable for communicating data between workstations 1202 and
server 1210, such as network links, dial-up links, wireless links,
hard-wired links, etc. Workstations 1202 enable a user to access
features using the temporal structure model. Workstations 1202 may
be personal computers, laptop computers, mainframe computers, dumb
terminals, data displays, Internet browsers, personal digital
assistants (PDAs), two-way pagers, wireless terminals, portable
telephones, etc., or any combination of the same.
[0090] The server and one of the workstations, which are depicted
in FIG. 12, are illustrated in more detail in FIG. 13. Referring to
FIG. 13, workstation 1202 may include processor 1302, display 1304,
input device 1306, and memory 1308, which may be interconnected. In
a preferred embodiment, memory 1308 contains a storage device for
storing a workstation program for controlling processor 1302.
Memory 1308 also preferably contains the application according to
the invention.
[0091] In some embodiments, the application may include an
application program interface (not shown), or alternatively, as
described above, the application may be resident in the memory of
workstation 1202 or server 1210. In another suitable embodiment,
the only distribution to the user may be a Graphical User Interface
which allows the user to interact with the application resident at,
for example, server 1210.
[0092] In one particular embodiment, the application may include
client-side software, hardware, or both. For example, the
application may encompass one or more Web-pages or Web-page
portions (e.g., via any suitable encoding, such as HyperText Markup
Language (HTML), Dynamic HyperText Markup Language (DHTML),
Extensible Markup Language (XML), JavaServer Pages (JSP), Active
Server Pages (ASP), Cold Fusion, or any other suitable
approaches).
[0093] Although the application is described herein as being
implemented on a workstation, this is only illustrative. The
application may be implemented on any suitable platform (e.g., a
personal computer (PC), a mainframe computer, a dumb terminal, a
data display, a two-way pager, a wireless terminal, a portable
telephone, a portable computer, a palmtop computer, a H/PC, an
automobile PC, a laptop computer, a personal digital assistant
(PDA), a combined cellular phone and PDA, etc.) to provide such
features.
[0094] Processor 1302 uses the workstation program to present on
display 1304 the application and the data received through
communication link 1204 and commands and values transmitted by a
user of workstation 1202. Input device 1306 may be a computer
keyboard, a cursor-controller, a microphone, a dial, a switchbank,
lever, or any other suitable input device as would be used by a
designer of input systems or process control systems.
[0095] Server 1210 may include processor 1320, display 1322, input
device 1324, and memory 1326, which may be interconnected. In a
preferred embodiment, memory 1326 contains a storage device for
storing data received through communication link 1208 or through
other links, and also receives commands and values transmitted by
one or more users. The storage device further contains a server
program for controlling processor 1320.
[0096] It will also be understood that the detailed description
herein may be presented in terms of program procedures executed on
a computer or network of computers. These procedural descriptions
and representations are the means used by those skilled in the art
to most effectively convey the substance of their work to others
skilled in the art.
[0097] A procedure is here, and generally, conceived to be a
self-consistent sequence of steps leading to a desired result.
These steps are those requiring physical manipulations of physical
quantities. Usually, though not necessarily, these quantities take
the form of electrical or magnetic signals capable of being stored,
transferred, combined, compared and otherwise manipulated. It
proves convenient at times, principally for reasons of common
usage, to refer to these signals as bits, values, elements,
symbols, characters, terms, numbers, or the like. It should be
noted, however, that all of these and similar terms are to be
associated with the appropriate physical quantities and are merely
convenient labels applied to these quantities.
[0098] Further, the manipulations performed are often referred to
in terms, such as adding or comparing, which are commonly
associated with mental operations performed by a human operator. No
such capability of a human operator is necessary, or desirable in
most cases, in any of the operations described herein which form
part of the present invention; the operations are machine
operations. Useful machines for performing the operation of the
present invention include general purpose digital computers or
similar devices.
[0099] The present invention also relates to apparatus for
performing these operations. This apparatus may be specially
constructed for the required purpose or it may comprise a general
purpose computer as selectively activated or reconfigured by a
computer program stored in the computer. The procedures presented
herein are not inherently related to a particular computer or other
apparatus. Various general purpose machines may be used with
programs written in accordance with the teachings herein, or it may
prove more convenient to construct more specialized apparatus to
perform the required method steps. The required structure for a
variety of these machines will appear from the description
given.
[0100] The system according to the invention may include a general
purpose computer, or a specially programmed special purpose
computer. The user may interact with the system via e.g., a
personal computer or over PDA, e.g., the Internet an Intranet, etc.
Either of these may be implemented as a distributed computer system
rather than a single computer. Similarly, the communications link
may be a dedicated link, a modem over a POTS line, the Internet
and/or any other method of communicating between computers and/or
users. Moreover, the processing could be controlled by a software
program on one or more computer systems or processors, or could
even be partially or wholly implemented in hardware.
[0101] Although a single computer may be used, the system according
to one or more embodiments of the invention is optionally suitably
equipped with a multitude or combination of processors or storage
devices. For example, the computer may be replaced by, or combined
with, any suitable processing system operative in accordance with
the concepts of embodiments of the present invention, including
sophisticated calculators, hand held, laptop/notebook, mini,
mainframe and super computers, as well as processing system network
combinations of the same. Further, portions of the system may be
provided in any appropriate electronic format, including, for
example, provided over a communication line as electronic signals,
provided on CD and/or DVD, provided on optical disk memory,
etc.
[0102] Any presently available or future developed computer
software language and/or hardware components can be employed in
such embodiments of the present invention. For example, at least
some of the functionality mentioned above could be implemented
using Visual Basic, C, C++ or any assembly language appropriate in
view of the processor being used. It could also be written in an
object oriented and/or interpretive environment such as Java and
transported to multiple destinations to various users.
[0103] It is to be understood that the invention is not limited in
its application to the details of construction and to the
arrangements of the components set forth in the following
description or illustrated in the drawings. The invention is
capable of other embodiments and of being practiced and carried out
in various ways. Also, it is to be understood that the phraseology
and terminology employed herein are for the purpose of description
and should not be regarded as limiting.
[0104] As such, those skilled in the art will appreciate that the
conception, upon which this disclosure is based, may readily be
utilized as a basis for the designing of other structures, methods
and systems for carrying out the several purposes of the present
invention. It is important, therefore, that the claims be regarded
as including such equivalent constructions insofar as they do not
depart from the spirit and scope of the present invention.
[0105] Although the present invention has been described and
illustrated in the foregoing exemplary embodiments, it is
understood that the present disclosure has been made only by way of
example, and that numerous changes in the details of implementation
of the invention may be made without departing from the spirit and
scope of the invention, which is limited only by the claims which
follow.
[0106] The following references are incorporated by reference
herein in their entireties: [0107] M. Athineos and D. P. W. Ellis,
"Sound texture modeling with linear prediction in both time and
frequency domains," in Proc. ICASSP, 2003, vol. 5, pp. 648-651
[0108] H. Hermansky and S. Sharma, "Temporal patterns (TRAPs) in
ASR of noisy speech," in Proc. ICASSP, March 1999, vol. 1, pp.
289-292. [0109] H. Hermansky and N. Morgan, "RASTA processing of
speech," in Trans. Speech and Audio Processing, October 1994, vol.
2:4, pp. 578-589. [0110] J. Tribolet and R. Crochiere, "Frequency
domain coding of speech," in Trans. ASSP, October 1979, vol. 27,
pp. 512-530. [0111] J. Herre and J. D. Johnston, "Enhancing the
Performance of Perceptual Audio Coders by Using Temporal Noise
Shaping (TNS)," in Proc. 101 st AES Conv., November 1996. [0112] L.
Rabiner and R. Schafer, Digital processing of speech signals,
Prentice Hall, 1978. [0113] Ozgur Cetin and Mari Ostendorf,
"Cross-stream observation dependencies for multi-stream speech
recognition," in Eurospeech, Geneva, 2003. [0114] P. Somervuo, B.
Chen, and Q. Zhu, "Feature transformations and combinations for
improving ASR performance," in Eurospeech, Geneva, 2003. [0115] H.
Hermansky, H. Fujisaki, and Y. Sato, "Analysis and synthesis of
speech based on spectral transform linear predictive method," in
Proc. ICASSP, April 1983, vol. 8, pp. 777-780. [0116] S. Sharma, H.
Versnel, and N. Kowalski, "Ripple analysis in ferret primary
auditory cortex: 1. Response characteristics of single units to
sinusoidally rippled spectra," Aud. Neurosci., vol. 1, 1995. [0117]
D. Klein, D. Depireux, J. Simon, and S. Sharma, "Robust
spectro-temporal reverse correlation for the auditory system:
Optimizing stimulus design," J. Comput. Neurosci, vol. 9, 2000.
[0118] H. Hermansky, "Exploring temporal domain for robustness in
speech recognition," in Proc. of 15th International Congress on
Acoustics, vol. 11, Trondheim, Norway, June 1995. [0119] H.
Hermansky, "Should recognizers have ears" Speech Communication,
vol. 25, 1998. [0120] H. Hermansky and S. Sharma,
"TRAPS--classifiers of temporal patterns," in Proc. ICSLP, Sydney,
Australia, 1998. [0121] P. Jain and H. Hermansky, "Beyond a single
critical band in TRAP based ASR," in Proc. Eurospeech, Geneva,
Switzerland, November 2003. [0122] S. Makino, T. Kawabata, and K.
Kido, "Recognition of consonant based on the perception model," in
Proc. ICASSP, Boston, Mass., 1983. [0123] P. Brown, "The
acoustic-modeling problem in automatic speech recognition," Ph.D.
dissertation, Computer Science Department, Carnegie Mellon
University, 1987. [0124] H. Hermansky, D. Ellis, and S. Sharma,
"Connectionist feature extraction for conventional hmm systems," in
Proc. ICASSP, Istanbul, Turkey, 2000. [0125] M. Fanty and R. Cole,
"Spoken letter recognition," in Advances in Neural Information
Processing Systems 3, Morgan Kaufmann Publishers, Inc., 1990.
[0126] S. Sharma, D. Ellis, S. Kajarekar, P., Jain, and H.
Hermansky, "Feature extraction using non-linear transformation for
robust speech recognition on the AURORA data-base," in Proc.
ICASSP, Istanbul, Turkey, 2000. [0127] P. Schwartz, P. Matejka, and
J. Cernocky, "Recognition of phoneme strings using TRAP technique,"
in Proc. Eurospeech, Geneva, Switzerland, September 2003. [0128] M.
Athineos and D. Ellis, "Frequency-domain linear prediction for
temporal features," in Proc. IEEE ASRU Workshop, St. Thomas, US
Virgin Islands, December 2003. [0129] H. Hermansky, H. Fujisaki,
and Y. Sato, "Analysis and synthesis of speech based on spectral
transform linear predictive method," in Proc. ICASSP, vol. 8, April
1983, pp. 777-780. [0130] R. Koenig, H. Dunn, and L. Lacey, "The
sound spectrograph," J. Acoust. Soc. Am., vol. 18, pp. 19-49,1946.
[0131] H. Hermansky, "Perceptual linear predictive (PLP) analysis
of speech," J. Acoust. Soc. Am., vol. 87:4, April 1990. [0132] M.
Athineos, H. Hermansky, and D. Ellis, "PLP.sup.2: Autoregressive
modeling of auditory-like 2-D spectro-temporal patterns," Submitted
to SAPA-04, Jeju Island, Korea, October 2004.
* * * * *