U.S. patent application number 12/042111 was filed with the patent office on 2008-06-26 for system for classification of voice signals.
This patent application is currently assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION. Invention is credited to Israel Nelken.
Application Number | 20080154595 12/042111 |
Document ID | / |
Family ID | 39510490 |
Filed Date | 2008-06-26 |
United States Patent
Application |
20080154595 |
Kind Code |
A1 |
Nelken; Israel |
June 26, 2008 |
SYSTEM FOR CLASSIFICATION OF VOICE SIGNALS
Abstract
A system and method for classifying a voice signal to one of a
set of predefined categories, based upon a statistical analysis of
features extracted from the voice signal. The system includes an
acoustic processor and a classifier. The acoustic processor
extracts features that are characteristic of the voice signal and
generates feature vectors using the extracted spectral features.
The classifier uses the feature vectors to compute the probability
that the voice signal belongs to each of the predefined categories
and classifies the voice signal to a predefined category that is
associated with the highest probability.
Inventors: |
Nelken; Israel; (Jerusalem,
IL) |
Correspondence
Address: |
GATES & COOPER LLP
6701 CENTER DRIVE WEST, SUITE 1050
LOS ANGELES
CA
90045
US
|
Assignee: |
INTERNATIONAL BUSINESS MACHINES
CORPORATION
Armonk
NY
|
Family ID: |
39510490 |
Appl. No.: |
12/042111 |
Filed: |
March 4, 2008 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
10421356 |
Apr 22, 2003 |
|
|
|
12042111 |
|
|
|
|
Current U.S.
Class: |
704/240 ;
704/E15.001; 704/E15.024; 704/E15.037 |
Current CPC
Class: |
G10L 15/14 20130101;
G10L 15/1815 20130101 |
Class at
Publication: |
704/240 ;
704/E15.001 |
International
Class: |
G10L 15/00 20060101
G10L015/00 |
Claims
1. A system for classifying a voice signal, comprising: an acoustic
processor configured to receive the voice signal, to generate
feature vectors that characterize the voice signal, and to assign
an integer label to each generated feature vector; and a classifier
coupled to the acoustic processor to classify the voice signal to
one of a set of predefined categories based upon a statistical
analysis of the integer labels associated with the feature vectors,
wherein the classifier uses one or more probability suffix trees
(PSTs) to compute a probability of occurrence of the integer labels
being classified in the set of predefined categories.
2. The system of claim 1, wherein the system further comprises a
framer configured to segment the voice signal into frames.
3. The system of claim 1, wherein the acoustic processor comprises
a feature extractor configured to extract statistical features
characteristic of the voice signal.
4. The system of claim 1, further comprising a memory for storing
identities of agents, each agent being associated with one of the
set of predefined categories.
5. The system of claim 1, wherein the classifier computes a
probability that the voice signal belongs to each of the set of
predefined categories using the integer labels assigned to the
feature vectors.
6. The system of claim 5, wherein the classifier classifies the
voice signal to the predefined category in the set of predefined
categories that is associated with the highest probability.
7. The system of claim 1, wherein the classifier routes a caller
associated with the voice signal to an agent associated with the
predefined category.
8-14. (canceled)
15. A system for classifying a voice signal, comprising: means for
generating a digital discrete-time representation of the voice
signal; means for segmenting the digital discrete-time
representation of the voice signal into frames; means for
extracting statistical features from each frame that characterize
the voice signal; means for generating a feature vector from each
frame using the extracted statistical features; means for
associating an integer label to each feature vector; and means for
classifying the voice signal to one of a set of predefined
categories based upon a statistical analysis of the integer labels,
wherein the means for classifying uses one or more probability
suffix trees (PSTs) to compute a probability of occurrence of the
integer labels being classified in the set of predefined
categories.
Description
BACKGROUND OF THE INVENTION
[0001] 1. Field of the Invention
[0002] This invention relates generally to electronic voice
processing systems, and relates more particularly to a system and
method for voice signal classification based on statistical
regularities in voice signals.
[0003] 2. Description of the Background Art
[0004] Speech recognition systems may be used for interaction with
a computer or other device. Speech recognition systems usually
translate a voice signal into a text string that corresponds to
instructions for the device. FIG. 1 is a block diagram of a speech
recognition system of the prior art. The speech recognition system
includes a microphone 110, an analog-to-digital (A/D) converter
115, a feature extractor 120, a speech recognizer 125, and a text
string 130. Microphone 110 receives sound energy via pressure waves
(not shown). Microphone 110 converts the sound energy to an
electronic analog voice signal and sends the analog voice signal to
A/D converter 115. A/D converter 115 samples and quantizes the
analog signal, converting the analog voice signal to a digital
voice signal. Typical sampling frequencies are 8 KHz and 16 KHz.
A/D converter 115 then sends the digital voice signal to feature
extractor 120. Typically, feature extractor 120 segments the
digital voice signal into consecutive data units called frames, and
then extracts features that are characteristic to the voice signal
of each frame. Typical frame lengths are ten, fifteen, or twenty
milliseconds. Feature extractor 120 performs various operations on
the voice signal of each frame. Operations may include
transformation into a spectral representation by mapping the voice
signal from time to frequency domain via a Fourier transform,
suppressing noise in the spectral representation, converting the
spectral representation to a spectral energy or power signal, and
performing a second Fourier transform on the spectral energy or
power signal to obtain cepstral coefficients. The cepstral
coefficients represent characteristic spectral features of the
voice signal. Typically, feature extractor 120 generates a set of
feature vectors whose components are the cepstral coefficients.
Feature extractor 120 sends the feature vectors to speech
recognizer 125. Speech recognizer 125 includes speech models and
performs a speech recognition procedure on the received feature
vectors to generate the text string 130. For example, speech
recognizer 125 may be implemented as a Hidden Markov Model (HMM)
recognizer.
[0005] Speech recognition systems translate voice signals into
text; however, speaker-independent speech recognition systems are
generally rigid, inaccurate, computationally-intensive, and are not
able to recognize true natural language. For example, typical
speech recognition systems have a voice-to-text translation
accuracy rate of 40%-50% when processing true natural language
voice signals. It is difficult to design a highly accurate natural
language speech recognition system that generates unconstrained
voice-to-text translation in real-time, due to the complexity of
natural language, the complexity of the language models used in
speech recognition, and the limits on computational power.
[0006] In many applications, the exact text of a speech message is
unimportant, and only the topic of the speech message needs to be
recognized. It would be desirable to have a flexible, efficient,
and accurate speech classification system that categorizes natural
language speech based upon the topics comprising a speech message.
In other words, it would be advantageous to implement a speech
classification system that categorizes speech based upon what is
talked about, without generating an exact transcript of what is
said.
SUMMARY OF THE INVENTION
[0007] In accordance with the present invention, a system and
method are disclosed for classifying a voice signal to a category
from a set of predefined categories, based upon a statistical
analysis of features extracted from the voice signal.
[0008] The system includes an acoustic processor that generates a
feature vector and an associated integer label for each frame of
the voice signal, a memory for storing statistical
characterizations of a set of predefined categories and agents
associated with each predefined category, and a classifier for
classifying the voice signal to a predefined category based upon a
statistical analysis of the received output of the acoustic
processor.
[0009] In one embodiment the acoustic processor includes an FFT for
generating a spectral representation from the voice signal, a
feature extractor for generating feature vectors characterizing the
voice signal, a vector quantizer for quantizing the feature vectors
and generating an integer label for each feature vector, and a
register for storing the integer labels.
[0010] The classifier computes a probability of occurrence for the
output of the acoustic processor based on each of the statistical
characterizations of the predefined categories, and classifies the
voice signal to the predefined category with the highest
probability or to a set of predefined categories with the highest
probabilities. Furthermore, the classifier accesses memory to
determine an agent associated with the predefined category or
categories and routes a caller associated with the voice signal to
the agent. The agent may be a human agent or a software agent.
BRIEF DESCRIPTION OF THE DRAWINGS
[0011] FIG. 1 is block diagram of a speech recognition system of
the prior art;
[0012] FIG. 2 is a block diagram of one embodiment of a voice
signal classification system, according to the present
invention;
[0013] FIG. 3 is a block diagram of one embodiment of the acoustic
processor of FIG. 2, according to the invention;
[0014] FIG. 4A is a block diagram of one embodiment of the
classifier of FIG. 2, according to the invention;
[0015] FIG. 4B is a block diagram of one embodiment of
probabilistic suffix tree PST11 of FIG. 4A, according to the
invention;
[0016] FIG. 4C is a block diagram of one embodiment of
probabilistic suffix tree PST21 of FIG. 4A, according to the
invention;
[0017] FIG. 5 is a block diagram of another embodiment of the
classifier of FIG. 2, according to the invention;
[0018] FIG. 6 is a block diagram of one embodiment of a
hierarchical structure of classes, according to the invention;
and
[0019] FIG. 7 is a flowchart of method steps for classifying
speech, according to one embodiment of the invention.
DETAILED DESCRIPTION OF THE INVENTION
[0020] The present invention classifies a voice signal based on
statistical regularities in the signal. The invention analyzes the
statistical regularities in the voice signal to determine a
classification category. In one embodiment, the voice signal
classification system of the invention applies digital signal
processing techniques to a voice signal. The system receives the
voice signal and computes a set of quantized feature vectors that
represents the statistical characteristics of the voice signal. The
system then analyzes the feature vectors and classifies the voice
signal to a predefined category from a plurality of predefined
categories. Finally, the system contacts an agent associated with
the predefined category. The agent may be a person or an automated
process that provides additional services to a caller.
[0021] FIG. 2 is a block diagram of one embodiment of a voice
signal classification system 200, according to the invention. Voice
classification system 200 includes a sound sensor 205, an amplifier
210, an A/D converter 215, a framer 220, an acoustic processor 221,
a classifier 245, a memory 250, and an agent 255. System 200 may
also include noise-reduction filters incorporated in A/D converter
215, acoustic processor 221, or as separate functional units. Sound
sensor 205 detects sound energy and converts the detected sound
energy into an electronic analog voice signal. In one embodiment,
sound energy is input to system 200 by a speaker via a telephone
call. Sound sensor 205 sends the analog voice signal to amplifier
210. Amplifier 210 amplifies the analog voice signal and sends the
amplified analog voice signal to A/D converter 215. A/D converter
215 converts the amplified analog voice signal into a digital voice
signal by sampling and quantizing the amplified analog voice
signal. A/D converter 215 then sends the digital voice signal to
framer 220.
[0022] Framer 220 segments the digital voice signal into successive
data units called frames, where each frame occupies a time window
of duration time T. A frame generally includes several hundred
digital voice signal samples with a typical duration time T of ten,
fifteen, or twenty milliseconds. However, the scope of the
invention includes frames of any duration time T and any number of
signal samples. Framer 220 sends the frames to acoustic processor
221. Sound sensor 205, amplifier 210, A/D converter 215, and framer
220 are collectively referred to as an acoustic front end to
acoustic processor 221. The scope of the invention covers other
acoustic front ends configured to receive a voice signal, and
generate a digital discrete-time representation of the voice
signal.
[0023] Acoustic processor 221 generates a feature vector and an
associated integer label for each frame of the voice signal based
upon statistical features of the voice signal. Acoustic processor
221 is described below in conjunction with FIG. 3.
[0024] In one embodiment, classifier 245 classifies the voice
signal to one of a set of predefined categories by performing a
statistical analysis on the integer labels received from acoustic
processor 221. In another embodiment of the invention, classifier
245 classifies the voice signal to one of the set of predefined
categories by performing a statistical analysis on the feature
vectors received from acoustic processor 221. Classifier 245 is not
a speech recognition system that outputs a sequence of words.
Classifier 245 classifies the voice signal to one of the set of
predefined categories based upon the most likely content of the
voice signal. Classifier 245 computes the probabilities that the
voice signal belongs to each of a set of predefined categories
based upon a statistical analysis of the integer labels generated
by acoustic processor 221. Classifier 245 assigns the voice signal
to the predefined category that produces the highest probability.
Classifier 245, upon assigning the voice signal to one of the set
of predefined categories, accesses memory 250 to determine which
agent is associated with the predefined category. Classifier 245
then routes a caller associated with the voice signal to the
appropriate agent 255. Agent 255 may be a human agent or a software
agent.
[0025] FIG. 3 is a block diagram of one embodiment of acoustic
processor 221 of FIG. 2, according to the invention. However, the
scope of the invention covers any acoustic processor that
characterizes voice signals by extracting statistical features from
the voice signals. In the FIG. 3 embodiment, acoustic processor 221
includes an FFT 325, a feature extractor 330, a vector quantizer
335, and a register 340. FFT 325 generates a spectral
representation for each frame received from framer 220 by using a
computationally efficient algorithm to compute the discrete Fourier
transform of the voice signal. FFT 325 transforms the time-domain
voice signal to the frequency-domain spectral representation to
facilitate analysis of the voice signal by signal classification
system 200. FFT 325 sends the spectral representation of each frame
to feature extractor 330. Feature extractor 330 extracts
statistical features of the voice signal and represents those
statistical features by a feature vector, generating one feature
vector for each frame. For example, feature extractor 330 may
generate a smoothed version of the spectral representation called a
Me1 spectrum. The statistical features are identified by the
relative energy in the Me1 spectrum coefficients. Feature extractor
330 then computes the feature vector whose components are the Me1
spectrum coefficients. Typically the components of the feature
vector are cepstral coefficients, which feature extractor 330
computes from the Me1 spectrum. All other techniques for extracting
statistical features from the voice signal and processing the
statistical features to generate feature vectors are within the
scope of the invention. Feature extractor 330 sends the feature
vectors to vector quantizer 335. Vector quantizer 335 quantizes the
feature vectors and assigns each quantized vector one integer label
from a set of predefined integer labels.
[0026] In an exemplary embodiment, vector quantizer 335 snaps
components of an n-dimensional feature vector to the nearest
quantized components of an n-dimensional quantized feature vector.
Typically there are a finite number of different quantized feature
vectors that can be enumerated by integers. Once the components of
the feature vectors are quantized, vector quantizer 335 generates a
single scalar value for each quantized feature vector corresponding
to a unique integer label of this vector among all different
quantized feature vectors. For example, given a quantized
n-dimensional feature vector v with quantized components (a.sub.1,
a.sub.2, a.sub.3, . . . , a.sub.n), a scalar value (SV) may be
generated by a function SV=f(a.sub.1, a.sub.2, a.sub.3, . . . ,
a.sub.n), where SV is equal to a function f of the quantized
components (a.sub.1, a.sub.2, a.sub.3, . . . , a.sub.n). Vector
quantizer 335 then assigns an integer label from the set of
predefined integer labels to each computed SV.
[0027] Vector quantizer 335 sends the integer labels to register
340, which stores the labels for all frames in the voice signal.
Register 340 may alternatively comprise a memory of various
storage-device configurations, for example Random-Access Memory
(RAM) and non-volatile storage devices such as floppy-disks or hard
disk-drives. Once the entire sequence of integer labels that
represents the voice signal is stored in register 340, register 340
sends the entire sequence of integer labels to classifier 245.
[0028] In alternate embodiments, acoustic processor 221 may
functionally combine FFT 325 with feature extractor 330, or may not
include FFT 325. If acoustic processor 221 does not perform an
explicit FFT on the voice signal at any stage, acoustic processor
221 may use indirect methods known in the art for extracting
statistical features from the voice signal. For example, in the
absence of FFT 325, feature extractor 330 may generate an LPC
spectrum directly from the time domain representation of the
signal. The statistical features are identified by spectral peaks
in the LPC spectrum and are represented by a set of LPC
coefficients. Then, in one embodiment, feature extractor 330
computes the feature vector whose components are the LPC
coefficients. In another embodiment, feature extractor 330 computes
the feature vector whose components are cepstral coefficients,
which feature extractor 330 computes from the LPC coefficients by
taking a fast Fourier transform of the LPC spectrum.
[0029] FIG. 4A is a block diagram of one embodiment of classifier
245 of FIG. 2, according to the invention. Classifier 245 includes
one or more probabilistic suffix trees (PSTs) grouped together by
voice classification category 410. For example, category 1 410a may
be "pets" and includes PST11, PST12, and PST13. Category 2 410b may
be "automobile parts" and includes PST21, PST22, PST23, and PST24.
Any number and type of voice classification categories 410 and any
number of PSTs per category are within the scope of the
invention.
[0030] FIG. 4B is a block diagram of one embodiment of PST11 from
category 1 410a and FIG. 4C is a block diagram of one embodiment of
PST21 from category 2 410b. The message information stored in
register 340 (FIG. 3) can be considered as a string of integer
labels. For each position in this string, a suffix is a contiguous
set of integer labels that terminates at that position. Suffix
trees are data structures comprising a plurality of suffixes for a
given string, allowing problems on strings, such as substring
matching, to be solved efficiently and quickly. A PST is a suffix
tree in which each vertex is assigned a probability. Each PST has a
root vertex and a plurality of branches. A path along each branch
comprises one or more substrings, and the substrings in combination
along a specific branch define a particular suffix.
[0031] For example, PST11 of FIG. 4B includes 9 suffixes
represented by 9 branches, where a substring of each branch is
defined by an integer label. For example, a 7-1-2 sequence of
integer labels along a first branch defines a first suffix, a 7-1-4
sequence of integer labels along a second branch defines a second
suffix, a 7-8-2 sequence of integer labels along a third branch
defines a third suffix, and a 7-8-4 sequence of integer labels
along a fourth branch defines a fourth suffix. In one embodiment, a
probability is assigned to each vertex of each PST in each category
410, based upon suffix usage statistics in each category 410. For
example, suffixes specified by the PSTs of category 1 410a (FIG.
4A) common to words typically used to describe "pets" are assigned
higher probabilities than suffixes used less frequently. In
addition, a probability assigned to a given suffix from category 1
410a is typically different than a probability assigned to the
given suffix from category 2 410b (FIG. 4A).
[0032] In one embodiment, the PSTs associated with each voice
classification category 410 are built from training sets. The
training sets for each category include voice data from a variety
of users such that the PSTs are built using a variety of
pronunciations, inflections, and other such criteria.
[0033] In operation, classifier 245 receives a sequence of integer
labels from acoustic processor 221 associated with a voice message.
Classifier 245 computes the probability of occurrence of the
sequence of integer labels in each category using the PSTs. In one
embodiment, classifier 245 determines a total probability for the
sequence of integer labels for each PST in each category.
Classifier 245 determines the total probability for a sequence of
integer labels applied to a PST by determining a probability at
each position in the sequence based on the longest suffix present
in that PST, then calculating the product of the probabilities at
each position. Classifier 245 then determines which category
includes the PST that produced the highest total probability, and
assigns the message to that category.
[0034] Using PST11 of FIG. 4B and a sequence of integer labels
4-1-7-2-3-1-10 as an example, classifier 245 determines the
probability of a longest suffix at each of the seven locations in
the integer label sequence. Classifier 245 reads the first location
in the sequence of integer labels as the integer label 4. Since the
integer label 4is not associated with a branch labeled 4 that
originates from a root vertex 420 of PST11, classifier 245 assigns
a probability of root vertex 420 (e.g., 1) to the first location.
The second location in the sequence of integer labels is the
integer label 1. The longest suffix associated with the second
location that is also represented by a branch originating from root
vertex 420 is the suffix corresponding to the integer label 1,
since the longest suffix corresponding to the integer label
sequence 1-4 does not correspond to any branches similarly labeled
originating from root vertex 420. That is, PST11 does not have a
branch labeled 1-4 that originates from root vertex 420. Therefore,
classifier 245 assigns the probability defined at a vertex 422
(P(1)) to the second location. The third location in the sequence
of integer labels is the integer label 7. Since the longest suffix
ending at the integer label 7 (i.e., suffix 7-1-4) exists in PST11
as the branch labeled 7-1-4 originating from root vertex 420,
classifier 245 assigns a probability associated with a vertex 424
(P(7-1-4)) to the third location. The next two locations in the
sequence of integer labels correspond to the integers 2 and 3,
respectively, and are not associated with any similarly labeled
branches the originate from root vertex 420, and therefore
classifier 245 assigns the probability of root vertex 420 to these
next two locations. The sixth location in the sequence corresponds
to the integer label 1, and the longest suffix ending at the sixth
location that is represented by a branch in PST11 is the suffix
1-3-2. Therefore, classifier 245 assigns a probability associated
with a vertex 426 (P(1-3-2)) to the sixth location along the
sequence. Next, since the seventh location corresponding to the
integer label 10 is not represented by a branch in PST11
originating from root vertex 420, classifier 245 assigns the
probability of root vertex 420 to the seventh location in the
sequence.
[0035] Next, classifier 245 calculates the total probability for
the sequence of integer labels 4-1-7-2-3-1-10 applied to PST11
where the total probability is a product of the location
probabilities:
PT(PST11)=1.times.P(1).times.P(7-1-4).times.1.times.1.times.P(1-3-2).time-
s.1. In another embodiment of the invention, classifier 245
calculates the total probability by summing the logarithm of each
location probability. Although the sequence of integer labels for
this examples includes only seven integer labels, any number of
integer labels is within the scope of the invention. The number of
integer labels in the sequence depends on the number of frames of
the message, which in turn depends on the duration of the voice
signal input to system 200.
[0036] FIG. 5 is a block diagram of another embodiment of
classifier 245, according to the invention. The FIG. 5 embodiment
of classifier 245 includes three states and nine arcs, but the
scope of the invention includes classifiers with any number of
states and associated arcs. Since each state is associated with one
of the predefined integer labels, the number of states is equal to
the number of predefined integer labels. The FIG. 5 embodiment of
classifier 245 comprises three predefined integer labels, where
state 1 (505) is identified with integer label 1, state 2 (510) is
identified with integer label 2, and state 3 (515) is identified
with integer label 3. The arcs represent the probability of a
transition from one state to another state or the same state. For
example, a.sub.12 is the probability of transition from state 1
(505) to state 2 (510), a.sub.21 is the probability of transition
from state 2 (510) to state 1 (505), and a.sub.11 is the
probability of transition from state 1 (505) to state 1 (505). The
transition probabilities a.sub.ij(L) depend on the integer labels L
of the quantized speech.
[0037] In the FIG. 5 embodiment, classifier 245 computes all
permutations of the integer labels received from acoustic processor
221 and computes a probability of occurrence for each permutation.
Classifier 245 associates each permutation of the received integer
labels to a unique sequence of states. The total number of
sequences that classifier 245 can compute is the total number of
predefined integer labels raised to an integer power, where the
integer power is the total number of integer labels sent to
classifier 245. If m=the total number of predefined integer labels,
n=the integer power, and ns=the total number of sequences of
states, then ns=m.sup.n. Classifier 245 comprises three predefined
integer labels (m=3). Thus, if register 340 sends classifier 245
three integer labels (n=3), then classifier can compute 3.sup.3=27
possible sequences of states. The sequences of states include, for
example, 1.fwdarw.1.fwdarw.1, 1.fwdarw.1.fwdarw.2,
1.fwdarw.2.fwdarw.1, 1.fwdarw.1.fwdarw.3, 1.fwdarw.3.fwdarw.1,
1.fwdarw.2.fwdarw.1, 1.fwdarw.2.fwdarw.2, 1.fwdarw.3.fwdarw.3, and
1.fwdarw.2.fwdarw.3. The total number of transition probabilities
is the total number of predefined integer labels squared. If
np=total number of transition probabilities, then np=m.sup.2. Thus
there are 3.sup.2=9 transition probabilities. For each integer
label L that can be assigned by quantizer 335 (FIG. 3), there is
possibly a different set of transition probabilities. The
transition probabilities are a.sub.11(L), a.sub.22(L), a.sub.33(L),
a.sub.12(L), a.sub.21(L), a.sub.13(L), a.sub.31(L), a.sub.23(L),
and a.sub.32(L).
[0038] When a user or system administrator initializes voice signal
classification system 200, classifier 245 assigns an initial
starting probability to each state. For example, classifier 245
assigns to state 1 (505) a probability a.sub.11, which represents
the probability of starting in state 1, to state 2 (510) a
probability a.sub.12, which represents the probability of starting
in state 2, and to state 3 (515) a probability a.sub.13, which
represents the probability of starting in state 3.
[0039] If classifier 245 receives integer labels (1,2,3), then
classifier 245 computes six sequences of states
1.fwdarw.2.fwdarw.3, 1.fwdarw.3.fwdarw.2, 2.fwdarw.1.fwdarw.3,
2.fwdarw.3.fwdarw.1, 3.fwdarw.1.fwdarw.2, and 3.fwdarw.2.fwdarw.1,
and an associated probability of occurrence for each sequence. The
six sequences of states are a subset of the 27 possible sequences
of states. For example, classifier 245 computes the total
probability of the 1.fwdarw.2.fwdarw.3 sequence of states by
multiplying the probability of starting in state 1, a.sub.11, by
the probability a.sub.12(L.sub.1) of a transition from state 1 to
state 2 when the first integer label of a sequence of integer
labels appears, by the probability a.sub.23(L.sub.2) of a
transition from state 2 to state 3 when the second integer label of
the sequence appears. The total probability is
P(1.fwdarw.2.fwdarw.3)=a.sub.11.times.a.sub.12(L.sub.1).times.a.sub.23(L.-
sub.2). Similarly, the total probability of the 2.fwdarw.3.fwdarw.1
sequence of states is
P(2.fwdarw.3.fwdarw.1)=a.sub.12.times.a.sub.23(L.sub.1).times.a.sub.31(L.-
sub.2). Classifier 245 calculates the total probabilities for the
remaining four sequences of states in a similar manner. Classifier
245 then classifies the voice signal to one of a set of predefined
categories associated with the sequence of states with the highest
probability of occurrence. Some of the sequences of states may not
have associated categories, and some of the sequences of states may
have the same associated category. If there is no predefined
category associated with the sequence of states with the highest
probability of occurrence, then classifier 245 classifies the voice
signal to a predefined category associated with the sequence of
states with the next highest probability of occurrence.
[0040] Voice classification system 200 may be implemented in a
voice message routing system, a quality-control call center, an
interface to a Web-based voice portal, or in conjunction with a
speech-to-text recognition engine, for example. A retail store may
use voice signal classification system 200 to route telephone calls
to an appropriate department (agent) based upon a category to which
a voice signal is classified. For example, a person may call the
retail store to inquire whether the store sells a particular brand
of cat food. More specifically, a person may say the following: "I
was wondering if you carry, . . . uh, . . . well, if you stock or
have in store cat food X, well actually cat food for my kitten, and
if so, could you tell me the price of a bag. Also, how large of bag
can I buy? (Pause). Oh wait, I almost forgot, do you have monkey
chow?" Although this is a complex, natural language speech pattern,
voice signal classification system 200 classifies the received
natural language voice signal into a category based upon the
content of the voice signal. For example, system 200 may classify
the voice signal to a pet department category, and therefore route
the person's call to the pet department (agent). However, in
addition, system 200 may classify the speech into other categories,
such as billing, accounting, employment opportunities, deliveries,
or others. For example, system 200 may classify the speech to a
pricing category that routes the call to an associated agent that
can immediately answer the caller's questions concerning inventory
pricing.
[0041] System 200 may classify voice signals to categories
associated with predefined items on a menu. For example, a voice
signal may be classified to a category associated with a software
agent that activates a playback of a predefined pet department
menu. The caller can respond to the pet department menu with
additional voice messages or a touch-tone keypad response. Or the
voice signal may be classified to another category whose associated
software agent activates a playback of a predefined pricing
menu.
[0042] In another embodiment, system 200 may be implemented in a
quality control call center that classifies calls into complaint
categories, order categories, or personal call categories, for
example. An agent then selects calls from the various categories
based upon the agent's priorities at the time. Thus, system 200
provides an effective and efficient manner of customer-service
quality control.
[0043] In yet another embodiment of speech classification system
200, system 200 may be configured as an interface to voice portals,
classifying calls to various categories such as weather, stock, or
traffic, and then routing and connecting the call to an appropriate
voice portal.
[0044] In yet another embodiment of the present invention, system
200 is used in conjunction with a speech-to-text recognition
engine. For example, a voice signal is assigned to a particular
category that is associated with a predefined speech model
including a defined vocabulary set for use in the recognition
engine. For instance, a caller inquiring about current weather
conditions in Oklahoma City would access the recognition engine
with a speech model/vocabulary set including voice-to-text
translations for words such as "storm", "rain", "hail", and
"tornado." The association of speech models/vocabulary sets with
each voice signal category reduces the complexity of the
speech-to-text recognition engine and consequently reduces
speech-to-text processing times.
[0045] The combination of system 200 with the speech-to-text
recognition engine may classify voice signals into language
categories, thus making the combination of system 200 and the
speech-to-text recognition engine language independent. For
example, if voice classification system 200 classifies a voice
signal to a German language category, then the recognition engine
uses a speech model/vocabulary set associated with the German
language category to translate the voice signal.
[0046] In other embodiments, system 200 may be implemented to
classify voice signals into categories that are independent of the
specific spoken words or text of the call. For example, system 200
may be configured to categorize a caller as male or female as the
content of a male voice signal typically is distinguishable from
the content of a female voice signal. Similarly, system 200 may be
configured to identify a caller as being one member of a
predetermined group of persons as the content of the voice signal
of each person in the group would be distinguishable from that of
the other members of the group. System 200 therefore may be used,
for example, in a caller identification capacity or a password
protection or other security capacity.
[0047] In addition, just as system 200 may be used to categorize
voice signals as either male or female, system 200 may be used to
distinguish between any voice signal sources where the voice
signals at issue are known to have different content. Such voice
signals are not required to be expressed in a known language. For
example, system 200 may be used to distinguish between various
types of animals, such as cats and dogs or sheep and cows. Further,
system 200 may be used to distinguish among different animals of
the same type, such as dogs, where a predetermined group of such
animals exists and the voice signal content of each animal in the
group is known. In this case, system 200 may be used to identify
any one of the animals in the group in much the same way that
system 200 may be used to identify a caller as described above.
[0048] Voice classification system 200 may be implemented in a
hierarchical classification system. FIG. 6 is a block diagram of
one embodiment of a hierarchical structure of classes 600,
according to the invention. The hierarchical structure includes a
first level class 605, a second level class 610, and a third level
class 615. In the FIG. 6 exemplary embodiment of the hierarchical
structure of classes 600, the first level class 605 includes
language categories, such as an English language category 620, a
German language category 625, and a Spanish language category 630.
The second level class 610 includes a pricing category 635, a
complaint category 640, and an order category 645. The third level
class 615 includes a hardware category 650, a sporting goods
category 655, and a kitchen supplies category 660.
[0049] For example, voice classification system 200 receives a call
and classifies the caller's voice signal 601 into English category
620, then classifies voice signal 601 into order 645 subcategory,
and then classifies voice signal 601 into sporting goods 655
sub-subcategory. Finally, system 200 routes the call to an agent
665 associated with ordering sporting goods supplies in English.
The configuration of system 200 with the hierarchical structure of
classes 600 permits more flexibility and refinement in classifying
voice signals to categories. The scope of the present invention
includes any number of class levels and any number of categories in
each class level.
[0050] FIG. 7 is a flowchart of method steps for classifying
speech, according to one embodiment of the invention. Although the
steps of the FIG. 7 method are described in the context of system
200 of FIG. 2, any other system configured to implement the method
steps is within the scope of the invention. In a step 705, sound
sensor 205 detects sound energy and converts the sound energy into
an analog voice signal. In a step 710, amplifier 210 amplifies the
analog voice signal. In a step 715, A/D converter 215 converts the
amplified analog voice signal into a digital voice signal. In a
step 720, framer 220 segments the digital voice signal into
successive data units called frames. In a step 725, acoustic
processor 221 processes the frames and generates a feature vector
and an associated integer label for each frame. Typically, acoustic
processor 221 extracts features (such as statistical features) from
each frame, processes the extracted features to generate feature
vectors, and assigns an integer label to each feature vector.
Acoustic processor 221 may include one or more of the following: an
FFT 325, a feature extractor 330, a vector quantizer 335, and a
register 340. In a step 730, classifier 245 performs a statistical
analysis on the integer labels and in a step 735, classifier 245
classifies the voice signal to a predefined category based upon the
results of the statistical analysis. In a step 740, classifier 245
accesses memory 250 to determine which agent 255 is associated with
the predefined category assigned to the voice signal. The agent may
either be a human agent or a software agent. In a step 745, a
caller associated with the voice signal is routed to the agent
corresponding to the predefined category.
[0051] The invention has been explained above with reference to
specific embodiments. Other embodiments will be apparent to those
skilled in the art in light of this disclosure. The present
invention may readily be implemented using configurations other
than those described in the embodiments above. Therefore, these and
other variations upon the specific embodiments are intended to be
covered by the present invention, which is limited only by the
appended claims.
* * * * *