U.S. patent application number 11/549173 was filed with the patent office on 2008-04-17 for generation of domain models from noisy transcriptions.
Invention is credited to Shourya Roy, Laxminarayan Venkata Subramaniam.
Application Number | 20080091423 11/549173 |
Document ID | / |
Family ID | 39325648 |
Filed Date | 2008-04-17 |
United States Patent
Application |
20080091423 |
Kind Code |
A1 |
Roy; Shourya ; et
al. |
April 17, 2008 |
Generation of domain models from noisy transcriptions
Abstract
A method of building a domain specific model from transcriptions
is disclosed. The method starts by applying text clustering to the
transcriptions to form text clusters. The text clustering is
applied at a plurality of different granularities, and groups
topically similar phrases in the transcriptions. The relationship
between text clusters resulting from the text clustering at
different granularities is then identified to form a taxonomy. The
taxonomy is augmented with topic specific information.
Inventors: |
Roy; Shourya; (New Delhi,
IN) ; Subramaniam; Laxminarayan Venkata; (Gurgaon,
IN) |
Correspondence
Address: |
FREDERICK W. GIBB, III;Gibb & Rahman, LLC
2568-A RIVA ROAD, SUITE 304
ANNAPOLIS
MD
21401
US
|
Family ID: |
39325648 |
Appl. No.: |
11/549173 |
Filed: |
October 13, 2006 |
Current U.S.
Class: |
704/235 ;
704/E15.045 |
Current CPC
Class: |
G10L 15/26 20130101 |
Class at
Publication: |
704/235 |
International
Class: |
G10L 15/26 20060101
G10L015/26 |
Claims
1. A method of building a domain specific model from
transcriptions, said method comprising the steps of: applying text
clustering to said transcriptions to form text clusters, where said
text clustering is applied at a plurality of different
granularities and groups topically similar phrases in said
transcriptions; identifying the relationship between text clusters
resulting from said text clustering at different granularities to
form a taxonomy; and augmenting said taxonomy with topic specific
information.
2. The method according to claim 1 wherein said taxonomy is
augmented with one or more of: typical issues raised and solutions
to those issues; typical questions and answers; and statistics
relating to conversations associated with said transcriptions.
3. The method according to claim 1 comprising the initial step of:
extracting from said transcriptions n-grams based upon the
respective frequencies of occurrences of the n-grams.
4. The method according to claim 1 wherein said identifying step
identifies relationships between text clusters resulting from said
text clustering at granularities of adjacent levels.
5. The method according to claim 1 comprising the initial steps of:
removing stopwords from said transcriptions; and removing pause
filling words from said transcriptions.
6. The method according to claim 5 wherein said stopwords include
generic stopwords and domain specific stopwords.
7. A method comprising the steps of: receiving a domain specific
model; receiving a transcription of a part of a conversation; and
mapping said transcription to a node in said domain specific
model.
8. A method comprising the steps of: receiving a domain specific
model; receiving a transcription of a conversation; calculating
statistics of said transaction; and comparing at least said
statistics of said transcription with statistics from said domain
specific model.
9. An apparatus for building a domain specific model from
transcriptions, said apparatus comprising a processor configured to
perform the steps comprising of: applying text clustering to said
transcriptions to form text clusters, where said text clustering is
applied at a plurality of different granularities and groups
topically similar phrases in said transcriptions; identifying the
relationship between text clusters resulting from said text
clustering at different granularities to form a taxonomy; and
augmenting said taxonomy with topic specific information.
10. The apparatus according to claim 9 wherein said taxonomy is
augmented with one or more of: typical issues raised and solutions
to those issues; typical questions and answers; and statistics
relating to conversations associated with said transcriptions.
11. The apparatus according to claim 9 wherein said processor is
configured to perform the further step of: extracting from said
transcriptions n-grams based upon the respective frequencies of
occurrences of the n-grams.
12. The apparatus according to claim 9 wherein said identifying
step identifies relationships between text clusters resulting from
said text clustering at granularities of adjacent levels.
13. An apparatus configured to perform the steps comprising of:
receiving a domain specific model; receiving a transcription of a
part of a conversation; and mapping said transcription to a node in
said domain specific model.
14. An apparatus configured to perform the steps comprising of:
receiving a domain specific model; receiving a transcription of a
conversation; calculating statistics of said transaction; and
comparing at least said statistics of said transcription with
statistics from said domain specific model.
Description
FIELD OF THE INVENTION
[0001] The present invention relates generally to speech processing
and, in particular, to the generation of domain models from
transcriptions of conversation recordings.
BACKGROUND
[0002] Call-center is a general term used in relation to help
desks, information lines and customer service centers. Many
companies today operate call-centers to handle a diversity of
customer issues. Such may include product and services related
issues and grievance redress. Call-centers are constantly trying to
increase customer satisfaction and call handling efficiency by
aiding agents and agent monitoring.
[0003] Call-centers may use a dialog-based support, relying on
voice conversations and on line chat, and email support where a
user communicates with a professional agent via email. A typical
call-center agent handles over a hundred calls in a day. These
calls are typically recorded. As a result, gigabytes of data are
produced every day in the form of speech audio, speech transcripts,
email etc. This data is valuable for doing analysis at many levels.
For example, the data may be used to obtain statistics about the
type of problems and issues associated with different products and
services. The data may also be used to evaluate call center agents
and train the agents to improve their performance.
[0004] Today's call-centers handle a wide variety of domains, such
as computer sales and support, mobile phones, apparel, car rental,
etc. To analyze the calls in any domain, analysts need to identify
the key issues in the domain. Further, there may be variations
within a domain based on the service providers. An example of a
domain where variations within the domain exist that are based on
the service providers is the domain of mobile phones.
[0005] In the past an analysts would generate a domain model
through manual inspection of the data. Such a domain model can
include a listing of the call categories, types of problems solved
in each category, listing of the customer issues, typical
question-answers, appropriate call opening and closing styles etc.
In essence, these domain models provide a structured view of the
domain.
[0006] Manually building such domain models for various domains may
become prohibitively resource intensive. Many of the domain models
are also dynamic in nature and therefore change over time. For
example, when a new version of a mobile phone is introduced, when a
new software product is launched in a country, or when a new
computer virus starts an attack, the domain model may need to be
refined.
[0007] In view of the foregoing, a need exists for an automated
approach of creating and maintaining domain models.
SUMMARY
[0008] It is an object of the present invention to substantially
overcome, or at least ameliorate, one or more disadvantages of
existing arrangements.
[0009] According to an aspect of the present invention there is
provided a method of building a domain specific model from
transcriptions. The method starts by applying text clustering to
the transcriptions to form text clusters. The text clustering is
applied at a plurality of different granularities, and groups
topically similar phrases in the transcriptions. The relationship
between text clusters resulting from the text clustering at
different granularities is then identified to form a taxonomy. The
taxonomy is augmented with topic specific information.
[0010] Preferably the taxonomy is augmented with one or more of:
typical issues raised and solutions to those issues; typical
questions and answers; and statistics relating to conversations
associated with said transcriptions.
[0011] In a further preferred implementation the method starts with
the initial step of extracting from the transcriptions n-grams
based upon their respective frequency of occurrences.
[0012] The identifying step preferably identifies relationships
between text clusters resulting from the text clustering at
granularities of adjacent levels.
[0013] Other aspects of the invention are also disclosed.
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] One or more embodiments of the present invention will now be
described with reference to the drawings, in which:
[0015] FIG. 1 shows a schematic flow diagram of a method of
building a domain specific model from a collection of telephonic
conversation recordings;
[0016] FIG. 2 shows a partial transcript of a dialog from the
internal IT help desk of a company;
[0017] FIG. 3 shows a part of the taxonomy obtained from the
dialogs from the internal IT help desk of a company;
[0018] FIG. 4 shows a part of topic specific information that has
been generated for the "default properti" node in FIG. 3 from an
example case;
[0019] FIG. 5 shows average call duration and corresponding average
transcription lengths for topics of interest;
[0020] FIG. 6 shows variation in prediction accuracy as a function
of the fraction of a call;
[0021] FIG. 7 shows a graph of the prediction accuracy achieved for
various clusters after analyzing 25%, 50%, 75% and 100% of the
call; and
[0022] FIG. 8 is a schematic block diagram of a general purpose
computer upon which arrangements described can be practiced.
DETAILED DESCRIPTION
[0023] FIG. 1 shows a schematic flow diagram of a method 100 of
building a domain specific model 190 from a collection of
conversation recordings 105, such as telephonic conversation
recordings. The domain specific model 190 built by the method 100
comprises primarily a topic taxonomy, where every node in the
taxonomy is characterized for example by topic(s), typical
Question-Answers (QAs), typical actions call statistics etc.
[0024] The method 100 is described in the context of a call-center
where a customer phones a call-center agent for assistance.
However, the method 100 is not so limited. In the embodiment
described the conversation recordings 105 comprise dialog between
the call-center agent and the customer.
[0025] The method 100 of building a domain specific model from a
collection of conversation recordings 105 may be practiced using a
general-purpose computer system 800, such as that shown in FIG. 8
wherein the processes of FIG. 1 may be implemented as software,
such as an application program executing within the computer system
800. In particular, the steps of the method 100 are affected by
instructions in the software that are carried out by the computer
system 800. The software may be stored in a computer readable
medium. The software is loaded into the computer system 800 from
the computer readable medium, and then executed by the computer
system 800. A computer readable medium having such software or
computer program recorded on it is a computer program product. The
use of the computer program product in the computer system 800
preferably effects an advantageous apparatus for building a domain
specific model 190 from a collection of conversation recordings
105.
[0026] The computer system 800 is formed by a computer module 801,
input devices such as a keyboard 802, output devices including a
display device 814. The computer module 801 typically includes at
least one processor unit 805, and a memory unit 806. The module 801
also includes an number of input/output (I/O) interfaces including
a video interface 807 that couples to the display device 814, and
an I/O interface 813 for the keyboard 802. A storage device 809 is
provided and typically includes at least a hard disk drive and a
CD-ROM drive. The components 805 to 813 of the computer module 801
typically communicate via an interconnected bus 804 and in a manner
which results in a conventional mode of operation of the computer
system 800 known to those in the relevant art.
[0027] Typically, the application program is resident on the
storage device 809 and read and controlled in its execution by the
processor 805. In some instances, the application program may be
supplied to the user encoded on a CD-ROM or floppy disk and read
via a corresponding drive, or alternatively may be read by the user
from a network via a modem device. Still further, the software can
also be loaded into the computer system 800 from other computer
readable media. The term "computer readable medium" as used herein
refers to any storage medium that participates in providing
instructions and/or data to the computer system 800 for execution
and/or processing.
[0028] The method 100 may alternatively be implemented in dedicated
hardware such as one or more integrated circuits performing the
functions or sub functions of method 100.
[0029] Referring again to FIG. 1, the method 100 starts in step 110
where the dialog of the conversation recordings 105 is
automatically transcribed to form transcription output 115. For
example, in the case of telephonic conversation recordings
automatic speech recognition (ASR) may be used to automatically
transcribe the dialog of the conversation recordings 105.
[0030] The transcription output 115 comprises information about the
recognized words along with their durations, i.e. start and end
times of the words. Further, speaker turns are marked, so the
speaker portions are demarcated without exactly naming which part
belongs to which speaker.
[0031] However, the transcription output 115 contains many errors
and a high degree of noise. The method 100 is capable of managing
the errors and noise through various feature engineering
techniques.
[0032] Before describing the feature engineering techniques in more
detail the origin of the errors and noise is first discussed.
Current ASR technology, when applied to telephone conversations,
has moderate to high word error rates. This is particularly true
for telephone conversations arising from call-center calls, because
call-centers are now located in different parts of the world,
resulting in a diversity of accents that the ASR has to contend
with. This high error rate implies that many wrong deletions of
actual words and wrong insertion of dictionary words are common
phenomena. Also, speaker turns are often not correctly identified
and portions of both speakers are assigned to a single speaker.
[0033] In addition to speech recognition errors, there are other
challenges that arise from recognition of spontaneous speech. For
example, there are no punctuation marks. Silence periods are
marked, but it is not possible to find sentence boundaries based on
these silences. There are also repeated words, false starts, many
pause filling words such as "um" and "uh", etc.
[0034] FIG. 2 shows a partial transcript of a dialog from the
internal IT help desk of a company. As can be seen from the partial
transcript, due to the high error rate and introduced noise, the
transcription output 115 is difficult for humans to interpret.
[0035] Referring again to FIG. 1, in order to combat the noise
introduced by the ASR, the method 100 continues to step 120 where
various feature engineering techniques are employed to perform
noise removal. More particularly, a sequence of cleansing
operations are performed to remove generic stopwords such as "the",
"of", etc., as well as domain specific stopwords such as "serial",
"seven", "dot" etc. Pause filling words, such as "um", "uh", and
"huh" are also removed.
[0036] The words remaining in the transcription output are passed
through a stemmer. The stemmer determines a root form of a given
inflected (or, sometimes, derived) word form.
[0037] For example, the root "call" is determined from the word
"called". In the preferred implementation Porter's stemmer
available from http://www.tartarus.org.martin/PorterStemmer is
used.
[0038] The next action performed in step 120 is that all n-grams
which occur more frequently than a threshold, and do not contain
any stopword, are extracted from the noisy transcriptions.
[0039] Next, in step 130, text clustering is applied to the output
of step 120 to group topically similar conversations together. In
the preferred implementation the clustering high dimensional
datasets package (CLUTO package) available from
http://glaros.dtc.umn.edu/gkhome/views/cluto is used for the text
clustering, with the default repeated bisection technique and the
cosine function as the similarity metric.
[0040] The transcriptions do not contain well formed sentences.
Therefore, step 130 applies clustering at different levels of
granularity. In the preferred implementation 5 clusters are firstly
generated from the transcriptions. Next, 10 clusters from the same
set of transcriptions are generated, and so on. At the finest level
100 clusters are generated.
[0041] The relationship between groups at different levels of
granularity is identified by generating a taxonomy of conversation
types in step 140. Step 140 firstly removes clusters containing
less than a predetermined number of transcriptions, and secondly,
introduces directed edges from cluster .nu..sub.1 to cluster
.nu..sub.2 if clusters .nu..sub.1 and .nu..sub.2 share at least one
transcription between them, where cluster .nu..sub.2 is one level
finer than cluster .nu..sub.1. Clusters .nu..sub.1 and .nu..sub.2
thereby become nodes in adjacent layers in the taxonomy. Each node
in the taxonomy may be termed a topic.
[0042] A top-down approach is preferred over a bottom-up approach
because it indicates the link between clusters of various levels of
granularity and also gives the most descriptive and discriminative
set of features associated with each node (topic). Descriptive
features are the set of features which contribute the most to the
average similarity between transcriptions belonging to the same
cluster. Similarly, discriminative features are the set of features
which contribute the most to the average dissimilarity between
transcriptions belonging to different clusters. These features are
later used for generating topic specific information.
[0043] FIG. 3 shows a part of the taxonomy obtained from the
dialogs from the internal IT help desk of a company. The labels
shown in FIG. 3 are the most descriptive and discriminative
features of a node given the labels of its ancestors.
[0044] Referring again to method 100 shown in FIG. 1, the taxonomy
generated in step 140 is augmented in step 150 with various topic
specific information related to each node, thereby creating an
augmented taxonomy. The topic specific information includes phrases
that describe typical actions, typical QAs and call statistics for
each topic (node) in the taxonomy.
[0045] Typical actions correspond to typical issues raised by the
customer, problems and strategies for solving such problems. The
inventors have observed that action related phrases are mostly
found around topic features. Accordingly, step 150 starts by
searching and collecting all the phrases containing topic words
from the documents belonging to the topic. In the preferred
implementation a 10-word window is defined around the topic
features, and all phrases from the documents are harvested. The set
of collected phrases are then searched for n-grams with frequency
above a preset threshold. For example, both the 10-grams "note in
click button to set up for all stops" and "to action settings and
click the button to set up" increase the support count of the
5-gram "click button to set up".
[0046] The search for the n-grams proceeds based on a threshold on
a distance function that counts the insertions necessary to match
two phrases. For example, the phrase "can you" is closer to the
phrase "can<:::>you" than to the phrase
"can<:::><:::>you". Longer n-grams are allowed a higher
distance threshold than shorter n-grams. Next, all the phrases that
frequently occur within the cluster are extracted.
[0047] Step 150 continues by performing phrase tiling and ordering.
Phrase tiling constructs longer n-grams from sequences of
overlapping shorter n-grams. The inventors have noted that the
phrases have more meaning if they are ordered by their respective
appearance. For example, if the phrase "go to the program menu"
typically appears before the phrase "select options from program
menu", then it is more useful to present the phrases in the order
of their appearance. The order is established based on the average
turn number in the dialog where a phrase occurs.
[0048] Consider next a situation or a case of typical
Question-Answers sessions in a call centre. To understand a
caller's issue the call center agent needs to ask the appropriate
set of questions. Asking the right questions is the key to handle
calls effectively. All the questions within a topic are searched
for by defining question templates. The question templates look for
all phrases beginning with the terms "how", "what", "can I", "can
you", "were there" etc. All 10-word phrases conforming to the
question templates are collected and phrase harvesting, tiling and
ordering is done on those 10-word phrases. For the answers a search
is conducted for phrases in the vicinity immediately following the
question.
[0049] FIG. 4 shows a part of the topic specific information that
has been generated for the "default properti" node in FIG. 3 from
an example case comprising 200 transcription outputs 115. 123 of
the transcription outputs 115 contained the topic "default
properti". In the example case, phrases that occur at least 5 times
in these 123 transcription outputs 115 were selected. The general
opening and closing styles used by the call-center agents in
addition to typical actions and QAs for the topic have been
captured. The transcription outputs 115 associated with the
"default properti" node in FIG. 3 pertain to queries on setting up
a new network connection, for example from AT&T. Most of the
topic specific issues that have been captured relate to the
call-center agent leading the caller through the steps for setting
up the network connection.
[0050] The following observations can be made from the topic
specific information that has been generated: [0051] Despite the
fact that the ASR step 110 introduces a lot of noise, the phrases
captured are well formed. The resulting phrases, when collected
over the clusters, are clean. [0052] Some phrases appear in
multiple forms. Consider for example the phrases, "thank you for
calling how can i help you", "how may i help you today", "thanks
for calling can i be of help today". While tiling is able to merge
matching phrases, semantically similar phrases are not merged.
[0053] With regards to call statistics, various aggregate
statistics for each node in the topic taxonomy are captured as part
of the domain specific model namely (1) average call duration (in
seconds), (2) average transcription length (number of words) (3)
average number of speaker turns and (4) number of calls. Generally
the call durations and number of speaker turns varies significantly
from one topic to another.
[0054] FIG. 5 shows average call duration and corresponding average
transcription lengths for a few topics of interest. It can be seen
that in topic cluster-1, which relates to expense reimbursement and
associated issues, most of the queries were answered relatively
quickly when compared to topic cluster-5, for example. Cluster-5
relates to connection related issues. Calls associated with
connection related issues require more information from callers and
are therefore generally longer in duration. Interestingly, topic
cluster-2 and topic cluster-4 have similar average call durations
but substantially different average transcription lengths.
Cluster-4 is primarily about printer related queries where the
customer is often not ready with details, such as printer name and
the Internet Protocol (IP) address of the printer, resulting in
long hold times. In contrast thereto, cluster-2 which relates to
online courses have a shorter transcription length because users
generally have details like course name etc. ready and are
interactive in nature. It should be apparent to a person skilled in
the art that various other clusters may be formed and used, which
fall within the scope of the present invention.
[0055] A hierarchical index of type {topic.fwdarw.information}
based on this automatically generated domain specific model is then
built for each topic in the topic taxonomy. An entry of this index
contains topic specific information namely (1) typical QAs, (2)
typical actions, and (3) call statistics. The information
associated with each topic becomes more and more specific the
further one goes down this hierarchical index.
[0056] The domain specific model may be further refined by
semantically cluster topic specific information so that redundant
topics are eliminated. Topics in the model may also be linked to
technical manuals, catalogs etc. already available on the different
topics in the given domain.
[0057] Having described the method 100 of building a domain
specific model from a collection of telephonic conversation
recordings 105, applications of the domain specific model is next
described.
[0058] Information retrieval from spoken dialog data is an
important requirement for call-centers. Call-centers constantly
endeavor to improve the call handling efficiency and identify key
problem areas. The described domain specific model provides a
comprehensive and structured view of the domain that can be used to
do both. The domain specific model encodes three levels of
information about the domain, namely: [0059] General: The taxonomy
along with the labels gives a general view of the domain. The
general information can be used to monitor trends on how the number
of calls in different categories changes over time e.g. daily,
weekly, monthly. [0060] Topic level: This includes a listing of the
specific issues related to the topic, typical customer questions
and problems, usual strategies for solving the problems, average
call durations, etc. The topic level of information can be used to
identify primary issues, problems and solutions pertaining to any
category. [0061] Dialog level: This includes information on how
agents typically open and close calls, ask questions and guide
customers, average number of speaker turns, etc. The dialog level
information can be used to monitor whether agents are using
courteous language in their calls, whether they ask pertinent
questions, etc.
[0062] The {topic.fwdarw.information} index requires identification
of the topic for each call to make use of information available in
the domain specific model. Many of the callers' complaints can be
categorized into coarse as well as fine topic categories by
analyzing only the initial part of the call. Exploiting this
observation fast topic identification is performed using a simple
technique based on distribution of topic specific descriptive and
discriminative features within the initial portion of the call.
[0063] FIG. 6 shows variation in prediction accuracy using the
method 100 as a function of the fraction of a call observed for 5,
10 and 25 clusters. It can be seen, at coarse level, nearly 70%
prediction accuracy can be achieved by analyzing the initial 30% of
the call and more than 80% of the calls can be correctly
categorized by analyzing only the first half of the call.
[0064] FIG. 7 shows a graph of the prediction accuracy achieved for
various clusters after analyzing 25%, 50%, 75% and 100% of the
call. It can be seen that calls related to some clusters can be
quickly detected compared to some other clusters.
[0065] A further application of the domain specific model is in an
aiding and administrative tool. One such a tool operates by aiding
call-center agents to efficient handle calls, thereby improving
customer satisfaction as well as to reduce call handling time.
Another is an administrative tool for agent appraisal and
training.
[0066] Call-centre agent aiding is primarily driven by topic
identification. As can be see from FIG. 6, in order to achieve 75%
prediction accuracy, a "level-1" topic, which corresponds to the
5-cluster level, can be identified within the first 30% of the
calls. Similarly, a "level-2" topic, which corresponds to a
10-cluster level, can be identified within the first 42% of the
calls, and a "level-3" topic, which corresponds to a 25-cluster
level, can be identified within the first 62% of the calls.
[0067] The hierarchical nature of the model assists in providing
generic to specific information to the agent as the call
progresses. For example, once a call has been identified as
belonging to topic {lotusnot} (FIG. 3), the call-center agent is
prompted with generic Lotus Notes related QAs and actions. Within
the next, say, 45 seconds the tool identifies the topic to be the
{copi archive replic} topic, and the typical QAs and actions in the
prompts change accordingly. Finally, the tool identifies the topic
as the {servercopi localcopi} topic and comes up with suggestions
for solving the replication problem in Lotus Notes.
[0068] The administrative tool is primarily driven by dialog and
topic level information. This post-processing tool is used for
comparing completed individual calls with corresponding topics
based on the distribution of QAs, actions and call statistics.
Based on the topic level information it can be verified whether the
agent identified the issues correctly, and offered the known
solutions on a given topic. The dialog level information is used to
check whether the call-center agent used courteous opening and
closing sentences. Calls that deviate from the topic specific
distributions are identified in this way and agents handling these
calls can be offered further training on the subject matter,
courtesy etc. This kind of post-processing tool may also be used to
identify abnormally long calls, agents with high average call
handle times etc.
[0069] The foregoing describes only some embodiments of the present
invention, and modifications and/or changes can be made thereto
without departing from the scope and spirit of the invention, the
embodiments being illustrative and not restrictive.
* * * * *
References