U.S. patent application number 13/012691 was filed with the patent office on 2011-12-22 for method for identifying emerging issues from textual customer feedback.
This patent application is currently assigned to OVERTONE, INC.. Invention is credited to Scott Austin, Grant Foster, Guy Jones, Eric Scott.
Application Number | 20110313962 13/012691 |
Document ID | / |
Family ID | 46327065 |
Filed Date | 2011-12-22 |
United States Patent
Application |
20110313962 |
Kind Code |
A1 |
Jones; Guy ; et al. |
December 22, 2011 |
METHOD FOR IDENTIFYING EMERGING ISSUES FROM TEXTUAL CUSTOMER
FEEDBACK
Abstract
Systems, methods and software products identify emerging issues
from textual customer feedback. A message stream of customer
feedback is received. The message stream includes a plurality of
unstructured text messages from at least one homogeneous source. A
time interval is established. The volume of text messages for the
time interval is determined to establish a reference volume. The
volume of text messages in subsequent time intervals is measured to
establish a trend volume. The trend volume is compared to the
reference volume to determine a volumetric change. At least one
action is initiated in response to a volumetric change above a
pre-determined threshold. At least one action is initiated in
response to a volumetric change below a pre-determined
threshold.
Inventors: |
Jones; Guy; (Vista, CA)
; Austin; Scott; (Carlsbad, CA) ; Foster;
Grant; (Portland, ME) ; Scott; Eric; (El
Cajon, CA) |
Assignee: |
OVERTONE, INC.
Carlsbad
CA
|
Family ID: |
46327065 |
Appl. No.: |
13/012691 |
Filed: |
January 24, 2011 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
11623694 |
Jan 16, 2007 |
7899769 |
|
|
13012691 |
|
|
|
|
11286572 |
Nov 23, 2005 |
7853544 |
|
|
11623694 |
|
|
|
|
60758813 |
Jan 13, 2006 |
|
|
|
60630858 |
Nov 24, 2004 |
|
|
|
Current U.S.
Class: |
706/20 |
Current CPC
Class: |
G06F 40/30 20200101;
G06F 16/353 20190101; G06F 2216/03 20130101; G06F 16/2477 20190101;
G06F 16/35 20190101 |
Class at
Publication: |
706/20 |
International
Class: |
G06F 15/18 20060101
G06F015/18 |
Claims
1. A method for processing unstructured text, comprising: receiving
a message stream, the message stream including a plurality of
unstructured text messages originating from at least one
homogeneous source; capturing at least a subset of the text
messages as an exploration set; displaying the text messages of the
exploration set to an analyst for review; receiving at least one
text category from the analyst for each displayed text message;
associating each text category with at least one text message
within the exploration set, the associated categories and messages
providing a classification model; initiating an automated training
process to categorize text messages based on the classification
model.
Description
RELATED APPLICATIONS
[0001] This patent application is a continuation of U.S.
application Ser. No. 11/623,694, filed Jan. 16, 2007, which claims
priority to Provisional Patent Application Ser. No. 60/758,813,
filed Jan. 13, 2006, and is a continuation-in-part of U.S. patent
application Ser. No. 11/286,572, filed Nov. 23, 2005, which issued
as U.S. Pat. No. 7,853,544 on Dec. 14, 2010, which claims priority
to Provisional Patent Application Ser. No. 60/630,858, filed on
Nov. 24, 2004, all of which are incorporated herein by
reference.
TECHNICAL FIELD
[0002] This invention is related in general to information
management systems and methods, and more particularly to a workflow
system that uses a human-trained text categorization engine to
analyze, process, and categorize data that contain natural language
text.
BACKGROUND
[0003] The availability of on-line communication that includes but
is not limited to e-mail, web-based feedback, and on-line chat has
generated an explosive growth in data communication that does not
come in the form of structured data, but rather as natural language
text in digital form. Consumers and businesses are now able to
communicate, execute transactions, and perform a variety of
electronic business functions online.
[0004] The sheer quantity and lack of structure pertaining to
natural language communications renders the complexity and cost of
extracting value from this information prohibitive in many cases.
Therefore, analyzing unstructured textual data and generating
insight from such content has posed challenges for researchers
analyzing customer communication, interests, and market trends. By
the same token, many messages go unread simply because targeting
large numbers of messages to appropriate parties within an
organization is too costly to be done by current methods.
[0005] With the near real time availability of information, at
times issues may arise which require speedy response, timely
alteration and or correction of material information being
disseminated to consumers. Often the first opportunity for
awareness of such an issue occurs in the form of consumer feedback.
However, as noted the lack of structure and volume of feedback may
frustrate the ability to recognize and respond to such issues.
SUMMARY OF THE INVENTION
[0006] From the foregoing, it may be appreciated that a need has
arisen for a method for identifying emerging issues from textual
customer feedback.
[0007] In particular and by way of example only, according to one
embodiment, provided is a method for identifying emerging issues
from textual customer feedback, including: receiving a message
stream of customer feedback, the message stream including a
plurality of unstructured text messages from at least one
homogeneous source; establishing a time interval; determining the
volume of text messages for the time interval to establish a
reference volume; measuring the volume of text messages in
subsequent time intervals to establish a trend volume; comparing
the trend volume to the reference volume to determine a volumetric
change; initiating in response to a volumetric change above a
pre-determined threshold at least one action; and initiating in
response to a volumetric change below a pre-determined threshold at
least one action.
[0008] In yet another embodiment, provided is a method for
identifying emerging issues from textual customer feedback,
including: receiving a message stream of customer feedback, the
message stream including a plurality of unstructured text messages
from at least one homogeneous source; evaluating the volume of
customer feedback to establishing a time interval; capturing within
a first time interval a subset of the text messages as a reference
set; establishing from the reference set a key term set, the key
term set including the frequency of occurrence for each key term;
capturing within a second time interval a subset of the text
messages as a sample set; evaluating the text messages of the
sample set to determine a change in the frequency of occurrences of
members of the key term set; initiating in response to a frequency
change above a pre-determined threshold at least one action; and
initiating in response to a frequency change below a pre-determined
threshold at least one action.
[0009] In still yet another embodiment, provided is a computer
implemented method for identifying emerging issues from textual
customer feedback, including: receiving a message stream of
customer feedback, the message stream including a plurality of
unstructured text messages from at least one homogeneous source;
evaluating the volume of customer feedback to establishing a time
interval; capturing within a first time interval a subset of the
text messages as a reference set; establishing from the reference
set a key term set, the key term set including the frequency of
occurrence for each key term; identifying emerging issues from the
stream of text messages, the identification including: capturing
within a second time interval a subset of the text messages as a
sample set; evaluating the text messages of the sample set to
determine a change in the frequency of occurrences of members of
the key term set; initiating in response to a frequency change
above a pre-determined threshold at least one action; and
initiating in response to a frequency change below a pre-determined
threshold at least one action.
[0010] Many advantages of the methods set forth are readily
apparent to one skilled in the art from the following figures,
descriptions, and claims.
BRIEF DESCRIPTION OF THE FIGURES
[0011] For a more complete understanding of the present invention
and advantage thereof, reference is now made to the following
descriptions taken in conjunction with the accompanying drawings,
wherein like reference numerals represent like parts, and in
which:
[0012] FIG. 1 is a flow chart showing one exemplary process for
exploring message data, training one or more text categorization
engines, publishing text classifiers and classifying unstructured
data for further processing;
[0013] FIG. 2 is an illustration of one embodiment of a system that
provides the capturing and loading of verbatim into a system data
source;
[0014] FIG. 3 is an illustration of one embodiment of a system that
manages the data set;
[0015] FIG. 4 is an illustration of one embodiment of a system that
provides the exploration of verbatim and the flagging of
interesting concepts and generation of their classifiers;
[0016] FIG. 5 is an illustration of one embodiment of a system that
monitors the performance of the concept classifier during the
training of the concept;
[0017] FIG. 6 is an illustration of one embodiment of a system that
provides the method of publishing the trained concepts;
[0018] FIGS. 7-A to 7-F are flowcharts that elaborate on an
embodiment of selected portions of the method of claim 1;
[0019] FIG. 8 shows a flowchart of an exemplary process for
providing early alerts to emerging issues and pattern anomalies in
customer feedback; and
[0020] FIG. 9 show a flowchart of an exemplary process for
providing early alerts to emerging issues and pattern anomalies in
customer feedback.
DETAILED DESCRIPTION
[0021] In the following discussion, experience in computer
programming, database design and management, computer interface
design and implementation and use of software development tools is
assumed. The following references may also provide background
information to this application; Mitchell, T., Machine Learning,
Boston: McGraw-Hill, 1997, Manning, C. and Shutze, H., Foundations
of Statistical Natural Language Processing, Cambridge, Mass.: MIT
Press 1999, and Sheskug D J. Handbook of parametric and
nonparametric statistical procedures (second edition) Boca Raton:
Chapman & Hall, 2000, each of which is incorporated herein by
reference.
[0022] Natural language text, encoded in digital form, is herein
termed unstructured data. This contrasts with structured data, in
which data is represented in some canonical format, from which
well-defined semantics can be inferred and automatically processed.
On the other hand, semantics of unstructured data defies inference
by a computer, making automated processing of the unstructured data
problematic. One method of processing unstructured data is to use
statistical methods, where categorization judgments are encoded as
structured attributes to facilitate determining semantics of the
unstructured data.
[0023] Natural language text comes in many varieties, serves many
purposes and arises from many sources. As such, different kinds of
text may differ dramatically from one another. Conversely, natural
language text that arises from a narrowly defined range of sources
may tend to range over a similarly narrow range of topics (such a
range of topics is called a domain herein). For example, email
messages from customers of a business directed to the business may
be one such domain, and news feeds issued by a news agency may be
another such domain. The email messages directed to the business
are probably relatively short messages relating primarily to issues
surrounding the goods and services produced by the business. The
news feeds are probably longer messages relating to world affairs.
The statistical distribution of linguistic features germane to each
of these domains is probably very different.
[0024] Unstructured data from a single domain may therefore be
considered to contain messages with characteristics that are
consistent to some degree; even its messages yet to be seen. Such
messages are called the message stream herein. One or more of such
message streams may be referred to hereinafter as `at least one
homogeneous source`.
[0025] One desire may be to automatically categorize messages from
a message stream into one or more categories. For example, a
message stream consisting of incoming emails may be categorized
into sales opportunities. In another example, a subscriber to a
news service may only be interested in stories that relate to the
economy and therefore desire that the message stream be so
categorized. In another example, an individual may wish to flag
unsolicited advertisements (`spam`) in his or her email message
stream for special treatment.
[0026] A user receiving a message stream may therefore wish to
train an automated system to automatically categorize messages from
the message stream. The automated system may include one or more
text categorization engines that use one or more text
categorization algorithms. Examples of text categorization
algorithms may be found in the field of machine learning, for
example naive Bayes, k-nearest-neighbor, etc.
[0027] Text categorization engines require training for a
particular message stream, herein called the target message stream.
A training corpus may be extracted from the target message stream
and typically consists of a large number of messages (usually
numbered in the hundreds or thousands) drawn randomly from the
target message stream. The text categorization algorithm also
requires that binary truth values be assigned to each message in
the training corpus. Each message is considered positive if it is
judged an instance of the target category, and negative if it is
not. Each text categorization algorithm extracts a set of features
from each message, and captures important aspects of these
features' statistical distributions to render judgments reflecting
future messages' likelihood of being positive with respect to the
target category. The features extracted from the messages are, for
example, words occurring in the message, but may also include other
attributes of the text within the message such as word length, word
stems, thesaurus entries associated with certain words, or any
other value which maybe automatically associated with the content
of the messages.
[0028] One approach to assigning truth values to messages in a
training corpus involves presenting instances from the corpus to a
human user, who assigns truth values to the messages based on his
or her own subjective understanding of the category. An interactive
computer program dedicated to facilitating the training of text
categorization engine is called a training system. In general, the
more truth values assigned to messages in a training corpus, the
more data points there are to inform the test categorization
algorithm, and, therefore, the more accurate a text categorization
engine based on that algorithm becomes. However, the user faces
diminishing returns since the incremental improvement amount
reduces with each successive new data point. This raises the
question: how will the user know when sufficient truth values have
been provided to adequately train the text classification engine?
This concept is addressed below, and is one major advantage of
systems and methods disclosed herein.
[0029] The preceding discussion assumes that a user already knows
which test categories are of interest, or at least which text
categories pertain to the target message stream. The proposed
system may also provide tools to allow a user to explore messages
of a message stream, and to tentatively identify interesting and
useful categories within those messages, prior to training one or
more text categorization engine.
[0030] Once a category of interest is identified, and a text
categorization engine is trained such that its categorization
performance meets a specified performance criterion, the text
categorization engine is ready for use. In the system described
below, a text classifier, which includes the categorization
information trained into the text categorization engine, may be
published so that it is automatically applied to new messages
received from the message stream, and defines rules to
automatically process messages based on the categories assigned to
them.
[0031] FIG. 1 is a flow chart showing one exemplary process 10 for
exploring, training, publishing, and classifying unstructured data.
In process 10, a user may explore data from a target message stream
12 to identify interesting categories, train one or more text
categorization engines to categorize received messages, publish one
or more text classifiers, and use the text classifiers within text
categorization engines to categorize messages from the target
stream for further processing.
Specifying and Capturing Data from a Message Stream
[0032] In a data capture phase of process 10, represented by steps
14, 16, 18, 20 and 22 of FIG. 1, messages are captured from target
message stream 12 and imported for analysis. Messages may initially
be in the form of files stored on disk, records in a database, a
live feed from incoming internet traffic or some other source. In
one example, a user may have already collected and stored exemplary
messages from a target data stream as a tabulated data file where
each line in the file expresses a record associated with a message,
and each field relates to one datum associated with the record. The
first line of the file may, for example, be a `header` naming each
field.
[0033] In this example, the user may be prompted to indicate a
location of one or more files rendered in this tabular format
within a file system. FIG. 2 shows one exemplary screen that allows
the user to select a data source for import. During such an import,
the user may be prompted to associate each file with a `data
source` label to identify the message stream from which the data is
taken. Tools may also be provide to allow the user to specify
details of the format of imported data, including record and column
delimiters, and a mapping between one column in each record and the
target message body. The user may also be prompted to specify a
mapping between other columns and the names of any other structured
attributes to be associated with each message during import.
Preferably, each message has at least one structured attribute
assigned to it, such as a date or timestamp indicating or
reflecting the time of origin.
[0034] In step 14 of process 10, the data stored in such tabular
form is imported. For example, step 14 may read in and parse one or
more files whose fields are delimited by some character, specified
by the user when prompted. The important fields may be captured and
stored in a database to be used by the system, with the user
selecting fields for the text of each message, a data stream
identifier, a timestamp, and imported text attributes, as
applicable. A standard graphical user interface may be used to
prompt the user as necessary, FIG. 7-A shows exemplary user
interaction during step 14 of process 10.
[0035] Since each body of imported data may be assigned a `data
source` label, one organization may deal with one or more uniquely
named message streams, each with its own statistical distributions.
There may be a predefined `default` data stream, which assumes all
the user's data ranges over the same domain.
[0036] A text categorization engine trained with data from a given
message stream is preferably applied to future messages from the
same, or an equivalent, message stream, since the basis for
statistically driven text categorization engines assumes that
messages within a given message stream have similar distributions
of features; performance may degrade to the extent that this
assumption is violated.
[0037] In a preferred embodiment, a new message stream may be
created by specifying a filter that takes data from an already
specified source and applies a condition/action rule to makes some
subset of the input available to the new message stream (see below
for a discussion of condition/action rules). Such a filter would
have the advantage of providing a richer vein of data for
relatively rare text categories.
[0038] Thus, at the end of the data capture phase of process 10, a
body of text messages and associated attributes may exist in a
database for use in the remaining phases of the process.
[0039] In one embodiment, an explicit data stream label is
eschewed. This simplifies the user interface and the implementation
in the code base, but may lead to sub-optimal categorization
performance if statistical distributions between the text that is
used for training, and the text to be processed by the resulting
text categorization engine, differ radically.
[0040] In another embodiment, the requirement that structured
attributes be associated with the text to be imported may be
removed to allow for the import of one or more `flat files` of text
and text in some other format.
[0041] In another embodiment, the data to be imported is taken from
a live feed of data from a source such as a stream of incoming
email messages.
[0042] In another embodiment, the tabular data is imported from
some external database, rather than spreadsheet files, and field
mappings are made through database queries rather than reading a
header file. In such cases, the user may not be required to provide
information about file delimiters.
[0043] In another embodiment, the above mentioned filter is not
used to define new data streams.
The Exploration Process
[0044] In order for a human analyst to subjectively understand the
contents of the data under examination, it is often useful to allow
the analyst to review the contents of a number of messages, and
simply make notes, group similar messages together, and flag
interest messages in a free-form environment. The system discussed
below provides tools to facilitate such an exploration.
[0045] The exploration process begins when the user specifies a
message stream by indicating the `data source` label, then
extracting an exploration set 16 from the message stream associated
with that label during the data capture phase (see FIG. 1, steps 14
and 16). In the preferred embodiment, each message has a timestamp,
and the user specifies a date range and a numeric upper limit on
the size of the exploration set. A sample of messages corresponding
to this specification is then drawn randomly from the available
messages of the target message stream imported into the system.
Data-source labels and timestamp fields are, for example,
associated with message texts in the database during the data
import process described above.
[0046] During the exploration process of step 18, messages from
exploration set 16 are displayed in a graphical user interface,
where, at any one time, a plurality of messages of the exploration
set may be presented to the user. The user reviews the contents of
each presented message and determines whether the message belongs
to a test category worth training. The user may either assign a new
`flag` to the message or assign one of a set of already declared
flags to the message. In the exploration interface, each message
bears the mark of as many flags as have been assigned to it during
exploration. The name of each flag can be easily edited, and in a
preferred embodiment a set of notes is kept for each flag, so that
the representation of each flag can evolve as the user continues
the exploration process. The implementation of a flag can use a
standard `container` software pattern, with persistent properties
for fields for a label and notes, the contents of such a container
being each message record associated with the flag during
exploration. FIG. 7-B shows exemplary user interaction during
exploration of exploration set 16.
[0047] In a preferred embodiment, at any time the user may indicate
a specific word or phrase, and a message listing may render
messages containing that word or phrase. This may allow the user to
identify a number of messages which have important lexical content
in common. This can be implemented in a code base if each message
is tokenized into a sequence of word tokens, and `indexed` so that
mappings between each token's associated word form, and messages
containing tokens of that word are stored in the database.
[0048] In a preferred embodiment, the user may specify such words
and phrases by consulting a listing of words found in the
exploration set and indexed (see FIG. 4 item 410 and FIGS. 7-C, 7-D
and 7-E). Let us call this a lexical listing. The user may specify
whether such a listing can be ordered alphabetically, or sorted by
`significance`, which reflects the frequency of each word in the
exploration set, compared to what would be expected given its
observed frequency in a background corpus of the target natural
language, whose lexical frequencies are derived from a broad a
range of messages from many domains. This feature can be
implemented by collecting a member of texts in the same language as
the data set under examination (but ranging over a wider set of
domains), then counting the number of occurrences of each token in
the background corpus, along with the sum of all tokens in the
background corpus. It should be straightforward to one skilled in
the art of corpus linguistics to calculate an estimate of the
expected frequency of occurrences for each word form per unit of
text, and derive a significance score based on the observed
frequency of each word form in the data set under examination as
compared to this expected frequency.
[0049] Again in the preferred embodiment, at any time during the
exploration process, the user may indicate a specific flag and
review messages to which the flag has been assigned. Let us call
this a by-flag listing. This can be implemented by a simple
retrieval of the contents associated with the flag in question, and
displayed using practices common to graphical user interfaces.
After reviewing the set of messages so flagged, the user may gain
insight into whether the category has been properly named, or
realize that the flag represents a very broad category, and that
other flags with narrower scope might be better assigned.
[0050] The exploration process is advantageous because in practice,
a user may not be completely familiar with the data s/he is dealing
with, or s/he may have misconceptions about them. This allows the
user to get a good subjective understanding of the contents of the
message stream under review, and be able to identity which
categories are interesting, and which may be good subjects on which
to train a text classifier.
[0051] An important output of the exploration process is a
persistent set of named `flags` (see FIG. 1, step 20), with which
are associated several positive instances, and notes the user has
taken. The user may then promote any one or more of these flags
into a training session 24, which produces a classifier for the
category associated with the flag in the next phase of the
workflow, as described below.
[0052] One embodiment eschews the use of the background corpus to
render `relevance` information in the lexical listing. Another
embodiment excludes the use of a lexical listing altogether.
Another embodiment does not include a flag listing. Yet another
embodiment forgoes use of an exploration phase altogether; the user
simply specifies the names of categories to train, proceeding
directly to training phase 24.
The Training and Audit Process
[0053] Having identified the important categories of text one is
likely to encounter in one's data, it is greatly advantageous to be
able to automatically recognize messages that are likely to be
members of those categories. To do this, training data is provided
whose statistical distributions can inform automatic text
categorization algorithms.
[0054] In a preferred embodiment, this process begins with the
creation of a training session, with two subsets of the exploration
set: a training set and an audit set. Training and audit sets are
also disjoint, so that no message in one set belongs to the other.
The audit set may be large enough to guarantee a suitably small
sampling error, determinable through standard statistical means.
The training session may also be dedicated to one or more target
categories. (see FIG. 1 step 24). In the preferred embodiment, more
than one category may be trained during any training session. This
may be implemented by maintaining separate persistent containers
for training and audit sets, whose members are drawn from the
exploration set described above, which should be straightforward to
an experienced programmer with database programming skills.
[0055] After the user specifies the categories to be targeted in
training (FIG. 1 step 22), the system automatically creates a
training session 24 with training and audit sets, and composes the
first page of training data (see FIG. 1 step 26). In a preferred
implementation, for each page of the training session (FIG. 1 step
28), messages are displayed in one column, with `check box` columns
corresponding to each category being trained during the session
(see FIG. 4).
[0056] During training session 24, messages from the training and
audit sets are presented one page at a time to the user. The user
reviews each message in turn, and indicates truth values as to
whether the message is positive for each category being trained
during the session (FIG. 1 step 30). The system may maintain and
store a mapping between each message and its truth value with
respect to each category associated with the training session.
[0057] In a preferred implementation, training and audit messages
are displayed interleaved on each page so that half of the messages
under review are from the audit set until the audit set has been
exhausted. This insures that early on in the training process the
margin of error may be relatively small, and performance feedback
may be stable.
[0058] In a preferred implementation, messages which have been
flagged for the target category during exploration are presented as
`checked`, and unchecked otherwise, until the user explicitly
checks/unchecks them.
[0059] The user may be unaware as to which messages were drawn from
the audit set and which from the training set, to avoid introducing
a bias.
[0060] When the user has assigned truth values for each message on
a page for each category being trained, s/he presses a `next page`
button, and the following things happen for each target category:
[0061] Messages in the page just visited drawn from the training
pool are added to the training instances that inform a text
categorization engine, which recalculates its model of its
associated category (FIG. 1 step 32). [0062] Messages in the page
just visited from the audit pool are added to the audit set for the
target category (FIG. 1 step 34), and the messages in the newest
version of the audit set are classified with the newly updated
classification engine (FIG. 1 step 36). [0063] Judgments of the
categorization engine are compared to truth values assigned by the
user to messages in the audit set, and a performance in text
categorization such as Precision, Recall, F1 and Correlation
Coefficient may be utilized. [0064] Performance score 38 is
compared to a threshold specified prior to the training session by
the user (FIG. 1 step 40). To implement this, the interface may
provide a means for the user to specify this threshold, and to
store this threshold per-category in the database. [0065] If
performance meets or exceeds the performance threshold, the user is
prompted with the option to discontinue training (FIG. 1 step 42),
and publish the classifier (FIG. 1 step 46). Publication is
discussed below. [0066] If performance has not yet met its target,
or the user opts to continue training, a new page of training and
audit data is composed and presented to the user, and the training
process continues (FIG. 1 step 44).
[0067] Thus, with each new page, the performance score is presented
for each target category in the training session indicating how
well the text classifier is performing against the audit set.
[0068] In a preferred embodiment, performance score 38 is a
correlation coefficient. This has certain advantages, such as
having a well understood and easily calculated margin of error, so
that a lower bound of the confidence interval can be taken as a
critical benchmark of performance. Correlation coefficient is a
standard measure of correlation known to the art of statistics.
[0069] Thus, a lower bound on performance against the audit set can
be compared to some predetermined threshold to determine when
enough training has been performed for any given target category.
At such point, the user may feel confident that s/he can stop
training and publish a classifier which is likely to meet its
specified performance goals.
[0070] Many classification algorithms produce a numeric interim
score for each message reflecting a probabilistic judgment as to
the degree of match, then make binary (positive/negative)
classification judgments comparing the interim score to some
threshold. In such cases, it is possible to select an optimal
threshold by searching a set of possible thresholds, and selecting
the threshold that maximizes the performance score. Such a
calculation uses values calculated by the text categorization
algorithm.
[0071] Thus, at the end of the training phase, the system may
represent, for each of its target categories, the parameters
pertinent to at least one text classification algorithm, informed
by the training data provided, so that the classification algorithm
renders a classification score for any text presented to it as
input. In a typical case, this would involve identifying a set of
significant features attainable from any text, where each such
feature is assigned a weight or coefficient, such that a
calculation can be performed to return a value reflecting the
likelihood that the text in question is an instance of the target
category. In the typical case there may also be global parameters
such as a prior probability for the category, and an optimal
threshold that applies to the target category generally rather than
being associated with any particular feature. These parameters may
exist in a database representation of the classifier so that they
can be retrieved and applied to future texts.
[0072] In an alternative embodiment, performance score 38 is
displayed on a screen (see FIG. 4 item 410), and the user is not
prompted with the option to publish, but rather chooses for himself
when to publish in consultation with the performance score
display.
[0073] In an alternate embodiment, in circumstances where instances
of the category can be identified with one or more structured
attributes (such as a Boolean attribute suggesting that the author
bought a product), the training process can be fully automated,
with truth values assigned automatically by referring to the
structured attributes in question. We refer to this as a `Learn by
Example` feature (see FIG. 6). In this embodiment, the user creates
data for import as described above, but after indicating a field
whose values are a basis for truth values, training and audit is
done without having to page through each training instance.
[0074] In another alternate implementation, performance score 38 is
optimized for a weighted score combining 1) the risk associated
with missing a false negative, and 2) the cost associated with
processing a false positive, and this process provides a means for
specifying these risks and costs as they pertain to the user's
particular application.
Publishing a Classifier
[0075] Having provided training data to train a text categorization
engine, and evaluated the categorization performance of that engine
with respect to the target category, it is useful to save important
parameters of the categorization engine so that they can be
retrieved and applied to categories future texts.
[0076] In a preferred embodiment, when the minimum performance
criterion has been met, process 10 gives the user the option of
publishing the classifier (FIG. 1 step 46), after which point the
set of published classifiers (FIG. 1 step 48) actively assigns
categorization scores to new messages drawn from the target message
stream as they are imported into the system. This means that each
message from the message stream processed by the system may be
associated with a structured attribute named after the category,
and be assigned the score returned by the classification engine
reflecting the likelihood that the message is a positive instance
of the category. To implement this in the code base, it may capture
text messages from the same (or equivalent) source as the original
training messages. Then it applies the classification algorithm
trained during the training phase, using the parameters specific to
each target category for that algorithm stored during the training
phase. This should be straightforward for a programmer with skills
in text categorization and with database skills.
[0077] In the preferred implementation, more than one classifier
may be available, based on competing algorithms or configuration
parameters. These may each be trained with the target classifier's
training set, and tested against its audit set, in turn. A
classifier whose performance score 38 on the audit set is highest
is then selected as the preferred classifier for that category.
[0078] Thus, at the end of the training phase of the process, the
system may represent and persist whatever data are needed to inform
an optimal classification engine, so that it can classify any
arbitrary text for any of the target categories trained.
[0079] An alternative embodiment might only use a single
classification algorithm rather than using the competing algorithms
described above.
Programming the System to Respond to Message Categorizations and
Attributes
[0080] Having attained an ability to automatically categorize texts
with a text classification engine, the ability to further automate
a process dealing with texts falling into one or more of those
categories (for example, automatically routing emails to some
member within the user's organization) may be useful. To accomplish
this, the next phase in the workflow described here allows the user
to specify which actions the system may take, and under what
circumstances to take them.
[0081] A preferred embodiment of the invention also has a feature
that allows the user to write executable rules to automatically
assess one or more attributes assigned to a message taken from a
given message stream, and take action accordingly. As a general
process, this is done as follows: [0082] The user identifies a
target message stream (or relies on a default); [0083] The user
defines a set of conditions that may hold per message in order to
trigger some action or actions; [0084] The user defines a set of
actions that may be taken when the set of conditions is met.
[0085] In the preferred embodiment, this step is also included:
[0086] The user names the rule, and optionally adds notes.
Specifying Conditions
[0087] In order to implement a condition/action rule-based system,
it is necessary to specify the `condition` part of the rule, which
may evaluate to `true` when the conditions specified are met, and
automatically triggers the `action` part of the rule as the rule is
systematically applied to each message fed into the system.
[0088] In general, we assume a model where each message has some
set of attributes which may be associated with it. Such attributes
may be structured attributes, which were assigned to the message
from some process external to the invention, and imported into the
system coincidentally during import of the message. Examples of
structured attributes might be an author's name, social security
number or zip code, or any other datum which the user was able to
associate with the message during whatever process captured the
message prior to its import into the system. This may be done by
representing the data in tubular form prior to use by the system,
as discussed above.
[0089] Other attributes may simply be inherent to the data, such as
whether its text matches a given regular expression.
[0090] Of course one important type of attribute is the kind of
attribute that was trained by the classifier training algorithm
described above, and assigned to each message on import. It should
be clear to one skilled in the art text categorization how to
associate such attributes to arbitrary texts.
[0091] The rules described here are grouped into suites called
programs, and another type of attribute is one which was itself
established as a temporary variable, established by the action of
some rule that executed for the target message prior to the
assessment of conditions for the rule at hand (see below).
[0092] Simple conditions within a rule can then be built by
specifying standard Boolean comparative operators ( > < =,
etc) to the values associated with each such attribute in the
manner typical of rule-based systems.
[0093] The interface provided by the system allows the user to use
the standard Boolean operators (and or, etc) to recursively build
up complex conditions from combinations of simple conditions. The
process of building complex Boolean expressions out of simple ones
is commonplace in computer programming, and it should be clear to
one skilled in the art how to implement such functionality.
Specifying Actions
[0094] An action is some programmatic behavior that is
systematically taken upon each message examined by a program,
whenever that message evaluates as positive with respect to the
program's condition component.
[0095] Possible actions to specify include, but are not limited to:
[0096] Assigning a structured attribute to the message at hand
within some relational table. [0097] Assigning a value to a
temporary variable, which a condition of some other rule can
reference when applied to the message at hand. [0098] Calling some
program external to this invention, with some set of input
arguments.
[0099] Whenever a structured attribute is assigned to a message
during the execution of a rule, it becomes visible in the database
managed by the system, allowing future users to use that attribute
as a filter in defining future message streams, or as a basis for
business reporting about the nature and contents of incoming
messages.
[0100] The fact that temporary variables can be assigned in one
rule and assessed in another rule facilitates a regime in which a
number of relatively simple rules can combine in a suite. When
implementing, the implementer guarantees that evaluation of such
rules is scheduled so that variables are set first and referenced
afterward, in a manner typical of any standard reasonably
sophisticated rule-based system.
[0101] When an external program is invoked, the system may be built
in such a way that it has a means of invoking the program, such as
a shell call within the native operating system, or art IP address
to which a server request can be directed. The system may also
encode and implement knowledge as to the set of input arguments
required by the external program. In our preferred implementation,
the system may be configured to reference external programs running
on HTTP servers with known URLs, and arguments are passed with
standard HTTP posts. Both shell calls and client/server web service
calls are well established technologies.
[0102] In an alternative embodiment, the definition of `local
variables` is eschewed, which may simplify the user interface,
although it may reduce the expressive power of the system.
[0103] One might also make this a simpler system which does not
import structured attributes, and whose conditions deal exclusively
with attributes assigned by the categorization engine.
[0104] One might limit the scope of the actions of rules so that
they only pertain to changes in the data base, or conversely only
trigger actions by external programs, instead of being able to
specify both possible actions.
[0105] One could reduce the response to a simple system whereby
texts are provided as input to the system, and the only output
would be a categorization judgment. This could be done by embedding
the categorization engine in another application, or making the
system available as a shell call, or providing a simple web
service. Implementing any of these alternatives should be
straightforward to a skilled programmer.
[0106] FIG. 8 shows a flowchart of an exemplary process 800 for
providing early alerts to emerging issues and pattern anomalies in
customer feedback.
[0107] In step 802, process 800 determines a time scale of a
reference set and a base set of customer feedback based upon volume
of customer feedback. Step 804 is a decision. If, in step 804,
process 800 determines that a reference and base set are
established, process 800 continues with step 806; otherwise process
800 terminates. In step 806, process 800 trains a text
classification model using the reference set as truth and the base
set as false. Step 808 is a decision. If, in step 808, the text
classification model is determined as significant, process 800
continues with step 810; otherwise process 800 terminates. In step
810, process 800 applies the text classification model to both the
reference set and the base set of customer feedback. Step 812 is a
decision. If, in step 812, process 800 determines that there are
significant results from the application of the text classification
model, process 800 continues with step 814; otherwise process 800
terminates. In step 814, process 800 creates an alert highlighting
significant customer feedback.
[0108] FIG. 9 show a flowchart of an exemplary process 900 for
providing early alerts to emerging issues and pattern anomalies in
customer feedback.
[0109] In step 902, process 900 sets the time interval. In step
904, process 900 determines a sample set. In step 906, process 900
establishes a database of terms. In step 908, process 900 acquires
a test set sample. For example, process 900 may acquire a test
sample set from electronic storage of customer feedback. In step
910, process 900 compares the test sample set to the database of
terms. Step 912 is a decision. If, in step 912, process 900
determines that an anomaly is identified in the test set sample,
process 900 continues with step 914; otherwise process 900
continues with step 916. Step 916 is a decision. If, in step 916,
process 900 determines that the database is to be reused, process
900 continues with optional step 918; otherwise process 900
continues with step 906. If optional step 918 is not included,
process 900 continues with step 908. In step 914, process 900
raises an alert for the anomaly identified in steps 910 and 912.
Process 900 then continues with step 906. In optional step 918,
process 900 augments the database of terms. Process 900 then
continues with step 908.
[0110] Steps 906-918 repeat to process additional test set
samples.
Conclusions
[0111] From the foregoing discussion, it should be clear to one
skilled in the art that this system can be built using one or more
standard digital computers, with a standard software development
environment, including a means of building and maintaining a
database, internet connectivity, building a graphical user
interface or HTML interface, using standard practices established
for statistics, for machine learning algorithms dedicated to text
categorization, and rule based programming systems.
[0112] The system described herein provides real advantages to any
party dealing with large amounts of unstructured textual data, in
enabling that party to analyze the contents of such data, identify
categories of text found with some frequency within the data in an
exploration session, train automated text categorization engines
within a training session and provide ongoing performance
evaluations during the training process. If further provides a
means for publishing these classifiers so that they automatically
recognize future instances of messages matching these categories,
and for writing programs to respond automatically to those messages
as they are recognized.
[0113] The system described herein may; identify issues without
prior knowledge that those issues exist, send automatic
notifications when a new issue develops, allow reactions to occur
while a problem is still in sits infancy, start sending alerts in a
little as one day, and analyzes large volumes of customer feedback,
enabling the detection of emerging issues earlier than other
sampled or manual monitoring techniques. Once identified, the
system allows tracking of an emerging issue across time and allows
viewing of the entire life-cycle of the issue. In one analogy, the
system make be likened to an EKG Monitor; the system monitors the
"pulse" of customer feedback and alerts you when there is a
"life-threatening" change. The system is highly sensitive to both
short-term issues and slowly escalating issues.
[0114] The system may automatically alert a default client
point-of-contact immediately. Alternatively, the system may direct
alerts, automatically, to the most appropriate person within the
clients' company.
[0115] The system analyzes both structured and unstructured
(textual) data to automatically detect new emerging patterns or
issues and may generate an alert using the textual data alone. The
system may automatically generate an alert and send it to a
designated individual by a variety of electronic means (including,
but not limited to: email, text message, pager, SMS, RSS,
etc.).
[0116] In one example of use, an online content provider uses the
system to raise an alert when readers of the online content send
feedback indicating that there is a critical mistake in one of
their articles so that they can immediately correct it. Thus,
potential legal repercussions may be avoided and a high level of
perceived trust and reliability in the online content may be
maintained.
[0117] In another example of use, an internet services company
receives alerts from the system when customers suddenly start
complaining about being unable to log in. The internet service
company then rapidly identifies and resolves the problem,
preventing further customer complaints and attrition risk.
[0118] In another example of use, an e-commerce company is alerted
by the system when customers start complaining about new popular
product being much smaller than it appeared online. The company
then updates the online photograph to show the size of the product
relative to a known reference points, thereby ending complaints
immediately.
[0119] In another example of use, a consumer electronics company is
alerted by the system when customer feedback includes complaints
about a charging issue on one of their new cordless telephone
products. The company may then recall inventory of the product at
an early stage, thereby saving thousands of dollars.
[0120] Changes may be made in the above methods and systems without
departing from the scope hereof. It should thus be noted that the
matter contained in the above description or shown in the
accompanying drawings should be interpreted as illustrative and not
in a limiting sense. For example, although email messages are used
in many of the above example, other textual messages and documents
may be processed without departing from the scope hereof. The
following claims are intended to cover all generic and specific
features described herein, as well as all statements of the scope
of the present method and system, which, as a matter of language,
might be said to fall there between.
* * * * *