U.S. patent application number 12/638356 was filed with the patent office on 2010-04-15 for forecasting outcomes based on analysis of text strings.
This patent application is currently assigned to AT&T Corp.. Invention is credited to Philip E. Brown, Colin Goodall, Sylvia Halasz.
Application Number | 20100094591 12/638356 |
Document ID | / |
Family ID | 38973997 |
Filed Date | 2010-04-15 |
United States Patent
Application |
20100094591 |
Kind Code |
A1 |
Brown; Philip E. ; et
al. |
April 15, 2010 |
Forecasting Outcomes Based on Analysis of Text Strings
Abstract
A method and a processing device are provided for forecasting
whether a record of data for an individual is associated with an
outcome. Whether an n-gram associated with one of a group of
predictors matches an n-gram included in the record of data is
determined. When the n-gram included with the one of the group of
predictors matches the n-gram in the record of data, a prediction
is made as to whether the record of data is associated with the
outcome based on a value assigned to the one of the group of
predictors.
Inventors: |
Brown; Philip E.;
(Westfield, NJ) ; Goodall; Colin; (Rumson, NJ)
; Halasz; Sylvia; (Fair Haven, NJ) |
Correspondence
Address: |
AT & T LEGAL DEPARTMENT - NDQ
ATTN: PATENT DOCKETING, ONE AT & T WAY, ROOM 2A-207
BEDMINSTER
NJ
07921
US
|
Assignee: |
AT&T Corp.
New York
NY
|
Family ID: |
38973997 |
Appl. No.: |
12/638356 |
Filed: |
December 15, 2009 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
11952182 |
Dec 7, 2007 |
7657396 |
|
|
12638356 |
|
|
|
|
11275430 |
Dec 30, 2005 |
7324918 |
|
|
11952182 |
|
|
|
|
Current U.S.
Class: |
702/181 |
Current CPC
Class: |
G16H 50/80 20180101;
G16H 50/70 20180101 |
Class at
Publication: |
702/181 |
International
Class: |
G06F 19/00 20060101
G06F019/00 |
Claims
1. A method for forecasting whether a plurality of records are
associated with an outcome, the method comprising: for a first
record: a) obtaining a pre-assigned classifier associated with the
first record; b) analyzing characters in the first record to
determine n-grams appearing in the first record; c) incrementing a
record counter associated with each n-gram determined to be in the
first record; d) determining whether a pre-assigned classifier is
associated with a possible outcome; and e) if the pre-assigned
classifier is determined to be associated with a possible outcome,
incrementing an outcome counter associated with each n-gram in the
first record; if more records are available, accessing a next
record and carrying out steps (a)-(e); determining a predictive
value associated with each n-gram based on a respective outcome
counter and a record counter associated with each respective
n-gram; and creating an ordered set of predictors corresponding to
n-grams appearing in a minimum number of records.
Description
[0001] The present application is a continuation of U.S.
application Ser. No. 11/952,182, filed Dec. 7, 2007, which is
divisional of U.S. patent Ser. No. 11/275,430, filed Dec. 30, 2005,
which is related to U.S. patent application Ser. No. 11/276,503,
filed Mar. 2, 2006, all applications are incorporated herein by
reference in their entirety.
BACKGROUND OF THE INVENTION
[0002] 1. Field of the Invention
[0003] The present invention relates to forecasting an outcome
based on an n-gram found in a text string and more specifically to
a system and method for generating a classifier to forecast an
outcome and a system and method for using the generated
classifier.
[0004] 2. Introduction
[0005] The public and private health sector has been investigating
different approaches to disease outbreak detection using automated
syndromic surveillance systems. In such systems, patient chief
complaint data is collected by medical staff in an emergency
department or outpatient clinic. The medical staff typically enter
the patient chief complaint data in free-form text to be stored in
electronic records. An automated syndromic surveillance system may
perform natural language processing to analyze the free-text of the
patient chief complaint data. Further, various automated syndromic
surveillance systems code and prioritize symptoms differently.
Thus, a syndrome profile for a group of patients may vary depending
on a definition used by the automated syndromic surveillance system
in a given clinical setting.
[0006] Existing automated syndromic surveillance systems suffer
from a number of problems. Changes to a system may be time
consuming and expensive. Because of the time involved to make
changes, a practical size limit for a training set is imposed. New
languages or dialects may require development of new programs for
automated syndromic surveillance systems. Further, some automated
syndromic surveillance system may require preprocessing of chief
complaint data.
[0007] The International Classification of Diseases (ICD) coding
system is an international classification system which groups
related disease entities and procedures for the purpose of
reporting statistical information. ICD version 9 (ICD9) and ICD
version 10 (ICD10) are widely used codes. The purpose of the ICD
code is to provide a uniform language and thereby serve as an
effective means for reliable nationwide communication among
physicians, patients, and third parties. Several days may pass from
the time that a patient's chief complaint data is recorded, on
first entering an emergency or urgent care department, to the time
that ICD diagnoses are given. One or more ICD9 or ICD10 diagnosis
codes may be assigned by medical professionals based on their
diagnoses of the patient's condition, using a combination of inputs
including physician notes, patient vital signs, laboratory test
results, and medical examination results. (Unfortunately,
assignment of ICD9 or ICD10 diagnosis codes may also be influenced
by treatment and payment options.) The patient's chief complaint
may provide an early indication of these diagnoses.
SUMMARY OF THE INVENTION
[0008] Additional features and advantages of the invention will be
set forth in the description which follows, and in part will be
obvious from the description, or may be learned by practice of the
invention. The features and advantages of the invention may be
realized and obtained by means of the instruments and combinations
particularly pointed out in the appended claims. These and other
features of the present invention will become more fully apparent
from the following description and appended claims, or may be
learned by the practice of the invention as set forth herein.
[0009] In a first aspect of the invention, a method is provided for
forecasting whether a record of data for an individual is
associated with an outcome. Whether an n-gram associated with one
of a group of predictors matches an n-gram included in the record
of data is determined. When the n-gram associated with the one of
the group of predictors matches the n-gram in the record of data, a
prediction is made as to whether the record of data is associated
with the outcome based on a value assigned to the one of the group
of predictors.
[0010] In a second aspect of the invention, a machine-readable
medium is provided which has recorded thereon a group of
instructions for at least one processor. The machine-readable
medium includes instructions for determining a group of n-grams
included in a record of data, instructions for incrementing a
respective record counter associated with each of the plurality of
n-grams included in the record of data, instructions for
determining whether the record of data is associated with an
outcome of interest based, at least in part, on data associated
with the record of data indicative of one or more outcomes,
instructions for incrementing a respective outcome counter
associated with each of the plurality of n-grams included in the
record of data when the record of data is determined to be
associated with the outcome of interest, instructions for repeating
the previous four sets of instructions for each of a group of
records of data, and instructions for assigning a value to each of
the n-grams based on a value of the respective outcome counter for
and a value of a corresponding record counter for the respective
n-gram.
[0011] In a third aspect of the invention, a processing device is
provided. The processing device includes at least one processor, a
storage component having a group of instructions stored therein,
and a communications interface arranged to permit communications
between the at least one processor and the storage component. The
at least one processor is arranged to determine a group of n-grams
included in a record of data, increment a respective record counter
associated with each of the group of n-grams included in the record
of data, determine whether the record of data is associated with an
outcome of interest based, at least in part, on data associated
with the record of data indicative of one or more outcomes,
increment a respective outcome counter associated with each of the
group of n-grams included in the record of data when the record of
data is determined to be associated with the outcome of interest,
repeat the previous four acts for each of a group of records of
data, and assign a value to each of the n-grams based on a value of
the respective outcome counter and a value of a corresponding
record counter for the respective n-gram.
[0012] In a fourth aspect of the invention, a processing device is
provided for predicting whether a record of data for an individual
is associated with an outcome. The processing device includes means
for determining whether an n-gram associated with one of a
plurality of predictors matches an n-gram included in the record of
data, and means for predicting whether the record of data is
associated with the outcome based on a value assigned to the one of
the plurality of predictors, when the n-gram associated with the
one of the plurality of predictors matches the n-gram in the record
of data.
BRIEF DESCRIPTION OF THE DRAWINGS
[0013] In order to describe the manner in which the above-recited
and other advantages and features of the invention can be obtained,
a more particular description of the invention briefly described
above will be rendered by reference to specific embodiments thereof
which are illustrated in the appended drawings. Understanding that
these drawings depict only typical embodiments of the invention and
are not therefore to be considered to be limiting of its scope, the
invention will be described and explained with additional
specificity and detail through the use of the accompanying drawings
in which:
[0014] FIG. 1 illustrates an exemplary processing device in which
implementations consistent with principles of the invention may
execute;
[0015] FIGS. 2A and 2B are diagrams illustrating high level views
of aspects of implementations consistent with the principles of the
invention;
[0016] FIG. 3 is a flowchart of an exemplary process that may be
performed by an implementation of a complaint classifier generator
shown in FIG. 2A;
[0017] FIGS. 4A and 4B are flowcharts of a second exemplary process
that may be performed in another implementation of the complaint
classifier generator shown in FIG. 2A;
[0018] FIGS. 5A and 5B are flowcharts of a third exemplary process
that may be performed in yet another implementation of the
complaint classifier generator shown in FIG. 2A;
[0019] FIGS. 6A and 6B are flowcharts of an exemplary process that
may be performed in an implementation of an optimizer shown in FIG.
2A;
[0020] FIG. 7 is a flowchart of an exemplary process that may be
performed in an implementation of an outcome predictor shown in
FIG. 2B;
[0021] FIG. 8 is a flowchart of another exemplary process that may
be performed in another implementation of the outcome predictor of
FIG. 2B; and
[0022] FIG. 9 is a flowchart of an exemplary process that may be
performed in another implementation of the complaint classifier
generator of FIG. 2A.
DETAILED DESCRIPTION OF THE INVENTION
[0023] Various embodiments of the invention are discussed in detail
below. While specific implementations are discussed, it should be
understood that this is done for illustration purposes only. A
person skilled in the relevant art will recognize that other
components and configurations may be used without parting from the
spirit and scope of the invention.
Exemplary System
[0024] FIG. 1 illustrates a block diagram of an exemplary
processing device 100 which may be used to implement systems and
methods consistent with the principles of the invention. Processing
device 100 may include a bus 110, a processor 120, a memory 130, a
read only memory (ROM) 140, a storage device 150, an input device
160, an output device 170, and a communication interface 180. Bus
110 may permit communication among the components of processing
device 100.
[0025] Processor 120 may include at least one conventional
processor or microprocessor that interprets and executes
instructions. Memory 130 may be a random access memory (RAM) or
another type of dynamic storage device that stores information and
instructions for execution by processor 120. Memory 130 may also
store temporary variables or other intermediate information used
during execution of instructions by processor 120. ROM 140 may
include a conventional ROM device or another type of static storage
device that stores static information and instructions for
processor 120. Storage device 150 may include any type of media,
such as, for example, magnetic or optical recording media and its
corresponding drive. In some implementations consistent with the
principles of the invention, storage device 150 may store and
retrieve data according to a database management system.
[0026] Input device 160 may include one or more conventional
mechanisms that permit a user to input information to system 200,
such as a keyboard, a mouse, a pen, a voice recognition device, a
microphone, a headset, etc. Output device 170 may include one or
more conventional mechanisms that output information to the user,
including a display, a printer, one or more speakers, a headset, or
a medium, such as a memory, or a magnetic or optical disk and a
corresponding disk drive. Communication interface 180 may include
any transceiver-like mechanism that enables processing device 100
to communicate via a network. For example, communication interface
180 may include a modem, or an Ethernet interface for communicating
via a local area network (LAN). Alternatively, communication
interface 180 may include other mechanisms for communicating with
other devices and/or systems via wired, wireless or optical
connections.
[0027] Processing device 100 may perform such functions in response
to processor 120 executing sequences of instructions contained in a
computer-readable medium, such as, for example, memory 130, a
magnetic disk, or an optical disk. Such instructions may be read
into memory 130 from another computer-readable medium, such as
storage device 150, or from a separate device via communication
interface 180.
[0028] Processing device 100 may be, for example, a personal
computer (PC), or any other type of processing device capable of
processing textual data. In alternative implementations, such as,
for example, a distributed processing implementation, a group of
processing devices 100 may communicate with one another via a
network such that various processors may perform operations
pertaining to different aspects of the particular
implementation.
Generating a Predictive Model
[0029] FIG. 2A illustrates portions of an exemplary system, which
may be used to build a predictive model for an automated syndromic
surveillance system or other predictive outcome system, consistent
with the principles of the invention. The system may include a
pre-assigned classifier 202, records of medical complaints 204, a
complaint classifier generator 206, a predictive model 208, which
may be generated by the system, and optionally, an optimizer
210.
[0030] In one implementation consistent with the principles of the
invention, pre-assigned classifier 202 may include ICD codes that
have been assigned to each of the records of medical complaint data
204 by a medical professional. The ICD codes may be, for example,
ICD9 or ICD 10 codes. ICD codes are well defined and may have a
well-defined mapping to an outcome. An outcome may be, for example,
a syndrome, a diagnosis, or any other information that indicates a
result or conclusion. In alternative implementations consistent
with the principles of the invention, pre-assigned classifier 202
may, instead, be data indicating a particular outcome or may be a
combination of various data items that are indicative of a
particular outcome, such as, for example, ICD code, patient age,
patient gender, and geographic location. Each one of records of
medical complaints 204 may include textual data, such as, for
example, free-form alphanumeric textual characters that describe a
patient's chief complaint. In addition to including alphabetic and
numeric character data, the free-form alphanumeric textual
characters may include spaces as well as punctuation and special
characters, such as "/", "$", "@", as well as other characters.
[0031] Complaint classifier generator 206 may input each one of
records of medical complaints 204, or alternatively, other text
strings that may be available in patients' records. Although, for
the sake of simplifying the description of implementations
consistent with principles of the invention, records of medical
complaint data 204 will be described as the text sting source for
the following implementations of complaint classifier generator
204.
[0032] Based on n-grams appearing in respective ones of records of
medical complaints 204, and further based on an associated
pre-assigned classifier 202, complaint classifier generator 206 may
generate predictors for predictive model 208. Each predictor may be
associated with an n-gram, an outcome, and a value indicating an
expected probability of an outcome associated with a predictor.
[0033] The exemplary system may include optional optimizer 210. In
implementations consistent with the principles of the invention,
optimizer 210 may prune predictors associated with n-grams that
occur in fewer than a specific number of records of medical
complaints 204 from predictive model 208. Such pruning may result
in a more compact predictive model 208 and a more efficient
complaint classifier. In some implementations consistent with the
principles of the invention, the above-mentioned specific number of
records is configurable. In one implementation consistent with the
principles of the invention, predictors associated with n-grams
appearing in fewer than 100 records of medical complaint data 204,
may be pruned. In an alternative implementation consistent with the
principles of the invention, predictors associated with n-grams
appearing in fewer than a specific percentage of records may be
pruned. The specific percentage may be configurable.
[0034] FIG. 2B is a diagram that illustrates exemplary operation of
an exemplary system that may use predictive model 208 created by
the exemplary system of FIG. 2A. The exemplary system may include
newly received records of medical complaints 228, predictive model
208, outcome predictor 230 and predictions 232.
[0035] Newly received records of medical complaints 228 may be
input by outcome predictor 230, which may use predictive model 208,
created by the exemplary system of FIG. 2A. Alternatively, instead
of using records of medical complaint data 228, other text strings
may be available in patients' records. Although, for the sake of
simplifying the description of implementations consistent with
principles of the invention, we describe the source of text as
newly received records of medical complaint data 228.
[0036] Using predictive model 208, outcome predictor 230 may
generate one or more predictions 232 for each one of records of
medical complaints 228. The one or more predictions 232 may simply
state one outcome, or one or more outcomes, each with a value
indicating an expected probability of a corresponding outcome.
Alternatively, outcome predictor 230 may produce an expected
probability, for example, a value between 0 and 1, inclusive,
associated with a corresponding outcome.
[0037] FIG. 3 is a flowchart of an exemplary process that may be
employed in implementations of complaint classifier generator 206
consistent with the principles of the invention. In the exemplary
process described in FIG. 3, predictors for one outcome may be
generated for predictive model 208.
[0038] The process may begin with complaint classifier generator
206 accessing a first one of records of medical complaints 204,
which, in this exemplary process, is being used as training data
(act 302). Next, complaint classifier generator 204 may obtain a
corresponding classifier from pre-assigned classifier 202 (act
304). The corresponding classifier may, but need not be included in
records of medical complaints 204. Complaint classifier generator
206 may then analyze characters in the accessed one of records
medical complaints 204 to determine n-grams appearing in the record
(act 306). Theoretically, the n-grams may include any number of
characters. Experimentation has been performed with n being 2
characters to 7 characters. A modal value of 4 for n appears to
work well for English language text. Although, at least in some
implementations consistent with the principles of the invention,
all n-grams may not have the same length.
[0039] Assuming that n is 4, complaint classifier generator 206 may
analyze the text of the record for 4-grams. Before analyzing the
text, complaint classifier 206 may replace special characters, such
as, for example, punctuation, "/", or other special characters with
a delimiter, such as, for example, a space. A word may be defined
as one or more consecutive characters preceded by a delimiter, such
as, for example, a space or a beginning of a record and followed by
a delimiter or end of record, with no intermediate delimiters
between characters. Words having fewer than n characters (4 for
this example) may be included among the n-grams, or, in this
example, 4-grams. A word larger than n characters may form a number
of n-grams beginning with the first through n.sup.th characters,
second through (n+1).sup.st characters, etc. For example, if a
record of a medical complaint includes the following: "chest
pain+wheezing", the 4-grams that may be produced are: "ches",
"hest", "pain", "whee", "heez", "eezi", "ezin" and "zing". In some
implementations consistent with the principles of the invention,
contextual information may be stored with an n-gram. The contextual
information may indicate a beginning of a word and an end of a
word. For example, the 4-grams produced above, may be stored as
".sub.bches", "hest.sub.b", ".sub.bwhee", "heez", eezi", ezin" and
"zing.sub.b", where ".sub.b" indicates a beginning or end of a
word. In some implementations, contextual information only for
beginnings of words are stored. In other implementations,
contextual information for only endings of words are stored.
[0040] In implementations for use with other languages, a word may
be defined differently depending on a type of word delimiter used
in a particular language. In one implementation consistent with the
principles of the invention, a text string search may categorize
each character as 1) a word character, 2) a non-word character used
to define word boundaries, and 3) mapped characters, where a mapped
character may be mapped to a space or an empty string (which, in
effect, is character deletion). Perl is a programming language
especially designed for processing text. In an embodiment of the
invention, a process for performing a text string search may be
programmed in Perl, which defines a word boundary as \w\W, \W\w,
\w, \w$, where \w and \W refer to a word and non-word character,
respectively, and and $ are, respectively, the start and end of the
text string. In other implementations, other programming languages
may be used to program a text string search.
[0041] After performing act 306, complaint classifier generator 206
may increment a record counter associated with each n-gram
appearing in the one of records of medical complaints 204,
indicating that the corresponding n-gram was observed in a record
of medical complaint data (act 308).
[0042] Next, because preassigned classifier 202 may be data
indicative of an outcome, complaint classifier generator 206 may
determine whether the corresponding classifier is associated with
the possible outcome (act 310). If complaint classifier generator
206 determines that the corresponding classifier is associated with
a particular outcome, then complaint classifier generator 206 may
increment an outcome counter associated with each n-gram in the one
of records of medical complaints 204 (act 312). The outcome counter
for a particular n-gram keeps track of a number of records
associated with the particular outcome in which the n-gram
appeared.
[0043] After determining that a predefined classifier is not
associated with the particular outcome (act 310), or after
incrementing the corresponding outcome counters (act 312),
complaint classifier generator 206 may determine whether any
additional records of medical complaints 204 are to be analyzed
(act 314). If there are additional records of medical complaint 204
to analyze, complaint classifier generator 206 may access the next
one of records of medical complaints 204 and may again perform acts
304-314.
[0044] If, at act 314, complaint classifier generator 206
determines that there are no more records of medical complaints 204
to analyze, complaint classifier generator 206 may determine a
respective predictive value to be associated with each n-gram (act
318). The value may be based on the respective outcome counter and
the record counter associated with each respective n-gram. This can
be expressed mathematically for each n-gram as:
predictive value=(outcome counter)/(record counter associated with
n-gram) (Eq. 1)
That is, complaint classifier generator 206 may calculate the
predictive value corresponding to an n-gram being associated with
the particular outcome by dividing the outcome counter associated
with the particular n-gram by the record counter associated with
the particular n-gram. The result may be a value between 0 and 1,
inclusive, that may be used as an expected probability of a record
of medical complaint, including the particular n-gram, being
associated with the particular outcome.
[0045] Next, the results may be arranged as an ordered set of
predictors (act 320). Each predictor may include an n-gram and a
predictive value. The ordered set of predictors may be arranged in
descending order based on the predictive values associated with the
predictors. Each n-gram may appear only once in the ordered
set.
[0046] FIGS. 4A and 4B are flowcharts of another exemplary process
that may be employed in other implementations of complaint
classifier generator 206 consistent with the principles of the
invention. In this exemplary process, predictors for two possible
outcomes may be generated for predictive model 208. The possible
outcomes may be, for example, that a record of a medical complaint
is associated with a particular outcome, and that the record of
medical complaint is not associated with the particular
outcome.
[0047] The exemplary process may begin with complaint classifier
generator 206 accessing a first one of records of medical
complaints 204, which in this process is used as training data (act
402). Next, complaint classifier generator 206 may obtain a
classifier corresponding to the record of medical complaints from
pre-assigned classifier 202 (act 404). As mentioned above, the
classifier from preassigned classifier 202 may be indicative of one
or more outcomes.
[0048] Complaint classifier generator 206 may then analyze
characters in the accessed one of records medical complaints 204 to
determine n-grams appearing in the record (act 406). The n-grams
may be determined as discussed previously with respect to act 306.
Further, complaint classifier generator 206 may store contextual
information with the determined n-grams, as previously discussed
with respect to FIG. 3. Complaint classifier generator may then
increment a record counter associated with each n-gram that
appeared in the accessed one of records medical complaints 204 (act
407) in order to keep track of a number of records in which each
n-gram appeared.
[0049] Next, complaint classifier generator 206 may determine
whether the classifier from preassigned classifier 202,
corresponding to the record of medical complaint, is associated
with the particular outcome (act 408). If the pre-assigned
classifier is associated with the particular outcome, then
complaint classifier generator 206 may increment a first outcome
counter associated with each n-gram determined to appear in the
record of medical complaints (act 410).
[0050] If, at act 408, complaint classifier generator 206
determines that the pre-assigned classifier is not associated with
the particular outcome, then complaint classifier generator 206 may
increment a second outcome counter associated with each n-gram that
is determined to appear in the record of medical complaint (act
412).
[0051] After incrementing the first or the second outcome counters,
complaint classifier generator 206 may determine whether there are
additional ones of records of medical complaints 204 to analyze
(act 414). If there are one or more records of medical complaints
204 to analyze, then complaint classifier generator 206 may access
a next one of records of medical complaints 204 and may again
perform acts 404-414.
[0052] If, at act 414, complaint classifier generator 206
determines that no additional records of medical complaints 204
exist to analyze, then complaint classifier generator 206 may
determine a first predictive value of each n-gram based on the
corresponding first outcome counter and the record counter for the
respective n-gram (FIG. 4B: act 418). The first predictive value
may be equal to a value of the corresponding first outcome counter
divided by the record counter for the corresponding n-gram (see Eq.
1). The first predictive value may be a value between 0 and 1,
inclusive, that represents an expected probability that a record of
records of medical complaints 204 is associated with the particular
outcome.
[0053] Next, complaint classifier generator 206 may then determine
a second predictive value of each predictor based on the
corresponding second outcome counter and the record counter for the
respective n-gram (act 420). The second predictive value may be
equal to a value of the corresponding second outcome counter
divided by the record counter for the corresponding n-gram (see Eq.
1). The second predictive value may be a value between 0 and 1,
inclusive, that represents an expected probability that a record of
records of medical complaints 204 is not associated with the
particular outcome.
[0054] Next, the resulting statistics may be used to create an
ordered set of predictors appearing in a minimum number of records
(act 422). The ordered set of predictors may be arranged in
descending order based on the first or second predictive value
corresponding to an n-gram. Each predictor may include an n-gram, a
predictive value, and an indication of a particular outcome
associated with the respective predictor. As an example, a first
predictor of an ordered set may include n-gram "diah", a predictive
value of 0.93, and an indication that the predictor is for outcome
1, a second predictor in the ordered set of predictors may include
n-gram "naus", a predictive value of 0.88, and an indication that
the predictor is for outcome 1, and a third predictor in the
ordered set of predictors may include n-gram "feve", a predictive
value of 0.75, and an indication that the predictor is for outcome
2. In implementations consistent with the principles of the
invention, each n-gram may be associated with only one predictor in
the ordered set of predictors, as determined by the highest
predictive value associated with the n-gram.
[0055] FIGS. 5A-5B are flowcharts that illustrate a third
embodiment of complaint classifier 206. In this embodiment,
complaint classifier generator 206 may generate an ordered set of
predictors for J+1 possible outcomes, where J of the possible
outcomes indicate particular outcomes and a J+1.sup.st outcome
indicates that the outcome is not one of the J possible
outcomes.
[0056] Complaint classifier generator 206 may begin by accessing a
first record of records of medical complaints 204, which may be
training data (act 502). Next, complaint classifier generator 206
may obtain a classifier corresponding to the record of medical
complaints from pre-assigned classifier 202 (act 504). As mentioned
above, the obtained classifier may be indicative of a particular
outcome, such as, for example an ICD code assigned by a medical
professional upon reviewing a patient's chief complaint data
included in a corresponding one of records of medical complaints
204. Alternatively, the obtained classifier may be any data or
combination of data items indicative of a particular outcome.
[0057] Complaint classifier generator 206 may then analyze
characters in the accessed one of records medical complaints 204 to
determine n-grams appearing in the record (act 506). Further,
complaint classifier generator 206 may store contextual information
with the determined n-grams, as discussed with respect to FIG.
3.
[0058] Complaint classifier 206 may then increment a record counter
associated with each n-gram appearing in the record in order to
keep track of a number of records in which each n-gram appears (act
508).
[0059] Next, complaint classifier generator 206 may determine
whether the pre-assigned classifier corresponding to one of records
of medical complaints 204 is associated with one of the J
particular outcomes (act 510). If the pre-assigned classifier is
associated with one of the J particular outcomes, then complaint
classifier generator 206 may increment an element of an outcome
counter array associated with each n-gram, which is determined to
exist in the record of records of medical complaints 204 (act 512).
The particular element of the outcome counter incremented depends
on the particular outcome associated with the pre-assigned
classifier. For example, when a pre-assigned classifier is
determined to be associated with an i.sup.th particular outcome,
where l.ltoreq.i.ltoreq.J, then the i.sup.th counter in the array
of outcome counters with respect to an n-gram may be
incremented.
[0060] If, at act 510, complaint classifier generator 206
determines that the pre-assigned classifier is not associated with
any of the J particular outcomes, then complaint classifier
generator 206 may increment an element of the outcome counter array
associated with each n-gram, which is determined to exist in the
record of records of medical complaint data 204 that is associated
with a (J+1).sup.st particular outcome (act 514).
[0061] Complaint classifier generator 206 may then determine
whether any additional records of medical complaints 204 are to be
analyzed (act 516). If more records of medical complaints 204 are
to be processed, then complaint classifier generator 206 may access
the next record of records of medical complaints 204 (act 518) and
acts 504-516 may be repeated.
[0062] If complaint classifier generator 206 determines that no
additional records of medical complaints 204 are to be analyzed
(act 516), then each of the n-grams found in records of medical
complaints 204 may be assigned up to J+1 predictive values by
calculating a value of a corresponding outcome counter for an
outcome divided by a value of a record counter for the
corresponding n-gram (see Eq. 1) (act 522).
[0063] Complaint classifier generator 206 may then create an
ordered set of predictors corresponding to n-grams appearing in a
minimum number of records, as described with respect to act 422
(act 524).
Optimizer
[0064] Due to a large number of n-grams that complaint classifier
generator 206 may generate, it may be desirable to prune or
eliminate n-grams (and as a result, predictors) that may not have
much predictive value. For example, an n-gram that is associated
with a small number of ones of records of medical complaints 204,
may in fact be a result of one or more typographical errors. By
eliminating such n-grams from predictive model 208, predictive
model 208 may become more compact and may result in more accurate
and efficient classification of records of medical complaints 228
in an operational environment. Further, because implementations of
complaint classifier generator 206 may build an ordered set of
predictors based on relative frequencies, it may be desirable to
perform pruning of predictors of the ordered set by executing
optimizer 210 at least once.
[0065] FIGS. 6A and 6B are a flowcharts that illustrate exemplary
operation of optimizer 210 for pruning predictors in
implementations consistent with the principles of the invention.
First, optimizer 210 may access a first record of training data
used to create the ordered set of predictors, such as, for example,
a first record of medical complaints 204 (act 602). Next, optimizer
210 may obtain a classifier, from preassigned classifier 202,
associated with the record (act 604). Optimizer 210 may then access
a first predictor in the ordered set of predictors (act 606) and
may determine whether the n-gram associated the accessed predictor
appears in the record (act 608).
[0066] If optimizer 210 determines that the n-gram associated with
the predictor appears in the record, then optimizer 210 may
increment a record counter associated with the n-gram (act 610).
Optimizer 210 may then determine whether the classifier is
associated with an outcome-of interest (act 612). If optimizer 210
determines that the classifier is associated with the
outcome-of-interest, then optimizer 210 may increment an outcome
counter for the n-gram associated with the predictor (act 614).
Next, optimizer 210 may determine whether there are additional
records to process (act 616). If optimizer 210 determines that
there are additional records to process, then a next record is
accessed (act 618) and optimizer 210 may repeat acts 604-616.
[0067] If, at act 616, optimizer 210 determines that there are no
additional records to process, then optimizer 210 may determine the
predictive value of each of the n-grams based on the respective
outcome counter and the record counter (FIG. 6B: act 624). This may
be performed according to the formula of Eq. 1. Optimizer 210 may
then create a new ordered set of predictors corresponding to the
n-grams, associated with the previous ordered set, that appear in a
new minimum number of records (act 626). The new minimum number of
records may be configurable and, in some implementations, may be a
number, such as, for example, 100 records or another suitable
value, or, in other implementations, may be derived from a
configurable minimum percentage of records, such as, for example,
5% or another suitable percentage.
[0068] If, at act 608, outcome predictor 210 determines that the
n-gram associated with the predictor does not appear in the record,
then optimizer 210 may determine whether the predictor is the last
predictor in the ordered set (FIG. 6A: act 620). If optimizer 210
determines that the predictor is the last predictor in the ordered
set, then optimized 210 may proceed to act 616 to determine whether
there are more records to process. If optimizer 210 determines that
the predictor is not the last predictor in the ordered set (act
620), then optimizer 210 may access a next predictor from the
ordered set (act 622) and may perform act 608 to determine whether
an n-gram from the next predictor appears in the record.
[0069] Optimizer 210 may be executed multiple times with respect to
a predictive model, using different values minimum values (minimum
number of records or minimum percentage of records) until an
optimum size predictive model is built.
Exemplary Processing of Outcome Predictor
[0070] FIG. 7 is a flowchart that illustrate exemplary processing
of outcome predictor 230, in implementations consistent with the
principles of the invention, when processing one of records of
medical complaints 228 to produce one or more predictions 232
corresponding to the record. Although the following implementations
of outcome predictor 210 may refer to using records of medical
complaints 228 as textual input, other information from patients'
records may be used as textual input in other implementations.
[0071] This particular implementation of outcome predictor may use
an ordered set of predictors associated with one outcome. Outcome
predictor 230 may begin by accessing a first predictor from the
ordered set of predictors (act 702). Each predictor may include
information with respect to an n-gram and a predictive value. The
ordered set may be arranged in descending order with respect to the
predictive value of each of the predictors. Outcome predictor 230
may then perform a text string search to determine whether the
n-gram corresponding to the predictor appears in the current one of
the record of medical complaints 228 (act 704). If the n-gram
appears in the current one of the record of medical complaints 228,
then outcome predictor 230 may assign the predictive value from the
predictor to be an expected probability of the outcome (act 706)
and may determine whether the predictive value is greater than a
threshold (act 708). If the predictive value is greater than the
threshold, then outcome predictor 230 may provide output indicating
that the current one of the record of medical complaints 228 is
associated with the outcome (act 710). If the predictive value is
not greater than the threshold, then outcome predictor 230 may
provide output indicating that the current one of the record of
medical complaints 228 is not associated with the outcome (act
712).
[0072] If outcome predictor 230 determines that the n-gram
associated with the predictor does not appear in the record (act
704), then outcome predictor 230 may determine whether the
predictor is the last predictor in the ordered set (act 714). If
the predictor is the last one in the ordered set, then outcome
predictor 230 may assign a value of 0 to be the expected
probability of the outcome, indicating that the expected
probability of the current one of record of medical complaints 228
is 0 (act 716). Outcome predictor may then proceed to act 712 to
predict that the current one of record of medical complaints 228 is
not associated with the outcome (act 712).
[0073] If outcome predictor 230 determines that the predictor is
not the last predictor in the ordered set of predictors, then
outcome predictor may access a next predictor from the ordered set
of predictors (act 718) and may then proceed to act 704 to
determine whether the corresponding n-gram appears in the current
one of records of medical complaints 228.
[0074] FIG. 8 is a flowchart that illustrate processing in another
implementation of outcome predictor 230 consistent with the
principles of the invention. In this implementation, predictive
model 208 may have an ordered set of predictors for multiple
outcomes. For example, there may be J+1 possible outcomes, in which
J of the outcomes may be particular outcomes and a J+1.sup.st
outcome may be any outcome other than the J particular outcome. In
one implementation consistent with the principles of the invention,
J may be 10.
[0075] Outcome predictor 230 may begin by accessing a first
predictor from the ordered set of predictors (act 802). Each
predictor may include information with respect to an n-gram, a
predictive value, and an outcome to which the predictor pertains.
The ordered set may be arranged in descending order with respect to
the predictive value of each of the predictors. Outcome predictor
230 may then perform a text string search to determine whether the
n-gram corresponding to the predictor appears in the current one of
the record of medical complaints 228 (act 804). If the n-gram
appears in the current one of the record of medical complaints 228,
then outcome predictor 230 may assign the predictive value from the
predictor to be an expected probability of an outcome and may
assign a value indicating to which one of the possible outcomes the
predictor pertains (act 806). Outcome predictor 230 may then
proceed to act 816 to output the value and outcome from the ordered
set or predictors.
[0076] If outcome predictor 230 determines that the n-gram
associated with the predictor does not appear in the record (act
804), then outcome predictor 230 may determine whether the
predictor is the last predictor in the ordered set (act 808). If
the predictor is the last one in the ordered set, then outcome
predictor 230 may assign a value of 0 to be the expected
probability of all of the outcomes (act 814). Outcome predictor 230
may then output the highest predictive value and corresponding
outcome indication from the ordered set of predictors (act
816).
[0077] Although the exemplary processing of FIGS. 7 and 8 employ
one ordered set of predictors to produce a most likely outcome,
alternative implementations of outcome predictor 230 may use
multiple ordered sets of predictors, where each of the ordered sets
of predictors may pertain to different outcomes, to produce
multiple outputs.
Miscellaneous
[0078] The above described implementations of complaint classifier
generator 206 may calculate predictive values based on a relative
frequency of an n-gram appearing in records associated with an
outcome of interest and based on a total number of records in which
the n-gram appears. Another implementation of complaint classifier
206, consistent with the principles of the invention, is
illustrated with reference to the flowchart of FIG. 9. In this
implementation, complaint classifier generator 206 may use a
stepwise variable selection approach.
[0079] Complaint classifier generator 206 may begin by accessing a
first record from training data, such as, for example, records of
medical complaints 204 (act 902). Classifier complaint generator
206 may then clear, or blank-out, the ordered set of predictors and
may reset counters, such as record counters and outcome counters
(act 903). Next, complaint classifier 206 may obtain a classifier
associated with the accessed record from preassigned classifier 202
(act 904). Complaint classifier 206 may then analyze characters in
the record to determine all of the n-grams that appear in the
record (act 906). This may be performed as described with respect
to the implementation of FIG. 3. Complaint classifier 206 may then
increment a record counter associated with each n-gram that appears
in the record (act 908).
[0080] Next, complaint classifier 206 may determine whether the
record is associated with an outcome-of interest based on the
associated classifier from preassigned classifier 202 (act 910). If
complaint classifier 206 determines that the classifier is
associated with the outcome-of-interest, then complaint classifier
206 may increment an outcome counter associated with each of the
n-grams that appear in the record (act 912). Complaint classifier
206 may then determine whether there are more records to process
(act 914). If there are more records to process, then complaint
classifier 206 may access a next record (act 916) and may repeat
acts 904-914. If complaint classifier 206 determines that there are
no additional records to process (act 914), then complaint
classifier 206 may determine whether any of the n-grams found in
the records appear in more than a minimum number of records (act
917). If none of the n-grams found in the records appear in more
than a minimum number of records, then the process is
completed.
[0081] If, at act 917, complaint classifier 206 determines that
there are n-grams that appear in more than a minimum number of the
records then complaint classifier 206 may determine a predictive
value of each of the n-grams that appeared in the records more than
the minimum number of times based on the outcome counter and the
record counter associated with respective n-grams. The predictive
value may be calculated according to Eq. 1.
[0082] Next, complaint classifier 206 may select an n-gram having a
highest calculated predictive value and add a predictor to the
ordered set of predictors that corresponds to this n-gram and
predictive value (act 920). Complaint classifier 206 may then
exclude all of the records that include an n-gram associated with
the ordered set of predictors (act 922). Complaint classifier 206
may then determine whether any records exist, after excluding ones
of the records that include n-grams that are associated with the
ordered set of predictors (act 924). If no records exist, then the
process is completed. Otherwise, complaint classifier 206 may
repeat acts 902-924.
[0083] In an alternative implementation of complaint classifier
206, while performing act 920, complaint classifier 206 may check
the predictive value of the selected n-gram before adding a new
predictor to the ordered set of predictors. If the predictive value
is less than a predetermined amount, such as, for example, 0.01,
then the process may end. In this alternative, implementation,
complaint classifier generator 206 will not add predictors to the
ordered list when the predictor has a very small predictive
value.
CONCLUSION
[0084] The above-described embodiments are exemplary and are not
limiting with respect to the scope of the invention. Although the
above-described embodiments discuss either creating a predictive
model from records of medical complaints 204, or using the created
predictive models to predict or forecast an outcome associated with
records of medical complaints 228, other types of records may be
used in other embodiments to generate predictive models and to
forecast one or more outcomes. Further, embodiments of the
invention may not be limited to any particular language, but may be
used with records of any language to create predictive models and
to forecast one or more outcomes.
[0085] Embodiments within the scope of the present invention may
also include computer-readable media for carrying or having
computer-executable instructions or data structures stored thereon.
Such computer-readable media can be any available media that can be
accessed by a general purpose or special purpose computer. By way
of example, and not limitation, such computer-readable media can
comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage,
magnetic disk storage or other magnetic storage devices, or any
other medium which can be used to carry or store desired program
code means in the form of computer-executable instructions or data
structures. When information is transferred or provided over a
network or another communications connection (either hardwired,
wireless, or combination thereof) to a computer, the computer
properly views the connection as a computer-readable medium. Thus,
any such connection is properly termed a computer-readable medium.
Combinations of the above should also be included within the scope
of the computer-readable media.
[0086] Computer-executable instructions include, for example,
instructions and data which cause a general purpose computer,
special purpose computer, or special purpose processing device to
perform a certain function or group of functions.
Computer-executable instructions also include program modules that
are executed by computers in stand-alone or network environments.
Generally, program modules include routines, programs, objects,
components, and data structures, etc. that perform particular tasks
or implement particular abstract data types. Computer-executable
instructions, associated data structures, and program modules
represent examples of the program code means for executing steps of
the methods disclosed herein. The particular sequence of such
executable instructions or associated data structures represents
examples of corresponding acts for implementing the functions
described in such steps.
[0087] Those of skill in the art will appreciate that other
embodiments of the invention may be practiced in network computing
environments with many types of computer system configurations,
including personal computers, hand-held devices, multi-processor
systems, microprocessor-based or programmable consumer electronics,
network PCs, minicomputers, mainframe computers, and the like.
Embodiments may also be practiced in distributed computing
environments where tasks are performed by local and remote
processing devices that are linked (either by hardwired links,
wireless links, or by a combination thereof) through a
communications network. In a distributed computing environment,
program modules may be located in both local and remote memory
storage devices.
[0088] Although the above description may contain specific details,
they should not be construed as limiting the claims in any way.
Other configurations of the described embodiments of the invention
are part of the scope of this invention. For example, hardwired
logic may be used in implementations instead of processors, or one
or more application specific integrated circuits (ASICs) may be
used in implementations consistent with the principles of the
invention. Further, implementations consistent with the principles
of the invention may have more or fewer acts than as described, or
may implement acts in a different order than as shown. Accordingly,
the appended claims and their legal equivalents should only define
the invention, rather than any specific examples given.
* * * * *