U.S. patent application number 14/886340 was filed with the patent office on 2016-11-03 for generating predictive models based on text analysis of medical study data.
The applicant listed for this patent is International Business Machines Corporation. Invention is credited to Dhruv A. Bhatt, Kristin E. McNeil, Nitaben A. Patel.
Application Number | 20160321423 14/886340 |
Document ID | / |
Family ID | 57204927 |
Filed Date | 2016-11-03 |
United States Patent
Application |
20160321423 |
Kind Code |
A1 |
Bhatt; Dhruv A. ; et
al. |
November 3, 2016 |
GENERATING PREDICTIVE MODELS BASED ON TEXT ANALYSIS OF MEDICAL
STUDY DATA
Abstract
Methods for text analysis of medical study data to extract
predictive data. Natural language processing is performed on a
document in a collection of documents to determine whether the
document contains medical model data. In response to determining
that the document contains medical model data, content relating to
the medical model data in the document is annotated. A first
medical model is generated based on the annotations for the
identified medical model data and a certainty threshold In response
to the certainty threshold meeting a user setting, the first
medical model is added to a predictive model for determining a risk
score, based on the analyzed data.
Inventors: |
Bhatt; Dhruv A.; (Indian
Trail, NC) ; McNeil; Kristin E.; (Charlotte, NC)
; Patel; Nitaben A.; (Charlotte, NC) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
International Business Machines Corporation |
Armonk |
NY |
US |
|
|
Family ID: |
57204927 |
Appl. No.: |
14/886340 |
Filed: |
October 19, 2015 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
14698012 |
Apr 28, 2015 |
|
|
|
14886340 |
|
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06F 40/20 20200101;
G16H 50/30 20180101; G16H 15/00 20180101; G16H 50/70 20180101; G06F
40/10 20200101; G06F 40/169 20200101 |
International
Class: |
G06F 19/00 20060101
G06F019/00 |
Claims
1. A computer-implemented method for text analysis of medical study
data to extract predictive data, comprising: performing natural
language processing on a document in a collection of documents to
determine whether the document contains medical model data; in
response to determining that the document contains medical model
data, annotating content relating to the medical model data in the
document; generating a first medical model based on the annotations
for the identified medical model data and a certainty threshold;
and in response to the certainty threshold meeting a user setting,
adding the first medical model to a predictive model for
determining a risk score, based on the analyzed data.
2. The method of claim 1, wherein the certainty threshold is
configured by a user in a property file or through a user
interface.
3. The method of claim 1, wherein the natural language processing
uses a set of predetermined dictionaries and parsing rules to
determine whether the document contains medical model data.
4. The method of claim 1, further comprising using the predictive
model against unstructured or structured content to generate a
predictive score.
Description
BACKGROUND
[0001] The present invention relates to text analytics, and more
specifically, to using text analytics of medical study data. In the
healthcare industry, there are a vast number of new studies being
published everyday. With the current use of the Internet, these
studies are accessible electronically to people. However, it is
hard to keep up with reading these studies to uncover new pieces of
information, especially for medical personnel like doctors and
nurses, who are often very busy caring for their patients.
[0002] Predictive analytics encompasses a variety of statistical
techniques from modeling, machine learning, and data mining that
analyze current and historical facts to make predictions about
future, or otherwise unknown, events. Predictive analytics can be
used to create models that capture relationships among many factors
to allow assessment of risk or potential associated with a
particular set of conditions. These models can be used to guide
decision making in a variety of areas, including healthcare.
[0003] Currently, there are a few Medical Models published on the
web that could be used for Predictive Analytics Model. Some of
these medical models include: [0004] A Predictive Model for
Delirium in Hospitalized Elderly Medical Patients Based on
Admission Characteristics
(http://annals.org/article.aspx?articleid=706724) [0005] A risk
assessment model for the identification of hospitalized medical
patients at risk for venous thromboembolism: the Padua Prediction
Score (http://www.ncbi.nlm.nih.gov/pubmed/20738765) [0006] Risk
Prediction Models for Hospital Readmission (http
://jama.jamanetwork.com/article.aspx?articleid=1104511) [0007] Risk
prediction models for patients with chronic kidney disease: a
systematic review (http://www.ncbi.nlm.nih.gov/pubmed/23588748)
[0008] Development of a predictive model to identify inpatients at
risk of re-admission within 30 days of discharge (PARR-30)
(http://bmjopen.bmj.com/content/2/4/e001667.full) [0009] Framingham
diabetes
(http://www.framinghamheartstudy.org/risk-functions/diabetes/ind-
ex.php) [0010] Framingham Heart Study AF Score (10-year risk)
(http://www.framinghamheartstudy.org/risk-functions/atrial-fibrillation/1-
0-year-risk.php)
[0011] However, it takes time to find these and convert the logic
into a model that can be used with software that can produce
predictive models, such as a Statistical Package for the Social
Sciences (SPSS) model). Typically, it is necessary to manually
create the models based on the logic mentioned in the studies, like
the ones referenced above. Thus, there is a need for an improved
way of generating predictive models.
SUMMARY
[0012] According to one embodiment of the present invention,
techniques are described for text analysis of medical study data to
extract predictive data. Natural language processing is performed
on a document in a collection of documents to determine whether the
document contains medical model data. In response to determining
that the document contains medical model data, content relating to
the medical model data in the document is annotated. A first
medical model is generated based on the annotations for the
identified medical model data and a certainty threshold In response
to the certainty threshold meeting a user setting, the first
medical model is added to a predictive model for determining a risk
score, based on the analyzed data.
[0013] The details of one or more embodiments of the invention are
set forth in the accompanying drawings and the description below.
Other features and advantages of the invention will be apparent
from the description and drawings, and from the claims.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
[0014] FIG. 1 shows a flowchart for generating a predictive model
based on medical model data, in accordance with one embodiment.
[0015] FIG. 2 is a block diagram showing a system for generating a
predictive model based on medical model data, in accordance with
one embodiment.
[0016] Like reference symbols in the various drawings indicate like
elements.
DETAILED DESCRIPTION
[0017] The various embodiments described herein pertain to
techniques for performing text analysis on medical study data to
extract predictive data from medical studies around a category (for
example, that a Chronic Heart Failure (CHF) diagnosis may result in
a 50% mortality rate over the next 5 years). Text analysis, in
particular natural language processing (NLP), uses dictionaries and
rules to annotate content in order to determine whether the content
is related to a medical model. If the content is determined to
contain medical model information, then the text analysis tool
obtains the section of text for the first model. It determines each
instruction phrase of the medical model section. Instruction
phrases are similar to text analytic rules. One example of an
instruction phrase is "if age>70, then risk=3." If there are
more instruction phrases in this model's section of text, then they
are also determined.
[0018] A model generator engine generates the medical model based
on the medical model information determined from the text analytics
and based on a certainty threshold. Typically, this certainty
threshold is configured by a user in some kind of property file or
user interface, but it should be realized that there are also many
other ways to configure certainty thresholds that are available to
those having ordinary skill in the art. If the certainty threshold
meets a predefined user setting, then the piece of information is
added to the model. Further, if there are more rules identified for
this medical model, then they are also added to the model. If there
are more sections with medical model information, then those
sections of text are also analyzed to determine the model
information (as annotations) and another model is generated. As is
familiar to those having ordinary skill in the art, an annotation
is the resulting value from the identified rule or dictionary. For
example, if an Agelndicator rule is <Age dictionary term>
followed by a mathematical symbol followed by a number, then when
the text "Age>70" is analyzed, the rule is fired off and
generates an Agelndicator annotation with value Age>70.
Annotators and Annotation terms are part of the Unstructured
Information Management Architecture (UIMA) framework, which is one
possible framework implementation in the various embodiments
described herein. The generated predictive model can then be used
to determine a risk score based on the analyzed data. Various
embodiments will now be described in further detail by way of
example and with reference to the figures.
[0019] FIG. 1 shows a flowchart of a process (100) for generating a
predictive model based on medical model data, in accordance with
one embodiment. As can be seen in FIG. 1, the process (100) starts
by inputting a document into the system (step 102). The document is
known to contain medical model information and can be input, for
example, by a user uploading the document to the system, or by
automatically accessing the document from a database.
[0020] Next, it is determined if there are any more documents to be
input (step 104). If there are more documents, the system returns
to step 102 to obtain the next document. If there are no more
documents, the process continues to step 106, where medical study
and model parsing rules and dictionaries are used to perform
natural language processing on a selected document.
[0021] Based on the text annotations generated in step 106,l a
determination is made as to whether there is a medical study and
model present in the selected document (step 108). If it is
determined that there is no medical study and model present in the
selected document, the process continues to check if there are any
more documents available (step 110). If there are more documents
available, the process returns to step 106 and continues as
outlined above.
[0022] If it is determined in step 108 that there is a medical
study and model present in the selected document, the text
analytics annotations for the determined section and modeling rule
are obtained. This is based on section and modeling rule
dictionaries and parsing rules (step 112).
[0023] Next, the section that contains the sections and modeling
rules is identified (step 114) based on the annotations generated
in step 112.
[0024] Next, the modeling rules that are located within the section
are identified (step 116). This identification is based on using
the Common Analysis Structure (CAS) Subject of Analysis (SOFA)
index values of the UIMA framework for the section and finding the
modeling rule annotations that are identified within the annotation
index range. SOFA describes a way of storing the information
(annotations) in memory to be able to retrieve and work with them.
For example, at character 11 there is a first name annotation
(Kristin) identified in this line: "My name is Kristin McNeil". The
annotation index range is the beginning and end values for the
identified phrase for the annotation. For example, in the above "My
name is Kristin McNeil" phrase, the begin and end values would be
11 and 17, respectively.
[0025] Next, in step 118, a model generator program is used to
create/update a predictive model based on the modeling rules
identified in step 116. In one embodiment, the model generator is
integrated with SPSS or some similar modeling software to generate
a predictive model based on using Rest APIs. The Rest API allows a
user to send requests (i.e., add predictive, update details of
predictive rule) with information via http to the server. The
predictive model can be used against unstructured and/or structured
content to generate a predictive score. For example, patient data
can be used to generate a predictive score for a CHF
readmission.
[0026] After the predictive model has been created, it is examined
whether there are any more sections in the document to be analyzed.
If not, the process returns to step 110. If there are any further
sections, the process returns to step 114 and continues as outlined
above.
Use Example
[0027] To further illustrate the process described above, consider
the following example, in which the following medical model is
provided. Table 1 below indicates a point designation based on
predictors for 8-year risk of Type 2 diabetes in middle-aged
adults. Table 2 indicates an approximate percentage risk of Type 2
diabetes in middle-aged adults, based on the total points obtained
in Table 1.
TABLE-US-00001 TABLE 1 Point designation based on predictors for
8-year risk of type 2 diabetes in middle-aged adults Predictor
Points Fasting glucose level 100-126 mg/dL 10 BMI 25.0-29.9 2 BMI
>30.0 5 HDL-C level <40 mg/dL in men or <50 5 mg/dL in
women Parental History of diabetes mellitus 3 Triglyceride level
>150 mg/dL 3 Blood pressure >130/85 mmHG or 2 receiving
treatment
TABLE-US-00002 TABLE 2 Given total points from Table 1, there is an
approximate percentage risk for type 2 diabetes in middle-aged
adults Total Points 8-year risk, % 10 or less 3 or less 11 4 12 4
13 5 14 6 15 7 16 9 17 11 18 13 19 15 20 18 21 21 22 25 23 29 24 33
25 or more 35 or more
[0028] When analyzing this data, the following dictionaries may be
used, in one embodiment:
Medical Model Indicator Dictionary
[0029] Risk Model [0030] Medical Model [0031] Risk [0032] Point
Designation [0033] Risk Score
Age Dictionary
[0033] [0034] Population of interest [0035] Age [0036] Age
Range
Point Dictionary
[0036] [0037] Total Points [0038] Points
Risk Value Dictionary
[0038] [0039] Risk [0040] Percent [0041] Risk Score
Unit Dictionary
[0041] [0042] Year [0043] Month [0044] Quarter [0045] Decade [0046]
Century
Predictor Dictionary
[0046] [0047] Fasting glucose [0048] BMI [0049] HDL-C [0050]
Parental history of diabetes mellitus [0051] Triglyceride [0052]
Blood Pressure [0053] Receiving Treatment
Range Symbol Dictionary
[0053] [0054] -- [0055] to
Gender Dictionary
[0055] [0056] Men [0057] Women [0058] Male [0059] Female
[0060] Some examples of parsing rules that may be included when
performing natural language processing the data in the above
document, in accordance with one embodiment, are listed below:
Predictor Factor Rules
[0061] Predictor dictionary followed by a number followed by a
measurement. [0062] Predictor dictionary followed by mathematical
sign followed by a number. [0063] Predictor dictionary followed by
number followed by a range symbol followed by a number.
[0064] These rules will identify range, measurement unit,
conditional (i.e., gender) and points.
Multi Predictor Factor Rule
[0065] Predictor Dictionary followed by the text `and` followed by
Predictor Dictionary.
Risk Value Rule
[0065] [0066] Risk Value dictionary followed by %. [0067] Number
followed by unit dictionary followed by Risk Value dictionary.
Risk Conversion Annotation Rule
[0067] [0068] Points dictionary followed by Risk Value Rule.
Age Rules
[0068] [0069] Age dictionary followed by a number followed by the
text `years`==>The number is set to the age minimum and age
maximum feature annotations. [0070] Age dictionary followed by a
number followed by range symbol followed by a number followed by
the text `years`=>The first number is set to the age minimum and
the second number is set to age maximum. [0071] Number followed by
the text `to` followed by a number followed by the text
`years`==>The first number is set to the age minimum and the
second number is set to age maximum feature annotations. [0072]
Number followed by the text `years`==>The number is set to the
age minimum and age maximum feature annotations.
[0073] This Age analytic rule identifies the age or age range for
patients in the model and in one embodiment, the system generates a
Predictive Node Rule based on the age analytic rule. The Predictive
Node Rule is an SPSS concept and represents a step in a process,
similar to a block in a flow diagram. One or more rules can be
implemented in a single SPSS node. Some examples of SPSS nodes
include file input nodes (i.e., configure how to input data), data
mining algorithm node (i.e., decision tree, clustering), etc. One
example of a predictive node rule is: [0074] If age >70, then
risk of heart disease is 45%
[0075] If a table is followed by a medical model indicator
annotation, then the table is parsed for column and row headers.
Next the analytic engine parses the text. It reads the text
token-by-token and row-by-row to determine the Predictive Model to
be generated.
[0076] By applying the dictionaries and rules to Table 1 above, the
following annotations are generated, in accordance with one
embodiment:
Age Range Annotation: 45 to 64 years [0077] Age minimum
annotation=45 [0078] Age maximum annotation=64
[0079] The system would generate a predictive model node rule:
Age>=45 and Age<=64
Predictor Factor Annotations:
[0080] Fasting glucose (range feature =100-126, measurement=mg/dL,
points=10) [0081] BMI (range feature=25.0-29.9, points=2) [0082]
BMI (range feature=`>30`, points=5) [0083] HDL-C (range
feature=`<40, measurement=mg/dL, conditional=male, points=3)
[0084] HDL-C (range feature=`<50, measurement=mg/dL,
conditional=female, points=3) [0085] Parental history of diabetes
mellitus (points=3) [0086] Triglyceride (range feature=`>150`,
measurement=mg/dL, points=3) [0087] Blood pressure (range
feature=`>130/85`, measurement=mmHG, point=2) [0088] Receiving
treatment (point=2)
[0089] The system can generate a predictive model node rule for
each of these annotations by using the API of the predictive
software, such as SPSS. The predictor factor annotation value
combined with the range feature and measurement unit is used to
generate the model rule. The points feature value is used to assign
a number of points should the predictive model node criteria be met
(e.g., age between 60 and 70, then assign 3 points). If there is no
range feature value then a Boolean predictive model rule node is
generated (e.g., male, then add 1 point to the risk). If the
conditional feature value is set, then it is included in the
predictive model node criteria.
[0090] Analyzing Table 2 generates data for the point to percent
conversion. That is:
Risk Value Rule Annotation:
[0091] 8 year risk, %
Points Annotation:
[0091] [0092] Total Points
[0093] Since these annotations are in the table headers, the parser
goes line-by-line and generates the Risk conversion annotation. For
example:
Risk Conversion Annotation:
[0094] Points=11 [0095] Risk value=4
[0096] It should be noted that this is merely one exemplary
implementation, and that the above concepts can be used in the
context of many other products, such as the IBM Advanced Care
Insights product, the IBM Watson Content Analytics product and the
IBM SPSS Modeler product, all of which are available from
International Business Machines Corporation of Armonk, NY. As the
skilled person realizes, the above list of dictionaries and rules
is not exclusive, and many other types of dictionaries and rules
can also be used within this context.
[0097] FIG. 2 shows a system (200), in accordance with one
embodiment, for generating a predictive model. As can be seen in
FIG. 2, the system (200) includes a text analytics engine (204), a
predictive rule generator (206), a predictive model generator (208)
and a predictive analysis engine (210). FIG. 2 also shows how the
medical study data (202) is ingested into the system (200). In the
illustrated embodiment, the text analytics engine (204) performs
steps 102-112 of FIG. 1, the predictive rule generator (206)
performs steps 114-116 of FIG. 1, the predictive model generator
performs step 118 of FIG. 1, and once the predictive model has been
generated, it can be used by the predictive analytics engine (210).
The system (200) can be advantageously implemented on one or more
computers and/or servers or on specialized computers/modules that
perform each of the steps described in FIG. 1.
[0098] The present invention may be a system, a method, and/or a
computer program product. The computer program product may include
a computer readable storage medium (or media) having computer
readable program instructions thereon for causing a processor to
carry out aspects of the present invention. The computer readable
storage medium can be a tangible device that can retain and store
instructions for use by an instruction execution device. The
computer readable storage medium may be, for example, but is not
limited to, an electronic storage device, a magnetic storage
device, an optical storage device, an electromagnetic storage
device, a semiconductor storage device, or any suitable combination
of the foregoing. A non-exhaustive list of more specific examples
of the computer readable storage medium includes the following: a
portable computer diskette, a hard disk, a random access memory
(RAM), a read-only memory (ROM), an erasable programmable read-only
memory (EPROM or Flash memory), a static random access memory
(SRAM), a portable compact disc read-only memory (CD-ROM), a
digital versatile disk (DVD), a memory stick, a floppy disk, a
mechanically encoded device such as punch-cards or raised
structures in a groove having instructions recorded thereon, and
any suitable combination of the foregoing. A computer readable
storage medium, as used herein, is not to be construed as being
transitory signals per se, such as radio waves or other freely
propagating electromagnetic waves, electromagnetic waves
propagating through a waveguide or other transmission media (e.g.,
light pulses passing through a fiber-optic cable), or electrical
signals transmitted through a wire.
[0099] Computer readable program instructions described herein can
be downloaded to respective computing/processing devices from a
computer readable storage medium or to an external computer or
external storage device via a network, for example, the Internet, a
local area network, a wide area network and/or a wireless network.
The network may comprise copper transmission cables, optical
transmission fibers, wireless transmission, routers, firewalls,
switches, gateway computers and/or edge servers. A network adapter
card or network interface in each computing/processing device
receives computer readable program instructions from the network
and forwards the computer readable program instructions for storage
in a computer readable storage medium within the respective
computing/processing device.
[0100] Computer readable program instructions for carrying out
operations of the present invention may be assembler instructions,
instruction-set-architecture (ISA) instructions, machine
instructions, machine dependent instructions, microcode, firmware
instructions, state-setting data, or either source code or object
code written in any combination of one or more programming
languages, including an object oriented programming language such
as Smalltalk, C++or the like, and conventional procedural
programming languages, such as the "C" programming language or
similar programming languages. The computer readable program
instructions may execute entirely on the user's computer, partly on
the user's computer, as a stand-alone software package, partly on
the user's computer and partly on a remote computer or entirely on
the remote computer or server. In the latter scenario, the remote
computer may be connected to the user's computer through any type
of network, including a local area network (LAN) or a wide area
network (WAN), or the connection may be made to an external
computer (for example, through the Internet using an Internet
Service Provider). In some embodiments, electronic circuitry
including, for example, programmable logic circuitry,
field-programmable gate arrays (FPGA), or programmable logic arrays
(PLA) may execute the computer readable program instructions by
utilizing state information of the computer readable program
instructions to personalize the electronic circuitry, in order to
perform aspects of the present invention.
[0101] Aspects of the present invention are described herein with
reference to flowchart illustrations and/or block diagrams of
methods, apparatus (systems), and computer program products
according to embodiments of the invention. It will be understood
that each block of the flowchart illustrations and/or block
diagrams, and combinations of blocks in the flowchart illustrations
and/or block diagrams, can be implemented by computer readable
program instructions.
[0102] These computer readable program instructions may be provided
to a processor of a general purpose computer, special purpose
computer, or other programmable data processing apparatus to
produce a machine, such that the instructions, which execute via
the processor of the computer or other programmable data processing
apparatus, create means for implementing the functions/acts
specified in the flowchart and/or block diagram block or blocks.
These computer readable program instructions may also be stored in
a computer readable storage medium that can direct a computer, a
programmable data processing apparatus, and/or other devices to
function in a particular manner, such that the computer readable
storage medium having instructions stored therein comprises an
article of manufacture including instructions which implement
aspects of the function/act specified in the flowchart and/or block
diagram block or blocks.
[0103] The computer readable program instructions may also be
loaded onto a computer, other programmable data processing
apparatus, or other device to cause a series of operational steps
to be performed on the computer, other programmable apparatus or
other device to produce a computer implemented process, such that
the instructions which execute on the computer, other programmable
apparatus, or other device implement the functions/acts specified
in the flowchart and/or block diagram block or blocks.
[0104] The flowchart and block diagrams in the Figures illustrate
the architecture, functionality, and operation of possible
implementations of systems, methods, and computer program products
according to various embodiments of the present invention. In this
regard, each block in the flowchart or block diagrams may represent
a module, segment, or portion of instructions, which comprises one
or more executable instructions for implementing the specified
logical function(s). In some alternative implementations, the
functions noted in the block may occur out of the order noted in
the figures. For example, two blocks shown in succession may, in
fact, be executed substantially concurrently, or the blocks may
sometimes be executed in the reverse order, depending upon the
functionality involved. It will also be noted that each block of
the block diagrams and/or flowchart illustration, and combinations
of blocks in the block diagrams and/or flowchart illustration, can
be implemented by special purpose hardware-based systems that
perform the specified functions or acts or carry out combinations
of special purpose hardware and computer instructions.
[0105] The descriptions of the various embodiments of the present
invention have been presented for purposes of illustration, but are
not intended to be exhaustive or limited to the embodiments
disclosed. Many modifications and variations will be apparent to
those of ordinary skill in the art without departing from the scope
and spirit of the described embodiments. The terminology used
herein was chosen to best explain the principles of the
embodiments, the practical application or technical improvement
over technologies found in the marketplace, or to enable others of
ordinary skill in the art to understand the embodiments disclosed
herein.
* * * * *
References