System and method for implementing computer modeling techniques Patent Grant Kemp , et al. [Nationwide Mutual Insurance Company]

System and method for implementing computer modeling techniques

Kemp , et al.

Patent Grant 10332210

U.S. patent number 10,332,210 [Application Number 14/534,437] was granted by the patent office on 2019-06-25 for system and method for implementing computer modeling techniques. This patent grant is currently assigned to Nationwide Mutual Insurance Company. The grantee listed for this patent is Nationwide Mutual Insurance Company. Invention is credited to Kapwon Choi, Spencer Herath, John Kemp, Christopher P. Nicholas, Benjamin A. Ogorek, Shannon Terry, Thomas Michael Winningham.

United States Patent	10,332,210
Kemp , et al.	June 25, 2019

System and method for implementing computer modeling techniques

Abstract

Improved computer modeling techniques, and uses thereof, are described herein. A set of unstructured textual data is received. Certain textual data is removed from the set of unstructured data to form an initial vocabulary set of textual data. One or more bigrams are added to the vocabulary set of textual data to form a final vocabulary set of textual data. The final vocabulary set of textual data is divided into a plurality of subsets of textual data based on type. A model is trained using each of the plurality of subsets of textual data to form a plurality of trained models, each corresponding to one of the types.

Inventors:

Kemp; John (Worthington, OH), Herath; Spencer (Brooklyn, NY), Winningham; Thomas Michael (Columbus, OH), Terry; Shannon (Upper Arlington, OH), Nicholas; Christopher P. (Worthington, OH), Choi; Kapwon (Columbus, OH), Ogorek; Benjamin A. (San Jose, CA)

Applicant:

Name	City	State	Country	Type
Nationwide Mutual Insurance Company	Columbus	OH	US

Assignee:

Nationwide Mutual Insurance Company (Columbus, OH)

Family ID:

66996494

Appl. No.:

14/534,437

Filed:

November 6, 2014

Related U.S. Patent Documents


Application Number	Filing Date	Patent Number	Issue Date
61900576	Nov 6, 2013

Current U.S. Class:	1/1
Current CPC Class:	G06Q 40/08 (20130101)
Current International Class:	G06Q 40/08 (20120101)

References Cited [Referenced By]

U.S. Patent Documents


9299108	March 2016	Diana
2008/0077451	March 2008	Anthony
2009/0254379	October 2009	Adams
2010/0145734	June 2010	Becerra
2013/0226623	August 2013	Diana
2013/0268534	October 2013	Mathew
2014/0149144	May 2014	Li
2014/0358591	December 2014	Gray

Other References

Frees et al., "Predictive Modeling of Multi-Peril Homeowners Insurance", Casualty Actuarial Society E-Forum, Winter 2011. cited by examiner .
Allianz soars past goals with IxReveal advanced analytic software. (Jul. 14, 2008). Business Wire Retrieved from https://dialog.proquest.com/professional/docview/677162343?accountid=1422- 57 (Year: 2008). cited by examiner .
Workers compensation fund implements HNC claims advisor for subrogation. (May 28, 2002). Business Wire Retrieved from https://dialog.proquest.com/professional/docview/679265909?accountid=1422- 57 (Year: 2002). cited by examiner.

Primary Examiner: Wong; Eric T
Attorney, Agent or Firm: Morgan, Lewis & Bockius, LLP

Parent Case Text

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims priority to U.S. Provisional Patent Application Ser. No. 61/900,576, filed Nov. 6, 2013, which is hereby incorporated by reference in its entirety.

Claims

What is claimed is:

1. A computer implemented method comprising: receiving at a computer processor a set of unstructured textual property claims data corresponding to a plurality of property loss insurance claims, wherein the set of unstructured textual property claims data comprises data describing claims associated with subrogation and claims not associated with subrogation, wherein a number of the claims associated with subrogation is the same as a number of the claims not associated with subrogation; removing, using the computer processor, at least certain textual data from the set of unstructured textual property claims data to form an initial vocabulary set of textual data; adding, using the computer processor, one or more bigrams to the vocabulary set of textual data to form a final vocabulary set of textual data; identifying, using the computer processor, a property loss type for each of the plurality of property loss insurance claims; dividing, using the computer processor, the final vocabulary set of textual data into a plurality of subsets of textual data based on the property loss type; training, using the computer processor, a model using each of the plurality of subsets of textual data to form a plurality of trained models, each corresponding to one of the property loss types receiving at the computer processor data describing a property insurance loss claim; determining, using the computer processor, whether the property insurance loss claim is a qualifying claim; identifying, using the computer processor, a property loss type associated with the property insurance loss claim; selecting, using the computer processor, one of the plurality of models based on the identified property loss type; processing, using the computer processor, the data describing the property insurance loss claim using the selected model; and determining, using the computer processor, whether to analyze the property insurance loss claim for subrogation based on the processing.

2. The computer implemented method of claim 1 wherein the certain textual data removed from the set of unstructured textual data comprises one or more of numbers, punctuation, stop words, and context-specific words that occur within the set of unstructured textual property claims data with a predetermined level of frequency.

3. The computer implemented method of claim 1 wherein the property loss type comprises one of water, fire and damage by vehicle.

4. The computer implemented method of claim 1 wherein the determining comprises determining whether the property insurance loss claim was initiated within the past 7 to 14 days.

5. A non-transitory computer readable storage medium having stored thereon computer executable instructions that, when executed on a computer, configure the computer to perform a method comprising: receiving a set of unstructured textual property claims data corresponding to a plurality of property loss insurance claims, wherein the set of unstructured textual property claims data comprises data describing claims associated with subrogation and claims not associated with subrogation, wherein a number of the claims associated with subrogation is the same as a number of the claims not associated with subrogation; removing at least certain textual data from the set of unstructured textual property claims data to form an initial vocabulary set of textual data; adding one or more bigrams to the vocabulary set of textual data to form a final vocabulary set of textual data; identifying a property loss type for each of the plurality of property loss insurance claims; dividing the final vocabulary set of textual data into a plurality of subsets of textual data based on the property loss type; training a model using each of the plurality of subsets of textual data to form a plurality of trained models, each corresponding to one of the property loss types; receiving data describing a property insurance loss claim; determining whether the property insurance loss claim is a qualifying claim; identifying a property loss type associated with the property insurance loss claim; selecting one of the plurality of models based on the identified property loss type; processing the data describing the property insurance loss claim using the selected model; and determining whether to analyze the property insurance loss claim for subrogation based on the processing.

6. The non-transitory computer readable storage medium of claim 5 wherein the certain textual data removed from the set of unstructured textual data comprises one or more of numbers, punctuation, stop words, and context-specific words that occur within the set of unstructured textual property claims data with a predetermined level of frequency.

7. The non-transitory computer readable storage medium of claim 5 wherein the property loss type comprises one of water, fire and damage by vehicle.

8. The non-transitory computer readable storage medium of claim 5 wherein the determining comprises determining whether the property insurance loss claim was initiated within the past 7 to 14 days.

9. A system comprising: memory operable to store at least one program; and at least one processor communicatively coupled to the memory, in which the at least one program, when executed by the at least one processor, causes the at least one processor to perform a method comprising: receiving a set of unstructured textual property claims data corresponding to a plurality of property loss insurance claims, wherein the set of unstructured textual property claims data comprises data describing claims associated with subrogation and claims not associated with subrogation, wherein a number of the claims associated with subrogation is the same as a number of the claims not associated with subrogation; removing at least certain textual data from the set of unstructured textual property claims data to form an initial vocabulary set of textual data; adding one or more bigrams to the vocabulary set of textual data to form a final vocabulary set of textual data; identifying a property loss type for each of the plurality of property loss insurance claims; dividing the final vocabulary set of textual data into a plurality of subsets of textual data based on the property loss type; and training a model using each of the plurality of subsets of textual data to form a plurality of trained models, each corresponding to one of the property loss types; receiving data describing a property insurance loss claim; determining whether the property insurance loss claim is a qualifying claim; identifying a property loss type associated with the property insurance loss claim; selecting one of the plurality of models based on the identified property loss type; processing the data describing the property insurance loss claim using the selected model; and determining whether to analyze the property insurance loss claim for subrogation based on the processing.

10. The system of claim 9 wherein the certain textual data removed from the set of unstructured textual data comprises one or more of numbers, punctuation, stop words, and context-specific words that occur within the set of unstructured textual property claims data with a predetermined level of frequency.

11. The system of claim 9 wherein the property loss type comprises one of water, fire and damage by vehicle.

12. The system of claim 9 wherein the determining comprises determining whether the property insurance loss claim was initiated within the past 7 to 14 days.

Description

FIELD OF THE INVENTION

The invention relates to improved computer modeling techniques.

BRIEF SUMMARY OF EMBODIMENTS OF THE INVENTION

Embodiments of the present invention relate to systems, methods and computer readable storage media for storing instructions which, when executed by a processor, perform a method for implementing improved computer modeling techniques. In connection with the improved modeling, a set of unstructured textual property claims data, corresponding to a plurality of property loss insurance claims, is received. At least certain textual data is removed from the set of unstructured textual property claims data to form an initial vocabulary set of textual data. One or more bigrams are added to the vocabulary set of textual data to form a final vocabulary set of textual data. A property loss type is identified for each of the plurality of property loss insurance claims. The final vocabulary set of textual data is divided into a plurality of subsets of textual data based on the property loss type. A model is trained using each of the plurality of subsets of textual data to form a plurality of trained models, each corresponding to one of the property loss types.

In some embodiments, the textual data removed from the set of unstructured textual data includes one or more of numbers, punctuation, stop words, and context-specific words that occur within the set of unstructured textual property claims data with a predetermined level of frequency.

In some embodiments, the property loss type comprises one of water, fire and damage by vehicle.

Further, in certain embodiments, data describing a property insurance loss claim is received. It is determined whether the property insurance loss claim is a qualifying claim. If so, a property loss type associated with the property insurance loss claim is identified. One of the plurality of models is selected based on the identified property loss type. Data describing the property insurance loss claim is processed using the selected model. Based on the processing, it is determined whether to analyze the property insurance loss claim for subrogation. In some embodiments, the determining includes determining whether the property insurance loss claim was initiated within the past 7 to 14 days.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing summary, as well as the following detailed description of embodiments of the invention, will be better understood when read in conjunction with the appended drawings of an exemplary embodiment. It should be understood, however, that the invention is not limited to the precise arrangements and instrumentalities shown.

In the drawings:

FIG. 1 illustrates an exemplary process of one embodiment of the present invention;

FIG. 2 illustrates an exemplary process for filtering and preparing claims for scoring;

FIG. 3 illustrates an exemplary process for scoring claims;

FIG. 4 illustrates an exemplary process for preparing scored claims for daily extract;

FIG. 5A illustrates an exemplary process for training a model;

FIG. 5B illustrates an exemplary process for executing a model;

FIG. 6 illustrates exemplary software components for use in carrying out an embodiment of the present invention; and

FIG. 7 illustrates an exemplary system of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The modeling techniques described herein allow for automatic detection of recent insurance claims that have the potential for subrogation. Claims departments of insurance companies have a need for a standard review process to identify subrogation opportunities missed by claims adjusters employing a manual process. Subrogation is the legal doctrine that allows an insurance company to sue to recover losses caused by a third party. With the sheer volume of claims to be handled, insurance companies frequently miss subrogation opportunities, which means lost opportunities to recover for losses caused by a third party.

Previously, companies have relied on adjusters to detect subrogation opportunities manually based on the facts of the case. Typically, there was no formal review process for double-checking claims, except for claims that resulted in a large payout for the company. If a claims adjuster missed a subrogation opportunity on a smaller claim, the opportunity was lost forever.

Described herein is a method and system that employs an improved computer system for uncovering the claims that are the most likely candidates for subrogation. In connection with the system and method, subrogation can be reasonably accurately predicted based on the unstructured text within the logs of claims adjusters. The process below describes how a limited amount of structured data and copious amounts of unstructured text is used to detect missed subrogation opportunities efficiently.

The following describes a process for building the model in connection with one embodiment of the invention. The problem to solve involves binary classification: a claim is either a subrogation opportunity or it is not. In order to build the model, a random subset of past claims is chosen, each of which has been classified as either "subrogation" or "no subrogation." Because subrogation is a rare event, the majority of claims are not subrogation opportunities. If a sample representative of the population were used, any modeling technique would have a difficult time trying to detect the true differences between the two classification classes (i.e., subrogation and no subrogation). Therefore, the same number of "no subrogation" claims and "subrogation" claims are sampled randomly, so that the response variable in the sample has a 1:1 ratio.

The data used in connection with the technique is not structured in a table. Instead, the data is the unstructured, free-form text contained in the claims logs written by a claims adjuster. The differences in words contained within subrogation claims versus non-subrogation claims are the subject of the model. Generally, several log entries are attached to a single claim. These log entries are grouped together to comprise all the text associated with a single claim. Known natural language processing techniques are used to first prepare, and then train, the data.

Not all text contained in the log is important. Thus, the text first must be cleaned. For example, the claims log text is stripped of numbers, punctuation, stop words, infrequently occurring words, and a small set of frequently occurring unimportant words (e.g., "claim"). The unique set of the words that is left comprises the model's vocabulary. A selection of pre-identified bigrams (two-word phrases) is then added to this vocabulary. In some circumstances, two words in combination are more meaningful than they are separately. For example, the occurrence of "drunk driver" is a more powerful indicator of subrogation than simply individual occurrences of "drunk" or "driver." By way of additional example, "Los Angeles" is a city, but "los" and "angeles" are meaningless apart.

The unedited, unstructured text is then transformed into a vector space model of the vocabulary words that were contained within the claims text. This term-document matrix contains the word counts of each word within each document. Thus, instead of readable sentences, each document can be expressed by the number of occurrences of each of the words in vector space.

For example, given the text from two documents as follows:

Doc1="The dog chased the cat."

Doc2="The cat followed the cat."

the following term-document matrix is generated.

TABLE-US-00001 Term-Document Matrix: Terms Doc1 Doc2 cat 1 2 chased 1 0 dog 1 0 followed 0 1

To improve model performance, the raw word count matrix is converted to a term frequency-inverse document frequency (TF-IDF) matrix. TF-IDF is a weighting factor that reflects how important a word is to a document in the whole collection of documents. The statistic increases the number of times a word appears in a document, but is offset by the frequency of the word in the collection as a whole. This helps determine what the most important words are in that specific document. This is the model's input and it is linked to the claim's response variable ("yes subrogation" or "no subrogation").

In the preferred embodiment, the improved modeling technique employs four separate models, based on four general causes of damage. The rationale for this multi-model approach is that different kinds of words can suggest subrogation, depending on what caused the damage that gave rise to the claim. In order to achieve this further classification, in one embodiment, available structured data may be used. For example, each claim may be associated with a Loss Cause Code, which indicates the general cause of the damage. Loss Cause Codes may be used to group claims into five general Loss Categories: "damage by vehicle," "fire," "water," "act of God," and "miscellaneous." First, "act of God" claims (e.g., claims caused by weather and natural disasters) may be filtered out because they are very rarely ever subrogation opportunities. Four unique models may then be trained on the remaining groups. This division allows for achieving a more accurate view of the true differentiators between the two response variables, because the claims contained within each category have a similar cause of damage. For example, every "damage by vehicle" claim will contain words like "car" and "driver." Because every claim within the "damage by vehicle" model contains such words, the model can actually look beyond the similar words and detect the more powerful words that detect subrogation opportunities, such as "drunk."

The four training sets are then used to build four Support Vector Machine (SVM) models. An SVM is a supervised classification technique, which attempts to find the optimal way to split the response variables (i.e., "yes subrogation" "no subrogation"). Other techniques could be used (e.g., Naive Bayes and Logistic Regression) within the scope of the present invention. The models are then validated on a hold-out test set in order to ensure the results remain robust for unseen data.

The models are trained using a computer. The models and supporting code may be executed daily using the model execution engine environment, as described further herein.

In one embodiment, the model execution occurs regularly, e.g., every morning. At each execution (e.g., every morning), new claims are scored, and the highest scored claims (i.e. most likely for subrogation) are passed on to a subrogation specialist for further review. The following provides a technical overview of that process, in one exemplary embodiment.

After a model instance is generated, qualifying claims are prepared for scoring. In the exemplary embodiment, Python programming language is used to score the claims using the models that have been previously trained and saved. Other programming languages can be used within the scope of the present invention. Thus, a computer is programmed (e.g., in Python) using the particular instructions described herein, thereby improving the functioning of the computer. The scored claims are then ordered and prepared for the daily extract. Finally, the top scoring claims of the day are sent to a subrogation specialist for further review.

FIG. 1 illustrates the high-level process in an exemplary embodiment. In step 101, the model instance is generated. In step 102, the claims are filtered and prepared for scoring. In step 103, new claims are scored. In step 104, scored claims are prepared for extract. In step 105, extracts are transmitted to a specialist for review, e.g., daily, via email.

As described previously, only qualifying claims are sent for scoring. More particularly, in one embodiment, the claim is not sent immediately for review using the model. Approximately one weeks' time passes to allow the claims adjuster to collect relevant facts and, in some instances, identify a subrogation opportunity manually. The model only scores claims that are between about 7 and 14 days old, in one embodiment. However, claims that fall within other time frames in terms of age may be eligible for scoring within the scope of the present invention. If a subrogation opportunity is not caught by the 14th day, then the opportunity is typically lost. In the exemplary embodiment, the model runs every weekday, and claims are rescored every day while they are in the 7-14 day window. Claims are rescored daily in order to account for any new information that may have been added to the claims logs.

Next, claims are grouped into Loss Categories based on their individual Loss Cause Codes. Claims that fall under the categories "Act of God" and "Liability" are filtered out and are not scored. These types of claims have an extremely low rate of subrogation simply due to the nature of the claims. For "Damage by Vehicle" claims, the data regarding such claims can be processed to determine whether the homeowner was the driver of the vehicle (i.e., if the policyholder drove into his own house, there is no third party at fault and, therefore, there is no subrogation opportunity). Claims for which the policyholder was at fault are filtered out. Thus, a claim that is qualified to be scored, in the preferred embodiment, (1) is between 7 and 14 days old, (2) is not an "Act of God" or "Liability" claim, and (3) is not a "Damage by Vehicle" claim in which the policyholder was the driver at fault. At this stage, the process has made use of the structured data available. Now, the unstructured text of the claims logs are brought in and linked up with their respective claims. Claims Level Data contains data at the claim level. Activity Level Data contains data at the claims log level, which includes several claims logs, as information comes in at different times.

FIG. 2 provides an overview of the process for filtering and preparing claims for scoring in accordance with an exemplary embodiment of the invention. In step 201, structured data, including base claim data and loss event data, is obtained from one or more databases. In step 202, it is determined whether the claim is between 7-14 days old. If not, the claim is not appropriate for scoring using the model. If so, claims associated with similar loss cause codes are grouped into loss categories in step 203. In step 204, it is determined if the loss category is an "Act of God" or "Liability". If not, it is determined if the owner was at fault, in step 205. In step 206, unstructured data from the claims logs is provided and, in step 207, the structured claims data is merged with the unstructured data from the claim logs for their respective claims.

Claims level data include attributes about the overall claim (identifiers, reporting dates, etc. including the linkage to key relationships such as the parties associated with the claim and any opened claim coverages that are identified as potential insurance company obligations with respect to the claim by the claims adjuster). Activity level data are updates and modifications that are appended to the claim (e.g., via database record linking and relationships) over time and after the initial claim reporting event. For the purposes of the embodiments herein, these are textual entries or log notes that are stamped with activity date-time information, the user id of the individual making the update, and the raw notes entered. Each activity level record/log can be treated as an individual document and/or the entire collection of activity level updates can be considered a document as well.

The claims logs of qualifying claims are aggregated at the claim level. The unstructured text may be cleaned in the manner described above with regard to model-building. The remaining text is tokenized and vectorized. Each individual claim is scored against one of the four models, depending on the Loss Category with which it is associated. After being fed through the model, a claim is assigned a probability score (e.g., a mathematical representation of the likelihood of subrogation), based on the previously trained model. The claims and their respective scores across all four Loss Categories are then grouped together and saved.

With reference to FIG. 3, this process is illustrated. In step 301, data regarding qualifying claims is received. In step 302, the textual data is tokenized and otherwise processed. In step 303, the Loss Category for the claim is determined and the data for the claim is fed into the appropriate model. In step 304, the claims are scored using each of the models using, e.g., the SVM modeling as described elsewhere herein. In step 305, the scored claims are merged and, in step 306, the data comprising the claims and associated scores are output.

Referring to step 304 in particular, each model consists of a SVM classifier, which is an established mathematical technique for handling classification (i.e., separating data into two classes in an extremely high-dimensional context--in the embodiments herein, thousands of dimensions). The modeling technique relies on extensive matrix and vector mathematics using extremely large volumes of standard linear algebra calculations. In the embodiment herein, the classification output is binary (i.e., subrogratable claims and non-subrogatable claims). The SVM model built for each category of claims essentially constructs a mathematical hyperplane (also known as a decision surface) that splits the m-dimensional space (m is roughly 2200+ dimensions in the exemplary embodiment, but could be tens of thousands in alternative iterations because each dimension is derived from the words in the document that have a high degree of discriminative power) into the two classes such that each claim can be identified as residing in the subrogatable space or the non-subrogatable space. Raw statistics from the method allow for ranking each claim from 1 to N based on the order of the predictions (e.g., the first claim has the highest relative predicted probability of indeed being subrogatable, the second has the second-highest predicted probability, etc.). As described herein, logistic regression or Naive Bayes (and potentially other techniques) can be used, but the exemplary embodiment employs an SVM classifier. TF-IDF, which is also used in the solution of an exemplary embodiment as an input to the current SVM classifier, is another means of identifying relevant documents and ranking them (e.g., you could identify a document from a corpus that is most similar to another document that had been previously identified as a positive subrogation example). Models of the same or different types can be combined into an ensemble where each classification decision from such a plurality of models is a arrived at via weighted voting, voting with tie-breaking rules, or other means of choosing a decision from the classification scores from each technique (e.g., another example from such an ensemble might be to choose the class based a "highest probability wins" rule).

After scoring, further work is required to prepare the daily extract file and save the day's scores to the database, as illustrated with reference to FIG. 4. The claim scores (output of step 306 of FIG. 3) are merged back together with the Base Claim and Loss Event tables, so that all relevant information has been gathered, in step 401. A check is made, in step 402, to ensure that none of the claims have already been open for subrogation, as this would result in redundant work. The claims are then ordered according to their probability scores, in step 403. The top claims are then selected, in step 404, and compiled in a daily subrogation extract file, in step 405. The extract file may then, e.g., be sent to a subrogation specialist for further review.

The process described herein is efficient and highly robust and can significantly increase subrogation opportunities on an ongoing basis. Whereas a single claim may take an adjuster minutes to review for subrogation potential, the model is capable of reviewing tens of thousands of claims in a matter of minutes. In the exemplary embodiment, the computer that is improved by being programmed with the code described herein has a nominal run time of approximately 22 minutes or less.

Further, the only current mechanism for identifying subrogation opportunities is manual review. The improved computer modeling process removes the inconsistency inherent in claims adjusters working manually and independently in the field and reduces identification to one standardized and highly automated process that is different from, and cannot be achieved by, a manual process.

For example, the computerized process described herein can score approximately 10,000 or more claims on a daily basis. Each claim may have one or more log notes (activity entries) on day 1 and could have between 20 and 40 by the time the 7 day window, described elsewhere herein, has been reached. The average for a claim is over 7 entries each with over 100 words but potentially having a few thousand words each. Even after some of the text processing described herein to reduce the words, a matrix of tens of thousands of documents by tens of thousands of terms may result. Mathematical operations must occur in both dimensions of the matrix (row-based and column-based), and additional operations must combine those results just to produce the weighted TF-IDF scores. For example, 10000 docs.times.50000 terms would yield an initial base of 500 MM operations just to generate a portion of the data used in the process. Additional passes over the structure would be required to produce the final TF-IDF score. The SVM classifier also must calculate across a similar matrix space focused on a critical subset of the terms that are identified, which results in a large volume of additional vector and matrix math. If the current embodiment is scaled, the resulting size of the matrices would scale in both dimensions (thus, geometric growth). Given human error rates on calculations, the ability to accurately process the data and perform accurate calculations to identify the subrogation opportunities as described herein being performed by the invention is not possible for humans to perform. Only a computer can complete these operations in a reasonable time to allow this process to be executed on at least daily basis.

In an exemplary embodiment, the subrogation model is trained and executed using Python programming language, as described below. However, as will be understood by those skilled in the art, other programming languages can be used within the scope of the present invention. Exemplary modules are illustrated and described with reference to FIGS. 5A and 5B.

Claims.py Module 502

This module reads in Activity Level Data and performs tokenizing (i.e., breaking up individual words in a long string into a list of words) and some processing on the text. The user calls ClaimsProcessor to read in a tab delimited file with the columns "Base_Claim_Id", "Activity_Dt", "Activity_Tm", "Log_Detail_Ds". The other classes in the module assist ClaimsProcessor in performing its tasks. ClaimsProcessor also has the ability to process a claims file in connection with model execution.

Subro.py Module 503

This module contains two classes that perform two functions: SubroModelingSet and SubroDataProcessor. SubroModelingSet is executed on the analyst's computer to build the actual models from a training set of past claims. The user can specify the proportion of data to be excluded as a holdout set upon which to test the newly trained model. This object provides the option of subsetting the data by some list of covariate fields. In the case of the Subrogation Model, the data is subset by the Loss_Category field. Four separately trained models can then be created for each unique loss category from a single SubroModelingSet class object. SubroDataProcessor is used during production. It assumes that model objects have been saved in the appropriate location. It brings in new claims and scores them against the saved models, relative to Loss Category. It returns an ordered list of claims, with the highest scored claims at the top. Then, it writes that ordered list to a daily extract file which the user specifies in the initiation of the class object.

The modules described herein provide a suite of methods that clean up the text, such as removing white space and removing instances of multiple punctuation characters. They make use of predetermined Regular Expression rules to perform many of their tasks. They also contain the BigramReplacer and AbbrevReplacer classes to handle pre-specified bigrams and abbreviations. The clean method is called by ClaimLogEntry. Replace_bigrams( ) and replace_abbrevs( ) are called by NLP.Vocab as well as NLP.Vectorizer. In other embodiments, text processing steps may occur in a unified manner, e.g., bigrams and abbreviations may be handled upfront during the claims processing stage.

NLP.Vocab analyzes all of the text from all of the documents and creates a vocabulary of unique words. It finds and handles bigrams and abbreviations. It also filters out stop words, infrequent words, numbers, non-dictionary terms, common first names, and frequent words that occur in every claim regardless of loss class (e.g., policyholder, damages, etc.).

NLP.Vectorizer takes a constructed vocabulary list as well as the document texts and turns each document into a vector of word counts in which each word in the vocabulary is granted its own column. This process creates a large, sparse matrix with the rows representing individual documents and the columns representing individual words. The class also provides functionality to convert this large feature matrix into a TF-IDF transform. This is especially important if the SVM model is to be tested on a holdout set, because the test set can be hidden from the TF-IDF training step. By way of further explanation, when a model is fit to data, the data set may synonymously be referred to as the training, learning, or fitting set. A separate hold-out sample is used for testing and validation purposes. The test/hold-out set should not include the TF-IDF scores that were calculated during the learning phase; thus, in a preferred embodiment, such scores would not be included if the goal is to understand the isolated performance of the SupportVectorClassifier in a pure sense. Thus, the TF-IDF matrix should not be built with the test set included, in the preferred embodiment, because it would impact the weights assigned to each vector.

Referring now more specifically to the TF-IDF score, the TF-IDF score is calculated from the matrix of terms (rows) and documents (columns). TF-IDF is then added to the matrix itself from which it was calculated. This TF-IDF score establishes a measure of discriminative power of the term based on its term frequency (how often it occurs in a given document) and its inverse document frequency (IDF)--how many documents in which the term occurs. For example, the word "the" has a high TF score, but a low IDF score since it will appear in nearly every document. Thus, a word that actually does occur in every document sends IDF to zero and thus the TF.times.IDF product becomes zero indicating that the word (e.g., "the") has no discriminative power in classifying the documents corpus (collection of documents).

NLP.SupportVectorClassifier builds the model and classifies new data against it. This class can build either a support vector machine or a logistic regression, depending on user input. The two fit very well into the same class together because they are essentially the same model, but for their respective cost functions. Once the class object is instantiated and a model is built, new claims can be scored using get_probs( ) which returns the probabilities of "yes subrogation" for new claims. SubroDataProcessor calls this method to order claims by their likelihood of subrogation opportunity based on the trained model. Both classify( ) and get_probs( ) perform a TF-IDF transformation on the test data using the same weights that were used to transform the training data matrix.

NLP.py Module 504

This module is used to build the models. A description of text classification concepts such as vectorizing, TF-IDF, and SVM is provided above.

Build_Models.py Module 501/do_CV.py Modules

These modules subset data by Loss Category and then train and test models. These modules can be used to determine how to call these classes (and how the classes themselves interact) to create production-grade SVM models saved to .pkl objects.

Text_processing.py may be a subcomponent (i.e., for convenience of code maintenance) of the NLP.py module 504 that assists in further preparing the text after the tokenize step in "ClaimsProcessor". Specifically, it handles the "Vocab" and "Vectorizer" components shown in FIG. 5A as part of the NLP.py module 504.

Main.py Module 505

This is the module that is called to perform on the server with models already built and stored in the models subdirectory. It requires three inputs: a claims level file, activity level file, and output file.

FIG. 6 illustrates components of an exemplary production model execution environment, with functional dependencies noted. FIG. 6 shows how a production modeling environment may consist of scheduling and storage components, as well as an execution engine. FIG. 6 shows that the model execution engine 601 may read and write to a database 602. It may also invoke models from the model storage environment 603, and receive notifications from the model storage environment 603. The model storage environment 603 may call the scheduler 104 and receive notifications from the same. The database 602 feeds data to consuming applications 605.

Exemplary hardware and software employed by the systems are now generally described with reference to FIG. 7. Database server(s) 700 may include a database services management application 706 that manages storage and retrieval of data from the database(s) 701, 702. The databases may be relational databases; however, other data organizational structure may be used without departing from the scope of the present invention. One or more application server(s) 703 are in communication with the database server 700. The application server 703 communicates requests for data to the database server 700. The database server 700 retrieves the requested data. The application server 703 may also send data to the database server for storage in the database(s) 701, 702. The application server 703 comprises one or more processors 704, computer readable storage media 705 that store programs (computer readable instructions) for execution by the processor(s), and an interface 707 between the processor(s) 704 and computer readable storage media 705. The application server may store the computer programs referred to herein, including those used in training and executing the models.

To the extent data and information is communicated over the Internet, one or more Internet servers 708 may be employed. The Internet server 708 also comprises one or more processors 709, computer readable storage media 711 that store programs (computer readable instructions) for execution by the processor(s) 709, and an interface 710 between the processor(s) 709 and computer readable storage media 711. The Internet server 708 is employed to deliver content that can be accessed through the communications network, e.g., by end user 712. When data is requested through an application, such as an Internet browser, the Internet server 708 receives and processes the request. The Internet server 708 sends the data or application requested along with user interface instructions for displaying a user interface.

The computers referenced herein are specially programmed to perform the functionality described herein as performed by the software programs.

The non-transitory computer readable storage media may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. Computer readable storage media may include, but is not limited to, RAM, ROM, Erasable Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), flash memory or other solid state memory technology, CD-ROM, digital versatile disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computer system.

It will be appreciated by those skilled in the art that changes could be made to the exemplary embodiments shown and described above without departing from the broad inventive concept thereof. It is understood, therefore, that this invention is not limited to the exemplary embodiments shown and described, but it is intended to cover modifications within the spirit and scope of the present invention as defined by the claims. For example, specific features of the exemplary embodiments may or may not be part of the claimed invention and features of the disclosed embodiments may be combined. Unless specifically set forth herein, the terms "a", "an" and "the" are not limited to one element but instead should be read as meaning "at least one".

It is to be understood that at least some of the figures and descriptions of the invention have been simplified to focus on elements that are relevant for a clear understanding of the invention, while eliminating, for purposes of clarity, other elements that those of ordinary skill in the art will appreciate may also comprise a portion of the invention. However, because such elements are well known in the art, and because they do not necessarily facilitate a better understanding of the invention, a description of such elements is not provided herein.

Further, to the extent that the method does not rely on the particular order of steps set forth herein, the particular order of the steps should not be construed as limitation on the claims. The claims directed to the method of the present invention should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the steps may be varied and still remain within the spirit and scope of the present invention.

* * * * *

References

Patent Diagrams and Documents