U.S. patent application number 10/248962 was filed with the patent office on 2003-09-18 for system and method for classification of documents.
Invention is credited to Moon, Charles, Torossian, Vasken.
Application Number | 20030177118 10/248962 |
Document ID | / |
Family ID | 27807593 |
Filed Date | 2003-09-18 |
United States Patent
Application |
20030177118 |
Kind Code |
A1 |
Moon, Charles ; et
al. |
September 18, 2003 |
SYSTEM AND METHOD FOR CLASSIFICATION OF DOCUMENTS
Abstract
The invention provides a classification engine for classifying
documents that makes use of functions included in a similarity
search engine. The classification engine executes a classify
command from a client that makes use of similarity search results,
and rules files, classes files, and a classification profile
embedded in the classification command. When the classification
receives a classify command from a client, it retrieves a
classification profile and input documents to be classified, sends
extracted values from the input documents based on anchor values to
a XML transformation engine to obtain a search schema, requests a
similarity search by a search manager to determine the similarity
between input documents and anchor values, and classifies the input
documents according to the rules files, classes files, and the
classification profile. The client is then notified that the
classify command has been completed and the classification results
are stored in a database.
Inventors: |
Moon, Charles; (Round Rock,
TX) ; Torossian, Vasken; (Round Rock, TX) |
Correspondence
Address: |
TAYLOR RUSSELL & RUSSELL, P.C.
4807 SPICEWOOD SPRINGS ROAD
BUILDING ONE, SUITE 1200
AUSTIN
TX
78759
US
|
Family ID: |
27807593 |
Appl. No.: |
10/248962 |
Filed: |
March 5, 2003 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
60319138 |
Mar 6, 2002 |
|
|
|
Current U.S.
Class: |
1/1 ;
707/999.005; 707/E17.09 |
Current CPC
Class: |
G06F 16/353 20190101;
Y10S 707/99935 20130101 |
Class at
Publication: |
707/5 |
International
Class: |
G06F 007/00 |
Claims
1. A method for classification of documents, comprising the steps
of: receiving a classify command from a client for initiating a
classification of documents, the classify instruction identifying
input documents to be classified, a classification profile, and
anchor values; retrieving the classification profile and input
documents; extracting input values from each input document based
on the anchor values; structuring the input values according to a
search schema identified in the classification profile; performing
similarity searches for determining similarity scores between each
database document and each input document; and classifying the
database documents based on profile and the similarity scores using
classes and rules identified in the classification profile.
2. The method of claim 1, wherein the step of performing similarity
searches comprises performing similarity searches for determining
normalized similarity scores having values of between 0.00 and 1.00
for each for each database document for indicating a degree of
similarity between each database document and each input document,
whereby a normalized similarity score of 0.00 represents no
similarity matching, a value of 1.00 represents exact similarity
matching, and scores between 0.00 and 1.00 represent degrees of
similarity matching.
3. The method of claim 1, wherein the step of retrieving the
classification profile and input documents comprises retrieving the
classification profile and input documents having repeating
groups.
4. The method of claim 1, further comprising the steps of: storing
the classified database documents in a results database; and
notifying the client of completion of the classify command.
5. The method of claim 4, wherein the step of storing the
classified database documents comprises storing the classified
database documents as a classification results file in a results
database.
6. The method of claim 4, wherein the step of storing the
classified database documents comprises storing the classified
database documents in an output target database identified in the
classification profile.
7. The method of claim 1, wherein each of the classes identified in
the classification profile comprises an identification attribute, a
name element, and a rank element.
8. The method of claim 7, further comprising a low score element
and a high score element for defining lower and upper thresholds
for similarity scores associated with the class.
9. The method of claim 1, wherein each of the rules identified in
the classification profile comprises an identification attribute, a
description element, and a condition element.
10. The method of claim 9, further comprising property elements for
describing conditions for including a document in a parent
class.
11. The method of claim 1, further comprising the step of mapping
between defined classes and defined rules using class rule map
files.
12. The method of claim 1, wherein the step of classifying the
database documents is selected from the group consisting of
classifying a document based on a threshold using a top score from
results of more than one search schema, classifying a document
based on a logical relationship and a threshold using a top score
from more results of more than one search schema, classifying a
document based on a number of search results for a single schema
that have scores greater than a threshold, and classifying a
document based on a number of search results from multiple schemas
having scores above a threshold.
13. The method of claim 1, wherein the step of classifying the
database documents further comprises classifying the multiple
database documents based on profile and the similarity scores using
classes and rules identified in the classification profile using a
classify utility.
14. A computer-readable medium containing instructions for
controlling a computer system to implement the method of claim
1.
15. A system for classification of documents, comprising: a
classification engine for receiving a classify command from a
client for initiating a classification of documents, the classify
instruction identifying input documents to be classified, a
classification profile, and anchor values; the classification
engine for retrieving the classification profile and input
documents from a virtual document manager; the classification
engine for extracting input values from each input document based
on the anchor values; an XML transformation engine for structuring
the input values according to a search schema identified in the
classification profile; a search manager for performing similarity
searches for determining similarity scores between each database
document and each input document; and the classification engine for
classifying the database documents based on profile and the
similarity scores using classes and rules identified in the
classification profile.
16. The system of claim 15, wherein the search manager performs
similarity searches comprises performing similarity searches for
determining normalized similarity scores having values of between
0.00 and 1.00 for each for each database document for indicating a
degree of similarity between each database document and each input
document, whereby a normalized similarity score of 0.00 represents
no similarity matching, a value of 1.00 represents exact similarity
matching, and scores between 0.00 and 1.00 represent degrees of
similarity matching.
17. The system of claim 15, wherein the classification retrieves
the classification and input documents having repeating groups.
18. The system of claim 18, further comprising the classification
engine for storing the classified database documents in a results
database and notifying the client of completion of the classify
command.
19. The system of claim 18, wherein the classification engine
stores the classified database documents as a classification
results file in a results database.
20. The system of claim 18, wherein the classification engine
stores the classified database documents in an output target
database identified in the classification profile.
21. The system of claim 15, wherein each of the classes identified
in the classification profile comprises an identification
attribute, a name element, and a rank element.
22. The system of claim 21, further comprising a low score element
and a high score element for defining lower and upper thresholds
for similarity scores associated with the class.
23. The system of claim 15, wherein each of the rules identified in
the classification profile comprises an identification attribute, a
description element, and a condition element.
24. The system of claim 23, further comprising property elements
for describing conditions for including a document in a parent
class.
25. The system of claim 15, further comprising the classification
for mapping between defined classes and defined rules using class
rule map files.
26. The system of claim 15, wherein the classification engine for
classifying the database documents is selected from the group
consisting of means for classifying a document based on a threshold
using a top score from results of more than one search schema,
means for classifying a document based on a logical relationship
and a threshold using a top score from more results of more than
one search schema, means for classifying a document based on a
number of search results for a single schema that have scores
greater than a threshold, and means for classifying a document
based on a number of search results from multiple schemas having
scores above a threshold.
27. The system of claim 15, wherein the classification engine
further comprises means for classifying the multiple database
documents based on profile and the similarity scores using classes
and rules identified in the classification profile using a classify
utility.
28. A system for classification of documents comprising: a
classification engine for accepting a classify command from a
client, retrieving a classification profile, classifying documents
based on similarity scores, rules and classes, storing document
classification results in a database, and notifying the client of
completion of the classify command; a virtual document manager for
providing input documents; an XML transformation engine for
structuring the input values according to a search schema
identified in the classification profile; and a search manager for
performing similarity searches for determining similarity scores
between each database document and each input document.
29. The system of claim 28, further comprising an output queue for
temporarily storing classified documents.
30. The system of claim 28, further comprising a database
management system for storing classification results.
31. A method for classification of documents, comprising: receiving
a classify command from a client, the classify command designating
input document elements for names and search schema, anchor
document structure and values to be used as classification filters,
and a classification profile; retrieving the designated
classification profile, the classification profile designating
classes files for name, rank and score thresholds, rules files for
nested conditions, properties, schema mapping, score threshold
ranges and number of required documents, and class rules maps for
class identification, class type, rule identification, description,
property, score threshold ranges and document count; retrieving the
designated search documents; identifying a schema mapping file for
each input document; determining a degree of similarity between
each input document and anchor document; classifying the input
documents according to the designated classes files and rules
files; and creating and storing a classification results file in a
database.
32. The method of claim 31, wherein the number of documents
classified is designated in the rules files.
33. The method of claim 31, further comprising notifying the client
of completion of the classify command.
Description
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application claims benefit of U. S. Provisional
Application No. 60/319,138, filed on Mar. 6, 2002.
BACKGROUND OF INVENTION
[0002] The invention relates generally to the field of
classification of documents contained within large enterprise
databases. More particularly, the invention relates to
classification engines that classify documents by performing
similarity searches to match classification profile data to data
found in external databases containing known class data.
[0003] Information resources often contain large amounts of
information that may be useful only if there exists the capability
to segment the information into manageable and meaningful packets.
Database technology provides adequate means for identifying and
exactly matching disparate data records to provide a binary output
indicative of a match. However, in many cases, users wish to
classify information contained in databases based on inexact but
similar attributes. This is particularly true in the case where the
database records may be incomplete, contain errors, or are
inaccurate. It is also sometimes useful to be able to narrow the
number of possibilities for producing irrelevant classifications
reported by traditional database classification programs.
Traditional classification methods that make use of exact, partial
and range retrieval paradigms do not satisfy the content-based
retrieval requirements of many users.
[0004] Many existing classification systems require significant
user training and model building to make effective use the system.
These models are very time-consuming to generate and to maintain.
Another disadvantage with many model-based classification systems
is that they appear as a black box to a user and only provide the
resulting class or grade without any explanation of how the
resultant conclusion was reached. The information regarding the
conclusion is valuable if additional analysis is required to
validate the conclusion. Some classification systems use a large
set of complex rules that process data directly. These rules are
difficult to generate and even more difficult to maintain because
they contain many complex attributes.
SUMMARY OF INVENTION
[0005] The present invention provides a new method of classifying
documents that makes use of many of the desirable characteristics
of similarity search engines. The invention concerns the use of
Similarity Search Technology described in U. S. Provisional
Application No. 60/356,812, entitled Similarity Search Engine for
Use with Relational Databases filed on Feb. 14, 2002 to provide a
new method of classifying documents. This document is incorporated
herein by reference. This classification method differs from other
classification methods in that it performs similarity searches to
match data drawn from the documents to be classified to data found
in external databases containing known class data. Because the
similarity search is performed on an existing known class data, the
returning search score already contains the grading information
that can be applied directly to the classification criteria.
Matches and near-matches as determined by a similarity search are
evaluated by a set of classification rules to determine whether
documents satisfy predefined classification criteria. In addition
to making classification decisions based on properties derived from
similarity search scores, this method is able to make
classification decisions based on scores obtained from external
analyses of the document in question, and to make classification
decisions based on a combination of similarity scores and external
analytics. The present invention uses a small set of high-level
decision rules that analyze the results returned by a mathematical
scoring engine. Since these rules only contain a small number of
attributes, they are simple to define and maintain.
[0006] A unique feature of the invention is its ability to return
along with the classification result a score that reflects a given
document's rating relative to others in its class according to
predetermined scoring thresholds. Another unique feature of the
present invention is the ability to justify every classification
results. Along with every decision, it provides the user with
reasons why the conclusion for the classification was reached. This
information may be important for many applications, especially when
performing fraud or threat analysis or where additional analysis
needs to be performed to validate the conclusion. Along with
justifications data, all additional search results generated by all
classification rules are available following the classification.
This is one of many unique features of the present invention and
separates it from the other classification techniques.
[0007] A method having features of the present invention for
classification of documents comprises the steps of receiving a
classify command from a client for initiating a classification of
documents, the classify instruction identifying input documents to
be classified, a classification profile, and anchor values,
retrieving the classification profile and input documents,
extracting input values from each input document based on the
anchor values, structuring the input values according to a search
schema identified in the classification profile, performing
similarity searches for determining similarity scores between each
database document and each input document, and classifying the
database documents based on profile and the similarity scores using
classes and rules identified in the classification profile. The
step of performing similarity searches may comprise performing
similarity searches for determining normalized similarity scores
having values of between 0.00 and 1.00 for each for each database
document for indicating a degree of similarity between each
database document and each input document, whereby a normalized
similarity score of 0.00 represents no similarity matching, a value
of 1.00 represents exact similarity matching, and scores between
0.00 and 1.00 represent degrees of similarity matching. The step of
retrieving the classification profile and input documents may
comprise retrieving the classification profile and input documents
having repeating groups. The method may further comprise the steps
of storing the classified database documents in a results database,
and notifying the client of completion of the classify command. The
step of storing the classified database documents may comprise
storing the classified database documents as a classification
results file in a results database. The step of storing the
classified database documents may comprise storing the classified
database documents in an output target database identified in the
classification profile. Each of the classes identified in the
classification profile may comprise an identification attribute, a
name element, and a rank element. The method may further comprise a
low score element and a high score element for defining lower and
upper thresholds for similarity scores associated with the class.
Each of the rules identified in the classification profile may
comprise an identification attribute, a description element, and a
condition element. The method may further comprise property
elements for describing conditions for including a document in a
parent class. The method may further comprise the step of mapping
between defined classes and defined rules using class rule map
files. The step of classifying the database documents may be
selected from the group consisting of classifying a document based
on a threshold using a top score from results of more than one
search schema, classifying a document based on a logical
relationship and a threshold using a top score from more results of
more than one search schema, classifying a document based on a
number of search results for a single schema that have scores
greater than a threshold, and classifying a document based on a
number of search results from multiple schemas having scores above
a threshold. The step of classifying the database documents may
further comprise classifying the multiple database documents based
on profile and the similarity scores using classes and rules
identified in the classification profile using a classify utility.
A computer-readable medium may contain instructions for controlling
a computer system to implement the method above.
[0008] Another embodiment of the present invention is a system for
classification of documents, comprising a classification engine for
receiving a classify command from a client for initiating a
classification of documents, the classify instruction identifying
input documents to be classified, a classification profile, and
anchor values, the classification engine for retrieving the
classification profile and input documents from a virtual document
manager, the classification engine for extracting input values from
each input document based on the anchor values, an XML
transformation engine for structuring the input values according to
a search schema identified in the APP_ID=10248962 Page 4 of 57
classification profile, a search manager for performing similarity
searches for determining similarity scores between each database
document and each input document, and the classification engine for
classifying the database documents based on profile and the
similarity scores using classes and rules identified in the
classification profile. The search manager may perform similarity
searches comprises performing similarity searches for determining
normalized similarity scores having values of between 0.00 and 1.00
for each for each database document for indicating a degree of
similarity between each database document and each input document,
whereby a normalized similarity score of 0.00 represents no
similarity matching, a value of 1.00 represents exact similarity
matching, and scores between 0.00 and 1.00 represent degrees of
similarity matching. The classification engine may retrieve the
classification and input documents having repeating groups. The
system may further comprise the classification engine for storing
the classified database documents in a results database and
notifying the client of completion of the classify command. The
classification engine may store the classified database documents
as a classification results file in a results database. The
classification engine may store the classified database documents
in an output target database identified in the classification
profile. Each of the classes identified in the classification
profile may comprise an identification attribute, a name element,
and a rank element. The system may further comprise a low score
element and a high score element for defining lower and upper
thresholds for similarity scores associated with the class. Each of
the rules identified in the classification profile may comprise an
identification attribute, a description element, and a condition
element. The system may further comprise property elements for
describing conditions for including a document in a parent class.
The system may further comprise the classification for mapping
between defined classes and defined rules using class rule map
files. The classification engine for classifying the database
documents may be selected from the group consisting of means for
classifying a document based on a threshold using a top score from
results of more than one search schema, means for classifying a
document based on a logical relationship and a threshold using a
top score from more results of more than one search schema, means
for classifying a document based on a number of search results for
a single schema that have scores greater than a threshold, and
means for classifying a document based on a number of search
results from multiple schemas having scores above a threshold. The
classification engine may further comprise means for classifying
the multiple database documents based on profile and the similarity
scores using classes and rules identified in the classification
profile using a classify utility.
[0009] An alternative embodiment of the present invention is a
system for classification of documents comprising a classification
engine for accepting a classify command from a client, retrieving a
classification profile, classifying documents based on similarity
scores, rules and classes, storing document classification results
in a database, and notifying the client of completion of the
classify command, a virtual document manager for providing input
documents, an XML transformation engine for structuring the input
values according to a search schema identified in the
classification profile, and a search manager for performing
similarity searches for determining similarity scores between each
database document and each input document. The system may further
comprise an output queue for temporarily storing classified
documents. The system may further comprise a database management
system for storing classification results.
[0010] In yet another embodiment of the present invention, a method
for classification of documents comprises receiving a classify
command from a client, the classify command designating input
document elements for names and search schema, anchor document
structure and values to be used as classification filters, and a
classification profile, retrieving the designated classification
profile, the classification profile designating classes files for
name, rank and score thresholds, rules files for nested conditions,
properties, schema mapping, score threshold ranges and number of
required documents, and class rules maps for class identification,
class type, rule identification, description, property, score
threshold ranges and document count, retrieving the designated
search documents, identifying a schema mapping file for each input
document, determining a degree of similarity between each input
document and anchor document, classifying the input documents
according to the designated classes files and rules files, and
creating and storing a classification results file in a database.
The number of documents classified may be designated in the rules
files. The method may further comprise notifying the client of
completion of the classify command.
BRIEF DESCRIPTION OF DRAWINGS
[0011] These and other features, aspects and advantages of the
present invention will become better understood with regard to the
following description, appended claims, and accompanying drawings
wherein:
[0012] FIG. 1 shows a classification engine within the framework of
a similarity search engine;
[0013] FIG. 2 shows a search that is for a claim containing a
doctor with a name Falstaff;
[0014] FIG. 3A shows the CLASSES file;
[0015] FIG. 3B shows a reserved system-defined CLASS attribute;
[0016] FIG. 3C shows an example CLASSES instance;
[0017] FIG. 4A shows a RULES file;
[0018] FIG. 4B shows an example of a RULES instance;
[0019] FIG. 5A shows a CLASS_RULE_MAPS file;
[0020] FIG. 5B shows an example of a CLASS_RULES_MAPS instance;
[0021] FIG. 6A shows a SCHEMA_MAPPING file;
[0022] FIG. 6B shows an example of a SCHEMA_MAPPING instance;
[0023] FIG. 7A shows a CLASSIFICATION_RESULTS file;
[0024] FIG. 7B shows an example of a CLASSIFICATION_RESULTS
instance;
[0025] FIG. 7C shows the normalization formulas used for computing
Class Scores;
[0026] FIG. 8A shows a CLASSIFICATION_PROFILE file;
[0027] FIG. 8B shows an example of a CLASSIFICATION_PROFILE
instance;
[0028] FIG. 9 shows a flowchart that depicts transaction steps of a
classification engine;
[0029] FIG. 10 shows a flowchart of the classification process;
[0030] FIG. 11 shows an XCL CLASSIFY command;
[0031] FIG. 12A shows a FROM-clause;
[0032] FIG. 12B shows an example of a FROM-clause instance with
multiple input documents;
[0033] FIG. 12C shows an example of a FROM-clause instance for an
entire set;
[0034] FIG. 12D shows an example of a FROM-clause instance with
specific documents;
[0035] FIG. 13A shows a WHERE-clause;
[0036] FIG. 13B shows an example of a FROM-clause instance;
[0037] FIG. 14A shows a USING-clause;
[0038] FIG. 14B shows an example of a USING-clause instance;
and
[0039] FIG. 15 shows a RESPONSE for a CLASSIFY command.
DETAILED DESCRIPTION
[0040] Turning to FIG. 1, the Classification Engine (CE) operates
within the framework of the Similarity Search Engine (SSE),
employing the services of the SSE's Virtual Document Manager (VDM),
Search Manager (SM), and XML Transformation Engine (XTE). The VDM
is used by the CE to access the documents to be classified, and by
the SM to access the databases the CE needs to search. The SM
performs similarity searches requested by the CE and returns the
results indicating the degree of similarity between the anchor
values drawn from the input documents and the target values found
in the search databases. The XTE enables the CE to move data from
one hierarchical form to another, which is necessary for searching
across disparate databases.
[0041] The CE is a general-purpose classification server designed
to support a range of client applications. A typical interactive
client might employ the CE to classify incoming documents as they
are received--for instance, an insurance claim form being
considered for immediate payment or referral for investigation. A
batch client might use the CE to classify a collection of
documents--for instance, to re-evaluate a set of insurance claims
based on new information received regarding a claimant. Though
these examples are drawn from the insurance industry, the CE can
operate on any sort of document and with any set of document
categories.
[0042] The Classification Client interacts with the CE by means of
a CLASSIFY command, which is part of the XML Command Language for
the SSE. The client issues a CLASSIFY command to request the CE to
classify the indicated documents and deposit the results into a
designated database A batch utility has been developed in
conjunction with the CE and represents one implementation of a
batch-mode CE client.
[0043] The Classification Engine is the server program that carries
out CLASSIFY commands, assisted by the VDM, SM, and XTE. It
receives input documents from a staging database via VDM and places
them into an input queue for classification. The CE uses a
Classification Profile (see File Descriptions) to determine what
searches to conduct in order to classify the document. It uses XTE
to extract data values from the input document for use as search
criteria. It then passes the SM a set of queries to execute to
determine whether values similar to those from the input document
are to be found in the databases available to the SM. Using a set
of classification rules, the CE compares the similarity scores from
the completed queries to predefined thresholds. If the requisite
number of searches returns scores within the designated thresholds,
a rule is regarded to be true and the input document is classified
accordingly. The CE contains one or more classes and one or more
classification rules. Each defined class has one or more rules that
are used to identify the class criteria. Once all the rules are
executed and the classification is complete, the classified
documents are moved onto an output queue and the classifications
are written to tables in a specified database.
[0044] The CE is designed in such way that it can use any scoring
modules behaving similar to the SSE. It has the ability to classify
using rules and scores representing the likelihood of finding the
search document inside a known class dataset. This includes
datasets represented by predictive models trained by other
mathematical model-based systems, i.e. Neural Networks. By using
rules and thresholds, it is able to reach a conclusion about the
class by analyzing the combination of all scores returned from all
scoring modules.
[0045] The Virtual Document Manager is responsible for reading
documents for classification by the CE and for providing the Search
Manager with access to the databases containing the data used in
the classification operation. The documents managed by VDM are
structured hierarchically according to the industry-standard
Extensible Markup Language (XML). These hierarchical documents have
a top-level element (known as the root) that contains other
elements (known as its children). Child elements can have children
of their own, and they can contain individual data values (known as
leaf elements). It is the nesting of child elements that gives the
XML document its hierarchical form. Because an element can have
zero, one, or multiple occurrences of a child, the XML format can
be used to represent any kind of document. Multiply occurring
elements are known as repeating groups.
[0046] The documents managed by VDM are virtual in the sense that
their values are not stored in text strings, as is the case with
most XML documents. Instead, when a document is accessed, the VDM
obtains the appropriate values from a designated datasource, often
a relational database but not limited to this storage method. It
uses a document definition known as a search schema to create the
structure of the document and to map values to its elements. To
clients of the VDM, it appears that XML text strings are being read
and written.
[0047] The Search Manager (SM) performs similarity searches
according to QUERY commands from its clients. A QUERY command
contains a WHERE-clause that sets out the criteria for the search,
a list of measures to be used to assess the similarity of the
database values in the databases being searched to the values given
in the QUERY, and (optionally) some limits on the volume of output
documents to be returned.
[0048] The SM has a library of similarity measures developed to
handle different kinds of data. Some of these compare whole values
and others break complex values down into their constituent parts.
Each measure is able to compare two values of the same type and to
return a score indicating the level of similarity between the two.
Measures differ in kinds of data they examine, so that the score
coming from a specialized "personal_address" measure might be more
accurate than the score from the more generic "text" measure that
does not have knowledge of how addresses are formatted. When a
search involves more than one element, the scores for all the
comparisons are combined using a weighted average. These weights
reflect the relative importance of the elements such that those of
the highest importance can be assigned higher weights and therefore
contribute more to the overall score for the search.
[0049] Similarity scores range from 0.00 to 1.00 where a zero score
means no similarity and one means the values are identical. By
default, the SM examines all the values in the designated database,
scores them all against the search criteria, and returns a Result
Set containing a score for each document drawn from the database.
However, since the default Result Set could contain an entry for
every document in the database and the lower scores may not be of
interest to the application, the SM can filter the Result Set
according to the number of documents or range of scores. This is
controlled by the SELECT-clause in the query.
[0050] The XML Transformation Engine (XTE) is an internal service
to the SSE, responsible for moving values from one hierarchical
document format to another. For instance, the XTE can transform a
personal name given as a single string into separate values for
First, Middle, and Last. It does this by applying matching rules to
the names of the data items to find counterparts and by
decomposing/recomposing their data values according to another set
of rules. The XTE can also employ a synonym table to quickly
resolve known mappings. The CE uses the XTE to extract data values
from the input documents into the formats required for the searches
it issues. This allows the CE to search across multiple databases,
even when they differ in the way their data is structured.
[0051] As part of the SSE, the CE uses schemas and result documents
maintained by the SSE. In the vernacular of the SSE, a schema is an
XML document that contains a <STRUCTURE> element defining the
structure of the document, a <MAPPING> element that ties
elements of the document to fields in the database, and a
<SEMANTICS> element that associates similarity measures with
the elements of the documents that the schema describes.
[0052] The SSE Document Schema describes the contents of a database
to be searched by the SM. However, it is not used directly in the
search. Instead the XTE uses the document schema to locate
counterparts for the elements of the input document in the database
to be searched. Only the <STRUCTURE> portion of the schema is
used. The measures for the searches come from the search schemas.
Through the VDM, the contents of the database can thereby be seen
as a collection of XML documents, structured according to the
hierarchy defined in the document schema.
[0053] The SSE Search Schema describes a search to be performed
when the CE evaluates an input document to determine whether it
conforms to the classification rules. Its STRUCTURE-clause may
consist of one or several elements structured hierarchically
according to the document structure defined by the document schema.
However, it typically contains a subset of those elements--i.e. the
ones for the data values involved in the search. Its MAPPING-clause
indicates the mapping of elements to fields in the datasource to be
searched--i.e. the database described by the document schema. Its
WHERE-clause is populated by XTE using values from the input
document. Its SEMANTICS-clause specifies the measures to be used in
evaluating target documents for similarity to the values taken from
the input document.
[0054] The XTE profile describes the mapping of values from the
input document to the structure of a search schema. It contains a
STRATEGIES element that lists the comparisons made to find the
counterpart for a given element in the target database, a set of
MAPPING elements that pair source values to target elements, and a
set of SYNONYMS that allow elements to be recognized under several
names.
[0055] The Input Profile is an SSE document schema that describes
the structure of the input documents. Only the <STRUCTURE>
and <MAPPINGS> are used. Since the input documents are not
used directly--they provide values for the search schemas--no
<SEMANTICS> are required.
[0056] The Input Search Criterion document (WHERE-clause) used to
anchor the searches issued by the CE are drawn from the input
documents by the XTE. The output of the XTE is a structure that
conforms to the schema of the datasource to be searched and
populated with the corresponding values from the input document.
This structure becomes the contents of the WHERE-clause in the
QUERY issued to the SSE that carries out the search.
[0057] Turning to FIG. 2, FIG. 2 shows a search that is for a CLAIM
that contains a DOCTOR element containing a NAME element with the
value "Falstaff". In the case of a repeating group, each instance
of the group is used to generate a different Input Search Criterion
document. If there are multiple repeating groups, all permutations
are generated.
[0058] The XML Command Language defines a standard format for the
SSE result document. Options are available for including additional
data with the results, but the default format is used by the CE.
The results of a search are presented as an XML document containing
a <RESPONSE> element that (in the case of a successfully
completed search) contains a <RESULT> element, that in turn
contains a set of DOCUMENT elements. The DOCUMENT elements have no
children. Each contains three attributes: the similarity score
computed for the document, the unique Identifier of the document,
and the name of the schema used for the search. By default,
<RESULT> contains a DOCUMENT element for every document in
the database. Since low-scoring documents are seldom of interest,
it is possible to limit the number of <DOCUMENT> elements in
the <RESULT> set by specifying a threshold score or maximum
number of documents to return. The CE obtains these values from the
Rules used to classify the documents.
[0059] The Classification Engine uses a set of XML files, referred
to as CE Classification Files, to define the classification rules
and the searches they require. Ordinarily, these are located in the
local filesystem of the server where the CE resides. The.xml
extension is used to indicate that the file is formatted as a text
string according to the rules of XML.
[0060] Turning to FIG. 3A, FIG. 3A shows the CLASSES.xml file that
describes the categories into which documents are classified. The
file contains one or more CLASS elements, each defining one of the
categories. Each class has an ID attribute, a NAME element, and a
RANK element. The value of the ID attribute is a unique identifier
for the class. The value of the NAME element provides a descriptive
name for use in displays and reports. The value of the RANK element
indicates the place of this class in the hierarchy of classes. A
RANK value of 1 is the highest level in the hierarchy. It is
possible for more than one class to have the same rank. Each class
may optionally have LOW_SCORE and HIGH_SCORE elements that define
the upper and lower thresholds for scores associated with the
class.
[0061] FIG. 3B shows a system-defined CLASS attribute that is
reserved for documents that do not fall into any defined class.
[0062] Turning to FIG. 3C, FIG. 3C shows an example of a CLASSES
instance where four classes are defined, each with a unique integer
ID. The class hierarchy is reflected in Table 1 and the CLASS file
example is shown in FIG. 3C. Note that BLUE and GREEN have the same
rank. This system is designed to handle thousands of hierarchically
defined classes, in terms of grades. The hierarchy identifies the
priority or the rank of each grade and is used to order the
execution priority of rules for each class. The higher ranked class
and its rules will always override the lower ranked ones.
1 TABLE 1 RANK ID NAME 1 1 RED 2 2 YELLOW 3 3 GREEN 3 4 BLUE
[0063] Turning to FIG. 4A, FIG. 4A shows a RULES.xml file. The
RULES file itemizes the rules used for classification. The file
contains one or more RULE elements (each with an ID attribute), a
DESCRIPTION element, and a CONDITION element. The value for the ID
attribute must be unique to distinguish the rule from others.
[0064] The value of the DESCRIPTION element is descriptive text for
use in user displays and reports. The CONDITION element may contain
PROPERTY elements that describe the search results to indicate that
a document meets the conditions for inclusion in the parent CLASS.
CONDITION elements can be nested, using the optional OP attribute
to indicate how to combine the results of the child CONDITION
elements. (Default is "AND". Absence of the OP attribute means only
one PROPERTY is specified.)
[0065] A simple rule has only one PROPERTY element. A complex rule
has multiple PROPERTY elements grouped together with CONDITION
elements. Each PROPERTY element is uniquely identified (within the
scope of the rule) by the value of its associated ID attribute. Two
kinds of PROPERTY elements are defined: threshold PROPERTY elements
and value PROPERTY elements. Both kinds of PROPERTY element contain
a SCHEMA_MAP_ID element, and a DOCUMENT_COUNT element.
[0066] The SCHEMA_MAP_ID element is a reference to a MAP element ID
in the SCHEMA_MAPPING file. (The MAP element indicates the schema
for the search and any XTE transformation required. See
CLASS_RULE_MAPS.xml). The DOCUMENT_COUNT element defines a default
value for the required minimum number of documents with scores in
the default range.
[0067] In a threshold PROPERTY, the THRESHOLD element describes a
default range of scores for documents returned by the searches
required for this rule. A THRESHOLD element has START and END
elements, whose values define the bottom and top of the default
range. Values of the OP attributes for START and END indicate
whether the bottom and top values are themselves included in the
range.
[0068] A combination of the THRESHOLD and DOCUMENT_COUNT elements
defines the condition when a predefined number of documents meets
the score range criteria. The THRESHOLD element can be used to
reach a conclusion about a class when using other model-based
scoring engines. The DOCUMENT_COUNT element is primarily used with
the SSE to identify the likelihood, in terms of the probability, of
the anchor document in the target dataset.
[0069] A value PROPERTY addresses the values themselves and
contains a VALUE element that specifies the criterion value and
comparison operator and contains a FIELD element that references
the database table and field (column) containing target values for
the comparison. A combination of the VALUE and DOCUMENT_COUNT
elements defines the condition when a predefined number of
documents meet the value-matching criterion.
[0070] An example rules file is shown in FIG. 4B. In the example,
two rules are defined. RULE 1 specifies a search using the
SANCTIONED_DOCS schema, indicated by MAP 2. Default values for the
top and bottom of the threshold are set at 0.90 and 1.00. The
default DOCUMENT_COUNT is set at 3. RULE 2 requires two searches,
both of which must be satisfied as indicated by the AND operator in
the CONDITION. The first search uses the STOLEN_VEHICLES schema, as
indicated by MAP 1, and specifies an inclusive range of scores from
0.90 to 1.00. The second search uses the SANCTIONED_LAWYERS schema,
as indicated by MAP 3, and specifies an inclusive range of scores
from 0.90 to 1.00. Table 2 shows the hierarchy of the RULES file
example.
2TABLE 2 RANK NAME ID Match Type Rules Rules Overrides 1 Red C1
Single R1 Threshold = .90-.100 R2 Record count = 4 2 YELLOW C2
Multi R1 R2 3 GREEN C3 Single R1 Threshold = .85-1.00 3 BLUE C4
Multi R1 R2 R3
[0071] Turning to FIG. 5A, FIG. 5A shows a CLASS_RULE_MAPS.xml file
that defines the mapping between defined classes and defined rules
(See CLASSES.xml and RULES.xml). The CLASS_RULE_MAPS contains one
or more CLASS_RULE_MAP elements. Each element is uniquely
identified by its associated ID attribute. The CRITERIA_MATCH_TYPE
attribute of the CLASS_RULE.sub.-MAP element has two possible
values that govern the processing of input documents containing
repeating groups. The (default) value of "Single" indicates that
once CE has search results that satisfy a rule, other repetitions
do not need to be checked. A value of "Multi" means that the
results of all repetitions are to be examined and retained. A
CLASS_RULE_MAP element contains one or more CLASS_ID elements whose
values correspond to classes defined in the CLASSES file. The
RULE_MATCH_TYPE attribute for CLASS_ID has two possible values. The
(default) value of "Single" indicates that rule checking can stop
as soon as a single rule is met. A value of "Multi" indicates that
the rule checking should continue until all rules for the class are
checked and that results for all rules met are to be saved. The
CLASS_ID element contains a RULE_ID element whose values correspond
to rules defined in the RULES file. These are the rules to be
checked for the class. A RULE_ID element can contain DESCRIPTION
and PROPERTY_ID elements whose values override the defaults given
in the RULES file. The value for PROPERTY_ID references the
corresponding PROPERTY for the associated rule and contains
elements with override values for the THRESHOLD and DOCUMENT_COUNT.
The values for LOW_SCORE and HIGH_SCORE reference the associated
class and provide override values for score thresholds set in
CLASSES.
[0072] Turning to FIG. 5B, FIG. SB shows an example of a
CLASS_RULES_MAPS file where three mappings are specified. The first
mapping assigns RULE 1 and RULE 2 to CLASS 1, which has the NAME
"RED" assigned in the CLASSES file. The default values for the
rules are used because no values of the rule are overridden. The
second mapping assigns RULE 1 and RULE 2 to CLASS 2, which has the
NAME "YELLOW". However, in this definition the defaults for the
rules are overwritten. The third mapping assigns RULE 1 and RULE 2
to CLASS 3, providing a different set of override values. Where
CLASS 1 and CLASS 2 have a RULE_MATCH_TYPE of "Multi", which means
RULE 2 is checked even though RULE 1 evaluated true. Because
CLASS_RULE_MAP 1 has a value of "Multi" for CRITERIA_MATCH_TYPE,
all repetitions of the document's repeating groups are checked and
all the search results are saved.
[0073] Turning to FIG. 6A, FIG. 6A shows a SCHEMA_MAPPING.xml file
that describes how to map values from the input document into a
schema for the search. The file contains one or MAP elements, each
with an integer value for the ID attribute that uniquely identifies
the map. The MAP element contains a SEARCH_SCHEMA element and an
XTE_MAP element. The value of the SEARCH_SCHEMA element is the name
of the schema used in the search. The schema is stored in the
SCHEMAS file for the SSE that conducts the search. The value of the
XTE_MAP element is the name of the XTE element in the XTE_PROFILE
file. The XTE_PROFILE contains the mapping STRATEGIES, the
SOURCE/TARGET mappings, and the SYNONYMS used in the
transformation. The result is a SCHEMA_MAPPING suitable for use in
the WHERE-clause of the QUERY command issued for the search.
[0074] Turning to FIG. 6B, FIG. 6B shows an example of a
SCHEMA_MAPPING file where three schema mappings are specified.
[0075] Turning to FIG. 7A, FIG. 7A shows a
CLASSIFICATION_RESULTS.xmi file that describes the output produced
by the CE. The TARGET element indicates where to save the results
of a classification, and (optionally) the additional search results
to save. Each TARGET element is uniquely identified by the value of
its ID attribute, and contains exactly one DATASET element. The
DATASET element contains the name of the datasource to receive the
output. In the present implementation, this is a relational
database. Datasources for the SSE are defined in the DATASOURCES
file. The SEARCH_RESULTS element is optional. The value of the
SEARCH element corresponds to the identifier of a MAP in the
SCHEMA_MAPPING file that indicates the schema used in the search.
The value of the COUNT element indicates the number of results to
save. The SEARCH_RESULTS element may contain multiple
<SEARCH> elements, but only one <COUNT> element.
[0076] Turning to FIG. 7B, FIG. 7B shows an example of a
CLASSIFICATION_RESULTS.xml file where results are sent to the
datasource named "classification_output". Up to 20 results from
searches of the schemas specified for MAP 1 and MAP 2
(STOLEN_VEHICLES and SANCTIONED_DOCS) are saved. The SCORE
associated with the classification of a document is derived as
follows: The highest similarity search score returned from among
all Properties contained in the RULE that resulted in the
classification is normalized such that lower threshold from the
Property equates to 0.00 and the upper threshold from the Property
equates to 1.00. This score is renormalized according to the
LOW_SCORE and HIGH_SCORE thresholds for the resulting CLASS to
yield a score within the CLASS thresholds proportional to its place
within the thresholds for the Property.
[0077] The normalization formulas are shown in FIG. 7C. An example
is a document that scores 0.60 with a Property whose thresholds are
0.50 to 0.90. The computation (0.60-0.50)/(0.90-0.50) gives 0.25 as
the score normalized for those thresholds. To renormalize the score
for a Class where LOW_SCORE is 0.60 and HIGH_SCORE is 0.80, the
computation (0.80-0.60)*0.25+0.60 produces a renormalized class
score of 0.65.
[0078] Turning now to FIG. 8A, FIG. 8A shows a
CLASSIFICATION_PROFILE.xml file that drives the classification
process. It describes how a classification is to be performed, what
classes are to be generated, and what actions to take with a
classified record. The CLASSIFICATION_PROFILE contains one or more
PROFILE elements that define the kinds of classification processes
available. The value for the ID attribute uniquely identifies the
PROFILE. A PROFILE element contains a SOURCE_SCHEMA element and a
TARGET_ID element. It may also contain a CLASS_RULE_MAP_ID. The
SOURCE_SCHEMA element has only a NAME attribute whose value
corresponds to the name of a schema in the SSE's SCHEMAS file. This
schema is used to read the input documents to be classified. Only
the STRUCTURE and MAPPING elements are used. SEMANTICS are ignored
since the schema is used only for reading and mapping input
documents, not for searching them. This is carried out by the
search schemas. The DATASET element has only an ID attribute whose
value corresponds to the identifier of a TARGET element in the
CLASSIFICATION_RESULTS file that specifies the datasource to
receive the output of the classification. The CLASS_RULE_MAP
element has only an ID attribute whose value corresponds to the
identifier of a CLASS_RULE.sub.-MAP in the CLASS_RULE_MAPS file
that describes the rule mapping to use in the classification.
[0079] Turning now to FIG. 8B, FIG. 8B shows an example of the
CLASSIFICATION_PROFILE.xml where the NEW_CLAIMS source schema is
used to get the records to be classified. The results go to the
dataset referenced by the TARGET element with the ID value of "1".
The CLASS_RULE_MAP with ID value of "1" indicates the class
definitions and rules employed.
[0080] Database Result tables are created when a new TARGET is
defined in the CLASSIFICATION_RESULTS file. In the present
embodiment, the target datasource must be a relational database
where the CE has the permissions necessary for creating and
updating tables. When the TARGET is specified in the
CLASSIFICATION_PROFILE, it receives the output from the
classification.
[0081] A number of tables are created in the specified datasource.
These include a HEADER table, a CLASSRESULTS table, a
SEARCHCRITERIA table, and a RULE_CRITERIA table.
[0082] A table having a HEADER TABLENAME is shown in TABLE 3.
3TABLE 3 Column Name Description Characteristics ClassificationID
The id value of the INTEGER classification profile used for this
classification. PKEY_VALUE Primary key values Primary Key
CHAR(30)(Note; from the input data, Should be defined large enough
so that the largest primary key value from the input source can be
stored in this column) CLASS_ID Generated highest- INTEGER ranking
classification class id for this key.
[0083] A table having a CLASSRESULTS TABLENAME is shown in TABLE
4.
4TABLE 4 Column Name Description Characteristics PKEY_VALUE Primary
key values Primary Key CHAR from the input (30) (Note: Should
document be defined large enough so that the largest primary key
value from the input source can be stored in this column) RULE_ID
ID value for the INTEGER rule, specified by the ID attribute in the
rule definition RULE_CRITERION_ID System generated INTEGERNOTE: ID.
Used to locate This value is search criteria for unique per record.
the rule. CLASS_ID Generated INTEGER classification class id for
this search criterion.
[0084] A table having a SEARCHCRITERIA TABLENAME is shown in TABLE
5. Each input record generates one or more search criteria
documents. A document without repeating groups generates one search
criteria document. A document with repeating groups generates a
search criteria document for each permutation of values.
5TABLE 5 Column Name Description Characteristics PKEY_VALUE Primary
key values CHAR(30)(Note: from the Should be defined input document
large enough so that the largest primary key value from the input
source can be stored in this column) SEARCH_CRIT_ID System
generated INTEGERNOTE: This ID. Used to value is unique per
uniquely identify PKEY_VALUE. search criteria. SCHEMA_MAP_ID ID
value of the INTEGER MAP, specified by ID attribute in
SEARCH.sub.-- SCHEMA_MAP definition. SEARCH_CRIT XML document BLOB
containing the input search criteria. (See INPUT_SEARCH _CRITERIA.)
RESULT_DOC SSE result document BLOB containing document name,
scheme name, and similarity score. See SSE_RESULT _DOCUMENT.
[0085] A table having a RULE_CRITERIA TABLENAME is shown in TABLE
6.
6TABLE 6 Column Name Description Characteristics PKEY_VALUE Primary
key Primary Key CHAR values from the (30) (Note: Should be input
data. defined large enough so that the largest primary value from
the input source can be stored in this column) RULE_ID Unique
identifier INTEGER of the rule specified by the ID attribute in the
rule definition. RULE_CRITERION_ID Identifier of the INTEGERMatches
criterion. values found in RULE.sub.-- CRITERION_ID CLASSRESULTS
table. ATTRIBUTE_ID Unique identifier INTEGERUnique of the PROPERTY
within the specified by the scope of a RULE ID ID attribute in the
RULE definition. CRITERION_ID Identifier of the INTEGERMatches
search criterion. values found in SEARCH.sub.-- CRITERIA_ID in the
SEARCH.sub.-- CRITERIA table.
[0086] The following provides a narrative account of the process
flow of a transaction carried out by the Classification Engine. As
part of the SSE, the CE has access to the services of the VDM, SM,
and XTE components of the SSE framework. For the most part, this
narrative focuses on the actions of the CE itself with only brief
descriptions of the actions of the other components as they
interact with the CE.
[0087] Turning now to FIG. 9, FIG. 9 shows a flowchart that depicts
the main steps of the transactions carried out by the
Classification Engine. As part of the preparations for a
classification run, the collection of documents to be classified
are stored in a datasource accessible to the SSE. Ordinarily this
will be a staging database devoted to this purpose. The SSE has a
schema that describes the input documents so that they can be read
by the CE using the XML Command Language's (XCL's) DOCUMENT
command. The SSE also has schemas for the searches to be conducted
by the CE during the classification run, and datasource definitions
for the databases to be searched. The CE's set of Classification
files have been edited according to the requirements of the
run.
[0088] Step 1: C-ENGINE Accepts CLASSIFY Command from Client
[0089] To request the CE to conduct a classification run, the
client passes a CLASSIFY command to the SSE, using the execute
method of the SSE's Java Connection class. In the SSE, a Command
Handler object is instantiated to carry out the CLASSIFY command.
This is one implementation of a general-purpose command interface
using XML documents to represent client requests and their results.
The CLASSIFY command contains clauses that specify the source and
selection criteria of the documents to be classified and a profile
describing the classifications to be performed. The FROM-clause
contains one or more DOCUMENT commands that can be carried out by
the VDM to provide the CE with the documents to be classified--i.e.
the input documents. The WHERE-clause contains selection criteria
to filter the collection of documents defined by the FROM-clause.
To qualify for processing by the CE, a document's values must match
those given for the corresponding elements in the WHERE-clause.
(For details on the WHERE-clause, see APPENDIX A). The USING-clause
has a profile attribute that identifies the classification profile
for the run. (The classification profile is described in
CLASSIFICATION_PROFILE.xml.)
[0090] Step 2: C-ENGINE Retrieves Classification Profile to
Identify Required Searches
[0091] The CE prepares to begin classifying the input documents by
reading the CLASSIFICATION_PROFILE file to find the PROFILE
specified in the USING-clause of the CLASSIFY command. From this
PROFILE, the CE obtains the SOURCE_SCHEMA, DATASET, and
CLASS_RULE_MAP to use for the classification run. SOURCE_SCHEMA is
the schema that describes the structure and mapping of the input
documents. The semantics (similarity measures) are not used.
DATASET is the XCL definition of the datasource to receive the
output of the classification. In the current implementation, this
is a relational database for which the CE has authorizations to
create and update tables. CLASS_RULE_MAP is the identifier of a
CLASS_RULE_MAP in the CLASS_RULE_MAPS file that defines the
classification scheme to be employed in the run. The classification
process is shown in Step 10 and explained in detail later in the
document.
[0092] Step 3: C-Engine Issues Document Command(s) to Read Input
Documents
[0093] To obtain input documents to classify, the CE issues the
DOCUMENT commands contained in the FROM-clause to the VDM. If there
is no FROM-clause, the entire datasource represented by the input
schema is used. There are three main forms of the FROM-clause:
7 1. <FROM> <DOCUMENT name="document1"
schema="schema"/> <DOCUMENT name="document2"
schema="schema"/> ... </FROM>
[0094] In this form, the FROM-clause contains a DOCUMENT command
for each input document, identifying documents by their name and
schema. With this information, the VDM is able to construct the
document as specified in the STRUCTURE-clause of the schema,
drawing values from the datasource specified in the MAPPING-clause
of the schema.
8 2. <FROM> <DOCUMENT name="*" schema="schema"/>
<FROM>
[0095] In this form, the FROM-clause contain a single DOCUMENT
command, using the * wildcard symbol to indicate that all documents
in the set are to be returned.
[0096] VDM is then able to construct the documents as specified in
the STRUCTURE-clause of the schema, drawing values from the
datasource specified in the MAPPING-clause of the schema.
9 3. <FROM> <DOCUMENT schema="schema"> <Contents
/> <DOCUMENT> <DOCUMENT schema="schema">
<Contents /> <DOCUMENT> ... </FROM>
[0097] In this form, the FROM-clause contains the documents
themselves. The DOCUMENT commands specify the schema that describes
the structure of the documents. The values are taken from the
DOCUMENT commands themselves, not the datasource referenced in the
MAPPING-clause of the schema. This form is most often used by a
client that already has a document to be classified.
[0098] Step 4: C-Engine Receives Input Documents
[0099] When the VDM has carried out the DOCUMENT commands from the
CE, it returns them as an XCL Result Set to the CE. In the current
implementation, to avoid swamping the CE with documents, the VDM
passes them in batches.
[0100] Step 5: C-Engine Extracts Values From Input Documents to
Anchor Searches
[0101] In this step, the CE prepares to launch the searches
required to classify the document by extracting values to serve as
the anchor criteria for the search. The combinations of values
needed depend on the requirements of the searches. If the input
document contains repeating groups--i.e. elements with more than
one value--the CE launches searches for each repetition. That is,
each resulting set of search criteria contains a different set of
values for the elements in the repeating group. In the case of
multiple repeating groups, the CE creates a separate document for
each permutation. For example a document with two repeating groups,
one with 5 repetitions and one with 4 repetitions, would be
decomposed into 20 sets of searches.
[0102] Step 6: C-ENGINE Passes Input Values, Search Schema to
XTE
[0103] For each document to be classified, one or more searches may
be required. The schemas for these searches are located in the
SCHEMAS directory in the filesystem used by the SSE. In order to
locate matching values in the databases to be searched, the CE must
issue a QUERY command to the appropriate SM. The WHERE-clause of
the QUERY command gives the values to be used as search criteria.
However, there is no assurance that the structure of these anchor
values in the input document is the same as the structure needed in
the WHERE-clause, which needs to reflect the structure of the
target database. In some cases, complex values may need to be
broken down into constituent parts. In others, simple values may
need to be combined. Sometimes, a synonym table is used to make
substitutions. This kind of conversion is performed by the XTE. For
each search schema defined in the maps specified in the
CLASS_RULE_MAP indicated in the CLASSIFY command, the CE issues a
request to the XTE containing the input document and the target
schema.
[0104] Step 7: XTE Returns Input Values Structured for Search
Schema
[0105] The XTE receives XML transformation requests from the CE and
returns an Input Search Criterion document suitable for use in the
WHERE-clause of a query. (For details on the operation of the XTE,
refer to patent description XXXX.)
[0106] Step 8: C-ENGINE Issues QUERY Commands to Search
Managers
[0107] For each search indicated by the CLASS_RULE_MAP, the CE
issues a QUERY command to the SM to find documents with values that
match those taken from the input document. The QUERY command
consists of a WHERE-clause and a FROM-clause.
[0108] WHERE-clause: Using the Input Search Criterion document, the
CE is able to construct a WHERE-clause that contains the anchor
values from the input document in the structure required by the
search schema.
[0109] FROM-clause: The CE constructs a FROM-clause consisting of a
single DOCUMENT command that uses the wildcard designation to
indicate that all documents should be searched.
[0110] Step 9: SM Processes QUERY Commands, Returns Similarity
Scores
[0111] For each QUERY issued by the CE, the SM returns an XCL
Result Set consisting of a DOCUMENT element for every document
drawn from the database being searched. The DOCUMENT element has a
score attribute that indicates how well the document's values match
the anchor values given as search criteria in the QUERY command.
Scores range from 0.00 to 1.00, with zero indicating a total
mismatch and one indicating an exact match. The score depends on
the similarity measure assigned to the element in the search
schema. As the SM completes the searches, it places the results on
a return queue for processing by the CE.
[0112] Step 10: C-ENGINE Classifies Document Based on Profile and
Scores
[0113] As search results become available from the SM, the CE is
able to classify the input documents according to the prevailing
rules. In this implementation, a rule is expressed as a set of
conditions that must be satisfied in order for the document to be
placed in a defined class. Boolean operators (AND, OR) allow for
combinations of conditions. A condition is deemed satisfied if the
results of a search include a required number of documents with
similarity scores within a specified range. The classification
algorithm uses the CLASSES.xml file shown in FIG. 3, and described
subsequently in relation to FIG. 10. In Step 8, on encountering a
document with repeated groups of data, the CE launched searches for
each repetition. The value of the CRITERIA_MATCH_TYPE element in
the specified CLASS_RULE_MAP determines whether the CE regards a
classification rule to be evaluated as True as soon as any of the
repetitions is found to fulfill the conditions of the rule or
whether the CE waits to see the results for all the resulting
documents before completing the classification.
[0114] Step 11: C-ENGINE Places Classified Document on Output
Queue
[0115] Documents for which classification rules evaluate as True
are placed on the Output Queue for assignment to the appropriate
class.
[0116] Step 12: C-ENGINE Reads Documents from Output Queue
[0117] On completion of the classification process, the CE reads
the documents from the Output Queue.
[0118] Step 13: C-ENGINE Adds Results to Results Database
[0119] On completion of the classification process, the CE writes
the identifier of the PROFILE to the HEADER table. (See "Output
Files".)
[0120] For each classified document, the CE adds a row to the
CLASSRESULTS table.
[0121] For each successful search, the CE adds a row to the
SEARCHCRITERIA table.
[0122] For each rule evaluated as True, the CE adds a row to the
RULE_CRITERIA table.
[0123] Step 14: C-ENGINE Notifies Client of Completion of CLASSIFY
Command
[0124] On completion of the classification process, the CE notifies
the client with an XCL Response indicating the success of the
operation or the likely cause of failure. The classification result
APIs allow CE clients to access the results of a classification via
XCL commands. Java utilities are available to read the results
tables and generate the appropriate XCL commands. The generated XCL
document is used with the SSE java Connection class to execute the
request. The 3 classes that represent the API are: Cresults;
CresultDocument; and Cjustification.
[0125] The following describes a method of document classification
using similarity search results. The process flow here is
summarized as Step 12 of the main processing narrative. It is
broken out for further detailing because it embodies the essential
invention being described.
[0126] This method is based on the premise that documents can be
classified according to how well their values match documents in
other databases. For instance, an insurance claim might be
classified as suspicious based on a match between the name of the
claimant and a document with the same name drawn from a database of
known fraud perpetrators. While exact match searches could find the
corresponding record when the name is stored in exactly the same
way, they are often defeated by inconsequential differences in the
way the name is stored. For instance, on the insurance claim, the
name might be written as a single string, while in the database it
is broken down into First, Middle, and Last Names. Furthermore,
minor differences or irregularities in the way the name is spelled
or entered could foil the exact match search. For instance, the
claim form may say "Charley" while the database says "Charles".
[0127] The application of similarity technology is able to overcome
these barriers to finding the match in several ways. First, the
ability to recognize near-matches, such as "Charley" and "Charles"
means that minor differences do not eliminate a document from
consideration, as is the case with exact match methods. Second, the
ability of the SSE's XTE service to restructure anchor values to
match the structure of the search database overcomes differences in
how the data is organized, as with the case of full names vs.
first-middle-last. Finally, the calculation of a similarity score
as a weighted average of the scores for matches of individual
values gives the SSE the ability to find the best overall matches,
based on all the relevant values, and even to find a good overall
match when none of the values are exactly the same.
[0128] On the other hand, similarity technology is also able to
confirm non-matches with the same tolerance for differences in data
representation described above. For instance, the ability to
confirm that a person's name and all reasonable variations do not
appear in a database of approved customers may be sufficient to
classify that person as a new customer.
[0129] The CE Offers Four Ways to Classify a Document Based on
Similarity Search Results:
[0130] 1) Take the top score from among all results from one search
schema and use that to classify claim based on a threshold. For
example, if the highest scoring document in SANCTIONED_DOCS matches
the input document with a score of 0.90 or more, then classify the
input document as "high risk".
[0131] 2) Take the top score from among the results from more than
one search schema and classify based on an AND/OR relationship and
some threshold. For example, if the highest scoring document in
SANCTIONED_DOCS matches with a score of 0.90 or more AND the
highest scoring document in STOLEN_VEHICLES matches with a score of
0.80 or more, then classify the input document as "high risk".
[0132] 3) Classify based on the number of search results for a
single schema that have scores above some threshold. For example,
if 6 monthly payment documents in PAYMENTS_RETURNED match with a
score of 0.90 or better then classify the input document as "high
risk".
[0133] 4) Classify based on the number of search results from
multiple schemas that have scores records above some threshold. For
example, if 6 monthly payment documents in
PAYMENTS_RETURNED.sub.--2000 match with a score of 0.90 or more AND
6 monthly payment documents in PAYMENTS_RETURNED.sub.--2001 match
with a score of 0.80 or more, then classify the input document as
"high risk".
[0134] The classification rules are given in the CE's
classification files, described in "CE Classification Files". These
are:
[0135] CLASSES--defines classes by name and rank
[0136] RULES--defines rules and conditions for evaluation
[0137] CLASS_RULE_MAPS--defines type of mapping and which rules
apply to classes
[0138] The processing flow for document classification is shown in
FIG. 9. At this point, the searches have completed and results have
been tabulated so that for each search the CE knows the number of
results with scores above the given threshold.
[0139] For a simple document, the CE processes each RULE to
determine whether the rule evaluates as True according to the
search results. The properties in each rule are evaluated and
combined into an overall rule evaluation. Each PROPERTY uses a
single search result score. A CONDITION is used to logically
combine its individual PROPERTY and CONDITION evaluations to
compute an overall True or False result.
[0140] The rule evaluation process provides two modes of operation.
One mode evaluates rules against all possible combinations of
search results, regardless of whether the conditions for
classification have already been satisfied. This provides extensive
evaluation and classification justification information. The other
mode evaluates rules in an order such that once the conditions for
a classification have been satisfied, further rule processing is
terminated. This provides a simple classification with minimal
justification information but can result in improved operational
performance. The settings for these modes of operation are defined
by the CLASS_RULE_MAP CRITERIA_MATCH_TYPE.
[0141] CRITERIA_MATCH_TYPE governs the processing mode at the Class
level. When CRITERIA_MATCH_TYPE is "Single", as soon as a rule
fires that allows a document to be placed in that Class, its
results are saved and other scores are no longer considered. This
means once a classification is achieved for a Class, then no
further processing is needed at that Class rank or lower. When
CRITERIA_MATCH_TYPE is "Multi", all rules must be evaluated and
processing continues. This provides a more complete account of the
classification, since it evaluates every rule for which search
results are available.
[0142] RULE_MATCH_TYPE governs the evaluation of rules in classes
that contain more than one rule. When RULE_MATCH_TYPE is "Multi",
then all the rules for a class must be evaluated. When
RULE_MATCH_TYPE is "Single", then as soon as a rule evaluates as
True, the document can be placed in that Class and no further
processing is needed for that Class.
[0143] Turning to FIG. 10, FIG. 10 shows a flowchart of the
classification process. The classification takes different paths
for each type of condition.
[0144] 1) For each property, if the required number of documents
produce scores within the specified range, a property evaluates as
True. Otherwise, the property evaluates as False.
[0145] 2) For a condition with the AND operator, to evaluate as
True, all the properties and conditions it contains must evaluate
True.
[0146] 3) For a condition with OR operator, to evaluate as True,
any property or condition it contains must evaluate True.
[0147] Conditions are tested recursively until the topmost
condition has been evaluated. If True, then the rule has been
evaluated as True.
[0148] Turning to FIG. 11, FIG. 11 shows the XCL CLASSIFY command.
The XCL CLASSIFY command is an XML document which contains the
necessary elements for performing a classification using the
Classification Engine. The document is transmitted via the SSE
execute method on the SSE java Connection class.
[0149] Turning now to FIG. 12A, FIG. 12A shows the FROM-clause. The
FROM-clause identifies the document set being classified. These are
virtual documents drawn from relational datasources according to a
predefined input schema. The FROM-clause offers three ways to
identify the input documents. The first lists the documents
individually by name. The second uses the wildcard designation "*"
to request all documents in the set. The third (used primarily for
debugging) includes the documents themselves in the command.
Examples of each are given below.
[0150] Turning to FIG. 12B, FIG. 12B shows an example of a
FROM-clause that indicates that CLASSIFY should get its input from
the documents named "1", "2", and "3"in the set defined for the
search schema "acme_products".
[0151] Turning to FIG. 12C, FIG. 12C shows an example of a
FROM-clause that indicates CLASSIFY should examine the entire set
for "acme_products":
[0152] Turning to FIG. 12D, FIG. 12D shows an example of a
FROM-clause that indicates CLASSIFY should examine the documents
shown. Note that the documents are unnamed and are therefore
unidentified in classification outputs.
[0153] Turning now to FIG. 13A, FIG. 13A shows the WHERE-clause.
The CLASSIFY command uses the WHERE-clause to filter documents for
classification. The WHERE-clause indicates the anchor to be
compared to target values drawn from the datasources specified in
the FROM-clause. The anchor document is structured as a hierarchy
to indicate parent/child relationships, reflecting the
STRUCTURE-clause of the schema. Only those documents that contain
values matching the anchor values are considered for
classification. Matches are determined by the measures specified in
the schema.
[0154] For the Classification Engine, the WHERE-clause takes the
form of an XML document structure populated with anchor
values--i.e. the values that represent the "ideal" for the document
filter. This document's structure conforms to the structure of the
input schema. However, only the elements contributing to the filter
need to be included. Hierarchical relationships among elements,
which would be established with JOIN operations in SQL, are
represented in SSE Command Language by the nesting of elements in
the WHERE-clause. No matter where they occur in the document
structure, all elements included in the WHERE-clause are used to
filter document found in the associated input datasource. A
classification engine WHERE-clause is used for selection. A
WHERE-clause is optional in any CLASSIFY that does classification.
Without a WHERE-clause, a CLASSIFY will use all documents in the
FROM clause. FIG. 13B shows an example of a WHERE-clause.
[0155] Turning to FIG. 14A, FIG. 14A shows a USING-clause. The
Using-clause defines which classification profile the
Classification Engine should use to classify the input
documents.
[0156] Turning to FIG. 14B, FIG. 14B shows an example of a
USING-clause that indicates that the CLASSIFY command should use
the profile with the ID "1" (MyClassification) to perform the
classifications on the input documents.
[0157] A Classify utility is useful for classifying multiple
documents at once. The batch classification utility allows the use
of the CE without a custom-written client. The SSE SCHEMAS file
must contain a schema for the documents to be classified. Executing
the following command from a system console starts the utility.
10 Classify profile="classification profile name" [gateway=" SSE
connection url"] [uid="user id for SSE connection"] [ pwd="password
for SSE connection"] profile="classification profile name"
(Required) Specifies the name of the classification profile to use
for classifying records found in the input database. gateway="SSE
connection url" (Optional) Specifies the url to use for connnecting
to the SSE gateway. The default value is localhost. Example value
is: gateway="raw://localhost:515- 1" uid="user id for SSE
connection" (Optional) Specifies the user id to use for connecting
to the SSE gateway. The default value is Admin. Example value is:
uid="Admin" pwd="passwordfor SSE connection" (Optional) Specifies
the password for the user that is to be used for the connection to
the SSE gateway. The default value is admin. Example value is:
pwd="admin"
[0158] Once the command is executed, the classification process
begins and the utility starts writing messages reporting its
progress.
[0159] Turning to FIG. 15, FIG. 15 shows the RESPONSE element of
the classification log resulting from the classification. The RC
element provides a return code indicating the success of the
operation or the error conditions that resulted in failure. The
MESSAGE element contains a descriptive account of the operation,
including the progress of the classification and its general
results. Each document in the Input File is identified by PKEY
value and classification results are given by CLASS_ID, CLASS,
RANK, and SCORE.
[0160] To stop the Batch Utility issue the CLASSIFYSTOP command.
Issuing the CLASSIFYSTOP command will terminate the batch Classify
Utility. Terminating the program stops the classification of the
remaining records that have not yet been classified. Results for
records already classified are saved.
[0161] Although the present invention has been described in detail
with reference to certain preferred embodiments, it should be
apparent that modifications and adaptations to those embodiments
might occur to persons skilled in the art without departing from
the spirit and scope of the present invention.
* * * * *