U.S. patent application number 09/898948 was filed with the patent office on 2002-01-24 for method and apparatus for processing free-format data.
Invention is credited to Hetherington, Greg.
Application Number | 20020010714 09/898948 |
Document ID | / |
Family ID | 3804719 |
Filed Date | 2002-01-24 |
United States Patent
Application |
20020010714 |
Kind Code |
A1 |
Hetherington, Greg |
January 24, 2002 |
Method and apparatus for processing free-format data
Abstract
A method and apparatus for processing free-format data (301) to
produce a "text object" associated with the free-format data. The
text object comprises a plurality of "component nodes" (302-312)
containing attribute-type identifiers for elements of the
free-format text and other data facilitating access to the text
object to obtain information and/or change or add the free-format
data. This arrangement obviates the need for the provision of
separate database fields for each element of the information.
Free-format data can therefore be processed in a similar manner to
the way a human being processes free-format data. All elements can
be accessed via the constructed text object.
Inventors: |
Hetherington, Greg;
(Kareela, AU) |
Correspondence
Address: |
DAVIS & BUJOLD, P.L.L.C.
500 NORTH COMMERCIAL STREET
FOURTH FLOOR
MANCHESTER
NH
03101
US
|
Family ID: |
3804719 |
Appl. No.: |
09/898948 |
Filed: |
July 3, 2001 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
09898948 |
Jul 3, 2001 |
|
|
|
09117776 |
Aug 6, 1998 |
|
|
|
Current U.S.
Class: |
715/256 ;
707/E17.058 |
Current CPC
Class: |
Y10S 707/99934 20130101;
G06F 40/131 20200101; G06F 40/279 20200101; G06F 40/117 20200101;
G06F 40/30 20200101; G06F 16/3344 20190101; Y10S 707/99943
20130101; G06F 40/12 20200101; G06F 40/151 20200101; G06F 40/205
20200101; G06F 40/14 20200101; G06F 16/30 20190101; Y10S 707/99942
20130101 |
Class at
Publication: |
707/505 ;
707/508 |
International
Class: |
G06F 007/00 |
Foreign Application Data
Date |
Code |
Application Number |
Apr 22, 1997 |
AU |
PP0439 |
Claims
1. A method of processing free-format data stored in a computing
system, comprising the steps of examining elements of the data to
determine attributes of the data, by examining the content of the
elements and the contextual relationships of elements to each
other, to determine semantic and syntactic information (attributes)
about the data, producing additional data relating to this
information, in the form of a text object which includes pointer
means enabling access to the elements of the free-format data, and
the additional data being accessible by a query processing means to
provide answers to queries relating to the semantic and syntactic
information about the data and/or to access the data to manipulate
the data.
2. A method in accordance with claim 1, wherein the free-format
data is stored as a record in a free-format field of a
database.
3. A method in accordance with claim 1 or claim 2, wherein the data
remains stored in the computing system as it was originally stored,
whereby it may be accessed by other applications.
4. A method in accordance with any preceding claim, wherein the
text object includes an attribute--type identifier which identifies
an attribute type of an element of the data.
5. A method in accordance with any preceding claim, wherein the
text object includes a value indicating the character length of an
element of the data.
6. A method in accordance with claim 4 or claim 5, wherein the text
object includes a value indicating whether an element is low level
in a syntactic hierarchy or higher level whereby the value may be
used for matching purposes when matching data with other data
processed in accordance with the method.
7. A method in accordance with any preceding claim, the text object
including a match weighting value for an element of the data, which
can be used to determine the significance of the element when
matching with other free format data.
8. A method in accordance with any preceding claim, wherein the
text object comprises a plurality of component nodes arranged
according to the semantic structure of the free-format data, the
component nodes being arranged in a hierarchy corresponding to the
semantic structure of the free-format data and each component node
including additional data relating to the corresponding element of
the free-format data.
9. A method in accordance with any preceding claim, comprising the
further step of generating matching values for comparing an element
of the free-format data with an element of other free-format data
processed in accordance with the present method.
10. A method in accordance with claim 9 where the matching value is
a phonetic value for phonetically comparing elements of free-format
data.
11. A method in accordance with any preceding claim, wherein the
text object includes implied data relating to information implied
from the free-format data.
12. A method in accordance with any preceding claim, wherein a
plurality of free-format data records are processed and a text
object associated with each free-format data record is
produced.
13. A method in accordance with claim 12, wherein the text object
is stored in the computer system whereby it is available for
queries on the associated free-format data record via the query
processing means.
14. A method in accordance with claim 12 comprising the further
step of producing a text object index including attribute type
identifiers for elements of each data record and pointers to each
data record, whereby the index may be queried by queries relating
to semantic and syntactic information about the data and the data
may be accessed via the index.
15. A method in accordance with claim 14 wherein each entry in the
text object index includes a representative value key, which gives
a value representative of a feature of the element associated with
the attribute--type identifier.
16. A method in accordance with any preceding claim, comprising the
further step of carrying out a domain construction process to
construct a domain object from domain definition data files, the
domain object being arranged to carry out the examination process
by parsing the free-format data in accordance with grammar
rules.
17. A method in accordance with claim 16, wherein the domain
definition data files include character definition data, regular
expression definition data and grammar data.
18. A method in accordance with any preceding claim, wherein the
free-format data is postal address data.
19. A method in accordance with any preceding claim wherein the
query processing means can carry out normal database operations on
the data via the additional data.
20. A processing system for processing free-format data stored in a
computing system, the apparatus including means for examining
elements of the data to determine attributes of the data, by
examining the content of the elements and the contextual
relationships of elements to each other, to determine semantic and
syntactic information (attributes) about the data, means for
producing additional data relating to this information, in the form
of a text object which includes pointer means enabling access to
the elements of the free-format data, and a query processing means
which is arranged to access the additional data to provide answers
to queries relating to the semantic and syntactic information about
the data and/or to access the data to manipulate the data.
21. A processing system in accordance with claim 20, wherein the
free-format data is stored as a record in a free-format field of a
database.
22. A processing system in accordance with claim 20 or claim 21,
wherein the examining means does not affect the storage of the
data.
23. A processing system in accordance with any one of claims 20 to
22, wherein the text object includes an attribute--type identifier
which identifies an attribute type of an element of the data.
24. A processing system in accordance with any one of claims 20 to
23, wherein the text object includes a value indicating the
character length of an element of the data.
25. A processing system in accordance with claim 23 or claim 24,
wherein the text object includes a value, indicating whether an
attribute--type of an element is low level in a syntactic hierarchy
or high level whereby the value may be used for matching purposes
when matching with other free-format data processed in accordance
with this system.
26. A processing system in accordance with any one of claims 20 to
25, wherein the text object includes a match weighting value for an
element of the data, which can be used to determine the
significance of the element when matching with other free-format
data.
27. A processing system in accordance with any one of claims 20 to
26, wherein the text object comprises a plurality of component
nodes arranged according to the semantic structure of the
free-format data, the component nodes being arranged in a hierarchy
corresponding to the semantic structure of the free-format data,
and each component node including additional data relating to the
corresponding element of free-format data.
28. A processing system in accordance with any one of claims 20 to
27, the text object means for generating matching values for
comparing an element of the free-format data with an element of
other free-format data processed by the processing system.
29. A processing system in accordance with claim 28, wherein the
matching value is a phonetic value for phonetically comparing
elements of free-format data.
30. A processing system in accordance with any one of claims 20 to
29, wherein the text object includes implied data relating to
information implied from the free-format data.
31. A processing system in accordance with any one of claims 20 to
30, wherein the system is arranged to process a plurality of
free-format data records and produce a text object associated with
each free-format data record.
32. A processing system in accordance with claim 31, wherein the
means for producing additional data is arranged to produce a text
object index including attribute--type identifiers for elements of
each data record and pointers to each data record and wherein the
query processing means is arranged to access the text object index
to provide answers to queries relating to the semantic and
syntactic information about the data and/or to access the data to
manipulate the data.
33. A processing system in accordance with claim 32, wherein the
text object index includes representative value keys for entries,
which give a value representative of a feature of the element
associated with the attribute--type identifier for the entry for
facilitating matching with other free-format data processed in
accordance with this system.
34. A processing system in accordance with any one of claims 20 to
33, further comprising a domain object, the domain object being
arranged to carry out the examination process by parsing the
free-format data in accordance with grammar rules.
35. A processing system in accordance with claim 34, wherein the
domain object is produced by a domain construction process from
domain definition data files.
36. A processing system in accordance with claim 35, further
comprising a domain constructor for carrying out the domain
construction process.
37. A processing system in accordance with claim 35 or claim 36,
wherein the domain definition data files include character
definition data, regular expression definition data and grammar
data.
38. A processing system in accordance with any one of claims 20 to
37, wherein the free-format data is postal address data.
39. A processing system in accordance with any one of claims 20 to
38, wherein the query processing means is arranged to carry out
normal database operations on the data via the additional data.
40. A method of enabling access to free-format data stored in a
computing system, including a plurality of free-format data
records, comprising the steps of storing additional data relating
to semantic and syntactic information (attributes) about the data
for each data record, the additional data being in the form of a
text object associated with each data record, the text object
including pointer means enabling access to elements of each
free-format data record, the additional data being accessible by a
query processing means to provide answers to queries relating to
the semantic and syntactic information about the data and/or to
access the data to manipulate the data.
41. A processing system for enabling access to free-format data
stored in a computing system, including a plurality of free-format
data records, the processing system comprising additional data
relating to semantic and syntactic information (attributes) about
the data for each data record, stored and accessible by the
processing system, the additional data being in the form of a text
object associated with each data record, the text object including
pointer means enabling access to elements of each free-format data
record, and a query processing means arranged to access the
additional data to provide answers to queries relating to the
semantic and syntactic information about the data and/or to access
the data to manipulate the data.
42. A method of enabling access to free-format data stored in a
computing system, including a plurality of free-format data
records, comprising the steps of storing additional data relating
to semantic and syntactic information (attributes) about the data
of each data record, the additional data being in the form of a
text object index which includes attribute--type identifiers for
elements of each data record and pointers to each data record, the
text object index being accessible by a query processing means to
provide answers to queries relating to the semantic and syntactic
information about the data and/or to access the data to manipulate
the data.
43. A processing system for enabling access to free-format data
stored in a computing system, including a plurality of free-format
data records, the processing system comprising the additional data
relating to semantic and syntactic information (attributes) about
the free-format data for each data record, the additional data
being in the form of a text object index which includes attribute
type identifiers for elements of each data record and pointers to
each data record, and a query processing means arranged to access
the additional data to provide answers to queries relating to the
semantic and syntactic information about the data and/or to access
the data to manipulate the data.
44. A method of accessing free-format data processed in accordance
with the method of any one of claims 1 to 19 comprising the steps
of accessing the additional data to provide answers to queries
relating to the semantic and syntactic information about the data
and/or to access the data to manipulate the data.
45. A processing system for enabling access to free-format data
processed in accordance with the method of any one of claims 1 to
19, the processing system including a query processing means
arranged to access the additional data and provide answers to
queries relating to the semantic and syntactic information about
the data and/or to access the data to manipulate the data.
46. A processing system for processing free-format data stored in a
computing system, comprising means for examining elements of the
data to determine attributes of the data, by examining the content
of the elements and the contextual relationship of elements to each
other, to determine semantic and syntactic information (attributes)
about the data, and a query processing means for utilising this
information to provide answers to queries relating to the semantic
and syntactic information about the data and/or to access the
data.
47. A processing system in accordance with claim 46, wherein the
examining means retains the free-format data as stored in the
computer system, without affecting it.
48. A method of processing free-format data stored in a computing
system, comprising the steps of examining elements of the data to
determine attributes of the data, by examining the content of the
elements and the contextual relationships of elements to each
other, to determine semantic and syntactic information (attributes)
about the data, and querying the data using this information to
provide answers to queries relating to the semantic and syntactic
information about the data and/or to access the data.
49. A method of processing free-format data in accordance with
claim 48, wherein the free-format data is unaffected by the
examining process and remains stored in the computing system as it
was originally stored.
50. A computer readable memory storing instructions for controlling
a computer to process free-format data stored in a computing
system, in accordance with the method of any one of claims 1 to
19.
51. A computer readable memory storing instructions for controlling
a computer to process free-format data stored in a computing
system, in accordance with the method of claim 48.
52. A method of processing a plurality of records of free-format
data stored in a computing system, comprising the steps of, for
each record, examining elements of the data to determine attributes
of the data, by examining the content of the elements and the
contextual relationships of elements to each other, to determine
semantic and syntactic information (attributes) about each record,
and producing virtual data fields associated with each record
enabling access to this information and the associated elements,
whereby each record is provided with associated virtual data fields
enabling access to semantic and syntactic information about that
record and also access to the associated elements.
53. A processing system for processing free-format data records
stored in a computing system, comprising means for examining
elements of the data of each record to determine attributes of the
data, by examining the content of the elements and the contextual
relationship of elements to each other, to determine semantic and
syntactic information (attributes) about the data, and means for
producing virtual data fields associated with each record enabling
access to this information and the associated elements, whereby
each record is provided with associated virtual data fields
enabling access to semantic and syntactic information about that
record and also access to the associated elements.
Description
[0001] The present invention relates generally to the processing,
storage and analysis of information in the form of free-format
data, and particularly, but not exclusively, to a method and
apparatus for interpreting free-format text.
BACKGROUND OF THE INVENTION
[0002] One of the main purposes of computer systems is to manage
information. This management of information is performed internally
by data management systems. Generally, data management systems may
be divided into two categories: 1) Database management systems; and
2) Text search and retrieval systems.
[0003] The first type of data management system imports and
manipulates data into internal representations so that the data may
be located and modified. When required, these systems generate a
suitable representation of this data which is read by humans or
used by another system. This category of data management system
includes: hierarchical, network, relational, object-oriented
database management systems and knowledge based management
systems.
[0004] Within hierarchical, network and relational databases,
information about an entity (a transaction, a stock item, a person,
a company, an address etc.) is usually referred to as a "record"
(although sometimes a record may contain information about many
entities). Within each record the various "attributes" of the
entity are usually classified into "fields".
[0005] Within object-oriented database management systems and
knowledge based management systems these basic units may have other
names such as "object" and the information regarding the object may
have names such as "slot" or "member". Each of the attribute
fields/slots has a format which can be, for example, integer, real
number, boolean, character etc. Others are records/objects. Some
fields/slots have specific formats (e.g., date, time), but yet
others are free-format text.
[0006] Once the database has been constructed, it may be used to
perform the following operations:
[0007] Add a record/object
[0008] Locate and change a record/object
[0009] Locate and delete a record/object
[0010] Retrieve information
[0011] These operations will be referred to as "normal database
operations".
[0012] Storing of information about an entity in fields/slots is
suitable for many types of data. There are however, some types of
data which do not have a suitable standard structure. One best
example of data which does not have a standard structure is
"address" data. As most databases store people's address
information in one, two or three free-format fields, performing
normal database operations on individual attributes of the address
is very difficult. Note that the term "attribute" is used in this
specification to refer to a property of an "element" of data.
[0013] For example, the free-format data "35 Pitt Street, NORTH
SYDNEY" has a number of "elements". Each element has an associated
"attribute". An attribute of the element "NORTH" is that it is a
"geographical indicator". An attribute of the element "12" is that
it is a "number". Note that the "low level" elements correspond to
the "tokens" of data i.e., the element "NORTH" is a token of the
data. The data also includes higher level elements, however, e.g.,
"NORTH SYDNEY" is an element which includes two tokens and this
element has the attribute of being a "town". An attribute of the
entire data "12 Pitt Street, NORTH SYDNEY", i.e., the total
"element" is that it is an "address". An alternative term for
element is "component".
[0014] For each element of this free-format data to be provided
with its own field for the associated attribute would increase the
size and complexity of the database quite significantly, even for
this simple example of addresses. Where the database includes
information on people, together with their addresses, for example,
in order to avoid complexity, and particularly with older
databases, address data may be stored in a single field labelled
"address". This field contains the address in free-format form and
it is therefore not possible with present database technology to
perform normal database operations on individual elements of the
address--those elements cannot be accessed separately (apart from
the total combination of elements which makes up the address, which
can of course, be accessed as a whole, as "address").
[0015] This problem is to some extent addressed by the science of
database scrubbing/cleansing. This field of commercial endeavour
applies parsing processes to free-format text with the objective of
creating new database fields for the attributes of the free-format
text and entering into those fields completely standardised data.
This standardising of data includes converting all spelling
variations into one consistent set. (eg "Street".fwdarw."St".) The
above example would produce the following:
1 House Number Street Name Street Type City 12 Pitt St Sydney
[0016] The new database fields are then used to perform normal
database operations. An entire industry is devoted to this field,
applying large, complex and expensive software packages to take
information stored in databases, analyse and process the
information to produce new databases including more fields for the
attributes of the information records, thus providing more
flexibility for operations which can be applied to the records.
[0017] Much has been written about the field of database
cleansing/scrubbing (see e.g., "Dealing with Dirty Data" DBMS
Magazine, September, 1996). The process is expensive --a complete
cleansing operation for a large database can cost millions of
dollars, as it is so time consuming and the software packages that
have been developed to cleanse databases are very complex--and it
is still limited by the fundamental requirement that to perform
database operations on an element, the element must have a field to
itself.
[0018] This brings us to the second major problem which afflicts
the present methods of storing computerised information in
commercial databases. Practically all commercial data is stored
within hierarchical, relational databases or flat data files which
have a structure which is fixed at time of design, but information
by its very nature is complex and can have almost an infinite
number of different attributes. To create a database containing
fields for each and every attribute for each and all types of
different information is just not practical, if not totally
impossible, and certainly the cost of any attempt to produce a
database containing fields for all the types of information
available to humanity would be cost prohibitive.
[0019] Even a fairly trivial (although very important) example
illustrates the scale of the problem. Consider international
addresses, i.e., addresses the world over. Although four or five
free-format fields can contain any address, to design a database
table which has a data field for every possible attribute of all
international addresses would contain hundreds, if not thousands of
data fields. England has counties, USA and Australia have states,
Japan has districts and different orders of addresses, etc.
[0020] The field of database cleansing/scrubbing is therefore a
partial solution at best. It still requires the same fundamental
database structure of a field for each data attribute. One can
build more and more complex databases but this problem can never be
completely resolved, and limits the computerised handling of
information significantly.
[0021] Natural language processing systems are known that employ
"Semantic Grammars" to encode semantic information into a syntactic
grammar. These systems are mainly used to provide natural language
interface to other systems such as a data base management system.
The following description comes from a book by Patterson, D. W.
"Artificial Intelligence and Expert Systems".
2 ". . . They use context-free rewrite rules with non- terminal
semantic constituents. The Constituents are categories or
metasymbols such as attribute, object, present (as in display or
print), and ship, rather then NP (Noun Phase), VP (Verb Phase), N
(Noun), V (Verb), and so on. . . . Semantic grammars have proven to
be successful in limited applications including LIFER, a data base
query system distributed by the US Navy . . . and a tutorial system
named SOPHIE which is used to teach the debugging of circuit
faults. Rewrite rules in these systems essentially take the forms S
-> What is <OUTPUT-PROPERTY> of <CIRCUIT-PART>?
OUTPUT-PROPERTY -> the <OUTPUT-PROP> OUTPUT-PROPERTY ->
<OUTPUT-PROP> CIRCUIT-PART -> C23 CIRCUIT-PART -> D12
OUTPUT-PROP -> voltage OUTPUT-PROP -> current In the LIFER
system, there are rules to handle numerous forms of wh-queries such
as What is the name of the carrier nearest to New York? Who
commands the Kennedy? etc . . . These sentences are analyzed and
words matched to metasymbols contained in lexicon entries. For
example, the input statement `Print the length of the Enterprise`
would fit with the LIFER top grammer rule (LTG) of the form
<LTG> -> <PRESENT> the <ATTRIBUTE> of
<SHIP> where print matches <PRESENT>, length matches
<ATTRIBUTE>, and the Enterprise matches <SHIP>. Other
typical lexicon entries that can match <ATTRIBUTE> include
CLASS, COMMANDER, FUEL, BEAM, LENGTH, and so on."
[0022] These types of systems receive information in structured or
free-format form and converts it to its own representations.
[0023] Although the interface is flexible the database they
interface to has a fixed structure and these systems are unable to
perform changes on the original (human readable) data.
[0024] Indeed there are many prior art systems which provide
"Natural Language" interfaces to structured databases. All of these
systems provide translation from "Natural Language" into some form
of structured data and suffer from the same problems described
above.
[0025] Refer to U.S. Pat. No. 4,787,035, Bourne, D.
"META-INTERPRETER" and U.S. Pat. No. 5,454,106, Burns, L.,
Malhotra, A., "Database retrieval system using natural language for
presenting understood components of . . . " for examples of such
systems.
[0026] As discussed earlier, one type of database management
systems are knowledge based management systems (KBMS).
[0027] These systems employ the concept of attribute "slots" on an
object. Slots provide or change information regarding the object
either directly onto the stored values or indirectly through
procedures. A simple example of "slots" will illustrate the
concept: a "Square" object has two attribute slots "Length" and
"Area". The "Area" slot does not need to store a value because its
value can calculated by squaring the "Length" value.
[0028] Although these types of systems do not require fixed
database structures, they do however, need to transform the
original data into internal data representations which must be put
through a very process intensive "language generation" process to
produce information that is understandable by humans. If these
types of systems were required to maintain the original data for
use by other systems and humans, a small change would require the
whole text string to regenerated.
[0029] The text search and retrieval category of data management
system does not import the data but builds searchable indices which
point to the original data. This category includes: document
storage & retrieval systems; and Internet search engines.
[0030] These types of systems have very successful because they
leave the original information in human readable form. This basic
principle means that unlike the prior art database system described
above, the underlying data can be very easily shared with many
systems of this type. Another reason for their success is that
improvements in technology can be implemented without requiring
conversion of the original data. Data conversion is not only
extremely expensive, but it is also a major source of data
errors.
[0031] There are however, major drawbacks in using this type of
system to manage data. Compared with the database systems described
above. The major limitation is that the data cannot be
manipulated--it cannot be modified, it must remain as it is. Other
database functions which are very difficult to perform include:
[0032] Cross checking and validating the data
[0033] Integrating the data with database systems
[0034] Sorting and classifying the text data
[0035] From these limitations, we can see that this category of
data management system is suited to unstructured data which does
not need to be changed.
[0036] In text search and retrieval systems, it is known to process
a documentation collection to identify specific attributes of each
document such as its "subject" topic. The types of documents which
have been processed by this type of system include books,
newspapers, reports, manuals and e-mail messages.
[0037] Most of these types of systems, however, only look for
individual words to match and do not look at words in context. Some
others identify words that are nouns but do not classify the type
of noun. Both are unsuitable for data such as address data, which
contains a large portion of proper nouns.
[0038] Further, the original data cannot be changed within
context.
[0039] For more information regarding this area, refer to the works
published by Gerald Salton.
[0040] Note that the term "text object" as used in the following
description should not be confused with the terminology "text
object" which has been used in systems to describe software
techniques which assist in the storage and transfer of pieces of
text data between computer systems by encapsulating the text
string. Techniques which have used the term "text object" range
from the "String" object employed within Apple Computer's operating
systems (where the object contains a leading two byte "length"
value and the text string) to the "Compound String" object employed
by the X-Windows operating system (where the object encapsulates
multiple encodings, language translations and font styles of one
piece of information.)
SUMMARY OF THE INVENTION
[0041] From a first aspect the present invention provides a method
of processing free-format data stored in a computing system,
comprising the steps of examining elements of the data to determine
attributes of the data, by examining the content of the elements
and the contextual relationships of elements to each other, to
determine semantic and syntactic information (attributes) about the
data, producing additional data relating to this information, in
the form of a text object which includes pointer means enabling
access to the elements of the free-format data, and the additional
data being accessible by a query processing means to provide
answers to queries relating to the semantic and syntactic
information about the data and/or to access the data to manipulate
the data.
[0042] The term "text object" as used in the current specification
does not encapsulate text string, as discussed above. The text
object in the terms of the present invention provides a "semantic
layer" between the actual text data and, for example, an
application software system which may need to access and/or
manipulate the text data.
[0043] In its simplest form, as defined above, the text object is
the additional data, related to the semantic and syntactic
information obtained from examination of the data elements, and a
pointer means (such as a key) which can lead back to the elements
of the free-format data (e.g., back to the text string which forms
the free-format data).
[0044] The additional data preferably allows identification of the
attributes of the data which have been obtained by the examination
of the data. For example, in the "12 Pitt Street, NORTH SYDNEY"
example given in the preamble, the various attributes of the data,
e.g., "street" equals "12 Pitt Street"; "street number" equals
"12"; "town" equals "NORTH SYDNEY", etc., are identified by the
additional data and the pointer means preferably allows access to
the elements of the data which are associated with those
attributes. The additional data effectively provides "virtual data
fields"--the data fields do not exist as they do in a normal
database which would have a column field head for each attribute.
Nevertheless the free-format data can be accessed on an attribute
by attribute basis using the present invention, as if actual fields
for those attributes did exist. The preferred embodiment of the
invention thus operates to create "virtual data fields" which,
preferably, allow all normal database operations on free-format
text, without having to create actual database fields for the
free-format text. The free-format text can remain stored as it is
in the same location (usually database).
[0045] The significance of this becomes apparent when one considers
the processing of many records of free-format data, for example
international address data. As discussed above, although four or
five address fields could store all international address data in
free-format form, each data record can have many attributes, which
differ from attributes of other addresses e.g., England has
counties, the USA has states. To produce actual conventional
database fields for all the attributes for international addresses
would be an almost impossible task. However, with the present
invention, each record of free-format data can be taken and
processed to produce a (small) number of virtual data fields for
that particular record in the form of a text object. The text
object for each record can then be queried separately by an
appropriate query processing means to provide all the normal
database operations for that record. The data itself may stay in
place. As a separate text object is created for each record, there
is no problem with having different virtual data fields for each
record. We do not have to create a large database with many fields,
instead we leave the database records as they are and create many
text objects, one for each record, to give many virtual fields
overall, but few virtual fields for each text object.
[0046] The step of examining preferably includes the step of
parsing the free-format data.
[0047] A text object preferably enables manipulation of the data to
carry out all the normal database operations, such as changing the
record, locating an element of the record, retrieving information
from the record, etc. The information which may be provided by the
text object preferably includes information on the elements of the
data. In a preferred embodiment, the information may also include
matching information (such as phonetics) to facilitate comparison
of one record of data with another record of data, parsing priority
information to assist in the processing of ambiguous free-format
text, etc.
[0048] It is believed that this new approach will lead to computers
being able to manipulate free-format data in much the same way as
human beings do. There is no need disassemble the data record
according to its attributes and place standardised values for each
attribute type into an appropriate field in a database (as is
conventional practise), once the appropriate column names for the
database have been determined. Each text object for each data
record provides all the processing and information the computer
needs to provide all the normal database operations. The attribute
types of, for example, international addresses can be compared,
manipulated, etc., without it being necessary to provide a complex
database with many fields.
[0049] The text object preferably includes attribute-type
identifiers accessible to enable identification of attributes of
the free-format data and pointer means for locating elements of the
data having the particular attribute.
[0050] In a preferred embodiment, the text object comprises a
plurality of parts in the form of "component nodes". Preferably, a
plurality of component nodes may be associated together in a text
object in a predetermined hierarchy. For example, a plurality of
component nodes may be considered to be "nested" together in the
form of a "text node tree" which may have a plurality of branches
associating various component nodes with each other in a
predetermined hierarchy. Each component node may comprise:
[0051] an attribute type identifier (for the classification of an
attribute of the free-format data which is associated with that
component node);
[0052] a pointer to the beginning of a sub-string within the text
object's text string (i.e. beginning of the element associated with
the component node).
[0053] an integer containing the character length of the element
sub-string (of the data).
[0054] zero, one or more other component nodes (nested within this
component node or otherwise associated with the component nodes so
that the other component nodes can be accessed via the component
node) preferably stored as an array;
[0055] a matching weight (to indicate the relative importance of
this element when performing comparisons with other text
objects);
[0056] a boolean variable indicating whether this attribute type
identifier is a low level matching element; and
[0057] depending on time/space considerations, one or more values
to assist in the matching process. (See section on "text string
operations" below for more details.)
[0058] a parsing priority value (giving a notional "priority" to
the elements of the free-format data associated with the component
node so that a priority may be allocated and used in the
determination of the best interpretation of free-format text when
ambiguities exist).
[0059] Other component nodes may not be physically nested within
the component node but each component node may just contain a list
of pointers to subordinate component nodes so that the subordinate
component nodes can be "found" from the component node which
includes the list.
[0060] Each component node preferably relates to one particular
attribute of the free-format data, as identified by the attribute
type identifier in the component node. Component nodes which are
relatively high in hierarchy may contain or point to a plurality of
other component nodes, whereas those component nodes which are the
lowest in the hierarchy may not contain or point to any other
component nodes as the next step down in the hierarchy is the
associated element of the free-format data.
[0061] The hierarchy is determined by the parsing of the
free-format data. E.g., one attribute of a record of address data
may be a <Street>, e.g. "12 Pitt Street". Sub attributes of
the <Street>component are <Street number>"12",
<Street name> "Pitt" and <Street type> "Street". The
<Street> component node will therefore list three other sub
component nodes, having attribute type identifiers <Street
number>, <Street name> and <Street type>.
[0062] Preferably, each component node could be considered to be
text objects themselves. This recursive definition allows all the
functions of the text object of the present invention to be applied
to each attribute.
[0063] The text object may also comprise other data structures
which assist in the quick location of specific component nodes. An
example of such a structure is a lookup table containing all the
attribute type identifiers and a pointer to their associated
component nodes.
[0064] The query processing means is preferably a software
application engine which is configured to be able to use the text
object to answer questions on the data and access the data to
manipulate it (e.g., correct it if it is in error).
[0065] The method preferably also includes the further step of
preparing an "index" which facilitates comparison of elements of a
plurality of records of free-format data. The index is preferably
in the form of a table (termed by the inventors a "text object
index") including columns, column headings and data, very much in
the same way as a conventional database, except that it is prepared
from the additional data for each of the plurality of data
records.
[0066] The text object index preferably includes a table with a
column for the attribute type identifier, a column for
representative value keys and a column for user supplied record
identifiers. The representative value key preferably provides a
value representative of a feature of the element associated with
the appropriate component type identifier, e.g., a phonetic value
for elements which are proper nouns (e.g., Smith) or a numeric
identifier for common words (e.g., Street). The section on text
string matching below contains more details regarding the values of
the representative key value. The user supplied record identifier
will identify to the user which record of free-format data is being
compared or accessed i.e., is a pointer which enables access to the
record.
[0067] Where a text object index is prepared, a text object having
a plurality of component nodes containing attribute-type
identifiers and other data may not be necessary. All that may be
required to access the data and carry out database operations is
the query processing engine and the text object index. The text
object index may be prepared directly from the examination of the
data and the text object index includes text objects for a
plurality of records (i.e., additional data plus pointer to
record). The text object as a separate "component node structure"
can therefore be dispensed with or is not needed in the first place
as a separate entity, instead it is incorporated in the text object
index as additional data plus pointers.
[0068] Where the text object includes "matching" values (or
procedures to create these values) for low level matching elements
of the free-format text, it is possible, for example, to compare
records including elements which are in different written
languages. For example, a free-format record containing a street
name value in Kanji, may be compared with a street name element in
Arabic by comparing respective matching values. The street name for
each record could be the same street, but merely being expressed in
different languages in the free-format data. The matching
information provided by this aspect of the present invention
therefore enables comparison of elements of free-format text
expressed in different written languages.
[0069] Matching values may be generated during processing of text
objects, and need not be stored in the text object. That is, they
can be generated "on the fly" via procedures designated by the
query processing engine. See later on in the description.
[0070] In the method of the present invention, the step of
examining the elements of the data to determine the components
preferably comprises the step of parsing the free-format data in
accordance with grammar rules applied by a domain object. The
domain object is preferably constructed by a domain construction
process which uses as input data: character definition data,
regular expression definition data, and grammar data.
[0071] The hierarchy of the component nodes of the text node tree
is preferably determined by the grammar rules for the particular
domain object.
[0072] An embodiment of the present invention may be implemented by
a software application which includes a domain object and a query
processing means. The domain object is arranged to examine
free-format data to produce a text object which can be then used by
the query processing means to enable all database operations on the
free-format data. The free-format data may be stored in any
conventional way, such as in a conventional database on a computer
system. The free-format data may also be stored as a string in the
text object. The software application comprising the domain object
and query processing engine would be used to process the data
without affecting its storage in the database. Other software
applications could therefore interface with the database as normal,
i.e., the database remains totally unaffected as far as its
operation is concerned apart from the fact that the domain object
and query processing means can be used to enhance the capabilities
of the database by providing access to all the elements of the
free-format data.
[0073] As well as allowing access to data in free-format data
fields which has previously been unavailable without data cleansing
and preparation of new databases with more fields, the present
invention also has great potential for the future structuring and
ordering of data. For example, using the present invention it may
be possible to greatly reduce the number of fields which are
required to store data in a database. Considering the example given
above, of international name and address data, at present it is not
possible for a database to deal with international address data in
a single field--because international address data has many
different attributes. With the present invention, however,
international addresses may be kept in single free-format field
containing all the international address records. Processing by the
present invention provides each individual international address
record with its own set of virtual data fields allowing comparison
with other records via the query processing means, manipulation and
access to information of all elements of each data record. Indeed,
it is possible to provide a single domain object for all
international addresses. Any free-format data could be processed in
this way. The invention is not limited to address data.
[0074] From yet a further aspect, the present invention provides a
method of enabling access to free-format data stored in a computing
system, including a plurality of free-format data records,
comprising the steps of storing additional data relating to
semantic and syntactic information (attributes) about the data for
each data record, the additional data being in the form of a text
object associated with each data record, the text object including
pointer means enabling access to elements of each free-format data
record, the additional data being accessible by a query processing
means to provide answers to queries relating to the semantic and
syntactic information about the data and/or to access the data to
manipulate the data.
[0075] The text object preferably includes any or all of the
properties of the text object as discussed above in relation to the
first aspect of the invention and the text object is preferably
produced by an examination including any or all of the features as
discussed above. The present invention further provides a method of
enabling access to free-format data stored in a computing system,
including a plurality of free-format data records, comprising the
steps of storing additional data relating to semantic and syntactic
information (attributes) about the data of each data record, the
additional data being in the form of a text object index which
includes attribute--type identifiers for elements of each data
record and pointers to each data record, the text object index
being accessible by a query processing means to provide answers to
queries relating to the semantic and syntactic information about
the data and/or to access the data to manipulate the data.
[0076] The text object index preferably includes any or all of the
properties of the text object index as discussed above in relation
to the first aspect of the invention. The text object index is
preferably produced by process steps as discussed above in relation
to the first aspect of the invention.
[0077] From yet a further aspect, the present invention provides a
processing system for processing free-format data stored in a
computing system, the apparatus including means for examining
elements of the data to determine attributes of the data, by
examining the content of the elements and the contextual
relationships of elements to each other, to determine semantic and
syntactic information (attributes) about the data, means for
producing additional data relating to this information, in the form
of a text object which includes pointer means enabling access to
the elements of the free-format data, and a query processing means
which is arranged to access the additional data to provide answers
to queries relating to the semantic and syntactic information about
the data and/or to access the data to manipulate the data.
[0078] Preferably, the examination means and means for producing is
arranged to produce a text object with any or all of the features
as discussed above in relation to first aspect of the invention, by
applying, preferably, the same methods of examination.
[0079] The present invention further provides a processing system
for enabling access to free-format data stored in a computing
system, including a plurality of free-format data records, the
processing system comprising additional data relating to semantic
and syntactic information (attributes) about the data for each data
record, stored and accessible by the processing system, the
additional data being in the form of a text object associated with
each data record, the text object including pointer means enabling
access to elements of each free-format data record, and a query
processing means arranged to access the additional data to provide
answers to queries relating to the semantic and syntactic
information about the data and/or to access the data to manipulate
the data.
[0080] The present invention further provides a processing system
for enabling access to free-format data stored in a computing
system, including a plurality of free-format data records, the
processing system comprising the additional data relating to
semantic and syntactic information (attributes) about the
free-format data for each data record, the additional data being in
the form of a text object index which includes attribute type
identifiers for elements of each data record and pointers to each
data record, and a query processing means arranged to access the
additional data to provide answers to queries relating to the
semantic and syntactic information about the data and/or to access
the data to manipulate the data.
[0081] The present invention yet further provides an apparatus
including a domain object arranged to process free-format data to
produce a text object, the text object including any or all of the
features of the text object as discussed above in relation to
previous aspects of the present invention.
[0082] In a preferred embodiment, the step of accessing the text
object may comprise querying one or more text objects for
attributes and obtaining the value of the element corresponding to
the queried attribute. For example, where the free-format data is
name and address data, a person may query the text object or
objects to see if there is a <Street> element, and, if so,
obtain the value of the element (e.g., "12 Pitt St"). This is
something that cannot be done with present databases where the
"address" field merely includes all the <address> in
free-format form. Other older systems provide search facilities
which scan for a particular text string without regard for the
semantics of the text being searched. These systems could be used
to find all address with a street name of "Pitt" by searching for
that string. This leads to problems when the string being searched
for can be used in different ways.
[0083] "76 Box Rd Townsville QLD"
[0084] "PO Box 92 Geelong VIC"
[0085] "39 Main St Box Hill NSW"
[0086] Attempting to locate the all the address with a street name
of "Box" by scan for the string "Box" will lead to many errors
being generated. The present invention, in the preferred
embodiment, will report only addresses contain the correct term.
So, searching for street name of "Box" will return records such
as:
[0087] "8 Box Ave Devonport TAS"
[0088] "76 Box Rd Townsville QLD"
[0089] "110 Box St Parramatta NSW"
[0090] Consider the address examples in FIG. 2 of the drawings, and
a system user wishes to locate all the addresses on "Box Rd" within
this data. If the user searches for "Box Rd", the system would
return record 201, but miss records 205 and 207. If the user
changes the search text to "Box", system would return all the
required records, but would also erroneously return records 202,
203, 204 and 206. Even if the user specified every variation of
"Road" in separate queries, the correct results would not be
obtained. The problem becomes more difficult if the system user
wishes to allow for errors in the data. e.g., Returning record 206
when specifying "Box Rd".
[0091] Another example where string searching without considering
the semantics can lead to erroneous results is when <Street
Names> have the same names as <Town Names>. For example:
"123 Sydney Ave, Melbourne VIC". String searching will not allow
one to find only records with "Sydney" as their town name.
[0092] The step of accessing the text object may also include
comparing two text objects and ascertaining and providing a
confidence value that indicates how closely the two text objects
match. For example, two street addresses may be compared by
comparing their respective text objects, and a confidence value (in
percentage points) can be given depending on how closely they
match.
[0093] The step of accessing may also include the step of changing
a value associated with a particular component. Common examples
include changing a woman's surname after marriage and changing the
name of a street or town name after a mistake has occurred.
[0094] There are also many cases where governments change the names
of street names, postcodes (e.g. Australia's Northern Territory
changed their postcode range from 5800-5999 to 0800-0899), or even
whole city names (e.g. Leningrad to St Petersburg).
[0095] This ability of the present invention to change a value of a
particular element of the original piece of text has the benefit
that the operations of legacy computer systems which use the data
directly (i.e. without using text objects) will not be
affected.
[0096] Yet a further aspect of the present invention provides a
processing system for enabling access to free-format data processed
in accordance with the method of any one of claims 1 to 19, the
processing system including a query processing means arranged to
access the additional data and provide answers to queries relating
to the semantic and syntactic information about the data and/or to
access the data to manipulate the data.
[0097] The apparatus may include means for accessing the text
object in accordance with any or all of the method steps given
above.
[0098] The present invention yet further provides a processing
system for processing free-format data stored in a computing
system, comprising means for examining elements of the data to
determine attributes of the data, by examining the content of the
elements and the contextual relationship of elements to each other,
to determine semantic and syntactic information (attributes) about
the data, and a query processing means for utilising this
information to provide answers to queries relating to the semantic
and syntactic information about the data and/or to access the
data.
[0099] The means for examining may comprise a domain object which
examines the elements and produces virtual data (being data
relating to the semantic and syntactic information about the data)
which is used by the query processing means to access the data and
obtain information on attributes of the data.
[0100] The present invention yet further provides a method of
processing free-format data stored in a computing system,
comprising the steps of examining elements of the data to determine
attributes of the data, by examining the content of the elements
and the contextual relationships of elements to each other, to
determine semantic and syntactic information (attributes) about the
data, and querying the data using this information to provide
answers to queries relating to the semantic and syntactic
information about the data and/or to access the data.
[0101] From yet a further aspect the present invention provides a
method of processing a plurality of records of free-format data
stored in a computing system, comprising the steps of, for each
record examining elements of the data to determine attributes of
the data, by examining the content of the elements and the
contextual relationships of elements to each other, to determine
semantic and syntactic information (attributes) about the data, and
producing virtual data fields enabling access to this information
and the associated elements for each data record, whereby each
record is provided with associated virtual data fields enabling
access to semantic and syntactic information about that record and
also access to the associated elements.
[0102] The term "virtual data fields" is used in the same sense as
previously. Unlike prior art conventional databases, where it is
necessary to process the information and produce actual data
fields, no separate data fields need be produced. The data may
remain in place where it is in the database, and instead an
associated "virtual field" is produced for attributes of the
semantic and syntactic information, and the virtual fields can be
queried to obtain all the information required of the record, and
preferably all normal database operations may be implemented.
[0103] The present invention yet further provides a processing
system for processing a plurality of free-format data records
stored in a computing system, comprising means for examining
elements of the data of each record to determine attributes of the
data, by examining the content of the elements and the contextual
relationships of elements to each other, to determine semantic and
syntactic information (attributes) about each record, and means for
producing virtual data fields associated with each record enabling
access to this information and the associated elements, whereby
each record is provided with associated virtual data fields
enabling access to semantic and syntactic information about that
record and also access to the associated elements.
DESCRIPTION OF PREFERRED EMBODIMENT
[0104] Features and advantages of the present invention will become
apparent from the following description of an embodiment thereof,
by way of example only, with reference to the accompanying
drawings, in which:
[0105] FIG. 1 is a diagram illustrating the architecture of a
system for enabling the processing of free-format data in
accordance with an embodiment of the present invention;
[0106] FIG. 2 illustrates sample "address" data;
[0107] FIG. 3 is a more detailed structural view of an example text
object produced by operation of the embodiment of the invention on
free-format data;
[0108] FIG. 4 illustrates sample "address" formats;
[0109] FIG. 5 is a flow chart illustrating a method for getting a
sub-component of a specific type from the text object of the
invention;
[0110] FIG. 6 illustrates the results of the get sub-component
method;
[0111] FIG. 7 is a flow chart illustrating a method for modifying a
sub-component of a text object of the invention;
[0112] FIG. 8 is an illustration of the mechanics of modifying a
text object of the invention; FIGS. 9, 10 and 11 provides an
example of modifying a text object of the invention;
[0113] FIG. 9 shows a text object before modification;
[0114] FIG. 10 shows the replacement text object; and
[0115] FIG. 11 shows the text object referred to in FIG. 9 after it
has been modified;
[0116] FIG. 12 is flow chart illustrating the node matching
subroutine used by other methods;
[0117] FIG. 13 illustrates examples of text objects in accordance
with embodiments of the present invention for illustrating a method
of comparison of text objects in accordance with an embodiment of
the present invention;
[0118] FIG. 14 is flow chart illustrating the "adjust node"
subroutine used by other methods;
[0119] FIG. 15 is a diagram illustrating the architecture of the
domain object block of FIG. 1;
[0120] FIG. 16 is an illustration of the domain construction
process of FIG. 1 in more detail;
[0121] FIG. 17 provides two examples of standard transliteration
tables. One for Japanese Katakana and one for Greek.
[0122] FIG. 18 contains tables illustrating Regular Expression
Definition data;
[0123] FIG. 19 illustrates a demonstration grammar data file;
[0124] FIGS. 20 and 21 provide flow charts of the domain object
construction process block of FIG. 1;
[0125] FIG. 22 illustrates an example session with a implementation
of the invention within a SQL relational database system.
[0126] Although the following descriptions use English name and
address examples, the invention can be equally applied to any
domain of free-format text.
[0127] As discussed in the preamble of this specification, the
present invention relates to an entirely new concept and approach
for processing computerised information, in particular free-format
data. As discussed above, the idea is to produce from the
free-format data a "text object" which may be stored in a computer
and which can be used to obtain information about the free-format
data, compare records of free-format data and manipulate the data.
This is achieved without it being necessary to construct complex
databases having many fields.
[0128] FIG. 1 is a diagram showing the configuration of an entire
"virtual data" system in accordance with an embodiment of the
present invention. It comprises a user interface 101, a processor
102. The processor 102 can be a standard computer system and has a
general configuration such as a CPU, a computer memory and mass
storage device. The user interface 101 can be a standard keyboard
and VDU, and/or an interface to another computer system. User
interfaces like these, along with other equivalent interfaces, are
well known.
[0129] For the purposes of the internal storage requirements of the
invention, no distinction will be made between the computer memory
and the mass storage device and will be referred to as memory.
[0130] Loaded into the memory of the processor 102 is standard
system software well known to those skilled in the art, such as a
operating system and a database system (not shown), one or more
application software systems 103 such as an accounting package or
word processor, and an embodiment of the present invention 104, for
producing text objects 105 from free-format data. The system 104
comprises a domain construction process 106 which is arranged to
take a plurality of input data 107 (in this example in the form of
data files) and build a domain object 108 which is used to produce
text objects 105. Each "domain" will include all the grammar and
syntax rules necessary for that particular domain of free-format
data. For example, one domain may be international name and
addresses and will include all the information necessary to analyse
free-format international name and address data to produce a text
object. Another domain may be a commodity description knowledge
base, another one may be a transportation industry knowledge base.
Domains may be produced to handle any free-format data. The domain
construction process 106 is essentially an engine which works on
the knowledge bases (input files) for the particular domain type to
produce the domain object 108 for that type.
[0131] Referring again to FIG. 1, a text object index 109 may be
produced by processing a number of text objects 105, and this will
be described later.
[0132] It should noted, as shown in FIG. 1, that the invention 104
provides a layer between general application software systems 103
and their stored data 110. Unlike "Knowledge Based Management
Systems" described above, this invention allows the free-format
data to remain in its original location and legacy application
software to operate using the original access paths 111.
[0133] Text Object
[0134] Structure
[0135] FIG. 3 is a schematic diagram of the detailed structure of
an example text object in accordance with an embodiment of the
present invention, in order to assist with illustrating the
concept.
[0136] The example free-format data illustrated in FIG. 3 is a
street address, "12 Pitt Street, North Sydney" (designated by
reference numeral 301). In prior art databases, this information
may have been stored in a single "address" field or may have been
divided into a number of separate fields corresponding to the
various attributes, i.e., street number, street name, street type
and town. Refer to FIG. 4 for other examples of common Australian
address formats. As discussed in the preamble, the prior art
database format requirement for a separate field for each attribute
gives rise to much complexity and, where the information is
intricate, it is cost prohibitive and even impossible to produce a
field for every attribute of the free-format data.
[0137] The text object (illustrated in FIG. 1) comprises a
plurality of component nodes 302-312. The text object can be
represented as a text node tree, having branches (eg 313) wherein
the component nodes 302-312 are positioned in a predetermined
hierarchy. The "lowest" hierarchy is at the bottom of the text node
tree and the "highest" hierarchy is at the top of the text node
tree. The node 302 at the top of the node tree will be refer to as
the "root" node. It will be appreciated that components of the text
object can be stored in any convenient manner in a memory of a
processing means, could be nested within each other, for example,
refer to each other in some way, etc. The text object is able to be
represented as a text node tree, but that does not mean that it is
stored in memory in this way. As long as the components of the text
object can be processed in such a fashion that the components act
like component nodes of a text node tree as represented in the
figure, then this is sufficient.
[0138] Note that each component node 302-312 could be considered
text objects themselves. This recursive definition allows all the
functions of the present invention to be applied to each
component.
[0139] The architecture of each component node 302-312
includes:
[0140] An attribute type identifier (which in this embodiment is an
integer) which identifies an attribute type of the free-format data
301 associated with the text object. For example, component node
303 includes the attribute type identifier <Street>,
indicating that this component node 303 is associated with the
element of the free-format data which gives is the Street, i.e.,
"12 Pitt Street". Component node 302 is the main component node for
the text object illustrated in FIG. 3 and includes the attribute
type identifier <Address>. The component node 302 is
therefore associated with the entire free-format data record in
this case, being "12 Pitt Street, North Sydney", which is an
address. Note that component node 302 is "higher" in the hierarchy
in the text node tree than component 303; the <Address>
component includes within it the <Street> component. The
hierarchy of the component node 302-312 within the text node tree
is in fact determined by the attribute type identifier of the
component node and by grammatical rules which determine that the
attribute should be of a lower or higher hierarchy.
[0141] A pointer to the starting position of the actual element
sub-string of the free-format data associated with a component
node. The free-format data is stored as a string in memory and the
pointers point towards the beginning of the character string. In
the example, component node 303 would point to numeral "1" of the
address.
[0142] An integer containing the character length of the element.
In the example, component node 303 would have a length of 14
(including space characters after "12" and "Pitt") which would in
effect point to the last letter "t" of "Street".
[0143] An array of subordinate component nodes. For example, for
component node 303, nodes 306, 307, 308 are all directly
subordinate in the hierarchy and nodes 311, 312 indirectly
subordinate. This array enables the component nodes to be related
to each other in the text node tree construction.
[0144] a boolean variable indicating whether this attribute type
identifier is for a "low level" matching element. "Regular
expression" terms such as <word> and <nbr> are not
matched against each other. Matching of these term is performed at
the next level up the hierarchy (e.g. <Street Name> 307). A
node is flagged as a low level matching component if it either: is
a literal which was located in the dictionary (e.g. nodes 308,
309); or contains "Regular expression" terms (e.g. nodes 306, 307,
305).
[0145] an integer representing the element's match weighting. This
indicates the relative importance of each of the elements when
performing comparisons between text objects. For example: when
comparing "Level 3, 45 Pitt st" with "3rd Floor, 45 Pitt St" the
fact that the elements "Level" and "Floor" are not equal is
insignificant. The "match weighting" values are specified in the
grammar rules used to construct the domain object.
[0146] depending on time/space considerations, other optional data
items used to assist the "matching" processes. Refer the section on
"text string operations" below for more details.
[0147] an integer indicating the parsing priority.
[0148] This will be described later.
[0149] a boolean value indicating whether this component node is
responsible for deleting and moving the piece of text it points to.
The two conditions when a component is responsible for its text
are: 1) When a outside process requests that the text object manage
the entire text string, the text object "root" node is flagged as
being responsible for the text string. 2) When a implied value is
created. See below for details.
[0150] a integer value representing the free space available at the
end of the buffer in which the free-format text is held. This value
is calculated during the creation of the text object and is usually
only applicable to the "root" node of the text object.
[0151] In the text node tree the foot of the hierarchy is a
component node dealing with an element for each token of the
free-format data, in this case being <number> 311,
<word> 312, <street type> 308, <geographic term>
309, <word> 310.
[0152] Further up in the hierarchy are component nodes for more
generic attribute type identifiers. For example these are
<street name> 307 for the word "Pitt", <Street> 303 for
the three tokens "12 Pitt Street", <town> 305 for the tokens
"North Sydney" and, at the top of the hierarchy of this particular
free-format data record, the attribute type identifier
<Address> 302.
[0153] Attribute Type Identifier
[0154] It will be appreciated that the attribute type identifiers
can be stored in any form, i.e., they need not be stored as
integers but could be stored in any representation. A program
engine is provided enabling access to the text node tree and this
engine has the information necessary to identify the attribute type
identifiers as stored.
[0155] Parsing Priority
[0156] To assist in the processing of ambiguous free-format data,
each component node contains an integer indicating the "parsing
priority" of the element. These values are assigned during
construction of the text object and are used to select the best
text node tree if more than one exists for a particular ambiguous
free-format text. For example: "12 Pitt St Nth Sydney" contains two
interpretations. Although "12 Pitt St Nth" is a valid street
address, it has a lower priority than "Nth Sydney" and therefore
not selected. These "parsing priority" values are specified in the
grammar rules used to construct the domain object (see below).
[0157] Implied Fields
[0158] Another feature of the present invention is the production
of extra implied sub fields in a text object, in the form of the
creation of extra component nodes for information that is not
actually explicit in the original text. For example, "Mr John
Smith" has an implied sub field "sex" with a value "male". The text
object can be created with an extra component node dealing with
this element and having the attribute type identifier "sex".
[0159] Normally these implied fields will be created during the
parsing process and are specified in the grammar, but they can be
added manually if required. See the description of the "Add
Sub-component" function below.
[0160] Interface
[0161] The text object acts as a "virtual interface" enabling
access to the free-format data and facilitating all normal database
operations on the free-format data. The user does not "see" the
internals of the text object, but can query the text object via the
associated program engine (query processing means) and, by virtue
of the structure of the stored text object, the attribute type
identifiers and other data being placed in nodes, can perform all
the normal database operations on the free-format text record.
[0162] All the below operations require that the text node tree be
searched for specific attribute types. This searching is performed
by the engine using recursive procedure calls. This technique is
very well known within computer science. Refer to the book "Data
Structures and Program Design" by Robert Kruse (Prentice Hall) for
a description of recursion.
[0163] Another embodiment of this invention may speed up the above
procedure by performing the above process and create a lookup table
containing every sub-attribute and sorting by the attribute type
identifier. This technique is well known to those skilled in the
art.
[0164] Function Overview
[0165] These operations include:
[0166] "Get Sub-component" Requests the text object to supply
(zero, one or many) values for the respective attribute type.
[0167] "Compare Text Objects" Compares two text objects and reports
a confidence value that indicates how closely they match.
[0168] "Contains component" Tests if a particular text object
contains a specific value for a particular element and returns a
confidence, e.g., one could obtain all free-format data records
which include Pitt Street as the "street". This would be one way of
finding how many people on a database live in Pitt Street where the
database includes free-format data in an address field and without
requiring a string search (which can often give rise to error).
[0169] "Modify Sub-component" Changes the value of a particular
element of a text object to a specific value. For example, change
"Pitt" to "King".
[0170] "Add Component" Adds extra data to the text object by
appending a new sub-component node to the respective node. Future
operations will reference this information.
[0171] Get Sub-component
[0172] When the "Text Object" is queried, an attribute type
identifier is supplied and zero, one or more "Sub-component Nodes"
are returned. These "Sub-component Nodes" point to the text of the
required elements. FIG. 5 illustrates this method. Beginning this
recursive procedure with the "root" node of the text object,
starting at 501, a determination is made (502) as to whether the
attribute type of this node is same as the required attribute type.
If it is, a pointer to this node is appended to the result list at
step 503. Continuing with step 504, for each sub-component node
referenced by this node recursively call this procedure 505. Then
return to caller 506. FIG. 6 illustrates the node tree for "Mr Fred
and Mrs Mary Smith". Searching the tree for nodes with attribute
type <Given Name> will return a list containing pointers to
two nodes 601, 602. These nodes point to the sub-strings "Fred",
"Mary" respectively.
[0173] Another version of this operation takes a text string as a
parameter. Only nodes containing the same attribute type and same
text string (ignoring case) are added to the list. For example:
calling this function with an attribute type <Given Name> and
text string "FRED" would return a list containing one node.
[0174] Yet another version of this operation takes as parameters a
text string and a confidence level. Only nodes containing the same
attribute type and have a text string which matches the supplied
string with a confidence above the supplied level are added to the
list.
[0175] Compare Text Objects
[0176] This operation compares two text object and returns a
confidence level indicating how closely they match. It performs
this by:
[0177] 1. Determining if the "root" nodes of the two text object
have the same attribute type. If they do not, return a zero
confidence level to the caller.
[0178] 2. Otherwise, call the "Match Node" subroutine (described
below) with the "root" nodes of the two text objects and return the
result of that operation to the caller.
[0179] For example: passing the two following text objects will
return a confidence of 100%.
3 <Address> "12/34 PITT ST SYDNEY 2000 NSW" <Address>
"Unit 12 34 Pitt Street, SYDNEY N.S.W., 2000"
[0180] Contains Sub-Component
[0181] This operation searches one text object for a sub-component
which matches a second text object. If found, it returns to the
caller a confidence level indicating how well they match. This
operation is achieved by first calling the "Get Component" function
(describe above) passing the component type of the second text
object. If successful, it calls the "Match Node" subroutine
(described below) with the "root" node of the second text object
and the node of the result of the "Get Component" function.
[0182] For example: passing the two following text objects will
(depending on how the string matching procedures are set up) return
a confidence of approximately 80%.
4 <Street> "Kathryn Street" <Address> "12-14 Catherine
St, Dubbo NSW 2830"
[0183] Add Sub-Component
[0184] This operation appends an extra component node into the text
object. Although the value of this element is not contained in the
original free-format text, queries performed on the text object
return the correct results. For example: a text object pointing to
a record containing "Dr Chris Smith" may need to modified to
indicate that the person is a female. Invoking the Add Component
function containing a sex attribute type with a value of "female"
will append the respective component node to the text object.
[0185] Modify Sub-Component
[0186] FIG. 8 illustrates the mechanics of the "Modify" operation.
The text object to be modified is represented by 801. The actual
text data consists of the sub-string to be replaced 805 and the
sub-strings before 804 and after 806. Within the main text object
801, the sub-tree 803 represents the sub-string to be replaced 805.
The replacement text string 807 is represented by another text
object 802.
[0187] FIG. 7 provides a flow chart of the "Modify" procedure.
Starting at 701, a call to the "Get Component" function (described
above) is performed to locate the required component node at step
702. The results of this function call are tested (step 703) to
ensure that one and only one component node is returned. If zero or
more than one nodes are returned, a error condition is set 704 and
the procedure returns to the caller 714. Otherwise, the procedure
continues with step 705 by calculating the difference (Diff) in
length between the sub-string to be replaced 805 and the new
replacement sub-string 807. If the difference is not zero (i.e. the
string have unequal lengths) invoke the "Adjust Node Variables"
subroutine 707 (described below). If the subroutine 707 is
unsuccessful, set a error condition 711 and return to caller 714.
Continuing the procedure at step 708, copy the new replacement
string 807 into the location of the old string 805. Replace the old
node sub-tree 803 with the new sub-tree 802 at step 710. For each
node in the new sub-tree 712 adjust the node's "text start address"
variable by adding the starting position of the new sub-string
713.l Then terminate this procedure and return to caller 714.
[0188] FIGS. 9, 10 and 11 provides an example of the "Modify"
operation. FIG. 9 shows a text object before modification. FIG. 10
shows the replacement text object and FIG. 11 shows the text object
referred to in FIG. 9 after it has been modified.
[0189] The extra versions of the "Get Sub-component" operation
described above also apply to this operation.
[0190] Subroutines
[0191] The operations described below are invoked from other text
object procedures described above.
[0192] Match Node
[0193] This procedure compares two elements with the same attribute
type and returns a confidence level value indicating how closely
they match.
[0194] FIG. 12 shows a flow chart for the "Match Node" operation.
Starting a 1201, a determination is made as to whether the nodes
being compared are low level matching components at step 1202. If
the two nodes are low level matching components, perform the
"String Comparison" procedure (described below) at step 1203 and
return to caller 1210. Otherwise, if the two nodes contain
sub-component nodes recursively invoke this procedure 1205 with all
combinations of sub-component pairs which have the same attribute
type (step 1204). Record the best confidence level for each 1206.
Multiply each node's confidence level by its respective matching
weight value 1207. Sum all the resulting values into one confidence
value 1208. Divide that value by the sum of the match weighting's
1209 and return to the caller 1210.
[0195] FIG. 13 contains an example showing the matching process.
Within the text object's node tree there are three types of
component nodes:
[0196] 1) nodes which contain sub-component nodes;
[0197] 2) low level matching components near the foot of the node
tree; and
[0198] 3) nodes which are contained within the low level matching
components and represent simple "regular expression" terms. (Refer
to the description of the grammar file for details of the terms.)
These nodes are not used in the matching process.
[0199] In this example text object, the nodes 1301, 1302, 1313 and
1314 contain sub-component nodes. The nodes 1304, 1305, 1306, 1307,
1308, 1309, 1315, 1316, 1317 and 1318 are low level matching nodes.
The nodes 1309, 1310, 1311, 1312, 1319, 1320 and 1321 are simple
"regular expression" terms.
[0200] In following calculation, the first number within the
parentheses is the weighting value for that component. The second
number is the best result from the node matching procedure for that
node. The number on top is the node's reference label in FIG.
13.
[0201] To calculate the matching confidence for the "Street"
components: 1 1304 ( 20 * 100 ) + 1305 ( 0 * 0 ) + 1306 ( 10 * 0 )
+ 1307 ( 60 * 80 ) + 1308 ( 10 * 100 ) + 1316 ( 30 * 100 ) + 1317 (
60 * 80 ) + 1318 ( 10 * 100 ) = 15400 ( 20 + 0 + 10 + 60 + 10 + 30
+ 60 + 10 ) = 200 15400 / 200 = 77 %
[0202] To calculate the matching confidence for the "Address"
components we perform the same procedure with the "Street" and
"Town" components: 2 1302 ( 60 * 77 ) + 1303 ( 40 * 100 ) + 1314 (
60 * 77 ) + 1315 ( 40 * 100 ) = 8620 60 + 40 + 60 + 40 = 100 17240
/ 200 = 86.2 %
[0203] This value indicates the two pieces of text match "quite
closely". Values greater than 90% indicate a match that is "very
close".
[0204] The above procedure may be improved by applying "Fuzzy
Logic" techniques. Fuzzy logic techniques are well known to those
skilled in the art and many suitable reference books are
available.
[0205] Adjust Node Variables
[0206] This subroutine is called from the "Modify Component"
procedure described above. The purpose of this routine is adjust
the actual free-format text and all corresponding sub-component
nodes and located after the node being replaced so that the new
replace sub-string and sub-tree fit exactly. If the old sub-string
and the new replacement sub-string are the same length, this
subroutine is not invoked.
[0207] FIG. 14 shows a flow chart of the steps required. Starting
at 1401, a determination is made at step 1402 as to whether there
is enough space in the current text buffer to accommodate the
change. This is done by referring to the "free space" variable
(described above) of the "root" node of the text object. If there
is not enough space, the "Relocate Text Data" subroutine is invoked
1403 to create free space in the text object. If this routine is
unsuccessful 1404, an error condition is set 1415, the procedure
terminates and return to the caller 1416. Otherwise, the procedure
continues at 1405 and calculate the extra space requirements of the
modified text object by subtracting the size of the old sub-tree
being replaced from the size of the new replacement sub-tree. A
zero or negative value indicates that the text object has enough
space to accommodate the change. If text object requires more space
1406, the "Relocate Text Object" subroutine is invoked 1407 to
create free space in the text object. If this routine is
unsuccessful 1408, an error condition is set 1415, the procedure
terminates and return to the caller 1416. If the above steps are
successful, the procedure continues at step 1409 and shifts the
"after" string 806 in FIG. 8 by the difference between the old
sub-string 805 and the new replacement sub-string 807. For each
node which refers to components located after the replacement node
1410, add this difference to the node's start address variable
1411. For each node which has the replace node as a sub-component
1412, add the difference to the node's length variable 1413. Adjust
the text object's "free space" variable by subtracting the
difference 1414 and return to caller 1416.
[0208] Relocate Text Data
[0209] This subroutine is invoked by the "Adjust Node Variables" to
move the current free-format text into a space large enough to
accommodate the required modification. The ability of this routine
to perform this operation depends on where the text data is stored.
Typically, free-format data such as "address" information is stored
in fixed length database fields and will not be able to be
relocated. If this is the case, this routine will set an error
condition and return to caller. However, if the text data is stored
within moveable storage such as the computer's memory or with a
object-oriented database as a non-persistent object, this procedure
will relocate the text data and return to the caller with the text
data's new address.
[0210] Relocate Text Object
[0211] This subroutine is invoked by the "Adjust Node Variables" to
move the current text object into a space large enough to
accommodate the required modification. The ability of this routine
to perform this operation depends on how this invention is
implemented. If the text object is stored within moveable storage
such as the computer's memory or with a object-oriented database as
a non-persistent object, this procedure will relocate the text
object and return to the caller with the text object's new
address.
[0212] For a description of Object-Oriented databases and object
persistence, refer to the book "Object-Oriented Databases" by
Setrag Khoshafian (Wiley Press).
[0213] Get Keys
[0214] This operation is used exclusively by the "Text Object
Index" described below. It provides key information used in
updating and querying of the text object index. It recursively
searches the text object node tree and returns a list of all the
nodes which have been flagged as low level matching components. See
above for a definition of a low level matching component. Refer to
the description of the Text Object Index below for an example of
the output of this function.
[0215] Summary of Text Object Benefits
[0216] Many records of free-format text may be processed in
accordance with this embodiment of the present invention, to
produce text objects in each case. Different text objects may have
different attribute type identifiers, but it is not necessary to
produce a complex database structure having a separate field for
each attribute type. Free-format text is stored basically as it is,
with the associated text object providing all the facility required
to provide all the normal database operations on the free-format
data. This essentially enables a computer to handle information in
much the same way as a human being does.
[0217] Text Object Construction Overview
[0218] The text object is produced by an examination of the
free-format data by applying natural language processing
techniques, such as parsing, which is known in the prior art. Such
language processing techniques have been applied to "clean" or
"scrub" databases and large and complex software systems have been
applied. In each case in the prior art, however, the natural
language processing has been applied to analyse the data to enable
the creation of new database fields. The idea of maintaining the
free-format data as it is and creating a text object as described
is a totally new concept.
[0219] In this embodiment of the present invention, the processing
of each item of free-format text to produce the text object
involves, firstly, lexical analysis in which regular expression
analyser reads the free-format text and groups the items of the
text into tokens with their associated attribute type identifier
(e.g., word, number, coma, etc). Each token is then checked against
a dictionary for other applicable attribute type identifiers (e.g.,
Street type, State, etc).
[0220] Syntax analysis is then applied and in the present
embodiment, the position of each of the tokens in the free-format
data is also analysed to provide attribute type identifiers. For
example, in the FIG. 5 example, "Pitt" is a plain word not found in
the dictionary and therefore probably a proper noun. By analysing
its position in relation to the other elements of the free-format
data, however, the embodiment can "imply" that it is a
<StreetName>. Therefore, "12 Pitt Street" can be classified
as a <Street> from the relative positioning of the
tokens.
[0221] Domain Object
[0222] The main function of the domain object 108 (FIG. 1) is to
create text objects 105. This function is described in detail
below. Other functions the domain object performs relate to
maintaining an attribute type table. This table contains the
information for all the attribute types defined for its domain.
[0223] Structure
[0224] FIG. 15 shows the domain object architecture 108 in more
detail. It comprises a series of "look up" tables, which include
the symbol table (e.g., <Street name> NB the term "symbol" is
equivalent to the term "attribute type identifier") 1502 and the
parse table 1504 (contains rules for applying the grammar). It also
comprises a lexicon 1503 contains a character definition table
1505, regular expression analyser 1506 and a dictionary 1507 (e.g.,
NSW, VIC, SA). All of these parts are used by a modified "Tomita
parser" (described below) to process free-format text to produce
text objects.
[0225] Text Object Construction
[0226] FIG. 16 gives an overview of the operation of the domain
object 108 creating a text object 105 of FIG. 1.
[0227] In operation, the domain object 1605 uses the attribute type
1608 to locate the respective parsing rules and then "parses" the
free-format data 1607 and produces a text object 1606.
[0228] Parsing is a known technique for analysing free-format data
and a skilled person would be able to arrange appropriate
parsing.
[0229] Parser Types
[0230] The parser may consist of any non-deterministic parser. The
common parsing techniques are listed as follows:
[0231] Top Down Backtracking Parser
[0232] Bottom Up Backtracking Parser
[0233] Top Down Chart Parser
[0234] Bottom Up Chart Parser
[0235] Augmented Transition Network Parser
[0236] Shift Reduce Parser with Backtracking
[0237] Tomita's Graph stack Shift Reduce parser
[0238] The main reasons for selecting Tomita's Graph-stack
Shift-Reduce parser for the best implementation of the invention
are:
[0239] A detailed description of the algorithm is readily
available.
[0240] The algorithm processes ambiguous text data very well.
[0241] The resulting data structures represent ambiguous text data
in a very efficient form.
[0242] The structure and operation of the parsing process is
described in the book by Tomita, M. "Efficient Parsing for Natural
Language", Kluwer 1986. A summarised copy of this description is
also given in the Appendix to this description.
[0243] Modifications to Tomita's Parser
[0244] In addition to producing the component node tree described
by Tomita, a number of enhancements are required for the text
object. These enhancements allow the text object to provide the
"virtual data" fields.
[0245] Modifications to Tomita's Graph-Stack Shift-Reduce parser
for this invention are as follows:
[0246] Assigning parsing priorities to the tokens returned from the
lexical analyser and to the rules in the parse table. Summing these
priorities to obtain the most suitable component node tree for a
given free-format text. All of these priorities are specified in
the input grammar file 1603 (FIG. 16).
[0247] Classifying the component nodes of the syntax tree as either
visible or invisible. Low level "regular expression" terms such as
<word> are classified as invisible.
[0248] Assigning match weightings to all component nodes. These
values are specified in the grammar data and are used to determine
the relative importance of each of the components when matching two
free-format texts.
[0249] Procedure
[0250] FIG. 16 gives an overview of the operation of the domain
object 108 creating a text object 105 of FIG. 1.
[0251] This procedure takes a free-format text string 1607 and an
attribute type identifier 1608 and creates a text object 1606.
[0252] 1. Using the attribute type identifier 1608, look up the
symbol table 1502 (FIG. 15) to get the corresponding parse
table.
[0253] 2. Call the parser to create a "shared parse forest" as
defined in section 2.4 of Tomita's book. A shared parse forest is
used to represent ambiguous parse trees within the one structure.
It does this by allowing trees to share common sub-trees.
[0254] 3. Recursively accumulate all the "parsing priorities" of
all the sub-component nodes of each node.
[0255] 4. Based on the values in the previous step, select the best
parse tree.
[0256] 5. Create a new Text Object with the selected parse
tree.
[0257] 6. Recursively search the parse tree to locate and flag
specific nodes as "low level matching components". (see above for
definition)
[0258] Refer to FIG. 3 for a simple example of a text object.
[0259] Construction of Domain Object
[0260] FIG. 16 shows an overview of the domain construction
process.
[0261] The input files for the domain construction process 1604
include the following:
[0262] Character Definition File 1601
[0263] This defines all the valid characters of the domain and
specifies their usage. The range of usage typically includes
alphabetic, numeric, punctuation, space. It also specifies which
characters are similar for matching purposes. It also specifies all
information required to perform the "text string matching"
described below.
[0264] In the best embodiment of the invention, this file contains
one record per character, and each record contains:
[0265] the character in question
[0266] the character's type (alpha, numeric, etc)
[0267] a base character for case and diacritic matching (e.g. "a",
"", "", ".ANG.".fwdarw."A")
[0268] a flag indicating the significance of the character. (e.g.
vowels are considered insignificant.)
[0269] one or more characters for standard international
transliteration. (see FIG. 17 for example tables)
[0270] This file could also define how character combinations are
translated into phonetic representations (e.g. "PH".fwdarw."F").
Phonetics is a known technique and a skilled person would be able
to arrange appropriate translation tables.
[0271] Regular Expression Definition 1602
[0272] This defines the structure of the elementary tokens of the
system. For example:
[0273] A word consists of two or more alphabetic characters. These
tokens are represented in the grammar by the term "word".
[0274] A number consists of one or more numeric characters.
Represented in the grammar by the term "nbr".
[0275] The structure of the Regular Expression definition is a
basic "state transition table". This technique is well known within
computer science. A working sample is shown in FIG. 18.
[0276] Grammar 1603
[0277] The basic premise of the grammar file is to define all
possible tree structures for the text objects created in its
language domain.
[0278] The grammar file consists of a number of grammar rules in
the form "A.fwdarw.B.sub.1 B.sub.2 B.sub.3 . . . ". Each grammar
rule consists of a LHS symbol <A> and zero, one or many RHS
symbols <B.sub.n>. The LHS symbol <A> is the name of
the component type and the RHS symbols <B.sub.n> defines its
sub-components. Each of the RHS symbols <B.sub.n> can be one
of the following:
[0279] Another component type name
[0280] A literal ( enclosed in quotes
[0281] A reserved word
[0282] The reserved words represent simple "regular expression"
terms as follows:
[0283] "word"--one or more alphabetic characters
[0284] "nbr"--one or more numeric characters
[0285] "A"--one alphabetic character
[0286] "9"--one numeric character
[0287] Additionally, each attribute type (i.e. LHS symbol) can be
assigned a "match weight adjustment". This is used to vary the
default match weighting. Match weighting are used when comparing
text objects to indicate the relative importance of sub-components
during the calculation of the matching confidence.
[0288] Additionally, each grammar rule can be assigned a "parsing
priority". This is used during the construction of text objects to
assist in selecting the best structure for the text object when two
or more ambiguous structures are available.
[0289] All branches at the lowest levels of the hierarchy of rules
and attribute type names defined by the grammar must end with
literals or reserved words. A simple example grammar is shown in
FIG. 19.
[0290] Procedure
[0291] FIGS. 20 and 21 provide flow charts of the domain object
construction process. Starting 2001, the character definition data
is loaded into memory at step 2002, then the regular expression
definition loaded at step 2003. Processing continues by reading the
grammar definition data and for each rule in the grammar 2004,
process the grammar rule 2005 by creating a new rule in the
temporary rule table 2102; using the LHS symbol of the rule to
create a new symbol/component type in the Symbol table if it does
not exist already, and then for each symbol on the RHS of the rule
(step 2104), if it is a literal 2105, then add it to the dictionary
2106, If it is a recognised "regular expression" term such as
"word" or "nbr" 2107, do nothing 2108, otherwise it is
attribute/symbol and it is added as a new symbol/attribute type to
the Symbol table if it does not exist already at step 2109. After
all the grammar rules have been processed, processing continues at
step 2006 by checking that each symbol/attribute type added to the
Symbol table has been defined. i.e. has appeared at least once on
the LHS of a grammar rule (step 2007). If any are undefined
symbols/attribute types, an error condition is set at step 2011,
the procedure terminates and returns to the caller 2012. Otherwise
processing continues at step 2008. Again, for each symbol/attribute
type added to the Symbol table, a parse table is created at step
2009, and a reference to this new parse table is recorded in the
corresponding Symbol table entry. After all the required parse
tables have been created, the procedure terminates and returns to
the caller 2012.
[0292] Building of parse tables is a well known technique within
computer science. Parse tables were originally developed for
programming languages. The algorithm for construction of the "LR
parsing table" can be found in Aho, A. V. and Ullman, J. D.
"Principles of Compiler Design" Addison Wesley 1977. Tomita applied
these techniques to "Natural Language Processing" by building parse
tables which are non-deterministic in that each entry in the tables
can have more than one action.
[0293] Note the domain object 1605 can be saved to memory or loaded
to operate on a record of free-format data.
[0294] Text Object Index
[0295] A "text object index" 109 (FIG. 1) is used as a means to
perform normal database operations on the "virtual data" fields of
a plurality of text objects and their associated free-format
text.
[0296] The basic concept for the text object index is similar to
the concepts published in the book "Human Associative Memory" by
John R. Anderson (Wiley 1973). This work described how the nouns in
a sentence are used to reference a database of named objects, and
then to match the "relationship" links between these objects to the
implied relationships in the original sentence. These relationships
follow the "Actor-Object-Action" model.
[0297] Although similar, the text object index differs from this
method in two major ways. 1) All constituent parts of the
free-format text are classified and used to reference the index.
(i.e. not just the nouns). 2) There are no relationship links
between objects.
[0298] Looking at the text object index with a different
perspective, one could consider the text object index an array with
unlimited dimensions where each dimension is one of the low level
matching attribute types described above. The text object created
from a free-format text string will provide the low level matching
components used to query the text object index. So that all
references to other text objects which are located at the
intersection of the supplied components are returned.
[0299] Performance improvements to this basic concept can be
provided by applying "fuzzy logic" techniques to the process. Fuzzy
logic techniques are well known to those skilled in the art and
many suitable reference books are available.
[0300] In the best embodiment of the invention, the main part of
the text object index is a three column table with the following
fields:
[0301] Attribute Type Identifier
[0302] Representative Value Key
[0303] User Supplied Record Identifier
[0304] This simple structure allows the text object index to be
implemented using the database technology available on the
respective computer.
[0305] The following example demonstrates how the three column
table is used. The basic idea behind the Text Object Index is that
all matching free-format texts have the same low level matching
attribute. For example, assume the following record has been added
to the text object index with a "user reference" of 123.
"Unit 12 34 Pitt Street, Sydney N.S.W., 2000"
[0306] After obtaining the respective text object's low level
matching attributes, the following entries will be added to the
index:
5 <Unit Number> "12" 123 <Street Number> "34" 123
<Street Name> "PITT" 123 <Street Type> "ST" 123
<Town Name> "SYDNEY" 123 <State> "NSW" 123
<Postcode> "2000" 123
EXAMPLE 1
[0307] A query is performed to check if the following address
exists in the database.
"12/34PITT ST SYDNEY NSW"
[0308] After creating a text object for this input and generating
the low level matching attributes:
6 <Unit Number> "12" <Street Number> "34" <Street
Name> "PITT" <Street Type> "ST" <Town Name> "SYDNEY"
<State> "NSW"
[0309] Performing intersection analysis on all index entries
retrieved with the above attributes-type identifiers and values
will yield the record specified at the beginning of this
section.
EXAMPLE 2
[0310] A query is performed to find all address which contain the
Street:
"PITT ST"
[0311] After creating a text object for this input and generating
the index key set:
7 <Street Name> "PITT" <Street Type> "ST"
[0312] Again, performing intersection analysis on all index entries
retrieved with the above attribute-type identifier and values will
yield the correct subset of records including the record specified
at the beginning of this section.
[0313] The above examples have been over simplified to demonstrate
the concept. In a practical system, once the low level matching key
set has been generated, all the techniques used in "key word
searching" can be applied to each attribute type subset. For more
detailed information on "key word searching" techniques, refer to
the numerous books and journal articles published by Gerald
Salton.
[0314] "Key word search" techniques applicable to this invention
include:
[0315] Storing very common terms in a high speed cache and using
this to avoid doing searches on index with terms that will return
too many entries.
[0316] Using one or more Representative Value Keys that allows for
common misspellings. Typically this is the original value with
vowels and double constants removed.
[0317] Using one or more Representative Value Keys that encodes the
original value into a one or more phonetic representations.
[0318] Using a Representative Value Key that encodes the original
value into a international standard transliteration representation.
(See FIG. 17 for examples of Greek and Japanese Katakana
transliteration tables.)
[0319] Checking the original value against a dictionary of synonyms
to obtain the value which represents the full set of synonyms.
[0320] Interface/Operations
[0321] The following operations can be provided by the text object
index.
[0322] The interface of the text object index is designed to mirror
the standard commands of SQL. SQL is the "Standard Query Language"
of relational databases and is very well known within the computer
industry.
[0323] Insert Text Object
[0324] As shown in the previous examples, this operation makes all
the required changes to the text object index so that the
respective text object reference can be located using any similar
free-format text or subcomponent there of.
[0325] The steps required by this operation are:
[0326] 1. Call the "Get Key" function of the respective text object
to obtain all of its low level matching components.
[0327] 2. For each low level matching component, add an entry in
the text object index's three column table.
[0328] 3. Optionally save the respective text object depending on
technical considerations of the current computer system.
[0329] Select Text Objects
[0330] This operation returns all references (normally record
identifiers supplied by the system user) to free-format texts which
contain the supplied free-format text. For example: to locate all
records which contain "Box Rd".
[0331] This operation proceeds with the following steps:
[0332] 1. Build a text object from the query input data.
[0333] 2. Invoke the "Get Keys" function of the text object to
obtain a list all of its low level matching components.
[0334] 3. Use the attribute type identifier and representative
value of each of the component nodes to retrieve all references
with any common low level matching items.
[0335] 4. Perform intersection analysis on the reference returned
from the previous step to select the free-format texts which
contain all the important low level matching elements of the query
data.
[0336] 5. Obtain the original text objects.
[0337] 6. Perform a "Text Object" Comparison on each to obtain
confidences.
[0338] 7. Sort according to confidences.
[0339] 8. Return the results to the caller.
[0340] Delete Text Object
[0341] This operation takes the user supplied reference key and
deletes all records with that reference key.
[0342] Update Text Object
[0343] This operation updates the entries for a modified text
object by first deleting all the previous entries and then
reinserting new entries using the "Insert" operation describe
above.
[0344] Text String Operations
[0345] The techniques used to compare two text string to obtain a
matching confidence are well known within the computer industry.
This section is provided as a quick overview of what text string
matching normal involves.
[0346] A typical matching procedure could perform the following
steps:
[0347] 1. Check for exact character match without regard to upper
and lower case.
[0348] 2. Check for common spelling mistakes by removing vowels and
double constants, then comparing the results.
[0349] 3. Check for any spelling mistakes by performing comparison
functions which allow for character deletion, insertion and
transposition.
[0350] 4. Check for similarity after standard international
transliteration. See FIG. 17 for example of transliteration
tables.
[0351] 5. Check for phonetic similarity after translating the
string into a standard phonetic representation.
[0352] In the present invention, text string matching is performed
on certain low level matching component nodes. The values used in
steps 1, 2, 4 and 5 of the above procedure may be generated each
time the string comparison is done, or alternatively may be
generated once when the text object is created and stored within
the respective component node. These values could also be used as
the "representative value key" in the text object index described
above.
[0353] Steps 4 and 5 of the above procedure allow the invention to
compare free-format data in foreign language text e.g., Japanese
Kanji. A phonetic value can be stored for the Kanji symbols, and
can be used to compare the Kanji with elements of other free-format
data which may not be in Kanji. In other words, this feature
facilitates the processing of free-format data in foreign
languages. See FIG. 17 and previous description
Example Application of Invention
[0354] FIG. 22 gives an example of how this invention could be
implemented within a SQL relational database implementation. A
description of the SQL statements are as follows:
[0355] 1. Create a domain object called "US_ADDRESS"
[0356] 2. Initialise it with a Language definition (which contains
the character definition and regular expression definition
described above and a Grammar definition).
[0357] 3. Create a text object class called "ADDRESS"
[0358] 4. Set its domain to "US_ADDRESS" and its type to "Address"
( the type name must be defined in the grammar.)
[0359] 5. Create a database table called "PERSONS" with one of the
elements being an "ADDRESS" text object called "Home_Addr".
[0360] 6. Insert a record into the table.
[0361] 7. Select all records in the "PERSONS" table with a specific
address.
[0362] 8. Select all records in the "PERSONS" table that have the
data in "Home Addr" column which contains a sub-component "State"
with a value matching "California"
[0363] 9. Select all records in the "PERSONS" table that have the
data in "Home Addr" column which contains a sub-component "Street"
that matches "Kathie St" with a confidence level greater than
80%.
Concluding Remarks
[0364] Any free-format data record may be analysed by applying the
present invention and by constructing the appropriate domain using
the appropriate domain construction process and appropriately
designed input files. All data can be analysed by computer in this
way to produce text objects for all free-format descriptions.
[0365] It will be appreciated that there are a number of processing
steps for processing free-format data in accordance with
embodiments of the present invention. It will be appreciated that
each of these steps can be done once during system initialisation
and the results saved, or they can be performed at execution time
only when they are needed (e.g., every time a query is performed).
A summary of these steps is as follows:
[0366] Construction of the domain object.
[0367] Construction of the text objects text node tree.
[0368] Construction of text objects extra implied sub-fields.
[0369] In addition to this, there are the other related steps of
producing a text object index from a plurality of text objects.
[0370] It will be appreciated by persons skilled in the art that
numerous variations and/or modifications may be made to the
invention as shown in the specific embodiments without departing
from the spirit or scope of the invention as broadly described. The
present embodiments are, therefore, to be considered in all
respects as illustrative and not restrictive.
* * * * *