U.S. patent application number 10/195847 was filed with the patent office on 2003-05-08 for database and method of generating same.
This patent application is currently assigned to SurfControl plc. Invention is credited to Stoiber, Harald, Thurnhofer, Klaus.
Application Number | 20030088577 10/195847 |
Document ID | / |
Family ID | 9918880 |
Filed Date | 2003-05-08 |
United States Patent
Application |
20030088577 |
Kind Code |
A1 |
Thurnhofer, Klaus ; et
al. |
May 8, 2003 |
Database and method of generating same
Abstract
A database comprises a plurality of keys representing respective
data items stored in the database and respective data tags
associated with at least some of the data items. Data tags
represent different identifiers or categories among which the
associated data items are grouped. The database is arranged in the
form of a tree-structured directed graph in which each of the
plurality of keys is represented by a series of nodes and arcs
defining a path between a root node and a terminal node, each node
being linked to at least one other node by a respective arc,
respective arcs for a given one of the plurality of keys
representing a respective character or characters of the given key.
The arcs and the nodes depending from the root node of data items
which represent a sequence of characters shared by different keys
are combined, and the data tags are associated with the arcs.
Inventors: |
Thurnhofer, Klaus; (Vienna,
AT) ; Stoiber, Harald; (Vienna, AT) |
Correspondence
Address: |
Paul D. Greeley, Esq.
Ohlandt, Greeley, Ruggiero & Perle, L.L.P.
10th Floor
One Landmark Square
Stamford
CT
06901-2682
US
|
Assignee: |
SurfControl plc,
|
Family ID: |
9918880 |
Appl. No.: |
10/195847 |
Filed: |
July 11, 2002 |
Current U.S.
Class: |
1/1 ;
707/999.103; 707/E17.012 |
Current CPC
Class: |
G06F 16/90344 20190101;
G06F 16/2246 20190101 |
Class at
Publication: |
707/103.00Y |
International
Class: |
G06F 007/00 |
Foreign Application Data
Date |
Code |
Application Number |
Jul 20, 2001 |
GB |
0117721.1 |
Claims
1. A database comprising a plurality of keys representing
respective data items stored in the database and respective data
tags associated with at least some of the data items, respective
data tags representing different identifiers or categories among
which the associated data items are grouped, wherein the database
is arranged in the form of a tree data structure in which each of
said plurality of keys is represented by a series of nodes and arcs
defining a path between a root node and a terminal node, each node
being linked to at least one other node by a respective arc,
respective arcs for a given one of said plurality of keys
representing a respective character or characters of said given
key, and wherein the arcs and the nodes depending from said root
node of data items which represent a sequence of characters shared
by different keys are combined, and the data tags are associated
with the arcs.
2. A database comprising a plurality of keys representing
respective data items stored in the database, wherein the database
is arranged in the form of a tree data structure in which each of
said plurality of keys is represented by a series of nodes and arcs
defining a path between a root node and a terminal node, each node
being linked to at least one other node by a respective arc,
respective arcs for a given one of said plurality of keys
representing a respective character or characters of said given
key, and wherein the arcs and the nodes depending from said root
node of data items representing a sequence of characters shared by
different keys are combined, and the arcs and the nodes extending
from a given terminal node of data items representing a sequence of
characters shared by different keys are also combined, said given
terminal node being a sink.
3. A database according to claim 1, wherein the arcs and the nodes
extending from a given terminal node of data items representing a
sequence of characters shared by different keys are also combined,
said given terminal node being a sink.
4. A database according to claim 1 wherein a data tag is associated
with each one of the arcs so that a data tag is read from the
database as said respective character(s) of the key are read from
the database.
5. A database according to claim 3 wherein a data tag is associated
with each one of the arcs so that a data tag is read from the
database as said respective character(s) of the key are read from
the database.
6. A database according to claim 4 wherein the last data tag which
is read before reaching a terminal node defines the category or
identifier of the key.
7. A database according to claim 5 wherein the last data tag which
is read before reaching a terminal node defines the category or
identifier of the key.
8. A database according to claim 1 wherein in cases where
successive arcs within a path have the same data tags associated
with them, only one occurrence of the data tag when reading from
the root node is stored in the database.
9. A database according to claim 3 wherein in cases where
successive arcs within a path have the same data tags associated
with them, only one occurrence of the data tag when reading from
the root node is stored in the database.
10. A database according to claim 8 wherein said only one is the
first occurrence of the data tag.
11. A database according to claim 9 wherein said only one is the
first occurrence of the data tag.
12. A method of generating a database having a plurality of keys
representing respective data items stored in the database and
respective data tags associated with at least some of the data
items, respective data tags representing different identifiers or
categories among which the data items are grouped, wherein the
method comprises: generating a data set represented by tree data
structure in which each of said plurality of keys is represented by
a series of nodes and arcs defining a path between a root node and
a terminal node, each node being linked to at least one other node
by a respective arc, and respective arcs for a given one of said
plurality of keys representing a respective character or characters
of said given key wherein arcs and nodes depending from said root
node of data items which represent a sequence of characters shared
by different keys and category or identifier are combined; and
associating at least some of the arcs with data tags which
correspond to the category or identifier of the key represented by
the character or characters of the arc.
13. A method of generating a database having a plurality of keys
representing respective data items stored in the database, wherein
the method comprises: generating a data set represented by tree
data structure in which each of said plurality of keys is
represented by a series of nodes and arcs defining a path between a
root node and a terminal node, each node being linked to at least
one other node by a respective arc, and respective arcs for a given
one of said plurality of keys representing a respective character
or characters of said given key, wherein arcs and nodes depending
from said root node of data items which represent a sequence of
characters shared by different keys are combined; and compacting
the data set so that arcs and nodes extending from a given terminal
node towards said root node of data items which represent a
sequence of characters shared by different keys are also combined,
said given terminal node being a sink.
14. A method according to claim 11, the method further including
compacting the data set by removing from a sequence of repeating
identical data tags all but one of said identical data tags.
15. A method according to claim 14, the method further including
further compacting the data set so that arcs and nodes extending
from a given terminal node towards said root node of data items
which represent a sequence of characters and category or identifier
shared by different keys are also combined, wherein said given
terminal node is a sink node.
16. A method according to claim 14, wherein said one of said
identical data tags is the first occurrence thereof in the
sequence.
17. A method according to claim 16, wherein either one or both of
the steps of compacting the data set include a recursive
routine.
18. A method according to claim 15, wherein said one of said
identical data tags is the first occurrence thereof in the
sequence.
19. A method according to claim 18, wherein either one or both of
the steps of compacting the data set include a recursive
routine.
20. A method according to claim 13, wherein either one or both of
the steps of compacting the data set include a recursive
routine.
21. A method according to said compacting step of claim 13
including assigning a weight value to nodes of the data set, the
weight value of a given node being dependent on the characters
between said given node and an associated sink(s), said given node
and associated sink(s) defining a sub-tree of said data set, and
identifying two or more nodes having identical weight values as
potentially having identical sub-trees.
22. A method according to said further compacting step of claim 15
including assigning a weight value to nodes of the data set, the
weight value of a given node being dependent on the characters
between said given node and an associated sink(s), said given node
and associated sink(s) defining a sub-tree of said data set, and
identifying two or more nodes having identical weight values as
potentially having identical sub-trees.
23. A method according to claim 21 wherein the weight value is
based on a checksum value incorporating the category or identifier
of an arc extending from the node to which the weight value is
being applied, in addition to the characters in the sub-tree.
24. A method according to claim 22 wherein the weight value is
based on a checksum value incorporating the category or identifier
of an arc extending from the node to which the weight value is
being applied, in addition to the characters in the sub-tree.
25. A method according to claim 23 wherein the checksum value
further incorporates an indication of the size of the associated
sub-tree of the given node.
26. A method according to claim 24 wherein the checksum value
further incorporates an indication of the size of the associated
sub-tree of the given node.
27. A method according to claim 21, wherein the step of compacting
to reduce identical sub-trees includes comparing with one another
the nodes and sub-trees depending from, and including, nodes having
identical weight values.
28. A method according to claim 22 wherein the step of compacting
to reduce identical sub-trees includes comparing with one another
the nodes and sub-trees depending from, and including, nodes having
identical weight values.
29. A method according to claim 27 wherein nodes having weight
values representative of longer sub-trees are preferably compared
and compacted prior to those representative of shorter ones.
30. A method according to claim 28 wherein nodes having weight
values representative of longer sub-trees are preferably compared
and compacted prior to those representative of shorter ones.
31. A method according to claim 29 wherein nodes and their
respective sub-trees identified as identical are rationalised by
directing the arc(s) leading to one of the nodes to the other node
and removing said one node and its associated sub-tree from the
database.
32. A method according to claim 30 wherein nodes and their
respective sub-trees identified as identical are rationalised by
directing the arc(s) leading to one of the nodes to the other node
and removing said one node and its associated sub-tree from the
database.
33. A method according to claim 31 wherein the identification of
the sub-trees includes use of a recursive routine.
34. A method according to claim 32 wherein the identification of
the sub-trees includes use of a recursive routine.
35. A database according claim 1 wherein the tree data structure is
in the form of a tree-structured directed graph.
36. A database according claim 2 wherein the tree data structure is
in the form of a tree-structured directed graph.
37. A method according to claim 12 wherein the tree data structure
is in the form of a tree-structured directed graph.
38. A method according to claim 13 wherein the tree data structure
is in the form of a tree-structured directed graph.
39. A database according to claim 1, wherein the data items
represent Universal Resource Locators (URL'S) for identifying
Internet web pages.
40. A database according to claim 2, wherein the data items
represent Universal Resource Locators (URL'S) for identifying
Internet web pages.
41. A method according to claim 12, wherein the data items
represent Universal Resource Locators (URL'S) for identifying
Internet web pages.
42. A method according to claim 13, wherein the data items
represent Universal Resource Locators (URL'S) for identifying
Internet web pages.
43. A database according to claim 1, wherein the data items
represent Universal Resource Locators (URL'S) for identifying
Internet web pages, the categories corresponding to subject matter
types, respective data tags representing different subject matter
types.
44. A method according to claim 12, wherein the data items
represent Universal Resource Locators (URL'S) for identifying
Internet web pages, the categories corresponding to subject matter
types, respective data tags representing different subject matter
types.
45. A data carrier comprising a database comprising a plurality of
keys representing respective data items stored in the database and
respective data tags associated with at least some of the data
items, respective data tags representing different identifiers or
categories among which the associated data items are grouped,
wherein the database is arranged in the form of a tree data
structure in which each of said plurality of keys is represented by
a series of nodes and arcs defining a path between a root node and
a terminal node, each node being linked to at least one other node
by a respective arc, respective arcs for a given one of said
plurality of keys representing a respective character or characters
of said given key, and wherein the arcs and the nodes depending
from said root node of data items which represent a sequence of
characters shared by different keys are combined, and the data tags
are associated with the arcs.
46. A data carrier according to claim 45, wherein the data items of
the database are URL's and the data tags are subject matter types
for them.
47. A data carrier according to claim 45, wherein the arcs and the
nodes extending from a given terminal node of data items
representing a sequence of characters shared by different keys are
also combined, said given terminal node being a sink.
48. A data carrier according to claim 47, wherein the data items of
the database are URL's and the data tags are subject matter types
for them.
49. A data carrier comprising a database comprising a plurality of
keys representing respective data items stored in the database,
wherein the database is arranged in the form of a tree data
structure in which each of said plurality of keys is represented by
a series of nodes and arcs defining a path between a root node and
a terminal node, each node being linked to at least one other node
by a respective arc, respective arcs for a given one of said
plurality of keys representing a respective character or characters
of said given key, and wherein the arcs and the nodes depending
from said root node of data items representing a sequence of
characters shared by different keys are combined, and the arcs and
the nodes extending from a given terminal node of data items
representing a sequence of characters shared by different keys are
also combined, said given terminal node being a sink.
50. A data carrier according to claim 49, wherein the data items of
the database are URL's and the data tags are subject matter types
for them.
51. A computer program containing code, which when run on a
computer, can configure the computer to generate a database
comprising a plurality of keys representing respective data items
stored in the database and respective data tags associated with at
least some of the data items, respective data tags representing
different identifiers or categories among which the associated data
items are grouped, wherein the database is arranged in the form of
a tree data structure in which each of said plurality of keys is
represented by a series of nodes and arcs defining a path between a
root node and a terminal node, each node being linked to at least
one other node by a respective arc, respective arcs for a given one
of said plurality of keys representing a respective character or
characters of said given key, and wherein the arcs and the nodes
depending from said root node of data items which represent a
sequence of characters shared by different keys are combined, and
the data tags are associated with the arcs.
52. A computer program according to claim 51, wherein the arcs and
the nodes extending from a given terminal node of data items
representing a sequence of characters shared by different keys are
also combined, said given terminal node being a sink.
53. A computer program containing code, which when run on a
computer, can configure the computer to generate a database
comprising a plurality of keys representing respective data items
stored in the database, wherein the database is arranged in the
form of a tree data structure in which each of said plurality of
keys is represented by a series of nodes and arcs defining a path
between a root node and a terminal node, each node being linked to
at least one other node by a respective arc, respective arcs for a
given one of said plurality of keys representing a respective
character or characters of said given key, and wherein the arcs and
the nodes depending from said root node of data items representing
a sequence of characters shared by different keys are combined, and
the arcs and the nodes extending from a given terminal node of data
items representing a sequence of characters shared by different
keys are also combined, said given terminal node being a sink.
54. A computer program containing code for configuring a computer
to perform a method of generating a database having a plurality of
keys representing respective data items stored in the database and
respective data tags associated with at least some of the data
items, respective data tags representing different identifiers or
categories among which the data items are grouped, wherein the
method comprises: generating a data set represented by tree data
structure in which each of said plurality of keys is represented by
a series of nodes and arcs defining a path between a root node and
a terminal node, each node being linked to at least one other node
by a respective arc, and respective arcs for a given one of said
plurality of keys representing a respective character or characters
of said given key wherein arcs and nodes depending from said root
node of data items which represent a sequence of characters shared
by different keys and category or identifier are combined; and
associating at least some of the arcs with data tags which
correspond to the category or identifier of the key represented by
the character or characters of the arc.
55. A computer program according to claim 54, wherein the method
further includes: compacting the data set by removing from a
sequence of repeating identical data tags all but one of said
identical data tags; and further compacting the data set so that
arcs and nodes extending from a given terminal node towards said
root node of data items which represent a sequence of characters
and category or identifier shared by different keys are also
combined, wherein said given terminal node is a sink node.
56. A computer program containing code for configuring a computer
to perform a method of generating a database having a plurality of
keys representing respective data items stored in the database,
wherein the method comprises: generating a data set represented by
tree data structure in which each of said plurality of keys is
represented by a series of nodes and arcs defining a path between a
root node and a terminal node, each node being linked to at least
one other node by a respective arc, and respective arcs for a given
one of said plurality of keys representing a respective character
or characters of said given key, wherein arcs and nodes depending
from said root node of data items which represent a sequence of
characters shared by different keys are combined; and compacting
the data set so that arcs and nodes extending from a given terminal
node towards said root node of data items which represent a
sequence of characters shared by different keys are also combined,
said given terminal node being a sink.
Description
BACKGROUND OF THE INVENTION
[0001] The present invention relates to a database and a method of
generating a database. In particular the invention relates to a
database which facilitates efficient storage of data, rapid search
and retrieval of data.
DESCRIPTION OF THE PRIOR ART
[0002] Databases are used in computer-based information and
processing systems for the storage of large quantities of
information or data items for subsequent retrieval and processing.
Such databases often require updating from time to time for
redistribution to users who may be situated remotely from the
producer of the database. Logistical difficulties can arise when
databases become large. For example, users of the database might
not have the same storage capacity enjoyed by the database creator
and in cases where users download updated databases via a computer
network such as the Internet, download times can become burdensome
and functionality may be compromised.
[0003] Preferred methods of data storage vary depending on the type
of data to be stored. Opportunities for compression of data exist
particularly when data to be stored contains repetitive elements.
Various schemes exist in the art for increasing the efficiency of
data management. For example, relational databases are adopted in
situations where it is desirable to avoid repetition of data entry.
A relational database might be adopted for customer contact
information having different categories. Such a database might
employ a plurality of separate database tables, one for each
category of information such as: one for customer name and address,
a second for accounting records and a third for product
information. These tables are linked, or related, by a customer ID
so that accounting and/or product information can be retrieved
without the need to store customer name and address data in the
table of each category.
[0004] A difficulty arises when it is desired to store a large
number of data items which are to be classified into a relatively
small number of different categories. In such a case, the database
is likely to be structured as a single table listing the data
items. The category of each data item is then stored against each
data item in the table. The difficulty is that the resulting table
becomes extravagant on storage space because the same category
identification is stored many times within the same table. As the
database becomes larger, the more difficult it is to transfer
between users and the longer it takes to retrieve information from
it.
[0005] This problem of `wasted space` is exacerbated in cases where
the data items contain repetitive elements or components. For
example, in a database for relating Internet web pages identified
by Uniform Resource Locators (URL's) to subject category, it is
expected that there will be millions of URL's and subject
categories numbered in the order of a few tens to hundreds,
possibly a few thousand. URL's are keys containing strings of
alphanumeric and other characters. Not only is there `wasted space`
in the storage of identical subject categories against multiple
data items, but there is `wasted space` in storing elements (i.e.
strings of characters) which repeat themselves among the URL's.
[0006] Although numerous methods of data compression are known in
the art, these techniques are generally applicable to the passive
storage and transport of data. In other words, the database is not
designed to facilitate search and retrieval of data while in a
compressed state. It is an aim of the invention to devise a
database structure which provides for greater storage capacity and
searching speed in a decompressed state.
[0007] U.S. Pat. No. 6,219,786 relates to a method and system for
monitoring and controlling computer users' access to network
resources from both inside and outside the network. The system
monitors network traffic and applies access rules to the traffic to
permit or deny access to predetermined network resources. In one
application of this system, a networked computer may be monitored
so that access to predetermined Internet web-sites can be permitted
while others denied. Such a system may include a database of URL's
which are categorised by subject. Given the existence of many tens
or even hundreds of millions of URL's which may be accessed via the
World Wide Web (www), a database of these containing a category
data tag for each can be expected to require a great deal of
storage capacity and be slow to search.
OBJECTIVES AND SUMMARY OF THE INVENTION
[0008] It is therefore an aim of the invention to devise a database
structure and method of generating same which alleviates these
problems. In particular, it is an aim of the invention to devise a
database structure which can contain more data items than in prior
art database structures having the same storage capacity. It is
another aim to provide for faster retrieval of data. It is a
further aim of the invention to devise a database structure which
provides for faster confirmation of the absence of a data item.
[0009] It is an aim of the invention to devise a database which can
store many millions of URL's and their respective category data
tags (numbering tens to hundreds) with a reduced storage
requirement. It is a further aim to provide for faster retrieval
and searching of such a database.
[0010] According to a first aspect of the present invention there
is provided a database comprising a plurality of keys representing
respective data items stored in the database and respective data
tags associated with at least some of the data items, respective
data tags representing different identifiers or categories among
which the associated data items are grouped, wherein the database
is arranged in the form of a tree data structure in which each of
said plurality of keys is represented by a series of nodes and arcs
defining a path between a root node and a terminal node, each node
being linked to at least one other node by a respective arc,
respective arcs for a given one of said plurality of keys
representing a respective character or characters of said given
key, and wherein the arcs and the nodes depending from said root
node of data items which represent a sequence of characters shared
by different keys are combined, and the data tags are associated
with the arcs.
[0011] In a preferred embodiment of the invention, a data tag is
associated with each one of the arcs so that a data tag is read
from the database as said respective character(s) of the key are
read from the database. The last data tag which is read before
reaching a terminal node defines the category or identifier of the
key. In cases where successive arcs within a path have the same
data tags associated with them, only one, for example the first
occurrence of the data tag when reading from the root node, is
stored in the database to reduce or eliminate redundancy of data
therein.
[0012] According to a second aspect of the present invention there
is provided a database comprising a plurality of keys representing
respective data items stored in the database, wherein the database
is arranged in the form of a tree data structure in which each of
said plurality of keys is represented by a series of nodes and arcs
defining a path between a root node and a terminal node, each node
being linked to at least one other node by a respective arc,
respective arcs for a given one of said plurality of keys
representing a respective character or characters of said given
key, and wherein the arcs and the nodes depending from said root
node of data items representing a sequence of characters shared by
different keys are combined, and the arcs and the nodes extending
from a given terminal node of data items representing a sequence of
characters shared by different keys are also combined, said given
terminal node being a sink.
[0013] A database may incorporate the first and the second aspects
of the invention. In such a database, the data tags are
rationalised to minimise the amount of storage space taken up by
category or identifier information for the keys and further storage
saving measures are achieved by the combining of arcs and nodes
between characters or character sequences shared by different keys
when reading from the root node to the terminal nodes and when
reading from the terminal nodes to the root node, wherein said
terminal nodes are sinks.
[0014] According to a further aspect of the present invention there
is provided a method of generating a database having a plurality of
keys representing respective data items stored in the database and
respective data tags associated with at least some of the data
items, respective data tags representing different identifiers or
categories among which the data items are grouped, wherein the
method comprises:
[0015] generating a data set represented by tree data structure in
which each of said plurality of keys is represented by a series of
nodes and arcs defining a path between a root node and a terminal
node, each node being linked to at least one other node by a
respective arc, and respective arcs for a given one of said
plurality of keys representing a respective character or characters
of said given key wherein arcs and nodes depending from said root
node of data items which represent a sequence of characters shared
by different keys and category or identifier are combined; and
[0016] associating at least some of the arcs with data tags which
correspond to the category or identifier of the key represented by
the character or characters of the arc.
[0017] In a preferred embodiment, the method further includes
compacting the data set by removing from a sequence of repeating
identical data tags all but one of said identical data tags.
Preferably, successive data tags identical to the first occurrence
thereof in the sequence are removed. This allows redundant data
tags to be removed from the database thereby making space available
for more data items.
[0018] According to a yet further aspect of the present invention,
there is provided a method of generating a database having a
plurality of keys representing respective data items stored in the
database, wherein the method comprises:
[0019] generating a data set represented by tree data structure in
which each of said plurality of keys is represented by a series of
nodes and arcs defining a path between a root node and a terminal
node, each node being linked to at least one other node by a
respective arc, and respective arcs for a given one of said
plurality of keys representing a respective character or characters
of said given key, wherein arcs and nodes depending from said root
node of data items which represent a sequence of characters shared
by different keys are combined; and
[0020] compacting the data set so that arcs and nodes extending
from a given terminal node towards said root node of data items
which represent a sequence of characters shared by different keys
are also combined, said given terminal node being a sink.
[0021] In a yet further aspect of the present invention, there is
provided a method of generating a database having a plurality of
keys representing respective data items stored in the database and
respective data tags associated with at least some of the data
items, respective data tags representing different categories or
identifiers among which the data items are grouped, wherein the
method comprises:
[0022] generating a data set represented by a tree data structure
in which each of said plurality of keys is represented by a series
of nodes and arcs defining a path between a root node and a
terminal node, each node being linked to at least one other node by
a respective arc, and respective arcs for a given one of said
plurality of keys representing a respective character or characters
of said given key wherein arcs and nodes depending from said root
node of data items which represent a sequence of characters shared
by different keys and category or identifier are combined;
[0023] associating at least some of the arcs with data tags which
correspond to the category or identifier of the key represented by
the character or characters of the arc;
[0024] compacting the data set by removing from a sequence of
repeating identical data tags all but one of said identical data
tags; and
[0025] further compacting the data set so that arcs and nodes
extending from a given terminal node towards said root node of data
items which represent a sequence of characters and category or
identifier shared by different keys are also combined, wherein said
given terminal node is a sink node.
[0026] The steps of compacting the data set may each include a
recursive routine. Successive data tags identical to first
occurrence thereof in the sequence may be the ones removed.
[0027] In a preferred embodiment, said compacting step may include
assigning a weight value to nodes of the data set, the weight value
of a given node being dependent on the characters between said
given node and an associated sink(s), said given node and
associated sink(s) defining a sub-tree of said data set, and
identifying two or more nodes having identical weight values as
potentially having identical sub-trees. The weight value may be
based on a checksum value incorporating the category or identifier
of an arc extending from the node to which the weight value is
being applied, in addition to the characters in the sub-tree. The
checksum value may further incorporate an indication of the size of
the associated sub-tree of the given node.
[0028] The step of compacting to reduce identical sub-trees
includes comparing with one another the nodes and sub-trees
depending from, and including, nodes having identical weight
values. Nodes having weight values representative of longer
sub-trees are preferably compared and compacted prior to those
representative of shorter ones. This provides for a faster
compaction operation. Nodes and their respective sub-trees
identified as identical are rationalised by directing the arc(s)
leading to one of the nodes to the other node and removing said one
node and its associated sub-tree from the database. This may be
done using a recursive routine.
[0029] Any node except the root node may be a terminal node,
provided it represents the end of a path defining a key. All nodes
that have no further arcs leading to further nodes are terminal
nodes, sometimes referred to as `sinks`. A node may be a terminal
node because it defines the end of a key, but may also have further
arcs leading to other nodes, the further arcs representing
characters of other keys. The tree data structure may be in the
form of a tree-structured directed graph.
[0030] In an embodiment of the present invention, the data items
may represent Universal Resource Locators (URL'S) for identifying
Internet web pages, the categories corresponding to subject matter
types, respective data tags representing different subject matter
types.
[0031] According to the present invention, there is yet further
provided a data carrier having stored thereon a database as defined
according to any aspect of the invention hereinabove. The data
items of the database may be URL's and the data tags may be subject
matter types for them. The data carrier may be in the form of any
computer readable medium, such as: CD-ROM; a hard disk of a
personal computer or network server; magnetic tape; or data
stream.
[0032] According to the present invention, there is yet further
provided a computer program containing code, which when run on a
computer can configure the computer to generate a database
according to any of aspect of the invention defined hereinabove.
The computer program may contain code for configuring a computer to
perform any of the methods of generating a database as defined
hereinabove.
[0033] The terms used herein are defined in a dictionary published
by the National Institute of Standards and Technology (NIST), see
in particular their Dictionary of Algorithms, Data Structures and
Problems. This may be accessed via the Internet (see URL:
http://www.nist.gov/dads/terms.html).
[0034] It should be noted that variations may be made to
embodiments of the present invention without departing from the
scope thereof. For example, there may be instances within a
tree-structured directed graph in which pairs of nodes are linked
by more than one arc.
[0035] Embodiments of the invention have the advantage that
information in the form of sequences of characters that recur in
many different keys (for example, the sequences "www.", and ".com"
occur in a great many URLs) need only be stored a minimum number of
times in the database. This results in a substantial reduction in
the bit size of the database and the amount of memory required. A
further advantage is searching is very fast because once a sequence
of characters occurring in the key being sought has been found,
there is no need to search anywhere else in the database for those
characters. This arises from the tree-structured directed graph in
which there is only one valid next move as a data item to be
searched is looked up in the tree-structure. Also, once it is
determined that the next character in a sequence is not present in
the database, the search can be terminated because the key will not
be present elsewhere.
[0036] Further advantages arise from the optimisation and storage
of the data tag information. By storing the data tags with the
characters of the keys, as a key is read the data tag is also read,
removing the need to retrieve the data tag from an associated data
location. Removal of redundant data tags results in a substantial
reduction in the amount of data that has to be stored. In the case
where the database stores URL's, a tenfold reduction in the size of
the database is contemplated relative to prior art database
structures which may be employed.
BRIEF DESCRIPTION OF THE DRAWINGS
[0037] The invention will now be further described by way of
example, with reference to the following drawings, in which:
[0038] FIG. 1 is a schematic diagram of a known computer system on
which a database embodying aspects of the present invention may be
implemented;
[0039] FIG. 2 is a flow diagram outlining a method for generating a
database in accordance with the first and second aspects of the
present invention;
[0040] FIG. 3 is an example of data items for use in an
illustration of a database embodying the first and/or second
aspects of the present invention;
[0041] FIG. 4 is a flow diagram with reference to which generation
of a database embodying the first aspect of the present invention
is explained,
[0042] FIGS. 5a to 5e are conceptual representations for explaining
the building up of a tree data structure for the data items of FIG.
3;
[0043] FIG. 6 is a conceptual representation of a tree data
structure in accordance with the first aspect of the present
invention;
[0044] FIG. 7 is a flow diagram with reference to which a process
of data tag optimisation is described;
[0045] FIG. 8 shows the directed graph representation of FIG. 6, in
which redundancy of the data tags in accordance with the process of
FIG. 7 has been reduced;
[0046] FIG. 9 shows the directed graph of FIG. 8 with weight values
assigned to nodes in accordance with creation of the database
embodying the first and second aspects of the present
invention;
[0047] FIG. 10 is a flow diagram with reference to which data
compaction in accordance with a fourth stage of the process of FIG.
2 is described;
[0048] FIG. 11 is a flow diagram showing a recursive procedure
adopted within the flow diagram of FIG. 10;
[0049] FIG. 12 shows the directed graph of FIG. 8 with an example
of how arcs and nodes may be shared to extend from a common
terminal node for a pair of data items having a common string of
characters;
[0050] FIG. 13 shows the directed graph of FIG. 8 with further
examples of how arcs and nodes are shared;
[0051] FIGS. 14a and 14b show examples of paths for two data items
which do not share the same root node or sink node;
[0052] FIG. 15 shows the directed graph of FIG. 8 with yet further
examples of how arcs and nodes are shared;
[0053] FIG. 16 shows the directed graph representation of FIG. 15,
redrawn to illustrate a database structure optimised for redundancy
using the example of FIG. 3;
[0054] FIG. 17 shows how a database embodying the first and second
aspects of the present invention may be represented in a data
stream;
[0055] FIG. 18 is a flow diagram showing a rapid search and
retrieval procedure for use with a database embodying the
invention; and
[0056] FIGS. 19a and 19b show further examples of paths for data
items having weight values assigned to nodes in accordance with
creation of the database embodying the first and second aspects of
the present invention.
DETAILED DESCRIPTION OF THE INVENTION
[0057] Referring to FIG. 1, a computer system comprises a user
interface 10, a processor 12, a data storage means 14, and program
memory 16, all of which communicate with each other via a data bus
18. The computer system further comprises an internet interface
device 20 for facilitating communication with the internet 22. A
disk drive and/or CD ROM drive 24 facilitate reading and/or writing
of data to and from portable media such as floppy disks or CDs.
User interface 10 comprises an information display, for example a
monitor, and a user input means such as a keyboard and/or a mouse.
Instructions contained in the program memory 16 control the
processor 12 to process data stored in the data storage means 14 or
read from portable media via the drive 24 or downloaded from the
internet 22. The system shown in FIG. 2 describes a single user
system, however it will be appreciated that the system is
extendable to link two or more users communicating via the data bus
18 or internet/intranet/extranet links thereto.
[0058] Computer systems such as the one described in FIG. 1 utilise
databases comprising lists of information items and associated
categories. The information items are in the form of keys, each key
comprising a unique character string, for example names of
people/companies/places/products etc.. The categories are
represented in the database by a category code, for example a
number and take the form of a data tag associated with each key.
When information about the category of an item is required, for
example at the request of a user, or in response to a coded
instruction as part of a software routine or control procedure, the
computer performs a search of the database to locate the key and
retrieve the data tag.
[0059] Databases can be very large, some holding many millions of
keys and their associated data tags. Prior art database structures
tend to be such that the computer has to search sequentially
through the entire list of keys stored in the database to find one
that matches the required key. It then retrieves the data tag to
identify the category. Two problems limit the efficacy of such
systems: firstly, the amount of data stored can be prohibitively
large, using up an excessive amount of data storage capacity;
secondly, the processing time for completing the search can be very
long and use up a large amount of computer memory.
[0060] FIG. 2 shows a process for creating a compact and rapidly
searchable database in accordance with the various aspects of the
present invention. The processes that make up the steps of FIG. 2
will be described for a specific example, using the data items
shown in FIG. 3, with reference to FIGS. 4 to 17. Referring to FIG.
2, the raw data 28 (keys and associated data tags) are read in at
step 30. At step 32 the raw data is processed to produce a data
structure representative of a tree data structure or
tree-structured directed graph 34, as will be described in more
detail below with reference to FIGS. 4 to 6. At step 36 an
algorithm is used to identify and discard superfluous data tags and
produce a data structure representative of an optimised directed
graph 38, as will be described below with reference to FIGS. 7 and
8.
[0061] The optimised directed graph 38 is compacted by the
processes of steps 40 and 44. At step 40 weight values are assigned
as will be described with reference to FIG. 9. At step 44 the
weight values are used to identify and reduce redundant key data to
produce a data structure representative of a compacted directed
graph 46, as will be described with reference to FIGS. 10 to
16.
[0062] At step 48 the optimised and compacted directed graph 46 is
stored as a final database 50 in a data storage format that will be
described with reference to FIG. 17. When the system requires to
know the category (data tag) associated with a key, the key data is
read by the system and the database 50 is searched at step 54 to
rapidly retrieve the required data tag 56.
[0063] FIG. 3 illustrates a data set to be used as an example for
describing the processes that make up an embodiment of the
invention. The data set of FIG. 3 comprises a set of keys
"BABYLON", "BARITONE" etc., to each of which is assigned a data tag
0, 1, 2, or 3 according to which of the four categories: music,
property, city or material entity, the key has been assigned. It
will be appreciated that the data set of FIG. 3 is shown here only
for the purpose of describing the embodiment of the invention, and
is very small compared with most databases in use on computer
systems.
[0064] FIG. 4 shows the process for generating a tree-structured
directed graph, and will be described with reference to FIGS. 5 and
6 to describe generation of a tree-structured directed graph for
the data set of FIG. 3. A directed graph is a way of visualising,
in two dimensions, an arrangement of data. Trees in the context of
data structures, graphs and directed graphs are all known terms in
the art (see for example, the NIST dictionary referred to above).
The data itself remains as a binary encoded bit stream stored
electronically by the computer system. The data in a directed graph
structure is represented by arcs, each arc representing a character
(e.g. a letter or numeral). It is contemplated that a given
character could represent more than one alpha-numeric character of
the data item. The arcs interconnect nodes. A node does not
represent any of the source data, but represents a point or
junction between one character and one or more further characters.
In FIGS. 5 and 6 nodes are represented as circles and arcs are
represented as lines having arrowheads pointing towards the node to
which the arc leads. The root node is represented by a larger
circle having a smaller circle inside it, and terminal nodes are
represented by bold circles. The structure of the directed graph
will become more apparent as the process of generation is
described.
[0065] In the beginning the graph is blank and has only a root node
with no arcs assigned. All the keys are now incorporated
individually into the graph character by character, whereby their
characters are stored along the arcs, and all arcs of a node are
sorted in ascending order according to their key-character
information. Sorting the arcs lends itself to fast search
operations within a node. If a new arc is created, and not merely
traversed, the data tag (or a reference to it) for the current key
must also be filed along this arc so embodying the first aspect of
the present invention. Each node to which the last arc of a key
opens has to be marked as a terminal node and must be equipped with
the current key's data tag. Consequently, following completion of
the process there is a deterministic finite state machine
available, which is the basis of the further steps.
[0066] The process of building a graph from a set of data items is
started at step 60. At step 62 a key and associated data tag are
read from the source data set 64. At step 66 an indexing counter is
set to 0. Thus far no data has been processed and the directed
graph consists only of a single root node and no arcs, as shown by
the "initial state" of FIG. 5a. At step 68 the directed graph
generator is positioned on the root node. At step 70 the process
reads the next character of the key, key[i]. The first time through
the process this is the first character of the key, key[0 ], as
defined by the indexing counter. FIGS. 5b to 5e show the example
where the first key read is METALLOPHON. Thus the first character
is the letter "M", and this is called the arc name of the next
(first) arc. At step 72 the process interrogates the data structure
as to whether the character "M" already exists as an arc. As no
arcs have yet been generated, the answer must clearly be No, and
the process proceeds to step 74, where the arc is generated. At
step 76 the associated data tag is also added to the arc. In the
example, "metallophon" has been assigned the category 0, "music".
At step 78 the arc is traversed to position the generator on the
next node, i.e. the node at the end of the arc. The directed graph
is now at state 2 as shown in FIG. 5b.
[0067] At step 80, the indexing counter is increment by 1. At step
82 the process interrogates the data to ask if the end of the key
has been reached. The answer in the example case is No, and the
process returns to step 70 to commence generation of the next arc,
which this time is given the arc name key[1], the letter "E". Again
at step 76, the data tag is added to the arc, and the directed
graph is then at state 3 as shown in FIG. 5c. The process repeats
for each letter of the key until eventually all the letters of
"METALLOPHON" have been assigned to arcs. This time, at step 82 the
answer is Yes and the process proceeds to step 84 where a flag data
bit is added to the data to indicate that the node at the end of
the last arc "N" is a terminal node. The directed graph is then at
state 4, as shown in FIG. 5d.
[0068] At step 85 the process ensures that the data tag associated
with the last arc of the key is that associated with the key. In
most cases the data tag will have been associated with the arc name
at step 76, however it is possible that the key may be made up
entirely of characters already contained in the database and that
step 76 will have been by-passed for every character of the key. In
such circumstances it is necessary to associate the correct data
tag with the last arc in the key. An example of this can be seen in
FIG. 6, which shows the directed graph for the data set of FIG. 3.
The key POLY has all its characters the same as the first four
characters of the key POLYMORPH, but has a data tag of 1 whereas
POLYMORPH has a data tag of 0. Therefore if POLY is entered into
the database after POLYMORPH, all the arcs will already exist and
have associated data tags of 0. Therefore the arc representing the
last character "Y" of POLY must have the correct data tag 1
associated with it by overwriting the previous data tag. Note that
the arc "Y" leads to a terminal node, but the terminal node is not
a sink.
[0069] At step 86 the process interrogates the data to see if the
end of the data set has been reached. If the answer is Yes, the
process is ended. However, in the illustrative example the answer
is No, so the process returns to step 62 to read the next key and
associated data tag. The next key is "MONOPHON". Here, when the
process reaches step 72 for the first time and asks whether the arc
name "M" exists for the current node (in this case the root node),
the answer is Yes because the arc with arc name "M" was generated
for the key "METALLOPHON". The process therefore steps ahead to
step 78, without generating an arc. The next time around, at step
72, the process asks the same question of the arc name "O", but
here the answer is No, and so a new arc must be generated.
Thereafter, for MONOPHON all arcs will be new arcs because there
will be no existing arcs connected to the nodes. State 5, as shown
in FIG. 5e has then been reached.
[0070] Once the process has been undertaken for all of the keys of
the data set, the data will represent the directed graph of FIG. 6.
Note that the directed graph is termed "tree-structured", because
each key is represented by a pathway of arcs commencing at the root
node and terminating at a terminal node. Each arc may only be
traversed once and (at this stage) each node is only arrived at via
one arc, but may have more than one arc departing from it.
[0071] The data structure represented by FIG. 6 is well suited for
searching. Starting at the root node a searching algorithm only
needs to look for an arc with an arc name the same as the first
character of the key being searched, and then to follow the path of
arcs with arc names equivalent to the characters of the key, to
identify the existence of the key in the database when the terminal
node is reached. On reaching any node without an arc having an
equivalent arc name to the next character of the key identifies the
absence of the key from the data base. Furthermore, if the
algorithm reads the data tags of the arcs as it traverses the
pathway, disregarding the previously read data tag each time it
reads a new data tag, then when it reaches a terminal node, the
last data tag to be read will be the one associated with the key
and will correctly identify the category of the key.
[0072] Nevertheless, the data structure of FIG. 6 is far from
optimised. Data tags are stored with every arc, but this entails
storing a great many more data tags than necessary to identify the
tag associated with a key. The process shown in FIG. 7 removes
superfluous data tags. The process is recursive, which is to say
that it involves passing through the steps of a procedure that
includes all the steps of the procedure itself as one of the steps.
In other words it involves calling a subroutine, which calls
itself.
[0073] The process illustrated by the flow chart of FIG. 7 is
started at step 100, and at step 102 calls the data tag
optimisation subroutine "data_tag_opt", which operates on the
parameters "current_node" and "data_tag". The directed graph data
structure is optimised by analysing the structure node by node,
recursively, along each branch of the tree. The procedure keeps
track of which node in the structure it is analysing by reference
to a node label called p_node. The subroutine starts at step 104.
At step 106 the node being analysed is labelled p_node and this
becomes the current node. At step 108, the process interrogates the
data as to whether the current node has arcs. If the answer is Yes,
then at step 110 the number "n" of arcs branching from the node is
read and, at step 112, a counter "i" is initialised to 0. At step
114, the data tag stored with the next arc, arc[i] is read (when
i=0, arc [0] is the first arc at the node). At step 116 the data
tag is compared with the previous data tag. If it is the same, then
at step 118 the data tag is removed. If not, then the data tag is
not removed and the routine moves directly to step 120 where it
moves on to the next node (i.e. the node at the end of arc[i]). At
step 122 the subroutine calls itself, i.e. it calls "data_tag_opt",
to perform the analysis for the next node. This can be considered
as performing the analysis at the next level down the tree.
[0074] If at step 108 the answer is No, the node must be a sink,
and the subroutine returns (i.e. goes back up a level to the
previous node) via step 128.
[0075] When the subroutine has been returned back up a level it
continues to step 124 where the counter "i" is incremented by 1 and
at step 126, if the counter has not reached "n", the number of arcs
at the node, the data tag on the next arc is read by looping back
to step 114. Once all the arcs at a node have been analysed (i.e.
i=n) the subroutine moves to step 128 where it is returned back up
to the node at the level above. Eventually, when the entire
database has been analysed, the subroutine will be returned back to
step 102 and the process is ended at step 130.
[0076] Referring back to FIG. 6, if the process is started at the
root node and the first arc to be analysed is "B", then as there is
no previous data tag the arc "B" retains the data tag (2) and the
routine moves down a level to the next node (the node between "B"
and "A"). The arc "A" is the next to be analysed and because this
also has the data tag (2), which is the same as the previous arc,
it is removed. The routine moves down a level to the next node.
Here there are two arcs branching from the node, "B" and "R". The
routine considers first the arc "B" (it could consider the arc "R",
it would make no difference to the outcome). The routine moves on
down the levels through the arcs "B", "Y", "L", "O", and "N",
removing the data tags (2) from all of these arcs as they are the
same as the first (2) on the first arc "B". When the routine
reaches the sink (the last node) it is returned back up the levels
until it reaches a node where there are further, as yet unanalysed,
arcs branching from it, in this case the node with the arc "R". The
procedure continues for all the arcs of the directed graph, finally
producing the directed graph of FIG. 8, which has been optimised to
contain a minimal number of data tags, thereby reducing redundancy
of data tag information in the database.
[0077] The optimised database described above can be further
reduced in size in accordance with an embodiment of the second
aspect of the present invention. To achieve efficient storage of
all keys it is desirable to rid the graph from redundancy. The
nature of a directed graph requires that the path starting at the
root node is the same for all keys that have an equal sequence of
characters up to the point of a difference in one single character.
Although keys might have equal character sequences in subsequent
parts of the string, the path is held separately. Therefore, the
database can be compacted by finding paths in the tree that have
the same sequence of characters and data--i.e. paths that are
equal--and reusing one single path rather than storing the path
multiple times. Paths can be considered as equal only if the
sequence of arcs is identical and the data tags stored along the
arcs are identical.
[0078] The method of creating the database embodying the second
aspect of the invention will be described with reference to FIGS. 9
to 16. FIG. 9 shows the directed graph of FIG. 8 for the example
data set of FIG. 3. In FIG. 9 the nodes have been assigned weight
values (shown as numbers in the node circles). In this example each
character has been assigned a character value, which in this case
is the character's ASCII value. It will be appreciated that any
consistent set of values could be used, which uniquely identifies
every possible character found in the keys. The weight value of a
node may be a checksum which is the sum of the character values of
all the characters in the sub-tree below the node (i.e. between the
node and all sinks that can be reached from the node). Put another
way, the checksum is the sum of the character values of all the
arcs branching from the node plus the weight values of the nodes at
the ends of those arcs (sinks have zero weight value).
[0079] FIG. 19a shows a simple example of assigning checksums which
does not form a part of the example database, but uses the same
method. For the example presented a very simple checksum algorithm
can be used: the checksum of a particular node is the sum of all
character ASCII values of the node's arcs plus the checksum of all
connected nodes.
[0080] Example Calculation:
[0081] A=65, B=66, C=67, D=68, E=69
[0082] Node 3=D=68
[0083] Node 4=E=69
[0084] Node 2=Node 3+Node 4+B+C=68+69+66+67=270
[0085] Node 1=Node 2+65=270+65=335
[0086] This algorithm is sufficient for the sample as it provides a
reasonably unique value for a sub-tree as well as includes the
level of the node--the higher the value, the larger the sub-tree.
However, for larger trees it is recommended to use a more complex
calculation to reduce the number of equal checksums and to take
counter overflows into consideration.
[0087] Other methods of assigning checksums may be used. CRC and
MD5 are two examples of known methods.
[0088] An example for calculating a compound checksum value is
described with reference to FIG. 19b. The checksum is the
concatenation of (1) the length of the longest path of the
sub-tree, (2) the sum of the character values and (3) the sum of
the data tag values. The format is a 9-digit number, padded with
leading zeros in the form lllcccddd, where lll is the level, ccc is
the character sum and ddd is the data sum. The checksum values for
each of the nodes of FIG. 19b are summarised in the table
below.
1 Node 6. Character sum. 84 Level 1 Data sum. 0 Characters. T Data
none 001084000 Node 5 Character sum 84 Level. 1 Data sum 0
Characters T Data none 001084000 Node 4 Character sum 83 + 82 + 84
+ 84 = 333 Level 2 Data sum 9 Characters S, R, T, T Data. 9
002333009 Node 3: Character sum 69 Level. 1 Data sum 0 Characters E
Data none 001069000 Node 2 Character sum. 82 + 69 + 79 + 83 + 84 +
82 + 84 = 563 Level. 3 (the longest Data sum: 5 + 9 = 14 path)
Characters: R, E, O, S, T, R, T Data. 5, 9 003563014 Node 1:
Character sum: 80 + 82 + 69 + 79 + 83 + 84 + 82 + 84 = 643 Level 4
(the longest Data sum: 3 + 5 + 9 = 17 path) Characters: P, R, E, O,
S, T, R, T Data 3, 5, 9 004643017
[0089] The purpose of assigning checksums to the nodes is to
perform the compaction method outlined in FIG. 10. Checksums
represent a hash of a data set. This hash does not necessarily hold
unique value depending on the data set, but can have the same value
for several different sets of data. Computing time is, however,
saved by comparing only sub-trees with equal checksums. Equal
checksums indicate that sub-trees have a high probability of being
identical. For fast and easy processing the checksums are first
collected into a list, which is then sorted by descending value. As
already indicated the checksum should represent the level
information. The list will, therefore, show the largest sub-trees
first. Each record in the list should additionally store a
reference information to the corresponding node as a means of
finding the node again later in the process. The reference, for
example, may be a pointer to the memory location, or anything else
appropriate. Best optimisation can be achieved by reducing large
sub-trees prior to small sub-trees. Special care should be taken on
implementation to ensure that, when reducing sub-trees, references
stored with nodes do not become invalid.
[0090] Starting at step 200, the method reads in the database and
at step 202 compiles a list 204 of all the nodes (identified by
node references) and their associated checksums. At step 206 the
list is sorted into a descending order of checksum values. At step
208 a variable called "last_cs" is set to 0. At step 210 the next
checksum on the list is read and its value assigned to the variable
"current_cs". At step 212 the values of "current_cs" and "last_cs"
are compared. If they are not equal, the sub-trees below the nodes
must be different and the method steps forward via step 213 where
the parameter last_cs is set equal to current_cs (i.e. the checksum
value of the current node) and on to step 224. However, if they are
equal there is a possibility that the two sub-trees are identical.
As will be described in an example later, it is not possible to be
certain that they are identical and so it is necessary to perform a
comparison of the sub-trees. At step 216 the node references,
noderef1 and noderef2, of the nodes having equal checksums are read
and at step 218 the comparison of the sub-trees is performed, as
will be described below with reference to FIG. 11. If the
comparison determines that the sub-trees are not identical by
returning a FALSE flag at step 220 the method is stepped forward to
step 224. At step 220, if the comparison has determined that the
sub-trees are identical by returning a TRUE flag, then at step 222
the arc leading into the node of noderef2 is redirected to the node
of noderef1 so that the sub-tree below the node of noderef2 can be
removed from the database.
[0091] At step 224 the method determines if there are any more
nodes on the list. If there are the method loops back to step 210,
but if not the method is ended at step 226.
[0092] Referring to FIG. 11, the method for comparing the sub-trees
is performed recursively. The subroutine "compare_tree" is started
at step 300 to compare the sub-trees of two nodes identified at
step 212 of FIG. 10 as having identical checksums and called here
node1 and node2. At step 302 a comparison is made of the number of
arcs branching from each of the nodes. If these are not equal, the
sub-trees cannot be identical, and so the subroutine is returned
with a FALSE flag at step 318. If the number of arcs is equal, then
the subroutine continues at step 304 to set a variable "n" to equal
the number of arcs and at step 305 initialises a counter "i" to 0.
At steps 306 and 308 the subroutine reads the arc names (i.e. the
characters) on the first arc of each node. The characters are read
in the order of ascending character value (the values used to
determine the node checksums). At step 310 a comparison of the arc
names is made. If they are not the same, then the subroutine
immediately returns with a FALSE flag at step 318.
[0093] Even if the arc names are the same, it is important that
they are only considered identical if they carry the same data
tags. Therefore at steps 312 and 314 the data tags of the arcs
being compared are read. At step 316 the data tags are compared and
if they are not the same the subroutine immediately returns with a
FALSE flag at step 318. If they are the same then the subroutine
moves on to compare the next nodes of the two sub-trees (next_node1
and next_node2) at steps 320 and 322. At step 324 the subroutine
calls itself to compare the next nodes and to continue down the
levels of the sub-tree in a recursive manner. If at any stage the
subroutine identifies a disparity between the two sub-trees it is
immediately returned at via steps 326 and 318 with a FALSE flag. If
at step 326 the subroutine has returned recursively without a FALSE
flag it moves to step 328 where the counter "i" is indexed by 1. If
at step 330 it is determined that the entire sub-tree has been
compared without a FALSE flag (i.e. i=n), then the subroutine
returns with a TRUE flag.
[0094] The compaction method described with reference to FIGS. 10
and 11 can be applied to the example database shown in FIG. 9. To
simplify the task of finding trees that are potentially equal, the
checksum information from the tree is extracted into a sequential
list of "checksum, pointer". The pointer is a reference to the
particular node, and provides a means of finding it again.
2 6296, Node0 1303, . . . 0309, . . . 0864, . . . 0861, . . . 0226,
. . . 0918, Node1 0689, . . . 0229, . . . 0788, . . . 0779, . . .
0147, . . . 0853, . . . 0605, . . . 0157, . . . 0699, . . . 0710, .
. . 0069, . . . 0322, . . . 0540, . . . 0078, . . . 0313, . . .
0631, . . . 0553, . . . 0233, . . . 0464, . . . 0629, . . . 0234, .
. . 0229, . . . 0464, . . . 0157, . . . 0388, . . . 0234, . . .
0152, . . . 0157, . . . 0388, . . . 0078, . . . 0309, . . . 0156, .
. . 0072, . . . 0078, . . . 0309, . . . 0383, . . . 0229, . . .
0072, . . . 0229, . . . 0238, . . . 0229, . . . 0310, . . . 0157, .
. . 0233, . . . 0157, . . . 0149, . . . 0157, . . . 0226, . . .
0078, . . . 0157, . . . 0078, . . . 0069, . . . 0078, . . . 0147, .
. . 0466, . . . 0078, . . . 1014, . . . 0380, . . . 0069, . . .
0388, . . . 0943, . . . 0930, . . . 0308, . . .
[0095] This list is then sorted into descending checksum value
order:
3 6296, Node0 0699, . . . 0388, . . . 0234, . . . 0157, . . . 0078,
. . . 1303, . . . 0689, . . . 0383, . . . 0233, . . . 0157, . . .
0078, . . . 1014, . . . 0631, . . . 0380, . . . 0233, . . . 0157, .
. . 0078, . . . 0943, . . . 0629, . . . 0322, . . . 0229, . . .
0157, . . . 0078, . . . 0930, . . . 0605, . . . 0313, . . . 0229, .
. . 0157, . . . 0078, . . . 0918, Node1 0553, . . . 0310, . . .
0229, . . . 0156, . . . 0072, . . . 0864, . . . 0540, . . . 0309, .
. . 0229, . . . 0152, . . . 0072, . . . 0861, . . . 0466, . . .
0309, . . . 0229, . . . 0149, . . . 0069, . . . 0853, . . . 0464, .
. . 0309, . . . 0226, . . . 0147, . . . 0069, . . . 0788, . . .
0464, . . . 0308, . . . 0226, . . . 0147, . . . 0069, . . . 0779, .
. . 0388, . . . 0238, . . . 0157, . . . 0078, . . . 0710, . . .
0388, . . . 0234, . . . 0157, . . . 0078, . . .
[0096] The first value found that is equal for two nodes is 464.
Comparing the underlying trees shows that they are equal in
character sequence as well as in data tags (no data tags in this
case). Consequently reassigning the arc named "Y" of node A to
point to node B can cut off the second tree. The storage resources
used by the tree starting at node C can now be freed up--the tree
is not connected any more.
[0097] 388 is the next value to look at. Again one tree can be
reduced. Although 388 occurs in the list three times, the third
occurrence had already been cut off in the previous step and can
therefore be ignored.
[0098] There are 3 occurrences of 309. However, after the above
compaction only one is left and so no further action is necessary.
The next value is 234. The two sub-trees have an equal checksum. On
comparing the tree, it can be seen that they differ in character
sequence. No reduction is therefore possible here.
[0099] FIG. 12 illustrates this example. The keys METALLOPHON and
XYLOPHON have both been categorised as music (category 0) and both
end with the sequence of characters LOPHON. The nodes labelled B
and C in FIG. 12 both have the checksum values 464. Comparison of
the sub-trees determines that both contain identical characters and
data tags, so the arc having arc name "Y" that connects the nodes
labelled A and C is redirected to connect node A to node B. All the
arcs that comprise the sub-tree below node C are then removed from
the database.
[0100] FIG. 13 shows similar compaction of the example database for
other nodes having equal checksum values. The sub-trees shown in
boxes with a shaded background are those that are being removed
from the database.
[0101] FIG. 14a presents an example of two nodes having equal
checksum values, but which are not identical. The character values
of both the sub-trees "NTH" and "RPH" produce checksums totalling
234 (see the nodes in the keys "NINTH" and "POLYMORPH" in FIG. 13).
However, comparison of the individual characters soon indicates
that they are not identical and causes the comparison subroutine of
FIG. 11 to return a FALSE flag.
[0102] The more information that can be provided in the form of a
weight value for each node, the more efficient the process of
identifying equivalent sub-trees.
[0103] It might appear that further compaction of the data set is
possible by combining groups of identical characters or character
strings that occur in keys. FIG. 14b illustrates an example of two
keys BARITONE and MARITAL. Both contain the same string of
characters "ARIT". However the subroutine would not identify equal
checksums and so compaction of the database to produce the sub-tree
illustrated in FIG. 14b would not occur. This is important because
compaction in this way would give rise to the possibility of keys
not in the original data set being present in the final compacted
database. In the example, the keys "MARITONE" and "BARITAL" are
present in the compacted tree, even though they were not part of
the original data set.
[0104] FIG. 15 illustrates further examples of compacting of the
example data set at lower levels (i.e. at nodes having lower
checksum values). Again, the sub-trees shown in boxes with a shaded
background are those that are being removed from the database. It
should be noted that the most efficient method of compacting the
database is to start with comparing the highest checksum values
first so as to remove the largest equivalent sub-trees from the
data base first, and then proceed by comparing progressively
smaller sub-trees having equal checksum values.
[0105] FIG. 16 illustrates the example database in its final
compacted form, with all the redundant arcs removed, and as such
represents an embodiment of both of the first and second aspects of
the present invention. All the original keys from the data set of
FIG. 3 are present together with their associated data tags. Some
of the nodes in FIG. 16 have numbers appearing in the circles that
represent the nodes. These are not checksum values, but are node
labels which will be used to describe the format in which the data
is stored with reference to FIG. 17.
[0106] Having optimised and compacted the database, the data itself
must be stored. As previously described, the data may be stored
electronically in the format of a one dimensional binary encoded
bit stream. A node is stored as its set of arcs, sorted in
ascending order in terms of character information. For the purpose
of fast searching, arcs are stored in ascending sorted order by
their character value.
[0107] FIG. 17 is a representation of a bit stream. The top line
400 in FIG. 17 comprises 56 bits which are used to store the data
associated with a single arc. The first 8 bits are the character
itself as represented by its ASCII value. The ninth bit is a data
tag flag. If this bit is a 1 it indicates that a data tag is also
stored with the arc, but if it is a 0 there is no data tag. The
tenth bit is another data flag which indicates whether or not the
arc leads to a terminal node. The next 30 bits (bits 10 to 39)
contain pointer information in the form of an address to the
location in the data base of the first arc of the next node. The
last 16 bits (bits 40 to 55) contain the data tag, if the data tag
flag indicates its presence. Otherwise these bits are not
present.
[0108] The next two lines 402 illustrate the data for a set of
nodes corresponding to the nodes labelled 0 to 11 in FIG. 16. The
bits that comprise the arc data in line 400 are shown compressed
into the four fields, character, flags, pointer and data tag. Node
0 is the root node in FIG. 16 and has arcs with the characters B,
M, N, P, S, T and X. The arc representing the character `B` carries
a data flag (shown here as a `Y` for `Yes` ) indicating the
presence of a data tag, but no flag (shown here as N for `No`)
indicating the presence of a terminal node; the pointer data for
this arc points to the first arc (`A`) of Node 1; and the arc
carries the data tag `2`. Similar data is contained the fields
representing the other arcs of node 0, all of which carry data
tags, but none of which lead to a terminal node. Finally the
character `&` is used to represent the termination of data for
the node.
[0109] Node 1 has only a single arc representing the character `A`.
This arc carries no data tag and does not lead to a terminal node,
so the flags are both shown as N.
[0110] Similar data appears in the data stream for all the other
nodes. Note, however, that both Node 6 and Node 11 have arcs that
lead to terminal nodes and carry the flag `Y`.
[0111] FIG. 18 illustrates a procedure for rapidly searching the
database to find a key and return its associated data tag. At step
500 a query is read in the form of the key to be sought. At step
502 an indexing counter "i" is set to 0 and at step 504 the search
is started at the root node of the data structure. As yet no data
tags have been read and so at step 506 a parameter result_tag is
set to a null value. At step 508 the parameter arc_name is set to
the next character of the key, key[i]. At step 510 the procedure
determines whether the arc name exists for the current node. If the
arc_name does not exist the procedure steps directly to step 524 to
return a null value. This means that the key is not to be found in
the database.
[0112] If the arc_name of key[i] does exist, at step 512 the
procedure determines if a data tag is stored with the arc. If there
is a data tag, the parameter result_tag is set to the value of the
data tag at step 514. At step 516 the procedure moves on to the
next node and at step 518 the indexing counter "i" is incremented
by 1. At step 520 the procedure determines whether the counter "i"
is less than the length of the key (i.e. the number of characters
in the key), and if it is the procedure returns to step 508 to look
for the next character in the key. If the last character in the key
has been reached, (i=key length) the procedure determines at step
522 whether the current node is a terminal node by reading the
terminal node flag associated with the arc (see the data stream
representation of FIG. 17). If the current node is not a terminal
node, the key is not in the database and the procedure moves
directly to step 524 to return a null data tag value. If the node
is a terminal node then the procedure moves to step 526 to return
the result_tag value, which is the data tag associated with the
key, and confirms the presence of the key in the database.
* * * * *
References