U.S. patent application number 10/875898 was filed with the patent office on 2005-02-03 for symbol dictionary compiling method and symbol dictionary retrieving method.
Invention is credited to Kanno, Yuji.
Application Number | 20050027513 10/875898 |
Document ID | / |
Family ID | 18340104 |
Filed Date | 2005-02-03 |
United States Patent
Application |
20050027513 |
Kind Code |
A1 |
Kanno, Yuji |
February 3, 2005 |
Symbol dictionary compiling method and symbol dictionary retrieving
method
Abstract
If the character string is long, and when retrieving symbols
containing characters of high frequency of appearance or character
chain, high speed retrieval is possible up to infix matching, and a
symbol dictionary of small capacity can be compiled. In the symbol
dictionary compiling method of the invention, each symbol in symbol
data is covered with shorter symbols called "meta-symbols" for
covering the symbol in the symbol data, and the information showing
how each symbol is covered is obtained by preparing meta-symbol
appearance information recorded in each meta-symbol, and therefore
high speed retrieval including up to infix matching is possible,
and a symbol dictionary of small capacity can be compiled.
Inventors: |
Kanno, Yuji; (Kanagawa,
JP) |
Correspondence
Address: |
RATNERPRESTIA
P O BOX 980
VALLEY FORGE
PA
19482-0980
US
|
Family ID: |
18340104 |
Appl. No.: |
10/875898 |
Filed: |
June 24, 2004 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
10875898 |
Jun 24, 2004 |
|
|
|
09451047 |
Nov 30, 1999 |
|
|
|
Current U.S.
Class: |
704/10 ;
707/E17.039 |
Current CPC
Class: |
Y10S 707/99936 20130101;
G06F 40/242 20200101; G06F 16/90344 20190101; Y10S 707/99933
20130101; Y10S 707/99942 20130101; Y10S 707/99932 20130101 |
Class at
Publication: |
704/010 |
International
Class: |
G06F 017/21 |
Foreign Application Data
Date |
Code |
Application Number |
Nov 30, 1998 |
JP |
10-340765 |
Claims
What is claimed is:
1. A method of retrieving a complete match to an arbitrary
character string query using a symbol dictionary containing a
meta-symbol information and a meta-symbol appearance information,
comprising the steps of: (a) retrieving meta-symbol information in
said symbol dictionary, (b) searching for a covering the query
character string Q by duplicate longest match word extraction
method that is, covering elements of pair (m, s, e) of meta-symbol
m collating with a partial character string, collating a character
start position s, and collating an end character position e
(1.ltoreq.s<e.ltoreq..vertline.Q.vertline.+1) in the character
string to be covered, and a set containing any character of Q in at
least one covering element, (c) storing the results of the search
of step (b) in the working area/storage/memory, (d) terminating the
retrieval if there is no covering such that, if there is no set of
covering elements of pair (m, s, e) of meta-symbol m collating with
said partial character string, collating said character start
position s, and collating said end character position e
(1.ltoreq.s<e.ltoreq..vertline- .Q.vertline.+1) in the character
string to be covered, and containing at least one covering element
for each character of Q, and (e) retrieving the meta-symbol
appearance information in said symbol dictionary and if there is
only one symbol number contained commonly in all elements in said
covering result, it is issued as the retrieval result and the
retrieval process is terminated, and if there is no symbol number
contained commonly in all elements in said covering result, the
retrieval process is terminated as being no retrieval result.
2. A method of retrieving by forward coincidence a response to an
arbitrary character string query using a symbol dictionary storing
meta-symbol information and meta-symbol appearance information,
that is, retrieving all symbols having said queried character
string in the beginning portion, comprising the steps of: a first
step of symbol dictionary retrieval in which (a) a question
character string covering means retrieves meta-symbol information
in said symbol dictionary and searches covering in the query
character string Q by longest match overlapped longest match word
extraction method that is, covering elements of pair (m, s, e) of
meta-symbol m collating with a partial character string, collating
a character start position s, and collating an end character
position e (1.ltoreq.s<e.ltoreq..vertline.Q.vertline.- +1) in
the character string to be covered, and a set containing any
character of Q in at least one covering element, (b) the retrieval
is terminated as being no retrieval result if there is no covering
elements of pair (m, s, e) of meta-symbol m collating with said
partial character string, said collating character start position
s, and collating said end character position e
(1.ltoreq.s<e.ltoreq..vertline.Q.vertline.+1) in the character
string to be covered, and there is no set containing any character
of Q in at least one covering element, and (c) if there is
covering, the covering result is recorded, a second step of symbol
dictionary retrieval in which (d) a right extended meta-symbol
assessing means retrieves meta-symbol information in said symbol
dictionary, retrieves, in said covering result, all meta-symbols x
of right extended meta-symbols, that is, meta-symbols containing
character string R in the beginning portion of j-th rightmost
portion character string of said query character string R such that
a partial character string from the j-th character
(1.ltoreq.j.ltoreq..vertline.Q.vertline.) to a final character in
query character string, out of extended meta-symbols of meta-symbol
Z of covering elements of which collating said start character
position is 1, that is, meta-symbols containing Z, and adds
elements (x, j, .vertline.R.vertline.+j) to said covering result
and records, and a third step of symbol dictionary retrieval in
which a symbol number set assessing means retrieves said
meta-symbol appearance information in said symbol dictionary while
systematically compiling a set C of elements in said covering
result, covering said query character string or an arbitrary right
extended character string, collects a symbol number set SC commonly
contained in all elements of C, records as part of said retrieval
result, and issues the sum set of all SCs as a final retrieval
result.
3. A method of retrieving, by backward coincidence, a response to
an arbitrary character string query using a symbol dictionary
storing meta-symbol information and meta-symbol appearance
information, that is, retrieving all symbols having said queried
character string in the end portion, comprising the steps of: a
first step of symbol dictionary retrieval in which (a) a queried
character string covering means retrieves meta-symbol information
in said symbol dictionary and searches covering in the queried
character string Q by longest match overlapped longest match word
extraction method that is, covering elements of pair (m, s, e) of a
meta-symbol m collating with a partial character string, collating
a character start position s, and collating an end character
position e (1.ltoreq.s<e.ltoreq..vertline.Q.vertline.+1) in the
character string to be covered, and a set containing any character
of Q in at least one covering element, (b) the retrieval process is
terminated as being no retrieval result if there is no covering,
that is, covering elements of pair (m, s, e) of meta-symbol m
collating with partial character string, collating character start
position s, and collating end character position e
(1.ltoreq.s<e.ltoreq..vertline.Q.vertline.+1) in the character
string to be covered, and there is no set containing any character
of Q in at least one covering element, the retrieval process is
terminated as being no retrieval result, and (c) if there is
covering, the covering result is recorded, a second step of symbol
dictionary retrieval in which (d) a left extended meta-symbol
assessing means retrieves meta-symbol information in said symbol
dictionary, retrieves, in said covering result, all meta-symbols x
of a left extended meta-symbols that is, meta-symbols containing a
character string L in the end portion) of j-th leftmost portion
character string of said queried character string that is, the
partial character string from the first character to the j-th
character (1.ltoreq.j.ltoreq..vertline.Q.vertline.) in question
character string L, out of an extended meta-symbols of meta-symbol
Z of covering elements of which collating end character position is
.vertline.Q.vertline.+1 [(] that is, meta-symbols containing Z, and
adds elements (x, j+1-.vertline.L.vertline., j+1) to said covering
result and records, and a third step of symbol dictionary retrieval
in which a symbol number set assessing means retrieves said
meta-symbol appearance information in said symbol dictionary while
systematically compiling a set C of elements in said covering
result covering said queried character string or an arbitrary left
extended character string, collects a symbol number set SC commonly
contained in all elements of C, records as part of retrieval
result, and issues the sum set of all SCs as a final retrieval
result.
4. A method of retrieving, by intermediate coincidence, to an
arbitrary character string query using a symbol dictionary storing
meta-symbol information and meta-symbol appearance information,
that is, retrieving all symbols having said queried character
string, comprising: a first step of symbol dictionary retrieval in
which a question character string covering means retrieves said
meta-symbol information in said symbol dictionary and searches
covering in the queried character string Q by longest
matchoverlapped longest match word extraction method that is,
covering elements of pair (m, s, e) of a meta-symbol m collating
with a partial character string, collating a character start
position s, and an collating end character position e
(1.ltoreq.s<e.ltoreq..vertline.Q.ve- rtline.+1) in the character
string to be covered, and a set containing any character of Q in at
least one covering element, the retrieval process is terminated
with no retrieval result if there is no covering that is, covering
elements of pair (m, s, e) of said meta-symbol m collating with
said partial character string, collating said character start
position s, and collating said end character position e
(1.ltoreq.s<e.ltoreq..vert- line.Q.vertline.+1) in the character
string to be covered, and there is no set containing any character
of Q in at least one covering element, and if there is covering,
the covering result is recorded, a second step of symbol dictionary
retrieval in which a right extended meta-symbol assessing means
retrieves meta-symbol information in said symbol dictionary,
retrieves, in said covering result, all meta-symbols x of a right
extended meta-symbols, that is, meta-symbols containing a character
string R in the beginning portion, of j-th rightmost portion
character string of said queried character string R, that is, the
partial character string from the j-th character
(1.ltoreq.j.ltoreq..vertline.Q.vertline.) to a final character in
queried character string R, out of extended meta-symbols of
meta-symbol Z of covering elements of which collating said start
character position is 1, that is, meta-symbols containing Z, and
adds elements (x, j, .vertline.R.vertline.+j) to said covering
result and records, a third step of symbol dictionary retrieval in
which a left extended meta-symbol assessing means retrieves
meta-symbol information in said symbol dictionary, retrieves, in
said covering result, all meta-symbols x of a left extended
meta-symbols, that is, meta-symbols containing a character string L
in the end portion, of j-th leftmost portion character string of
said queried character string, that is, the said partial character
string from a first character to the j-th character
(1.ltoreq.j.ltoreq..vertline.Q.vertline.) in question character
string L, out of extended meta-symbols of meta-symbol Z of covering
elements of which collating said end character position is
.vertline.Q.vertline.+1, that is, meta-symbols containing Z, and
adds elements (x, j+1-.vertline.L.vertline., j+1) to said covering
result and records, a fourth step of symbol dictionary retrieval in
which an extended meta-symbol assessing means retrieves the
meta-symbol information, retrieves both the extended meta-symbols
of Q and X that is, meta-symbols containing character string Q in
the portion from the j-th character to the
j+.vertline.Q.vertline.-th character, where 1<j, and X, adds
elements (X, 1-j, 1-j+.vertline.X.vertline.) to said covering
result and records, and a fifth step of symbol dictionary retrieval
in which a symbol number set assessing means retrieves said
meta-symbol appearance information in said symbol dictionary while
systematically compiling a set C of elements in said covering
result covering said queried character string or an arbitrary
extended character string, collects a symbol number set SC commonly
contained in all elements of C, records as part of said retrieval
result, and issues the sum set of all SCs as a final retrieval
result.
Description
[0001] This application is a divisional of patent application Ser.
No. 09/451,047, filed Nov. 30, 1999.
FIELD OF THE INVENTION
[0002] The present invention relates to compilation and retrieval
of a symbol dictionary for use in a database device or a document
retrieval device for controlling and retrieving accumulated
electronic symbol information by using a computer.
BACKGROUND OF THE INVENTION
[0003] With the wide spread use of word processors and personal
computers, development of large capacity and low price memory media
such as CD-ROM, and advancement of Ethernet networking, database
systems such as relational databases and full text retrieval
databases have come to be widely used.
[0004] Databases handle a relatively short character string of
several characters to hundreds of characters, such as a person's
name, place name, organization name, address, classification code
or part code as a symbol, storing a CSV list of symbols (a string
of symbols connected by the comma such as
"MorishitaElectricIndustries, MorishitaCommunicationsIn- dustries,
KyushuMorishitaElectric" as a field of trading partner company
names) in one item (field) of the database and search for records
which contain a complete match, a prefix match, a postfix match or
an infix match to the query symbols and retrieve the record at high
speed (retrieving the condition, for example, in the case of prefix
matching, "retrieve the record containing a symbol starting with
"Morishita" in the trading partner company names field").
[0005] Of these four methods of matching an efficient retrieval
method for complete matching and prefix matching is realized by
using the data structure called TRIE (also known as radix searching
tree) as mentioned in publications such as "Algorithm Vol. 2 (R.
Seziwick, tr. by Kohei Noshita, et al., Kindai Kagakusha, 1992,
ISBN 4-7649-0189-7, pp. 52-72) and "Algorithm and Data Structure
Handbook (G. H. Gonnet, tr. Mitsuo Gen, et al., Keigaku Shuppan,
1987, ISBN 4-7665-0326-0, pp. 111-122). In addition, where postfix
matching is needed, a TRIE may be constructed for data reversed in
the symbol character sequence, and it may be retrieved.
[0006] If infix matching is desired, efficient retrieval processing
is difficult by TRIE, and conventionally, for example, a method as
disclosed Japanese Laid-open Patent No. Hei 3-42774 has been
employed.
[0007] In the method disclosed in Japanese Laid-open Patent No. Hei
3-42774, when compiling a symbol dictionary, a symbol character
string is divided character by character and dictionary information
recording a pair of symbol number and appearance character position
of corresponding character in symbol is created for every
character, or when retrieving a symbol dictionary, a query
character string is decomposed by character, dictionary information
corresponding to each character is retrieved, and a set of symbol
numbers identical in symbol numbers and consecutive in appearance
character positions is issued a as retrieval result.
[0008] In this conventional compiling method of a symbol
dictionary, however, when the types of symbols are more than tens
of thousands, the symbol dictionary file to be compiled is more
than twice as large as the symbol data to be retrieved, and it is
difficult to utilize if the usable capacity of the memory device is
limited.
[0009] Or in the conventional retrieving method of a symbol
dictionary, if we retrieve a symbol which is long and contains many
high-frequency characters, the quantity of intermediate data to be
read out from the symbol dictionary is tremendous, and the
retrieval speed is reduced due to such read operation and
consecutive checking.
[0010] The disadvantage of a conventional retrieving method of a
symbol dictionary may be somewhat alleviated by recording the
symbol dictionary in every consecutive N characters or "N-gram" of
plural characters, instead of the unit of creating and recording
dictionary information for every characters, but in the case of
retrieving a symbol such as "199800000123A" initialed by the year
and followed by multiple digits of integers mostly composed of
consecutive zeros, there are many symbols incidentally coinciding
in the beginning 10 characters or more, and if N is about 2 to 4 in
N-gram, the amount of data to be read out from the symbol
dictionary is still large and the retrieval speed is reduced.
[0011] Further, by increasing the number N in the character chain,
the types of appearing N character chains increase abruptly and it
is hard to compile a symbol dictionary and the capacity of the
compiled symbol dictionary increases due to the housekeeping
information. In the conventional retrieval method of a symbol
dictionary, when we retrieve a symbol which is long and contains
many high-frequency characters, complete matching takes the longest
processing time among the four matching modes, and in the
application where complete matching occupies the majority of
queries, the average retrieval speed is reduced.
[0012] Thus, in the conventional compiling method of a symbol
dictionary, the symbol dictionary file to be compiled is more than
twice as large as the symbol data to be retrieved, and it is
difficult to utilize if the usable capacity of the memory device is
limited.
[0013] Moreover, in the conventional retrieval method of a symbol
dictionary, if we retrieve a symbol which is long and contains many
high-frequency characters, the amount of data to be read out from
the symbol dictionary is tremendous, and the retrieval speed is
reduced.
[0014] If the number of character chains N is increased, the types
of appearing N character chains increase abruptly and it is hard to
compile a symbol dictionary with small housekeeping information,
and the capacity of the compiled symbol dictionary increases.
[0015] In a compiling method of a symbol dictionary of the
invention, a meta-symbol dictionary gathering shorter symbols
called "meta-symbols" for covering symbols in symbol data is
compiled automatically, each symbol in the symbol data is covered
with the meta-symbol in this meta-symbol dictionary, the
information how each symbol is covered can be retrieved at high
speed including up to infix matching by compiling the meta-symbol
appearance information recorded in every meta-symbol, and the size
of the compiled symbol dictionary can be reduced; and in a
retrieving method of a symbol dictionary of the invention, a query
string is covered with meta-symbols by retrieving the meta-symbol
dictionary contained in the compiled symbol dictionary file,
retrieval results of both right and left extension meta-symbols of
the original covering meta-symbols are added to this covering
result and high speed retrieval is possible for all matching modes
including infix matching by seeking the symbol number set commonly
contained in every element set in the query string or covering
results covering the right and left extension character strings,
and moreover in the application where complete matching occupies
the majority of queries, symbol retrieval is possible without
decreasing the average retrieval speed.
SUMMARY OF THE INVENTION
[0016] A compiling method of a symbol dictionary according to a
first aspect of the invention comprises a symbol covering means for
retrieving each symbol in symbol data by searching a meta-symbol
dictionary and finding the covering result by extraction method
such as maximal word extraction method, meta-symbol accumulating
means for accumulating covering results, a meta-symbol frequency
table for accumulating the total appearance frequency of each
meta-symbol in the symbol data, meta-symbol dictionary update
judging means for adding or deleting of meta-symbols in the
meta-symbol dictionary, and for deciding to stop the meta-symbol
accumulation, by referring to the meta-symbol frequency table and
conforming to the predetermined condition/parameters, meta-symbol
appearance information compiling means for calculating the
meta-symbol appearance information recording the number of the
symbol containing each meta-symbol and the appearance character
position from the recovering results, and symbol dictionary
compiling means for compiling a machine-retrievable symbol
dictionary from the meta-symbol dictionary and meta-symbol
appearance information, if we retrieve a symbol which is long and
contains many high-frequency characters, high speed retrieval is
possible up to infix matching, and the size of compiled symbol
dictionary can be reduced.
[0017] A retrieval method of a symbol dictionary for complete
matching according to a second aspect of the invention comprises a
query string covering means for retrieving a meta-symbol dictionary
in a symbol dictionary, and finding the covering result from the
query string by the duplicate longest match word extraction method,
meta-symbol appearance information retrieval means for retrieving
meta-symbol appearance information in the symbol dictionary, and
finding a set of symbol numbers containing the desired meta-symbol
at the corresponding position from each element in the covering
result obtained at a first step, and a symbol number assessing
means for finding a common portion of corresponding symbol number
sets in all elements in the covering result, and, if the found
common portion is not empty, issuing the symbol number of the
element as retrieval result and terminating the retrieval process,
or if the set is empty, terminating the retrieval processing
assuming there is no retrieval result, in which if the character
string is long and when retrieving a symbol containing characters
or character chain of high frequency, high speed symbol dictionary
retrieval is possible by complete matching.
[0018] A retrieving method of a symbol dictionary of prefix
matching according to a third aspect of the invention comprises
question character string covering means for retrieving a
meta-symbol dictionary in a symbol dictionary and finding the
covering from the question character string Q in the retrieval
condition by the longest matchoverlapped longest match word
extraction method, and, if there is no covering, terminating the
retrieval processing assuming there is no retrieval result, or, if
there is covering, recording the covering result, right extended
meta-symbol assessing means for retrieving meta-symbol information
in the symbol dictionary, retrieving, in the covering result, all
meta-symbols x of right extended meta-symbols (that is,
meta-symbols containing character string R in the beginning
portion) of j-th rightmost portion character string of the question
character string (that is, the partial character string from the
j-th character (1.ltoreq.j.ltoreq..vertline.Q.vertline.) to the
final character in the question character string) R, out of
extended meta-symbols of meta-symbol Z of covering elements of
which collating start character position is 1 (that is,
meta-symbols containing Z), and adding elements (x, j,
.vertline.R.vertline.+j) to the covering result and recording, and
symbol number set assessing means for retrieving meta-symbol
appearance information in the symbol dictionary while
systematically compiling a set C of elements in the covering result
covering the question character string or an arbitrary right
extended character string, collecting a symbol number set SC
commonly contained in all elements of C, recording as part of
retrieval result, and issuing the sum set of all SCs as final
retrieval result, in which if the character string is long and when
retrieving a symbol containing characters or character chain of
high frequency, high speed symbol dictionary retrieval is possible
by prefix matching.
[0019] A retrieving method of a symbol dictionary of postfix
matching according to a fourth aspect of the invention comprises
question character string covering means for retrieving a
meta-symbol dictionary in a symbol dictionary, and finding the
covering from the question character string Q in the retrieval
condition by the longest matchoverlapped longest match word
extraction method, and, if there is no covering, terminating the
retrieval processing assuming there is no retrieval result, or, if
there is covering, recording the covering result, left extended
meta-symbol assessing means for retrieving meta-symbol information
in the symbol dictionary, retrieving, in the covering result, all
meta-symbols x of left extended meta-symbols (that is, meta-symbols
containing character string L in the end portion) of j-th leftmost
portion character string of the question character string (that is,
the partial character string from the first character to the j-th
character (1.ltoreq.j.ltoreq..vertline.Q.vertline.) in the question
character string) L, out of extended meta-symbols of meta-symbol Z
of covering elements of which collating end character position is
.vertline.Q.vertline.+1 (that is, meta-symbols containing Z), and
adding elements (x, j+1-.vertline.L.vertline., j+1) to the covering
result and recording, and symbol number set assessing means for
retrieving meta-symbol appearance information in the symbol
dictionary while systematically compiling a set C of elements in
the covering result covering the question character string or an
arbitrary left extended character string, collecting a symbol
number set SC commonly contained in all elements of C, recording as
part of retrieval result, and issuing the sum set of all SCs as
final retrieval result, in which if the character string is long
and when retrieving a symbol containing characters or character
chain of high frequency, high speed symbol dictionary retrieval is
possible by postfix matching.
[0020] A retrieving method of a symbol dictionary of infix matching
according to a fifth aspect of the invention comprises question
character string covering means for retrieving a meta-symbol
dictionary in a symbol dictionary, and finding the covering from
the question character string Q in the retrieval condition by the
longest matchoverlapped longest match word extraction method, and,
if there is no covering, terminating the retrieval processing
assuming there is no retrieval result, or, if there is covering,
recording the covering result, right extended meta-symbol assessing
means for retrieving meta-symbol information in the symbol
dictionary, retrieving, in the covering result, all meta-symbols x
of right extended meta-symbols (that is, meta-symbols containing
character string R in the beginning portion) of j-th rightmost
portion character string of the question character string (that is,
the partial character string from the j-th character
(1.ltoreq.j.ltoreq..vertline.Q.vertline.) to the final character in
the question character string) R, out of extended meta-symbols of
meta-symbol Z of covering elements of which collating start
character position is 1 (that is, meta-symbols containing Z), and
adding elements (x, j, .vertline.R.vertline.+j) to the covering
result and recording, left extended meta-symbol assessing means for
retrieving meta-symbol information in the symbol dictionary,
retrieving, in the covering result, all meta-symbols x of left
extended meta-symbols (that is, meta-symbols containing character
string L in the end portion) of j-th leftmost portion character
string of the question character string (that is, the partial
character string from the first character to the j-th character
(1.ltoreq.j.ltoreq..vertline.Q.vertline.) in the question character
string) L, out of extended meta-symbols of meta-symbol Z of
covering elements of which collating end character position is
.vertline.Q.vertline.+1 (that is, meta-symbols containing Z), and
adding elements (x, j+1-.vertline.L.vertline., j+1) to the covering
result and recording, both extended meta-symbol assessing means for
retrieving the meta-symbol dictionary, retrieving all of both
extended meta-symbols x of Q, adding elements (x, 1-j,
1-j+.vertline.x.vertline.) to the covering result and recording,
and symbol number set assessing means for retrieving meta-symbol
onset information in the symbol dictionary while systematically
compiling a set C of elements in the covering result covering the
question character string or an arbitrary extended character
string, collecting a symbol number set SC commonly contained in all
elements of C, recording as part of retrieval result, and issuing
the sum set of all SCs as final retrieval result, in which if the
character string is long and when retrieving a symbol containing
characters or character chain of high frequency, high speed symbol
dictionary retrieval is possible by infix matching.
BRIEF DESCRIPTION OF THE DRAWINGS
[0021] FIG. 1 is a block diagram showing a general constitution of
a symbol dictionary compiling apparatus in a first embodiment of
the invention.
[0022] FIG. 2 is a block diagram showing a general constitution of
a symbol dictionary retrieving apparatus in a second embodiment of
the invention.
[0023] FIG. 3 is a block diagram showing a general constitution of
a symbol dictionary retrieving apparatus in a third embodiment of
the invention.
[0024] FIG. 4 is a block diagram showing a general constitution of
a symbol dictionary retrieving apparatus in a fourth embodiment of
the invention.
[0025] FIG. 5 is a block diagram showing a general constitution of
a symbol dictionary retrieving apparatus in a fifth embodiment of
the invention.
[0026] FIG. 6 is a flowchart describing the procedure of covering
process by symbol covering means in the first embodiment.
[0027] FIG. 7 is a flowchart describing the procedure of covering
process by question character string covering means in the second
to fifth embodiments.
[0028] FIG. 8 is a flowchart describing the procedure of symbol
number assessing process by symbol number assessing means in the
second embodiment.
[0029] FIG. 9 is a flowchart describing the procedure of symbol
number set assessing process by symbol number set assessing means
in the third embodiment.
[0030] FIG. 10 is a flowchart describing the procedure of symbol
number set assessing process by symbol number set assessing means
in the third embodiment.
[0031] FIG. 11 is a flowchart describing the procedure of symbol
number set assessing process by symbol number set assessing means
in the third embodiment.
[0032] FIG. 12 is a flowchart describing the procedure of symbol
number set assessing process by symbol number set assessing means
in the third embodiment.
[0033] FIG. 11 is a flowchart describing the procedure of symbol
number set assessing process by symbol number set assessing means
in the fourth embodiment.
[0034] FIG. 12 is a flowchart describing the procedure of symbol
number set assessing process by symbol number set assessing means
in the fourth embodiment.
[0035] FIG. 13 is a flowchart describing the procedure of symbol
number set assessing process by symbol number set assessing means
in the fifth embodiment.
[0036] FIG. 14 is a flowchart describing the procedure of symbol
number set assessing process by symbol number set assessing means
in the fifth embodiment.
[0037] FIG. 15 is an example of symbol data in the first
embodiment.
[0038] FIG. 16 is an example of registered content of meta-symbol
dictionary in initial stage in the first embodiment.
[0039] FIG. 17 is an example of content of meta-symbol frequency
table in the first embodiment.
[0040] FIG. 18 is an example of content of meta-symbol frequency
table in the first embodiment.
[0041] FIG. 19 is an example of meta-symbol dictionary in the first
embodiment.
[0042] FIG. 20 is an example of part of meta-symbol dictionary in
the first embodiment.
[0043] FIG. 21 is an example of content of meta-symbol frequency
table in the first embodiment.
[0044] FIG. 22 is an example of content of meta-symbol frequency
table in the first embodiment.
[0045] FIG. 23 is an example of content of meta-symbol frequency
table in the first embodiment.
[0046] FIG. 24 is an example of content of meta-symbol frequency
table in the first embodiment.
[0047] FIG. 25 is an example of content of meta-symbol frequency
table in the first embodiment.
[0048] FIG. 26 is an example of content of meta-symbol frequency
table in the first embodiment.
[0049] FIG. 27 is an example of content of meta-symbol frequency
table in the first embodiment.
[0050] FIG. 28 is an example of content of meta-symbol frequency
table in the first embodiment.
[0051] FIG. 29 is an example of content of meta-symbol onset
information in the first embodiment.
[0052] FIG. 30 is a conceptual diagram showing an example of data
structure of meta-symbol dictionary in the first embodiment.
[0053] FIG. 31 is a conceptual diagram showing an example of data
structure of meta-symbol dictionary in the first embodiment.
[0054] FIG. 32 is a conceptual diagram showing an example of data
structure of meta-symbol dictionary in the first embodiment.
[0055] FIG. 33 is an example of content of extended information of
meta-symbol in the first embodiment.
[0056] FIG. 34 is a conceptual diagram describing principal
intermediate data in a process of symbol dictionary retrieval in
the second embodiment.
[0057] FIG. 35 is a conceptual diagram describing principal
intermediate data in a process of symbol dictionary retrieval in
the third embodiment.
[0058] FIG. 36 is a conceptual diagram describing principal
intermediate data in a process of symbol dictionary retrieval in
the third embodiment.
[0059] FIG. 37 is a conceptual diagram describing principal
intermediate data in a process of symbol dictionary retrieval in
the fourth embodiment.
[0060] FIG. 38 is a conceptual diagram describing principal
intermediate data in a process of symbol dictionary retrieval in
the fifth embodiment.
[0061] FIG. 39 is a conceptual diagram describing principal
intermediate data in a process of symbol dictionary retrieval in
the fifth embodiment.
REFERENCE NUMERALS
[0062] 101 Symbol data
[0063] 102 Meta-symbol dictionary
[0064] 103 Symbol covering means
[0065] 104 Meta-symbol summing means
[0066] 105 Meta-symbol frequency table
[0067] 106 Meta-symbol dictionary update judging means
[0068] 107 Meta-symbol appearance information compiling means
[0069] 108 Meta-symbol appearance information
[0070] 109 Symbol dictionary compiling means
[0071] 110 Symbol dictionary
[0072] 201 Symbol dictionary
[0073] 202 Retrieval condition input means
[0074] 203 Question character string covering means
[0075] 204 Covering result
[0076] 205 Symbol number assessing means
[0077] 206 Retrieval result output means
[0078] 301 Symbol dictionary
[0079] 302 Retrieval condition input means
[0080] 303 Question character string covering means
[0081] 304 Covering result
[0082] 305 Symbol number set assessing means
[0083] 306 Retrieval result output means
[0084] 307 Right extended meta-symbol assessing means
[0085] 401 Symbol dictionary
[0086] 402 Retrieval condition input means
[0087] 403 Question character string covering means
[0088] 404 Covering result
[0089] 405 Symbol number set assessing means
[0090] 406 Retrieval result output means
[0091] 408 Left extended meta-symbol assessing means
[0092] 501 Symbol dictionary
[0093] 502 Retrieval condition input means
[0094] 503 Question character string covering means
[0095] 504 Covering result
[0096] 505 Symbol number set assessing means
[0097] 506 Retrieval result output means
[0098] 507 Right extended meta-symbol assessing means
[0099] 508 Left extended meta-symbol assessing means
[0100] 509 Both extended meta-symbol assessing means
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0101] An exemplary embodiment of the present invention relates to
a compiling method of a symbol dictionary, being a method of
compiling a machine-retrievable symbol dictionary for symbol data,
registering a finite number of symbols mutually different out of an
array of not more than N (a specific number of) characters
contained in a certain determined character set, comprising a first
step of symbol dictionary compilation in which symbol covering
means retrieves each symbol in the symbol data in a prepared
meta-symbol dictionary in initial state, and searches covering
(that is, relating to a pair of a meta-symbol for collating with a
partial character string in the symbols to be covered and collating
start character position, a set containing any character in the
symbols to be covered in at least one of pair of meta-symbols),
meta-symbol summing means sums up covering results, total
appearance frequency of each meta-symbol in the symbol data is
accumulated in a meta-symbol frequency table, meta-symbol
dictionary update judging means refers to the meta-symbol frequency
table, and after deleting the meta-symbol from the meta-symbol
dictionary according to a predetermined standard, adds the
meta-symbol to the meta-symbol dictionary according to a
predetermined standard, a second step of symbol dictionary
compilation in which the symbol covering means retrieves each
symbol in the symbol data in the meta-symbol dictionary at the
present to search the covering, the meta-symbol summing means sums
up the covering results, the total appearance frequency in the
symbol data in each meta-symbol is accumulated in the meta-symbol
frequency table, the meta-symbol dictionary update judging means
refers to the meta-symbol frequency table, adds the meta-symbol to
the meta-symbol dictionary according to a predetermined standard,
judges if the predetermined stopping condition is satisfied or not,
and repeats the second step until satisfying the stopping
condition, a third step of symbol dictionary compilation in which
the meta-symbol dictionary update judging means refers to the
meta-symbol frequency table, and deletes the meta-symbol from the
meta-symbol dictionary according to a predetermined standard, a
fourth step of symbol dictionary compilation in which the symbol
covering means covers the symbol data by using the meta-symbol
dictionary calculated at the third step, and the meta-symbol
appearance information compiling means calculates the meta-symbol
appearance information recording the symbol number for showing each
meta-symbol and the appearance character position from the covering
result, and a fifth step of symbol dictionary compilation in which
the symbol dictionary compiling means compiles a
machine-retrievable symbol dictionary storing meta-symbol
information and meta-symbol appearance information from the
meta-symbol dictionary and meta-symbol appearance information, and
therefore if the character string is long and when retrieving a
symbol containing characters or character chain of high frequency,
high speed symbol dictionary retrieval is possible including up to
infix matching, and a symbol dictionary of small capacity can be
compiled.
[0102] The invention further relates to the compiling method of
symbol dictionary in which covering of symbol is determined by
maximal word extraction method in the symbol covering means.
[0103] The invention further relates to the compiling method of
symbol dictionary in which a symbol composed of one character only,
about each character in a predetermined character set, and zero or
more character string known as part of the symbol in the symbol
data are registered in the prepared meta-symbol dictionary in
initial state.
[0104] The invention further relates to the compiling method of
symbol dictionary in which the deletion of a meta-symbol in the
first step is done on the basis of deleting the meta-symbol of
which frequency in the meta-symbol frequency table is 0, and the
addition of a meta-symbol in the first step is done on the basis of
adding the meta-symbol by adding one arbitrary character in the
meta-symbol dictionary at the end, as for a meta-symbol less than N
characters in the meta-symbol frequency table of which frequency is
frequency C1 or more determined by the symbol data content.
[0105] The invention further relates to the compiling method of
symbol dictionary in which the addition of a meta-symbol in the
second step is done on the basis of adding the meta-symbol by
adding one arbitrary character in the predetermined character set
at the end, as for a meta-symbol less than N characters in the
meta-symbol frequency table of which frequency is frequency Ck or
more determined by the symbol data content and the number of times
of repetition of the second step, and the deletion of a meta-symbol
in the third step is done on the basis of deleting a meta-symbol of
which frequency in the meta-symbol frequency table is less than E
and two characters or more.
[0106] The invention further relates to the compiling method of a
symbol dictionary in which the stopping condition in the second
step is the condition of stopping when there is no addition or
deletion of a meta-symbol in the meta-symbol dictionary update
judging means.
[0107] The invention further relates to the compiling method of
symbol dictionary, in which the sequence number of the
corresponding symbol in the symbol data is used as the symbol
number in the third step.
[0108] The invention further relates to a retrieving method of a
symbol dictionary, being a method of retrieving complete
coincidence (that is, retrieving a same symbol as question
character string) of an arbitrary character string using a symbol
dictionary storing meta-symbol information and meta-symbol
appearance information, comprising a first step of symbol
dictionary retrieval in which question character string covering
means retrieves meta-symbol information in the symbol dictionary,
and searches covering in the question character string Q of
retrieval condition by longest matchoverlapped longest match word
extraction method (that is, covering elements of pair (m, s, e) of
meta-symbol m collating with partial character string, collating
character start position s, and collating end character position e
(1.ltoreq.s<e.ltoreq..vertline.Q.vertline.+1) in the character
string to be covered, and a set containing any character of Q in at
least one covering element), if there is no covering (that is,
covering elements of pair (m, s, e) of meta-symbol m collating with
partial character string, collating character start position s, and
collating end character position e
(1.ltoreq.s<e.ltoreq..vertline.Q.vertline.+1) in the character
string to be covered, and there is no set containing any character
of Q in at least one covering element), the retrieval process is
terminated as being no retrieval result, and if there is covering,
the covering result is stored in the working range, and a second
step of symbol dictionary retrieval in which symbol number set
assessing means retrieves the meta-symbol onset information in the
symbol dictionary, and if there is (only one) symbol number
contained commonly in all elements in the covering result, it is
issued as the retrieval result and the retrieval process is
terminated, and if there is no symbol number contained commonly in
all elements in the covering result, the retrieval process is
terminated as being no retrieval result, and therefore if the
character string is long and when retrieving a symbol containing
characters of high frequency or character chain, high speed symbol
dictionary retrieval is possible by complete matching.
[0109] The invention further relates to a retrieving method of a
symbol dictionary, being a method of retrieving forward coincidence
(that is, retrieving all symbols having a question character string
in the beginning portion) by an arbitrary character string using a
symbol dictionary storing meta-symbol information and meta-symbol
appearance information, comprising a first step of symbol
dictionary retrieval in which a question character string covering
means retrieves meta-symbol information in the symbol dictionary,
and searches covering in the question character string Q of
retrieval condition by longest matchoverlapped longest match word
extraction method (that is, covering elements of pair (m, s, e) of
meta-symbol m collating with partial character string, collating
character start position s, and collating end character position e
(1.ltoreq.s<e.ltoreq..vertline.Q.vertline.+1) in the character
string to be covered, and a set containing any character of Q in at
least one covering element), if there is no covering (that is,
covering elements of pair (m, s, e) of meta-symbol m collating with
partial character string, collating character start position s, and
collating end character position e
(1.ltoreq.s<e.ltoreq..vertline.Q.ve- rtline.+I) in the character
string to be covered, and there is no set containing any character
of Q in at least one covering element), the retrieval process is
terminated as being no retrieval result, and if there is covering,
the covering result is recorded, a second step of symbol dictionary
retrieval in which right extended meta-symbol assessing means
retrieves meta-symbol information in the symbol dictionary,
retrieves, in the covering result, all meta-symbols x of right
extended meta-symbols (that is, meta-symbols containing character
string R in the beginning portion) of j-th rightmost portion
character string of the question character string (that is, the
partial character string from the j-th character
(1.ltoreq.j.ltoreq..vertline.Q.vertline.) to the final character in
the question character string) R, out of extended meta-symbols of
meta-symbol Z of covering elements of which collating start
character position is 1 (that is, meta-symbols containing Z), and
adds elements (x, j, .vertline.R.vertline.+j) to the covering
result and records, and a third step of symbol dictionary retrieval
in which symbol number set assessing means retrieves meta-symbol
appearance information in the symbol dictionary while
systematically compiling a set C of elements in the covering result
covering the question character string or an arbitrary right
extended character string, collects a symbol number set SC commonly
contained in all elements of C, records as part of retrieval
result, and issues the sum set of all SCs as final retrieval
result, and therefore if the character string is long and when
retrieving a symbol containing characters or character chain of
high frequency, high speed symbol dictionary retrieval is possible
by prefix matching.
[0110] The invention further relates to a retrieving method of a
symbol dictionary, being a method of retrieving backward
coincidence (that is, retrieving all symbols having a question
character string in the end portion) by an arbitrary character
string using a symbol dictionary storing meta-symbol information
and meta-symbol appearance information, comprising a first step of
symbol dictionary retrieval in which question character string
covering means retrieves meta-symbol information in the symbol
dictionary, and searches covering in the question character string
Q of retrieval condition by longest matchoverlapped longest match
word extraction method (that is, covering elements of pair (m, s,
e) of meta-symbol m collating with partial character string,
collating character start position s, and collating end character
position e (1.ltoreq.s<e.ltoreq..vertline.Q.vertline.+1) in the
character string to be covered, and a set containing any character
of Q in at least one covering element), if there is no covering
(that is, covering elements of pair (m, s, e) of meta-symbol m
collating with partial character string, collating character start
position s, and collating end character position e
(I.ltoreq.s<e.ltoreq..vertline.Q.vertline.+1) in the character
string to be covered, and there is no set containing any character
of Q in at least one covering element), the retrieval process is
terminated as being no retrieval result, and if there is covering,
the covering result is recorded, a second step of symbol dictionary
retrieval in which left extended meta-symbol assessing means
retrieves meta-symbol information in the symbol dictionary,
retrieves, in the covering result, all meta-symbols x of left
extended meta-symbols (that is, meta-symbols containing character
string L in the end portion) of j-th leftmost portion character
string of the question character string (that is, the partial
character string from the first character to the j-th character
(1.ltoreq.j.ltoreq..vertline.Q.vertline.) in the question character
string) L, out of extended meta-symbols of meta-symbol Z of
covering elements of which collating end character position is
.vertline.Q.vertline.+1 (that is, meta-symbols containing Z), and
adds elements (x, j+1-.vertline.L.vertline., j+1) to the covering
result and records, and a third step of symbol dictionary retrieval
in which symbol number set assessing means retrieves meta-symbol
appearance information in the symbol dictionary while
systematically compiling a set C of elements in the covering result
covering the question character string or an arbitrary left
extended character string, collects a symbol number set SC commonly
contained in all elements of C, records as part of retrieval
result, and issues the sum set of all SCs as final retrieval
result, and therefore if the character string is long and when
retrieving a symbol containing characters or character chain of
high frequency, high speed symbol dictionary retrieval is possible
by postfix matching.
[0111] The invention further relates to a retrieving method of a
symbol dictionary, being a method of retrieving intermediate
coincidence (that is, retrieving all symbols having a question
character string) by an arbitrary character string using a symbol
dictionary storing meta-symbol information and meta-symbol
appearance information, comprising a first step of symbol
dictionary retrieval in which question character string covering
means retrieves meta-symbol information in the symbol dictionary,
and searches covering in the question character string Q of
retrieval condition by longest matchoverlapped longest match word
extraction method (that is, covering elements of pair (m, s, e) of
meta-symbol m collating with partial character string, collating
character start position s, and collating end character position e
(1.ltoreq.s<e.ltoreq..vertline.Q.vertline.+1) in the character
string to be covered, and a set containing any character of Q in at
least one covering element), if there is no covering (that is,
covering elements of pair (m, s, e) of meta-symbol m collating with
partial character string, collating character start position s, and
collating end character position e
(1.ltoreq.s<e.ltoreq..vertline.Q.vertline.+1) in the character
string to be covered, and there is no set containing any character
of Q in at least one covering element), the retrieval process is
terminated as being no retrieval result, and if there is covering,
the covering result is recorded, a second step of symbol dictionary
retrieval in which right extended meta-symbol assessing means
retrieves meta-symbol information in the symbol dictionary,
retrieves, in the covering result, all meta-symbols x of right
extended meta-symbols (that is, meta-symbols containing character
string R in the beginning portion) of j-th rightmost portion
character string of the question character string (that is, the
partial character string from the j-th character
(1.ltoreq.j.ltoreq..vert- line.Q.vertline.) to the final character
in the question character string) R, out of extended meta-symbols
of meta-symbol Z of covering elements of which collating start
character position is 1 (that is, meta-symbols containing Z), and
adds elements (x, j, .vertline.R.vertline.+j) to the covering
result and records, a third step of symbol dictionary retrieval in
which left extended meta-symbol assessing means retrieves
meta-symbol information in the symbol dictionary, retrieves, in the
covering result, all meta-symbols x of left extended meta-symbols
(that is, meta-symbols containing character string L in the end
portion) of j-th leftmost portion character string of the question
character string (that is, the partial character string from the
first character to the j-th character
(1.ltoreq.j.ltoreq..vertline.Q.vertline.) in the question character
string) L, out of extended meta-symbols of meta-symbol Z of
covering elements of which collating end character position is
.vertline.Q.vertline.+1 (that is, meta-symbols containing Z), and
adds elements (x, j+1-.vertline.L.vertline., j+1) to the covering
result and records, a fourth step of symbol dictionary retrieval in
which both extended meta-symbol assessing means retrieves the
meta-symbol information, retrieves all of both extended
meta-symbols of Q (that is, meta-symbols containing character
string Q in the portion from the j-th character to the
j+.vertline.Q.vertline.-th character, where 1<j) X, adds
elements (X, 1-j, 1-j+.vertline.X.vertline.) to the covering result
and records, and a fifth step of symbol dictionary retrieval in
which symbol number set assessing means retrieves meta-symbol
appearance information in the symbol dictionary while
systematically compiling a set C of elements in the covering result
covering the question character string or an arbitrary extended
character string, collects a symbol number set SC commonly
contained in all elements of C, records as part of retrieval
result, and issues the sum set of all SCs as final retrieval
result, and therefore if the character string is long and when
retrieving a symbol containing characters or character chain of
high frequency, high speed symbol dictionary retrieval is possible
by infix matching.
[0112] (Embodiment 1)
[0113] A first embodiment of the invention is described below while
referring to the drawings. FIG. 1 is a block diagram showing a
general constitution of an embodiment of a symbol dictionary
compiling apparatus. In FIG. 1, reference numeral 101 is symbol
data as the object of compilation of dictionary, 102 is a
meta-symbol dictionary, 103 is symbol covering means for retrieving
the meta-symbol dictionary 102 and seeking the retrieval result of
each symbol in the symbol data by maximal word extraction method,
104 is meta-symbol summing means for receiving the covering result
issued by the symbol covering means 103 and summing up meta-symbols
extracted as covering elements, 105 is a meta-symbol frequency
table for storing the summing result of the meta-symbol summing
means 104, 106 is meta-symbol dictionary update judging means for
judging addition of meta-symbol to the meta-symbol dictionary 102,
deletion of meta-symbol from the meta-symbol dictionary 102, and
stopping condition of meta-symbol dictionary update process on the
basis of the summing result of the meta-symbol frequency table 105,
107 is meta-symbol appearance information compiling means for
receiving the covering result issued from the symbol covering means
103, and compiling a meta-symbol onset information recording the
symbol number of the meta-symbol extracted as covering element and
extracted character position in every meta-symbol, 108 is
meta-symbol appearance information compiled by the meta-symbol
appearance information compiling means 107, 109 is symbol
dictionary compiling means for compiling a retrievable symbol
dictionary from the meta-symbol dictionary 102 and meta-symbol
appearance information 108, and 110 is a retrievable symbol
dictionary compiled by the symbol dictionary compiling means 109.
FIG. 6 is a flowchart showing the procedure of the process of
finding the covering result by the maximal word extraction method,
by using the meta-symbol dictionary 102, from each symbol to be
covered in the symbol data 101 in the symbol covering means
103.
[0114] In thus constituted symbol dictionary compiling apparatus,
its operation is explained below by referring to an example of
compiling a symbol dictionary that can be retrieved from symbol
data linking the date, 15-minute time increments, and surnames in
Roman alphabet. FIG. 15 shows an example of symbol data. In this
diagram, certain symbols are omitted, but actually a total of 1000
different symbols are stored as symbol data in the sequence of date
and time. The maximum number of characters of each symbol is 1000
characters. At a first step, the additional threshold C1 of
meta-symbol is {fraction (1/20)} of the number of symbols. In this
example, since the number of symbols is 1000, C1=50. At a second
step, the additional threshold Ck of meta-symbol in k-th repetition
is {fraction (1/10)} of the number of symbols as far as k<10,
and 1/5 of the number of symbols if 10.ltoreq.k. In this example,
Ck=100 if k<10, and Ck=200 if 10.ltoreq.k. At a third step, the
value of threshold E is determined as 5. Herein, in the symbol
data, only two symbols "-" and "/", numerals "0" to "9", and
letters "A" to "Z" of alphabet are used, and other characters are
not used. As the meta-symbol dictionary for covering such symbols,
prior to compilation of symbol dictionary, a meta-symbol dictionary
consisting of one character only possible to appear in the symbol
is prepared as shown in FIG. 16. In order to cover efficiently in
the symbol covering means, the meta-symbol dictionary has a digital
tree data structure (that is, TRIE) as shown in FIG. 30. The
meta-symbol of the longest match with a certain character string
can be retrieved efficiently by using TRIE, that is, from the root
of TRIE (the left end bullet), tracing the tree structure by
referring to the first character, second character and so forth of
the character string as the clues, and issuing a partial character
string up to the double circle node remotest from the root. In the
first step of compiling the symbol dictionary, the symbol covering
means 103 reads the symbol data in FIG. 15 stored in the symbol
data 101 sequentially, retrieves the meta-symbol dictionary of the
content as shown in FIG. 30 stored in the meta-symbol dictionary
102, and finds the covering result by the maximal word extraction
method. The maximal word extraction method is a method of word
extraction of taking out only the collation of collating character
intervals not contained completely in any other collating character
string, out of the meta-symbols in the meta-symbol dictionary for
collating character string with various partial character strings
with a certain symbol S. For example, supposing a symbol "TOKYO
METROPOLITAN COUNCIL," when the maximal word extraction is
performed by retrieving the meta-symbol dictionary containing six
meta-symbols (TO, TOKYO, TOKYO METRO, TOKYO METROPOLITAN,
METROPOLITAN COUNCIL, COUNCIL), by covering process as shown in the
flowchart in FIG. 6, the covering result is obtained as
follows.
[0115] {(TOKYO METROPOLITAN, 1, 19), (METROPOLITAN COUNCIL, 17,
27)}
[0116] In (X, s, e), X denotes the meta-symbol, s shows the
collating start character position, and e is the collating end
character position (that is, the character position of the
character immediately at the right side of the collated portion),
and this set of three pieces is called the covering element.
Meta-symbols, such as "TO", "TOKYO", and "TOKYO METRO" are to be
collated with the partial character string of "TOKYO METROPOLITAN
COUNCIL", but although they are contained in the meta-symbol
dictionary, since they are completely included in the collating
portion of "TOKYO METROPOLITAN" (in the first two words), they are
not included in the covering result. The process of "removing the
collation completely contained in other collating portion" is
judged at step 603 in FIG. 6, and for the meta-symbol M to be
collated from the i-th character to the (i+j-1)-th character of
symbol S, of the already collated results from the first to the
i-1-th character, if the collating end position (=character
position at right side of collating portion) e is i+j or more, the
collating character interval [s, e-1] of the collation (X, s, e)
includes the character interval [i, i+j-1] completely, and in this
case the judgement at step 603 is [yes], and the collated result is
not added to the covering result set C.
[0117] Incidentally, since all meta-symbols in the meta-symbol
dictionary in FIG. 16 are composed of one character only, in the
covering process of symbol covering means at the first step, the
extracted meta-symbol is always one character, and the judgement at
step 603 in FIG. 6 is not [yes], and, for example, the symbol
[0118] 1998-JAN-01/AM0200/KAWAYASU
[0119] is covered as follows
[0120] {(1, 1, 2), (9, 2, 3), . . . , (S, 26, 27), (U, 27 28)}
[0121] in one character each. Since all characters possibly
appearing in the symbol are included in the meta-symbol dictionary,
a non-empty covering result is always obtained. The retrieval
result of each symbol is transferred to the meta-symbol summing
means 104, and the number of times of each meta-symbol extracted as
covering element is recorded in the meta-symbol frequency table
105. For example, after processing of the first symbol
"1998-JAN-01/AM0200/KAWAYASU" in the symbol data, the content of
the meta-symbol frequency table contains "-" twice, "/" twice, "0"
four times, "1" twice, "2" once, "8" once, "9" twice, "A" five
times, "J" once, "K" once, "M" once, "N" once, "S" once, "U" once,
"W" once, and "Y" once. The frequency of other characters is zero.
After processing 1000 symbols in the symbol data 101, the content
of the meta-symbol frequency table is as shown in FIG. 17. In this
example, since all symbols are in the format of
"yyyy-mmm-dd/XXhhmm/name," the frequency of "-" and "/" is 2000
times (twice.times.1000 symbols in every symbol). It is also clear
that meta-symbols "H", "I", "Q", "X", "Z" do not appear at all in
the symbol data in FIG. 15. At this moment, of the meta-symbol
frequency table, five meta-symbols of which frequency is zero is
deleted from the meta-symbol dictionary. FIG. 18 is a frequency
table concerning each meta-symbol after deletion. The threshold C1
is 50, and all meta-symbols in FIG. 18 are in the number of
characters of 1 and the maximum number of characters is less than
1000, and therefore the meta-symbol dictionary update judging means
106 adds two-character meta-symbols such as "-", "-/", and "-0"
having each meta-symbol in FIG. 18 added to the end concerning all
meta-symbols in FIG. 18, and updates to the meta-symbol dictionary
including the meta-symbols as shown in FIG. 19. The actual
structure of the meta-symbol dictionary is built and held as a
digital tree structure (that is, TRIE) as shown in FIG. 31. This
ends the first step of compiling the symbol dictionary.
[0122] In the second step of compiling the symbol dictionary,
covering of each symbol at the first step and summing of frequency
of extracted meta-symbols are executed again by using the
meta-symbol dictionary in FIG. 19. For example, the symbol
[0123] 1998-JAN-01/AM0200/KAWAYASU
[0124] is covered as follows
[0125] {(19, 1, 3), (99, 2, 4), . . . , (AS, 25, 27), (SU, 26,
28)}
[0126] in two characters each. Since all characters possibly
appearing in the symbol are included in the meta-symbol dictionary,
a non-empty covering result is always obtained. In the covering
process at the second step, meanwhile, since meta-symbols different
in the number of characters are mixed in the meta-symbol dictionary
102, unlike the first step, the judgement at step 603 may be
possibly [yes] in the flowchart in FIG. 6, and all meta-symbols
extracted in the longest match may not be always contained in the
covering result (in the above example, since there is (SU, 26, 28),
the end (U, 27, 28) is not included in the covering result). Of the
meta-symbol frequency table after processing 1000 symbols in the
symbol data 101, the portion of meta-symbols of which frequency is
not zero is as shown in FIG. 20. Comparing it with FIG. 18, for
example, the meta-symbol "-" which is 2000 times of frequency in
FIG. 18 is known to be dispersed in FIG. 20, that is, the frequency
is dispersed into a total of 23 types of meta-symbols, consisting
of 12 types of two-character meta-symbols starting with such as
"-0" and "-S", and 11 types of meta-symbols ending with "-" such as
"8-", "B-", "C-", "G-", "L-", "N-", "P-", "R-", "T-", "V-", "Y-".
The total of frequency of 12 types of two-character meta-symbols
starting with "-" and the total of frequency of 11 types of
meta-symbols ending with "-" are both 2000, and it is known that
the character "-" in the symbols is contained in tow meta-symbols
sharing this character "-" of the meta-symbol starting with "-" and
the meta-symbol ending with "-". Since the threshold C2 is 100, and
all meta-symbols in FIG. 20 are 2 in the number of characters and
the maximum number of characters is less than 1000, the meta-symbol
dictionary update judging means 106 updates the meta-symbol
dictionary by adding three-character meta-symbols such as "-0-",
"-0/", "-00", adding each meta-symbol in FIG. 18 (that is,
one-character meta-symbol) to the end, concerning 62 types of
meta-symbols of which frequency is 100 or more, such as "-0", "-1",
"-2", "-A" in FIG. 20. At the same time, the meta-symbol dictionary
update judging means 106 does not terminate the second step because
addition of meta-symbol has occurred once or more as shown above,
but judges to continue similar covering, summing, and updating
process by using the updated meta-symbol dictionary successively.
Of the meta-symbol frequency table after similarly covering and
summing by using the updated meta-symbol dictionary 102, the
portion of meta-symbols of which frequency is 1 or more is as shown
in FIG. 21. For example, the symbol
[0127] 1998-JAN-01/AM0200/KAWAYASU
[0128] is covered as follows
[0129] {(199, 1, 4), (998, 2, 5), (98-, 3, 6), (8-J, 4, 7), (-JA,
5, 8), (AN-, 7, 10), (N-0, 8, 11), . . . , (YAS, 24, 27), (ASU, 25,
28)}
[0130] and it is known that the longest match meta-symbol "JA" from
the character "J" is not included in the covering result. In FIG.
21, two-character meta-symbols and three-character meta-symbols are
coexisting, and when covering, it is known that the symbols are
covered by using the three-character meta-symbol in the portion
large in the frequency of appearance, and by using two-character
meta-symbol in the portion relatively small in frequency of
appearance. Since the threshold C3 is 100, and all meta-symbols in
FIG. 21 are 3 or less in the number of characters and the maximum
number of characters is less than 1000, the meta-symbol dictionary
update judging means 106 updates the meta-symbol dictionary by
adding meta-symbols such as "-01-", "-01/", "-010", adding each
meta-symbol in FIG. 18 (that is, one-character meta-symbol) to the
end, concerning 42 types of meta-symbols of which frequency is 100
or more, such as "-01", "-02", "-03", "-04" in FIG. 21. At the same
time, the meta-symbol dictionary update judging means 106 does not
terminate the second step because addition of meta-symbol has
occurred once or more as shown above, but judges to continue
similar covering, summing, and updating process by using the
updated meta-symbol dictionary successively. Of the meta-symbol
frequency table after similarly covering and summing by using the
updated meta-symbol dictionary 102, the portion of meta-symbols of
which frequency is 1 or more is as shown in FIG. 22. Since the
threshold C4 is 100, and all meta-symbols in FIG. 22 are 4 or less
in the number of characters and the maximum number of characters is
less than 1000, the meta-symbol dictionary update judging means 106
updates the meta-symbol dictionary by adding meta-symbols such as
"-NOV-", "-NOV/", "-NOV0", adding each meta-symbol in FIG. 18 (that
is, one-character meta-symbol) to the end, concerning 31 types of
meta-symbols of which frequency is 100 or more, such as "-NOV",
"/AM0", "/AM1", "/KAW" in FIG. 22.
[0131] At the same time, the meta-symbol dictionary update judging
means 106 does not terminate the second step because addition of
meta-symbol has occurred once or more as shown above, but judges to
continue similar covering, summing, and updating process by using
the updated meta-symbol dictionary successively. Of the meta-symbol
frequency table after similarly covering and summing by using the
updated meta-symbol dictionary 102, the portion of meta-symbols of
which frequency is 1 or more is as shown in FIG. 23. Comparing FIG.
23 and FIG. 22, the number of types of meta-symbols with frequency
of 1 or more is decreased by two types in spite of addition of
meta-symbols, and it is confirmed that the meta-symbols smaller in
the number of characters are "being shut out" from the extraction
result by the maximal word extraction by meta-symbols with a large
number of characters. Since the threshold C5 is 100, and all
meta-symbols in FIG. 23 are 5 or less in the number of characters
and the maximum number of characters is less than 1000, the
meta-symbol dictionary update judging means 106 updates the
meta-symbol dictionary by adding meta-symbols such as "-NOV-",
"-NOV-/", "-NOV-0", adding each meta-symbol in FIG. 18 (that is,
one-character meta-symbol) to the end, concerning 20 types of
meta-symbols of which frequency is 100 or more, such as "-NOV-",
"/KAWA", "/SUDA", "/SUKA" in FIG. 23.
[0132] At the same time, the meta-symbol dictionary update judging
means 106 does not terminate the second step because addition of
meta-symbol has occurred once or more as shown above, but judges to
continue similar covering, summing, and updating process by using
the updated meta-symbol dictionary successively. Of the meta-symbol
frequency table after similarly covering and summing by using the
updated meta-symbol dictionary 102, the portion of meta-symbols of
which frequency is 1 or more is as shown in FIG. 24. Comparing FIG.
24 and FIG. 22, the number of types of meta-symbols with frequency
of 1 or more is decreased further, and it is confirmed that the
meta-symbols smaller in the number of characters are "being shut
out" from the extraction result by the maximal word extraction by
meta-symbols with a large number of characters. Since the threshold
C6 is 100, and all meta-symbols in FIG. 24 are 6 or less in the
number of characters and the maximum number of characters is less
than 1000, the meta-symbol dictionary update judging means 106
updates the meta-symbol dictionary by adding meta-symbols, adding
each meta-symbol in FIG. 18 (that is, one-character meta-symbol) to
the end, concerning 16 types of meta-symbols of which frequency is
100 or more in FIG. 24.
[0133] At the same time, the meta-symbol dictionary update judging
means 106 does not terminate the second step because addition of
meta-symbol has occurred once or more as shown above, but judges to
continue similar covering, summing, and updating process by using
the updated meta-symbol dictionary successively. Of the meta-symbol
frequency table after similarly covering and summing by using the
updated meta-symbol dictionary 102, the portion of meta-symbols of
which frequency is 1 or more is as shown in FIG. 25. Since the
threshold C7 is 100, and all meta-symbols in FIG. 25 are 7 or less
in the number of characters and the maximum number of characters is
less than 1000, the meta-symbol dictionary update judging means 106
attempts to add meta-symbols, adding each meta-symbol in FIG. 18
(that is, one-character meta-symbol) to the end, concerning nine
types of meta-symbols of which frequency is 100 or more in FIG. 25,
but, as for five types of meta-symbols "/SUDA", "/SUKAWA", "/SUWA",
"98-NOV", "WADA", since the meta-symbol adding one character at the
end is already included all in the meta-symbol dictionary, it
updates the meta-symbol dictionary by adding to the remaining four
types of meta-symbols. At the same time, the meta-symbol dictionary
update judging means 106 does not terminate the second step because
addition of meta-symbol has occurred once or more as shown above,
but judges to continue similar covering, summing, and updating
process by using the updated meta-symbol dictionary successively.
Comparing FIG. 25 and FIG. 24, the frequency of meta-symbol
"/SUKAWA" is decreased from 187 to 81. It is confirmed, due to the
presence of meta-symbol "/SUKAWA" in FIG. 25, that "SUKAWA" of the
symbol "1998 . . . /SUKAWA" is deleted from the covering result,
and that only the frequency of "SUKAWA" of the symbol "1998 . . .
YASUKAWA" is left over. Of the meta-symbol frequency table after
similar covering and summing process by using the updated
meta-symbol dictionary 102, the portion of meta-symbols of which
frequency is 1 or more is as shown in FIG. 26. Since the threshold
C8 is 100, and all meta-symbols in FIG. 26 are 8 or less in the
number of characters and the maximum number of characters is less
than 1000, the meta-symbol dictionary update judging means 106
attempts to add meta-symbols, adding each meta-symbol in FIG. 18
(that is, one-character metasymbol) to the end, concerning six
types of meta-symbols of which frequency is 100 or more in FIG. 26,
but, as for five types of meta-symbols "/SUDA", "/SUKAWA", "/SUWA",
"98-NOV", "WADA", since the meta-symbol adding one character at the
end is already included all in the meta-symbol dictionary, it
updates the meta-symbol dictionary by adding to the remaining
meta-symbol "1998-NOV".
[0134] At the same time, the meta-symbol dictionary update judging
means 106 does not terminate the second step because addition of
meta-symbol has occurred once or more as shown above, but judges to
continue similar covering, summing, and updating process by using
the updated meta-symbol dictionary successively. Of the meta-symbol
frequency table after similarly covering and summing by using the
updated meta-symbol dictionary 102, the portion of meta-symbols of
which frequency is 1 or more is as shown in FIG. 27. Since the
threshold C9 is 100, and all meta-symbols in FIG. 27 are 9 or less
in the number of characters and the maximum number of characters is
less than 1000, the meta-symbol dictionary update judging means 106
attempts to add meta-symbols, adding each meta-symbol in FIG. 18
(that is, one-character meta-symbol) to the end, concerning five
types of meta-symbols of which frequency is 100 or more in FIG. 27,
but, as for four types of meta-symbols "/SUDA", "/SUKAWA", "/SUWA",
"WADA", since the meta-symbol adding one character at the end is
already included all in the meta-symbol dictionary, it updates the
meta-symbol dictionary by adding to the remaining meta-symbol
"1998-NOV-". At the same time, the meta-symbol dictionary update
judging means 106 does not terminate the second step because
addition of meta-symbol has occurred once or more as shown above,
but judges to continue similar covering, summing, and updating
process by using the updated meta-symbol dictionary successively.
Of the meta-symbol frequency table after similar covering and
summing process by using the updated meta-symbol dictionary 102,
the portion of meta-symbols of which frequency is 1 or more is as
shown in FIG. 28. Since the threshold C10 is 100, and all
meta-symbols in FIG. 28 are 10 or less in the number of characters
and the maximum number of characters is less than 1000, the
meta-symbol dictionary update judging means 106 attempts to add
meta-symbols, adding each meta-symbol in FIG. 18 (that is,
one-character meta-symbol) to the end, concerning four types of
meta-symbols of which frequency is 100 or more in FIG. 28, but, as
for these four types of meta-symbols "/SUDA", "/SUKAWA", "/SUWA",
"WADA", since the meta-symbol adding one character at the end is
already included all in the meta-symbol dictionary, no additional
processing is done and the meta-symbol dictionary is not updated.
Thus, since addition of meta-symbol does not occur, the meta-symbol
dictionary update judging means 106 terminate the second step.
[0135] At a third step of compiling symbol dictionary, the
meta-symbol dictionary update judging means 106 refers to the
meta-symbol frequency table 105, and deletes meta-symbols of two
characters or more having frequency of less than the threshold E
(that is, 5) from the meta-symbol dictionary 102. In FIG. 28, of
the meta-symbols with frequency of 1 or more, nothing is less than
5 in frequency, and in this case the frequency is 0, and all
meta-symbols with two characters or more are deleted, and the
content of the meta-symbol dictionary 102 is consequently the sum
of the meta-symbols in FIG. 28 and meta-symbols in FIG. 18. The
actual structure of meta-symbol dictionary is built and held as
digital tree data structure as shown in FIG. 32 (that is, TRIE).
This ends the third step of compiling symbol dictionary.
[0136] At a fourth step of compiling symbol dictionary, the symbol
covering means 103 finds the covering result by covering each
symbol data in the symbol data 101 by using the meta-symbol
dictionary 102 in FIG. 32 calculated at the third step, and the
meta-symbol appearance information compiling means 108 compiles
meta-symbol appearance information 108 recording the symbol number
of appearance of each meta-symbol from the covering result and the
appearance character position. In this case, the meta-symbol
appearance information as shown in FIG. 29 is compiled. In FIG. 29,
however, for the ease of interpretation, the symbol character
string is used instead of the symbol number. This compilation
process is so-called inversion by nature, and it can be done
efficiently by the technique generally employed in information
retrieval system. The collating character position is expressed by
the number of characters at the left side of the collating portion
(the left character count in FIG. 29) and the number of characters
at the right side of the collating portion (right character count
in FIG. 29). The content as shown in FIG. 29 is recorded as a
summary table in each meta-symbol, and by retrieving by using the
meta-symbol and the collating character position as the clues, the
string (set) of numbers of symbols including the designated
meta-symbol at the designated character position can be obtained
efficiently. This ends the fourth step of compiling symbol
dictionary.
[0137] At a fifth step of compiling a symbol dictionary, the symbol
dictionary compiling means 109 compiles a machine-retrievable
symbol dictionary 110 from the meta-symbol dictionary 102 and
meta-symbol appearance information 108. At this time, the
meta-symbol appearance information 108 stores the table as shown in
FIG. 29 directly in the symbol dictionary 110, but as for the
meta-symbol dictionary 102, aside from the information in TRIE
structure as shown in FIG. 32, the meta-symbol extension table
adding extended information of meta-symbol as shown in FIG. 33 is
also stored in the symbol dictionary 110 as meta-symbol
information. The meta-symbol extension table in FIG. 33 is a table
recording three sets of meta-symbol in the meta-symbol dictionary
containing M as character string, and number of characters of right
and left extended portions, in every meta-symbol M in the
meta-symbol dictionary, and for example, the extended information
of meta-symbol "-" is expressed as follows:
[0138] {(-, 0, 0), (-01, 0, 2), . . . , (-29, 0, 2), (-3, 0, 1), .
. . , (R-0, 1, 1), (R-3, 1, 1)}
[0139] This extension table of meta-symbol can be compiled same as
in the compiling process of the meta-symbol appearance information
108 shown above. This ends the fifth step of compiling symbol
dictionary, and the symbol dictionary 110 is compiled, and the
symbol dictionary compilation is over.
[0140] As explained herein, according to the compiling method of
symbol dictionary in the first embodiment of the invention, as for
the partial character string appearing at high frequency in the
symbol data, by compiling a meta-symbol dictionary having
meta-symbols with more number of characters, since the covering
information of symbol is recorded by using this meta-symbol
dictionary, the symbol dictionary can be compiled by a smaller
quantity of information, and when retrieving the symbol dictionary,
the symbol including the partial character string appearing at high
frequency can be retrieved at high speed as compared with the
conventional symbol dictionary retrieval. Moreover, this
meta-symbol dictionary compilation can be executed mechanically by
setting the threshold, and an appropriate symbol dictionary suited
to deviation of character string distribution of symbol data can be
compiled without requiring manual operation.
[0141] (Embodiment 2)
[0142] A second embodiment of the invention is described below
while referring to the drawings. FIG. 2 is a block diagram showing
a general constitution of a symbol dictionary retrieving apparatus.
In FIG. 2, reference numeral 201 is a symbol dictionary storing
meta-symbol information and meta-symbol appearance information, 202
is retrieval condition input means for entering character string as
retrieval condition, 203 is question character string covering
means for finding the covering result by covering the question
character string which is the retrieval condition entered from the
retrieval condition input means 202 by the longest matchoverlapped
longest match word extraction method by using the symbol dictionary
201, 204 is the covering result determined by the question
character string covering means 203, 205 is symbol number assessing
means for assessing the symbol number completely coinciding with
the question character string, that is, identical with the question
character string, from the covering result 204 and the meta-symbol
appearance information of symbol dictionary 201, and 206 is
retrieval result output means for issuing the symbol number
assessed by the symbol number assessing means 205 and others.
[0143] In thus constituted symbol dictionary retrieving apparatus,
the operation is explained below by referring to the drawings,
relating to the example of symbol dictionary presented in the first
embodiment and an example of simple retrieval condition. FIG. 7 is
a flowchart describing the procedure of process for finding the
covering result in the question character string covering means
203, FIG. 8 is a flowchart describing the procedure of assessing
process of symbol number in the symbol number assessing means 205,
and FIG. 34 is a conceptual diagram describing principal
intermediate data in the process of symbol dictionary retrieval in
the case of giving the condition of "Find the symbol number
completely coinciding with the question character string
1998-NOV-01/PM1030/KAWAYASU- " as the retrieval condition.
[0144] To begin with, at a first step of retrieving symbol
dictionary, the question character string covering means 203
retrieves the meta-symbol information in the symbol dictionary 201,
and finds the covering of the question character string
1998-NOV-01/PM1030/KAWAYASU by the longest matchoverlapped longest
match word extraction method, and obtains the covering result of C
of *STEP1 in FIG. 34. The longest matchoverlapped longest match
word extraction method is a covering method in which the
meta-symbol of longest match is searched from the left side of the
covering object character string, while permitting partial
duplication of meta-symbols, and if the collating character
interval of a certain meta-symbol A is completely contained in the
interval of sum of the collating character intervals of one or more
other meta-symbol groups B, . . . , X, such meta-symbol A is not
recorded as covering element. More specifically, at step 702 in
FIG. 7, at the end side further from the immediate preceding
extraction result, among meta-symbols having collating character
interval without spacing, first a set H of meta-symbols covering up
to the utmost end side is find, and the meta-symbol of which
collating start position is closest to the beginning side, that is,
having the most number of characters is found from H and used as
covering element, and on the basis of the collating character
interval of this covering element, the covering element of the next
end side is further determined, and by this series of extraction
process from the beginning to the end, this covering method is
intended to obtain the partial set of the covering result obtained
by the maximal word extraction method. In the case of this question
character string, since the covering result 204 is not empty,
processing at the symbol number assessing means 205 starts, but in
the case of absence of covering result, the process is stopped
immediately, and there is no covering result. The subsequent
process conforms to FIG. 8. First, at step 801, an element (at most
one in C) of which collating start character position s is 1 is
searched in the covering result. In this example, "1998-NOV-0,1,11"
is found out. Successively, in the meta-symbol appearance
information in the symbol dictionary, all formats of (X, 0, n-e+1)
(where n is the number of characters in the question character
string; it is 14 in this example) in the appearance symbol
information of M=1998-NOV-0 are searched, and the set of symbol
number of this symbol X is recorded as A. In FIG. 34, for the ease
of reading, sets A and B are described by using symbol character
string instead of the symbol number. In this example, the symbol
number of symbol such as 1998-NOV-01/AM0830/NODA is determined.
Once A is determined, the element selected herein (1998-NOV-0,1,11)
is deleted from C. As a result, C becomes as shown in C at *STEP2
in FIG. 34, and the condition judging of "Is C an empty set?" at
step 802 in FIG. 8 is No, and the process advances to step 803. At
step 803, in this example, the beginning element of C (1/P, 11, 14)
is selected, and as B, the symbol (the number corresponding to the
symbol) such as B at *STEP3 in FIG. 34 including
1998-JAN-01/PM065/NODA or the like is obtained. Then, finding the
common portion of A and B, it is stored in A. That is, the content
of A is reduced only to the portion contained in B. In this
example, the content of A is reduced to four symbols (their
numbers). In succession, judging at step 804 in FIG. 8, since A is
not empty, the process advances to step 805, and the element (1/P,
11, 14) selected at step 803 is deleted from C, and the process
returns to step 802. Thereafter, up to *STEP4 to *STEP6, similarly
selecting the element from C successively, B is determined from the
meta-symbol appearance information, and the intermediate result A
is reduced. In this period, neither A nor C is empty, and the
process is not terminated on the way.
[0145] Finally, after the process of *STEP7, C is empty at the end
of step 805 in FIG. 8, it is judged Yes at step 802, and the
process in the symbol number assessing means 205 is terminated, and
the element of A, that is, the number of symbol
"1998-NOV-01/PM1030/KAWAYASU" is issued to the retrieval result
display means 206, and the symbol retrieval process is
terminated.
[0146] As explained herein, according to the retrieving method of
symbol dictionary in the second embodiment of the invention, as for
the partial character string appearing at high frequency among
symbol data, meta-symbol information having meta-symbols with
greater number of characters is compiled, and by using this
meta-symbol information, once the covering result is composed from
the question character string, and the retrieval is processed by
using this covering result and the meta-symbol appearance
information, therefore even the retrieval of symbol containing
partial character string appearing at high frequency can be done
faster than in the conventional retrieval of symbol dictionary.
[0147] (Embodiment 3)
[0148] A third embodiment of the invention is described below while
referring to the drawings. FIG. 3 is a block diagram showing a
general constitution of a symbol dictionary retrieving apparatus.
In FIG. 3, reference numeral 301 is a symbol dictionary storing
meta-symbol information and meta-symbol appearance information, 302
is retrieval condition input means for entering character string as
retrieval condition, 303 is question character string covering
means for finding the covering result by covering the question
character string which is the retrieval condition entered from the
retrieval condition input means 302 by the maximal word extraction
method by using the symbol dictionary 301, 304 is the covering
result determined by the question character string covering means
303, 305 is symbol number set assessing means for assessing the set
of symbol numbers coinciding forward with the question character
string, that is, containing the question character string in the
beginning portion, from the covering result 304 and the meta-symbol
appearance information of symbol dictionary 301, 307 is right
extended meta-symbol assessing means for retrieving the meta-symbol
information in the symbol dictionary 301, finding all of the sets
of the number of the meta-symbol and the collating position of the
right extended meta-symbol (that is, the meta-symbol containing R
in the beginning portion) of the rightmost partial character string
R of the question character string, out of the extended
meta-symbols of meta-symbol Z (that is, meta-symbols containing Z)
of covering elements largest in the collating start character
position among the covering result 304, and adding and storing to
the covering result 304, and 306 is retrieval result output means
for issuing the symbol number assessed by the symbol number
assessing means 305 and others. The constituent elements 301 to 304
in FIG. 3 correspond to the constituent elements 201 to 204 in FIG.
2 which is the block diagram of the second embodiment.
[0149] In thus constituted symbol dictionary retrieving apparatus,
the operation is explained below by referring to the drawings,
relating to the example of symbol dictionary presented in the first
embodiment and an example of simple retrieval condition. FIG. 7 is
a flowchart describing the procedure of process for finding the
covering result in the question character string covering means
303, FIGS. 9 and 10 are flowcharts describing the procedure of
assessing process of symbol number set in the symbol number set
assessing means 305, and FIGS. 35 and 36 are conceptual diagrams
describing principal intermediate data in the process of symbol
dictionary retrieval in the case of giving the condition of "Find
the set of symbol numbers coinciding forward with the question
character string 1998-NOV-01/PM" as the retrieval condition.
[0150] To begin with, at a first step of retrieving symbol
dictionary, the question character string covering means 303
retrieves the meta-symbol information in the symbol dictionary 301,
and finds the covering of the question character string
1998-NOV-01/PM by the longest matchoverlapped longest match word
extraction method, and obtains the covering result of C of *STEP1
in FIG. 35. The procedure of the covering process is same as the
procedure of the covering process in embodiment 2. In the case of
this question character string, since the covering result 304 is
not empty, processing at the right extended meta-symbol assessing
means 307 starts, but in the case of absence of covering result,
the process is stopped immediately, and there is no covering
result. Consequently, the right extended meta-symbol assessing
means 307 retrieves the meta-symbol information in the symbol
dictionary 301, and finds the extended meta-symbols in the
meta-symbol Z of the covering element largest in the collating
start position (that is, meta-symbols of character string
containing Z) among the covering result 304. Of the obtained
extended meta-symbols, only the meta-symbol X of the right extended
meta-symbol (that is, the meta-symbol containing the character
string R in the beginning portion) of the j-th rightmost partial
character string R of the question character string (that is, the
partial character string from the j-th character to the final
character in the question character string) is selected, and
[0151] (X, j, .vertline.R.vertline.+J)
[0152] is added to the covering result 304. In this example, Z=M,
and as its extended meta-symbols, 26 types are determined, that
is,
[0153] [/AM01], --, [/AM12], [/PM01], [/PM02], . . . , [/PM 12],
[1998-MAR], [1998-MAY]
[0154] Out of them, 12 types of meta-symbols
[0155] [/PM01], [PM02], . . . , [PM 12]
[0156] as the right extended meta-symbols of the rightmost partial
string of the question character string 1998-NOV-01/PM are added to
the covering result 304 by the right extended meta-symbol assessing
means 307. This mode is shown in *STEP2 in FIG. 35. Thus, after
covering up to the right extended meta-symbols, the symbol number
set assessing means 305 determines the symbol number set. The
subsequent process conforms to FIG. 9 and FIG. 10. First, at step
901, the set D composed of elements of which collating start
character position s is 1 is determined from the covering result.
In this example, D={(1998-NOV-0,1,11)}. The set SC of the final
result is initialized to be empty. Since D is not empty, the
process advances to step 903, and only one element
(1998-NOV-0,1,11) is selected from D, and in the meta-symbol
appearance information in the symbol dictionary 301, all formats of
(X, 0, *) in the appearance symbol information of M=1998-NOV-0 are
searched, and the set of the symbol number of the symbol X is
recorded as A. Herein, denotes an arbitrary value (don't care). In
FIG. 35 and FIG. 36, sets such as A, C, D are described by using
the symbol character string, instead of symbol number, for the ease
of reading. In this example, at *STEP4, the symbol number of the
symbol such as 1998-NOV-01/AM0830/NODA is determined. Once A is
determined, the element selected herein (1998-NOV-01,1,11) is
deleted from D. As a result, the condition judging of "Is A an
empty set?" at step 904 is No, and the process advances to step
905. At step 905, it is judged if q is greater than the number of
characters n of the question character string, and if larger, the
elements of the set A at this moment are added to the set SC of the
final result, and if not larger, the procedure select_cover1 (A, p,
q) in FIG. 10 is fetched. In this case, n=14 and q=11, and
therefore q<n, and moving to step 907, the procedure
select_cover1 (A, p, q) in FIG. 10 is fetched, and the process is
advanced. At step 908 in FIG. 10, as compared with the procedure
arguments of character positions p and q, in order that the
collating start character position s may be larger than p and
smaller than q, the set Dp composed of elements in the covering
result C is determined.
[0157] In this example, p=1 and q=11, and when the element
satisfying 1<s.ltoreq.11 is determined from C, D1={(1/P, 11,
14)} is obtained as shown in *STEP5 in FIG. 35. Since D1 is not
empty, it is No at step 909, and the process advances to step 910.
From D1, the first element (1/P, 11, 14) is selected, and from the
appearance meta-symbol information of meta-symbol 1/P, all elements
in the format of (X, 10, *) are searched, and the intermediate
result A is reduced by eliminating the portion common with A, and
the result is stored in A1. Herein, * denotes an arbitrary value.
In this example, as shown in *STEP6 in FIG. 35, A is reduced to
three elements. Further, from D1, the element of D1 selected herein
(1/P, 11, 14) is deleted. Since A1 is not empty, it is No at step
911, and the process advances to step 912. As compared with n=14,
it is u=12, and it is judged No at step 912.
[0158] At step 914, with A1, t=11, u=14 as arguments, the procedure
select_cover1 in FIG. 10 is fetched recursively, and the process is
continued, and the intermediate result of Ap is reduced gradually
as shown in FIG. 35 and FIG. 36. At *STEP18 in FIG. 36, since u=17,
and n=14 or more, All is recorded as part of the final result SC,
and the retrieval process is further continued in order to search
other result. Thus, while generating the combination of covering
elements systematically from the covering result 304, the
meta-symbol appearance information in the symbol dictionary 301 is
retrieved, and the set of the symbol numbers commonly contained in
the generated sets of covering elements is determined, and recorded
in the set SC of the final result. After processing all
combinations of covering elements, the process is terminated at
*STEP20, and the SC at this time is the retrieval result.
[0159] As explained herein, according to the retrieving method of
symbol dictionary in the third embodiment of the invention, as for
the partial character string appearing at high frequency among
symbol data, meta-symbol information having meta-symbols with
greater number of characters is compiled, and by using this
meta-symbol information, once the covering result is composed from
the question character string, and the retrieval is processed by
using this covering result, the covering result containing the
elements added by the right extended meta-symbol assessing means,
and the meta-symbol appearance information, and therefore even the
forward coincidence retrieval of symbol containing partial
character string appearing at high frequency can be done faster
than in the conventional retrieval of symbol dictionary.
[0160] (Embodiment 4)
[0161] A fourth embodiment of the invention is described below
while referring to the drawings. FIG. 4 is a block diagram showing
a general constitution of a symbol dictionary retrieving apparatus.
In FIG. 4, reference numeral 401 is a symbol dictionary storing
meta-symbol information and meta-symbol appearance information, 402
is retrieval condition input means for entering character string as
retrieval condition, 403 is question character string covering
means for finding the covering result by covering the question
character string which is the retrieval condition entered from the
retrieval condition input means 402 by the longest matchoverlapped
longest match word extraction method by using the symbol dictionary
401, 404 is the covering result determined by the question
character string covering means 403, 408 is left extended
meta-symbol assessing means for retrieving the meta-symbol
information in the symbol dictionary 401, finding all of the sets
of the number of the meta-symbol and the collating position of the
left extended meta-symbol (that is, the meta-symbol containing L in
the end portion) of the leftmost partial character string L of the
question character string, out of the extended meta-symbols of
meta-symbol Z (that is, meta-symbols containing Z) of covering
elements of which the collating start character position is 1 among
the covering result 404, and adding and storing to the covering
result 404, 405 is symbol number set assessing means for assessing
the set of symbol numbers coinciding backward with the question
character string, that is, containing the question character string
in the end portion, from the covering result 404 and the
meta-symbol appearance information of symbol dictionary 401, and
406 is retrieval result output means for issuing the symbol number
assessed by the symbol number assessing means 405 and others. The
constituent elements 401 to 404 in FIG. 4 correspond to the
constituent elements 301 to 304 in FIG. 3 which is the block
diagram of the third embodiment. In thus constituted symbol
dictionary retrieving apparatus, the operation is explained below
by referring to the drawings, relating to the example of symbol
dictionary presented in the first embodiment and an example of
simple retrieval condition.
[0162] FIG. 7 is a flowchart describing the procedure of process
for finding the covering result in the question character string
covering means 403, FIGS. 11 and 12 are flowcharts describing the
procedure of assessing process of symbol number set in the symbol
number set assessing means 405, and FIG. 37 is a conceptual diagram
describing principal intermediate data in the process of symbol
dictionary retrieval in the case of giving the condition of "Find
the set of symbol numbers coinciding backward with the question
character string KAWA" as the retrieval condition. To begin with,
at a first step of retrieving symbol dictionary, the question
character string covering means 403 retrieves the meta-symbol
information in the symbol dictionary 401, and finds the covering of
the question character string IKAWA by the longest matchoverlapped
longest match word extraction method, and obtains the covering
result of C of *STEP 1 in FIG. 37. The procedure of the covering
process is same as the procedure of the covering process in
embodiment 3. In the case of this question character string, since
the covering result 404 is not empty, processing at the left
extended meta-symbol assessing means 408 starts, but in the case of
absence of covering result, the process is stopped immediately, and
there is no covering result. Consequently, the left extended
meta-symbol assessing means 408 retrieves the meta-symbol
information in the symbol dictionary 401, and finds the extended
meta-symbols in the meta-symbol Z of the covering element of which
collating start position is 1 (that is, meta-symbols of character
string containing Z) among the covering result 404. Of the obtained
extended meta-symbols, only the meta-symbol X of the left extended
meta-symbol (that is, the meta-symbol containing the character
string L in the end portion) of the j-th leftmost partial character
string L of the question character string (that is, the partial
character string from the first character to the j-th character in
the question character string) is selected, and
[0163] (X, j+1-.vertline.L.vertline., j+1)
[0164] is added to the covering result 404. In this example,
Z=KAWA, and as its extended meta-symbols, nine types are
determined, that is,
[0165] [/SUKAWA], [0/KAWAD], [0/KAWAN], [0/KAWAY], [5/KAWAD],
[5/KAWAN], [5/KAWAY],[KAWA], [SUKAWA]
[0166] Out of them, two types of meta-symbols
[0167] [/SUKAWA], [/KAWA]
[0168] as the left extended meta-symbols of the leftmost partial
string of the question character string KAWA are added to the
covering result 404 by the left extended meta-symbol assessing
means 408. This mode is shown in *STEP2 in FIG. 37. Thus, after
covering up to the left extended meta-symbols, the symbol number
set assessing means 405 determines the symbol number set. The
subsequent process conforms to FIG. 11 and FIG. 12. First, at step
1001, the set D composed of elements of which collating end
character position e is n is determined from the covering result.
In this example,
[0169] D={(KAWA, 1, 5), (/SUKAWA, -2, 5), (SUKAWA, -1, 5)}
[0170] The set SC of the final result is initialized to be empty.
Since D is not empty, the process advances to step 1003, and the
element (KAWA, 1, 5) is selected from D, and in the meta-symbol
appearance information in the symbol dictionary 401, all formats of
(X, *, 0) in the appearance symbol information of M=KAWA are
searched, and the set of the symbol number of the symbol X is
recorded as A. In FIG. 37, sets such as A, C, D are described by
using the symbol character string, instead of symbol number, for
the ease of reading. In this example, at *STEP4, the symbol number
of the symbol such as 1998-JAN-17/PM0930/NOKAWA is determined.
[0171] Once A is determined, the element selected herein (KAWA, 1,
5) is deleted from D. As a result, the condition judging of "Is A
an empty set?" at step 1004 is No, and the process advances to step
1005. At step 1005, it is judged if the collating start position t
of the selected covering element is 1 or less, and if 1 or less,
the element of the set A at this moment is added to the set SC of
the final result, and if 2 or more, the procedure select_cover2 (A,
p, q) in FIG. 12 is fetched. In this case, since t=1, the element
of the set A at this moment is added to the set SC of the final
result, and to search other result, the retrieval process is
continued again. Thus, while generating the combination of covering
elements systematically from the covering result 404, the
meta-symbol appearance information in the symbol dictionary 401 is
retrieved, and the set of the symbol numbers commonly contained in
the generated sets of covering elements is determined, and recorded
in the set SC of the final result. After processing all
combinations of covering elements, the process is terminated at
*STEP6, and the SC at this time is the retrieval result. As
explained herein, according to the retrieving method of symbol
dictionary in the fourth embodiment of the invention, as for the
partial character string appearing at high frequency among symbol
data, meta-symbol information having meta-symbols with greater
number of characters is compiled, and by using this meta-symbol
information, once the covering result is composed from the question
character string, and the retrieval is processed by using this
covering result, the covering result containing the elements added
by the left extended meta-symbol assessing means, and the
meta-symbol appearance information, and therefore even the backward
coincidence retrieval of symbol containing partial character string
appearing at high frequency can be done faster than in the
conventional retrieval of symbol dictionary.
[0172] (Embodiment 5)
[0173] A fifth embodiment of the invention is described below while
referring to the drawings. FIG. 5 is a block diagram showing a
general constitution of a symbol dictionary retrieving apparatus.
In FIG. 5, reference numeral 501 is a symbol dictionary storing
meta-symbol information and meta-symbol appearance information, 502
is retrieval condition input means for entering character string as
retrieval condition, 503 is question character string covering
means for finding the covering result by covering the question
character string which is the retrieval condition entered from the
retrieval condition input means 502 by the longest matchoverlapped
longest match word extraction method by using the symbol dictionary
501, 504 is the covering result determined by the question
character string covering means 503, 507 is right extended
meta-symbol assessing means for retrieving the meta-symbol
information in the symbol dictionary 501, finding all of the sets
of the number of the meta-symbol and the collating position of the
right extended meta-symbol (that is, the meta-symbol containing R
in the beginning portion) of the rightmost partial character string
R of the question character string, out of the extended
meta-symbols of meta-symbol Z (that is, meta-symbols containing Z)
of covering elements largest in the collating start character
position among the covering result 504, and adding and storing to
the covering result 504, 508 is left extended meta-symbol assessing
means for retrieving the meta-symbol information in the symbol
dictionary 501, finding all of the sets of the number of the
meta-symbol and the collating position of the left extended
meta-symbol (that is, the meta-symbol containing L in the end
portion) of the leftmost partial character string L of the question
character string, out of the extended meta-symbols of meta-symbol Z
(that is, meta-symbols containing Z) of covering elements of which
the collating start character position is 1 among the covering
result 504, and adding and storing to the covering result 504, 509
is both extended meta-symbol assessing means for retrieving the
meta-symbol information in the symbol dictionary 501, retrieving
all of both extended meta-symbols of question character string
(that is, the meta-symbols containing the question character string
Q in the portion from the j-th character to the
j+.vertline.Q.vertline.-th character, where 1<j) X, adding and
storing elements (X, 1-j, 1-j+.vertline.X.vertline.) to the
covering result 504, 505 is symbol number set assessing means for
assessing the set of symbol numbers coinciding intermediately with
the question character string, that is, containing the question
character string, from the covering result 504 and the meta-symbol
appearance information of symbol dictionary 501, and 506 is
retrieval result output means for issuing the symbol number
assessed by the symbol number assessing means 505 and others. The
constituent elements 501 to 504 and 506 in FIG. 5 correspond to the
constituent elements 201 to 204 and 206 in FIG. 2 which is the
block diagram of the second embodiment, the constituent element 507
in FIG. 5 corresponds to the constituent element 307 in FIG. 3 of
the block diagram of the third embodiment, and the constituent
element 508 in FIG. 5 corresponds to the constituent element 408 in
FIG. 4 of the block diagram of the fourth embodiment. In thus
constituted symbol dictionary retrieving apparatus, the operation
is explained below by referring to the drawings, relating to the
example of symbol dictionary presented in the first embodiment and
an example of simple retrieval condition.
[0174] FIG. 7 is a flowchart describing the procedure of process
for finding the covering result in the question character string
covering means 503, FIGS. 13 and 14 are flowcharts describing the
procedure of assessing process of symbol number set in the symbol
number set assessing means 505, and FIGS. 38 and 39 are conceptual
diagrams describing principal intermediate data in the process of
symbol dictionary retrieval in the case of giving the condition of
"Find the set of symbol numbers coinciding intermediately with the
question character string KAWADA" as the retrieval condition. To
begin with, at a first step of retrieving symbol dictionary, the
question character string covering means 503 retrieves the
meta-symbol information in the symbol dictionary 501, and finds the
covering of the question character string KAWADA by the longest
matchoverlapped longest match word extraction method, and obtains
the covering result of C of *STEP1 in FIG. 38. The procedure of the
covering process is same as the procedure of the covering process
in embodiment 3. In the case of this question character string,
since the covering result 504 is not empty, processing at the right
extended meta-symbol assessing means 507 starts, but in the case of
absence of covering result, the process is stopped immediately, and
there is no covering result. Consequently, the right extended
meta-symbol assessing means 507 retrieves the meta-symbol
information in the symbol dictionary 501, and finds the extended
meta-symbols in the meta-symbol Z of the covering element largest
in the collating start position (that is, meta-symbols of character
string containing Z) among the covering result 504. Of the obtained
extended meta-symbols, only the meta-symbol X of the right extended
meta-symbol (that is, the meta-symbol containing the character
string R in the beginning portion) of the j-th rightmost partial
character string R of the question character string (that is, the
partial character string from the j-th character to the final
character in the question character string) is selected, and
[0175] (X, j, .vertline.R.vertline.+1)
[0176] is added to the covering result 504. In this example,
Z=WADA, and as its extended meta-symbols, only one type is
determined, that is, WADA. It is also the right extended
meta-symbol of the rightmost partial string of the question
character string KAWADA, the right extended meta-symbol assessing
means 507 adds to the covering result 504, but since the same
covering element is already contained in the covering result 504,
the covering result 504 is not changed. This mode is shown in
*STEP2 in FIG. 38. Consequently, the left extended meta-symbol
assessing means 508 retrieves the meta-symbol information in the
symbol dictionary 501, and finds the extended meta-symbols in the
meta-symbol Z of the covering element of which collating start
position is 1 (that is, meta-symbols of character string containing
Z) among the covering result 504. Of the obtained extended
meta-symbols, only the meta-symbol X of the left extended
meta-symbol (that is, the meta-symbol containing the character
string L in the end portion) of the j-th leftmost partial character
string L of the question character string (that is, the partial
character string from the first character to the j-th character in
the question character string) is selected, and
[0177] (X, j+1-.vertline.L.vertline., j+1)
[0178] is added to the covering result 504. In this example,
Z=KAWA, and as its extended meta-symbols, nine types are
determined, that is,
[0179] [/SUKAWA], [0/KAWAD], [0/KAWAN], [0/KAWAY],
[0180] [5/KAWAD], [5/KAWAN], [5/KAWAY],[KAWA], [SUKAWA]
[0181] Out of them, five types of meta-symbols
[0182] [/SUKAWA], [0/KAWAD], [5/KAWAD],
[0183] [KAWA], [SUKAWA]
[0184] as the left extended meta-symbols of the leftmost partial
string of the question character string KAWADA are added to the
covering result 504 by the left extended meta-symbol assessing
means 508. This mode is shown in *STEP3 in FIG. 38. Next, the both
extended meta-symbol assessing means 509 retrieves the meta-symbol
information in the symbol dictionary 501, retrieves all of both
extended meta-symbols of question character string KAWADA (that is,
the meta-symbols containing the question character string KAWADA in
the portion of j+6 characters from the j-th character, where
1<j) X, and adds elements (X, 1-j, 1-j+.vertline.X.vertline.) to
the covering result 504. In the case of this example, meta-symbol
containing KAWADA is not present in the meta-symbol information in
the symbol dictionary 501, and nothing is added to the covering
result 504. This mode is shown in *STEP4 in FIG. 42. Thus, after
covering up to the right extended meta-symbols, left extended
meta-symbols, and both extended symbols, the symbol number set
assessing means 505 determines the symbol number set. The
subsequent process conforms to FIG. 13 and FIG. 14. First, at step
1101, the set D composed of elements of which collating start
character position s is 1 or less is determined from the covering
result. In this example,
[0185] D={(KAWA, 1, 5), (/SUKAWA, -1, 5), (0/KAWAD, -1, 6),
[0186] (5/KAWAD, -1, 6), (SUKAWA, -1, 5)}
[0187] The set SC of the final result is initialized to be
empty.
[0188] This mode is shown in *STEP5 in FIG. 38. Since D is not
empty, the process advances to step 1103, and the element (KAWA,
1,) is selected from D, and in the meta-symbol appearance
information in the symbol dictionary 501, each appearance symbol
information (X, L, R) of M=KAWA is recorded as A by collecting the
sets of elements in the format of (X, L). In FIG. 38 and FIG. 39,
sets such as A, C, D are described by using the symbol character
string, instead of symbol number, for the ease of reading. In this
example, at *STEP6, elements such as 1998-JAN-17/PM0930/NOKAWA are
determined. Once A is determined, the element selected herein
(KAWA, 1, 5) is deleted from D. As a result, the condition judging
of "Is A an empty set?" at step 1104 is No, and the process
advances to step 1105. At step 1105, judging if the collating end
position q of the selected covering element is greater than n or
not, and if greater than n, the element of the set A at this moment
is added to the set SC of the final result, and if less than n, the
procedure select_cover3 (A, p, q) in FIG. 14 is fetched. In this
case, since q=5, the procedure select_cover3 (A, 1, 5) is fetched.
At step 1108 in FIG. 14, all elements of which collating position
(s, e) is in the relation of 1<s.ltoreq.5<e are selected from
C, and D1 is obtained. In this example, D1={(WADA, 3, 7)}. Since D1
is not empty, judgement at step 1109 is No, and the process
advances to step 1110. At step 1110, selecting the only one element
(WADA, 3, 7) from D1, (X, L-2) is recorded in B for each appearance
symbol information (X, L, R) of M=WADA, and the selected element is
deleted from D1. Further, A.andgate.B is calculated, but there is
no common part, and A1 is empty, and the judging result at step
1111 is Yes, and the process returns to step 1109. However, since
D1 is empty, the judging result at step 1109 is Yes, and it is
returned from select_cover3. This mode is shown in *STEP8 in FIG.
38. At step 1102 in FIG. 13, since D is not empty, the process
further advances to step 1103. At step 1103, selecting the element
(/SUKAWA, 2, 5) from D, A is calculated as in *STEP9, and advancing
to step 1107, the procedure select_cover3 (A, 1, 5) in FIG. 14 is
fetched again. At step 1108 in FIG. 14, all elements of which
collating position (s, e) is in the relation of
1<s.ltoreq.5<e are selected from C, and D1 is obtained. In
this example, D1={(WADA, 3, 7)}. This mode is shown in *STEP10 in
FIG. 38. Since D1 is not empty, judgement at step 1109 is No, and
the process advances to step 1110. At step 1110, selecting the only
one element (WADA, 3, 7) from D1, (X, L-2) is recorded in B for
each appearance symbol information (X, L, R) of M=WADA, and the
selected element is deleted from D1. Further, A.andgate.B is
calculated, but there is no common part, and A1 is empty, and the
judging result at step 1111 is Yes, and the process returns to step
1109. However, since D1 is empty, the judging result at step 1109
is Yes, and it is returned from select_cover3. This mode is shown
in *STEP11 in FIG. 38. At step 1102 in FIG. 13, since D is not
empty, the process further advances to step 1103 again. At step
1103, selecting the element (0/KAWAD, -1, 6) from D, A is
calculated as in *STEP 12, and advancing to step 1107, the
procedure select_cover3 (A, , 6) in FIG. 14 is fetched once more.
At step 1108 in FIG. 14, all elements of which collating position
(s, e) is in the relation of 1<s.ltoreq.6<e are selected from
C, and D1 is obtained. In this example, D1={(WADA, 3, 7)}. This
mode is shown in *STEP13 in FIG. 39. Since D1 is not empty,
judgement at step 1109 is No, and the process advances to step
1110.
[0189] At step 1110, selecting the only one element (WADA, 3,) from
D1, (X, L-2) is recorded in B for each appearance symbol
information (X, L, R) of M=WADA, and the selected element is
deleted from D1. Further, A.andgate.B is calculated. This common
part A1 is not empty, and u=7 is larger than n=6, and the process
advances to step 1113, and A1 is added to SC as part of the final
result. Back to step 1109, since D1 is empty, it is returned from
select_cover3. This mode is shown in *STEP14 in FIG. 39. At step
1102 in FIG. 13, since D is not empty, the process further advances
to step 1103 again. At step 1103, selecting the element (5/KAWAD,
-1, 6) from D, A is calculated as in *STEP15, and advancing to step
1107, the procedure select_cover3 (A, 1, 6) in FIG. 14 is fetched
again. At step 1108 in FIG. 14, all elements of which collating
position (s, e) is in the relation of 1<s.ltoreq.6<e are
selected from C, and D1 is obtained. In this example, D1={(WADA, 3,
7)}. This mode is shown in *STEP16 in FIG. 39. Since D1 is not
empty, judgement at step 1109 is No, and the process advances to
step 1110. At step 1110, selecting the only one element (WADA, 3,
7) from D1, (X, L-2) is recorded in B for each appearance symbol
information (X, L, R) of M=WADA, and the selected element is
deleted from D1. Further, A.andgate.B is calculated. This common
part A1 is not empty, and u=7 is larger than n=6, and the process
advances to step 1113, and A1 is added to SC as part of the final
result. Back to step 1109, since D1 is empty, it is returned from
select_cover3. This mode is shown in *STEP17 in FIG. 39. At step
1102 in FIG. 13, since D is not empty, the process further advances
to step 1103 again. At step 1103, selecting the element (SUKAWA,
-1, 5) from D, A is calculated as in *STEP18, and advancing to step
1107, the procedure select_cover3 (A, 1, 5) in FIG. 14 is fetched
once more. At step 1108 in FIG. 14, all elements of which collating
position (s, e) is in the relation of 1<s.ltoreq.5<e are
selected from C, and D1 is obtained. In this example, D1={(WADA, 3,
7)}. This mode is shown in *STEP19 in FIG. 39. Since D1 is not
empty, judgement at step 1109 is No, and the process advances to
step 1110. At step 1110, selecting the only one element (WADA, 3,
7) from D1, (X, L-2) is recorded in B for each appearance symbol
information (X, L, R) of M=WADA, and the selected element is
deleted from D1. Further, A.andgate.B is calculated, but there is
no common part, and A1 is empty, and the judging result at step
1111 is Yes, and the process returns to step 1109. However, since
D1 is empty, the judging result at step 1109 is Yes, and it is
returned from select_cover3. This mode is shown in *STEP20 in FIG.
38. At step 1102 in FIG. 13, since D is empty, the assessing
process of symbol number set is terminated. At this moment, since
SC is holding all of the sets of the combinations of the infix
matching symbols (their numbers) and the number of characters of
the left side of the collating portion (the beginning side of
symbol), by picking up only a first element of each set, the
intermediate coincidence retrieval result is obtained. Thus, by
retrieving the meta-symbol appearance information in the symbol
dictionary 501 while generating the combinations of the covering
elements systematically from the covering result 504, the set of
the symbol numbers commonly contained in the set of the generated
covering element can be determined.
[0190] As explained herein, according to the retrieving method of
symbol dictionary in the fifth embodiment of the invention, as for
the partial character string appearing at high frequency among
symbol data, meta-symbol information having meta-symbols with
greater number of characters is compiled, and by using this
meta-symbol information, once the covering result is composed from
the question character string, and the retrieval is processed by
using this covering result, and the covering result containing the
elements added by three means, that is, the right extended
meta-symbol assessing means, the left extended meta-symbol
assessing means, and the both extended meta-symbol assessing means,
and therefore even the intermediate coincidence retrieval of symbol
containing partial character string appearing at high frequency can
be done faster than in the conventional retrieval of symbol
dictionary.
[0191] As explained in the five embodiments of the invention
herein, according to the symbol dictionary compiling method and
symbol dictionary retrieving method of the invention,
[0192] (1) by compiling automatically a meta-symbol dictionary
collecting shorter symbols called "meta-symbols" for covering
symbols in symbol data, covering each symbol in the symbol data by
the meta-symbols in this meta-symbol dictionary, and compiling
meta-symbol appearance information recording the information
showing how each symbol is covered in every meta-symbol, and
[0193] (2) retrieving the question character string by using the
meta-symbol dictionary contained in the symbol dictionary, covering
with the meta-symbols, adding the retrieval results of left, right
and both extended meta-symbols to the covering result, and
determining the symbol number set contained commonly in every
element set in the covering result covering the question character
string or its left, right and both extended character strings, the
following problems in the conventional symbol dictionary compiling
method and symbol dictionary retrieving method can be solved, that
is:
[0194] 1) The symbol dictionary file to be created is more than
twice as much as the symbol data to be retrieved, and it is hard to
realize if the usable capacity of the memory device is limited.
[0195] 2) If the character string is long, and when retrieving
symbols containing characters or character chain of high frequency
of appearance, the quantity of data to be retrieved from the symbol
dictionary is large, and the retrieval speed is lowered.
[0196] 3) In the method of using character chain, if the number of
character chains N is increased, the types of N character chains to
appear increase suddenly, and it is hard to compile symbol
dictionary, and the capacity of the compiled symbol dictionary is
increased.
[0197] Thus, although difficult in the conventional symbol
dictionary compiling and retrieving technology, high speed
retrieval is possible including up to infix matching, and a symbol
dictionary of small capacity can be compiled, and even in the
application where the complete matching occupies the majority of
questions, the symbol retrieval is possible without lowering the
average retrieval speed, so that tremendous effects are obtained
practically.
[0198] In the foregoing five embodiments, as character sets,
alpha-numerics and special symbols are used, but the same effects
are obtained in the character sets adding Chinese characters and
Greek alphabet, too. In the first embodiment, prior to compilation
of symbol dictionary, a meta-symbol dictionary composed of one
character only as shown in FIG. 16 is prepared, but in addition to
the content in FIG. 16, a meta-symbol dictionary containing
meta-symbols of two or more characters of which appearance can be
predicted, such as "1998-" and "AM" can be prepared, and in this
case, too, the symbol dictionary can be compiled in the same
procedure as explained above. As the storing data structure of
meta-symbol information, TRIE structure and table structure are
shown, but if using other data structure, such as finite state
machine, PATRICIA tree, or hash table, it is possible to execute in
the same procedure as explained above. The storing format of the
meta-symbol appearance information is not limited to the table, but
by using TRIE, hash table or other data structure, it is possible
to execute in the same procedure as explained above.
[0199] In symbol dictionary retrieval, for the convenience of
explanation, the covering result is expressed by using the set, but
by using linked list, heap, tree structure, hash table or other
data structure, it is possible to execute in the same procedure as
explained above.
[0200] Thus, in the symbol dictionary compiling method of the
invention, for compiling a machine-retrievable symbol dictionary of
symbol data by complete matching, prefix matching, postfix matching
or infix matching, a meta-symbol dictionary collecting shorter
symbols called "meta-symbols" for covering the symbol in the symbol
data is compiled automatically, and each symbol in the symbol data
is covered with the meta-symbol in this meta-symbol dictionary, and
the information showing how each symbol is covered is obtained by
preparing meta-symbol onset information recorded in each
meta-symbol, and therefore high speed retrieval including up to
infix matching is achieved, and the size of compiled symbol
dictionary can be reduced, thereby bringing about outstanding
effects.
[0201] Also, in the symbol dictionary retrieving method of the
invention, for machine-retrieving of symbol dictionary by complete
matching, prefix matching, postfix matching or infix matching, a
question character string is covered with a meta-symbol by
retrieving the meta-symbol dictionary contained in the symbol
dictionary, retrieval results of left, right and both extended
meta-symbols are added to this covering result, and high speed
retrieval is possible up to infix matching by using a symbol
dictionary of small capacity, by seeking the symbol number set
commonly contained in every element set in the question character
string or covering results covering the left, right and both
extended character strings, and moreover in the application where
complete matching occupies the majority of questions, symbol
retrieval is possible without lowering the average retrieval speed,
thereby bringing about outstanding effects.
[0202] The effects of the invention appear very clearly when
compiling and retrieving a symbol dictionary from symbol data of
large scale having a deviated distribution in which the symbol data
to be retrieved contain symbols of more than tens of thousands of
kinds, each character has a great number of characters, and there
are partial character strings commonly included in many symbols.
For example, in an experiment of compiling a symbol dictionary from
symbol data containing 1 million symbols in which each symbol is a
100-digit numeral, all symbols are equal in the upper 90 digits and
all symbols are different in the lower 10 digits, in the
conventional symbol dictionary of n-gram system, at least 100
million symbol numbers, appearance character position information,
and information for character linking are needed, and the size is
more than 400 megabytes, but in the symbol dictionary compiled by
the symbol dictionary compiling method of the invention, it
requires to record only about 50000 kinds of meta-symbol
information, and about 4 million pieces of meta-symbol appearance
information, and the required size is smaller than 40 megabytes,
and the capacity is less than {fraction (1/10)} of the conventional
system. Moreover, in the case of retrieval by complete matching, in
the symbol dictionary of the conventional n-gram system, unless the
number of links n of character linking is 41 or more, the
intermediate result of the higher 40 digits is always 1 million
symbols long, and the retrieval speed is substantially lowered, but
in the retrieval of symbol dictionary of the invention, a
meta-symbol dictionary containing meta-symbols of 40+.alpha. digits
suited to deviation of distribution of symbol data is created
automatically, and the symbol number is searched only by referring
to the appearance information of the meta-symbol relating to the
question character string, so that retrieval of an extremely high
speed is realized. Thus, symbol data having deviation which was
conventionally hard to handle can be retrieved at high speed
including up to intermediate coincidence, and outstanding effects
are obtained practically.
* * * * *