U.S. patent number 10,922,347 [Application Number 15/395,778] was granted by the patent office on 2021-02-16 for hierarchical dictionary with statistical filtering based on word frequency.
This patent grant is currently assigned to HYLAND SWITZERLAND S RL. The grantee listed for this patent is HYLAND SWITZERLAND SARL. Invention is credited to Johannes Hausmann, Ralph Meier, Harry Urbschat, Thorsten Wanschura.
United States Patent |
10,922,347 |
Meier , et al. |
February 16, 2021 |
**Please see images for:
( Certificate of Correction ) ** |
Hierarchical dictionary with statistical filtering based on word
frequency
Abstract
A hierarchical dictionary having methods of storing words based
on frequency thereof in one or more documents which includes the
steps of identifying a hash value corresponding to an inputted
word; storing the word in a first hash map and in a second hash map
having a substantially larger word storage capacity than the first
hash map based on the identified hash value; clearing the first
hash map at every predetermined period or triggering event;
determining whether a frequency of the word as stored in the second
hash map exceeds a predetermined value; and if so, promoting the
word from the second hash map to a third hash map having a
substantially larger word storage capacity than the second hash map
for long-term storage and later retrieval.
Inventors: |
Meier; Ralph (Rastede,
DE), Hausmann; Johannes (Corcelles, CH),
Urbschat; Harry (Oldenburg, DE), Wanschura;
Thorsten (Oldenburg, DE) |
Applicant: |
Name |
City |
State |
Country |
Type |
HYLAND SWITZERLAND SARL |
Geneva |
N/A |
CH |
|
|
Assignee: |
HYLAND SWITZERLAND S RL
(Geneva, CH)
|
Family
ID: |
59386822 |
Appl.
No.: |
15/395,778 |
Filed: |
December 30, 2016 |
Prior Publication Data
|
|
|
|
Document
Identifier |
Publication Date |
|
US 20170220679 A1 |
Aug 3, 2017 |
|
Related U.S. Patent Documents
|
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
Issue Date |
|
|
62288032 |
Jan 28, 2016 |
|
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06F
16/335 (20190101); G06F 16/36 (20190101); G06F
16/313 (20190101) |
Current International
Class: |
G06F
16/36 (20190101); G06F 16/31 (20190101); G06F
16/335 (20190101) |
Field of
Search: |
;707/750 |
References Cited
[Referenced By]
U.S. Patent Documents
Other References
Teng, Wei-Guang, Ming-Syan Chen, and S. Yu Philip. "A
regression-based temporal pattern mining scheme for data streams."
Proceedings 2003 VLDB Conference. Morgan Kaufmann, 2003. (Year:
2003). cited by examiner .
Datar M, Gionis A, Indyk P, Motwani R. Maintaining stream
statistics over sliding windows. SIAM journal on computing.
2002;31(6): 1794-813. (Year: 2002). cited by examiner.
|
Primary Examiner: Almani; Mohsen
Attorney, Agent or Firm: Medley, Behrens & Lewis,
LLC
Parent Case Text
CROSS REFERENCE TO RELATED APPLICATIONS
This patent application claims the benefit of the earlier filing
date of U.S. Patent Application Ser. No. 62/288,032, entitled
"Hierarchical Dictionary with Statistical Filtering Used for
Automatic Online Extraction Value Validation", filed Jan. 28, 2016,
the content of which is hereby incorporated by reference herein in
its entirety.
Claims
What is claimed is:
1. A method for organizing a plurality of words associated with a
document, comprising: inputting, by a processor of a computing
device, each of the plurality of words associated with the document
to a memory coupled to the computing device, the memory including
instructions to perform: for each of the plurality of words:
identifying a hash value corresponding to a word and determining
whether a bucket associated with the hash value in a first hash map
is available, the first hash map having a first word capacity;
based upon the determination, storing the word and updating a
frequency of the word in the first hash map; and in a second hash
map, storing the word and ranking the word relative to each other
of the plurality of words based on the frequency, the second hash
map having a second word capacity, the second word capacity being
greater than the first word capacity; and after a predetermined
period of time has elapsed, transferring a portion of the plurality
of words based upon the ranking to a third hash map, the third hash
map having a third word capacity, the third word capacity being
greater than the second word capacity.
2. The method of claim 1, further comprising: clearing a set of
words stored in the first hash map following a triggering
event.
3. The method of claim 1, wherein the plurality of words associated
with the document is found in one of the first hash map or the
third hash map, and wherein the second hash map includes a data
structure hidden from a user of the computing device, the data
structure having a statistical filter based upon the inputted
plurality of words.
4. The method of claim 1, wherein the transferring includes
determining a predetermined limit for promoting words from the
second hash map to the third hash map, and wherein the portion of
the plurality of words transferred to the third hash map are a set
of words greater than the predetermined limit.
5. The method of claim 1, wherein the first hash map, the second
hash map, and the third hash map store the portion of the plurality
of words for a first time period, a second time period, and a third
time period, respectively, and further wherein the third time
period is greater than the second time period, and still further
wherein the second time period is greater than the first time
period.
6. A method of storing words from one or more documents based on
frequency, comprising: by at least one processor of a computing
device, inputting a set of words from the one or more documents to
a first hash map and a second hash map, wherein the first hash map
has a first word capacity and the second hash map has a second word
capacity, wherein the second word capacity is greater than the
first word capacity; for each of the set of words: identifying a
hash value corresponding to a word; in the first hash map and the
second hash map, determining whether a bucket associated with the
hash value is empty; upon a determination that the bucket is empty,
setting a frequency of the word to a first value and storing the
word and the frequency of the word in the bucket; and upon a
determination that the bucket is not empty, updating the frequency
of the word in the bucket to a second value; and determining, in
the second hash map and after a predetermined period of time has
elapsed, whether the frequency of at least one of the set of words
is equal to or greater than a predetermined limit; and upon a
positive determination that the frequency of the at least one of
the set of words is equal to or greater than the predetermined
limit, promoting each of the at least one of the set of words to a
third hash map, wherein the third hash map has a third word
capacity, wherein the third word capacity is greater than the
second word capacity, wherein a first word in the set of words has
a frequency in the one or more documents that is less than the
predetermined limit is stored in the second hash map and a second
word in the set of words has a frequency in the one or more
documents that is equal to or exceeding greater than the
predetermined limit is stored in the third hash map, the second
hash map keeping respective frequencies of the inputted set of
words and including program instructions for performing the
promoting from the first hash map to the third hash map.
7. The method of claim 6, wherein the first hash map, the second
hash map, and the third hash map occupy a memory that is at least
one of internal or external to the computing device.
8. The method of claim 6, wherein upon a negative determination
that the frequency of the at least one of the set of words is equal
to and or greater than the predetermined limit, retaining storage
of the at least one of the set of words in the first hash map or
the second hash map.
9. The method of claim 6, wherein the second hash map includes a
data structure hidden from a user of the computing device, the data
structure having a statistical filter based upon the inputted set
of words.
10. The method of claim 6, wherein the determining whether the
frequency of the at least one of the set of words is equal to or
greater than the predetermined limit is performed following the
updating the frequency of the word in the bucket to the second
value.
11. The method of claim 6, wherein the promoting each of the at
least one of the set of words to the third hash map includes
copying the word to the third hash map.
12. The method of claim 6, further comprising: clearing a word
stored in the second hash map following promotion of the word to
the third hash map.
13. The method of claim 6, further comprising: ranking in the
second hash map the set of words based upon respective frequencies
in the one or more documents.
14. The method of claim 6, further comprising: initializing the
first hash map, the second hash map, and the third hash map to a
first predetermined size, a second predetermined size, and a third
predetermined size, respectively, based upon a size of the one or
more documents.
15. The method of claim 6, further comprising: receiving a search
request for the word; identifying a hash value corresponding to the
word; using the hash value, determining whether the hash value has
an associated entry in the third hash map; upon a positive
determination, responding to the search request with the word; and
upon a negative determination, determining whether the hash value
has an associated entry in the first hash map and sending a
notification based on the determination.
16. The method of claim 1, wherein each of the first hash map, the
second hash map, and the third hash map are independent.
17. A non-transitory computer-readable storage medium having stored
therein a first data structure for storing a first set of words for
a predetermined period of time, a second data structure for storing
the first set of words and for creating a statistical filter based
on the first set of words, and a third data structure for storing a
second set of words based on the first set of words as filtered by
the second data structure, the medium further including
instructions for performing acts comprising: receiving a plurality
of words associated with a document; determining a unique
identifier corresponding to each of the plurality of words; in the
first data structure and the second data structure, storing the
plurality of words to corresponding buckets associated with the
respective unique identifiers; using the second data structure,
ranking the plurality of words based on frequency of usage in the
document; and after the predetermined period of time has elapsed,
transferring a portion of the plurality of words greater than a
predetermined frequency limit from the first data structure to the
third data structure, wherein the third data structure stores words
for a period of time, the period of time being longer than the
predetermined period of time, and wherein the second data structure
is hidden from user access.
18. The non-transitory computer-readable storage medium of claim
17, wherein the document includes a combination of text characters
and non-text characters, and wherein the receiving the plurality of
words includes determining numeric values corresponding to the
non-text characters.
19. The non-transitory computer-readable storage medium of claim
17, further comprising a fourth data structure and a fifth data
structure extending from the first data structure and the third
data structure, respectively, wherein the fourth data structure
stores words for a second period of time, the second period of time
being shorter than the period of time, and the fifth data structure
stores words for a third period of time, the third period of time
being greater than the period of time.
20. The non-transitory computer-readable storage medium of claim
17, wherein the second data structure includes one or more
additional filtering characteristics for promoting words from the
first data structure to the third data structure.
Description
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
None.
REFERENCE TO SEQUENTIAL LISTING, ETC
None.
BACKGROUND
1. Technical Field
The present disclosure pertains to a dictionary having methods for
storing words, and more particularly to, a hierarchical dictionary
generally having short, medium, and long-term storage layers as
filtered based on frequency.
2. Description of the Related Art
Humans have an implicit ability to spot errors i.e., misspellings,
within text despite the fact that they do not explicitly know all
words possible within specific documents or might read a word or a
phrase for the first time. For example, within the phrase "PHYSICS
EDU POLE VLT" a human reader can spot the mixture of two words:
"Physics Education" and "Pole Vault". A well-grounded understanding
of words is typically formed by learning and exposure.
In creating dictionaries, words are often assigned to a particular
unique identifier. These types of dictionaries, however, not only
take up a substantial amount of memory as more words are added
overtime but also lack meaning, as they are incapable of giving
users a view of how words are used in processed documents.
Accordingly, there is a need for a system and methods of storing
words into a dictionary which mimics a human brain's capability of
storing words at a short or long term basis depending on a number
of times a word has been used.
SUMMARY
A system and methods for organizing a set of words associated with
one or more documents based on frequency are disclosed.
A hierarchical dictionary stored in a memory and communicatively
coupled to one or more applications in a computing device may
include a first layer of data structure for storing a first set
words associated with a portion of a document, a second layer of
data structure for storing a second set of words including the
first set of words and corresponding frequencies thereof in the
document, and a third layer of data structure for storing a third
set of words from the second set of words exceeding a predetermined
frequency limit. All of the first, second, and third layer of data
structures may be implemented as hash maps and may be treated as
independent dictionaries.
The first set of words stored in the first data structure may be
swiped clean following a predetermined period or a triggering
event. The second data structure acts as a filter for promoting a
set of words from the first data structure exceeding a
predetermined frequency limit to the third data structure or for
retaining the set of words therein. The third data structure, when
receiving words from the second data structure, may store words at
a substantially longer period of time in the memory coupled to or
integral with the computing device relative to being stored in the
first and second data structures.
In one example embodiment, a method for storing words associated
with a document includes: identifying a hash value associated with
each word; storing in the first and second hash maps the word to a
bucket position associated with the identified hash value;
following a predetermined period of time, determining whether a
frequency of the word exceeded a predetermined frequency limit; and
promoting the word to a next layer of data structure upon a
positive determination that the predetermined frequency limit for
the word has been exceeded.
Other embodiments, objects, features and advantages of the
disclosure will become apparent to those skilled in the art from
the detailed description, the accompanying drawings and the
appended claims.
BRIEF DESCRIPTION OF THE DRAWINGS
The above-mentioned and other features and advantages of the
present disclosure, and the manner of attaining them, will become
more apparent and will be better understood by reference to the
following description of example embodiments taken in conjunction
with the accompanying drawings. Like reference numerals are used to
indicate the same element throughout the specification.
FIG. 1 is a system including a hierarchical dictionary for storing
a set of words from one or more documents, according to an example
embodiment.
FIG. 2 is a schematic diagram showing a generic set of steps for
inserting or searching a word in the hierarchical dictionary in
FIG. 1.
FIG. 3 is a flowchart detailing the steps of inserting a word to
the hierarchical dictionary of FIG. 1, according to an example
embodiment.
FIG. 4 is a flowchart detailing the steps of searching a word
within the hierarchical dictionary of FIG. 1, according to an
example embodiment.
DETAILED DESCRIPTION OF THE DRAWINGS
It is to be understood that the disclosure is not limited to the
details of construction and the arrangement of components set forth
in the following description or illustrated in the drawings. The
disclosure is capable of other example embodiments and of being
practiced or of being carried out in various ways. For example,
other example embodiments may incorporate structural,
chronological, process, and other changes. Examples merely typify
possible variations. Individual components and functions are
optional unless explicitly required, and the sequence of operations
may vary. Portions and features of some example embodiments may be
included in or substituted for those of others. The scope of the
disclosure encompasses the appended claims and all available
equivalents. The following description is therefore, not to be
taken in a limited sense, and the scope of the present disclosure
is defined by the appended claims.
Also, it is to be understood that the phraseology and terminology
used herein is for the purpose of description and should not be
regarded as limiting. The use herein of "including", "comprising",
or "having" and variations thereof is meant to encompass the items
listed thereafter and equivalents thereof as well as additional
items. Further, the use of the terms "a" and "an" herein do not
denote a limitation of quantity but rather denote the presence of
at least one of the referenced item.
In addition, it should be understood that example embodiments of
the disclosure include both hardware and electronic components or
modules that, for purposes of discussion, may be illustrated and
described as if the majority of the components were implemented
solely in hardware.
It will be further understood that each block of the diagrams, and
combinations of blocks in the diagrams, respectively, may be
implemented by computer program instructions. These computer
program instructions may be loaded onto a general purpose computer,
special purpose computer, or other programmable data processing
apparatus to produce a machine, such that the instructions which
execute on the computer or other data processing apparatus may
create means for implementing the functionality of each block or
combinations of blocks in the diagrams discussed in detail in the
description below.
These computer program instructions may also be stored in a
non-transitory computer-readable medium that may direct a computer
or other programmable data processing apparatus to function in a
particular manner, such that the instructions stored in the
computer-readable medium may produce an article of manufacture,
including an instruction means that implements the function
specified in the block or blocks. The computer program instructions
may also be loaded onto a computer or other programmable data
processing apparatus to cause a series of operational steps to be
performed on the computer or other programmable apparatus to
produce a computer implemented process such that the instructions
that execute on the computer or other programmable apparatus
implement the functions specified in the block or blocks.
Accordingly, blocks of the diagrams support combinations of means
for performing the specified functions, combinations of steps for
performing the specified functions and program instruction means
for performing the specified functions. It will also be understood
that each block of the diagrams, and combinations of blocks in the
diagrams, can be implemented by special purpose hardware-based
computer systems that perform the specified functions or steps, or
combinations of special purpose hardware and computer
instructions.
Disclosed are a hierarchical dictionary and methods for organizing
a set of words based upon a frequency thereof in a document. The
hierarchical dictionary includes short term, medium term, and long
term dictionaries and includes instructions for performing methods
where the propagation of words as inputted from the short term
dictionary towards the long term dictionary via the medium term
dictionary is controlled by word frequency and insertion over time,
as will be discussed in greater detail below.
It is to be noted that the term "dictionary" and "word" does not
limit the content that can be inserted and searched for to text
content. The "dictionary" referred to herein includes functions
that are the same as that of normal dictionaries, such as, for
example, insertion and removal of words, getting the relative
frequencies of stored words, word lookup, and the like. Also, a
"word" may refer to other forms of data, such as, but not limited
to phrases, images, sounds, and other forms which can be
represented in a data type that is implemented within the
dictionary. Other types of data format in a document besides text
which can be stored and searched for in a dictionary may be
apparent in the art.
FIG. 1 shows one example embodiment of a system 100 including a
hierarchical dictionary 105 for storing a word 110 from one or more
documents 115. System 100 further includes a computing device 120
including at least one processor 125 and a program interface 130.
While shown as a separate entity, hierarchical dictionary 105 may
be stored in a computer-readable storage medium 135 remotely
located from computing device 120, in a memory of computing device
120 (not shown), or a combination of both, provided that it is
communicatively coupled to processor 125. Hierarchical dictionary
105 includes a short term layer 142, a medium term layer 144, and a
long term layer 146. Respective word storage capacities of short
term, medium term, long term layers 142, 144, 146 varies based upon
a size of data to be processed, i.e., one or more documents 115. In
FIG. 1, when any word 110 is entered by a user via program
interface 130 for storage or lookup, program interface 130
communicates with processor 125 for the processor to communicate
with hierarchical dictionary 105. A number of layers in
hierarchical dictionary 105 may not be limited to 3, as shown. In
other example embodiments, additional intermediate filtering layers
with different sizes and parameters besides medium term layer 144
may be desired. Also, while word 110 is shown as being tied up to
one or more documents 115, it will be apparent in the art that word
110 may be standalone and need not necessarily be related to any
document 115. Combinations and permutations for the elements in
system 100 and other components of computing device 120 may be
apparent in the art.
Connections between the aforementioned elements in FIG. 1 depicted
by the arrows may be performed in a shared data bus of computing
device 120. Alternatively, the connections may be through a network
that is capable of allowing communications between two or more
remote computing systems, as discussed herein, and/or available or
known at the time of the filing, and/or as developed after the time
of filing. The network may be, for example, a communications
network or network/communications network system such as, but not
limited to, a peer-to-peer network, a Local Area Network (LAN), a
Wide Area Network (WAN), a public network such as the Internet, a
private network, a cellular network, and/or a combination of the
foregoing. The network may further be a wireless, a wired, and/or a
wireless and wired combination network.
In FIG. 1, hierarchical dictionary 105 may be stored on
computer-readable storage medium 135 and include a set of
instructions from processor 125 for receiving and performing
methods using word 110. In particular, hierarchical dictionary 105
includes program instructions for performing a method for
organizing a set of words 110 based upon relative frequencies
thereof (insert method, FIG. 3) and a method for searching words
(lookup method, FIG. 4). While independent in structure and
operation, short term, medium term, and long term layers 142, 144,
and 146 (collectively referred to as SML layers herein) of
hierarchical dictionary 105, are communicatively connected to the
other via medium term layer 144. Specifically, short term layer 142
is communicatively connected to long term layer 146 and vice-versa
via medium term layer 144. In this manner, medium term layer 144
acts as a filter.
Hierarchical dictionary 105 may be a module or a functional unit
for installation onto a computing device and/or for integration to
an application such as program interface 130. Each of short term
layer 142, medium term layer 144, and long term layer 146, which
are also referred to herein as S-layer 142, M-layer 144, and
L-layer 146, respectively, may each be implemented as a fixed size
hash map, with L-layer 146 having a substantially largest word
storage capacity, as will be detailed below with respect to FIG. 2.
Other types of data structures besides hash maps may be apparent in
the art.
S-layer 142 includes instructions for storing relatively smaller
chunk of data within and/or relating to document 115 (e.g., order
of the number of words in text of one page, words in a paragraph or
document). M-layer 144, also referred to herein as M-layer 144,
includes instructions for storing a set of words that are
relatively more frequent. In the present disclosure, M-layer 144
further includes instructions for gathering statistics which may be
associated, for example, to the usage frequency of word 110 in
document 115. Being a statistical filter, M-layer 144 further
includes instructions for propagating or transferring word 110 from
being stored in S-layer 142 to L-layer 146 and for removing stored
words therein, as will be discussed in greater detail below.
L-layer 146 includes instructions for receiving words from M-layer
144 for storing word 110 at a relatively longer period of time.
In S-layer 142, word 110 and/or other data relating to document 115
may be stored temporarily. In one aspect, word 110 that are stored
in S-layer 142 may be swiped clean by a triggering event, such as,
for example, when a new document, paragraph, or page is being
processed. A hash map for M-layer 144 may be augmented with a
predecessor and a successor in the sense of a doubly linked list
for keeping track of the youngest and oldest words that it stores.
The data structure in L-layer may include a tree. For purposes of
illustration and not by limitation, the general steps for the
insertion and lookup method are shown in FIG. 2.
FIG. 2 is a schematic diagram showing a generic set of steps for
inserting or searching one of word 110 in hierarchical dictionary
105. As shown in FIG. 2, a capacity of S-layer 142 may be set to
about a single document 115. To this end, S-layer 142 may be
cleared every time a single document 115 is being processed. A
capacity of M-layer 144 may be set to about 10 to 100 documents 115
whereas a capacity limit may not be defined for L-layer 146.
In FIG. 2, when inserting a word 110 for storage to hierarchical
dictionary 105 and as represented by step 1, a single word 110 is
first inserted or stored in S-layer 142. Frequency limits may be
predefined within hierarchical dictionary 115 for every one of
S-layer 142, M-layer 144, and L-layer 146. In one example
embodiment, hierarchical dictionary 105 may include instructions to
determine whether a frequency of word 110 has exceeded a first
predetermined limit and a second predetermined limit for word 110
to be promoted to M-layer 144 and L-layer 146, respectively. Thus,
word 110 may be promoted from S-layer 142 to M-layer 144 when the
first predetermined limit has been exceeded (step 2). Following a
period of time that the same word 110 has been repeatedly inserted
or stored to hierarchical dictionary 105 and when a frequency of
word 110 has exceeded the second predetermined limit, word 110 may
then be promoted from M-layer 144 to L-layer 146 for relatively
longer term storage. In setting frequency limits prior promoting
word 110 to the higher layers within hierarchical dictionary 105,
an input and recall ability of humans may be mimicked.
Alternatively, hierarchical dictionary 105 may include instructions
for M-layer 144 to copy word 110 stored in S-layer 1, to track a
frequency of each word 110 inserted, and to only promote word 110
towards L-layer 146 once a predetermined frequency limit has been
exceeded, making transfer of word 110 from relatively short to long
term storage at one-time.
With reference still in FIG. 2 and in one example embodiment, when
searching for a word 110 within hierarchical dictionary 105,
L-layer 146 may be initially searched (step A). When the same word
110 has not been found in L-layer 146, consequently, S-layer 142
may be searched (step B). Alternatively, word 110 may be
simultaneously searched on both S- and L-layers 142, 146.
FIG. 3 is a flowchart detailing the steps of inserting word 110 to
hierarchical dictionary 105. Program interface 130 may include
program instructions to receive a request from a user of computing
device 120 indicating word 110 to be inserted onto hierarchical
dictionary 105. At block 305, each word 110 may either be retrieved
from document 115 or received from processor 125. In one example
embodiment, word 110 may be a portion of the content extracted from
document 115. In another example embodiment, word 110 may be part
of an input received from a user of program interface 130 not
necessarily in relation to any document 115. In yet another example
embodiment, word 110 may be automatically received or retrieved for
insertion to hierarchical dictionary 105 when a controller of
computing device 120 (not shown) has determined that word 110 is
not included in hierarchical dictionary, as a result of a lookup
process detailed in the steps of FIG. 4.
Blocks 310 to 325 recites steps typically performed for inserting a
value into a hash map, as will be known in the art. For example, at
block 310, a hash value corresponding to word 110 in block 305 may
be identified. Identifying the hash value corresponding to word 110
may include determining, using a hash function with word 110 as the
input value, a unique integer corresponding to word 110. The
determined hash value is indicative of a unique index identifier
for a position in a bucket of the hash map to which a pair of
values is operative to be stored. In the present disclosure, each
pair of values in the bucket comprises word 110 as well as a
frequency thereof. At block 315, it is then determined whether the
bucket position associated with the identified hash value contains
an entry for checking whether word 110 is already within
hierarchical dictionary 105. At block 320, upon a determination
that the bucket position associated with the determined hash value
is empty or that hierarchical dictionary 105 does not contain word
110, word 110 is stored into said bucket position. In storing word
110 into the bucket, a frequency thereof may be initialized. At
block 325, upon a determination the bucket position associated with
the determined hash value contains a pair of values, such that word
110 is already stored in the hierarchical dictionary, a frequency
thereof also stored in the bucket is updated. Updating a frequency
may include incrementing a frequency of word 110 stored in the
bucket position.
In one example embodiment, steps in blocks 315 to 325 may be
performed at both hash maps associated with S-layer 142 and M-layer
144. In another example embodiment, steps in blocks 315 to 325 may
be initially performed in S-layer 142 and words 110 may be promoted
or transferred to M-layer 144 following a predetermined period
(e.g., when a new document 115 is being processed) or when a word
110 has reached a predetermined frequency limit for it to be
promoted to M-layer 144 for storage at a longer period of time than
when stored in S-layer 142.
At block 330, following updating of word frequency, the controller
then determines whether the frequency of word 110 stored therein
exceeds a predetermined limit, particularly, a limit for promotion
to the next layer in hierarchical dictionary 105, and if so, at
block 335, promotes word 110 to the next layer. Promoting word 110
to another layer includes transferring word 110 to a hash map
associated with the next layer in the hierarchy and removing
entries in the current layer associated with word 110. In the
context for example where a word 110 is stored in S-layer 142 and
the controller has determined that the frequency of word 110 has
exceeded a predetermined frequency limit for words stored in the
S-layer, word 110 is promoted to next layer M-layer 144. Similar
steps will be apparent for promoting words from M-layer 144 to
L-layer 146; however, word 110 has to exceed a second predetermined
frequency limit substantially greater than the predetermined
frequency limit in S-layer 142 for promotion from M-layer 144 to
L-layer 146. Otherwise, at block 340, word 110 is retained in the
current layer to which it is stored.
FIG. 4 is a flowchart detailing the steps of searching words 110
within hierarchical dictionary 105. Program interface 130 may
include program instructions to receive a request from a user of
computing device 120 indicating word 110 to be searched. At blocks
405 and 410, respectively, word 110 is received and a hash value
corresponding to word 110 is determined, similar to blocks 305 and
310 in FIG. 3.
At block 415, since the hash value is a unique identifier to a
bucket position associated to a hash map in any of SML layers 142,
144, 146, the hash value determined at block 410 is used to
determine whether the hash map in L-layer 146 associated with the
hash value includes word 110.
At block 420, upon a determination that word 110 is stored at the
specific bucket position in L-layer 146 corresponding to the hash
value, one or more program instructions in hierarchical dictionary
105 may send a notification to computing device 120 indicating
presence of word 110 in L-layer 146. In one example embodiment,
hierarchical dictionary 105 may send word 110 and a frequency
thereof indicated in the corresponding bucket to program interface
130 based upon a search request received therefrom. Otherwise, upon
a determination that the bucket position in L-layer 146
corresponding to the hash value determined at block 410 does not
include word 110, then at block 425, the controller may determine
whether the hash map in S-layer 142 associated with the hash value
includes word 110.
At block 425, upon a determination that word 110 is stored at the
specific bucket position in S-layer 142 corresponding to the hash
value determined at block 410, then, similar to block 415,
hierarchical dictionary 105 may send word 110 and a frequency
thereof to program interface 130 based upon a search request
received therefrom. However, upon a determination that the bucket
position in S-layer 142 corresponding to the hash value determined
at block 410 does not include word 110, then at block 430, the
controller may send a notification to computing device 120
indicating absence of word 110 in hierarchical dictionary 105. In
addition, word 110, when found neither in S-layer 142 nor L-layer
146, may be inserted into hierarchical dictionary 105. Steps for
inserting words to hierarchical dictionary 105, as detailed in FIG.
3, may be automatically performed for word 110 following
determination of an absence thereof in the hierarchical dictionary
of the present disclosure.
It will be appreciated that the actions described and shown in the
example flowcharts may be carried out or performed in any suitable
order. It will also be appreciated that not all of the actions
described in FIGS. 3 and 4 need to be performed in accordance with
the example embodiments and/or additional actions may be performed
in accordance with other example embodiments of the disclosure.
Many modifications and other embodiments of the disclosure set
forth herein will come to mind to one skilled in the art to which
these disclosure pertain having the benefit of the teachings
presented in the foregoing descriptions and the associated
drawings. Therefore, it is to be understood that the disclosure is
not to be limited to the specific embodiments disclosed and that
modifications and other embodiments are intended to be included
within the scope of the appended claims. Although specific terms
are employed herein, they are used in a generic and descriptive
sense only and not for purposes of limitation.
* * * * *