U.S. patent application number 13/620364 was filed with the patent office on 2013-03-21 for methods and systems to fingerprint textual information using word runs.
The applicant listed for this patent is Ilya Beyer, Scott MORE. Invention is credited to Ilya Beyer, Scott MORE.
Application Number | 20130074198 13/620364 |
Document ID | / |
Family ID | 41531435 |
Filed Date | 2013-03-21 |
United States Patent
Application |
20130074198 |
Kind Code |
A1 |
MORE; Scott ; et
al. |
March 21, 2013 |
METHODS AND SYSTEMS TO FINGERPRINT TEXTUAL INFORMATION USING WORD
RUNS
Abstract
The present invention provides methods and systems to enable
fast, efficient, and scalable means for fingerprinting textual
information using word runs. The present system receives textual
information and provides algorithms to convert the information into
representative fingerprints. In one embodiment, the fingerprints
are recorded in a repository to maintain a database of an
organization's secure data. In another embodiment, textual
information entered by a user is verified against the repository of
fingerprints to prevent unauthorized disclosure of secure data.
This invention provides approaches to allow derivative works (e.g.,
different ordering of words, substitution of words with synonyms,
etc.) of the original information to be detected at the sentence
level or even at the paragraph level. This invention also provides
methods and systems for enhancing storage and resource efficiencies
by providing approaches to optimize the number of fingerprints
generated for the textual information.
Inventors: |
MORE; Scott; (San Francisco,
CA) ; Beyer; Ilya; (San Mateo, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
MORE; Scott
Beyer; Ilya |
San Francisco
San Mateo |
CA
CA |
US
US |
|
|
Family ID: |
41531435 |
Appl. No.: |
13/620364 |
Filed: |
September 14, 2012 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
12177043 |
Jul 21, 2008 |
8286171 |
|
|
13620364 |
|
|
|
|
Current U.S.
Class: |
726/30 |
Current CPC
Class: |
H04L 63/1408 20130101;
G06F 21/608 20130101; G06F 21/6218 20130101; H04L 63/0245 20130101;
H04L 63/04 20130101; H04L 63/08 20130101; G06F 21/554 20130101;
H04L 63/12 20130101; G06F 21/62 20130101 |
Class at
Publication: |
726/30 |
International
Class: |
G06F 21/00 20060101
G06F021/00 |
Claims
1. A system to prevent unauthorized disclosure of secure
information, the system comprising: a processor; a memory; a
processing component configured to: receive information including a
first text, wherein the first text includes a plurality of words;
normalize the first text into a first canonical text expression,
the first canonical text expression including a plurality of
normalized words; generate a first word hash list for the first
canonical text expression, wherein the first word hash list is
generated at a word level; and generate one or more fingerprints
for the first word hash list, wherein the generation of one or more
fingerprints includes: assigning a sliding window of size W,
wherein W specifies a number of word-value hashes to read from the
first word hash list; using the sliding window to read the W
word-value hashes from the first word hash list; designating an
anchor word-value hash for the sliding window by selecting a
distinct-valued word-value hash among the W word-value hashes; and
applying a fingerprint hash function to all words starting from a
first word-value hash to the anchor word value-hash, wherein
applying the fingerprint hash function generates the one or more
fingerprints.
2. The system to prevent unauthorized disclosure of secure
information as recited in claim 1, wherein each of the plurality of
words being a combination of one or more text characters not
separated by a predefined character, and each of the plurality of
words being separated from a previous word and a subsequent word by
at least one of said predefined character.
3. A system to prevent unauthorized disclosure of secure
information as recited in claim 2, wherein said predefined
character includes more than one character type, including a space,
a period, a comma, a semi-colon, a colon, an exclamation point, a
dash, a parenthesis, and a quotation mark.
4. A system to prevent unauthorized disclosure of secure
information as recited in claim 1, wherein the processing component
is further configured to detect among the plurality of words a
separation of one word from a following or preceding word using a
word boundary detector.
5. A system to prevent unauthorized disclosure of secure
information as recited in claim 4, wherein the word boundary
detector is language independent.
6. A system to prevent unauthorized disclosure of secure
information as recited in claim 1, wherein said receiving
information includes receiving secure information from a local
database.
7. A system to prevent unauthorized disclosure of secure
information as recited in claim 1, wherein said receiving
information includes receiving information entered by a user.
8. A system to prevent unauthorized disclosure of secure
information as recited in claim 1, wherein said generating the
first word hash list includes converting the plurality of
normalized words into a plurality of word-value hashes, wherein
each word-value hash of the plurality of word-value hashes
represents a specific normalized word.
9. A system to prevent unauthorized disclosure of secure
information as recited in claim 8, wherein said converting the
plurality of normalized words includes one or more of the
following: performing case-folding; removing stop words; mapping a
predefined set of common words to a unique word-value hash; mapping
a predefined set of synonyms to a unique word-value hash; mapping a
predefined set of common words to a unique word-value hash; and
mapping a predefined set of words in a particular category to a
unique word-value hash.
10. A system to prevent unauthorized disclosure of secure
information as recited in claim 1, wherein the fingerprint hash
function includes any hash function that allows the one or more
fingerprints to be independent of the order of words in the first
word hash list.
11. A system to prevent unauthorized disclosure of secure
information as recited in claim 1, wherein the processing component
is further configured to record the one or more fingerprints from
all sources in a fingerprint database.
12. A system to prevent unauthorized disclosure of secure
information as recited in claim 1, wherein the processing component
is further configured to monitor and detect any unauthorized
disclosure of said secure information by a user, by generating the
one or more fingerprints for information entered by the user.
13. A system to prevent unauthorized disclosure of secure
information as recited in claim 11, wherein the processing
component is further configured to monitor and detect any
unauthorized disclosure of said secure information by a user, by
generating the one or more fingerprints for information entered by
the user and matching the one or more fingerprints against
fingerprints stored in said fingerprint database.
14. A computer implemented method for preventing unauthorized
disclosure of secure information, the computer implemented method
comprising: storing a plurality of secure text fingerprints for a
given organization, wherein each of the plurality of secure text
fingerprints is generated using a fixed window word run hashing;
receiving a first text that a user desires to transmit outside of
the given organization; generating a first set of fingerprints for
the first text using the fixed window word run hashing, wherein
generating a first set of fingerprints includes: converting a
plurality of normalized words into a plurality of word-value hashes
to create an original word hash list, wherein each word-value hash
represents a specific normalized word; assigning a sliding window
of size W, wherein W specifies a number of word-value hashes to
read from the original word hash list; using said sliding window to
read said W word-value hashes from the original word hash list;
designating an anchor word-value hash for the sliding window by
selecting a distinct-valued word-value hash among said W word-value
hashes; and applying a fingerprint hash function to all words
starting from a first word-value hash to the anchor word-value
hash, wherein applying the fingerprint hash function generates the
first set of fingerprints; determining whether any of the first set
of fingerprints is identical to any of the plurality of secure text
fingerprints; and taking a security action when any of the first
set of fingerprints is identical to any of the plurality of secure
text fingerprints.
15. A computer implemented method for preventing unauthorized
disclosure of secure information as recited in 14, wherein the
fixed window word run hashing comprises: receiving information
including an original text, the original text including a plurality
of words; normalizing said original text into an original canonical
text expression, the original canonical text expression including a
plurality of normalized words; generating an original word hash
list for the original canonical text expression, wherein the
original word hash list is generated at a word level, wherein the
original word hash list includes a plurality of word-value hashes;
and generating an original set of fingerprints for the original
word hash list.
16. A computer implemented method for preventing unauthorized
disclosure of secure information as recited in claim 14, wherein
said first text includes at least one of: text contained in an
electronic mail; text contained in a file attached to an electronic
mail; and text that is transferred using a computer's output
device.
17. A computer implemented method for preventing unauthorized
disclosure of secure information as recited in claim 14, wherein
said security action includes at least one of: preventing said
first text from being transferred; logging the event as a security
violation; requiring a password from said user to allow said first
text to be transferred; blocking said user's access to said first
text; and sending out a security alert.
18. A computer implemented method for preventing unauthorized
disclosure of secure information as recited in claim 14, the
computer implemented method further comprising generating the
plurality of secure text fingerprints for the given
organization.
19. A computer implemented method for preventing unauthorized
disclosure of secure information as recited in claim 14, the
computer implemented method further comprising creating a
fingerprint database for the given organization, wherein the
fingerprint database comprises the plurality of secure text
fingerprints for the given organization.
20. A computer implemented method for preventing unauthorized
disclosure of secure information as recited in claim 14, wherein
the fingerprint hash function includes any hash function that
allows the first set of fingerprints to be independent of the order
of the words in the original word hash list.
Description
CROSS REFERENCES
[0001] This application claims the benefit of U.S. application Ser.
No. 12/177,043, entitled "METHODS AND SYSTEMS TO FINGERPRINT
TEXTUAL INFORMATION USING WORD RUNS," filed Jul. 21, 2008, and is
hereby incorporated by reference.
BACKGROUND OF THE INVENTION
[0002] With the rapid increase and advances in digital
documentation capabilities and document management systems,
organizations are increasingly storing important, confidential, and
secure information in the form of digital documents. Unauthorized
dissemination of this information, either by accident or by wanton
means, presents serious security risks to these organizations.
Therefore, it is imperative for the organizations to protect such
secure information and detect and react to any secure information
(or derivatives thereof) from being disclosed beyond the perimeters
of the organization.
[0003] Additionally, the organizations face the challenge of
categorizing and maintaining the large corpus of digital
information across potentially thousands of data stores, content
management systems, end-user desktops, etc. It is therefore
valuable to the organization to be able to identify and disregard
redundant information from this vast database. At the same time, it
is critical to the organization's security to be able to identify
derivative forms of the secure data (e.g., changes to the sentence
structure or word ordering at the sentence/paragraph level, use of
comparable words in the form of synonyms/hpernyms, varied usage of
punctuations, etc.) and identify any unauthorized disclosure of
even such derivative forms. Therefore, any system or method built
to accomplish the task of preventing unauthorized disclosure would
have to address these two conflicting challenges.
[0004] One method to detect similar data is by examining the
database at the file level. This can be done by comparing the file
names, or by comparing the file sizes, or by doing a checksum of
the contents of the file. However, even minor differences between
the two files will evade a detection method.
[0005] Other prior art solutions teach partial text matching
methods using various k-gram approaches. In such approaches,
text-characters of a fixed length, called k-grams, are selected
from the secure text. These k-grams are hashed into a number called
a fingerprint. In order to increase storage and resource
efficiency, the various prior art approaches propose different
means by which the k-grams can be sampled so as to store only a
representative subset of the k-grams. However, these prior art
approaches suffer a number of disadvantages. For example, these
prior systems are not robust against derivative works of the secure
text. Additionally, the k-gram approaches are not suitable for use
in multi-language environments (e.g., a document containing a
mixture of Mandarin and English words). Also, using a
character-based approach as opposed to a word-based approach does
not allow for the exclusion of common or repeated words, thus
resulting in overall memory and resource inefficiencies.
SUMMARY OF THE INVENTION
[0006] Methods and systems to provide fast, efficient, and scalable
means to fingerprint textual information using word runs is
presented. In one embodiment, the present invention provides
methods and systems to efficiently fingerprint vast amounts textual
information using word runs and allows these fingerprints to be
recorded in a repository. This embodiment comprises a receiving
module to receive textual information from a plurality of input
sources. It further includes a normalization module to convert the
textual information to a standardized canonical format. It then
includes a word boundary detection module that detects the
boundaries of words in a language-independent manner. It
additionally includes a word hash list generator, where each word
of the textual information is converted to a representative hash
value. Several means are provided by which the word hash list can
be post-processed to significantly improve memory and resource
efficiencies. Examples of such post-processing include eliminating
certain stop words, grouping certain categories of words and
mapping them to one hash value, etc. This embodiment also includes
a fingerprint generator, which generates fingerprints by applying
hash functions over the elements of the word hash list. The
fingerprint generator uses algorithms to generate only a
representative subset of the entire word hash list, thus further
enhancing the memory and resource efficiencies of the system. A
repository, which can include any database or storage medium, is
then used to record the fingerprints generated for the vast amounts
of textual information received at the receiver module.
[0007] In another embodiment, the present invention provides
methods and systems to receive any textual information entered in
by a user and to match such information against a fingerprint
database. This embodiment includes a receiving module to receive
the user-entered information, a normalization module to convert the
textual information to a standardized canonical format, a language
independent word boundary detector to detect the start and end of
each word, a word hash list generator to generate representative
hash values to every word, and a fingerprint generator that uses a
sliding window to efficiently generate a representative subset of
fingerprints for the received user information. This embodiment
finally matches the generated fingerprints against a previously
developed fingerprint database, and provides alerts to the user in
the event that any secure or protected information is indeed being
disclosed.
[0008] Other embodiments of the present invention allow the
fingerprints to be generated without any preference for language,
and without any linguistic understanding of the underlying text,
thereby allowing the invention to be applied to most languages. The
present invention also provides embodiments where the fingerprints
are made independent of the presence of punctuations, the ordering
of words within sentences or paragraphs, and/or the presence of
upper and lower case characters in the words. By doing this, the
present invention allows word runs to be matched and detected both
at sentence and paragraph level. Additionally, this invention
allows even derivative works of the original text (e.g., changes to
the sentence structure or word ordering at the sentence/paragraph
level, use of comparable words in the form of synonyms/hpernyms,
varied usage of punctuations, removal or addition of certain stop
words, etc.) to be matched and detected.
BRIEF DESCRIPTION OF DRAWINGS
[0009] These and other objects, features and characteristics of the
present invention will become more apparent to those skilled in the
art from a study of the following detailed description in
conjunction with the appended claims and drawings, all of which
form a part of this specification. In the drawings:
[0010] FIG. 1 illustrates an overall embodiment of a method for
fingerprinting textual information using word runs;
[0011] FIG. 2 is a flowchart depicting an embodiment of a method
for generating a word hash list;
[0012] FIG. 3 is a block diagram providing the various methods by
which post-processing can be performed on the word hash list to
improve efficiency;
[0013] FIG. 4 is a flowchart depicting a preferred embodiment of a
method to generate a first fingerprint for the received textual
information;
[0014] FIG. 5 is a block diagram providing examples of methods by
which the fingerprints can be made word-order independent;
[0015] FIG. 6 is a flowchart depicting a preferred embodiment of a
method to generate a set of fingerprints for the entire textual
information;
[0016] FIG. 7 illustrates an embodiment for generating the
fingerprints for the secure and protected information of an
organization and then recording the fingerprints in a
repository;
[0017] FIG. 8 illustrates an embodiment for generating the
fingerprints for user-entered information and then matching that
fingerprint against fingerprints stored in a repository;
[0018] FIG. 9 provides an overall embodiment of a system for
fingerprinting textual information using word runs; and
[0019] FIG. 10 is a block diagram depicting various embodiments of
systems by which fingerprints can be either recorded or used for
matching and detecting an unauthorized disclosure.
DETAILED DESCRIPTION OF THE INVENTION
[0020] The present invention may be embodied in several forms and
manners. The description provided below and the drawings show
exemplary embodiments of the invention. Those of skill in the art
will appreciate that the invention may be embodied in other forms
and manners not shown below. It is understood that the use of
relational terms, if any, such as first, second, top and bottom,
and the like are used solely for distinguishing one entity or
action from another, without necessarily requiring or implying any
such actual relationship or order between such entities or
actions.
[0021] FIG. 1 shows one embodiment of an overall method to
fingerprint textual information using word runs. In this
embodiment, the information that needs to be fingerprinted is
received from a plurality of sources 110. This information is then
normalized 120 to a standardized or canonical text format. The
boundaries of each word are then detected 125 in a language
independent manner. The words from the normalized text are then
used to generate a word run based hash list, called the word hash
list 130. This word hash list is then used to generate the final
fingerprints 140. Each of these steps are discussed in detail
below.
[0022] Information may be received from several sources. In one
embodiment, the source could include confidential, important, or
secure information maintained by an organization, where such
information needs to be recorded or registered into a database. In
another embodiment, the source could include any information
entered by a user having access to an organization's secure
information, where such information would need to be matched and
inspected against an existing database of secure information. The
textual information received from either of these sources includes
a plurality of words. Such words are may be present as a plurality
of text-characters, with one word distinguished from another by the
presence of at least one space-character. The words may also be
present as plurality of text-characters, with one word separated
from another by the use of punctuation marks.
[0023] The received information is first normalized to a canonical
text representation 120. This can be done by converting the
computer files containing the textual information into one of
several raw text formats. One example of such normalization is to
convert a PDF (Portable Document Format) file into a Unicode
transformation format file. An example of a Unicode transformation
format is UTF-16.
[0024] In one embodiment, the present invention uses a word
boundary detector 125 to detect the separation of one word from a
preceding or following word. The word boundary detector 125 uses a
state machine and employs character-classes that dictate boundary
analysis across languages. In this embodiment, the state machine
utilizes mapping tables to determine what character-class a
particular character belongs to. By mapping the current character
and comparing that against the mapping of the previous character,
the detector determines whether a word has just started or ended.
Because the character-classes include generic word separators or
delimiters common to most languages, this word boundary detector
can be used in a language independent manner. Additionally, the
characters within the words may be case-folded, such that the
word-value hash assigned to a particular word does not depend upon
whether the word has any upper or lower-case characters. Note that
the case folding can be done at any time prior to the generation of
a word hash list.
[0025] FIG. 2 depicts one method of generating a word hash list
200. Here, the normalized textual information is read as input 210.
Each of the words present in this normalized input is then
converted to a word-value hash 220. One example of generating a
word-value hash is to compute a hash-based function over every
character of a word and generating an integer value corresponding
to that word. Such word-value hashes are generated for every word
of the received normalized information. In this embodiment, only
words are processed, and punctuations are not assigned any
word-value hashes. This allows the method to remain impervious to
changes in punctuation. The resulting word-value hashes from all
the words are compiled together to obtain a word hash list 230.
This word hash list may then be subject to post-processing steps
240 (explained below in detail in FIG. 3) to generate fingerprints
that are robust and remain impervious to edits in derivative works
of the original text. The word hash list received after such
post-processing steps is designated as the final word hash list
250.
[0026] In one embodiment, the word-value hashes are computed as
32-bit unsigned integers. This is advantageous because the
computation of the word-value hashes could then use 32-bit
arithmetic, which would be much faster than performing 64-bit
arithmetic on 32-bit architectures.
[0027] FIG. 3 is a block diagram 240 providing information on
various methods to achieve post processing of the word hash lists.
In one method, word-value hashes corresponding to certain
stop-words are excluded 320 from the final word hash list.
Stop-words include those words of any language that occur
frequently in the usage of the language, but do not add any
substantive content to meaningful understanding of the language.
Examples of stop-words include prepositions (e.g., beside, to,
until), gender denoting terms (e.g., she, he, her), etc. In yet
another method, certain predefined sets of words are mapped to a
distinct word-value hash 330. Examples include mapping all stems of
a frequently used word to the same root, mapping nouns to common
synonyms or hypernyms, etc. In one embodiment, the word-value
hashes 220 are generated as integers such that words of the textual
information are represented by unique integer values. Operating the
post processing steps with integer values results in increased
computational efficiencies as compared to operating on character or
string values.
[0028] The post processing steps of FIG. 3 ensure that the final
fingerprints remain robust and impervious to any changes or edits
in derivative works of the original information. Specifically,
these steps allow even derivative works of the original work to be
matched and detected at a later inspection stage. Derivative works
of the original information may include changes in word ordering,
removal or addition of stop-words, changes in punctuations, and
usage of different stems for a particular word. Additionally, the
post-processing steps also improve the efficiency of the process by
reducing the number of word-value hashes that will need further
processing.
[0029] FIG. 4 is a flowchart 400 depicting a method of generating
one fingerprint from the final word hash list 250. The method
comprises receiving the final word hash list 410 and assigning a
sliding window of fixed-size W (where W is an integer greater than
or equal to 1) to read the first W word-value hashes from the word
hash list 420. An anchor 430 is then determined for this first
window, by selecting a distinct-valued word-value hash from the W
number of word-value hashes currently read in by the sliding
window. Examples of distinct-valued word-value hashes include those
word-value hashes that have the highest integer value, or those
word-value hashes with the lowest integer value. After selecting an
anchor, a new hash H.sub.f 440 is computed by applying a hash
function over all the words starting from the first word-value hash
within the window, up until the word-value hash that is designated
as the anchor. This new hash is effectively a hash of one or more
word-value hashes, and this new hash is designated as the first
fingerprint.
[0030] The present invention also discloses methods by which the
hash function can optionally be made word-order independent. FIG. 5
is a block diagram illustrating several possible embodiments of the
hash function H.sub.f. These embodiments represent different ways
by which H.sub.f can be made word order independent 500. In one
embodiment, H.sub.f can be implemented as an addition hash function
520. In another embodiment, H.sub.f can be implemented as a
multiplication hash function 530. In yet another embodiment,
H.sub.f can be implemented as an exclusive-or hash function 540.
These hash functions are examples of symmetric hash functions, and
would therefore allow the fingerprints to be word order
independent. To make H.sub.f more robust, another embodiment of
H.sub.f can be developed by combining the symmetric hash functions
540. One method of realizing such an embodiment would be by
splitting a large word-value hash into two parts and performing a
different symmetric operation on the two parts. Word-order
independence of H.sub.f allows for a much larger range of
modifications to the original text to be detected at the inspection
level, than is possible with prior art approaches. The combination
of this word-order independence 500 and the various post-processing
methods 300 disclosed in FIG. 3 makes it possible to detect similar
text at the inspection stage, even when such text is modified from
the original text at the sentence or paragraph level.
[0031] FIG. 6 is a flowchart illustrating one method for generating
a complete set of fingerprints 600 for the entire word hash list
250. In one embodiment, a first fingerprint 450 is generated using
the method explained previously in FIG. 4. After this, the sliding
window of size W 420 is moved one position to the right 620,
thereby reading W word-value hashes 220 starting from the second
word-value hash in the word hash list 250. From this new set of W
word-value hashes, a new anchor 630 is designated for this new
window by selecting a new distinct-valued word-value hash, similar
to the anchor selection method 430 for the first fingerprint as
explained in FIG. 4. This new anchor 630 is then compared against
the anchor that was generated for the immediately preceding window.
If the new anchor 630 is identical to the immediately preceding
anchor, no new fingerprint is generated 640. However, if the new
anchor 630 is not identical to the immediately preceding anchor, a
new fingerprint is generated 650 using the hash function H.sub.f
440 explained in FIG. 4. After the completion of this step, the
sliding window is moved another position to the right, reading a
new set of W word-value hashes. This process is repeated until all
the word-value hashes in the word hash list are completely scanned
by the sliding window.
[0032] FIG. 7 presents one embodiment of registering the
fingerprints. In this embodiment, the fingerprints generated for
each word hash list 250 using the methods explained in FIGS. 2-6
are stored in a repository 700. This repository would then serve as
a database 730, containing fingerprint data for all confidential,
important, or secure information of an organization.
[0033] FIG. 8 depicts another embodiment of generating
fingerprints, where the embodiment can be used for the purpose of
inspecting any user-entered information. This can be done by
matching the fingerprint generated for the user-entered information
820 against fingerprints stored in a central fingerprint database
830. This central fingerprint database contains a plurality of
fingerprints of an organization's secure information, as explained
in FIG. 7. A new set of fingerprints is then generated for text
that a user desires to transmit outside of the organization 810.
Examples of such transmitted text includes text contained in an
email that a user desires to send out from his computer, text
contained in any files that a user attaches to an email, text
contained in any files that a user transfers outside of his
computer using any of the computer's output devices, etc. Examples
of a computer's output devices include data transferred to a floppy
disc in a floppy drive, data transferred to a flash memory device,
data transferred to a disc in a CD/DVD drive, data transferred to
another computer using the computer's network connectivity, data
transferred over the internet using a file transfer protocol, etc.
Here, the new set of fingerprints is compared against the
fingerprints stored in the central fingerprint database 830. In one
embodiment, a security action is performed if any of the new set of
fingerprints match against any of the fingerprints in the central
database. Examples of such security actions include sending out an
email alert to a person responsible for the organization's
information security, denying the user's access to the information,
logging the event as a potential security violation, requiring the
user to enter a password to allow such information to be
transferred, preventing the secure information from being
transferred out, etc.
[0034] The following description of FIGS. 9-11 includes an overview
of computer hardware and other operating components suitable for
implementing the systems of the invention described here. The
invention can be practiced with other computer system
configurations, including hand-held devices, multiprocessor
systems, microprocessor-based or programmable consumer electronics,
network PCs, minicomputers, mainframe computers, and the like. The
invention can also be practiced in distributed computing
environments where tasks are performed by remote processing devices
that are linked through a communications network.
[0035] FIG. 9 shows one embodiment of an overall system that can be
used to generate fingerprints for wordruns. Here, the system has a
first receiver module 910. In this embodiment, the receiver module
910 is a computer, which can receive textual information from
several sources. In one embodiment, the textual information can be
entered into the computer by a user, using any I/O device attached
to the computer. Such I/O devices could include any device used for
entering information into a computer, including a keyboard,
pointing device (e.g., a mouse), microphone, joystick, game pad,
scanner, digital camera, etc. In another embodiment, the textual
information could be in the form of data files, including an
organization's secure or confidential information, stored in the
memory of the computer. Such memory may include but is not limited
to RAM, ROM, and/or any combination of volatile and non-volatile
memory. In yet another embodiment, the information could be
available in the form of a database in a computer's memory. In
other embodiments, the information could be stored in a network
server, or could be received from an external source via a network
router.
[0036] The received information is converted to a normalized text
format within the text normalization module 920. In one embodiment,
this text normalization module is any computer implemented software
application that can be used to convert the data file from a
non-Unicode format to a Unicode text format. A person of skill in
the art can immediately appreciate the wealth of third-party
software applications that are readily available to perform this
normalization.
[0037] The received normalized information is then transmitted to a
word detector 930. In one embodiment, the word detector could be a
computer implemented software for running an algorithm to detect
the boundaries of each word. In this embodiment, the word boundary
detector uses a state machine and employs character-classes that
dictate boundary analysis across languages. Here, the state machine
utilizes mapping tables to determine what character-class a
particular character belongs to. By mapping the current character
and comparing that against the mapping of the previous character,
the detector determines whether a word has just started or ended.
Because the character-classes include generic word separators or
delimiters common to most languages, this word boundary detector
can be used in a language independent manner. Thus, various
embodiments of this system can be developed for different
languages. Additionally, a case-folding operation may be done on
the words to remove any distinction between words containing upper
case and lower case characters. This ensures that duplicate
fingerprints are not generated for upper and lower case formats of
the same word. Note that the case folding can be done at any time
prior to the operation of the word hash list generation module.
[0038] The received normalized information is then used to generate
a word hash list using the word hash list generation module 940. In
one embodiment, this word hash list generation module is a computer
implemented software that operates on every word of the received
normalized textual information. In this embodiment, the module
further comprises a computer implemented software to compute a hash
function over all the characters of each word, resulting in a
word-value hash for every word. These word-value hashes are
compiled together in a list, and this list is designated as the
word hash list. The word hash list can further be post-processed to
exclude some word-value hashes in order to generate fingerprints
that are robust and remain impervious to edits in derivative works
of the original text. Examples of this include removing certain
stop words that occur frequently in a language and grouping certain
categories of words and mapping them to one common word-value hash.
These post-processing steps can also be achieved by means of a
computer implemented software.
[0039] The word hash list is finally used to generate a set of
fingerprints by operation of the fingerprint generation module 950.
In one embodiment, the fingerprint generation module is a computer
implemented software capable of performing arithmetic and logic
operations. Here, the software reads word-value hashes using a
sliding window of size W, reading W number of word-value hashes at
a given time. At each window instant, the software designates a
distinct-valued word-value hash as an anchor, and generates a new
fingerprint every time the anchor of the current window is not
identical to the anchor from the immediately preceding window. The
software computes the fingerprint by computing a new hash function
over all word-value hashes starting from the first word-value hash
of the current window up until the word-value hash corresponding to
the anchor of the current window. This method of fingerprinting
using wordruns is advantageous over other methods because it
results in memory and resource efficiency, by reducing the total
number of fingerprints that need to be stored in a fingerprint
database.
[0040] FIG. 10 depicts an embodiment where the fingerprints
generated using the system explained in FIG. 9 can be stored in a
repository. Here, a receiver 1010, identical to the receiver 910
explained in FIG. 9, can be used to receive textual information.
Examples of such information include an organization's
confidential, secure, or any other important information that needs
to be protected from unauthorized disclosure. The fingerprints for
this information are generated using the word run based fingerprint
generation module 1020, which uses the steps described in the
fingerprint generation system explained in FIG. 9. The resulting
fingerprints are stored in a repository 1030 for later use.
Examples of a repository include recording the fingerprints in a
database, a network server, a local computer, or any other magnetic
or optical storage media.
[0041] FIG. 10 provides another embodiment where the fingerprints
generated using the system of FIG. 9 can be used to be matched and
inspected against a repository 1030 of fingerprints. In one
embodiment, the system receives textual information entered into a
computer by a user 1050, wherein the information may be entered in
using one of several input devices. Examples of such input devices
include keyboards, microphones, scanners, pointing devices (e.g.,
mouse), etc. The fingerprint generating module 1060 generates
fingerprints for this information using the fingerprint generation
system explained in FIG. 9. The inspection module 1070 then accepts
the resulting fingerprints, which are then compared against the
bank of fingerprints stored in the repository 1030. In one
embodiment, a computer implemented software can be used to build
the inspection module, wherein the software code enables the module
to match the current fingerprint with a fingerprint in the
repository 1030 and report any successful matches.
[0042] The systems explained in FIGS. 9-11 and all its embodiments
relate to apparatus for performing the operations herein. This
apparatus may be specially constructed for the required purposes,
or it may comprise a general purpose computer selectively activated
or reconfigured by a computer program stored in the computer. Such
a computer program may be stored in a computer readable storage
medium, such as, but is not limited to, any type of disk including
floppy disks, optical disks, CD-ROMs, and magnetic-optical disks,
read-only memories (ROMs), random access memories (RAMs), EPROMs,
EEPROMs, magnetic or optical cards, or any type of media suitable
for storing electronic instructions, each coupled to a computer
system.
[0043] The algorithms and software presented herein are not
inherently related to any particular computer or other apparatus.
Various general purpose systems may be used with programs in
accordance with the teachings herein, or it may prove convenient to
construct more specialized apparatus to perform the required method
steps. The required structure for a variety of these systems will
appear from other portions of this description. In addition, the
present invention is not described with reference to any particular
programming language, and various embodiments may thus be
implemented using a variety of programming languages.
[0044] In addition to the above mentioned examples, various other
modifications and alterations of the invention may be made without
departing from the invention. Accordingly, the above disclosure is
not to be considered as limiting and the appended claims are to be
interpreted as encompassing the true spirit and the entire scope of
the invention.
* * * * *