U.S. patent application number 13/172998 was filed with the patent office on 2011-10-27 for method and a system for information identification.
This patent application is currently assigned to PortAuthority Technologies Inc.. Invention is credited to Ofir CARNY, Ariel PELED, Lidror TROYANSKY.
Application Number | 20110264637 13/172998 |
Document ID | / |
Family ID | 33101332 |
Filed Date | 2011-10-27 |
United States Patent
Application |
20110264637 |
Kind Code |
A1 |
PELED; Ariel ; et
al. |
October 27, 2011 |
METHOD AND A SYSTEM FOR INFORMATION IDENTIFICATION
Abstract
A method for detecting an information item within an information
sequence obtained from a digital medium, said information item
comprising any one of a specified set of prestored information
items, comprising: transforming each of the set of prestored
information items into a respective representation, in accordance
with a predetermined transformation format; transforming the
information sequence obtained from the digital medium, in
accordance with the transformation format; and determining the
presence of one or more of the prestored information items within
the transformed information sequence, utilizing the respective
representation, wherein the information items are divided into
sets, and applying a security policy that depends on the number of
detected information items that belong to the same set.
Inventors: |
PELED; Ariel; (Even-Yehuda,
IL) ; CARNY; Ofir; (Kochav-Yair, IL) ;
TROYANSKY; Lidror; (Givataim, IL) |
Assignee: |
PortAuthority Technologies
Inc.
San Diego
CA
|
Family ID: |
33101332 |
Appl. No.: |
13/172998 |
Filed: |
June 30, 2011 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
10815764 |
Apr 2, 2004 |
7991751 |
|
|
13172998 |
|
|
|
|
60459372 |
Apr 2, 2003 |
|
|
|
Current U.S.
Class: |
707/698 ;
707/E17.009; 726/1 |
Current CPC
Class: |
G06F 7/02 20130101 |
Class at
Publication: |
707/698 ; 726/1;
707/E17.009 |
International
Class: |
G06F 7/00 20060101
G06F007/00; G06F 21/00 20060101 G06F021/00 |
Claims
1. A method for detecting an information item within an information
sequence obtained from a digital medium, said information item
comprising any one of a specified set of prestored information
items, comprising: transforming each of said set of prestored
information items into a respective representation, in accordance
with a predetermined transformation format; transforming said
information sequence obtained from said digital medium, in
accordance with said transformation format; determining the
presence of one or more of said prestored information items within
said transformed information sequence, utilizing said respective
representation, wherein said information items are divided into
sets, applying a security policy upon the detection of said
information item in said information sequence, and wherein said
security policy depends on the number of detected information items
that belong to the same set.
2. A method according to claim 1 wherein each of said sets
comprises information items associated with a single
individual.
3. A method according to claim 1, wherein a type of said
information item comprises one of a group of types comprising: a
word, a phrase, a number, a credit-card number, a social security
number, a name, an address, an email address, and an account
number.
4. A method according to claim 1, wherein said information sequence
is provided over a digital traffic channel.
5. A method according to claim 4, wherein said digital traffic
channel comprises one of a group of channels comprising: email,
instant messaging, peer-to-peer network, fax, and a local area
network.
6. A method according to claim 1, wherein said information sequence
comprises the body of an email, or wherein said information
sequence comprises an email attachment.
7. A method according to claim 1, further comprising retrieving
said information sequence from a digital storage medium.
8. A method for detecting an information item within an information
sequence obtained from a digital medium, said information item
comprising any one of a specified set of prestored information
items, comprising: transforming each of said set of prestored
information items into a respective representation, in accordance
with a predetermined transformation format; transforming said
information sequence obtained from said digital medium, in
accordance with said transformation format; determining the
presence of one or more of said prestored information items within
said transformed information sequence, utilizing said respective
representation, wherein said information item comprises a sequence
of sub-items, wherein said sub-items are separated by delimiters,
the delimiters being resilient to reordering of the sub-items,
wherein said transforming comprises: applying a first hashing
function to assign a respective preliminary hash value to each
sub-item within said information item; and applying a second
hashing function to assigning a global hash value to said
preliminary hash values of said sub-items, wherein said information
sequence comprises at least two sub-sequences, and wherein said
determining comprises: applying said first hashing function to
assign a respective preliminary hash value to each of said
sub-sequences; applying said second hashing function to at least
one of said preliminary hash values to assign a global hash value
to at least one of said sub-sequences; and comparing said global
hash value to hash values of said sub-sequences, wherein said
second hash function is invariant to reordering of at least two of
said sub-sequences within a respective sub-item.
9. A method according to claim 8, wherein a sub-item comprises one
of a group comprising: a word, a number, and a character
string.
10. A method according to claim 8, wherein said determining
comprises using a state machine operable to detect said sequence of
delimited sub-items within said information sequence.
11. A method according to claim 8, wherein said determining
comprises: applying said first hashing function to assign a
respective preliminary hash value to each of said sub-sequences;
applying said second hashing function to at least one of said
preliminary hash values to assign a global hash value to at least
one of said sub-sequences; and comparing said global hash value to
hash values of said sub-sequences.
12. A method according to claim 11, wherein said sub-sequences
comprise one of a group comprising: a word, a number, and a
character string.
13. A method according to claim 11, wherein said plurality of
sub-sequences comprises a plurality of ordered combinations of
sub-sequences within said data sequence.
14. A method according to claim 12, wherein said plurality of
series comprises a plurality of combinations of sub-sequences
within said data sequence.
15. A method according to claim 8, further comprising checking
whether said delimited segment was previously stored, and
continuing said detection process only if the current delimited
segment was previously stored.
16. A method according to claim 8, wherein a type of said
information item comprises one of a group of types comprising: a
word, a phrase, a number, a credit-card number, a social security
number, a name, an address, an email address, and an account
number.
17. A method according to claim 8, wherein said information
sequence is provided over a digital traffic channel.
18. A method according to claim 17, wherein said digital traffic
channel comprises one of a group of channels comprising: email,
instant messaging, peer-to-peer network, fax, and a local area
network.
19. A method according to claim 8, wherein said information
sequence comprises the body of an email, or wherein said
information sequence comprises an email attachment.
20. A method according to claim 8, further comprising retrieving
said information sequence from a digital storage medium.
21. A method for detecting an information item within an
information sequence obtained from a digital medium, said
information item comprising any one of a specified set of prestored
information items, comprising: transforming each of said set of
prestored information items into a respective representation, in
accordance with a predetermined transformation format; transforming
said information sequence obtained from said digital medium, in
accordance with said transformation format; determining the
presence of one or more of said prestored information items within
said transformed information sequence, utilizing said respective
representation, further comprising applying a policy upon the
detection of said information item in said information sequence,
wherein said policy is a security policy, said security policy
comprises at least one of the following group of security policies:
blocking transmission, logging a record of said detection and
detection details, and reporting said detection and detection
details.
22. An apparatus for detecting an information item within an
information sequence, said information item being any one of a
specified set of data items, comprising: a preprocessor, for
transforming said information item into a representation, in
accordance with a transformation format; a scanner, for scanning
said information sequence to identify sub-sequences; and a
comparator associated with said preprocessor and said scanner, for
comparing said representation to said sub-sequences to determine
the presence of said specified information item within said
information sequence; and a non-existence module comprising: an
encoder, for encoding said sub-sequences and said data item with an
encoding function to respective integers, each of said integers
being no greater than the size of said array; and an array setter
associated with said encoder, for setting indicators in an array of
indicators in accordance with said encoded sub-sequences; and a
status checker associated with said encoder and said array setter,
for determining the status of an indicator corresponding to said
data item.
23. The apparatus of claim 22, wherein said array of indicators is
associated with a respective encoding function for encoding a data
item into an integer no greater than the size of said respective
array.
24. The apparatus of claim 23, wherein each member of said list is
encodable with said respective encoding function; and wherein a
corresponding indicator is settable for each of said encoded
members; said specified data item is encodable with each of said
encoding functions; and the apparatus being configured to determine
for each of said encoded data items, the status of the
corresponding indicator in said array.
25. A method for determining the absence of a specified data item
from a list of data items, comprising: providing a plurality of
initialized arrays of indicators, each of said arrays being
associated with a respective encoding function for encoding a data
item into an integer no greater than the size of said respective
array; for each of said arrays, performing: encoding each member of
said list with said respective encoding function; and setting a
corresponding indicator for each of said encoded members; encoding
said specified data item with each of said encoding functions; and
for each of said encoded data items, determining the status of the
corresponding indicator in said respective array.
26. A method according to claim 25, wherein the size of each of
said arrays is greater than the number of items in said list.
27. A method according to claim 25, wherein at least one of said
encoding functions comprises a hashing function.
28. A method according to claim 25, wherein a data item comprises a
string of alphanumeric characters.
Description
RELATED APPLICATION/S
[0001] This application is a divisional of U.S. patent application
Ser. No. 10/815,764 filed on Apr. 2, 2004, which claims the benefit
of priority under 35 USC 119(e) of U.S. Provisional Patent
Application No. 60/459,372 filed on Apr. 2, 2003. The contents of
all of the above applications are incorporated by reference as if
fully set forth herein.
FIELD OF THE INVENTION
[0002] The present invention relates generally to the field of
analysis of digital information. More specifically, the present
invention deals with methods for fast identification of information
items within electronic traffic and digital media.
BACKGROUND OF THE INVENTION
[0003] The information and knowledge created and accumulated by
organizations and businesses are most valuable assets. As such,
managing and keeping the information and the knowledge inside the
organization and restricting its distribution outside is of
paramount importance for almost any organization, government entity
or business and provides a significant leverage of its value. Most
of the information in modern organizations and businesses is
represented in a digital format. Digital content can be easily
copied and distributed (e.g., via e-mail, instant messaging,
peer-to-peer networks, FTP and web-sites), which greatly increase
hazards such as business espionage and data leakage. It is
therefore essential to monitor the information traffic in order to
keep the information unavailable to unauthorized persons.
[0004] Various bills and regulations within the United States of
America and other countries impose another level of importance to
the problem of confidential information management and control.
Regulations within the United States of America, such as the Health
Insurance Portability and Accountability Act (HIPPA), the
Gramm-Leach-Bliley act (GLBA) and the Sarbanes Oxley act (SOXA)
implies that the information assets within organizations should be
monitored and subjected to an information management policy, in
order to protect clients privacy and to mitigate the risks of a
potential misuse and fraud. In particular, the existence of covert
channels of information, which can serves conspiracies to commit
fraud or other illegal activities, pose severe risk from both legal
and business perspectives.
[0005] Another aspect of the information management problem is to
make the information explicitly available to authorized persons
whenever needed, so that it can be utilized in order to create
value for the organization. This aspect also requires tracking the
information along its life cycle.
[0006] Methods that attempt to track digital information and manage
information and knowledge exist. One of the most prevalent methods
is based on key-words and key-phrases filtering: in this case, the
system attempts to recognize a pre-defined set of previously stored
information items, such as key-words, numbers and key-phrases,
within the content, utilizing string comparison algorithms. Such
methods are in wide usage, e.g., for email filtering utilizing
string matching. However, and the usage of such methods may become
prohibitively slow when the number of stored information items is
large.
[0007] There is thus a recognized need for, and it would be highly
advantageous to have, a method and system that allow fast and
efficient recognition of large number of keywords and key phrases
within electronic traffic, which will overcome the drawbacks of
current methods as described above.
SUMMARY OF THE INVENTION
[0008] It is an object of the present invention to provide a method
and a system that facilitates fast and efficient detection and
identification of a large number of previously stored information
and data items, such as words, key-phrases, credit-card numbers,
social security numbers, names, addresses, email address, account
numbers, and other strings within electronic traffic.
[0009] According to a first aspect of the present invention, there
is provided a method for detecting an information item within an
information sequence obtained from a digital medium, said
information item comprising any one of a specified set of prestored
information items, comprising:
[0010] transforming each of said set of prestored information items
into a respective representation, in accordance with a
predetermined transformation format;
[0011] transforming said information sequence obtained from said
digital medium, in accordance with said transformation format;
[0012] determining the presence of one or more of said prestored
information items within said transformed information sequence,
utilizing said respective representation, wherein said information
items are divided into sets, applying a security policy upon the
detection of said information item in said information sequence.
and wherein said security policy depends on the number of detected
information items that belong to the same set.
[0013] In a preferred embodiment of the present invention the
method further comprising storing the representations in a
database.
[0014] In another preferred embodiment of the present invention the
method further comprising sorting the representations into a sorted
list.
[0015] In another preferred embodiment of the present invention the
sorting is in accordance with a tree-sorting algorithm.
[0016] In another preferred embodiment of the present invention,
the information item comprises a single word.
[0017] In another preferred embodiment of the present invention the
information item comprises a sequence of words.
[0018] In another preferred embodiment of the present invention the
information item comprises a delimited sequence of sub-items.
[0019] In a preferred embodiment of the present invention each of
the sub-items comprises a sequence of alphanumeric characters.
[0020] In another preferred embodiment of the present invention, a
type of the information item comprises one of a group of types
comprising: a word, a phrase, a number, a credit-card number, a
social security number, a name, an address, an email address, and
an account number.
[0021] In another preferred embodiment of the present invention the
information sequence is provided over a digital traffic
channel.
[0022] In another preferred embodiment of the present invention the
digital traffic channel comprises one of a group of channels
comprising: email, instant messaging, peer-to-peer network, fax,
and a local area network.
[0023] In another preferred embodiment of the present invention,
the information sequence comprises the body of an email.
[0024] In another preferred embodiment of the present invention,
the information sequence comprises an email attachment.
[0025] In another preferred embodiment of the present invention the
method further comprising retrieving the information sequence from
a digital storage medium.
[0026] In another preferred embodiment of the present invention the
digital storage medium comprises a digital cache memory.
[0027] In another preferred embodiment of the present invention the
representation depends only on the textual and numeric content of
the information item.
[0028] In another preferred embodiment of the present invention the
transforming comprises Unicode encoding.
[0029] In another preferred embodiment of the present invention the
transforming comprises converting all characters to upper-case
characters or to lower-case characters.
[0030] In another preferred embodiment of the present invention the
transforming comprises encoding an information item into a numeric
representation.
[0031] In another preferred embodiment of the present invention the
method further comprising applying a first hashing function to the
representations.
[0032] In another preferred embodiment of the present invention the
information sequence comprises sub-sequences.
[0033] In another preferred embodiment of the present invention the
sub-sequences are separated by delimiters.
[0034] In another preferred embodiment of the present invention the
sub-sequences separated by delimiters are any of: words; names, and
numbers.
[0035] In another preferred embodiment of the present invention the
method further comprising scanning the information sequence to
identify the sub-sequences.
[0036] In another preferred embodiment of the present invention the
determining is performed by matching the information item to an
ordered series of the sub-sequences.
[0037] In another preferred embodiment of the present invention the
method further comprising applying a policy upon the detection of
the information item in the information sequence.
[0038] In another preferred embodiment of the present invention the
policy is a security policy, the security policy comprises at least
one of the following group of security policies: blocking the
transmission, logging a record of the detection and detection
details, and reporting the detection and detection details.
[0039] In another preferred embodiment of the present invention the
information items are divided into sets, and wherein the security
policy depends on the number of detected information items that
belong to the same set.
[0040] In another preferred embodiment of the present invention
each of the sets comprises information items associated with a
single individual.
[0041] In another preferred embodiment of the present invention the
information item comprises a sequence of sub-items.
[0042] In another preferred embodiment of the present invention the
sub-items are separated by delimiters.
[0043] In another preferred embodiment of the present invention a
sub-item comprises one of a group comprising: a word, a number, and
a character string.
[0044] In another preferred embodiment of the present invention the
determining comprises using a state machine operable to detect the
sequence of delimited sub-items within the information
sequence.
[0045] In another preferred embodiment of the present invention the
transforming comprises:
[0046] applying a first hashing function to assign a respective
preliminary hash value to each sub-item within the information
item; and
[0047] applying a second hashing function to assigning a global
hash value to the information item based on the preliminary hash
values of the sub-items.
[0048] In another preferred embodiment of the present invention the
information sequence comprises sub-sequences, and wherein the
determining comprises:
[0049] applying the first hashing function to assign a respective
preliminary hash value to each of the sub-sequences;
[0050] applying the second hashing function to at least one of the
preliminary hash values to assign a global hash value to the at
least one of the sub-sequences; and
[0051] comparing the global hash value to the hash values of the
series.
[0052] In another preferred embodiment of the present invention the
sub-sequences comprise one of a group comprising: a word, a number,
and a character string
[0053] In another preferred embodiment of the present invention the
plurality of series comprises a plurality of ordered combinations
of sub-sequences within the information or data sequence.
[0054] In another preferred embodiment of the present invention the
plurality of series comprises a plurality of combinations of
sub-sequences within the information or data sequence.
[0055] In another preferred embodiment of the present invention the
second hash function is invariant to reordering of at least two of
the sub-sequences.
[0056] In another preferred embodiment of the present invention the
method further comprising checking whether the delimited segment
was previously stored, and continuing the detection process only if
the current delimited segment was previously stored. According to a
second aspect of the present invention, a method for determining
the absence of a specified information or data item from a list of
information or data items, is presented. The method comprising:
[0057] (a) providing an initialized array of indicators;
[0058] (b) for each member of the list, performing:
[0059] (c) encoding the member with an encoding function to an
integer no greater than the size of the array; and [0060] i.
setting a corresponding indicator; [0061] ii. encoding the
specified information or data item with the encoding function; and
[0062] iii. determining the status of an indicator corresponding to
the encoded information or data item.
[0063] In another preferred embodiment of the present invention a
size of the array is greater than the number of items in the
list.
[0064] In another preferred embodiment of the present invention the
encoding function comprises a hashing function.
[0065] In another preferred embodiment of the present invention the
information item comprises a string of alphanumeric characters.
[0066] According to a third aspect of the present invention, a
method for determining the absence of a specified information or
data item from a list of information or data items is presented.
The method comprising:
[0067] (a) providing a plurality initialized array of indicators,
each of the arrays being associated with a respective encoding
function for encoding a information or data item into an integer no
greater than the size of the respective array;
[0068] (b) for each of the arrays, performing: [0069] (i) encoding
each member of the list with the respective encoding function; and
[0070] (ii) setting a corresponding indicator for each of the
encoded members;
[0071] (c) encoding the specified information or data item with
each of the encoding functions; and, for each of the encoded
information or data items, determining the status of the
corresponding indicator in the respective array.
[0072] In another preferred embodiment of the present invention the
size of each of the arrays is greater than the number of items in
the list.
[0073] In another preferred embodiment of the present invention at
least one of the encoding functions comprises a hashing
function.
[0074] In another preferred embodiment of the present invention the
information or data item comprises a string of alphanumeric
characters.
[0075] In a preferred embodiment of the present invention an
apparatus for detecting an information item within an information
sequence, the information item being any one of a specified set of
information or data items, is presented. The apparatus
comprising:
[0076] a preprocessor, for transforming the information item into a
representation, in accordance with a transformation format; and
[0077] a scanner, for scanning the information sequence to identify
sub-sequences; and
[0078] a comparator associated with the preprocessor and the
scanner, for comparing the representation to the sub-sequences to
determine the presence of the specified information item within the
information sequence.
[0079] In a preferred embodiment of the present invention the
apparatus for detecting a specified information item within an
information sequence further comprising a user interface for
inputting the information items.
[0080] In a preferred embodiment of the present invention the
apparatus the scanner is further operable to transform the
information sequence in accordance with the transformation
format.
[0081] In a preferred embodiment of the present invention the
scanner is further operable to transform the sub-sequences in
accordance with the transformation format.
[0082] In a preferred embodiment of the present invention the
apparatus further comprises an information storage or a database
for storing a representation of each information or data item of
the set.
[0083] In a preferred embodiment of the present invention the
information sequence is obtained from a digital medium.
[0084] In a preferred embodiment of the present invention the
apparatus further comprising a sorter, for forming a sorted list of
the respective representations of set of information or data
items.
[0085] In a preferred embodiment of the present invention the type
of the information item comprises one of a group of types
comprising: a word, a phrase, a number, a credit-card number, a
social security number, a name, an address, an email address, and
an account number.
[0086] In a preferred embodiment of the present invention the
information sequence is provided over a digital traffic
channel.
[0087] In a preferred embodiment of the present invention the
apparatus further comprising retrieving the information sequence
from a digital storage medium.
[0088] In a preferred embodiment of the present invention the
digital storage medium comprises digital storage medium within a
proxy server.
[0089] In a preferred embodiment of the present invention the
apparatus further comprising a non-existence module comprising:
[0090] an encoder, for encoding the sub-sequences and the
information or data item with an encoding function to respective
integers, each of the integers being no greater than the size of
the array; and [0091] an array setter associated with the encoder,
for setting indicators in an array of indicators in accordance with
the encoded sub-sequences; and [0092] a status checker associated
with the encoder and the array setter, for determining the status
of an indicator corresponding to the information or data item.
[0093] In a preferred embodiment of the present invention the
encoding function comprises a hashing function.
[0094] The present invention successfully addresses the
shortcomings of the presently known configurations by providing a
method and system that facilitates fast and efficient detection and
identification of a large number of previously stored information
and data items, which can efficiently serve digital privacy and
confidentiality enforcement as well as knowledge management.
BRIEF DESCRIPTION OF THE DRAWINGS
[0095] The invention is herein described, by way of example only,
with reference to the accompanying drawings. In the drawings:
[0096] FIG. 1 illustrates a system for fast detection of keywords,
constructed and operative according to a preferred embodiment of
the present invention.
[0097] FIG. 2 illustrates a system, substantially similar to the
one described in FIG. 1, which also includes a fast-proof of
non-existence module.
[0098] FIG. 3 illustrates a method for fast-proof of non-existence
of items in a database, operative according to a preferred
embodiment of the present invention.
[0099] FIG. 4 illustrates a system, substantially similar to the
one described in FIG. 2, which also includes a cache filter,
operable to filter out a short list of items, and
[0100] FIG. 5 contains some examples for a tree-based
data-structure that facilitates detection of multi-words
key-phrases, and
[0101] FIG. 6 is a flowchart illustrate algorithm for fast
detection of key-phrases, according to preferred embodiment of the
present invention.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
[0102] The present invention describes a method and a system for
detection of a large number of previously stored information items,
such as words, phrases, numbers, credit-card numbers, social
security numbers, names, addresses, email addresses, account
numbers and other pre-defined strings of characters, within
information sequence (such as textual documents) in digital media
and electronic traffic (e.g., emails), particularly but not
exclusively.
[0103] According to a first aspect of the present invention, the
method comprises pre-processing of the information items; storing
them in a manner that facilitates fast comparison, and then
performing sequential analysis of the inspected information
sequence, preferably utilizing the delimiters within the
information sequence (such as spaces between words) and comparing
each of the delimited segment (e.g. each word or sequence of words
within a textual document) with the pre-processed information
items
[0104] With specific reference now to the drawings in detail, it is
stressed that the particulars shown are by way of example and for
purposes of illustrative discussion of the preferred embodiments of
the present invention only, and are presented in the cause of
providing what is believed to be the most useful and readily
understood description of the principles and conceptual aspects of
the invention. In this regard, no attempt is made to show
structural details of the invention in more detail than is
necessary for a fundamental understanding of the invention, the
description taken with the drawings making apparent to those
skilled in the art how the several forms of the invention may be
embodied in practice.
[0105] Reference is now made to FIG. 1, which illustrates a system
for fast detection of previously stored information items, such as
keywords, numbers and key-phrases, within a digital medium,
constructed and operative according to a preferred embodiment of
the present invention. The information items insertion module 110
allow user to insert keywords, key-phrases, numbers and character
strings, preferably using a graphical user interface (GUI). The
items are first pre-processed by the pre-processor 120. In a
preferred embodiment of the present invention, the pre-processing
comprises transforming the stored information items into a
"canonized" form, in which they are represented in a lowercase (or
uppercase) Unicode representation. In a preferred embodiment of the
present invention, all non-alpha-numeric characters are omitted. In
another preferred embodiment of the present invention the
pre-processing comprises transforming the stored information items
to their base form, whenever possible (e.g., by transforming verbs
to "present simple" form, removing suffixes such as "'s" and "ly",
reducing to phonetic representation etc.). In another preferred
embodiment of the present invention the pre-processing comprises
encoding the information items into a numeric representation in a
manner that facilitates fast detection, as explained below. In
another preferred embodiment of the present invention, the numeric
representation depends only on the textual and numeric content of
the information item. The pre-processed items are thereafter
preferably sorted and stored at the storage 130. A digital content
to be analyzed 140 is thereafter preferably pre-processed, as
explained below, by the pre-processor 145, and then scanned by the
content scanner 150, preferably utilizing the existing delimiters
(e.g., spaces between words)in order to facilitate faster scanning.
After each delimiter, the comparator 160 efficiently compares the
sequence which started at that delimiter (usually a word, a number
or a sequence of words and numbers) with the sorted items in the
storage 130, preferably using one of the methods and algorithms
described below. In a preferred embodiment of the present
invention, the storage 130 is a database that facilitates efficient
queries.
[0106] In a preferred embodiment of the present invention, the
information within the digital medium is first pre-processed and
transformed into a representation that facilitates fast comparison
with the stored information items. In a preferred embodiment of the
present invention, the pre-processing comprises transforms that are
applied on the to-be-detected information items, such as: [0107]
transforming the stored information items into a "canonized" form,
in which they are represented in a lowercase (or uppercase) Unicode
representation; [0108] omitting all non-alpha-numeric characters;
[0109] transforming the stored information items to their base
form, whenever possible (e.g., by transforming verbs to "present
simple" form, removing suffixes such as "s" and "ly", reducing to
phonetic representation etc.); [0110] encoding the information
items into a numeric representation in a manner that facilitates
fast detection, as explained below.
[0111] In a preferred embodiment of the present invention, the
digital medium comprises a digital traffic channel, and information
items that were found within the digital traffic are then used in
order to apply a policy with respect to the traffic within the
channel. In a preferred embodiment of the present invention, the
digital traffic channel comprises email (both email body messages
and attachments), and a policy, such as security policy, is applied
with respect to the emails, as described, e.g., in PCT patent
application number IL02/00037, in U.S. Patent Application No.
20020129140, filed Dec. 6, 2001, and in US provisional patent
application 60/475,492, filled Jun. 4, 2003, the contents of which
are hereby incorporated herein by reference in their entirety. In a
preferred embodiment of the present invention the security policy
comprises actions such as blocking the transmission, logging a
record of the detection and detection details, and reporting the
detection and its details.
[0112] In a preferred embodiment of the present invention, the
information items are divided into sets, the security policy
depends on the number of detected information items that belong to
the same set. These sets may comprise, e.g., information items
associated with a single individual, such as her or his name, her
or his social security number, her or his address, her or his
bank-account number, etc. and the policy may preclude dissemination
of any two or more of these items via email.
[0113] In a preferred embodiment of the present invention, the
digital medium comprises digital storage, and the system is
operable to detect information items within the information stored
in the storage, e.g., in order to detect keywords and keyphrases
within a file system or within a proxy server. Such detection can
be of importance both for applying a security policy and for
information and knowledge management within the organization.
[0114] Reference is now made to FIG. 2, which illustrates a system,
substantially similar to the one described in FIG. 1, where a fast
proof of non-existence module 155 is introduced between the scanner
150 and the comparator 160. The proof of non-existence module is
operable to prove, with a probability P, that a certain item does
not exists in the list in the storage 130, thereby significantly
reducing the number of queries to the storage 130.
[0115] Reference is now made to FIG. 3, which illustrates a method
for proof of non-existence of items in a database, operative
according to a preferred embodiment of the present invention. The
input item 310 is preferably transformed into a numeric
representation, by numeric encoding 320. The numeric representation
is then subjected to one or more hash-functions h.sub.i 330 that
transform the numeric representation X in stage 320 to an L-bits
long number h.sub.i (x), where the distribution of the numbers is
preferably close to uniform over the range 1-2.sup.L. Array set 340
contains a corresponding array .alpha..sub.i,of length 2.sup.L, for
each of the hash functions h.sub.i. The elements of the arrays are
bits, which are all initiated to a have a zero value. The element
of the array .alpha..sub.i at the address h.sub.i (x) is then set
to 1, indicating the existence of the element x.
[0116] Since the mapping of elements to addresses in the array is
quasi-random, there is always the possibility of collisions between
two different items, i.e., that h.sub.i(x.sub.i)=h.sub.i(x.sub.2)
while x.sub.1.noteq.x.sub.2. The probability that at least one
event of that kind will happen become close to one when the number
of items become substantially greater then the square root of the
number of addresses (i.e., 2.sup.L/2), a phenomenon known as the
birthday problem. It is therefore not possible to positively
indicate the existence of a certain item. However, if there is a 0
in at least one of the corresponding arrays .alpha..sub.i, then one
can tell for sure that the item does not exist. The method is
therefore able to determine the absence of the item from the
sequence, but cannot determine the presence of the item in the
sequence with 100% certainty. In a preferred embodiment of the
present invention, the search is stopped after the first 0 is
encountered. Each of the arrays can therefore be considered a
filter.
[0117] The array's optimal length (and the number of bits in the
output of the hash function) is computed based on occupancy, the
optimal being 50%, (see discussion below), which requires an array
size of approximately 11.42 times the number of items for a single
array, and an array size of approximately 1.42 times the number of
items in the list times the number of hash functions, for a set of
arrays with different hash functions. Each bit for which a
respective hashed item exists is given a 1 value (in the first case
this is done in the respective array).
[0118] Searching for an item is based on the fact that an item can
only exist if all the corresponding bits are 1, so a process of
computing hash functions and checking respective bits takes place,
if any bit is 0, the item is not in the list of data items. When
the item's hash value contain more bits then L times the number of
arrays, and the different bits are statistically independent, one
can simply use "bit masks" as the hash functions (i.e., selecting
disjoint groups of bits from the item's hash value), however, if
they do not contain enough bits, a more substantially independent
scheme, such as a hash function of the basic hash function, is
needed (although it might be slightly less efficient).
[0119] Following is an analysis illustrating why 50% occupancy is
optimal, along with some implementation considerations:
Defining the Following Parameters:
[0120] N: number of items in the database (DB) [0121] L: length of
filter arrays (1 bit is assigned to each location in the array)
[0122] D=log.sub.2 L : the number of bits required to define a
location in the array.
[0122] P = N L , ##EQU00001##
"density of arrays". [0123] X=number of arrays. [0124] K=XL is the
total number of assigned bits.
[0124] C = K N ##EQU00002##
the number of total bits per item (the item "cost")
[0125] Now, assuming a uniform distribution, the probability of
collision with a specific item is
1 L , ##EQU00003##
and since N=PL, the probability of no collisions in an array filter
is:
( 1 - 1 L ) N = ( 1 - 1 L ) LP .fwdarw. ( 1 e ) P = - P
##EQU00004##
which is the probability of a negative result for a negative input
from one array filter.
[0126] The failure probability of a single array filter is
therefore 1-e.sup.-P
[0127] And since:
L = K X , P = N L = N X K = X C ##EQU00005##
[0128] The total failure probability for X filters is:
( 1 - - P ) X = ( 1 - - N X K ) X = ( 1 - - N X K ) N X K K N = [ (
1 - - N X K ) N X K ] K N ##EQU00006##
[0129] Assuming N, K are constant, the minimum of
[ ( 1 - - N X K ) N X K ] K N ##EQU00007##
is the minimum of
( 1 - - N X K ) N X K at N X K = X C .apprxeq. 0.7 .
##EQU00008##
[0130] And the failure rate is
( 1 - - X C ) X = [ ( 1 - - X C ) X C ] C , ##EQU00009##
which at 0.7, is about 0.6.sup.c.apprxeq.0.5.sup.x
X C .apprxeq. 0.7 C X .apprxeq. 1.42 C X = K N X = L X N X = L N =
1 P ##EQU00010##
[0131] Which means that the optimal length of each filter array is
about 42% longer (in bits) than the number of items in the
list.
[0132] The probability of a certain bit to be zero is
( 1 - 1 L ) N .apprxeq. - P , ##EQU00011##
so P=0.7.about.50% occupancy for each filter, which again results
in failure rate of .about.0.5.sup.x .
[0133] There are two possible cases in which no 0 is encountered
during the search process, and a direct query regarding the
existence of the item in the storage should be made: the first is
the case in which the item do exists in the list, and the second is
a "false alarm" due to collisions. In order to minimize the
probability of false alarms X should be increased, with the cost of
a larger memory footprint. The optimal X is a tradeoff between the
memory cost and the cost of accessing the storage.
[0134] Because the number of arrays, X, and the number of bits
required to define a location in the array, D, are both integers,
we should round by assigning the nearest values in the formulas
(not by rounding to the nearest, because they are not linear) and
choosing the best result.
[0135] Note that because of performance issues (cache thrashing) a
small first filter might be a good thing regardless, but obviously
not small enough to be saturated.
[0136] In a preferred embodiment of the present invention, the
system also utilizes a cash memory that include a short list of
common words that are not keywords or essential part of a
key-phrase. Reference is now made to FIG. 4, which illustrates a
system, substantially similar to the one described in FIG. 2, which
also includes a cache filter 157, operable to filter out the short
list described above.
[0137] In a preferred embodiment of the present invention, the list
of information items is sub-divided to several lists, according to
the frequencies of accuracy of the items in the list, such that
items that are anticipated to appear frequently in the scanned
content would appear in a separate list then less frequent items,
and a separate non-existence filter is implemented to each of the
lists, thereby facilitating optimized resource allocation.
[0138] In many cases, the items that need to be detected are
sequences of delimited segments, e.g., a sequence of words
delimited by spaces (a "key-phrase"). The detection problem in this
case is, in general, more involved then single word detection,
since a search must be performed for a plurality of sequences of
words with a variable length, and can no longer be conducted for
each word separately. In the following discussion, for sake of
brevity and clarity, we will use the term "word" with respect to
any delimited segment of the stored sequence of delimited
segments.
[0139] According to a preferred embodiment of the present
invention, the first word in each key-phrase is a root of a tree,
and the last words are the leaves of the tree (see examples in FIG.
5). Whenever a root word is found, the corresponding tree is
traversed in order to detect key-phrases.
[0140] In a preferred embodiment of the present invention,
identification of key-phrases is based on the following scheme,
dubbed Word-Based Hash-List (WBHL). Basically, the algorithm
comprises two phases: [0141] Pre-Processing: Each word (or other
delimited segment of interest) is represented by its hash value.
Each key-phrase is represented by the list or the set of the hashes
of its single words. (See more detailed description below) [0142]
Scanning and filtering: The algorithm scans the words, evaluates
their hash values and utilizes a hash-table for an immediate
rule-out of words that are not contained in the key-phrases. If the
scanned word belongs to one or more of the key-phrase, the
algorithm efficiently check all possible candidates according to
the hash values of the successive words. In case of a match, the
original key-phrase is retrieved and compared with the scanned
item. (See more detailed description below)
[0143] This method allow for commutativity, if required (i.e. "John
Doe"="Doe, John"), and for rapid clearance in cases where words
from the key-phrases are not very common in the analyzed text (a
probable scenario). It utilizes the fact that the basic units are
words, and not characters, in order to achieve a better
performance, compared with classical algorithms such as Boyer-Moore
or Rabin-Karp String, as described, e.g., in sections 6.5-6.6 of R.
A. Vowels: Algorithms and Data Structures in F and Fortran, Unicomp
(1999), ISBN 0-9640135-4-1, the contents of which are hereby
incorporated herein by reference in their entirety. Furthermore,
the performance does not depend on the number key phrases (as long
as their constituent words are not common in the analyzed
text).
[0144] Disadvantages of the above scheme may be: [0145] The
non-commutative version may be slow if the first word in one or
more key-phrase is common (e.g. `the" or "that") [0146] The
commutative version may be slow if any word in one or more
key-phrase is common
[0147] The speed issues problem may be avoided by removing common
words in the canonization process. The removal may require exact
textual matching for avoidance of false positives.
[0148] A more detailed description of the algorithm follows:
Key-Phrases Pre-Processing Phase:
[0149] Compute hash value for each word in key phrases. [0150]
Build oneWordsPhrases--a hash table for the hash values of each
one-word phrase. [0151] Build mutiWordsPhrases--a hash table for
the hash values of each starting word in multi word phrases. [0152]
Build mutiWordsWords--a hash table for the hash values of each word
in multi word phrases. [0153] For each word in mutiWordsPhrases,
add a hash set for each key-phrase containing that word. The hash
set contains hashes of all other words in the phrase. [0154]
Associate the set with the text of the key-phrase in
oneWordsPhrases and mutiWordsPhrases.
Scanning & Analysis Phase:
[0155] Initialization: [0156] "Canonize" Text [0157] candidates: an
empty set [0158] i=0
[0159] Analysis:
TABLE-US-00001 While i < number of words in the text Read Word
W(i) Evaluate the hash of W(i) Evaluate Hash: H(W(i)) (e.g., using
CRC32) Locate H(W(i) in oneWordsPhrases. (if exists, do textual
matching - compare with the actual verbatim) if exists: For each
hash_set in candidates: If H(W(i)) not in hash_set, delete hash_set
Else if size(hash_set) = 1: delete hash_set do textual match Else
delete H(W(i)) from hash_set Append to candidates all hash_sets
associated with H(W(i)) in multiWordsPhrases (They should not
contain H(W(I)) ) i = i+1 end
[0160] The non-commutative version of the algorithm is
substantially similar:
Key-Phrases Pre-Processing:
[0161] Compute hash value for each word in key phrases. [0162]
Build oneWordsPhrases--a hash table for the hash values of each
one-word phrase. [0163] Build mutiWordsPhrases--a hash table for
the hash values of each starting word in multi word phrases. [0164]
Build mutiWordsWords--a hash table for the hash values of each word
in multi word phrases. [0165] For each word in mutiWordsPhrases,
add a hash set for each key-phrase starting with that word. The
hash set contains ordered hashes of all other words in the phrase.
[0166] Associate the set with the text of the key-phrase in
oneWordsPhrases and mutiWordsPhrases. Scanning & Analysis
phase:
[0167] Initialization: [0168] "Canonize" Text [0169] candidates: an
empty set [0170] i=0
[0171] Analysis:
TABLE-US-00002 While i < number of words in the text Read Word
W(i) Evaluate the hash of W(i) Evaluate Hash: H(W(i)) (e.g., using
CRC32) Locate H(W(i) in oneWordsPhrases. (if exists, do textual
matching - compare with the actual verbatim) if exists: For each
hash_set in candidates: If H(W(i)) not first of hash_set, delete
hash_set Else if size(hash_set) = 1: delete hash_set do textual
match Else delete H(W(i)) from hash_set Append to candidates all
hash_sets associated with H(W(i)) in multiWordsPhrases (They should
not contain H(W(I)) ) i = i+1 end
[0172] In a preferred embodiment of the present invention, the
algorithm used for key-phrase identification comprises:
[0173] Pre-Processing phase: Each word is represented by its hash
value. Each key-phrase is represented by a commutative (or
non-commutative) hash of the hashes of keywords that comprise that
key-phrase. The commutative hash is simply the XOR of all the
hashes of the words that constitute the phrase.
[0174] Scanning and filtering phase: The algorithm scans the words,
evaluates the hash values of each word and utilizes a hash-table
for an immediate rule-out of words that are not contained in the
key-phrases. If the scanned word belongs to one or key-phrase, the
algorithm evaluates and checks the commutative hashes of bi-grams,
(two consecutive words), three-grams etc.--until the maximum
possible number of words in the key-phrases. In case of a match,
the original key-phrase is retrieved and compared with against the
scanned text.
[0175] This scheme also allows for commutativity and fast
clearance, and has a better worst-case behavior then the word-based
hash-list. It is also easy to implement and to verify, though it
may be slightly slower than the word-based hash-list in some cases.
Reference is now made to FIG. 6, which is a flowchart illustrates
the algorithm for fast detection of key-phrases, according to
preferred embodiment of the present invention.
[0176] The key-phrases pre-processing phase, 610, comprises:
Input: Key-Phrases and the Maximal Length of Phrase
("maxPhraseLength")
[0177] Pre-Processing: [0178] Compute hash value for each word in
key phrases. [0179] Build oneWordsPhrases--a hash table for the
hash values of each one-word phrase. [0180] Build mutiWordsWord--a
hash table for the hash values of each word in multi word phrases.
[0181] Evaluate commutativeHash by XORing all the hash values of
the words in mutiWordsWord [0182] Build mutiWordsPhrases--a hash
table for the hash values of each multi word phrase. [0183]
Associate the hash values with the text of the key-phrase in
oneWordsPhrases and mutiWordsPhrases.
[0184] Initialization: [0185] chainLength=0 [0186] hashBuffer=[
]//Empty set [0187] i=0
[0188] Analysis:
TABLE-US-00003 While i < number of words in the text Read Word
W(i) Evaluate the hash of W(i): H(W(i)) (e.g., using CRC32) Locate
H(W(i) in oneWordsPhrases. (if exists, do textual match) if exists:
hashBuffer+= [hashWord] // insert H(W(i) to buffer chainLength+=1
while chainLength <= maxPhraseLength: evaluate the
commutative/non-commutative hash for hashBuffer check if exists in
hash-table mutiWordsPhrases if exists, do textual match else check
possible matching with other initials of mutiWordsPhrases in the
buffer if there is a match do textual match i = i+1 end
Input: Key-Phrases and the Maximal Length of Phrase
("Maxphraselength")
[0189] Pre-Processing: [0190] Compute hash value for each word in
key phrases. [0191] Build oneWordsPhrases--a hash table for the
hash values of each one word phrase. [0192] Build mutiWordsWord--a
hash table for the hash values of each word in multi word phrases.
[0193] Evaluate commutativeHash by XORing all the hash values of
the words in mutiWordsWord [0194] Evaluate nonCommutativeHash, if
required, by first adding the numerical value of the index
wordLocationInPhrase (which can be just the order of the word in
the phrase--"1" for the first word in the phrase, "2" for the
second, etc.) to the hash values of the words in mutiWordsWord, and
then XORing all the resulted values. [0195] Build
mutiWordsPhrases--a hash table for the hash values of each multi
word phrase. [0196] Associate the hash values with the text of the
key-phrase in oneWordsPhrases and mutiWordsPhrases.
[0197] The scanning and analysis phase, 620, comprises:
[0198] Initialization: [0199] chainLength=0 [0200] hashBuffer=[
]//Empty set [0201] i=0
[0202] Analysis:
TABLE-US-00004 While i < number of words in the text Read Word
W(i) Evaluate the hash of W(i) Evaluate Hash: H(W(i)) (e.g., using
CRC32) Locate H(W(i) in oneWordsPhrases. (if exists, do textual
match) if exists: hashBuffer+= [hashWord] // insert H(W(i) to
buffer chainLength+=1 while chainLength <= maxPhraseLength:
evaluate the commutative/non-commutative hash for hashBuffer check
if exists in hash-table mutiWordsPhrases if exists, do textual
match else check possible matching with other initials of
mutiWordsPhrases in the buffer if there is a match do textual match
i = i+1 end
[0203] In a preferred embodiment of the present invention, a
state-machine, (described, e.g., in David J. Comer, "Digital Logic
and State Machine Design", International Thomson Publishing; 3rd
edition (June 1997), ISBN: 0030949041, the contents of which is
hereby incorporated herein by reference in its entirety) is
complied such that each keyword or key-phrase become a regular
expression that leave the state-machine in an "accepting state",
thereby provide an efficient method to detect both keywords and
key-phrases that contain more then one word.
[0204] In a preferred embodiment of the present invention, both the
items in the inspected documents and the items in the list are
sorted, and the comparison is performed between two sorted
lists.
[0205] In a preferred embodiment of the present invention, the
system includes a module that facilitates the automatic insertion
of keywords and key-phrase into a keywords list, by comparing close
documents with a different policy, and regarding the differences
between the documents as a collection of "key-phrases". For
example, if in one standard contract the name of on of the sides is
"John Doe" and in another contract the name is "Jane Smith" then
both "John Doe" and "Jane Smith" can be regarded as key-phrases. A
method for comparing documents and obtaining their differences is
described, e.g., in provisional patent application number
60/422,128.
[0206] In a preferred embodiment of the present invention, the list
of automatically detected keywords is further subjected to manual
approval.
[0207] The present invention successfully addresses the
shortcomings of the presently known configurations by providing a
method and system for fast identification of keywords and
key-phrases, which can efficiently serve current needs.
[0208] It is appreciated that one or more steps of any of the
methods described herein may be implemented in a different order
than that shown, while not departing from the spirit and scope of
the invention.
[0209] While the present invention may or may not have been
described with reference to specific hardware or software, the
present invention has been described in a manner sufficient to
enable persons having ordinary skill in the art to readily adapt
commercially available hardware and software as may be needed to
reduce any of the embodiments of the present invention to practice
without undue experimentation and using conventional
techniques.
[0210] Although the invention has been described in conjunction
with specific embodiments thereof, it is evident that many
alternatives, modifications and variations will be apparent to
those skilled in the art. Accordingly, it is intended to embrace
all such alternatives, modifications and variations that fall
within the spirit and broad scope of the appended claims. All
publications, patents and patent applications mentioned in this
specification are herein incorporated in their entirety by
reference into the specification, to the same extent as if each
individual publication, patent, or patent application was
specifically and individually indicated to be incorporated herein
by reference. In addition, citation or identification of any
reference in this application shall not be construed as an
admission that such reference is available as prior art to the
present invention.
* * * * *