U.S. patent application number 15/170758 was filed with the patent office on 2016-09-22 for message distribution control.
The applicant listed for this patent is Dell Software Inc.. Invention is credited to Gleb Budman, Christine Drake, Eugene Koontz, Andrew F. Oliver, Jonathan J. Oliver.
Application Number | 20160277365 15/170758 |
Document ID | / |
Family ID | 39332007 |
Filed Date | 2016-09-22 |
United States Patent
Application |
20160277365 |
Kind Code |
A1 |
Oliver; Jonathan J. ; et
al. |
September 22, 2016 |
MESSAGE DISTRIBUTION CONTROL
Abstract
A method of controlling distribution of content in a message
sent by a message sender comprises receiving an indication from the
message sender that the message is to be protected, identifying
content in the message to be protected, adding the identified
content to a database of protected content, and determining whether
subsequently received content in a subsequently received message is
associated with the identified content. A system for controlling
distribution of content in a message sent by a message sender
comprises a processor configured to receive an indication from the
message sender that the message is to be protected, identify
content in the message to be protected, add the identified content
to a database of protected content, and determine whether
subsequently received content in a subsequently received message is
associated with the identified content.
Inventors: |
Oliver; Jonathan J.; (San
Carlos, CA) ; Budman; Gleb; (Palo Alto, CA) ;
Oliver; Andrew F.; (Glenhuntly, AU) ; Koontz;
Eugene; (Mountain View, CA) ; Drake; Christine;
(San Mateo, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Dell Software Inc. |
Round Rock |
TX |
US |
|
|
Family ID: |
39332007 |
Appl. No.: |
15/170758 |
Filed: |
June 1, 2016 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
14491829 |
Sep 19, 2014 |
|
|
|
15170758 |
|
|
|
|
11036603 |
Jan 14, 2005 |
8886727 |
|
|
14491829 |
|
|
|
|
60642266 |
Jan 5, 2005 |
|
|
|
60578135 |
Jun 8, 2004 |
|
|
|
60543300 |
Feb 9, 2004 |
|
|
|
60539615 |
Jan 27, 2004 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
H04L 51/12 20130101;
H04L 63/0428 20130101; G06F 21/6227 20130101; G06F 21/6245
20130101; H04L 51/14 20130101; G06F 21/6218 20130101; G06F 16/3331
20190101 |
International
Class: |
H04L 29/06 20060101
H04L029/06; G06F 21/62 20060101 G06F021/62; H04L 12/58 20060101
H04L012/58 |
Claims
1. A method of controlling distribution of protected content
electronic message, the method comprising: receiving information,
wherein at least a portion of the received information includes the
protected content; receiving a selection identifying the protected
content included in the received information; storing information
corresponding to the protected content in a database; receiving the
electronic message; comparing information in the received
electronic message with the stored information corresponding to the
protected content; identifying that information in the received
electronic message is similar to the stored information
corresponding to the protected content when the information in the
received electronic message is consistent with the at least portion
of the received information; and sending an electronic message to
an external computing device based on identifying that the received
electronic message is similar to the stored information
corresponding to the protected content.
2. The method of claim 1, wherein one or more indicators are added
to the identified protected content before the information
corresponding to the protected content is stored in the database,
wherein the one or more indicators specifically identify the
protected content.
3. The method of claim 2, wherein the one or more indicators are
headers.
4. The method of claim 1, further comprising: receiving over the
user interface a distribution control option, wherein the
distribution control option identifies a recipient that authorized
to receive the protected content; identifying that an addressee of
the received electronic message corresponds to the authorized
recipient, wherein the message sent to the external computing
device includes the protected content.
5. The method of claim 1, further comprising: receiving over the
user interface a distribution control option, wherein the
distribution control option identifies a recipient that authorized
to receive the protected content; and identifying that an addressee
of the received electronic message does not correspond to the
authorized recipient, wherein the message sent to the external
computing device includes an alert identifying that the received
electronic message includes the protected content.
6. The method of claim 1, wherein the external computing device is
inside a corporate network.
7. The method of claim 1, further comprising: generating one or
more variations of the protected content; and storing the one or
more variation of the protected content in the database.
8. The method of claim 7, wherein the one or more variations of the
protected content include at least one of: a first word or phrase
that is a synonym of a primary word or phrase associated with the
protected content; one or more words associated with the protected
content are organized in a different order than one or more primary
words associated with the protected content; one or more letters
that have been substituted for one or more letters associated with
the protected content; a token that replaces a portion of the
protected content; and a special character added to the protected
content.
9. The method of claim 8, wherein the special character added to
the protected content or the token that replaces a portion of the
protected content includes at least one of a punctuation mark, a
space, a non-standard letter, and a non-standard number.
10. The method of claim 1, wherein the received information is
received over a user interface.
11. A non-transitory computer readable storage medium having
embodied thereon a program executable by a processor for performing
a method of controlling distribution of protected content in an
electronic message, the method comprising: receiving information,
wherein at least a portion of the received information includes the
protected content; receiving a selection identifying the protected
content included in the received information; storing information
corresponding to the protected content in a database; receiving the
electronic message; comparing information in the received
electronic message with the stored information corresponding to the
protected content; identifying that information in the received
electronic message is similar to the stored information
corresponding to the protected content when the information in the
received electronic message is consistent with the at least portion
of the received information; and sending an electronic message to
an external computing device based on identifying that the received
electronic message is similar to the stored information
corresponding to the protected content.
12. The non-transitory computer readable storage medium of claim
11, wherein one or more indicators are added to the identified
protected content before the information corresponding to the
protected content is stored in the database, wherein the one or
more indicators specifically identify the protected content.
13. The non-transitory computer readable storage medium of claim
12, wherein the one or more indicators are headers.
14. The non-transitory computer readable storage medium of claim
11, the program further executable to: receive over the user
interface a distribution control option, wherein the distribution
control option identifies a recipient that authorized to receive
the protected content; and identify that an addressee of the
received electronic message corresponds to the authorized
recipient, wherein the message sent to the external computing
device includes the protected content.
15. The non-transitory computer readable storage medium of claim
11, the program further executable to: receive over the user
interface a distribution control option, wherein the distribution
control option identifies a recipient that authorized to receive
the protected content; and identify that an addressee of the
received electronic message does not correspond to the authorized
recipient, wherein the message sent to the external computing
device includes an alert identifying that the received electronic
message includes the protected content.
16. The non-transitory computer readable storage medium of claim
11, wherein the external computing device is inside a corporate
network.
17. The method of claim 11, further comprising: generating one or
more variations of the protected content; and storing the one or
more variation of the protected content in the database.
18. The method of claim 17, wherein the one or more variations of
the protected content include at least one of: a first word or
phrase that is a synonym of a primary word or phrase associated
with the protected content; one or more words associated with the
protected content are organized in a different order than one or
more primary words associated with the protected content; one or
more letters that have been substituted for one or more letters
associated with the protected content; a token that replaces a
portion of the protected content; and a special character added to
the protected content.
19. The method of claim 18, wherein the special character added to
the protected content or the token that replaces a portion of the
protected content includes at least one of a punctuation mark, a
space, a non-standard letter, and a non-standard number.
20. An apparatus for controlling distribution of protected content
in an electronic message, the apparatus comprising: a network
interface that receives information and that receives a selection
identifying the protected content in the received information,
wherein at least a portion of the received information includes the
protected content; a database for storing information corresponding
to the protected content, wherein the received information is
stored in the database after it is received, wherein the network
interface receives the electronic message; a memory; and a
processor, wherein the processor: compares information in the
received electronic message with the stored information
corresponding to the protected content; identifies that information
in the received electronic message is similar to the stored
information corresponding to the protected content when the
information in the received electronic message is consistent with
the at least portion of the received information; and prepares an
electronic message to be sent to an external computing device based
on identifying that the received electronic message is similar to
the stored information corresponding to the protected content,
wherein the electronic message is sent to the external computing
device over the network interface.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] The present application is a continuation and claims the
priority benefit of U.S. patent application Ser. No. 14/491,829
filed Sep. 19, 2014, which is a continuation and claims the
priority benefit of U.S. patent application Ser. No. 11/036,603
filed Jan. 14, 2005, now U.S. Pat. No. 8,886,727. The present
application also claims the priority benefit of U.S. provisional
application 60/642,266 filed Jan. 5, 2005, U.S. provisional
application 60/578,135 filed Jun. 8, 2004, U.S. provisional
application 60/543,300 filed Feb. 9, 2004, and U.S. provisional
application 60/539,615 filed Jan. 27, 2004, the disclosures of
which are incorporated herein by reference.
FIELD OF THE INVENTION
[0002] The present invention relates generally to electronic
communications. More specifically, message distribution is
disclosed.
BACKGROUND OF THE INVENTION
[0003] Businesses and organizations today are becoming increasingly
dependent on various forms of electronic communication such as
email, instant messaging, etc. The same characteristics that make
electronic messages popular--speed and convenience--also make them
prone to misuse. Confidential or inappropriate information can be
easily leaked from within an organization. A breach of confidential
information may be caused inadvertently or purposefully.
Unauthorized information transmission can lead to direct harm such
as lost revenue, theft of intellectual property, additional legal
cost, as well as indirect harm such as damage to the company's
reputation and image.
[0004] Although some studies show that over half of information
security incidents are initiated from within organizations,
currently security products for preventing internal security
breaches tend to be less sophisticated and less effective than
products designed to prevent external break-ins such as spam
filters, intrusion detection systems, firewalls, etc. There are a
number of issues associated with the typical internal security
products that are currently available. Some of the existing
products that prevent inappropriate email from being sent use
filters to match keywords or regular expressions. Since system
administrators typically configure the filters to block specific
keywords or expressions manually, the configuration process is
often labor intensive and error-prone.
[0005] Other disadvantages of the keyword and regular expression
identification techniques include high rate of false positives
(i.e. legitimate email messages being identified as inappropriate
for distribution). Additionally, someone intent on circumventing
the filters can generally obfuscate the information using tricks
such as word scrambling or letter substitution. In existing
systems, the sender of a message is in a good position to judge how
widely certain information can be circulated. However, the sender
often has little control over the redistribution of the
information.
[0006] It would be desirable to have a product that could more
accurately and efficiently detect protected information in
electronic messages and prevent inappropriate distribution of such
information. It would also be useful if the product could give
message senders greater degrees of control over information
redistribution, as well as identify messages that are sent between
different parts of an organization.
SUMMARY OF THE PRESENTLY CLAIMED INVENTION
[0007] A method of controlling redistribution of content in a
message sent by a message sender includes a step of receiving over
a network from a mail client a first message created by a first
message sender. The first message includes a text string manually
marked by the first message sender as confidential content upon
which a distribution limit should be placed. The first message also
includes a distribution limit manually set by the first message
sender identifying one or more users authorized to recirculate the
confidential content. The method further includes a step of
receiving over the network from a mail client a second message
subsequent to the first message. The second message is distinct
from the first message and created by a second message sender who
is distinct from the first message sender. The second message
includes an identification of the second message sender.
[0008] The method further includes a step of executing instructions
stored in memory that, when executed, add the text string manually
marked by the first message sender and the distribution limit
manually set by the first message sender to a database stored in
memory. The method further includes examining the second message
for the stored text string manually marked by the first message
sender. Examining the second message includes extracting a
suspicious text string from the second message and comparing the
suspicious text string to the stored text string manually marked by
the first message sender. The method also includes examining the
second message for the one or more users identified in the stored
distribution limit manually set by the first message sender as
authorized to recirculate the confidential content. The examination
occurs when the suspicious text string matches the stored text
string manually marked by the first message sender.
[0009] Examining the second message includes extracting the
identification of the second message sender and comparing the
identification of the second message sender to the one or more
users identified in the stored distribution limit manually set by
the first message sender as authorized to recirculate the
confidential content. The method further includes transmitting the
second message over the network when the identification of the
second message sender matches one of the one or more users
identified in the stored distribution limit manually set by the
first message sender as users authorized to recirculate the
confidential content.
[0010] A computing device that controls redistribution of content
in a message sent by a message sender includes a processor, a
network interface communicatively coupled to a communications
network, memory storing a database, and executable instructions,
whereby execution of the instructions by the processor cause the
processor to perform the forgoing method of controlling
redistribution of content in a message sent by a message sender. A
non-transitory computer-readable storage medium also include a
program executable by a computer processor to perform the foregoing
method.
BRIEF DESCRIPTION OF THE DRAWINGS
[0011] Various embodiments of the invention are disclosed in the
following detailed description and the accompanying drawings.
[0012] FIG. 1 is a system diagram illustrating a message
distribution control system embodiment.
[0013] FIG. 2 is a diagram illustrating the user interface of a
mail client embodiment.
[0014] FIG. 3 is a flowchart illustrating a message processing
operation according to some embodiments.
[0015] FIG. 4 is a flowchart illustrating the examination of a
message before it is transmitted to its designated recipient,
according to some embodiments.
[0016] FIG. 5 is a flowchart illustrating a process for determining
whether a message is associated with particular protected
content.
[0017] FIG. 6 is a flowchart illustrating a lexigraphical
distancing process according to some embodiments.
[0018] FIG. 7 is a flowchart illustrating a process for generating
a database of protected content according to some embodiments.
DETAILED DESCRIPTION
[0019] The invention can be implemented in numerous ways, including
as a process, an apparatus, a system, a composition of matter, a
computer readable medium such as a computer readable storage medium
or a computer network wherein program instructions are sent over
optical or electronic communication links. In this specification,
these implementations, or any other form that the invention may
take, may be referred to as techniques. In general, the order of
the steps of disclosed processes may be altered within the scope of
the invention.
[0020] A detailed description of one or more embodiments of the
invention is provided below along with accompanying figures that
illustrate the principles of the invention. The invention is
described in connection with such embodiments, but the invention is
not limited to any embodiment. The scope of the invention is
limited only by the claims and the invention encompasses numerous
alternatives, modifications and equivalents. Numerous specific
details are set forth in the following description in order to
provide a thorough understanding of the invention. These details
are provided for the purpose of example and the invention may be
practiced according to the claims without some or all of these
specific details. For the purpose of clarity, technical material
that is known in the technical fields related to the invention has
not been described in detail so that the invention is not
unnecessarily obscured.
[0021] A method and system for controlling distribution of
protected content is disclosed. In some embodiments, the message
sender sends an indication that a message is to be protected. The
message sender may identify a portion of the message as protected
content. The protected content is added to a database. If a
subsequently received message is found to include content that is
associated with any protected content in the database, the system
takes actions to prevent protected content from being distributed
to users who are not authorized to view such content. Content in a
message that is similar but not necessarily identical to the
protected content is detected using techniques such as computing a
content signature or a hash, identifying a distinguishing property
in the message, summarizing the message, using finite state
automata, applying the Dynamic Programming Algorithm or a genetic
programming algorithm, etc.
[0022] FIG. 1 is a system diagram illustrating a message
distribution control system embodiment. For purposes of
illustration, distribution control of email messages is described
throughout this specification. The techniques are also applicable
to instant messages, wireless text messages or any other
appropriate electronic messages. In this example, mail clients such
as 102 and 104 cooperate with server 106. A user sending a message
via a mail client can indicate whether the message or a selected
portion of the message is to be protected. As used herein, a piece
of protected content may include a word, a phrase, a sentence, a
section of text, or any other appropriate string. Besides the
intended recipients, the user can also specify a set of users who
are authorized to recirculate the protected content. The authorized
users and the recipients may overlap but are not necessarily the
same. In this example, the mail server cooperates with a user
directory 108 to facilitate the specification of authorized users.
Mail server 106 extracts the protected content information and
recipient information, and stores the information in a database
114.
[0023] Received messages are tested by message identifier 110 based
on data stored in database 114, using identification techniques
which are described in more detail below. A message identified as
containing protected content is prevented from being sent to any
user besides the set of authorized users associated with the
protected content. In some embodiments, mail server 106 or gateway
112, or both, also automatically prevent restricted information
from being sent to users outside the organization's network.
Components of backend system 120 may reside on the same physical
device or on separate devices.
[0024] FIG. 2 is a diagram illustrating the user interface of a
mail client embodiment. In this example, mail client interface 200
includes areas 202 and 204 used for entering the standard message
header and message body. Additionally, the user interface allows
the user to selectively protect the entire message or portions of
the message. For instance, by checking checkbox 208, the sender can
indicate that the distribution of the entire message is to be
restricted. Alternatively, the user may select a portion or
portions of the message for protection. In the example shown, the
sender has highlighted section 206, which contains sensitive
information about an employee. The highlighted portion is marked
for protection. In some embodiments, special marks are inserted in
the message to define the protected portions. Special headers that
describe the start and end positions of the protected text may also
be used.
[0025] Configuration area 210 offers distribution control options.
In this example, five options are presented: if selected, the
"internal" option allows the message to be redistributed inside the
corporate network, "recipient" allows the message to be
redistributed among the recipients, "human resources", "sales", and
"engineering" options allow redistribution only among users within
the respective departments. In some embodiments, the mail client
queries a user directory to obtain hierarchical information about
the user accounts on the system, and presents the information in
the distribution control options. In some embodiments, the mail
client allows the user to configure custom distribution lists and
includes the custom distribution lists in the control options. Some
embodiments allow permission information to be set. The permission
information is used to specify the destinations and/or groups of
recipients who are allowed to receive the information. For example,
a sender may permit a message only to be sent to specific
destinations, such as recipients with a certain domain, subscribers
who have paid to receive the message, registered users of a certain
age group, etc.
[0026] FIG. 3 is a flowchart illustrating a message processing
operation according to some embodiments. Process 300 shown in this
example may be performed on a mail client, on a mail server, on a
message identification server, on any other appropriate device or
combinations thereof. At the beginning, an indication that a
message is to be protected is received 302. The indication may be
sent along with the message or separately. The content in the
message to be protected is then identified 304. The protected
content is then added to a database 306. In some embodiments, the
protected content is processed and the result is added to the
database. For example, spell check and punctuation removal are
performed in some embodiments. The process can be repeated multiple
times for different messages with different protected content.
Optionally, permission information may also be added to the
database.
[0027] When subsequent messages are to be sent by the mail server,
they are examined for protected content. FIG. 4 is a flowchart
illustrating the examination of a message before it is transmitted
to its designated recipient, according to some embodiments. Process
400 may be performed on a mail client, on a mail server, on a
message identification server, on any other appropriate device or
combinations thereof. The message identifier component may be an
integral part of the mail server or a separate component that
cooperates with the mail server. In this example, a message becomes
available for transmission 402. It is determined whether the
message is associated with any protected content in the database
404. A message is associated with protected content if it includes
one or more sections that convey the same information as some of
the protected content. A user intent on distributing unauthorized
information can sometimes mutate the text to avoid detection.
Letter substitution (e.g. replacing letter "a" with "@", letter "O"
with number "0", letter "v" with a backward slash and a forward
slash "\/"), word scrambling, intentional misspelling and
punctuation insertion are some of the tricks used to mutate text
into a form that will escape many keyword/regular expression
filters but still readable by the human reader. For example,
"social security number: 123-45-6789" can be mutated as "sOcial
sceurity #: 123*45*6789" (where letter "l" is replaced with number
"1" and vice versa), "CEO John Doe resigned" can be mutated as "CEO
J0hn Doe res!ng{hacek over (e)}d". By using appropriate content
identification techniques (such as lexigraphical distancing
described below), text that is not identical to the protected
content but conveys the similar information can be identified.
[0028] If the message is not associated with any protected content
in the database, it is deemed safe and is sent to its intended
recipient 408. If, however, the received message is associated with
a piece of protected content, it is determined whether each of the
recipients is authorized to view the protected content by the
content's original author 406. Optionally, it is determined whether
the sender of the message under examination is authorized by the
original sender of the protected content to send such content to
others. The message is sent to the recipient if the recipient is
authorized to view the protected content and if the sender is
authorized to send the message. If, however, a recipient (or the
sender) is not authorized, certain actions are taken 410. Examples
of such actions include blocking the message from the unauthorized
recipient, quarantining the message, sending a notification to the
sender or a system administrator indicating the reason for
blocking, etc. For instance, a new message that contains
information about John Doe's social security number and address
will be identified as being associated with protected content. If
one of the recipients of this message is in the human resources
department, he will be allowed to receive this message since the
original sender of the confidential information had indicated that
users from human resources department are authorized to send and
receive this information. If, however, another recipient is in the
sales department, he will be blocked from receiving the new
message. Furthermore, if someone in the sales department obtains
John Doe's social security number through other means and then
attempts to email the information to others, the message will be
blocked because the original sender only permitted users in the
human resources department to send and receive this information.
Alerts may be sent to the message sender and/or system
administrator as appropriate. In some embodiments, the system
optionally performs additional checks before the message is
sent.
[0029] FIG. 5 is a flowchart illustrating a process for determining
whether a message is associated with particular protected content.
In this example, a text string is extracted from a message 501. The
implementation of the extraction process varies for different
implementations. In some embodiments, the text string includes
plaintext extracted from the "text/plain" and "text/html" text
parts of a received message. In some embodiments, it is a line
delimited by special characters such as carriage return, linefeed,
ASCII null, end-of-message, etc. The string is sometimes
preprocessed to eliminate special characters such as blank spaces
and punctuations. A substring is obtained from the text string 502.
The substring is examined to determine whether it includes any
suspicious substring that may be the protected content in a mutated
form 504. Different embodiments may employ different techniques for
detecting a suspicious substring. For example, in some embodiments
if the first and last letters of a substring match the first and
the last letters of the protected content, and if the substring has
approximately the same length as the protected content, the
substring is deemed suspicious. If the substring is not suspicious,
the next substring in the text string, if available, is obtained
502 and the process is repeated.
[0030] If the substring is found to be suspicious, it is determined
whether the suspicious substring is a safe string 506. A safe
string is a word, a phrase, or an expression that may be present in
the message for legitimate reasons. Greetings and salutations are
some examples of safe strings. If the suspicious string is a safe
string, the next available substring in the text is obtained 502
and the process is repeated. If, however, the suspicious string is
not a safe string, it is evaluated against the protected content
(508). In some embodiments, the evaluation yields a score that
indicates whether the substring and the protected content
approximately match. The evaluation is sometimes performed on
multiple substrings and/or multiple protected content to derive a
cumulative score. An approximate match is found if the score
reaches a certain preset threshold value, indicating that the
suspicious string approximately matches the protected content.
[0031] Protected content may be mutated by inserting, deleting or
substituting one or more characters or symbols (sometimes
collectively referred to as tokens) in the string of the protected
content, scrambling locations of tokens, etc. The resulting string
conveys the same information to the human reader as the protected
content. To detect protected content that has been mutated, a
lexigraphical distancing process is used in some embodiments to
evaluate the similarity between a suspicious string and the
protected content. FIG. 6 is a flowchart illustrating a
lexigraphical distancing process according to some embodiments. The
technique is applicable to email messages as well as other forms of
textual documents that include delimiters such as spaces, new
lines, carriages returns, etc. In this example, the potential start
position of the protected content (or its mutated form) is located
602. In some embodiments, the potential start position is located
by finding the first character of the protected content or by
finding an equivalent token to the first character. If possible, a
potential end position is located by finding the last character of
the protected content or an equivalent token 604. As used herein,
an equivalent token includes one or more characters or symbols that
can be used to represent a commonly used character. For example,
the equivalent tokens for "c" include "c", "C", " ", " ", " ", "C",
"c", etc., and the equivalent tokens for "d" include "d", "D",
"{hacek over (D)}", "", "", etc. Thus, if "CEO resigned" is the
protected content under examination, the start position for a
suspicious string is where "c", "C", " ", " ", "C", or "c" is found
and the end position is where "d", "D", "{hacek over (D)}", "", or
"" is found. The length between the potential start position and
the potential end position is optionally checked to ensure that the
length is not greatly different from the length of the protected
content. Sometimes the potential start and end positions are
expanded to include some extra tokens such as spaces and
punctuations.
[0032] The string between the potential start and end position is
then extracted (606). In some embodiments, if a character, a symbol
or other standard token is obfuscated by using an equivalent token,
the equivalent token is identified before the string is further
processed. The equivalent token is replaced by the standard token
before further processing. For example, "\/" (a forward slash and a
backslash) is replaced by "v" and "|-|" (a vertical bar, a dash and
another vertical bar) is replaced by "H". An edit distance that
indicates the similarity between the suspicious string and the
protected content is then computed 608. In this example, the edit
distance is represented as a score that measures the amount of
mutation required for transforming the protected content to the
suspicious string by inserting, deleting, changing or otherwise
mutating characters. The score may be generated using a variety of
techniques, such as applying the Dynamic Programming Algorithm
(DPA), a genetic programming algorithm or any other appropriate
methods to the protected content and the suspicious string. For the
purpose of illustration, computing the score using DPA is discussed
in further detail, although other algorithms may also be
applicable.
[0033] In some embodiments, the Dynamic Programming Algorithm (DPA)
is used for computing the similarity score. In one example, the DPA
estimates the edit distance between two strings by setting up a
dynamic programming matrix. The matrix has as many rows as the
number of tokens in the protected content, and as many columns as
the length of the suspicious string. An entry of the matrix, Matrix
(I, J), reflects the similarity score of the first I tokens in the
protected content against the first J tokens of the suspicious
string. Each entry in the matrix is iteratively evaluated by taking
the minimum of V1, V2 and V3, which are computed as the following:
[0034]
V1=Matrix(I-1,J-1)+TokenSimilarity(ProtectedContent(I),SuspiciousString(J-
)) [0035] V2=Matrix(I-1,J)+CostInsertion(ProtectedContent(I))
[0036] V3=Matrix(I,J-1)+CostDeletion(SuspiciousString(I))
[0037] The similarity of the protected content and the suspicious
string is the matrix entry value at
Matrix(length(ProtectedContent), length(SuspiciousString)). In this
example, the TokenSimilarity function returns a low value (close to
0) if the tokens are similar, and a high value if the characters
are dissimilar. The Costinsertion function returns a high cost for
inserting an unexpected token and a low cost for inserting an
expected token. The CostDeletion function returns a high cost for
deleting an unexpected token and a low cost for deleting an
expected token.
[0038] Prior probabilities of tokens, which affect similarity
measurements and expectations, are factored into one or more of the
above functions in some embodiments. The TokenSimilarity,
Costinsertion and CostDeletion functions may be adjusted as a
result. In some embodiments, the prior probabilities of the tokens
correspond to the frequencies of characters' occurrence in natural
language or in a cryptographic letter frequency table. In some
embodiments, the prior probabilities of the tokens in the protected
content correspond to the actual frequencies of the letters in all
the protected content, and the prior probabilities of the tokens in
the message correspond to the common frequencies of letters in
natural language. In some embodiments, the prior probabilities of
tokens in the protected content correspond to the actual
frequencies of the tokens in the protected content, and the prior
probabilities of the different tokens in the message correspond to
the common frequencies of such tokens in sample messages previously
collected by the system.
[0039] In some embodiments, the context of the mutation is taken
into account during the computation. A mutation due to substitution
of special characters (punctuations, spaces, non-standard letters
or numbers) is more likely to be caused by intentional obfuscation
rather than unintentional typographical error, and is therefore
penalized more severely than a substitution of regular characters.
For example, "esigned" is penalized to a greater degree than
"resighed". Special characters immediately preceding a string,
following a string, and/or interspersed within a string also
indicate that the string is likely to have been obfuscated,
therefore an approximate match of protected content, if found, is
likely to be correct. For example, "C*E*O re*sighned*" leads to an
increase in the dynamic programming score because of the placements
of the special characters.
[0040] In some embodiments, the edit distance is measured as the
probability that the suspicious content being examined is an
"edited" version of the protected content. The probability of
insertions, deletions, substitutions, etc. is estimated based on
the suspicious content and compared to a predetermined threshold.
If the probability exceeds the threshold, the suspicious content is
deemed to be a variant of the protected content.
[0041] Sometimes the protected content is mutated by substituting
synonymous words or phrases. The evaluation process used in some
embodiments includes detecting whether a substring is semantically
similar (i.e. whether it conveys the same meaning using different
words or phrases) to the protected content. For example, a message
includes a substring "CEO left". The examination process generates
semantically similar substrings, including "CEO quit", "CEO
resigned", etc., which are compared with the protected content in
the database. If "CEO resigned" is included in the database as
protected content, the substring will be found to be semantically
similar with respect to the protected content.
[0042] In some embodiments, the database of protected content
includes variations of special terms of interest. The variations
may be lexigraphically similar and/or semantically similar with
respect to the special terms. FIG. 7 is a flowchart illustrating a
process for generating a database of protected content according to
some embodiments. In the example shown, variations of an original
term of interest are generated 702. For example, if the original
term is "CEO resigned", then variations such as "CEO resigns", "CE0
resigns", "CEO quit", "CEO qu!ts", "C*E*O left" and other possible
mutations are generated. These variations may be generated using
combinatorial techniques to generate permutations of the original
term, using genetic programming techniques to generate mutations of
the original term, or using any other appropriate techniques. For
each of the variations, the similarity between the variation and
the original term is evaluated 704. The similarity may be measured
as an edit distance between the variation and the original term,
and evaluated using techniques such as DPA, genetic programming
algorithm or any other appropriate techniques. If the variation
meets a certain criteria (e.g. if the similarity score is above a
certain threshold) 706, it is then included in the protected
content database 708. Otherwise, the variation is discarded 710. In
some embodiments, the process also includes an optional check to
eliminate any safe words. Thus, although "designed" may be
lexigraphically similar to "resigned" in terms of edit distance,
"designed" is deemed to be a safe word and is not included in the
protected content database. Process 700 may be repeated for various
special terms of interest. The resulting database includes
variations that can be used to represent the original term. During
operation, portions of the message are compared with terms in the
collection to determine whether there is a match. In some
embodiments, a score is then computed based on how similar the
matching term is with respect to the original term.
[0043] A content distribution control technique has been disclosed.
In addition to dynamic programming and genetic programming
algorithms, content in a message that is similar to certain
protected content can be detected by calculating a signature of the
content under examination and comparing the signature to signatures
of the protected content, identifying one or more distinguishing
properties in the message and comparing the distinguishing
properties (or their signatures) to the protected content (or their
signature), summarizing the message and comparing the summary with
the summary of the protected content, applying finite state
automata algorithm, or any other appropriate techniques.
[0044] Although the foregoing embodiments have been described in
some detail for purposes of clarity of understanding, the invention
is not limited to the details provided. There are many alternative
ways of implementing the invention. The disclosed embodiments are
illustrative and not restrictive.
* * * * *