U.S. patent application number 11/480677 was filed with the patent office on 2008-01-03 for system and method for privacy protection using identifiability risk assessment.
Invention is credited to Weifeng Chen, Zhen Liu, Anton Riabov, Angela Marie Schuett.
Application Number | 20080005778 11/480677 |
Document ID | / |
Family ID | 38878420 |
Filed Date | 2008-01-03 |
United States Patent
Application |
20080005778 |
Kind Code |
A1 |
Chen; Weifeng ; et
al. |
January 3, 2008 |
System and method for privacy protection using identifiability risk
assessment
Abstract
A risk assessment system and method includes an information
system configured to disclose information to a third party. A risk
determination model is configured to compute identifiability risk
for on one or more records in storage. The identifiability risk is
compared to a threshold prior to being disclosed wherein the
information system is informed of the identifiability risk
exceeding the threshold prior to disclosure to the third party.
Inventors: |
Chen; Weifeng; (Amherst,
MA) ; Liu; Zhen; (Tarrytown, NY) ; Riabov;
Anton; (Ossining, NY) ; Schuett; Angela Marie;
(Columbia, MD) |
Correspondence
Address: |
KEUSEY, TUTUNJIAN & BITETTO, P.C.
20 CROSSWAYS PARK NORTH, SUITE 210
WOODBURY
NY
11797
US
|
Family ID: |
38878420 |
Appl. No.: |
11/480677 |
Filed: |
July 3, 2006 |
Current U.S.
Class: |
726/1 |
Current CPC
Class: |
G06Q 10/10 20130101 |
Class at
Publication: |
726/1 |
International
Class: |
H04L 9/00 20060101
H04L009/00 |
Goverment Interests
GOVERNMENT RIGHTS
[0001] This invention was made with Government support under
Contract No.: H98230-05-3-001 awarded by the U.S. Department of
Defense. The Government has certain rights in this invention.
Claims
1. A risk assessment system, comprising: an information system
configured to disclose information to a third party; and a risk
determination model configured to compute identifiability risk for
one or more records in storage, the identifiability risk being
compared to a threshold prior to being disclosed wherein the
information system is informed of the identifiability risk
exceeding the threshold prior to disclosure to the third party.
2. The system as recited in claim 1, wherein the information system
builds the risk determination model based on the one or more
records of input data.
3. The system as recited in claim 1, wherein the information system
builds the risk determination model using records having a highest
identifiability for a combination attributes.
4. The system as recited in claim 1, wherein the risk determination
model computes identifiability risk for a given record based on a
reciprocal of a subset of records or cell size from a population
table that includes only the records that have values of all
attributes in a set of key attributes included in the given
record.
5. The system as recited in claim 4, wherein the cell size is
approximated.
6. The system as recited in claim 1, wherein the risk determination
model computes identifiability risk for multiple records by
subtracting from 1: a sum of 1 minus the identifiably risk for all
records in a sample table.
7. The system as recited in claim 1, wherein the risk determination
model predicts identifiability risk for records input to the
information system.
8. The system as recited in claim 1, wherein the risk determination
model predicts a single record identifiability risk for a given
record, knowing a set of key attributes, by summing over a
population: a probability that an individual is the one appearing
in a sample table divided by a cell size for each record
corresponding to the individual.
9. The system as recited in claim 8, wherein the cell size is
approximated.
10. The system as recited in claim 8, wherein the risk
determination model predicts identifiability risk for each of
multiple records by subtracting from 1:1 minus the single record
identifiably risk to exponent m, where m is the number of
records.
11. The system as recited in claim 1, wherein the risk
determination model includes a combined privacy risk model
configured to combine risk assessment for a plurality of risk
factors.
12. A risk monitoring system, comprising: an information system
configured to disclose information to an entity; a risk
determination model configured to compute identifiability risk for
on one or more records in storage; a privacy monitor configured to
receive the one or more records being released, and prior to
disclosing the one or more records to the entity, the privacy
monitor being configured to detect whether the identifiability risk
exceeds a threshold and perform a function to mitigate unauthorized
disclosure of the records to the entity.
13. The system as recited in claim 12, wherein the information
system builds the risk determination model based on the one or more
records of input data.
14. The system as recited in claim 12, wherein the information
system builds the risk determination model using records having a
highest identifiability for a combination attributes.
15. The system as recited in claim 12, wherein the risk
determination model computes identifiability risk for a given
record based on a reciprocal of a subset of records or cell size
from a population table that includes only the records that have
values of all attributes in a set of key attributes included in the
given record.
16. The system as recited in claim 15, wherein the cell size is
approximated.
17. The system as recited in claim 12, wherein the risk
determination model computes identifiability risk for multiple
records by subtracting from 1:1 a sum of 1 minus the identifiably
risk for all records in a sample table.
18. The system as recited in claim 12, wherein the risk
determination model predicts identifiability risk for records input
to the information system.
19. The system as recited in claim 12, wherein the risk
determination model predicts a single record identifiability risk
for a given record, knowing a set of key attributes, by summing
over a population: a probability that an individual is the one
appearing in a sample table divided by a cell size for each record
corresponding to the individual.
20. The system as recited in claim 19, wherein the cell size is
approximated.
21. The system as recited in claim 19, wherein the risk
determination model predicts identifiability risk for each of
multiple records by subtracting from 1: 1 minus the single record
identifiably risk to exponent m, where m is the number of
records.
22. The system as recited in claim 12, wherein the risk
determination model includes a combined privacy risk model
configured to combine risk assessment for a plurality of risk
factors.
23. A method for privacy protection, comprising: evaluating
identifiability risk for one or more records in storage using a
risk assessment model; comparing the identifiability risk with a
threshold to determine whether the one or more records can be
disclosed to a third party without violation of privacy criteria;
and disclosing the one or more records if disclosure is acceptable
based on the comparing step.
24. The method as recited in claim 23, further comprising, prior to
disclosing the one or more records to the third party, detecting
whether the identifiability risk exceeds a threshold and performing
a function to mitigate unauthorized disclosure of the records to
the third party.
25. The method as recited in claim 23, further comprising building
the risk determination model using records having a highest
identifiability for a combination attributes.
26. The method as recited in claim 23, wherein evaluating includes
computing identifiability risk for a given record based on a
reciprocal of a subset of records or cell size from a population
table that includes only the records that have values of all
attributes in a set of key attributes included in the given
record.
27. The method as recited in claim 26, further comprising
approximating the cell size.
28. The method as recited in claim 23, wherein evaluating includes
computing identifiability risk for multiple records by subtracting
from 1:1 a sum of 1 minus the identifiably risk for all records in
a sample table.
29. The method as recited in claim 23, wherein evaluating includes
predicting identifiability risk for records input to an information
system.
30. The method as recited in claim 23, wherein evaluating includes
predicting a single record identifiability risk for a given record,
knowing a set of key attributes, by summing over a population: a
probability that an individual is the one appearing in a sample
table divided by a cell size for each record corresponding to the
individual.
31. The method as recited in claim 30, further comprising
approximating the cell size.
32. The method as recited in claim 30, wherein evaluating includes
predicting identifiability risk for each of multiple records by
subtracting from 1:1 minus the single record identifiably risk to
exponent m, where m is the number of records.
33. The method as recited in claim 23, further comprising combining
risks using a combined privacy risk model configured to combine
risk assessment for a plurality of risk factors.
34. A computer program product of privacy protection comprising a
computer useable medium including a computer readable program,
wherein the computer readable program when executed on a computer
causes the computer to: evaluating identifiability risk for one or
more records in storage using a risk assessment model; comparing
the identifiability risk with a threshold to determine whether the
one or more records can be disclosed to a third party without
violation of privacy criteria; and disclosing the one or more
records if disclosure is acceptable based on the comparing
step.
35. The computer program product as recited in claim 34, further
comprising wherein evaluating includes predicting identifiability
risk for records input to an information system.
Description
BACKGROUND
[0002] 1. Technical Field
[0003] The present invention relates to privacy protection and more
particularly to systems and methods employing an identifiability
risk assessment to protect against disclosure of protected
information.
[0004] 2. Description of the Related Art
[0005] Information systems deployed and maintained by businesses of
different sizes often store personal information collected from
their clients. Similarly, information systems of government
organizations, such as the Internal Revenue Service, store personal
data about citizens, as well as other private data collected from
businesses. In recent years, these information systems have
increasingly become computerized.
[0006] Computer-based information systems enable anytime instant
access to the data, and on-the-fly cross-referencing of large
volumes data. Computerization is also critical for the
implementation of data mining methods and other automated and
highly efficient data analysis techniques, which in turn help
reduce costs and improve the agility and the efficiency of
businesses and governments.
[0007] Recent advances in communication technologies, wireless and
cellular networks and widespread availability of the Internet
enable instant access to the information systems and computational
resources from virtually anywhere, via easily obtainable equipment
such as a laptop with a wireless card. Together with the benefits
that the now ubiquitous digital technologies have brought, they
have also brought new dangers.
[0008] Currently, it is becoming increasingly easy to gain
unauthorized access to personal data. Identity theft is a very
serious and real threat to anyone who is sharing personal
information with companies in exchange for services, credit, etc.
As a consequence, safeguarding of personal information becomes a
highly important objective for businesses and governments, and many
aspects of privacy protection, such as the collection and use of
information on minors or strong encryption of Internet
communications, are mandated by laws or best business
practices.
[0009] Two previously unsolved important problems of privacy
protection arising in business and government information systems
need solutions. These problems include evaluating and managing a
tradeoff between privacy protection and business efficiency, and
quantifying the privacy risks associated with various business
operations. Solving these problems will contribute to improvements
in both business transparency and business efficiency. The
solutions will help streamline privacy protection processes,
simplify the work of employees responsible for privacy protection,
and increase the accountability of individual employees and entire
organizations.
[0010] Managing the tradeoff between privacy and business
efficiency: Arguably, any information collected and stored by an
organization is collected with the intent of using this
information, for one purpose or another. Privacy policies enacted
by organizations restrict the purposes for which the information
can be used. In this policy-controlled mode, privacy protection is
equivalent to the enforcement of policy compliance. For any save
more permissive policies, there can be practical situations where
the potential benefits resulting from a particular business
operation are not realized, because the operation requires the use
of information in conflict with privacy policy. Therefore, the
privacy policies are controlling the tradeoff between the needs of
business efficiency and the needs of privacy protection.
[0011] Developing privacy policies is an extremely difficult task
that any organization must perform when it establishes a system for
storing personal or private information. It is especially difficult
in information systems where automated policy enforcement is
implemented, because for such systems, the policy is specified as a
set of formal automatically verifiable rules. In some respects the
task of writing policies can be compared to the process of
developing the legislation. This analogy can be literal, if the
policy is stipulated by law.
[0012] For these reasons, the policy writers often prefer to err on
the side of caution, and prohibit the use of information when in
doubt. Also as a result, the policies mandated internally by an
organization are typically more restrictive than the published
policies, and reflect not only the law and the published
commitments of the organization, but also the best practices
requirements as seen by policy writers.
[0013] The employees responsible for the use of information are
often required to determine whether a particular use of information
is in violation of the policy. For example, in some situations, the
internal policy can be violated in order to allow an extremely
critical operation, as long as the externally published policy and
laws are not violated. If the internal policy is enforced
automatically, such an operation may require the intervention of
high-level management to circumvent the normal enforcement
controls. This problem can be partially addressed by policies that
permit wider access in exceptional circumstances, e.g., if
sufficient justification is provided. However, the currently
existing approaches do not provide sufficient assistance to the
organization's employees responsible for making decisions, and
often they are forced to violate internal policies in order to get
their job done.
SUMMARY
[0014] A risk assessment system and method includes an information
system configured to disclose information to a third party. A risk
determination model is configured to compute identifiability risk
for on one or more records in storage. The identifiability risk is
compared to a threshold prior to being disclosed wherein the
information system is informed of the identifiability risk
exceeding the threshold prior to disclosure to the third party.
[0015] Another risk monitoring system includes an information
system configured to disclose information to an entity, and a risk
determination model configured to compute identifiability risk for
on one or more records in storage. A privacy monitor is configured
to receive the one or more records being released, and prior to
disclosing the one or more records to the entity, the privacy
monitor is configured to detect whether the identifiability risk
exceeds a threshold and perform a function to mitigate unauthorized
disclosure of the records to the entity.
[0016] A method for privacy protection includes evaluating
identifiability risk for one or more records in storage using a
risk assessment model, comparing the identifiability risk with a
threshold to determine whether the one or more records can be
disclosed to a third party without violation of privacy criteria,
and disclosing the one or more records if disclosure is acceptable
based on the comparing step.
[0017] These and other objects, features and advantages will become
apparent from the following detailed description of illustrative
embodiments thereof, which is to be read in connection with the
accompanying drawings.
BRIEF DESCRIPTION OF DRAWINGS
[0018] The disclosure will provide details in the following
description of preferred embodiments with reference to the
following figures wherein:
[0019] FIG. 1 is a block diagram of a stream processing model for
privacy protection in accordance with one illustrative
embodiment;
[0020] FIG. 2 is a chart showing indentifiability approximation for
different combinations of attributes from US Census data in
accordance with an illustrative embodiment;
[0021] FIG. 3 is a block diagram showing a risk assessment system
in accordance with one illustrative embodiment;
[0022] FIG. 4 is a block diagram showing a risk processing and
monitoring system in accordance with another illustrative
embodiment;
[0023] FIG. 5 is a block diagram showing exemplary risk factors
affecting privacy risk in accordance with one embodiment; and
[0024] FIG. 6 is a block/flow diagram showing a system/method for
risk assessment and privacy protection in accordance with an
illustrative embodiment.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
[0025] Embodiments in accordance with the present principles
address at least two previously unsolved important problems of
privacy protection arising in business and government information
systems. These problems include evaluating and managing a tradeoff
between privacy protection and business efficiency, and quantifying
the privacy risks associated with various business operations. The
present embodiments provide systems and methods for evaluating and
managing the tradeoffs between business efficiency and privacy
protection.
[0026] The systems and methods are preferably based on a privacy
risk quantification method for the estimation of the numeric value
of privacy risk associated with a particular use of protected
information. An organization's employees cannot be relieved from
the duty of making their own, frequently highly subjective,
estimates of the risks, and can be provided with a system which
will automatically compute the risk value and keep the records of
the value, the employee's decision and the employee's justification
for the decision.
[0027] Such a system may include enforcement controls for limiting
the risk that an employee can take without consulting with others,
for example based on work experience--more experienced employees
may be trusted to take higher risks. Justification requirements and
risk mitigation procedures of varying degrees can be established,
such that the requirements and the procedures are enforced
automatically depending on the risk estimate. Possible risk
mitigation measures may include an audit that takes place after the
employee decides to perform a particular action leading to use of
information, post-action management review, or pre-action
management approval requirement.
[0028] A noted approach to evaluation of security risks is
described in commonly assigned U.S. patent Ser. No. 11/123,998, to
Cheng et al. entitled "SYSTEM AND METHOD FOR FUZZY MULTI-LEVEL
SECURITY", filed May 6, 2005, and incorporated herein by reference.
Many of the ideas described in Cheng et al. can be applied in
privacy protection scenarios. Access control decisions in privacy
protection context translate into the decisions about allowing the
use of information for specific purposes.
[0029] Assessment of the privacy risks associated with business
operations: People generally accept the necessity of providing
personal information to businesses or governments. When asked to
provide such information, people usually know the entity they are
providing the information to, and can decide whether the
information should be provided. However, in practice it is very
hard to ensure that the information indeed is provided to one
single principal constituting a business or a government. The
information is often shared with other principals: e.g., the
employees of the business or government, business partners, and so
on. In this situation it is especially difficult to limit the
distribution of private information due to possibility of misuse by
trusted insiders or trusted partners.
[0030] For these reasons, it can be concluded that any use of
personal or other types of private information that leads to
disclosure of private information, e.g., providing the information
to principals who did not have access to the information before, is
associated with risk. The privacy risk can be defined as the risk
that private information will be misused, or shared
inappropriately.
[0031] A system for evaluation and management of privacy risks is
provided and may be based on a method for risk estimation.
Embodiments in accordance with present principles build upon the
methods developed for the measurement of security risks, and
introduce a novel quantitative measure for assessment of
privacy-specific identifiably risk.
[0032] The present embodiments have significantly extended the
ideas of disclosure risk estimation described in previous art to
develop methods for providing consistent global privacy risk
assessment both before and after the data have been disclosed to a
third party. Risk estimation before actual data is disclosed can be
performed based on the forecast of future data distribution.
Obtaining such an estimate is important when the agreements on the
conditions of sharing the data are established before the data is
actually transferred from the information system to a third party,
and even before the data enters the information system.
[0033] Third party or third party principal as employed herein
refers to any consumer of data. The consumer may be internal or
external component or entity of a given system or may be a
different part of the same information system. Third party will be
employed herein to refer to an entity which at least initially does
not have access to sensitive information which the third party has
requested or needs for an application or function.
[0034] Embodiments of the present invention can take the form of an
entirely hardware embodiment, an entirely software embodiment or an
embodiment including both hardware and software elements. In a
preferred embodiment, the present invention is implemented in
software, which includes but is not limited to firmware, resident
software, microcode, etc.
[0035] Furthermore, the invention can take the form of a computer
program product accessible from a computer-usable or
computer-readable medium providing program code for use by or in
connection with a computer or any instruction execution system. For
the purposes of this description, a computer-usable or computer
readable medium can be any apparatus that may include, store,
communicate, propagate, or transport the program for use by or in
connection with the instruction execution system, apparatus, or
device. The medium can be an electronic, magnetic, optical,
electromagnetic, infrared, or semiconductor system (or apparatus or
device) or a propagation medium. Examples of a computer-readable
medium include a semiconductor or solid state memory, magnetic
tape, a removable computer diskette, a random access memory (RAM),
a read-only memory (ROM), a rigid magnetic disk and an optical
disk. Current examples of optical disks include compact disk-read
only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
[0036] A data processing system suitable for storing and/or
executing program code may include at least one processor coupled
directly or indirectly to memory elements through a system bus. The
memory elements can include local memory employed during actual
execution of the program code, bulk storage, and cache memories
which provide temporary storage of at least some program code to
reduce the number of times code is retrieved from bulk storage
during execution. Input/output or I/O devices (including but not
limited to keyboards, displays, pointing devices, etc.) may be
coupled to the system either directly or through intervening I/O
controllers.
[0037] Network adapters may also be coupled to the system to enable
the data processing system to become coupled to other data
processing systems or remote printers or storage devices through
intervening private or public networks. Modems, cable modem and
Ethernet cards are just a few of the currently available types of
network adapters.
[0038] Referring now to the drawings in which like numerals
represent the same or similar elements and initially to FIG. 1, a
block diagram showing a flow of information is illustratively
shown. An information system 10 receives data through a data input
facility 12, processes the received data, and releases the results
of the processing to a third party principal 14 (e.g., to an
insider or to a partner, for example an employee of the
organization that collects private data, or another information
system). The release of information to third party 14 can be
initiated either by a specific request by the third party 14, or
automatically. The automatic release may happen on a predefined
schedule, for example immediately when new input data arrives. No
restrictions are placed on the operations that the information
system 10 can perform on data. It is assumed that communication
channels are secure, and all communicating parties have been
properly identified.
[0039] In accordance with present embodiments, a method of risk
assessment only provides a risk estimate 16 associated with data
disclosure to the third party principal 14, and in general the
separate data input facility component 12 is not required for the
application of the method. Further, any business operation that
needs the use of private information by employees or other third
parties can be represented as a combination of operations outlined
in FIG. 1. How the overall risk estimates associated with such an
operation can be computed will be described. Although there can be
risks associated with storing and processing the private
information in the information system, this risk is not considered
here, so that the focus is on the disclosure risk.
[0040] To make the distinction between private and public
information, public sources may be defined as sources of
information that are accessible to the third-party principal. In
general settings, these sources can be those that are available to
very large groups of people, such as all citizens of a country, all
government employees, all clients of a multinational company, etc.
Publicly available information is information obtained from public
sources.
[0041] The following factors can contribute to overall risk
associated with the disclosure of private information: [0042] The
type of private information being disclosed. For example,
disclosing a person's taxpayer identification number is considered
a much more serious violation of privacy than disclosing the phone
number of the same person. [0043] The amount of information being
disclosed; The more data of the same type is released, the higher
the risk that one of the records will be highly sensitive. [0044]
The availability of the disclosed private information through
public sources; For example, disclosing the income of the current
President of the U.S. involves no privacy risk, since by tradition
since the 1970s U.S. Presidents mostly choose to make their tax
returns public. [0045] The degree to which the disclosed private
information can be associated with a particular individual via
cross-referencing with publicly available information; For example,
disclosing social security number (SSN) 123-45-6789 and age 24,
without any other information has negligible privacy risk, since
although SSN is valuable information, knowing only the age of a
person leaves too many candidates, e.g., all US residents 24 years
of age, and each of these candidates is equally likely to have the
disclosed SSN. [0046] The context of the disclosure which may
supply additional information for cross-referencing; For example,
if in the previous example the SSN and age are provided as part of
the list of employees of a company, it may significantly limit the
number of candidates, and potentially allow to single out the
identity associated with these attributes, by cross-referencing the
data available on the employees [0047] The likelihood of the
information misuse by the third party principal; The issue here is
that the principal receiving the information will not necessarily
abide by the same privacy policy that allowed the principal who is
sharing the information to make the disclosure. Access control
policies capture this by using credentials to reflect
trustworthiness of the principal.
[0048] Security risk evaluation model described in Cheng et al.,
cited above, addresses the evaluation of risk of misuse of the
information by the principal receiving the information. This type
of risk is also relevant and important in the privacy context, and
the model proposed in Cheng et al. can be used as part of privacy
risk evaluation formula. However, the needs of privacy protection
add another component of risk that is associated with certainty of
determining the identity of the individual associated with the
information being disclosed. This identification is not always
trivial to deduce, but it can in many cases be achieved by using
the knowledge of the context of information disclosure, and
cross-referencing the information obtained from the information
system with the information obtained from publicly available
sources.
[0049] Existing work on the analysis of disclosure risks in
statistical databases includes: T. M. Truta, F. Fotouhi and D.
Barth-Jones, "Assessing global disclosure risk in masked metadata",
in Proceedings of the 2004 ACM Workshop on Privacy in The
Electronic Society, Washington D.C., USA, pp. 85-93 (hereinafter
Truta et al.) and references therein. Truta et al. describes
methods of risk estimation based on ideas closely related to
identifiability risk. One difference that work and the present
embodiments is that the present embodiments do not rely on the
availability of the disclosed data, and can be applied for risk
estimation before any data is presented to the third party.
[0050] The assessment of identifiability risk is performed based on
the estimate of probability that the disclosed information can be
used by third party principal to identify a person. For example,
according to data of US Census 2000, in 2000 there was only 1 male
person of Asian race in age group of 30 to 34 years old, who was
living in the area of postal Zip code 10532. Therefore, if the
third party principal receives information that average income of
Asian male persons in Zip code 10532 with ages 30 to 34 is $54,000,
the released data in fact includes very precise income information
on one single person, and this disclosure is most likely
privacy-violating. On the other hand, if the information about
income of a male person in zip code 10504 is disclosed, that income
figure can correspond to any of 3,494 males living in that zip
code, and therefore has much smaller identifiability risk
associated with it.
[0051] Census data is employed herein, as well as other similar
data, to compute a number reflecting the identifiability risks for
different combinations of attributes (e.g., zip code and age group
combination). This number can be used, for example, to compare
different combinations and determine the combination that is less
risky to disclose. In addition to the set of attributes, the
identifiability risk will also depend on public sources, on
distribution of the data being disclosed, and on the information
available to the third party.
[0052] Once the identifiability risk is computed, it can be
combined with other types of risk to determine overall risk
associated with the disclosure. How to compute the combined risk is
also described.
[0053] The following definitions will be employed herein:
[0054] Adversary--the third party principal receiving the disclosed
data from the information system; the use of the term adversary
stresses the possibility of the malicious intent on the part of the
third party principal.
[0055] Individual--a person or other entity (e.g., a company) who
is the subject of privacy-sensitive information, and whose privacy
is protected by privacy policy.
[0056] Attribute--a category of information, such as address, zip
code, telephone number, medical history, etc. Attributes may or may
not be categorical (e.g., discrete). However, it will be assumed
that attributes are categorical, and it is later described how
non-categorical attributes can be handled.
[0057] Population table--the set of records containing exactly one
record for each individual, where for every individual the values
of all applicable attributes are specified. This set reflects
"global knowledge". It is possible that none of the parties in the
information disclosure has the complete global knowledge
information.
[0058] Sample table--the data released to the adversary, a subset
of the population table; not all records and not all attributes
from the population table are released in sample data. If the data
processing performed by the information system (10, FIG. 1)
combines attributes or records using aggregation methods, the
sample data should include all of the attributes that were
combined, and all of the records that were combined. Sample data
may include at most one record for one individual listed in the
population table.
[0059] Key attributes--those attributes in the sample table that
can be used by the adversary to identify a person based on
information accessible to the adversary, including public
sources.
[0060] Protected attributes--those attributes in the sample table
that are not known to the adversary or for which the adversary
cannot establish the association with person identity. In practice,
all non-key attributes included in the sample table are considered
protected attributes.
[0061] Single Record Identifiability: The identifiability risk I(r)
of a sample table including single record r is
I ( r ) = 1 D K ( r ) ##EQU00001##
where D.sub.K(r) is the size of the set C.sub.K(r), i.e.
D.sub.K(r)=|C.sub.K(r)|. The set C.sub.K(r) is the subset of
records from the population table that includes only the records
that have values of all attributes in the set of key attributes K
exactly equal to the corresponding values of the same attributes in
r. D.sub.K(r), referred to as cell size, is equal to the number of
records in C.sub.K(r). This definition reflects the fact that the
adversary will see the protected attributes of r as equally likely
belonging to any individual in C.sub.K(r).
[0062] Multiple Record Identifiability: If the sample table
includes records (r.sub.1,r.sub.2, . . . ,r.sub.m), then the
identifiability risk for this sample table is:
I ( r 1 , r 2 , , r m ) = 1 - i = 1 m ( 1 - I ( r i ) ) .
##EQU00002##
[0063] Predicted Single Record Identifiability: Both previous
computations rely on the availability of the sample table. If the
sample table is not available, the prediction of identifiability
can be made based on the knowledge of the distribution of the
sample records. Assume that the sample table includes 1 record, and
for each individual i from the total population table, the
probability that this individual is the one appearing in the sample
table is p.sub.i. The probabilities {p.sub.i}.sub.i=1.sup.n define
a probability density function over the population of n
individuals. Further assume that the set of key attributes K is
known at the time of computation.
[0064] Let r.sub.i be a record corresponding to individual i in
total population table. Then, the predicted identifiability is:
I p = i = 1 n p i D K ( r i ) ##EQU00003##
[0065] where, as before, D.sub.K(r.sub.i) is the cell size, e.g.,
the number of records in the population table that have values of
all attributes in the set of key attribute K equal to the values of
corresponding attributes in r.sub.i.
[0066] Predicted Multiple Record Identifiability: The
identifiability risk can be assessed before m records are released.
As in the one record prediction case described above, it is assumed
that the values of the probabilities {p.sub.i}.sub.i=1.sup.n and
the names of all attributes in the set of key attributes are
provided as input parameters. Identifiability associated with
realizing m records is calculated as:
I.sub.p.sup.(m)=1-(1-I.sub.p).sup.m
where I.sub.p is predicted single record identifiability, computed
as described above.
[0067] Approximate Cell Size Computation: The computation of
predicted identifiability provides that the values of
D.sub.K(r)=|C.sub.K(r)| can be computed. In practical
implementation of the method, it may be difficult to store these
values for all possible attribute combinations K, since for k
attributes up to 2.sup.k different combinations K can be
constructed.
[0068] One possible solution to this problem is to store the
complete global knowledge table. This table includes a record for
every individual in the population. The records stored in the
global knowledge table include values of all possible attributes.
Once this table is available, the computation of the set C.sub.K(r)
can be performed trivially, for example by using a known SQL SELECT
operation. By definition of C.sub.K(r), it is a set of records that
includes those and only those records from the population table
that have the values of key attributes equal to the values assigned
to the corresponding attributes in record r.
[0069] In some implementations, however, the value of
identifiability risk, and therefore the value of D.sub.K(r) should
be computed quickly, and the full selection operation on a large
population table as described above may be too time consuming and
therefore unacceptable. Further, obtaining the population table
itself can be a difficult task in practice, because it needs
complete information about all individuals. To address these issues
an alternative method for estimating identifiability risk has been
developed using approximation of values of D.sub.K(r).
[0070] The approximation is based on the following formula. Let
K=K.sub.1.sym.K.sub.2.sym. . . . .sym.K.sub.w, i.e.
K=K.sub.1.orgate.K.sub.2.orgate. . . . .orgate.K.sub.w and
K.sub.i.andgate.K.sub.j=.phi. for i.noteq.j. As before, let n be
the size of the population (e.g., total number of individuals).
Then, D.sub.K(r) can be approximately calculated as:
D K ( r ) .apprxeq. D ~ K 1 K w ( r ) = D K 1 ( r ) D K 2 ( r ) D K
w ( r ) n w - 1 . ##EQU00004##
This approximation is based on the computation of a joint
probability density function under the assumption of independent
distribution of the population within each of the attribute groups
K.sub.1, K.sub.2, . . . , K.sub.w. The approximation formula
permits the computation of identifiability even in extreme cases
when only the values of D.sub.K.sub.i(r) for elementary sets
K.sub.i (e.g., sets including only one attribute) are
available.
[0071] Storing the data only for the elementary attribute sets uses
exponentially smaller storage space than in the case of exact
computation, and the space needed for approximate computation grows
at most linearly in the number of attributes and categories of
attribute values. However, the use of very small sets K.sub.i leads
to higher w, and the increase in w may cause an increase of
approximation error. Thus, there exists a natural tradeoff between
the accuracy of approximation and the efficiency of computation.
Practical implementations may choose any level of detail for which
to store the cell size data, and achieve the best practical balance
between efficiency and accuracy by varying the level of detail.
[0072] To show that the approximation approach is practical,
experiments have been performed with census data published by the
US Census Bureau. For the purpose of the following illustrative
experiment, the set of attributes includes four attributes: zip
code, race, sex and age group, which for brevity will be denoted by
the first letter of the attribute name (e.g., Z, R, S, A). The
published data provides values of D.sub.ZRSA(r), i.e. the most
detailed information about cell sizes possible.
[0073] Referring to FIG. 2, an Identifiability (I) approximation is
shown for different combinations of attributes from US National
Census data. The value of one record identifiability (I) computed
based on the data is shown with a first bar 32 in the diagram. The
other bars 34 show approximations computed using different
decompositions of the set ZRSA, where dash ("-") separates
different subsets of the attribute set used in the decomposition.
In FIG. 2, it is easy to see that the identifiability value
approximation is the worst when the set of attributes is decomposed
into elementary components (Z-A-R-S). However, if attributes Z and
R are in the same subset, the accuracy of approximation is very
high. This can be explained by the fact that Zip code and Race are
not independent in this data set.
[0074] Furthermore, in all cases observed in this experiment, the
approximate value of the identifiability risk exceeds the true
value of the risk, and therefore the use of approximation leads to
a conservative risk estimation. Conservative estimation is very
important for consistent protection of privacy. It means that
increased risk estimation efficiency causes the system to release
less information, and therefore under incomplete information or
time constraints the system errs on the safe side. Note that
although in this experiment risk overestimation was observed, this
behavior of the approximation method is not guaranteed to hold for
all populations.
[0075] This experiment also illustrates an embodiment, in which
publicly available Census data is used for the computation of
identifiability risk associated with a particular attribute set.
The Census data has a limited number of attributes, but using the
approximation formula the attributes can be combined with other
attributes available in the information system. Therefore, it is
possible to use Census data for computation of identifiability
risks associated with more general attribute sets that include
attributes other than those included in Census data.
[0076] Prediction and Enforcement (Data Analysis Cycle): One
possible embodiment of predicted identifiability may combine risk
prediction and risk monitoring to support the complete cycle of
data analysis.
[0077] Referring to FIGS. 3 and 4, a risk assessment stage 102 and
a processing and risk management stage 104 are illustratively
shown. Stages 102 and 104 may be employed together or separately as
needed. At stage 102, the results of data processing by the
information system 10 are not yet released to the third party
principal 14. This may happen, for example, while the information
system 10 is waiting for a request to initiate data analysis and
data release. At this point, the information system 10 may study
the input data and build a model 106 of the data that can be used
to compute the values of {p.sub.i}.sub.i=1.sup.n needed for initial
assessment of identifiability risk I.sub.p. The initial assessment
is used to evaluate risk before any data is released to the third
party principal 14.
[0078] Any method for modeling the input data and computing
probabilities {p.sub.i}.sub.i=1.sup.n may be used, for example
sampling and averaging. Before any data is received by the
information system 10, the model 106 can be initialized in the
initial state in which the probability mass is concentrated in
records that belong to cells with highest identifiability (i.e., r
with the smallest D.sub.K(r)) for a predefined combination of
attributes K, and all other records are assigned zero or near-zero
probability. K can be chosen, for example, to be the largest set of
key attributes.
[0079] At stage 104, when the processing and the release of the
processed data to the third party principal 14 have been initiated
and are in progress, the observed identifiability risk
I(r.sub.1,r.sub.2, . . . ,r.sub.m) that can be computed based on
the released set of records (r.sub.1,r.sub.2, . . . ,r.sub.m} may
exceed the initial risk assessment value I.sub.p. To detect this
situation, a privacy monitor component 108 is added to the chain of
processing between the information system 10 and the third party
principal 14. This component can perform any of the following
functions: [0080] observe the data received by the third principal
14, and measure the true risk based on that data; [0081] terminate
data processing if observed risk exceeds a specified threshold
(e.g. 110% of initial risk assessment I.sub.p); [0082] suppress
(e.g., filter out) the records the release of which will cause the
observed risk to exceed a specified threshold.
[0083] The continuous update of the input data model 106 may be
performed at stage 104 as well as at stage 102. The set of
{p.sub.i}.sub.i=1.sup.n computed based on the updated model may
later be used again for initial risk assessment.
[0084] Combined Privacy Risk Computation: One of the many possible
embodiments using the identifiability model 106 may use the
identifiability risk value to compute the overall combined privacy
risk associated with the release of information. As described, the
privacy risk depends on many factors. FIG. 5 shows the main factors
contributing to the overall privacy risk.
[0085] Referring to FIG. 5, illustrative factors affecting privacy
risk 200 are shown. These risks may include values 202, rates 204,
indentifiability risk 206, occurrence risk 208 and misuse risk 210.
Other factors may also have an effect. To simplify risk computation
and exclude other factors contributing to the overall risk, the
following assumptions may be made: [0086] A sample table is sent to
the adversary through a secure channel; therefore, there is no risk
associated with the data transmission itself (although such a risk
can be taken into account, if needed, through a straightforward
extension). [0087] Communicating parties are properly identified;
there is no risk associated with releasing information to an
untrusted principal posing as a trusted third-party principal.
Based on this and previous assumptions, the identity of the
adversary is known, and trustworthiness of the adversary can be
evaluated, if needed.
[0088] Under these assumptions, the remaining factors include:
[0089] Value 202. The value of information. [0090] Rate 204. The
rate at which records are released. [0091] Identifiability risk
206. The risk is defined as the probability that the released data
can be associated with a particular individual. [0092] Occurrence
risk 208. The risk is defined as the probability that the released
data includes private information. [0093] Misuse risk 210. The risk
is defined as the probability that the released data will be
misused. The higher the sensitivity of the information, and the
lower the trust level of the third party principal, the higher this
risk is.
[0094] The overall privacy risk is monotone non-decreasing in all
of these parameters. Below an example is given of a formula based
on joint probability of events that can be used to provide an
assessment of privacy risk.
[0095] Let V(i) be value of attribute i in the released dataset.
Let rate r be the average number of records released per fixed time
interval, e.g. per minute. Let I.sub.p.sup.(r) be the predicted
identifiability of r records. Let R.sub.o be the occurrence risk.
Let R.sub.m(i) be the misuse risk for attribute i. Then, overall
privacy risk R can be defined as the expected lost value, which is
equal to the product of the loss risks and the value:
R = i V ( i ) R m ( i ) ( 1 - ( 1 - R o ) r ) I p ( r ) R m ( i )
##EQU00005##
Note that the misuse risk R.sub.m(i) depends on the difference
between the level of information for which the third party
principal is trusted (principal credentials), and the information
present within the output presented to the third party principal. A
sigmoid function described in Fuzzy logic can be used to compute
R.sub.m(i), as discussed in Cheng et al. incorporated by reference
above.
[0096] An example of the use of a sigmoid function is as follows.
Sensitivity levels may be viewed as one dimension and each category
or risk as one dimension. One choice for determining misuse risk
R.sub.m(i) is the sigmoid function. Let RI be the risk index and RI
.epsilon. (0;+.infin.), then
R.sub.m(RI)=1/(1+exp((-k).times.(RI-mid)))
[0097] The value of this function increases very slowly when RI is
much smaller than mid, it increases much faster when RI is closer
to mid and saturates as RI becomes much larger than mid. The value
"mid" is the risk index value where the probability is deemed to be
0.5; it is a tunable parameter. The value k is also a tunable
parameter that controls the slope of the function. A dimension may
have its own values for mid and k.
[0098] The choice of mid has a significant effect on the
probabilities computed and that the probabilities become 1 (or very
close to 1) when the value of an object is at least two orders of
magnitude or a hundred times larger than the trustworthiness of the
subject. This observation is consistent with our pessimistic view
of human nature. It should be noted that by choosing this formula,
the first requirement for R.sub.m(i) discussed above is changed to
be RI
lim.sub.RI.fwdarw.0+ R.sub.m(RI).apprxeq.0
[0099] This is acceptable since the risk at such a low level is
usually well within the acceptable range. If it is desirable to
take risk mitigation into consideration, the formula becomes:
R.sub.m(RI)=1/(1+exp((-k).times.(e.sub.m(RI)-(mid)))
where e.sub.m(RI) is a residual risk after mitigation.
[0100] A further assumption may be made that the R.sub.m(i) for
sensitivity levels and the R.sub.m(i) for a category are
independent of each other. The rationale behind this assumption
includes the following. View the risk computed from sensitivity
levels as the "risk of being tempted", in other words, the risk of
a subject disclosing sensitive information intentionally for its
own gain. The more sensitive the information or the less
trustworthy the subject, the higher the risk is. The risk computed
from a category may be viewed as the risk of "inadvertent
disclosure or use". It is generally very hard to divide a piece of
information into the "need-to-know" and "no-need-to-know"
partitions while still maintaining the original context of the
information. Therefore, once a subject, even a very trusted one,
absorbs some information, which it has no (strong) need-to-know,
there is a chance the subject will inadvertently disclose or use
the information.
[0101] Referring to FIG. 6, a system/method for privacy protection
and/or risk assessment is illustratively shown in accordance with
one exemplary embodiment. In block 302, build a risk determination
model, preferably, using records having a highest identifiability
for a combination attributes.
[0102] In block 304, identifiability risk is evaluated for one or
more records in storage using a risk assessment model. This
evaluation may include computing identifiability risk for a given
record, e.g., based on a reciprocal of a subset of records or cell
size from a population table that includes only the records that
have values of all attributes in a set of key attributes included
in the given record in block 308. In block 306, the cell size may
be approximated if the actual cell size information is not
available. In block 309, the evaluating may include computing
identifiability risk for multiple records, e.g., by subtracting
from 1: a sum of 1 minus the identifiably risk for all records in a
sample table.
[0103] Alternately, the identifiability risk may be predicted for
records input to an information system in block 310. In block 312,
the prediction may be for a single record identifiability risk for
a given record, e.g., knowing a set of key attributes, summing over
a population: a probability that an individual is the one appearing
in a sample table divided by a cell size for each record
corresponding to the individual. In block 311, the cell size may be
approximated. In block 314, the identifiability risk may be
predicted for each of multiple records by, e.g., subtracting from
1: 1 minus the single record identifiably risk to exponent m, where
m is the number of records.
[0104] In block 320, the identifiability risk (or combined risk) is
compared with a threshold to determine whether the one or more
records can be disclosed to a third party without violation of
privacy criteria. This may be performed prior to disclosing the one
or more records to the third party. A determination of whether the
identifiability risk exceeds a threshold is made in block 322. If
the threshold is exceeded, a function is performed in block 323 to
mitigate unauthorized disclosure of the records to the third party.
This may include preventing the record from being disclosed,
preventing all records from being disclosed, and/or observe the
data received by the third party principal and determine the true
risk. Otherwise, a check as to whether the last record to be
checked has been reached is made in block 325. If the last record
not has been reached, return to block 320. Otherwise disclose the
record in block 324. A determination for each record can be made
and each record may be disclosed one at a time or in blocks of
records.
[0105] In block 324, the one or more records are disclosed if
disclosure is acceptable based on the comparing step in block 320.
Risks may be combined into a combined privacy risk model configured
to combine risk assessment for a plurality of risk factors in block
326. A combined risk threshold can be established and employed as
the threshold in block 322.
[0106] Automated planning methods can be used to compose and
rearrange the components of the data processing application
automatically in order to manage privacy risk while satisfying
requirements on produced data specified by the end users. In
particular, the components may include privacy filters that filter
out or modify (e.g., anonymize) the sensitive data. Including these
components in the composition may reduce privacy risk associated
with data disclosure. It is possible to use the risk assessment
method described herein within the automated planning framework.
The assessed risk value will be different for different sets of
produced output. The value will also depend on other factors, e.g.,
as described above. All of these factors can be changed by changing
the composition of processing components, and therefore the planner
may be able to create a composition that has expected privacy risk
below a given threshold, or has minimal possible risk of all
possible compositions. This can be achieved, for example, by
creating all possible compositions and computing risk assessment
for each composition.
[0107] Having described preferred embodiments of a system and
method for privacy protection using identifiability risk assessment
(which are intended to be illustrative and not limiting), it is
noted that modifications and variations can be made by persons
skilled in the art in light of the above teachings. It is therefore
to be understood that changes may be made in the particular
embodiments disclosed which are within the scope and spirit of the
invention as outlined by the appended claims. Having thus described
aspects of the invention, with the details and particularity
required by the patent laws, what is claimed and desired protected
by Letters Patent is set forth in the appended claims.
* * * * *