System and method for privacy protection using identifiability risk assessment Chen; Weifeng ; et al. [Chen; Weifeng]

System and method for privacy protection using identifiability risk assessment

Chen; Weifeng ; et al.

Patent Application Summary

U.S. patent application number 11/480677 was filed with the patent office on 2008-01-03 for system and method for privacy protection using identifiability risk assessment. Invention is credited to Weifeng Chen, Zhen Liu, Anton Riabov, Angela Marie Schuett.

Application Number	20080005778 11/480677
Document ID	/
Family ID	38878420
Filed Date	2008-01-03

United States Patent Application	20080005778
Kind Code	A1
Chen; Weifeng ; et al.	January 3, 2008

System and method for privacy protection using identifiability risk assessment

Abstract

A risk assessment system and method includes an information system configured to disclose information to a third party. A risk determination model is configured to compute identifiability risk for on one or more records in storage. The identifiability risk is compared to a threshold prior to being disclosed wherein the information system is informed of the identifiability risk exceeding the threshold prior to disclosure to the third party.

Inventors:	Chen; Weifeng; (Amherst, MA) ; Liu; Zhen; (Tarrytown, NY) ; Riabov; Anton; (Ossining, NY) ; Schuett; Angela Marie; (Columbia, MD)
Correspondence Address:	KEUSEY, TUTUNJIAN & BITETTO, P.C. 20 CROSSWAYS PARK NORTH, SUITE 210 WOODBURY NY 11797 US
Family ID:	38878420
Appl. No.:	11/480677
Filed:	July 3, 2006

Current U.S. Class:	726/1
Current CPC Class:	G06Q 10/10 20130101
Class at Publication:	726/1
International Class:	H04L 9/00 20060101 H04L009/00

Goverment Interests

GOVERNMENT RIGHTS

[0001] This invention was made with Government support under Contract No.: H98230-05-3-001 awarded by the U.S. Department of Defense. The Government has certain rights in this invention.

Claims

1. A risk assessment system, comprising: an information system configured to disclose information to a third party; and a risk determination model configured to compute identifiability risk for one or more records in storage, the identifiability risk being compared to a threshold prior to being disclosed wherein the information system is informed of the identifiability risk exceeding the threshold prior to disclosure to the third party.

2. The system as recited in claim 1, wherein the information system builds the risk determination model based on the one or more records of input data.

3. The system as recited in claim 1, wherein the information system builds the risk determination model using records having a highest identifiability for a combination attributes.

4. The system as recited in claim 1, wherein the risk determination model computes identifiability risk for a given record based on a reciprocal of a subset of records or cell size from a population table that includes only the records that have values of all attributes in a set of key attributes included in the given record.

5. The system as recited in claim 4, wherein the cell size is approximated.

6. The system as recited in claim 1, wherein the risk determination model computes identifiability risk for multiple records by subtracting from 1: a sum of 1 minus the identifiably risk for all records in a sample table.

7. The system as recited in claim 1, wherein the risk determination model predicts identifiability risk for records input to the information system.

8. The system as recited in claim 1, wherein the risk determination model predicts a single record identifiability risk for a given record, knowing a set of key attributes, by summing over a population: a probability that an individual is the one appearing in a sample table divided by a cell size for each record corresponding to the individual.

9. The system as recited in claim 8, wherein the cell size is approximated.

10. The system as recited in claim 8, wherein the risk determination model predicts identifiability risk for each of multiple records by subtracting from 1:1 minus the single record identifiably risk to exponent m, where m is the number of records.

11. The system as recited in claim 1, wherein the risk determination model includes a combined privacy risk model configured to combine risk assessment for a plurality of risk factors.

12. A risk monitoring system, comprising: an information system configured to disclose information to an entity; a risk determination model configured to compute identifiability risk for on one or more records in storage; a privacy monitor configured to receive the one or more records being released, and prior to disclosing the one or more records to the entity, the privacy monitor being configured to detect whether the identifiability risk exceeds a threshold and perform a function to mitigate unauthorized disclosure of the records to the entity.

13. The system as recited in claim 12, wherein the information system builds the risk determination model based on the one or more records of input data.

14. The system as recited in claim 12, wherein the information system builds the risk determination model using records having a highest identifiability for a combination attributes.

15. The system as recited in claim 12, wherein the risk determination model computes identifiability risk for a given record based on a reciprocal of a subset of records or cell size from a population table that includes only the records that have values of all attributes in a set of key attributes included in the given record.

16. The system as recited in claim 15, wherein the cell size is approximated.

17. The system as recited in claim 12, wherein the risk determination model computes identifiability risk for multiple records by subtracting from 1:1 a sum of 1 minus the identifiably risk for all records in a sample table.

18. The system as recited in claim 12, wherein the risk determination model predicts identifiability risk for records input to the information system.

19. The system as recited in claim 12, wherein the risk determination model predicts a single record identifiability risk for a given record, knowing a set of key attributes, by summing over a population: a probability that an individual is the one appearing in a sample table divided by a cell size for each record corresponding to the individual.

20. The system as recited in claim 19, wherein the cell size is approximated.

21. The system as recited in claim 19, wherein the risk determination model predicts identifiability risk for each of multiple records by subtracting from 1: 1 minus the single record identifiably risk to exponent m, where m is the number of records.

22. The system as recited in claim 12, wherein the risk determination model includes a combined privacy risk model configured to combine risk assessment for a plurality of risk factors.

23. A method for privacy protection, comprising: evaluating identifiability risk for one or more records in storage using a risk assessment model; comparing the identifiability risk with a threshold to determine whether the one or more records can be disclosed to a third party without violation of privacy criteria; and disclosing the one or more records if disclosure is acceptable based on the comparing step.

24. The method as recited in claim 23, further comprising, prior to disclosing the one or more records to the third party, detecting whether the identifiability risk exceeds a threshold and performing a function to mitigate unauthorized disclosure of the records to the third party.

25. The method as recited in claim 23, further comprising building the risk determination model using records having a highest identifiability for a combination attributes.

26. The method as recited in claim 23, wherein evaluating includes computing identifiability risk for a given record based on a reciprocal of a subset of records or cell size from a population table that includes only the records that have values of all attributes in a set of key attributes included in the given record.

27. The method as recited in claim 26, further comprising approximating the cell size.

28. The method as recited in claim 23, wherein evaluating includes computing identifiability risk for multiple records by subtracting from 1:1 a sum of 1 minus the identifiably risk for all records in a sample table.

29. The method as recited in claim 23, wherein evaluating includes predicting identifiability risk for records input to an information system.

30. The method as recited in claim 23, wherein evaluating includes predicting a single record identifiability risk for a given record, knowing a set of key attributes, by summing over a population: a probability that an individual is the one appearing in a sample table divided by a cell size for each record corresponding to the individual.

31. The method as recited in claim 30, further comprising approximating the cell size.

32. The method as recited in claim 30, wherein evaluating includes predicting identifiability risk for each of multiple records by subtracting from 1:1 minus the single record identifiably risk to exponent m, where m is the number of records.

33. The method as recited in claim 23, further comprising combining risks using a combined privacy risk model configured to combine risk assessment for a plurality of risk factors.

34. A computer program product of privacy protection comprising a computer useable medium including a computer readable program, wherein the computer readable program when executed on a computer causes the computer to: evaluating identifiability risk for one or more records in storage using a risk assessment model; comparing the identifiability risk with a threshold to determine whether the one or more records can be disclosed to a third party without violation of privacy criteria; and disclosing the one or more records if disclosure is acceptable based on the comparing step.

35. The computer program product as recited in claim 34, further comprising wherein evaluating includes predicting identifiability risk for records input to an information system.

Description

BACKGROUND

[0002] 1. Technical Field

[0003] The present invention relates to privacy protection and more particularly to systems and methods employing an identifiability risk assessment to protect against disclosure of protected information.

[0004] 2. Description of the Related Art

[0005] Information systems deployed and maintained by businesses of different sizes often store personal information collected from their clients. Similarly, information systems of government organizations, such as the Internal Revenue Service, store personal data about citizens, as well as other private data collected from businesses. In recent years, these information systems have increasingly become computerized.

[0006] Computer-based information systems enable anytime instant access to the data, and on-the-fly cross-referencing of large volumes data. Computerization is also critical for the implementation of data mining methods and other automated and highly efficient data analysis techniques, which in turn help reduce costs and improve the agility and the efficiency of businesses and governments.

[0007] Recent advances in communication technologies, wireless and cellular networks and widespread availability of the Internet enable instant access to the information systems and computational resources from virtually anywhere, via easily obtainable equipment such as a laptop with a wireless card. Together with the benefits that the now ubiquitous digital technologies have brought, they have also brought new dangers.

[0008] Currently, it is becoming increasingly easy to gain unauthorized access to personal data. Identity theft is a very serious and real threat to anyone who is sharing personal information with companies in exchange for services, credit, etc. As a consequence, safeguarding of personal information becomes a highly important objective for businesses and governments, and many aspects of privacy protection, such as the collection and use of information on minors or strong encryption of Internet communications, are mandated by laws or best business practices.

[0009] Two previously unsolved important problems of privacy protection arising in business and government information systems need solutions. These problems include evaluating and managing a tradeoff between privacy protection and business efficiency, and quantifying the privacy risks associated with various business operations. Solving these problems will contribute to improvements in both business transparency and business efficiency. The solutions will help streamline privacy protection processes, simplify the work of employees responsible for privacy protection, and increase the accountability of individual employees and entire organizations.

[0010] Managing the tradeoff between privacy and business efficiency: Arguably, any information collected and stored by an organization is collected with the intent of using this information, for one purpose or another. Privacy policies enacted by organizations restrict the purposes for which the information can be used. In this policy-controlled mode, privacy protection is equivalent to the enforcement of policy compliance. For any save more permissive policies, there can be practical situations where the potential benefits resulting from a particular business operation are not realized, because the operation requires the use of information in conflict with privacy policy. Therefore, the privacy policies are controlling the tradeoff between the needs of business efficiency and the needs of privacy protection.

[0011] Developing privacy policies is an extremely difficult task that any organization must perform when it establishes a system for storing personal or private information. It is especially difficult in information systems where automated policy enforcement is implemented, because for such systems, the policy is specified as a set of formal automatically verifiable rules. In some respects the task of writing policies can be compared to the process of developing the legislation. This analogy can be literal, if the policy is stipulated by law.

[0012] For these reasons, the policy writers often prefer to err on the side of caution, and prohibit the use of information when in doubt. Also as a result, the policies mandated internally by an organization are typically more restrictive than the published policies, and reflect not only the law and the published commitments of the organization, but also the best practices requirements as seen by policy writers.

[0013] The employees responsible for the use of information are often required to determine whether a particular use of information is in violation of the policy. For example, in some situations, the internal policy can be violated in order to allow an extremely critical operation, as long as the externally published policy and laws are not violated. If the internal policy is enforced automatically, such an operation may require the intervention of high-level management to circumvent the normal enforcement controls. This problem can be partially addressed by policies that permit wider access in exceptional circumstances, e.g., if sufficient justification is provided. However, the currently existing approaches do not provide sufficient assistance to the organization's employees responsible for making decisions, and often they are forced to violate internal policies in order to get their job done.

SUMMARY

[0014] A risk assessment system and method includes an information system configured to disclose information to a third party. A risk determination model is configured to compute identifiability risk for on one or more records in storage. The identifiability risk is compared to a threshold prior to being disclosed wherein the information system is informed of the identifiability risk exceeding the threshold prior to disclosure to the third party.

[0015] Another risk monitoring system includes an information system configured to disclose information to an entity, and a risk determination model configured to compute identifiability risk for on one or more records in storage. A privacy monitor is configured to receive the one or more records being released, and prior to disclosing the one or more records to the entity, the privacy monitor is configured to detect whether the identifiability risk exceeds a threshold and perform a function to mitigate unauthorized disclosure of the records to the entity.

[0016] A method for privacy protection includes evaluating identifiability risk for one or more records in storage using a risk assessment model, comparing the identifiability risk with a threshold to determine whether the one or more records can be disclosed to a third party without violation of privacy criteria, and disclosing the one or more records if disclosure is acceptable based on the comparing step.

[0017] These and other objects, features and advantages will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.

BRIEF DESCRIPTION OF DRAWINGS

[0018] The disclosure will provide details in the following description of preferred embodiments with reference to the following figures wherein:

[0019] FIG. 1 is a block diagram of a stream processing model for privacy protection in accordance with one illustrative embodiment;

[0020] FIG. 2 is a chart showing indentifiability approximation for different combinations of attributes from US Census data in accordance with an illustrative embodiment;

[0021] FIG. 3 is a block diagram showing a risk assessment system in accordance with one illustrative embodiment;

[0022] FIG. 4 is a block diagram showing a risk processing and monitoring system in accordance with another illustrative embodiment;

[0023] FIG. 5 is a block diagram showing exemplary risk factors affecting privacy risk in accordance with one embodiment; and

[0024] FIG. 6 is a block/flow diagram showing a system/method for risk assessment and privacy protection in accordance with an illustrative embodiment.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

[0025] Embodiments in accordance with the present principles address at least two previously unsolved important problems of privacy protection arising in business and government information systems. These problems include evaluating and managing a tradeoff between privacy protection and business efficiency, and quantifying the privacy risks associated with various business operations. The present embodiments provide systems and methods for evaluating and managing the tradeoffs between business efficiency and privacy protection.

[0026] The systems and methods are preferably based on a privacy risk quantification method for the estimation of the numeric value of privacy risk associated with a particular use of protected information. An organization's employees cannot be relieved from the duty of making their own, frequently highly subjective, estimates of the risks, and can be provided with a system which will automatically compute the risk value and keep the records of the value, the employee's decision and the employee's justification for the decision.

[0027] Such a system may include enforcement controls for limiting the risk that an employee can take without consulting with others, for example based on work experience--more experienced employees may be trusted to take higher risks. Justification requirements and risk mitigation procedures of varying degrees can be established, such that the requirements and the procedures are enforced automatically depending on the risk estimate. Possible risk mitigation measures may include an audit that takes place after the employee decides to perform a particular action leading to use of information, post-action management review, or pre-action management approval requirement.

[0028] A noted approach to evaluation of security risks is described in commonly assigned U.S. patent Ser. No. 11/123,998, to Cheng et al. entitled "SYSTEM AND METHOD FOR FUZZY MULTI-LEVEL SECURITY", filed May 6, 2005, and incorporated herein by reference. Many of the ideas described in Cheng et al. can be applied in privacy protection scenarios. Access control decisions in privacy protection context translate into the decisions about allowing the use of information for specific purposes.

[0029] Assessment of the privacy risks associated with business operations: People generally accept the necessity of providing personal information to businesses or governments. When asked to provide such information, people usually know the entity they are providing the information to, and can decide whether the information should be provided. However, in practice it is very hard to ensure that the information indeed is provided to one single principal constituting a business or a government. The information is often shared with other principals: e.g., the employees of the business or government, business partners, and so on. In this situation it is especially difficult to limit the distribution of private information due to possibility of misuse by trusted insiders or trusted partners.

[0030] For these reasons, it can be concluded that any use of personal or other types of private information that leads to disclosure of private information, e.g., providing the information to principals who did not have access to the information before, is associated with risk. The privacy risk can be defined as the risk that private information will be misused, or shared inappropriately.

[0031] A system for evaluation and management of privacy risks is provided and may be based on a method for risk estimation. Embodiments in accordance with present principles build upon the methods developed for the measurement of security risks, and introduce a novel quantitative measure for assessment of privacy-specific identifiably risk.

[0032] The present embodiments have significantly extended the ideas of disclosure risk estimation described in previous art to develop methods for providing consistent global privacy risk assessment both before and after the data have been disclosed to a third party. Risk estimation before actual data is disclosed can be performed based on the forecast of future data distribution. Obtaining such an estimate is important when the agreements on the conditions of sharing the data are established before the data is actually transferred from the information system to a third party, and even before the data enters the information system.

[0033] Third party or third party principal as employed herein refers to any consumer of data. The consumer may be internal or external component or entity of a given system or may be a different part of the same information system. Third party will be employed herein to refer to an entity which at least initially does not have access to sensitive information which the third party has requested or needs for an application or function.

[0034] Embodiments of the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment including both hardware and software elements. In a preferred embodiment, the present invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.

[0035] Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that may include, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.

[0036] A data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code to reduce the number of times code is retrieved from bulk storage during execution. Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system either directly or through intervening I/O controllers.

[0037] Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.

[0038] Referring now to the drawings in which like numerals represent the same or similar elements and initially to FIG. 1, a block diagram showing a flow of information is illustratively shown. An information system 10 receives data through a data input facility 12, processes the received data, and releases the results of the processing to a third party principal 14 (e.g., to an insider or to a partner, for example an employee of the organization that collects private data, or another information system). The release of information to third party 14 can be initiated either by a specific request by the third party 14, or automatically. The automatic release may happen on a predefined schedule, for example immediately when new input data arrives. No restrictions are placed on the operations that the information system 10 can perform on data. It is assumed that communication channels are secure, and all communicating parties have been properly identified.

[0039] In accordance with present embodiments, a method of risk assessment only provides a risk estimate 16 associated with data disclosure to the third party principal 14, and in general the separate data input facility component 12 is not required for the application of the method. Further, any business operation that needs the use of private information by employees or other third parties can be represented as a combination of operations outlined in FIG. 1. How the overall risk estimates associated with such an operation can be computed will be described. Although there can be risks associated with storing and processing the private information in the information system, this risk is not considered here, so that the focus is on the disclosure risk.

[0040] To make the distinction between private and public information, public sources may be defined as sources of information that are accessible to the third-party principal. In general settings, these sources can be those that are available to very large groups of people, such as all citizens of a country, all government employees, all clients of a multinational company, etc. Publicly available information is information obtained from public sources.

[0041] The following factors can contribute to overall risk associated with the disclosure of private information: [0042] The type of private information being disclosed. For example, disclosing a person's taxpayer identification number is considered a much more serious violation of privacy than disclosing the phone number of the same person. [0043] The amount of information being disclosed; The more data of the same type is released, the higher the risk that one of the records will be highly sensitive. [0044] The availability of the disclosed private information through public sources; For example, disclosing the income of the current President of the U.S. involves no privacy risk, since by tradition since the 1970s U.S. Presidents mostly choose to make their tax returns public. [0045] The degree to which the disclosed private information can be associated with a particular individual via cross-referencing with publicly available information; For example, disclosing social security number (SSN) 123-45-6789 and age 24, without any other information has negligible privacy risk, since although SSN is valuable information, knowing only the age of a person leaves too many candidates, e.g., all US residents 24 years of age, and each of these candidates is equally likely to have the disclosed SSN. [0046] The context of the disclosure which may supply additional information for cross-referencing; For example, if in the previous example the SSN and age are provided as part of the list of employees of a company, it may significantly limit the number of candidates, and potentially allow to single out the identity associated with these attributes, by cross-referencing the data available on the employees [0047] The likelihood of the information misuse by the third party principal; The issue here is that the principal receiving the information will not necessarily abide by the same privacy policy that allowed the principal who is sharing the information to make the disclosure. Access control policies capture this by using credentials to reflect trustworthiness of the principal.

[0048] Security risk evaluation model described in Cheng et al., cited above, addresses the evaluation of risk of misuse of the information by the principal receiving the information. This type of risk is also relevant and important in the privacy context, and the model proposed in Cheng et al. can be used as part of privacy risk evaluation formula. However, the needs of privacy protection add another component of risk that is associated with certainty of determining the identity of the individual associated with the information being disclosed. This identification is not always trivial to deduce, but it can in many cases be achieved by using the knowledge of the context of information disclosure, and cross-referencing the information obtained from the information system with the information obtained from publicly available sources.

[0049] Existing work on the analysis of disclosure risks in statistical databases includes: T. M. Truta, F. Fotouhi and D. Barth-Jones, "Assessing global disclosure risk in masked metadata", in Proceedings of the 2004 ACM Workshop on Privacy in The Electronic Society, Washington D.C., USA, pp. 85-93 (hereinafter Truta et al.) and references therein. Truta et al. describes methods of risk estimation based on ideas closely related to identifiability risk. One difference that work and the present embodiments is that the present embodiments do not rely on the availability of the disclosed data, and can be applied for risk estimation before any data is presented to the third party.

[0050] The assessment of identifiability risk is performed based on the estimate of probability that the disclosed information can be used by third party principal to identify a person. For example, according to data of US Census 2000, in 2000 there was only 1 male person of Asian race in age group of 30 to 34 years old, who was living in the area of postal Zip code 10532. Therefore, if the third party principal receives information that average income of Asian male persons in Zip code 10532 with ages 30 to 34 is $54,000, the released data in fact includes very precise income information on one single person, and this disclosure is most likely privacy-violating. On the other hand, if the information about income of a male person in zip code 10504 is disclosed, that income figure can correspond to any of 3,494 males living in that zip code, and therefore has much smaller identifiability risk associated with it.

[0051] Census data is employed herein, as well as other similar data, to compute a number reflecting the identifiability risks for different combinations of attributes (e.g., zip code and age group combination). This number can be used, for example, to compare different combinations and determine the combination that is less risky to disclose. In addition to the set of attributes, the identifiability risk will also depend on public sources, on distribution of the data being disclosed, and on the information available to the third party.

[0052] Once the identifiability risk is computed, it can be combined with other types of risk to determine overall risk associated with the disclosure. How to compute the combined risk is also described.

[0053] The following definitions will be employed herein:

[0054] Adversary--the third party principal receiving the disclosed data from the information system; the use of the term adversary stresses the possibility of the malicious intent on the part of the third party principal.

[0055] Individual--a person or other entity (e.g., a company) who is the subject of privacy-sensitive information, and whose privacy is protected by privacy policy.

[0056] Attribute--a category of information, such as address, zip code, telephone number, medical history, etc. Attributes may or may not be categorical (e.g., discrete). However, it will be assumed that attributes are categorical, and it is later described how non-categorical attributes can be handled.

[0057] Population table--the set of records containing exactly one record for each individual, where for every individual the values of all applicable attributes are specified. This set reflects "global knowledge". It is possible that none of the parties in the information disclosure has the complete global knowledge information.

[0058] Sample table--the data released to the adversary, a subset of the population table; not all records and not all attributes from the population table are released in sample data. If the data processing performed by the information system (10, FIG. 1) combines attributes or records using aggregation methods, the sample data should include all of the attributes that were combined, and all of the records that were combined. Sample data may include at most one record for one individual listed in the population table.

[0059] Key attributes--those attributes in the sample table that can be used by the adversary to identify a person based on information accessible to the adversary, including public sources.

[0060] Protected attributes--those attributes in the sample table that are not known to the adversary or for which the adversary cannot establish the association with person identity. In practice, all non-key attributes included in the sample table are considered protected attributes.

[0061] Single Record Identifiability: The identifiability risk I(r) of a sample table including single record r is

I ( r ) = 1 D K ( r ) ##EQU00001##

where D.sub.K(r) is the size of the set C.sub.K(r), i.e. D.sub.K(r)=|C.sub.K(r)|. The set C.sub.K(r) is the subset of records from the population table that includes only the records that have values of all attributes in the set of key attributes K exactly equal to the corresponding values of the same attributes in r. D.sub.K(r), referred to as cell size, is equal to the number of records in C.sub.K(r). This definition reflects the fact that the adversary will see the protected attributes of r as equally likely belonging to any individual in C.sub.K(r).

[0062] Multiple Record Identifiability: If the sample table includes records (r.sub.1,r.sub.2, . . . ,r.sub.m), then the identifiability risk for this sample table is:

I ( r 1 , r 2 , , r m ) = 1 - i = 1 m ( 1 - I ( r i ) ) . ##EQU00002##

[0063] Predicted Single Record Identifiability: Both previous computations rely on the availability of the sample table. If the sample table is not available, the prediction of identifiability can be made based on the knowledge of the distribution of the sample records. Assume that the sample table includes 1 record, and for each individual i from the total population table, the probability that this individual is the one appearing in the sample table is p.sub.i. The probabilities {p.sub.i}.sub.i=1.sup.n define a probability density function over the population of n individuals. Further assume that the set of key attributes K is known at the time of computation.

[0064] Let r.sub.i be a record corresponding to individual i in total population table. Then, the predicted identifiability is:

I p = i = 1 n p i D K ( r i ) ##EQU00003##

[0065] where, as before, D.sub.K(r.sub.i) is the cell size, e.g., the number of records in the population table that have values of all attributes in the set of key attribute K equal to the values of corresponding attributes in r.sub.i.

[0066] Predicted Multiple Record Identifiability: The identifiability risk can be assessed before m records are released. As in the one record prediction case described above, it is assumed that the values of the probabilities {p.sub.i}.sub.i=1.sup.n and the names of all attributes in the set of key attributes are provided as input parameters. Identifiability associated with realizing m records is calculated as:

I.sub.p.sup.(m)=1-(1-I.sub.p).sup.m

where I.sub.p is predicted single record identifiability, computed as described above.

[0067] Approximate Cell Size Computation: The computation of predicted identifiability provides that the values of D.sub.K(r)=|C.sub.K(r)| can be computed. In practical implementation of the method, it may be difficult to store these values for all possible attribute combinations K, since for k attributes up to 2.sup.k different combinations K can be constructed.

[0068] One possible solution to this problem is to store the complete global knowledge table. This table includes a record for every individual in the population. The records stored in the global knowledge table include values of all possible attributes. Once this table is available, the computation of the set C.sub.K(r) can be performed trivially, for example by using a known SQL SELECT operation. By definition of C.sub.K(r), it is a set of records that includes those and only those records from the population table that have the values of key attributes equal to the values assigned to the corresponding attributes in record r.

[0069] In some implementations, however, the value of identifiability risk, and therefore the value of D.sub.K(r) should be computed quickly, and the full selection operation on a large population table as described above may be too time consuming and therefore unacceptable. Further, obtaining the population table itself can be a difficult task in practice, because it needs complete information about all individuals. To address these issues an alternative method for estimating identifiability risk has been developed using approximation of values of D.sub.K(r).

[0070] The approximation is based on the following formula. Let K=K.sub.1.sym.K.sub.2.sym. . . . .sym.K.sub.w, i.e. K=K.sub.1.orgate.K.sub.2.orgate. . . . .orgate.K.sub.w and K.sub.i.andgate.K.sub.j=.phi. for i.noteq.j. As before, let n be the size of the population (e.g., total number of individuals). Then, D.sub.K(r) can be approximately calculated as:

D K ( r ) .apprxeq. D ~ K 1 K w ( r ) = D K 1 ( r ) D K 2 ( r ) D K w ( r ) n w - 1 . ##EQU00004##

This approximation is based on the computation of a joint probability density function under the assumption of independent distribution of the population within each of the attribute groups K.sub.1, K.sub.2, . . . , K.sub.w. The approximation formula permits the computation of identifiability even in extreme cases when only the values of D.sub.K.sub.i(r) for elementary sets K.sub.i (e.g., sets including only one attribute) are available.

[0071] Storing the data only for the elementary attribute sets uses exponentially smaller storage space than in the case of exact computation, and the space needed for approximate computation grows at most linearly in the number of attributes and categories of attribute values. However, the use of very small sets K.sub.i leads to higher w, and the increase in w may cause an increase of approximation error. Thus, there exists a natural tradeoff between the accuracy of approximation and the efficiency of computation. Practical implementations may choose any level of detail for which to store the cell size data, and achieve the best practical balance between efficiency and accuracy by varying the level of detail.

[0072] To show that the approximation approach is practical, experiments have been performed with census data published by the US Census Bureau. For the purpose of the following illustrative experiment, the set of attributes includes four attributes: zip code, race, sex and age group, which for brevity will be denoted by the first letter of the attribute name (e.g., Z, R, S, A). The published data provides values of D.sub.ZRSA(r), i.e. the most detailed information about cell sizes possible.

[0073] Referring to FIG. 2, an Identifiability (I) approximation is shown for different combinations of attributes from US National Census data. The value of one record identifiability (I) computed based on the data is shown with a first bar 32 in the diagram. The other bars 34 show approximations computed using different decompositions of the set ZRSA, where dash ("-") separates different subsets of the attribute set used in the decomposition. In FIG. 2, it is easy to see that the identifiability value approximation is the worst when the set of attributes is decomposed into elementary components (Z-A-R-S). However, if attributes Z and R are in the same subset, the accuracy of approximation is very high. This can be explained by the fact that Zip code and Race are not independent in this data set.

[0074] Furthermore, in all cases observed in this experiment, the approximate value of the identifiability risk exceeds the true value of the risk, and therefore the use of approximation leads to a conservative risk estimation. Conservative estimation is very important for consistent protection of privacy. It means that increased risk estimation efficiency causes the system to release less information, and therefore under incomplete information or time constraints the system errs on the safe side. Note that although in this experiment risk overestimation was observed, this behavior of the approximation method is not guaranteed to hold for all populations.

[0075] This experiment also illustrates an embodiment, in which publicly available Census data is used for the computation of identifiability risk associated with a particular attribute set. The Census data has a limited number of attributes, but using the approximation formula the attributes can be combined with other attributes available in the information system. Therefore, it is possible to use Census data for computation of identifiability risks associated with more general attribute sets that include attributes other than those included in Census data.

[0076] Prediction and Enforcement (Data Analysis Cycle): One possible embodiment of predicted identifiability may combine risk prediction and risk monitoring to support the complete cycle of data analysis.

[0077] Referring to FIGS. 3 and 4, a risk assessment stage 102 and a processing and risk management stage 104 are illustratively shown. Stages 102 and 104 may be employed together or separately as needed. At stage 102, the results of data processing by the information system 10 are not yet released to the third party principal 14. This may happen, for example, while the information system 10 is waiting for a request to initiate data analysis and data release. At this point, the information system 10 may study the input data and build a model 106 of the data that can be used to compute the values of {p.sub.i}.sub.i=1.sup.n needed for initial assessment of identifiability risk I.sub.p. The initial assessment is used to evaluate risk before any data is released to the third party principal 14.

[0078] Any method for modeling the input data and computing probabilities {p.sub.i}.sub.i=1.sup.n may be used, for example sampling and averaging. Before any data is received by the information system 10, the model 106 can be initialized in the initial state in which the probability mass is concentrated in records that belong to cells with highest identifiability (i.e., r with the smallest D.sub.K(r)) for a predefined combination of attributes K, and all other records are assigned zero or near-zero probability. K can be chosen, for example, to be the largest set of key attributes.

[0079] At stage 104, when the processing and the release of the processed data to the third party principal 14 have been initiated and are in progress, the observed identifiability risk I(r.sub.1,r.sub.2, . . . ,r.sub.m) that can be computed based on the released set of records (r.sub.1,r.sub.2, . . . ,r.sub.m} may exceed the initial risk assessment value I.sub.p. To detect this situation, a privacy monitor component 108 is added to the chain of processing between the information system 10 and the third party principal 14. This component can perform any of the following functions: [0080] observe the data received by the third principal 14, and measure the true risk based on that data; [0081] terminate data processing if observed risk exceeds a specified threshold (e.g. 110% of initial risk assessment I.sub.p); [0082] suppress (e.g., filter out) the records the release of which will cause the observed risk to exceed a specified threshold.

[0083] The continuous update of the input data model 106 may be performed at stage 104 as well as at stage 102. The set of {p.sub.i}.sub.i=1.sup.n computed based on the updated model may later be used again for initial risk assessment.

[0084] Combined Privacy Risk Computation: One of the many possible embodiments using the identifiability model 106 may use the identifiability risk value to compute the overall combined privacy risk associated with the release of information. As described, the privacy risk depends on many factors. FIG. 5 shows the main factors contributing to the overall privacy risk.

[0085] Referring to FIG. 5, illustrative factors affecting privacy risk 200 are shown. These risks may include values 202, rates 204, indentifiability risk 206, occurrence risk 208 and misuse risk 210. Other factors may also have an effect. To simplify risk computation and exclude other factors contributing to the overall risk, the following assumptions may be made: [0086] A sample table is sent to the adversary through a secure channel; therefore, there is no risk associated with the data transmission itself (although such a risk can be taken into account, if needed, through a straightforward extension). [0087] Communicating parties are properly identified; there is no risk associated with releasing information to an untrusted principal posing as a trusted third-party principal. Based on this and previous assumptions, the identity of the adversary is known, and trustworthiness of the adversary can be evaluated, if needed.

[0088] Under these assumptions, the remaining factors include: [0089] Value 202. The value of information. [0090] Rate 204. The rate at which records are released. [0091] Identifiability risk 206. The risk is defined as the probability that the released data can be associated with a particular individual. [0092] Occurrence risk 208. The risk is defined as the probability that the released data includes private information. [0093] Misuse risk 210. The risk is defined as the probability that the released data will be misused. The higher the sensitivity of the information, and the lower the trust level of the third party principal, the higher this risk is.

[0094] The overall privacy risk is monotone non-decreasing in all of these parameters. Below an example is given of a formula based on joint probability of events that can be used to provide an assessment of privacy risk.

[0095] Let V(i) be value of attribute i in the released dataset. Let rate r be the average number of records released per fixed time interval, e.g. per minute. Let I.sub.p.sup.(r) be the predicted identifiability of r records. Let R.sub.o be the occurrence risk. Let R.sub.m(i) be the misuse risk for attribute i. Then, overall privacy risk R can be defined as the expected lost value, which is equal to the product of the loss risks and the value:

R = i V ( i ) R m ( i ) ( 1 - ( 1 - R o ) r ) I p ( r ) R m ( i ) ##EQU00005##

Note that the misuse risk R.sub.m(i) depends on the difference between the level of information for which the third party principal is trusted (principal credentials), and the information present within the output presented to the third party principal. A sigmoid function described in Fuzzy logic can be used to compute R.sub.m(i), as discussed in Cheng et al. incorporated by reference above.

[0096] An example of the use of a sigmoid function is as follows. Sensitivity levels may be viewed as one dimension and each category or risk as one dimension. One choice for determining misuse risk R.sub.m(i) is the sigmoid function. Let RI be the risk index and RI .epsilon. (0;+.infin.), then

R.sub.m(RI)=1/(1+exp((-k).times.(RI-mid)))

[0097] The value of this function increases very slowly when RI is much smaller than mid, it increases much faster when RI is closer to mid and saturates as RI becomes much larger than mid. The value "mid" is the risk index value where the probability is deemed to be 0.5; it is a tunable parameter. The value k is also a tunable parameter that controls the slope of the function. A dimension may have its own values for mid and k.

[0098] The choice of mid has a significant effect on the probabilities computed and that the probabilities become 1 (or very close to 1) when the value of an object is at least two orders of magnitude or a hundred times larger than the trustworthiness of the subject. This observation is consistent with our pessimistic view of human nature. It should be noted that by choosing this formula, the first requirement for R.sub.m(i) discussed above is changed to be RI

lim.sub.RI.fwdarw.0+ R.sub.m(RI).apprxeq.0

[0099] This is acceptable since the risk at such a low level is usually well within the acceptable range. If it is desirable to take risk mitigation into consideration, the formula becomes:

R.sub.m(RI)=1/(1+exp((-k).times.(e.sub.m(RI)-(mid)))

where e.sub.m(RI) is a residual risk after mitigation.

[0100] A further assumption may be made that the R.sub.m(i) for sensitivity levels and the R.sub.m(i) for a category are independent of each other. The rationale behind this assumption includes the following. View the risk computed from sensitivity levels as the "risk of being tempted", in other words, the risk of a subject disclosing sensitive information intentionally for its own gain. The more sensitive the information or the less trustworthy the subject, the higher the risk is. The risk computed from a category may be viewed as the risk of "inadvertent disclosure or use". It is generally very hard to divide a piece of information into the "need-to-know" and "no-need-to-know" partitions while still maintaining the original context of the information. Therefore, once a subject, even a very trusted one, absorbs some information, which it has no (strong) need-to-know, there is a chance the subject will inadvertently disclose or use the information.

[0101] Referring to FIG. 6, a system/method for privacy protection and/or risk assessment is illustratively shown in accordance with one exemplary embodiment. In block 302, build a risk determination model, preferably, using records having a highest identifiability for a combination attributes.

[0102] In block 304, identifiability risk is evaluated for one or more records in storage using a risk assessment model. This evaluation may include computing identifiability risk for a given record, e.g., based on a reciprocal of a subset of records or cell size from a population table that includes only the records that have values of all attributes in a set of key attributes included in the given record in block 308. In block 306, the cell size may be approximated if the actual cell size information is not available. In block 309, the evaluating may include computing identifiability risk for multiple records, e.g., by subtracting from 1: a sum of 1 minus the identifiably risk for all records in a sample table.

[0103] Alternately, the identifiability risk may be predicted for records input to an information system in block 310. In block 312, the prediction may be for a single record identifiability risk for a given record, e.g., knowing a set of key attributes, summing over a population: a probability that an individual is the one appearing in a sample table divided by a cell size for each record corresponding to the individual. In block 311, the cell size may be approximated. In block 314, the identifiability risk may be predicted for each of multiple records by, e.g., subtracting from 1: 1 minus the single record identifiably risk to exponent m, where m is the number of records.

[0104] In block 320, the identifiability risk (or combined risk) is compared with a threshold to determine whether the one or more records can be disclosed to a third party without violation of privacy criteria. This may be performed prior to disclosing the one or more records to the third party. A determination of whether the identifiability risk exceeds a threshold is made in block 322. If the threshold is exceeded, a function is performed in block 323 to mitigate unauthorized disclosure of the records to the third party. This may include preventing the record from being disclosed, preventing all records from being disclosed, and/or observe the data received by the third party principal and determine the true risk. Otherwise, a check as to whether the last record to be checked has been reached is made in block 325. If the last record not has been reached, return to block 320. Otherwise disclose the record in block 324. A determination for each record can be made and each record may be disclosed one at a time or in blocks of records.

[0105] In block 324, the one or more records are disclosed if disclosure is acceptable based on the comparing step in block 320. Risks may be combined into a combined privacy risk model configured to combine risk assessment for a plurality of risk factors in block 326. A combined risk threshold can be established and employed as the threshold in block 322.

[0106] Automated planning methods can be used to compose and rearrange the components of the data processing application automatically in order to manage privacy risk while satisfying requirements on produced data specified by the end users. In particular, the components may include privacy filters that filter out or modify (e.g., anonymize) the sensitive data. Including these components in the composition may reduce privacy risk associated with data disclosure. It is possible to use the risk assessment method described herein within the automated planning framework. The assessed risk value will be different for different sets of produced output. The value will also depend on other factors, e.g., as described above. All of these factors can be changed by changing the composition of processing components, and therefore the planner may be able to create a composition that has expected privacy risk below a given threshold, or has minimal possible risk of all possible compositions. This can be achieved, for example, by creating all possible compositions and computing risk assessment for each composition.

[0107] Having described preferred embodiments of a system and method for privacy protection using identifiability risk assessment (which are intended to be illustrative and not limiting), it is noted that modifications and variations can be made by persons skilled in the art in light of the above teachings. It is therefore to be understood that changes may be made in the particular embodiments disclosed which are within the scope and spirit of the invention as outlined by the appended claims. Having thus described aspects of the invention, with the details and particularity required by the patent laws, what is claimed and desired protected by Letters Patent is set forth in the appended claims.

* * * * *