U.S. patent application number 13/676528 was filed with the patent office on 2014-05-15 for privacy preserving statistical analysis for distributed databases.
This patent application is currently assigned to Mitsubishi Electric Research Laboratories, Inc.. The applicant listed for this patent is MITSUBISHI ELECTRIC RESEARCH LABORATORIES, INC.. Invention is credited to Bing-Rong Lin, Shantanu Rane, Ye Wang.
Application Number | 20140137260 13/676528 |
Document ID | / |
Family ID | 50683106 |
Filed Date | 2014-05-15 |
United States Patent
Application |
20140137260 |
Kind Code |
A1 |
Wang; Ye ; et al. |
May 15, 2014 |
Privacy Preserving Statistical Analysis for Distributed
Databases
Abstract
Aggregate statistics are determined by first randomizing
independently data X and Y to obtain randomized data {circumflex
over (X)} and . The first randomizing preserves the privacy of the
data X and Y. Then, the randomized data {circumflex over (X)} and
is randomized secondly to obtain randomized data {tilde over (X)}
and {tilde over (Y)} for a server, and helper information
T.sub.{tilde over (X)}|{circumflex over (X)} and T.sub. | for a
client, wherein T represents an empirical distribution, and wherein
the randomizing secondly preserves the privacy of the aggregate
statistics of the data X and Y. The server then determines
T.sub.{tilde over (X)},{tilde over (Y)}. Last, the client applies
the side information T.sub.{tilde over (X)}|{circumflex over (X)}
and T.sub. | to T.sub.{tilde over (X)},{tilde over (Y)} to obtain
an estimated {dot over (T)}.sub.X,Y, where "|" and "," between X
and Y represent a conditional and joint distribution,
respectively.
Inventors: |
Wang; Ye; (Andover, MA)
; Lin; Bing-Rong; (State College, PA) ; Rane;
Shantanu; (Cambridge, MA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
MITSUBISHI ELECTRIC RESEARCH LABORATORIES, INC. |
Cambridge |
MA |
US |
|
|
Assignee: |
Mitsubishi Electric Research
Laboratories, Inc.
Cambridge
MA
|
Family ID: |
50683106 |
Appl. No.: |
13/676528 |
Filed: |
November 14, 2012 |
Current U.S.
Class: |
726/26 |
Current CPC
Class: |
G06F 21/6245 20130101;
G06F 21/60 20130101; G06F 21/6254 20130101 |
Class at
Publication: |
726/26 |
International
Class: |
G06F 21/60 20060101
G06F021/60 |
Claims
1. A method for securely determining aggregate statistics on
private data, comprising the steps of: randomizing firstly and
independently data X and Y to obtain randomized data {circumflex
over (X)} and , respectively, wherein the randomizing firstly
preserves a privacy of the data X and Y; randomizing secondly
independently the randomized data {circumflex over (X)} and to
obtain randomized data {tilde over (X)} and {tilde over (Y)} for a
server, and helper information T.sub.{tilde over (X)}|{circumflex
over (X)} and T.sub.{tilde over (Y)}| [[T.sub. | ]] for a client,
respectively, wherein T represents an empirical distribution, and
wherein the randomizing secondly preserves the privacy of the
aggregate statistics of the data X and Y; determining, at the
server, T.sub.{tilde over (X)},{tilde over (Y)}; applying, by the
client, the helper information T.sub.{tilde over (X)}|{circumflex
over (X)} and T.sub.{tilde over (Y)}| [[T.sub. | ]] to T.sub.{tilde
over (X)},{tilde over (Y)} obtain an estimated {dot over
(T)}.sub.X,Y, wherein "|" and "," between X and Y represent a
conditional and joint distribution, respectively.
2. The method of claim 1, wherein the data X are produced by a
first data source, and the data Y are produced by a second data
source, and the data X and Y are produced independently in a
distributed manner.
3. The method of claim 1, wherein the randomizing uses a Post
RAndomisation Method (PRAM).
4. The method of claim 1, wherein the randomizing firstly and
secondly are different.
5. The method of claim 1, wherein the helper information is small
compared to the data X and Y.
6. The method of claim 1, wherein data X and Y are random
sequences, and data pairs (X.sub.i,Y.sub.i) are independently and
identically distributed.
7. The method of claim 1, wherein the randomizing preserves
differential and distributional privacy of the data X and Y.
8. The method of claim 1, wherein the randomizing secondly provides
distributional privacy that is stronger than the differential
privacy provided by the randomizing firstly.
Description
FIELD OF THE INVENTION
[0001] This invention relates generally to secure computing by
third parties, and more particularly to performing secure
statistical analysis on a private distributed database.
BACKGROUND OF THE INVENTION
[0002] Big Data
[0003] It is estimated that 2.5 quintillion (10.sup.18) bytes of
data are created each day. This means that 90% of all the data in
the world today has been created in the last two years. This "big"
data come from everywhere, social media, pictures and videos,
financial transactions, telephones, governments, medical, academic,
and financial institutions, and private companies. Needless to say
the data are highly distributed in what has become known as the
"cloud,"
[0004] There is a need to statistically analyze this data. For many
applications, the data are private and require the analysis to be
secure. As used herein, secure means that privacy of the data is
preserved, such as the identity of the sources for the data, and
the detailed content of the raw data. Randomized response is one
prior art way to do this. Random response does not unambiguously
reveal the response of a particular respondent, but aggregate
statistical measures, such as the mean or variance, can still be
determined.
[0005] Differential privacy (DP) is another way to preserve privacy
by using a randomizing function, such as Laplacian noise.
Informally, differential privacy means that the result: of a
function determined on a database of respondents is almost
insensitive to the presence or absence of a particular respondent.
Formally, if the function is evaluated on adjacent databases
differing in only one respondent, then the probability of
outputting the same result is almost unchanged.
[0006] Conventional mechanisms for privacy, such as k-anonymization
are not differentially private, because an adversary can link an
arbitrary amount of helper (side) information to the anonymized
data to defeat the anonymization.
[0007] Other mechanisms used to provide differential privacy
typically involve output perturbation, e.g., noise is added to a
function of the data. Nevertheless, it can be shown that the
randomized response mechanism, where noise is added to the data
itself, provides DP.
[0008] Unfortunately, while DP provides a rigorous and worst-case
characterization for the privacy of the respondents, it is not
enough to formulate privacy of an empirical probability
distribution or "type" of the data. In particular, if an adversary
has accessed anonymized adjacent databases, then the DP mechanism
ensures that the adversary cannot de-anonymize any respondent.
However, by construction, possessing an anonymized database reveals
the distribution of the data.
[0009] Therefore, there is a need to preserve privacy of the
respondents, while also protecting an empirical probability
distribution from adversaries.
[0010] In U.S. application Ser. No. 13/032,521, Applicants disclose
a method for processing data by an untrusted third party server.
The server can determine aggregate statistics on the data, and a
client: can retrieve the outsourced data exactly. In the process,
individual entries in the database are not revealed to the server
because the data are encoded. The method uses a combination of
error correcting codes, and a randomization response, which enables
responses to be sensitive while maintaining confidentiality of the
responses.
[0011] In U.S. application Ser. No. 13/032,552. Applicants disclose
a method for processing data securely by an untrusted third party.
The method uses a cryptographically secure pseudorandom number
generator that enables client data to be outsourced to an untrusted
server to produce results. The results can include exact aggregate
statistics on the data, and an audit report on the data. In both
cases, the server processes modified data to produce exact results,
while the underlying data and results are not revealed to the
server.
SUMMARY OF THE INVENTION
[0012] The embodiments of the invention provide a method for
statistically analyzing data while preserving privacy of the
data.
[0013] For example, Alice and Bob are mutually untrusting sources
of separate databases containing information related to
respondents. It is desired to sanitize and publish the data to
enable accurate statistical analysis of the data by an authorized
entity, while retaining the privacy of the respondents in the
databases. Furthermore, an adversary must not be able to analyze
the data.
[0014] The embodiments provide a theoretical formulation of privacy
and utility for problems of this type. Privacy of the individual
respondents is formulated using .epsilon.-differential privacy.
Privacy of the statistics on the distributed databases is
formulated using .delta.-distributional and .epsilon. differential
privacy.
[0015] Specifically, aggregate statistics are determined by first
randomizing independently data X and Y to obtain randomized data
{circumflex over (X)} and . The first randomizing preserves a
privacy of the data X and Y.
[0016] Then, the randomized data {circumflex over (X)} and is
randomized secondsly to obtain randomized data {tilde over (X)} and
{tilde over (Y)} for a server, and helper information on
T.sub.{tilde over (X)}|{circumflex over (X)} and T.sub. | for a
client, wherein T represents an empirical distribution, and wherein
the randomizing secondly preserves the privacy of the aggregate
statistics of the data X and Y.
[0017] The server then determines T.sub.{tilde over (X)},{tilde
over (Y)}. Last, the client applies the the side information
T.sub.{tilde over (X)}|{circumflex over (X)} and T.sub. | to
T.sub.{tilde over (X)},{tilde over (Y)} obtain an estimated {dot
over (T)}.sub.X,Y, wherein "|" and "," between X and Y represent a
conditional and joint distribution, respectively.
BRIEF DESCRIPTION OF THE DRAWING
[0018] FIG. 1 is a flow diagram of a method for securely
determining statistics on private data according to embodiments of
the invention;
[0019] FIG. 2 is a block diagram of private data from two sources
operated on according to embodiments of the invention;
[0020] FIG. 3 is a schematic of a method according to embodiments
of the invention for deriving statistics from the data of FIG. 2 by
a third party without compromising privacy of the data; and
[0021] FIG. 4 is a schematic of an application of the method
according to embodiments of the invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0022] Method Overview
[0023] As shown in FIG. 1, the embodiments of our invention provide
a method for securely performing statistical analysis on private
data. This means the actual raw data is not revealed to anyone,
other than sources of the data.
[0024] In security, privacy and randomization applications "weak"
and strong" are terms of art that are well understood and
documented. Weak means that underlying data (e.g., password, user
identification, etc.) is could be recovered with know "cracking"
methods. Strong means that the data is very difficult to recover in
given a reasonable amount of time and reasonable computing
resources.
[0025] In addition, the randomization means randomizing the data
according to a particular distribution. The term encompasses the
following concept. First, the data are anonymized to protect
privacy. Second, data are sanitized to reinforce the notion that
the operation serves the purpose of making the data safe for
release.
[0026] Data X 101 and Y 102 are first randomized (RAMI)
independently to obtain randomized data {circumflex over (X)} and ,
respectively. The randomizations 110 and 115 can be the same or
different. In the preferred embodiment, we use a Post RAndomisation
Method (PRAM). The security provided by 110 and 115 is relatively
"weak." This means that the identities of data sources are hidden
and individual data privacy is preserved, but aggregate statistics
on the data could perhaps be determined with some effort.
[0027] The randomized data {circumflex over (X)} and data are again
(second) randomized (RAM2) to obtain randomized data {tilde over
(X)} and {tilde over (Y)} for a server, and helper information
T.sub.{tilde over (X)}| and T.sub. | for a client, respectively.
The second randomizations can be the same or different than the
first randomizations. In the helper information, T represents a
true empirical distribution.
[0028] In statistics, an empirical distribution is the normalized
histogram of the data. Each of n data points contributes by 1/n to
the empirical distribution. The empirical distribution is
representative of the underlying data. The emperical distribution
is sufficient to determine a large number of different types of
statistics, including mean, median, mode, skewedness, quantiles,
and the like.
[0029] The security provided by 120 and 125 is relatively "strong."
That is, the privacy of aggregate statistics on the data X and Y is
preserved.
[0030] The server 130 determines T.sub.{tilde over (X)},{tilde over
(Y)}{tilde over ( )} after {tilde over (X)} and {tilde over (Y)}0
are combined.
[0031] The client can now apply the side information T.sub.{tilde
over (X)}|{circumflex over (X)} and T.sub. | to T.sub.{tilde over
(X)},{tilde over (Y)} to "undo" the second randomization, and
obtain an estimated {dot over (T)}.sub.X,Y. The estimated,
indicated by above, distribution of the data X and Y is sufficient
to obtain first, second, etc. order statistics. Although the client
can determine statistics, the client cannot recover the exact data
X and and Y because of the weak security.
[0032] Method Details
[0033] For ease of this description as shown in FIG. 2, we present
our problem formulation and results with two data sources Alice and
Bob. However, our method can easily be generalized to more than two
sources. Also, other levels of security with fewer or more
randomizations can also be used.
[0034] Alice and Bob independently sanitize 210 data 201-202 to
protect the privacy of respondents 205. As used herein, it is not
possible to recover exact private information from sanitized data.
A number of techniques are know for sanitizing data, e.g., adding
random noise.
[0035] The sanitized data 211-212 are combined 220 into a database
230 at a "cloud" server. The server can be connected to a public
network (Internet). This is the data is available for statistical
analysis by an authorized user of a client.
[0036] As shown in FIG. 3, Alice and Bob store the sanitized data
in at the server to facilitate transmission and computation
required on these potentially large databases. An entrusted
authorized client 301 can now perform statistical analysts on the
data with the assistance of low-rate helper-information 303. The
helper information is low-rate in that it is relatively small in
comparison to the original database and/or the randomized data. The
helper information 303 allows the authorized client to essentially
undo the second randomization.
[0037] The analysis is subject to the following requirements. The
private data of the sources should not be revealed to the server or
the client. The statistics of the data provided by sources and Bob
should not be revealed to the server. The client should be able to
determine joint, marginal and conditional distributions of the data
provided by Alice and Bob. The distributions are sufficient to
determine first, second, etc. order statistics of the data.
[0038] Problem Framework and Notation
[0039] The Alice data are a sequence of random variables
X:=(X.sub.1,X.sub.2, . . . , X.sub.n), with each variable X.sub.i
taking values from a finite-alphabet X. Likewise, Bob's data are
modeled as a sequence of random variables Y:=(Y.sub.1,Y.sub.2, . .
. , Y.sub.n), with each Y.sub.i taking values from the
finite-alphabet Y. The length of the sequences, n, represents the
total number of respondents in the database, and each
(X.sub.i,Y.sub.i) pair represents the data of the respondent i
collectively held by Alice and Bob, with the alphabet X.times.Y
representing the domain of each respondent's data.
[0040] data pairs (X.sub.i,Y.sub.i) are independently and
identically distributed (i.i.d.) according to a joint distribution
P.sub.X,Y over X.times.Y, such that for
x := ( x 1 , , x n ) .di-elect cons. X n , and ##EQU00001## y := (
y 1 , , y n ) .di-elect cons. Y n , such that P X , Y ( x , y ) = i
= 1 n P X , Y ( x i , y i ) . ##EQU00001.2##
[0041] A privacy mechanism randomly maps 310 input to output, M:
I.fwdarw.O, according to a conditional distribution P.sub.O|I. A
post RAndomisation method (PRAM) is a class of privacy mechanisms
where the input and output are both sequences. i.e., I=O=D.sup.n
for an alphabet D, and each element of the input sequence is i.i.d.
according to an element-wise conditional distribution.
[0042] Alice and bob each independently apply PRAM to their data as
R.sub.A:X.sup.n.fwdarw.X.sup.n and R.sub.B:Y.sup.n.fwdarw.Y.sup.n.
The respective outputs are
{tilde over (X)}:=({tilde over (X)}.sub.1, . . . , {tilde over
(X)}.sub.n):=R.sub.A(X)
and
{tilde over (Y)}:=({tilde over (Y)}.sub.1, . . . , {tilde over
(Y)}.sub.n):=R.sub.B(Y),
and the governing distributions are
[0043] P.sub.{tilde over (X)}|X and P.sub.{tilde over (Y)}|Y,
so we have that
P X ~ , Y ~ X , Y ( x ~ , y ~ x , y ) = P X ~ X ( x ~ x ) P Y ~ Y (
y ~ y ) = i = 1 n P X ~ X ( x ~ i x i ) P Y ~ Y ( y ~ i y i )
##EQU00002##
[0044] We also use R.sub.AB:
X.sup.n.times.Y.sup.n.fwdarw.X.sup.n.times.Y.sup.n, defined by
R.sub.AB(X, Y):=({tilde over (X)}, {tilde over (Y)}):=(R(X),
R.sub.B(Y))
to denote a mechanism that arises from a concatenation of each
individual mechanism. R.sub.AB is also a PRAM mechanism and is
governed by the conditional distribution P.sub.{tilde over
(X)}|XP.sub.{tilde over (Y)}|Y.
[0045] Type Notation
[0046] The type or empirical distribution of the sequence of the
random variables X=(X.sub.1, . . . , X.sub.n) is the mapping
T.sub.X:X.fwdarw.[0,1] defined by
T X ( x ) := { i : X i = x } n , .A-inverted. x .di-elect cons. X .
##EQU00003##
[0047] A joint type of two sequences X=(X.sub.1, . . . , X.sub.n)
and Y=(Y.sub.1, . . . , Y.sub.n) is the mapping
T.sub.X,Y:X.times.Y.fwdarw.[0,1] defined by
T X , Y ( x , y ) := { i : ( X i , Y i ) = ( x , y ) } n ,
.A-inverted. ( x , y ) .di-elect cons. X .times. Y .
##EQU00004##
[0048] A conditional type of a sequence Y=(Y.sub.1, . . . ,
Y.sub.n) given another X=(X.sub.1, . . . , X.sub.n) is the mapping
T.sub.Y|X:Y.times.X.fwdarw.[0,1] defined by
T Y X ( y x ) := T Y , X ( y , x ) T X ( x ) = { i : ( Y i , X i )
= ( y , x ) } { i : X i = x } ##EQU00005##
[0049] The conditional distribution is the joint distribution
divided by the marginal distribution.
[0050] Values of these type mappings are determined, given the
underlying sequences, and are random when the sequences are
random.
[0051] Matrix Notation for Distributions and Types
[0052] The various distributions, and types of finite-alphabet
random variables can be represented as vectors or matrices. By
fixing a consistent ordering on their finite domains, these
mappings can be vectors or matrices indexed by their domains. The
distribution P.sub.X:X.fwdarw.[0,1] can be written as an
|X|.times.1 column-vector P.sub.X, whose x.sup.th element, for x
.di-elect cons. X, is given by P.sub.X[x]:=P.sub.X(x).
[0053] A conditional distribution P.sub.Y|X:Y.times.Y.fwdarw.[0,1]
can be written as a |Y|.times.|X| matrix P.sub.Y|X, defined by
P.sub.Y|X[y,x]:=P.sub.Y|X(y|x). A joint distribution
P.sub.X,Y:X.times.Y.fwdarw.[0,1] can be written as a |X|.times.|Y|
matrix P.sub.X,Y, defined by P.sub.X,Y[x,y]:=P.sub.X,Y(x,y), or as
a |X.parallel.Y|.times.1 column-vector P.sub.X,Y, formed by
stacking the columns of P.sub.X,Y.
[0054] We can similarly develop the matrix notation for types, with
T.sub.X, T.sub.Y|X, T.sub.X,Y and T.sub.X,Y similarly defined for
sequences X and Y with respect to the corresponding type mappings.
These type vectors or matrices are random quantities.
[0055] Privacy and Utility Conditions
[0056] We now formulate the privacy and utility requirements for
this problem of computing statistics on independently sanitized
data. According to the privacy requirements described above, the
formulation consider privacy of the respondents, privacy of the
distribution, and finally the utility for the client.
[0057] Privacy of the Respondents
[0058] The data related to a respondent must be kept private from
all other parties, including any authorized, and perhaps untrusted
clients. We formalize this notion using .epsilon.-differential
privacy for the respondents.
[0059] Definition: For .epsilon..gtoreq.0, a randomized mechanism
M:D.sup.n.fwdarw.O gives .epsilon.-differential privacy if for all
data, sets d,d'.di-elect cons.D.sup.n, within Hamming distance
d.sub.H(d,d').ltoreq.1, and all S.di-elect cons.O,
Pr[M(d).di-elect cons.S].ltoreq.e.sup..epsilon.Pr[M(d').di-elect
cons.S].
[0060] Under the assumption, that the respondents are sampled
i.i.d., a privacy mechanism that satisfies DP results in a strong
privacy guarantee. Adversaries with knowledge of all respondents
except one, cannot discover the data of the sole missing
respondent. This notion of privacy is rigorous and widely accepted,
and satisfies privacy axioms.
[0061] Privacy of the Distribution
[0062] Alice and Bob do not want to reveal the statistics of the
data to adversaries, or to the server. Hence, the sources and
server must ensure that the empirical distribution, i.e., the
marginal and joint types cannot be recovered from {tilde over (X)}
and {tilde over (Y)}. As described above, .epsilon.-DP cannot be
used to characterize privacy in this case. To formulate a privacy
notion for the empirical probability distribution, we extend
.epsilon.-differential privacy as follows.
[0063] Definition: (.delta.-distributional .epsilon.-differential
privacy) Let (`,`) be a distance metric on the space of
distributions. For .epsilon.,.delta..gtoreq.0, a randomized
mechanism M:D.sup.n.fwdarw.O gives .delta.-distributional
.epsilon.-differential privacy if for all data sets d,d'.di-elect
cons.D.sup.n, with d(T.sub.d, T.sub.d').ltoreq..delta., and all
S.OR right.O,
Pr[M(d).di-elect cons.S].ltoreq.e.sup..epsilon.Pr[M(d').di-elect
cons.S].
[0064] A larger .delta. and smaller .epsilon. provides better
protection of the distribution. Our definition also satisfies
privacy axioms.
[0065] Utility for Authorized Clients
[0066] The authorized client extracts statistics from the
randomized database 230. We model this problem as the
reconstruction of the joint and marginal type functions
T.sub.X,Y(x,y), T.sub.X(x), and T.sub.Y(y), or (equivalently) the
matrices T.sub.X,Y, T.sub.X and T.sub.Y. The server facilitates
this reconstruction by providing computation
[0067] based on the sanitized data ({tilde over (X)}, {tilde over
(Y)}). Alice and Bob provide low-rate, independently generated
helper-information 203. With the server's computation and the
helper-information, the client produces the estimates {dot over
(T)}.sub.X,Y, {dot over (T)}.sub.X, and {dot over (T)}.sub.Y.
[0068] For a distance metric d(`,`) over the space of
distributions, we define the expected utility of the estimates
as
.mu.T.sub.X,Y:=E[-d({dot over (T)}.sub.X,Y,T.sub.X,Y)],
.mu.T.sub.X:=E[-d({dot over (T)}.sub.X,T.sub.X)], and
.mu.T.sub.Y:=E[-d({dot over (T)}.sub.Y,T.sub.Y)].
[0069] Analysis of Privacy Requirements
[0070] The privacy protection of the marginal types of the database
implies privacy protection for the joint type because the distance
function d satisfies a general property shared by common
distribution distance measures.
[0071] Lemma 1: Let d(`,`) be a distance function such that
d(T.sub.X,Y,T.sub.X',Y').gtoreq.max(d(T.sub.X,T.sub.X'),d(T.sub.Y,T.sub.-
Y') ). (1)
[0072] Let M.sub.AB be the privacy mechanism defined by
M.sub.AB(X,Y):=(M.sub.A(X), M.sub.B(Y)). If M.sub.A satisfies
.delta.-distributional .epsilon..sub.1-differential privacy and
M.sub.B satisfies .delta.-distributional
.epsilon..sub.2-differential privacy, then M.sub.AB satisfies
.delta.-distributional
(.epsilon..sub.1+.epsilon..sub.2)-differential privacy.
[0073] If vertically partitioned data are sanitized independently
and we want to recover joint distribution from the sanitized table,
the choice of privacy mechanisms is restricted to the class of PRAM
procedures. We analyze the constraints that should be placed on the
PRAM algorithms so that they satisfy the privacy constraints.
First, consider the privacy requirement of the respondents in Alice
and Bob's databases.
[0074] Lemma 2: Let R: X.sup.n.fwdarw.X.sup.n be a PRAM mechanism
governed by conditional distribution P.sub.{tilde over (X)}|X. R
satisfies .epsilon.-DP if
.epsilon. = max x 1 , x 2 , x ~ .di-elect cons. X ln ( P X ~ X ( x
~ x 1 ) ) - ln ( P X ~ X ( x ~ x 2 ) ) . ( 2 ) ##EQU00006##
[0075] Lemma 3: Define M.sub.AB(x,y)=(M.sub.A(x), M.sub.B(y)). If
M.sub.A satisfies .epsilon..sub.1-DP and M.sub.B satisfies
.epsilon..sub.2-DP, the M.sub.AB satisfies
(.epsilon..sub.1+.epsilon..sub.2)-DP.
[0076] The lemma can be extended to k sources where if i.sup.th
source's sanitized data, satisfies .epsilon..sub.i-DP, then the
joint system provides (.SIGMA..sub.i=1.sup.k.epsilon..sub.i)-DP.
Next, we consider the privacy requirement for the joint and
marginal types.
[0077] Lemma 4: Let d(`,`) be the distance metric on the space of
distributions. Let R: X.sup.n.fwdarw.X.sup.n be a PRAM mechanism
governed by conditional distribution P.sub.{tilde over (X)}|X.
[0078] Necessary Condition: If R satisfies .delta.-distributional
.epsilon.-DP, then R must satisfy
n / 2 - DP ##EQU00007##
for the respondents.
[0079] Sufficient Condition: If R satisfies
n - DP ##EQU00008##
for the respondents, then R satisfies .delta.-distributional
.epsilon.-DP.
[0080] Example Implementation
[0081] We now describe an example realization of the system
framework given above, where the privacy mechanisms are selected to
satisfy our privacy and utility requirements. The key requirements
of this system can be summarized as follows: [0082] (I). R.sub.AB
is a .delta.-distributional .epsilon.-differentially private
mechanism; [0083] (II). Helper information is generated by a
.epsilon.-DP algorithm; and [0084] (III). R.sub.A and R.sub.B are
PRAM mechanisms.
[0085] Because the santized data are generated by a
.delta.-distributional .epsilon.-differentially private mechanism,
helper information is necessary to accurately estimate the marginal
and joint type. To generate outputs that preserve different levels
of privacy, the sources use a multilevel privacy approach.
[0086] As shown in FIG. 4, the databases are sanitized by a
two-pass randomization process 410, see FIG. 1. The first pass
R.sub.AB,1 takes the raw source data X,Y as input and guarantees
the respondent privacy, while the second pass R.sub.AB,2 takes the
sanitized output {circumflex over (X)}, ) of the first pass as
input and guarantees distributional privacy. The helper information
303 is extracted during the second pass to preserve respondent
privacy. The mechanisms are constructed with the following
constraints:
R A , 2 and R B , 2 are 2 n - DP ; ( i ) . R A , 1 and R B , 1 are
2 - DP ; and ( ii ) . ##EQU00009## [0087] (iii), R.sub.A,1,
R.sub.A,2, R.sub.B,1 and R.sub.B,2 are PRAM, mechanisms.
[0088] By Lemma 3, constraint (ii) implies R.sub.AB,1 is
.epsilon.-DP and hence implies requirement (II). Note that
R.sub.A(X) can be viewed as R.sub.A,2(R.sub.A,1) (X)) and is
governed by the conditional distribution (in matrix notation)
P.sub.{tilde over (X)}|X=P.sub.{tilde over (X)}|{circumflex over
(X)}P.sub.{tilde over (X)}|X.
[0089] Hence, constraint (iii) implies that requirement (III) is
satisfied. By Lemmas 1 and 4, constraint (i) implies that
requirement (i) is satisfied. Now, all the privacy requirement are
satisfied. In the following, we describe how the client can
determine the estimated types.
[0090] Recall that without the helper information, the client
cannot accurately estimate exact types due to requirement (I). In
this example, the helper information includes the conditional types
T.sub.{circumflex over (X)}|{circumflex over (X)} and T.sub. |
determined during the second pass. An unbiased estimate of T.sub.X
determined from {tilde over (X)} is given by P.sub.{tilde over
(X)}|X.sup.-1T.sub.{tilde over (X)} and the exact types can be
recovered by T.sub.{tilde over (X)}|X.sup.-1T.sub.{tilde over (X)}.
Thus, we have the following identities and estimators:
T.sub.{circumflex over (X)}=T.sub.{tilde over (X)}|{circumflex over
(X)}.sup.-1T.sub.{tilde over (X)}, (4)
{dot over (T)}.sub.X=P.sub.{tilde over (X)}|{circumflex over
(X)}.sup.-1T.sub.{circumflex over (X)}=P.sub.{tilde over
(X)}|{circumflex over (X)}.sup.-1T.sub.{tilde over (X)}|{circumflex
over (X)}.sup.-1T.sub.{tilde over (x)},
T.sub. =T.sub.{tilde over (Y)}| .sup.-1T.sub.{tilde over (Y)},
(5)
{dot over (T)}.sub.Y=P.sub.{tilde over (Y)}| .sup.-1T.sub.
=P.sub.{tilde over (Y)}| .sup.-1T.sub.{tilde over (Y)}|
.sup.-1T.sub.{tilde over (Y)},
[0091] Extending the results to determine the joint type presents
some challenges. The matrix form of the conditional distribution of
the collective mechanism R.sub.AB is given by P.sub.{tilde over
(X)},{tilde over (Y)}|X,Y=P.sub.{tilde over (X)}|X.sub.P.sub.{tilde
over (Y)}|Y where is the Kronecker product. An unbiased estimate of
the joint type is given by
T . X , Y = P X ~ Y ~ X , Y - 1 T X ~ , Y ~ = ( ( P X ~ X ^ P X ^ X
) ( P Y ~ Y ^ P Y ^ Y ) ) - 1 T X ~ , Y ~ = ( P X ~ X ^ P X ^ X ) -
1 ( P Y ~ Y ^ P Y ^ Y ) - 1 T X ~ , Y ~ = ( P X ^ X - 1 P Y ^ Y - 1
) ( P X ~ X ^ - 1 P Y ~ Y ^ - 1 ) T X ~ , Y ~ = ( P X ^ X - 1 P Y ^
Y - 1 ) T . X ^ , Y ^ . ##EQU00010##
[0092] Effect of the Invention
[0093] The embodiments of the invention provide a method for
statistically analyzing sanitized private data stored at a server
by an authorized, but perhaps, untrusted client in a distributed
environment.
[0094] The client can determine empirical joint statistics on
distributed databases without compromising the privacy of the data
sources. Additionally, a differential privacy guarantee is provided
against unauthorized parties accessing the sanitized data.
[0095] Although the invention has been described by way of examples
of preferred embodiments, it is to be understood that various other
adaptations and modifications can be made within the spirit and
scope of the invention. Therefore,
[0096] it is the object of the appended claims to cover all such
variations and modifications as come within the true spirit and
scope of the invention.
* * * * *