U.S. patent application number 14/151474 was filed with the patent office on 2014-05-08 for automated determination of quasi-identifiers using program analysis.
This patent application is currently assigned to Telcordia Technologies, Inc.. The applicant listed for this patent is Telcordia Technologies, Inc.. Invention is credited to Hiralal Agrawal, Munir Cochinwala, Joseph R. Horgan.
Application Number | 20140130178 14/151474 |
Document ID | / |
Family ID | 43032797 |
Filed Date | 2014-05-08 |
United States Patent
Application |
20140130178 |
Kind Code |
A1 |
Agrawal; Hiralal ; et
al. |
May 8, 2014 |
Automated Determination of Quasi-Identifiers Using Program
Analysis
Abstract
A system and method for automated determination of
quasi-identifiers for sensitive data fields in a dataset are
provided. In one aspect, the system and method identifies
quasi-identifier fields in the dataset based upon a static analysis
of program statements in a computer program having access to -
sensitive data fields in the dataset. In another aspect, the system
and method identifies quasi-identifier fields based upon a dynamic
analysis of program statements in a computer program having access
to -sensitive data fields in the dataset. Once such
quasi-identifiers have been identified, the data stored in such
fields may be anonymized using techniques such as k-anonymity. As a
result, the data in the anonymized quasi-identifiers fields cannot
be used to infer a value stored in a sensitive data field in the
dataset.
Inventors: |
Agrawal; Hiralal;
(Bridgewater, NJ) ; Cochinwala; Munir; (Basking
Ridge, NJ) ; Horgan; Joseph R.; (Somerville,
NJ) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Telcordia Technologies, Inc. |
Piscataway |
NJ |
US |
|
|
Assignee: |
Telcordia Technologies,
Inc.
Piscataway
NJ
|
Family ID: |
43032797 |
Appl. No.: |
14/151474 |
Filed: |
January 9, 2014 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
12771130 |
Apr 30, 2010 |
8661423 |
|
|
14151474 |
|
|
|
|
Current U.S.
Class: |
726/26 |
Current CPC
Class: |
G06F 21/6254 20130101;
G06F 21/6209 20130101; G06F 21/566 20130101 |
Class at
Publication: |
726/26 |
International
Class: |
G06F 21/62 20060101
G06F021/62 |
Claims
1. A method for automatically identifying one or more
quasi-identifier data fields in a dataset, the method comprising:
identifying a program having access to the dataset, the program
including one or more program statements for reading or writing a
value in one or more fields in the dataset; determining a first
output program statement in the program, where the first program
output statement is a program statement for writing a first value
into a sensitive data field in the dataset; determining, with a
processor, a first set of program statements in the program, where
the first set of program statements includes one or more program
statements that contribute to the computation of the first value
written into the sensitive data field; and, analyzing, with the
processor, the first set of program statements, and determining,
based on the analysis of the first set of program statements, one
or more quasi-identifier data fields associated with the sensitive
data field in the dataset.
2. The method of claim 1, further comprising: anonymizing, in the
dataset, data stored in at least one of the quasi-identifier data
fields.
3. The method of claim 1, further comprising anonymizing, in the
dataset, data stored in at least one of the quasi-identifier data
fields using K-anonymity.
4. The method of claim 1, wherein determining the first set of
program statements includes determining the first set of program
statements using static program analysis.
5. The method of claim 1, wherein determining the first set of
program statements includes determining the first set of program
statements using dynamic program analysis.
6.-7. (canceled)
8. The method of claim 1, wherein analyzing, with the processor,
the first set of program statements includes recursively analyzing,
with the processor, the first set of program statements.
9. A system for automatically identifying one or more data fields
in a dataset, the system comprising: a memory storing instructions
and data, the data comprising a set of programs and a dataset
having one or more data fields; a processor to execute the
instructions and to process the data, wherein the instructions
comprise: identifying a program in the set of programs, the program
having one or more program statements for reading or writing a
value in one or more fields in the dataset; determining a first
output program statement in the program, where the first program
output statement is a program statement for writing a first value
into a sensitive data field in the dataset; determining a first set
of program statements in the program, where the first set of
program statements includes one or more program statements that
contribute to the computation of the first value written into the
sensitive data field; and, analyzing the first set of program
statements, and determining, based on the analysis of the first set
of program statements, one or more data fields associated with the
sensitive data field in the dataset.
10. The system of claim 9, wherein the instructions further
comprise: anonymizing, in the dataset, data stored in at least one
of the data fields associated with the sensitive data field.
11. The system of claim 9, wherein the instructions further
comprise: anonymizing, in the dataset, data stored in at least one
of the data fields associated with the sensitive data field using
K-anonymity.
12. The system of claim 9, wherein determining the first set of
program statements includes determining the first set of program
statements using static program analysis.
13. The system of claim 9, wherein determining the first set of
program statements includes determining the first set of program
statements using dynamic program analysis.
14.-15. (canceled)
16. The system of claim 9, wherein analyzing the first set of
program statements includes recursively analyzing the first set of
program statements.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the benefit of the filing date of
U.S. Provisional Patent Application No. 61/174,690, filed May 1,
2009, the disclosure of which is hereby incorporated herein by
reference.
FIELD OF INVENTION
[0002] The present invention generally relates to a system and
method for managing data, and more particularly to a system and
method for identifying sensitive data so it can be anonymized in a
manner that increases privacy.
BACKGROUND
[0003] Databases or datasets containing personal information, such
as databases containing healthcare records or mobile subscribers'
location records, are increasingly being used for secondary
purposes, such as medical research, public policy analysis, and
marketing studies. Such use makes it increasingly possible for
third parties to identify individuals associated with the data and
to learn personal, sensitive information about those
individuals.
[0004] Undesirable invasion of an individual's privacy may occur
even after the data has been anonymized, for example, by removing
or masking explicit sensitive fields such as those that contain an
individual's name, social security number, or other such explicit
information that directly identifies a person.
[0005] One way this may occur is, for example, by analyzing less
explicit and so called "quasi-identifier" fields in a dataset. In
this regard, a set of quasi-identifier fields may be any subset of
fields of a given dataset which can either be matched with other,
external datasets to infer the identities of the individuals
involved, or used to determine a value of another sensitive field
in the dataset based upon the values contained in such fields.
[0006] For example, quasi-identifier fields may be data containing
an individual's ZIP code, gender, or date of birth, which, while
not explicit, may be matched with corresponding fields in external,
publicly available datasets such as census data, birth-death
records, and voter registration lists to explicitly identify an
individual. Similarly, it may also be possible to infer values of
otherwise hidden fields containing sensitive information such as,
for example, disease diagnoses, if the values in such hidden,
sensitive fields are dependent upon values of other
quasi-identifier fields in the dataset, such as fields containing
clinical symptoms and/or medications prescribed for example, from
which information in an otherwise hidden field may be independently
determined.
[0007] Typical systems and methods that seek to protect information
contained in a dataset include several shortcomings. For example,
many conventional methods depend upon a central tenet that all
fields that qualify as either explicit or quasi-identifier fields
can be easily identified in a dataset, which is not always the
case. In addition, typical conventional techniques primarily focus
on preventing identities of individuals to be revealed and do not
adequately address the situation where values in other sensitive
fields, such as an HIV diagnosis, may need to be hidden.
Furthermore, conventional techniques that rely upon statistical
analysis or machine learning approaches to determine
quasi-identifiers in a dataset, while useful, are also prone to
producing many false positives (fields are falsely identified as
being quasi-identifiers when they are not) as well as many false
negatives (fields are falsely identified as not being
quasi-identifiers when they are).
[0008] Therefore, improved methods and systems are desired for
identifying and anonymizing quasi-identifiers fields in a data set
whose values may be used to infer the values in other sensitive
fields.
SUMMARY OF THE INVENTION
[0009] In one aspect, a method for identifying quasi-identifier
data fields in a dataset is provided. The method includes
identifying a program having access to the dataset, the program
including one or more program statements for reading or writing a
value in one or more fields in the dataset; determining a first
output program statement in the program, where the first program
output statement is a program statement for writing a first value
into a sensitive data field in the dataset; determining, with a
processor, a first set of program statements in the program, where
the first set of program statements includes one or more program
statements that contribute to the computation of the first value
written into the sensitive data field; and, analyzing, with the
processor, the first set of program statements, and determining,
based on the analysis of the first set of program statements, one
or more quasi-identifier data fields associated with the sensitive
data field in the dataset.
[0010] In another aspect, a system for identifying data fields in a
dataset is provided, where the system includes a memory storing
instructions and data, and a processor for executing the
instructions and processing the data. The data includes a set of
programs and a dataset having one or more data fields, and the
instructions include identifying a program in the set of programs,
the program having one or more program statements for reading or
writing a value in one or more fields in the dataset; determining a
first output program statement in the program, where the first
program output statement is a program statement for writing a first
value into a sensitive data field in the dataset; determining a
first set of program statements in the program, where the first set
of program statements includes one or more program statements that
contribute to the computation of the first value written into the
sensitive data field; and, analyzing the first set of program
statements, and determining, based on the analysis of the first set
of program statements, one or more data fields associated with the
sensitive data field in the dataset.
DESCRIPTION OF THE DRAWINGS
[0011] FIG. 1 illustrates a system in accordance with an aspect of
the invention.
[0012] FIG. 2 illustrates a sample dataset in accordance with one
aspect of the invention.
[0013] FIG. 3 illustrates an example of a pseudo-code program in
accordance with one aspect of the invention.
[0014] FIG. 4 illustrates an example of the operation of the system
in FIG.1 in accordance with an aspect of the system and method.
[0015] FIG. 5 illustrates a flow diagram in accordance with various
aspects of the invention.
[0016] FIG. 6 illustrates a block diagram of a computing system in
accordance with various aspects of the invention.
DETAILED DESCRIPTION
[0017] A system and method for automated determination of
quasi-identifiers fields for one or more sensitive fields in a
given dataset are provided. Instead of identifying
quasi-identifiers based on the conventional approach of analyzing
the contents of a given dataset, the system and method disclosed
herein identifies quasi-identifier fields based upon an analysis of
computer programs that are used to create and manipulate the
dataset. Once such quasi-identifier fields have been found, they
may be anonymized using existing anonymization techniques such as
k-anonymity or L-diversity. As a result, the anonymized
quasi-identifiers cannot be used to identify an individual
associated with the data in the dataset, or to infer a value of one
or more other sensitive fields contained in the dataset.
[0018] FIG. 1 illustrates a system 10 in accordance with various
aspects of the invention disclosed herein. System 10 includes a
database 12, one or more programs 14, an analysis module 16, and a
list of quasi-identifier fields 18.
[0019] Database 12 may include one or more datasets 20, which may
contain a set of data, including sensitive data that may need to be
anonymized before the data set is provided to an external third
party for further use, e.g., research or analysis. Dataset 20 may
be organized by conventional columns and rows, where each column
may be a field such as name, age, address, medical symptom, medical
diagnosis, etc. Likewise, each row may be a record that includes
related data in one or more fields that is associated with an
individual. While the system and method disclosed herein is
advantageous when the data set is organized by fields and records,
it will be appreciated that the invention is not limited to any
particular organization of data, and is equally applicable to any
set of data that includes sensitive data that needs to be
protected.
[0020] Program 14 may be one or more computer programs containing
program statements that include instructions, executable by a
processor, for reading, writing, storing, modifying, or otherwise
manipulating the data contained in dataset 20. Program statements
in program 14 may include program instructions written in any
programming language, such as Java, C, C++, C#, Javascript, SQL,
Visual Basic, Perl, PHP, pseudo-code, assembly, machine language or
any combination of languages. Further, it will be understood that
the system and invention disclosed herein is not limited to any
particular type of program or programming language.
[0021] Analysis module 16 in system 10 may be implemented in
hardware, software, or a combination of both. In one aspect,
analysis module 16 may itself be a software program, executable by
a processor, where the analysis module 16 has access to both
program 14 and database 12 and contains instructions for analyzing
program 14 for identifying quasi-identifier fields for one or more
sensitive fields in dataset 20. Alternatively, the functionality of
analysis module 16 may also be implemented in hardware, such as on
a custom application specific integrated circuit ("ASIC").
[0022] Upon identification of the quasi-identifiers 18 based upon
an analysis of program 14, analysis module may also contain
instructions for anonymizing data in dataset 20 associated with
such quasi-identifier fields using one or more anonymization
techniques such as, for example, k-anonymity or other conventional
anonymization techniques. Thus, quasi-identifiers 18 may be
considered a list of quasi-identifiers fields determined by
analysis module 16 to contain information based upon which values
in other sensitive fields in dataset 20 may be ascertained.
[0023] Operation of the system and method in accordance with one
aspect is described below. FIG. 2 shows a database 212 containing a
dataset 220 that includes data organized in a generic row and
column format. In one embodiment, the data contained in dataset 220
may be a collection of medical records of patients treated in a
hospital. In accordance with this embodiment, each row 222 in
dataset 220 may contain a record of related medical data associated
with a particular patient, including, for example, one or more
medical symptoms (factors) and medical diagnoses, where the medical
diagnoses are determined based on the patient's medical symptoms
associated with the patient.
[0024] Each medical symptom or medical factor used to determine a
diagnosis may be represented by a different field in the dataset.
For example, each of the fields f1, f2, f3, f4, f5, f6, f7 and f8
may respectively represent factors such as takes_heart_medicine,
has_chest_pain, has_diabetes, exercises_regularly,
has_trouble_breating, has_high_fat_diet, has_high_cholesterol,
takes_tranquilizers, etc. Thus, each symptom field may contain a
Boolean, true or false value (not shown), that indicates whether
the particular medical factor represented by the field applies to
the patient.
[0025] Likewise, each medical diagnosis may also be represented by
a different field in the dataset. For example, fields d1, d2, d3,
d4, d5 and d6 may also contain Boolean, true or false data (not
shown), where each field respectively represents whether the
patient has been diagnosed with a medical diagnosis such as
has_heart_disease, has_heart_burn, has_heart_murmur, has_COPD,
needs_surgery, etc. based on the one or more factors associated
with the patient.
[0026] While particular types of fields containing Boolean or true
or false data have been described for simplicity and ease of
understanding, it will be understood that the system and method
disclosed herein is not limited to any type of field or data, and
is equally applicable to alpha-numeric data, image data, sound
data, audio/visual data, documents, or any other compressed or
uncompressed data that may be accessed and processed by a
computer.
[0027] The values contained in one or more diagnoses fields d1-d6
in dataset 220 may be considered sensitive fields that need to be
masked or anonymized for protecting the privacy of a patient before
providing other data in dataset 220 to an external third party for
marketing, research, or other purposes. For example, dataset 220
may contain many other fields (not shown) such as the patient's
name, address, date of birth, social security information,
insurance information, etc., which may need to be provided to an
insurance company to determine how subscribers of its plans are
using the provisions provided by the company. In such cases,
instead of hiding or protecting information related to an
individual's identity, it may be desirable to instead protect
specific medical conditions or diagnoses associated with the
patient. In addition to masking or anonymizing the values in
explicitly sensitive fields (such as a patient's diagnoses), it is
also desirable to be able to identify and anonymize the values in
other, quasi-identifier fields, which may otherwise be used by a
knowledgeable third party to infer the values contained in the
sensitive fields. Thus, in one aspect, the analysis module may
identify a list of quasi-identifier fields associated with the
sensitive fields d1-d6 by analyzing one or more programs that are
used to create or modify the data contained in dataset 220.
[0028] In computer programs, program slicing can be used to
automatically identify parts of a program that affect the value of
a given variable at a given point in that program. A computer
program may compute many values during its execution. Program
slicing can be used to determine the specific program statements
that contribute towards the computation of specific values in that
program. In one aspect, the analysis module may analyze one or more
program slices in a program to automatically determine
quasi-identifier fields for one or more sensitive fields in a
dataset.
[0029] When a dataset is manipulated by a program, its fields may
be considered as input or output fields (or both) from the
perspective of that program. For each output field that is believed
to contain sensitive data (i.e., a sensitive field), the analysis
module may determine the corresponding program slice (a set of
program statements), which yield all statements that directly or
indirectly contribute towards computation of its value. The
analysis module may then identify or extract quasi-identifiers
associated with that output field from the program slice. If the
output field is deemed to be a sensitive field for privacy reasons,
then not only should the data in that field be masked, but one or
more of the identified quasi-identifier fields may also be masked
or anonymized. Otherwise, there is a risk that the value of the
data in the sensitive field may be inferred based on the values of
the quasi-identifier fields.
[0030] There are two aspects to computing and analyzing a program
slice, referred to here as static analysis and dynamic analysis, by
which the analysis module 16 may automatically determine a set of
quasi-identifiers for a sensitive data field in a given dataset.
The operation of the analysis module in accordance with each aspect
is explained below with reference to an exemplary pseudo-code
program having access to dataset 220.
[0031] FIG.3 shows exemplary pseudo-code logic of a program 314
having access to dataset 220 that may be analyzed by the analysis
module 16 using either static or dynamic program analysis. As seen
therein, program 314 may contain instructions that may access
dataset 220 and read, write, or modify the contents of the dataset.
In one embodiment, program 314 may be an executable program slice
of a larger medical program used by medical personnel to diagnose a
patient with one or medical diagnoses (d1, d2, d3, d4, d5, d6)
based on one or more medical factors (f1, f2, f3, f4, f5, f6, f7
and f8) contained in a patient's record in the dataset 220.
[0032] While line numbers 1-18 have been depicted to indicate
certain program statements for ease of understanding, they are not
necessary to the operation of the program. As indicated by
reference numeral 316, program 314 may include one or more program
statements for reading the values of one or more medical symptoms
f1-f8 contained in a patient's record in the dataset. In addition,
program 314 may also include program output statements (indicated
by reference numeral 318) for writing one or more medical diagnoses
d1-d6 into a patient's record in the dataset. Program 314 may
execute according to the program statements in lines 1-18 to
determine whether the value of one or more diagnoses d1-d6 is true
or false based upon particular factors f1-f8 exhibited by the
patient.
[0033] FIG. 4 illustrates an example of the analysis module 16
using static analysis for determining one or more quasi-identifier
fields associated with the sensitive diagnosis field d3.
[0034] The analysis module may begin by analyzing the logic
contained in the program statements in program 314, and identifying
a program output statement (indicated by block 410) that writes a
value into the sensitive data field d3 in dataset 220 (indicated by
arrow 412).
[0035] The analysis module may then recursively determine and
analyze a set of program statements (or program slice), that
indirectly or directly contribute to computing the value that is
written into the sensitive data field d3.
[0036] Thus, the analysis module may first identify the program
statements in program 314 that may have last assigned a value
(directly contributed) to the value of d3 which was written into
the dataset 220. As indicated by the arrows 414 and 416, the
analysis module may determine, based upon an examination of the
logic in the program, that program statements on both line 6 and
line 13 assign values to d3, and that either statement may thus
have assigned a value that was ultimately written into the
sensitive data field d3 in the dataset 220.
[0037] Having identified that the write statement in block 410 may
be dependent on the program statements in lines 6 and 13, the
analysis module may now recursively continue to analyze program 314
to determine any further program statements upon which the program
statements in lines 6 and 13 may further depend, which may
indirectly contribute to the value assigned to d3.
[0038] As the program statement on line 6 is executed only if the
condition on line 5 is true (arrow 418), the analysis module may
analyze the condition on line 5 and determine that that the program
statement on line 6 depends upon the values of factors f3 and f5.
Upon determining that the value in sensitive field d3 may depend on
factors f3 and f5 (circled), the analysis module may recursively
look for other statements that assign a value to these factors
which may lead to yet further dependencies. As seen in FIG. 4,
however, factors f3 and f5 are not assigned any values and do not
have any dependencies on any other fields in the dataset 220. Thus,
the analysis module may stop recursively looking for further
dependencies for factors f3 and f5 and identify both as possible
quasi-identifier fields for the sensitive data field d3.
[0039] Applying a similar analysis to the program statement on line
13, the analysis module may determine that the program statement on
line 13 depends on the condition on line 12 (arrow 420). The
analysis module may thus analyze the program statement on line 12
and also identify diagnosis d2 (circled) as a possible
quasi-identifier field upon which the value assigned to sensitive
field d3 may depend (a diagnosis field may be a quasi-identifier
for another diagnosis field).
[0040] Upon determining that diagnosis field d3 may be dependent on
diagnosis field d2 in dataset 220, the analysis module may continue
to recursively analyze the program to determine any further
dependencies of d2 (which may indirectly contribute to the value
written into sensitive data field d3) by analyzing the program for
one or more statements that may have last assigned a value to d2,
which, as shown by arrow 422, is the program statement on line
4.
[0041] The assignment on program statement in line 4 is dependent
on the conditional statement in line 3 (arrow 424). Thus, the
analysis module may examine the statement in line 3 and determine
that the value of d2 may be dependent on the value of factors f2,
f3, and f4 (circled). Continuing to examine the program statements
associated with factors f2, f3, and f4 recursively as described
above, the analysis module may determine that these factors do not
have any other dependencies, and thus conclude that all backward
dependencies (all potential quasi-identifiers) for sensitive data
field d3 have now been found. As factors f3 and f5 were already
identified as quasi-identifiers previously, the analysis module may
thus simply identify factors f2 and f4 as additional
quasi-identifiers for the data field d3.
[0042] As indicated previously, the analysis module may now collect
all quasi-identifier fields identified above into a
quasi-identifier list 430 for sensitive data field d3, which, in
this case, include factors f2, f3, f4, and f5 and the diagnosis d2,
and anonymize or mask the quasi-identifier fields in the dataset
using one or more conventional anonymization techniques.
[0043] The recursive program analysis method disclosed above for
identifying the quasi-identifier fields for sensitive field d3 in
program 314 be similarly applied to other sensitive fields in the
dataset 220.
[0044] For example, applying the same static program analysis
technique to diagnosis field d2 in program 314 reveals that it
reciprocally depends on diagnosis d3 and the values of the same
four factors, f2, f3, f4, and f5. Thus, if fields d2 and d3
represent heart disease and heart burn, respectively, and f2, f3,
f4, and f5 represent factors related to blood pressure, chest pain,
exercise level, and type of diet, respectively, then, according to
the above program, these two diagnoses not only depend on the same
set of symptoms and risk factors but they also depend on each
other. Thus, if either d2 or d3 is considered a sensitive field for
privacy reasons, then it is desirable to anonymize the other as
well. Otherwise, there is a risk that the hidden diagnosis in one
of these fields may be inferred based on the value of the diagnosis
in the other field.
[0045] To provide yet another example, applying the same static
program analysis technique to diagnosis field d2 in program 314
reveals that that program statement on line 8 is the last statement
that assigns a value to variable d4. In addition, the program
statement on line 8 is based upon the condition in the program
statement on line 7, which, as seen, is based upon factors f5 and
f8. As neither of these factors has any further dependencies (is
not modified by the program 314), the analysis module may stop its
backward trace for further dependencies and determine that factors
f5 and f8 are the only possible quasi-identifier fields for
diagnosis field d4.
[0046] As illustrated by the foregoing examples, statically
analyzing a program slice (set of program statements) is often
helpful in identifying quasi-identifiers for one or more sensitive
data fields in a data set. However, static analysis of a program
may sometimes also yield false positives (i.e., identify one or
more fields as quasi-identifiers when they are not).
[0047] For example, this can be seen in program 314 with respect to
diagnosis field d5. In this case, if static program analysis is
applied in the manner described above to diagnosis field d5, the
result indicates that sensitive field d5 may depend on all eight
factors, f1 through f8. However, if instead of just statically
identifying all potential dependencies as quasi-identifiers in the
manner described above, the feasibility of paths during the
operation (e.g., actual execution) of program 314 are also
considered, it can be determined that, in fact, that there are only
five factors that may be quasi-identifier fields for sensitive
field d5, as described below.
[0048] As seen in program 314, the only place where d5 is assigned
a value is on the program statement on line 17. However, line 17
can execute only if the program statement on line 16 evaluates to
true, i.e., if d4 has the value true. But, d4 can be true only if
line 8 executes. However, if line 8 executes, line 11 cannot
execute, as lines 8 and 11 belong to the two opposing branches of
the same conditional statement on line 7. Therefore, as line 8 must
execute if d5 is assigned a value, and line 11 cannot execute
(i.e., is infeasible) if line 8 executes, the factors f3, f4, and
f7 evaluated in line 11 are not valid quasi-identifiers for field
d5. Thus, field d5 only has five valid quasi-identifiers, which
include f1, f2, f5, f6, and f8.
[0049] Based on the foregoing, it can be seen that diagnosis field
d5 has only five valid quasi-identifiers, and not eight as
indicated by static analysis. Such false positives may also arise
if a program in question makes heavy use of indirect references via
memory addresses. False positives like these, however, may be
avoided if the analysis module also uses dynamic analysis, which is
described next.
[0050] The static program analysis technique described in the
previous sections included traversing all program paths (whether
feasible or infeasible) when looking for statements that may have
assigned or used a value in a given field. Instead, the analysis
module may dynamically analyze a program to determine program
paths, that are actually taken (and/or not taken) during execution
of the program under different or all possible input conditions, to
identify paths that are feasible and/or unfeasible. While, as
demonstrated above, a program such as program 314 may contain
infeasible paths, such paths will not be executed, and hence, no
quasi-identifiers based on an analysis of an infeasible path would
be considered by the analysis module during dynamic analysis.
[0051] In one embodiment, the analysis module may dynamically
analyze the program statements in program 314 by tracing or
observing all paths taken by program 314 during its execution by a
processor in determining one or more diagnoses for a record of a
particular patient. As program 314 executes, the analysis module
May trace or record the actual path(s) traversed based on the
inputs provided to the program. As a result, the analysis module,
when determining a program statement where a given field was
assigned a value, may recursively analyze paths (program
statements) that were actually taken during the execution of the
program when identifying quasi-identifiers with respect to a given
sensitive data field, and ignore paths (program statements) that
were not taken by the program.
[0052] Furthermore, this holds true even if a program makes heavy
use of indirect assignments or references via memory addresses,
because when the execution path used to compute dynamic slices (a
set of program statements that are executed in a path taken by the
program) is recorded, the actual memory addresses of variables that
are assigned values and are used in all statements along that path
can also be recorded by the analysis module, such that the analysis
module may decide whether a particular program statement assigning
or using a value via a memory address refers to a potential
quasi-identifier field or not.
[0053] Thus, in one aspect, the analysis module may trace program
paths that are executed by program 314 based one or more possible
combinations of inputs. As one of ordinary skill in the art would
appreciate, programs will often operate upon a finite combination
of inputs. For example, analysis module may trace the execution of
program 314 to dynamically identify quasi-identifier fields for the
sensitive data field d5, based on true or false combinations of the
finite factors f1-f8 and data fields d1-d4, and d6. While such
"brute force" approach may be computationally burdensome, it will
eliminate any chance of generating false positives. In another
aspect, the analysis module may consider additional information
that may be used to determine only valid inputs. For example, the
analysis module may be programmed with information regarding
specific symptoms and diagnoses, such that it can generate and
analyze the program based on valid combinations of certain symptoms
and/or diagnosis, while ignoring other invalid ones.
[0054] As most programs are normally tested to ensure that they
function (execute) as desired with respect to a program's features,
in another aspect, the same test data sets (input conditions) that
are used to validate the program during testing may also be used to
dynamically analyze the corresponding program slices and identify
quasi-identifier fields for one or more sensitive fields in a
database.
[0055] In this regard, the dataset that contains the sensitive
fields and the quasi-identifier fields in question may, before
masking or anonymization of such fields, itself serve as a test set
of data.
[0056] Thus, dynamic analysis of a given program in the manner
described above may dramatically reduce or even eliminate false
positives in many cases.
[0057] There is a tradeoff involved between using static and
dynamic analysis. While computing and analyzing a static slice may
be much more efficient (e.g., faster), it may lead to false
positives. Dynamic analysis, on the contrary, may be much more
expensive, both in terms of computation time and space required,
and it may miss detection of some genuine dependencies (i.e., may
allow false negatives if feasible paths of a program under certain
input conditions are not evaluated), but it substantially reduces
or eliminates false positives.
[0058] Thus, in a yet another embodiment, the analysis module may
also adaptively determine whether static analysis or dynamic
analysis is more appropriate for a given program. A determination
that dynamic analysis is more appropriate for a particular program
may be adaptively made, for example, if the program contains many
indirect variable references, such that static analysis of such a
program is likely to contain many infeasible paths and result in
many false positives. Thus, in one aspect the analysis module may
compare the number of indirect references in all or a portion of
the program to a threshold and determine a likelihood of generating
an unacceptable number of false positives. If the number of
indirect references exceeds the threshold or the likelihood of
generating false positives is unacceptably high, then the analysis
module may analyze the program using dynamic analysis. In other
cases, the analysis module may adaptively determine that the number
of indirect references or the likelihood of generating false
positives is low (based on a comparison with a threshold), and
analyze the program using static analysis based upon the
determination it is likely to result in a few or acceptable number
of false positives.
[0059] FIG. 5 is a flow chart of a process 500 in accordance with
various aspects of the system and method disclosed herein. The
process begins in block 515. In block 520, the system and method
identifies a set of programs containing one or more programs having
access to a given dataset, where each program in the set of
programs include one or more program statements for reading,
writing, modifying, or otherwise manipulating the data in the
dataset.
[0060] In block 525, the system and method determines whether all
programs in the set of programs identified in block 520 have been
analyzed to determine one or more quasi-identifiers fields for one
or more sensitive data fields contained in the dataset.
[0061] If the result of the check in block 525 is false, that is,
that all of the programs in the set of programs have not been
analyzed, then in block 530 the system and method selects a program
from the set of programs for analysis.
[0062] In block 535, the system and method identifies a set of
output statements in the program selected in block 530, where the
set of output statements includes one or more output statements
that write or update a value in one or more sensitive data fields
in the dataset.
[0063] In block 540, the system and method determines if all output
statements in, the set of output statements identified in block 535
have been analyzed.
[0064] If the result of the check in block 540 is false, that is,
all program statements in the set of output statements have not
been analyzed, then in block 545 the system and method selects an
output statement that remains to be analyzed from the set of output
statements, where the output statement writes or updates a value of
a given sensitive data field in the dataset.
[0065] In block 550, the system and method recursively identifies,
using, for example, static and/or dynamic analysis, a set of one or
more program statements (e.g., a program slice) that indirectly or
directly contribute to the value that is written by the output
statement into the given sensitive data field in the dataset.
[0066] In block 555, the system and method identifies one or more
data fields in the dataset, which are indirectly or directly
referenced in the set of program statements identified in block
550, as quasi-identifier fields for the given sensitive data field.
The system and method then returns to block 540 to check if all
output statements in the selected program have been analyzed.
[0067] If the result of the check in block 540 is true, that is,
all statements in the set of output statements for the selected
program have been analyzed, the system and method returns to block
525 to check if all programs in the set of one or more programs
have been analyzed.
[0068] If the result of the check in block 525 is true, i.e., each
program in the set of one or more programs has been analyzed, then
the system and method proceeds to block 560.
[0069] In block 560, the system and method uses conventional
anonymization techniques such as K-anonymity or L-diversity to
partially or completely mask the values of one or more fields in
the dataset that have been determined to be quasi-identifier fields
for one or more sensitive data fields. The system and method then
ends in block 565.
[0070] FIG. 6 is a block diagram illustrating a computer system
upon which various aspects of the system and method as disclosed
herein can be implemented. FIG. 6 shows a computing device 600
having one or more input devices 612, such as a keyboard, mouse,
and/or various other types of input devices such as pen-inputs,
joysticks, buttons, touch screens, etc. Computing device 600 also
contains a display 614, which could include, for instance, a CRT,
LCD, plasma screen monitor, TV, projector, etc. In one embodiment,
the computing device 600 may be a personal computer, server or
mainframe, mobile phone, PDA, laptop etc. In addition, computing
device 600 also contains a processor 610, memory 620, and other
components typically present in a computer.
[0071] Memory 620 stores information accessible by processor 610,
including instructions 624 that may be executed by the processor
610 and data 622 that may be retrieved, executed, manipulated or
stored by the processor. The memory may be of any type capable of
storing information accessible by the processor, such as a
hard-drive, ROM, RAM, CD-ROM, DVD, Blu-Ray disk, flash memories,
write-capable or read-only memories. The processor 610 may comprise
any number of well known processors, such as processors from Intel
Corporation. Alternatively, the processor may be a dedicated
controller for executing operations, such as an ASIC.
[0072] Data 622 may include dataset 20, program 14, and
quasi-identifiers 18 as described above with respect to FIGS. 1-3.
Data 622 may be retrieved, stored, modified, or processed by
processor 610 in accordance with the instructions 624. The data may
be stored as a collection of data. For instance, although the
invention is not limited by any particular data structure, the data
may be stored in computer registers, in a relational database as a
table having a plurality of different fields and records, XML
documents, or flat files. Data may also be stored in one or more
relational databases.
[0073] Additionally, the data may also be formatted in any computer
readable format such as, but not limited to, binary values, ASCII
etc. Moreover, the data may include any information sufficient to
identify the relevant information, such as descriptive text,
proprietary codes, pointers, references to data stored in other
memories (including other network locations) or information which
is used by a function to calculate the relevant data.
[0074] Instructions 624 may implement the functionality described
with respect to the analysis module and in accordance with the
process disclosed above. The instructions 624 may comprise any set
of instructions to be executed directly (such as machine code) or
indirectly (such as scripts) by the processor. In that regard, the
terms "instructions," "steps" and "programs" may be used
interchangeably herein. The instructions may be stored in any
computer language or format, such as in object code or modules of
source code. In one embodiment, instructions 624 may include
analysis module 16, and the processor may execute instructions
contained in analysis module 16 in accordance with the
functionality described above.
[0075] Although the processor 610 and memory 620 are functionally
illustrated in FIG. 6 as being within the same block, it will be
understood that the processor and memory may actually comprise
multiple processors and memories that may or may not be stored
within the same physical housing or location. Some or all of the
instructions and data, such as the dataset 20 or the program 14,
for example, may be stored on a removable recording medium such as
a CD-ROM, DVD or Blu-Ray disk. Alternatively, such information may
be stored within a read-only computer chip. Some or all of the
instructions and data may be stored in a location physically remote
from, yet still accessible by, the processor. Similarly, the
processor may actually comprise a collection of processors which
may or may not operate in parallel. Data may be distributed and
stored across multiple memories 620 such as hard drives, data
centers, server farms or the like.
[0076] In one aspect, computing device 600 may communicate with one
or more other computing devices (not shown). Each of such other
computing devices may be configured with a processor, memory and
instructions, as well as one or more user input devices and
displays. Each computing device may be a general purpose computer,
intended for use by a person, having all the components normally
found in a personal computer such as a central processing unit
("CPU"), display, CD-ROM, DVD or Blu-Ray drive, hard-drive, mouse,
keyboard, touch-sensitive screen, speakers, microphone, modem
and/or router (telephone, cable or otherwise) and all of the
components used for connecting these elements to one another. In
one aspect, for example, the one or more other computing devices
may include a third party computer (not shown) to which the
computing device 600 transmits a dataset for further use or
analysis, where the dataset that the computing device 600 transmits
to the third party computer may be a dataset that has been
anonymized in accordance with various aspects of the system and
method disclosed herein herein.
[0077] In addition, computing device 600 may be capable of direct
and indirect communication with such other computing devices over a
network (not shown). It should be appreciated that a typical
networking system can include a large number of connected devices,
with different devices being at different nodes of the network. The
network including any intervening nodes, may comprise various
configurations and protocols including the Internet, intranets,
virtual private networks, wide area networks, local networks,
private networks using communication protocols proprietary to one
or more companies, Ethernet, WiFi, Bluetooth and HTTP.
Communication across the network, including any intervening nodes,
may be facilitated by any device capable of transmitting data to
and from other computers, such as modems (e.g., dial-up or cable),
network interfaces and wireless interfaces.
[0078] Although the invention herein has been described with
reference to particular embodiments, it is to be understood that
these embodiments are merely illustrative of the principles and
applications of the present invention. It is therefore to be
understood that numerous modifications may be made to the
illustrative embodiments and that other arrangements may be devised
without departing from the spirit and scope of the present
invention as defined by the appended claims.
* * * * *