U.S. patent application number 14/234365 was filed with the patent office on 2014-11-27 for anonymization and filtering data.
This patent application is currently assigned to VODAFONE IP LICENSING LIMITED. The applicant listed for this patent is Stephen Babbage, Adam Gianniotis, Gerald Mcquaid. Invention is credited to Stephen Babbage, Adam Gianniotis, Gerald Mcquaid.
Application Number | 20140351943 14/234365 |
Document ID | / |
Family ID | 44652192 |
Filed Date | 2014-11-27 |
United States Patent
Application |
20140351943 |
Kind Code |
A1 |
Gianniotis; Adam ; et
al. |
November 27, 2014 |
ANONYMIZATION AND FILTERING DATA
Abstract
System method of anonymising data comprising the steps of
receiving data to be anonymised. Applying one or more
transformations to the received data according to a transformation
configuration resource, wherein the one or more transformations
include transforming at least an original portion of the received
data into a transformed portion, wherein the original portion of
the received data is recoverable from the transformed portion using
stored information.
Inventors: |
Gianniotis; Adam; (Newbury,
GB) ; Mcquaid; Gerald; (Newbury, GB) ;
Babbage; Stephen; (Newbury, GB) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Gianniotis; Adam
Mcquaid; Gerald
Babbage; Stephen |
Newbury
Newbury
Newbury |
|
GB
GB
GB |
|
|
Assignee: |
VODAFONE IP LICENSING
LIMITED
Newbury, Berkshire
GB
|
Family ID: |
44652192 |
Appl. No.: |
14/234365 |
Filed: |
July 20, 2012 |
PCT Filed: |
July 20, 2012 |
PCT NO: |
PCT/GB2012/051751 |
371 Date: |
June 11, 2014 |
Current U.S.
Class: |
726/26 |
Current CPC
Class: |
H04L 63/0407 20130101;
H04L 63/0435 20130101; G06F 21/60 20130101; G06F 21/602 20130101;
G06F 21/6254 20130101 |
Class at
Publication: |
726/26 |
International
Class: |
G06F 21/62 20060101
G06F021/62; G06F 21/60 20060101 G06F021/60 |
Foreign Application Data
Date |
Code |
Application Number |
Jul 22, 2011 |
GB |
1112665.3 |
Claims
1. A method of anonymizing data comprising the steps of: receiving
data to be anonymized; applying one or more transformations to the
received data according to a transformation configuration resource,
wherein the one or more transformations include transforming at
least an original portion of the received data into a transformed
portion, wherein the original portion of the received data is
recoverable from the transformed portion using stored
information.
2. The method of claim 1, wherein the stored information comprises
the transformed portion stored with the original portion of
received data.
3. The method of claim 1, wherein the stored information is
cryptographic material for decrypting the transformed portion into
the original portion of received data.
4. The method of claim 1, wherein the transformation configuration
resource defines the transformation to be applied.
5. The method according to claim 1, wherein an anonymization
configuration resource defines how the received data provides an
output containing the transformed portion, the method further
comprising the step of operating according to the anonymization
configuration resource to produce an output.
6. The method of claim 5, wherein the anonymization configuration
resource defines any one or more of: an interface for providing the
received data; how the received data is read; the transformation
configuration resource; an output format; the source of the
received data; the destination of the output; and a maximum number
of processing threads.
7. The method according to claim 1, wherein the received data is in
a data format defined by a data description configuration
resource.
8. The method according to claim 1 further comprising the step of
generating an output comprising the transformed portion with or
without an untransformed portion of the received data.
9. The method of claim 5, wherein the output is formatted according
to an output configuration resource.
10. The method according to claim 1 further comprising the steps
of: receiving an input comprising the transformed portion and a new
portion; and using the stored information to recover the original
portion from the transformed portion.
11. The method according to claim 1, wherein the transformation is
encryption.
12. The method of claim 11, wherein the encryption is selected from
the group consisting of: format preserving encryption; and
ephemeral encryption.
13. The method according to claim 1 further comprising applying
transformations to further original portions of the received data,
the further transformations selected from the group consisting of:
hashing; redacting; filtering; find and replacing; replacement with
random values; validation; and masking.
14. The method according to claim 1 wherein any one or more of the
configuration resources are encrypted.
15. The method according to claim 1, wherein the received data is
selected from one or more of the group consisting of: XML;
delimited; fixed width; YAML; SOAP; SMPP; and UCP/EMI.
16. A computer program comprising program instructions that, when
executed on a computer cause the computer to perform the method of
claim 1.
17. A computer-readable medium carrying a computer program
according to claim 16.
18. A computer programmed to perform the method of claim 1.
Description
FIELD OF THE INVENTION
[0001] The present invention relates to a system and method for
anonymising data and in particular for selectively anonymising data
in a configurable way.
BACKGROUND OF THE INVENTION
[0002] Many businesses are faced with an apparent conflict between
the need to respect their clear obligation to protect the security
and privacy of their customers in their use of data, and a business
imperative to maximise revenue opportunity by either exploiting or
enriching the data. These opportunities increasingly involve
exposing data to partners and third parties and the movement of
data outside the protected network of the business. Protecting the
data while still retaining sufficient usable data to enable the
opportunity is a key challenge. In addition, in this increasingly
global economy, data crosses borders increasingly and organisations
need to ensure that they continue to comply with all the relevant
regulatory requirements.
[0003] The ability to share data between repositories is an
essential requirement for many businesses. Sharing data between
repositories can become problematic when the data being shared is
at least in part confidential, secret or otherwise sensitive.
[0004] There are many examples of systems which are arranged to
securely share data between repositories, including systems for
securing the repositories themselves, and securing the
communication channels between repositories.
[0005] An additional problem arises when the source repository
wishes to only share part of a data set with a destination
repository.
[0006] Therefore, there is required a system and method that
overcomes these problems.
SUMMARY OF THE INVENTION
[0007] The present invention relates to a system for anonymising
and filtering data sets which are leaving a data repository within
a secure environment to another environment that may or may not be
secure, and a system for de-anonymising the data sets as they are
returned back to the secure data repository.
[0008] The present invention provides a system and a method of
transforming data in real-time, or near real-time from the original
data set to an at least partially anonymised, filtered and masked
data set suitable for transmission to a third party outside of a
secure environment. The invention has the important additional
feature of being able to receive an at least partially anonymised
data set that has previously been transmitted outside of the secure
environment and deanonymise the previously anonymised data, for
storing the deanonymised data back in the source repository, or
other location within the secure environment. The returning data
set does not have to be identical to the original data set,
provided that at least one identifier data item remains unaltered.
This allows third parties to add to alter or in other ways enrich
the transmitted data set prior to returning the data set to the
secure environment. Additionally, the present invention provides
the capability, using easily modified configuration data, to
transform multiple data sets, of differing structure and apply
different transformation techniques (for example anonymisation,
masking, filtering) to each according to their type.
[0009] An anonymisation system and method filters, anonymises
and/or otherwise transforms sensitive data before it is sent
onwards, for example to an external third party. Furthermore, the
anonymisation system is being able to also de-anonymise data as it
is sent back to the originating party after analysis or
enrichment.
[0010] The anonymisation system supports a number of interfaces to
route data and can apply a variety of transform and data quality
rules to the data.
[0011] According to a first aspect there is provided a method of
anonymising data comprising the steps of: [0012] receiving data to
be anonymised; [0013] applying one or more transformations to the
received data according to a transformation configuration resource,
wherein the one or more transformations include transforming at
least an original portion of the received data into a transformed
portion, wherein the original portion of the received data is
recoverable from the transformed portion using stored
information.
[0014] Therefore, data may be safely and securely released to third
parties as personal, private or other sensitive data may be
anonymised, tokenised or protected first and then recovered and
processed on return from the third party. For example, this may
allow external processing of data to take place outside of a secure
boundary or organisation. Upon return, additional information may
be utilised as the sensitive and identifying information may be
recovered by the originating party so that the external processing
and any additional data may be used. The original data may be data
that can be used to identify users or their personal information
(e.g. telephone number, name, address, date of birth, etc.) The
transformation configuration resource may be configurable,
customisable or specific to particular received data types and data
structures/formats, for example. The stored information may be
configuration information, for example.
[0015] Advantageously, this provides a faster, in-line, real-time,
highly configurable and reversible method of anonymising data.
[0016] The method advantageously may consistently anonymised data
to the same value when required. This provides referential
integrity with data.
[0017] The original portion may be replaced with a token as the
transformed portion. A token may be a representation or a reference
to the original portion in anonymised form so that the original
portion may not be inferred or generated from the token without
additional information. A token store may be a repository or
database of tokens. Tokens that have been used or are in use may be
associated with the original data or portion or linked to these in
other ways. A lookup or call may be made to the token store to
determine the original portion or data that it represents. Access
to the token store may be restricted or secured to prevent
unauthorised interpretation of the transformed (tokenised)
portion.
[0018] The transformation configuration resource may be
configurable. This makes the method and system easier to update
when new or amended data types and structures/formats are
received.
[0019] Optionally, the stored information may comprise the
transformed portion stored with the original portion of received
data. For example, the transformed or tokenised portion may have
the transformed data stored together with the original portion in a
database or token store.
[0020] Optionally, the stored information may be cryptographic
material for decrypting the transformed portion into the original
portion of received data. The transformed data may be an encrypted
form of the original data. Therefore, the original data may be
recovered by a decryption procedure involving a stored key or other
cryptographic material.
[0021] Optionally, the stored information may be replaced by a
unique, alternative value called a token. The token is stored
typically in a database and may be re-used to recover the original
value upon return.
[0022] Preferably, the transformation configuration resource
defines the transformation to be applied. This may be a
configuration file or database or repository describing how to
transform the original data and other options and procedures that
may be carried out, for example.
[0023] Optionally, an anonymisation configuration resource may
define how the received data provides an output containing the
transformed portion, the method further comprising the step of
operating or processing according to the anonymisation
configuration resource to produce an output. Therefore, a workflow
may be pre-defined for the particular received data (i.e.
preconfigured for different data types and formats).
[0024] Preferably, the anonymisation configuration resource may
define any one or more of: an interface for providing the received
data; how the received data is read; the transformation
configuration resource; an output format; the source of the
received data; the destination of the output; and a maximum number
of processing threads. The anonymisation configuration resource may
define other parameters and procedures to be carried out.
[0025] Optionally, the received data may be in a data format
defined by a data description configuration resource. Therefore,
the received data may be read according to the data description
configuration resource. The data description configuration resource
may for example, describe where in the received data any or all
data items may be located including those data items or portions
that are to be transformed.
[0026] Optionally the method may further comprise the step of
generating an output comprising the transformed portion with or
without an untransformed portion of the received data. In other
words, an output may be generated from the received data with the
original data replaced by the transformed data but with other
fields or data in their original form.
[0027] Preferably, the output may be formatted according to an
output configuration resource. This may include details of an
interface used to describe the required output and/or the form of
the output file, data, stream or database table.
[0028] Optionally, the method may further comprise the steps
of:
[0029] receiving an input comprising the transformed portion and a
new portion; and
[0030] using the stored information to recover the original portion
from the transformed portion. In other words, these steps describe
the receipt of previously transformed data once further processing
has been carried out to create or modify data preferably associated
or derived from the original data. Upon receipt, the original or
identifying portion or portions of the data may be recovered so
that the data is deanonymised.
[0031] Optionally, the transformation may be encryption.
[0032] Preferably, the encryption may be selected from the group
consisting of: format preserving encryption; and ephemeral
encryption. Other encryption types may be used. Format preserving
encryption may allow correct processing of the transformed data.
Ephemeral encryption may be used to create different outputs each
time for the same input. This can help to prevent third parties who
receive the transformed data, from building up user profiles or
user specific information. For example, even though they cannot
identify the actual user, they may be able to associate multiple
items of received data with the same user if the transformation (or
token) is identical for each item. Such analysis may be frustrated
by using ephemeral encryption.
[0033] Optionally, the method may further comprise applying
transformations to further original portions of the received data,
the further transformations selected from the group consisting of:
hashing; redacting; filtering; find and replacing; replacement with
random values; validation; and masking. Therefore, the transformed
data may contain data fields transformed in different ways. These
transformations may be preconfigured or based on the type of the
original data, for example.
[0034] Preferably, any one or more of the configuration resources
may be encrypted. For example, any or all of the transformation
configuration resource, the anonymisation configuration resource,
the data description configuration resource, the output
configuration resource, an input configuration resource, or an
interface configuration resource may be encrypted to increase
security.
[0035] Optionally, the received data may be selected from one or
more of the group consisting of: XML; delimited; fixed width; YAML;
SOAP; SMOPP; and UCP/EMI. Other data types may be used.
[0036] According to a second aspect there is provided an
anonymisation system comprising: [0037] an interface configured to
receive data to be anonymised; [0038] a data store; and [0039]
logic configured to: [0040] applying one or more transformations to
the received data according to a transformation configuration
resource, wherein the one or more transformations include
transforming at least an original portion of the received data into
a transformed portion, wherein the original portion of the received
data is recoverable from the transformed portion using information
stored within the data store.
[0041] Preferably, the interface is further configured to transmit
the transformed portion or the transformed portion together with
unchanged for untransformed portions of the received data outside
of the anonymisation system.
[0042] The methods described above may be implemented as a computer
program comprising program instructions to operate a computer. The
computer program may be stored on a computer-readable medium.
[0043] The methods described above may be implemented as a complete
anonymisation system.
[0044] It should be noted that any feature described above may be
used with any particular aspect or embodiment of the invention.
BRIEF DESCRIPTION OF THE FIGURES
[0045] The present invention may be put into practice in a number
of ways and embodiments will now be described by way of example
only and with reference to the accompanying drawings, in which:
[0046] FIG. 1 shows a flow diagram of a method for anonymising
data, given by way of example only;
[0047] FIG. 2 shows a flow diagram of a method for deanonymising
data;
[0048] FIG. 3 shows a schematic diagram of a system for performing
the methods of FIGS. 1 and 2;
[0049] FIG. 4 a flow diagram of a workflow for performing the
method of FIG. 1;
[0050] FIG. 5 shows a class diagram of classes used within a system
performing the methods of FIGS. 1 and 2;
[0051] FIG. 6 shows a schematic high level architecture diagram of
a system for performing the methods of FIGS. 1 and 2;
[0052] FIG. 7 shows example input data and example output data
following application of the method of FIG. 1;
[0053] FIG. 8 shows example input data and example output data
following application of the method of FIG. 1;
[0054] FIG. 9 shows functional and non-functional requirements of a
system for implementing the methods of FIGS. 1 and 2; and
[0055] FIG. 10 shows a table of use cases that may be performed by
the method of FIG. 1.
[0056] It should be noted that the figures are illustrated for
simplicity and are not necessarily drawn to scale.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0057] FIG. 1 is a simplified flow diagram of a first embodiment of
the anonymisation system, illustrating the process of anonymising a
data file/set from a source repository, suitable for transferring
to a third party repository.
[0058] FIG. 2 is a simplified flow diagram of a second embodiment
of the anonymising system, illustrating the process of
deanonymising a data file received from a third party repository,
suitable for uploading back into the source repository.
[0059] Example--Source Repository being a Mobile Network Operator.
A customer data set for a mobile network customer is stored on a
CDR Repository within a secure environment. The customer data set
comprises sensitive data items, as shown in use case of FIG. 10,
and non-sensitive data items. When the data set is to be sent to a
third party outside the secure environment, one or more of the
sensitive data items or portions in the customer data set is
transformed or anonymised by the anonymisation system according to
the rules as predefined for that sensitive data set use case 19 as
illustrated in FIG. 10. The anonymised data set is transmitted to
the third party. The transmitted anonymised data set comprises both
anonymised data items and non-anonymised data items.
[0060] The third party then performs processing on the anonymised
data set and adds at least one item of data enrichment to the data
set. This data enrichment item can be any additional data which is
dependant on at least one data item from the anonymised data set.
For example the cell-ID, which in use case 19 has not been
anonymised, could be used by the third party as an approximate
location identifier for the customer. Using this location
identifier the third party adds a contextual text message data item
to the data set.
[0061] The enriched or amended anonymised data set is then
transmitted back to the secure environment. The de-anonymisation
system then reads the incoming data set, de-anonymises the
anonymised data set.
[0062] The de-anonymised data set and the contextual text message
is transmitted within the secure environment to an SMSC (Short
Message Service Center) which uses the de-anonymised data to send
the contextual text message to the customer via SMS (Short Message
Service).
[0063] In the above example, the third party was provided with
enough information to allow them to send a targeted message based
on location to a customer without having any direct access to the
customer, and importantly, without any private and personal
information about the customer being transmitted outside of the
secure environment. By de-anonymising the sensitive data items when
the anonymised data set is returned to the secure environment, the
enriched data set can be associated back to the customer and the
enriched data can be utilised.
[0064] Configuration files used to configure which data items
should be anonymised, filtered and/or masked, and the configuration
files defining the layout of the transformed data set can be
variable. The inbound transformations need not be the same as the
outbound transformation.
[0065] The described invention is a configurable approach to
addressing data security (for example, by anonymising outgoing
data) and data privacy (for example, by masking and/or filtering
outgoing data).
[0066] FIG. 3 illustrates schematically the basic internal
components, data flows within an anonymisation system 10 and key
external interfaces. The "wall" at the top of the diagram
represents a security boundary between existing input and output
systems which the anonymisation system 10 creates.
[0067] An example anonymisation system 10 consists of three logical
layers: [0068] Data Interfaces--This layer is responsible for
reading and writing data from various raw sources. The data
interface passes the data to the Data Reader/Writer layer for
processing. The supported interfaces are:
[0069] File system
[0070] HTTP/HTTPS
[0071] TCP/IP
[0072] Database
[0073] Messaging
[0074] Data Readers/Writers--This layer is responsible for parsing
a variety of data formats, transforming individual data fields by
using the transforms within the Transform Engine, and repackaging
the result into the same output format for onward transmission. The
supported data formats are:
[0075] Delimited
[0076] Fixed Width
[0077] XML
[0078] HTML
[0079] YAML
[0080] SMPP
[0081] UCP
[0082] HTTP
[0083] SOAP
[0084] Transform Engine--This is responsible for transforming
individual data fields in a variety of ways, in order to anonymise
and de-anonymise them. The supported transforms are as follows:
[0085] Filtering
[0086] Masking
[0087] Ephemeral Encryption/Decryption
[0088] Format Preserving Encryption/Decryption *
[0089] Hashing *
[0090] Find and Replace
[0091] Redaction
[0092] Validation
[0093] Random Number Generation *
[0094] Detokenisation *
[0095] * Starred transforms are "tokenisable transforms", which
means tokenisation can be turned on for them. Tokenising is
explained in detail later in the description. The detokenisation
transform is used to reverse tokenisable transforms.
[0096] The following is a summary of the method carried out by the
anonymisation system 10:
[0097] The anonymisation system 10 ingests data from an
interface;
[0098] The data is interpreted into records/fields by a
reader/writer;
[0099] Fields may be modified by one or more transforms defined in
a transformset or transformation configuration resource;
[0100] The transformed data is returned into its original or
similar format by a reader/writer; and
[0101] The anonymisation system 10 transmits the data to its
destination via an interface.
[0102] A number of transforms which have complex properties,
including encryption, are defined via "Transform Schemas". These
schemas allow a complex transform to be specified once and then
consistently used, possibly many times.
[0103] For example, a schema to encrypt a common field, e.g. MSISDN
could be used consistently across a number of routes and interfaces
to allow consistent encryption and decryption.
[0104] A glossary is provided, which explains the technical terms
used by this description.
[0105] The anonymisation system 10 is preferably multithreaded and
can perform many of these actions, at high speed, concurrently.
[0106] The anonymisation system 10 is stateless and maintains no
history or status of activities performed or in progress.
Furthermore, with the exception of tokenisation, it does not store
any data. Therefore transactions are atomic. Once an action is
complete, the system disregards the action and commences a new one.
Should the anonymisation system 10 be interrupted, for example by
system failure, then on restart (or by another resilient instance),
the entire transaction would need to be repeated. If the system
fails while processing data on a file based interface, the file
data would remain on the server in an unprocessed state, allowing a
system administrator to attempt to reprocess the data later. When
using a TCP/IP interface, if the system fails the TCP/IP connection
will be terminated and no further data will be processed. Data
could then be sent through the system again once it has been
restarted.
[0107] Variations in format and protocol between input and output
may be made. For example, this may include reading from a database
and writing to a file.
[0108] In one implementation, the anonymisation system 10 is a Java
application which can be run on any operating system with a Java
Virtual Machine (JVM). The minimum suggested version of Java is
1.6. For production environments, the following Operating Systems
are recommended:
[0109] Redhat Enterprise Linux 5; and
[0110] Debian Squeeze
[0111] Example suitable versions are:
[0112] Linux RHEL version 5.x
[0113] Debian Squeeze version 6
[0114] Java JRE Version 1.6
[0115] Tomcat Version 7
[0116] Jpam (if using the GUI) Version 1.1
[0117] Other environments may be used.
[0118] An example execution of the anonymisation system 10 may be
as follows:
[0119] Navigate to the "input" directory and open the "input.csv"
file using a text editor. Example input to the system may be as
follows:
[0120] 12345678,Test,447777123456
[0121] To inspect the input data to the system and rename the
input.csv file to "input.csv.ready". The system picks it up,
processes it and writes the output to a new file in an output
directory. As shown in this example below, the first field has been
masked, the second filtered and the third partially encrypted,
i.e.:
[0122] "12****78"," ","448555422322"
[0123] Data Interfaces, Data Readers/Writers and Transform Engine
provide a flexible framework to receive, transform and output any
type of data. These may be configured via a configuration file in
XML format. The format for each component within the configuration
file is described below.
[0124] Configuration files are preferably stored securely in an
encrypted and digitally signed form
[0125] XML Configuration Format
[0126] The data flow through the application may be defined in XML.
The high level structure recommended for the XML file is as
follows:
[0127] Interfaces
[0128] Reader/Writers
[0129] Transform Sets
[0130] Routes
[0131] A "Route" defines a data flow or an anonymisation procedure
through the system, linking together a Data Interface, a Data
Reader/Writer and the relevant set of transforms. The route or
anonymisation procedure may be defined by a configurable
anonymisation procedure resource. An example configurable
anonymisation procedure resource is shown below in XML format:
TABLE-US-00001 <transform>
<type>validation</type>
<field>msisdn</field> <properties>
<regularExpression>[0-9]{15}</regularExpression>
<actionOnFailure>REPLACE</actionOnFailure>
<replacementValue>NOT A MSISDN</replacementValue>
</properties> </transform> <transform>
<type>validation</type>
<field>msisdn</field> <properties>
<regularExpression>[0-9]{15}</regularExpression>
<actionOnFailure>REPLACE</actionOnFailure>
<replacementValue>NOT A MSISDN</replacementValue>
<logWarningFlag>false</logWarningFlag>
</properties> </transform>
[0132] Failed Routes
[0133] If a failure occurs on an interface which means the route
either cannot start correctly, or continue to run, it will be moved
to a pool of failed routes. A RouteManager thread runs in the
background all the time that the anonymisation system 10 is
running, and periodically attempts to restart the failed routes. By
default, this period is set to every 30 seconds, but this is
configurable. FIG. 4 shows schematically the process 100 carried
out by a Route Manager.
[0134] If a route is successfully restarted, it will be removed
from the failed routes pool. If a route fails to restart, it will
remain in the failed routes pool until the next time the Route
Manager attempts to start the failed routes.
[0135] Interface types may be defined in the configuration file and
loaded when the anonymisation system 10 is started. New interfaces
can be defined using product Extension API.
TABLE-US-00002 TABLE 1 Feature Description File Based The
anonymisation system 10 will read data files from a configurable
input directory. Transformed files are written back out to an
output directory. HTTP(S) The anonymisation system 10 listens for
HTTP messages on a configurable address. It then forwards
transformed messages to an output port. TCP/IP The anonymisation
system 10 can listen for any generic communications over a TCP/IP
socket. Database The anonymisation system 10 can read raw data from
a database table and write back to another table. Messaging The
anonymisation system 10 can process messages from a JMS queue and
put the result back on another queue or topic. The following
interfaces may be supported.
[0136] File Based The anonymisation system 10 will read data files
from a configurable input directory. Transformed files are written
back out to an output directory.
[0137] HTTP(S) The anonymisation system 10 listens for HTTP
messages on a configurable address. It then forwards transformed
messages to an output port.
[0138] TCP/IP The anonymisation system 10 can listen for any
generic communications over a TCP/IP socket.
[0139] Database The anonymisation system 10 can read raw data from
a database table and write back to another table.
[0140] Messaging The anonymisation system 10 can process messages
from a JMS queue and put the result back on another queue or
topic.
[0141] Every interface may have two generic configuration
parameters: [0142] name--The name given to the interface being
defined. This is used in the Route to reference the interface
[0143] type--The type of interface being configured. Possible
values (case sensitive) are:
[0144] filesystem [0145] httpinterface [0146] tcpipinterface [0147]
databaseinterface [0148] jms
[0149] For example:
TABLE-US-00003 <interface>
<name>interfaceName</name>
<type>interfaceType</type> <properties> ...
Specific properties go here ... </properties>
</interface>
[0150] File System Interface
[0151] The file system interface has the following properties
available for configuration. [0152] inputDirectory--The path of the
directory to scan for new files [0153] outputDirectory--The path of
the directory to write output files to [0154] inputSuffix--Optional
filter to only process files ending in a certain suffix [0155]
removeInputSuffix--Whether or not to remove the suffix from the
incoming file name when it is written to the output [0156]
outputSuffix--Optional suffix to append to the outgoing file once
it is fully written. Defaults to .ready [0157]
finishedSuffix--Optional suffix to add to the incoming file once it
is fully processed. Defaults to .done [0158]
processingSuffix--Suffix to append to the input and output files
which the data is being processed. Defaults to .processing [0159]
pollingFrequency--How often to check the input directory for new
files in milliseconds. Defaults to 10000 (10 seconds)
[0160] Example Configuration File Section
[0161] The following is an example of the section of XML required
to define the file system interface.
TABLE-US-00004 <interface>
<name>FileInterface</name>
<type>filesystem</type> <properties>
<inputDirectory>/Data/in</inputDirectory>
<outputDirectory>/Data/out</outputDirectory>
<inputSuffix>.xml</inputSuffix>
<removeInputSuffix>false</removeInputSuffix>
<processingSuffix>.proc</processingSuffix>
<outputSuffix>.pickup</outputSuffix>
<finishedSuffix>.finished</finishedSuffix>
<pollingFrequency>20000</pollingFrequency>
</properties> </interface>
[0162] This interface will poll every 20 seconds for files in the
"/Data/in" directory (relative paths from the location where the
anonymisation system 10 was started are allowed, but it is
recommended that absolute paths be used to avoid confusion). The
interface will pick up any files with the ".xml" suffix, and the
resulting output files in "/Data/out" will end with .xml.pickup
(since the input suffix is not being removed).
[0163] If multiple files with the same file name are inserted into
the input directory for processing by the anonymisation system 10
(for example, a second file is inserted after the first file has
been processed) there may be collisions when The anonymisation
system 10 attempts to rename files.
[0164] In order to avoid this, the anonymisation system 10 may
attempt to identify filenames that have previously been processed
and for which the processed files are still present in the input or
output directories.
[0165] A unique file name may be assigned to the input file which
does not clash with any of the processing or processed files in the
input and output directories. Where a collision is found, a number
will be appended onto the end of the base file name. For example:
[0166] Supposing using the above configuration a file test.xml is
inserted into the input directory. [0167] This file will be
processed by The anonymisation system 10 and will result in a
test.xml.finished file in the input directory, and a
test.xml.pickup file in the output directory. [0168] Now if another
file called test.xml is dropped into the input directory, the
anonymisation system 10 will notice the existing processed files
and will rename the file to "test.xmll" before processing. [0169]
The resulting processed files would then be test.xmll.finished and
test.xmll.pickup in the input and output directories
respectively.
[0170] Files appearing in the input directory are created by virtue
of an "atomic operation" when ready. This means that a partially
written file cannot be picked up by the system.
[0171] Error Handling
[0172] If an I/O error occurs on the interface (reading and writing
files to disk) the route may fail and be moved to the failed routes
pool.
[0173] HTTP/HTTPS Interface
[0174] The HTTP(S) interface listens on a TCP/IP connection for
HTTP or HTTPS protocol requests on a configured address. The
content portion of the request is considered the data which is
interpreted by the Data Readers and transformed accordingly. The
interface has the following configurable properties: [0175]
listeningHostname--The interface will listen on this network
address. Defaults to "localhost" [0176] listeningPort--The
interface will listen on this network port [0177] listeningTLS
(Transport Layer Security)--whether the anonymisation system 10
server is using HTTPS for this route [0178] outgoingHostname--The
interface will create an outgoing connection to this network
address [0179] outgoingPort--The interface will create an outgoing
connection to this network port [0180] outgoingTLS (Transport Layer
Security)--Whether the downstream server is using HTTPS. [0181]
transformType--Specifies which direction the data is to be
transformed in. Data can be transformed in the HTTP Request Body,
the HTTP Response body, or both. The value of this field must be
REQUEST, RESPONSE or REQUESTRESPONSE respectively [0182]
keyProvider--the keyprovider class used for https connections.
[0183] For HTTPS, appropriate certificates may be installed in the
Java HTTPS keystore.
[0184] The following is an example of the section of XML required
to define the HTTPS interface.
TABLE-US-00005 <interface>
<name>HTTPInterface</name>
<type>httpinterface</type> <properties>
<outgoingHostname>10.20.0.221</outgoingHostname>
<outgoingPort>6051</outgoingPort>
<outgoingTLS>true</outgoingTLS>
<transformType>REQUESTRESPONSE</transformType>
<listeningHostname>localhost</listeningHostname>
<listeningPort>6050</listeningPort>
<listeningTLS>true</listeningTLS>
<keyProvider>keyProvider</keyProvider>
</properties> </interface>
[0185] TCP/IP Interface
[0186] The TCP/IP interface listens on a configured address for
TCP/IP connections. Once connected, data can be passed and
transformed in either direction on the socket. The raw data
arriving is passed directly to the Data Reader/Writer for
transformation. The interface has the following configurable
properties: [0187] listeningHostname [0188] listeningPort [0189]
outgoingHostname [0190] outgoingPort
[0191] When a connection is established on the specified incoming
port, a new Socket will be opened, a new connection will be
established to the outgoing address and the corresponding input and
output data streams for both directions will be passed down to the
Data Reader/Writers. The application will then continue to listen
on the specified port. A Reader/Writer of the same data type will
be created in each direction. Transforms can be configured to act
in either direction.
[0192] The following is an example of the section of XML required
to define the TCP/IP interface.
TABLE-US-00006 <interface>
<name>TCPIPInterface</name>
<type>tcpipinterface</type> <properties>
<outgoingHostname>1.2.3.4</outgoingHostname>
<outgoingPort>8080</outgoingPort>
<listeningHostname>localhost</listeningHostname>
<listeningPort>9201</listeningPort> </properties>
</interface>
[0193] In this case, this interface will listen on port 9201 and
make a connection to port 8080 on IP address 1.2.3.4. "localhost"
may be used for the outgoing hostname if the destination
application is hosted on the same server as The anonymisation
system 10.
[0194] Database Interface
[0195] The database interface reads raw data from a database table
and inserts transformed data into another table. The input database
table must consist of a primary key column and a data column. The
interface has the following configurable properties: [0196]
inputDriver--The Java driver class for the input database. (e.g.
"com.mysql.jdbc.Driver" for MySql,
"oracle.jdbc.driver.OracleDriver" for Oracle). Various database
drivers are available for each database implementation. [0197]
inputURL: The JDBC URL of the input database server. (e.g.
"jdbc:mysql://1.2.3.4"). [0198] inputUser: The user name for the
input database. [0199] inputPassword: The password for the input
database. [0200] inputDBName: The name of the input database
schema. [0201] tableName: The database table name to poll for new
rows. This must be the same for the input and output databases.
[0202] primaryKey: the primary key column of the database. [0203]
dataColumn: the data column to transform. [0204] outputDriver: The
driver for the output database. (e.g. "com.mysql.jdbc.Driver").
[0205] outputURL: The JDBC URL of the output database server. (e.g.
"jdbc:mysql://1.2.3.4"). [0206] outputUser: The user name for the
output database. [0207] outputPassword: The password for the output
database.
[0208] outputDBName: The name of the output database schema.
[0209] The database interface will read all rows in the input
table, passing the data from the data column to the reader writer
layer for each row. Once the data has been successfully
transformed, the transformed data will be written to the output
database and the original row from the input database will be
deleted.
[0210] The following is an example of the section of XML required
to define a database interface:
TABLE-US-00007 <interface>
<name>databaseInterface</name>
<type>databaseinterface</type> <properties>
<inputDriver>com.mysql.jdbc.Driver</inputDriver>
<inputUrl>jdbc:mysql://1.2.3.4:3306</inputUrl>
<inputUser>user1</inputUser>
<inputPassword>password</inputPassword>
<inputDBName>inputSchema</inputDBName>
<tableName>dataTable</tableName>
<primaryKey>key</primaryKey>
<dataColumn>data</dataColumn>
<outputDriver>com.mysql.jdbc.Driver</outputDriver>
<outputUrl>jdbc:mysql://1.2.3.4:3306</outputUrl>
<outputUser>user1</outputUser>
<outputPassword>password</outputPassword>
<outputDBName>inputSchema</outputDBName>
</properties> </interface>
[0211] Error handling
[0212] If a database connectivity issue occurs, the route may fail
and be moved to the failed routes pool (see FIG. 2). If there is a
parsing error with any of the data in a row, an error will be
logged and the offending row will remain in the input table. When
the database is polled again, the anonymisation system 10 will
attempt to process the row again.
[0213] Messaging Interface The message interface is used for
reading messages from a JMS queue and writing them to another queue
or topic.
[0214] The configuration parameters are: [0215] brokerUrl--the url
of the JMS broker to connect to. [0216] username
(optional)--username to use if authentication is required. [0217]
password (optional)--password to use if authentication is required.
[0218] inputQueue--the name of the queue to listen for messages.
[0219] outputDestination--the name of the queue/topic to send
messages after transformations have been applied. [0220] outputType
(queue/topic)--whether the output destination is a queue or a
topic. [0221] errorQueue (optional)--queue to send messages that
can't be processed due to an error occurring.
[0222] An example XML configuration section for the Messaging
Interface is as follows:
TABLE-US-00008 <interface>
<name>JMSInterface</name> <type>jms</type>
<properties>
<brokerUrl>tcp://localhost:61616</brokerUrl>
<username>secureserve</username>
<password>password</password>
<outputType>topic</outputType>
<inputQueue>input</inputQueue>
<outputDestination>output</outputDestination>
<errorQueue>error</errorQueue> </properties>
</interface>
[0223] Reader and Writer Configuration
[0224] The data reader/writer configuration consists of a specified
data type and a set of fields which are available to be
transformed. A field represents a particular piece of information
in a specified location in the incoming data stream. For example,
if the data type is HTML, a field could be a particular element,
defined by its XPath location. The configuration to define where a
field is located in the input data is called the "Field
Definition". The format of this parameter is described for each
reader in this section. The supported data types are listed
below.
TABLE-US-00009 Feature Description Fixed Width The anonymisation
system will read standard fixed width format data. Delimited The
anonymisation system will read standard delimited format data,
including CSV files. XML The anonymisation system will interpret
simple XML data, where each field is encapsulated within a single
tag. HTML The anonymisation system will interpret simple HTML data,
where each field is encapsulated within a single tag. SOAP The
anonymisation system will interpret simple SOAP data, where each
field is encapsulated within a single tag. HTTP The anonymisation
system will interpret fields within an HTTP request YAML The
anonymisation system will read YAML object data. SMPP The
anonymisation system will interpret the source and destination
address fields of SMPP v3.4 protocol messages. UCP/EMI The
anonymisation system will interpret the address code fields of
EMI-UCP v4.3c protocol messages.
[0225] Every reader writer has two generic configuration
parameters: [0226] name--The name given to the reader/writer being
defined. This is used in the Route to reference the interface
[0227] type--The type of interface being configured. Valid values
(case sensitive) are: fixedwidth, delimited, xml, html, soap,
httpreaderwriter, yaml, smpp and ucp
[0228] The rest of this section describes the specific configurable
properties for each reader/writer.
[0229] Delimited Reader
[0230] The delimited reader will read a stream of delimited data,
split it into individual rows and fields, pass fields to the
configured transforms and repackage the resulting delimited data
stream. The configurable properties for the delimited reader are as
follows: [0231] separatorCharacter--The character used to delimit
the fields in a row [0232] quoteCharacter--The character used to
surround each field, which need not be present in. Defaults to ''
[0233] escapeCharacter--The escape character, used to allow quote
characters within fields. Defaults to \ [0234] linesToSkip--The
number of lines in the header of incoming data. These will be
skipped for processing and can be configured to be appended without
change to the output. Defaults to 0 [0235] copySkipLines--Whether
to include skipped header lines in the output. Defaults to true
[0236] footerLines--The number of lines in the footer of the
incoming data. These will be skipped and configured to be appended
without change to the output. Defaults to 0 [0237]
copyFooter--Whether to include the skipped footer data in the
output. Defaults to true [0238] newline--The newline string to use
in the output. Defaults to the standard new line for the operating
system on which The anonymisation system 10 is running. For UNIX
based systems this is usually a single line feed character and for
Windows it is a Carriage return followed by a line feed. [0239]
filterField--When performing filter transforms on delimited data,
it may be desirable to include a blank field in the output instead
of removing the field completely, in order to preserve the number
of columns in the output data. This parameter specifies whether
filtered fields will be completely removed from the outgoing data
or whether blank fields will be included in their place. Defaults
to false, meaning that a filtered field will be included as a blank
value in the output.
[0240] The "Field Definition" for delimited data is the 0-based
index which corresponds to the field in the incoming data.
Optionally, the fields may be reordered, in which case the field
definition should be a comma separated pair of the initial index
and the desired output index.
[0241] An example of the XML section to configure the delimited
reader is as follows:
TABLE-US-00010 <reader>
<name>ThreeXFormReader</name>
<type>delimited</type> <properties>
<separatorCharacter>,</separatorCharacter>
<quoteCharacter>"</quoteCharacter>
<filterField>true</filterField> </properties>
<fields> <field> <name>field1</name>
<definition>0</definition> </field> <field>
<name>field2</name>
<definition>1</definition> </field> <field>
<name>field3</name>
<definition>2</definition> </field>
</fields> </reader>
[0242] Fixed Width Reader
[0243] The Fixed Width Reader is responsible for reading lines of
fixed width data. The configurable properties for the fixed width
reader are: [0244] linesToSkip--The number of lines in the header
of incoming data. These will be skipped for processing and can be
configured to be appended without change to the output. Defaults to
0 [0245] copySkipLines--Whether to include skipped header lines in
the output. Defaults to true. Must be true or false [0246]
footerLines--The number of lines in the footer of the incoming
data. These will be skipped and configured to be appended without
change to the output. Defaults to 0 [0247] copyFooter--Whether to
include the skipped footer data in the output. Defaults to true.
Must be true or false [0248] newline--The newline string to use in
the output. Defaults to the standard new line for the operating
system on which the anonymisation system 10 is running. [0249]
fixedFormat--Whether to enforce the same width fields on the
outgoing data stream as the input. This means that any fields
shorter than the input field after transformation will be padded
with trailing spaces. Defaults to true. It is invalid to configure
a transform which will produce a field of a greater length than the
input, e.g. Ephemeral Encryption.
[0250] The Field Definition for Fixed Width data is a comma
separated pair of their start position within the line (0 based)
and the length of the field. Only the fields that are to be
transformed need to be specified, the reader will copy all
unspecified fields untransformed.
[0251] An example of the XML section to configure the fixed width
reader is as follows:
TABLE-US-00011 <reader>
<name>FixedWidthReader</name>
<type>fixedWidth</type> <properties>
<linesToSkip>0</linesToSkip>
<fixedFormat>true</fixedFormat> </properties>
<fields> <field> <name>field1</name>
<definition>0,5</definition> </field>
<field> <name>field2</name>
<definition>5,4</definition> </field>
<field> <name>field3</name>
<definition>9,10</definition> </field>
</fields> </reader>
[0252] If fixedFormat is specified, and the transformed length is
less than the length of the original field, the transformed field
will be padded with spaces.
[0253] If fixedFormat is specified, and the transformed length is
greater than the length of the original field, an error will be
thrown.
[0254] If fixedFormat is set to false, the output will be a
concatenation of all the fields after they have been
transformed.
[0255] XML
[0256] An XML reader is responsible for reading XML data and
extracting fields to transform based on XPath expressions. XPath
can be used to define specific elements or attributes to be
transformed, these are collectively known as nodes. A configurable
property for the XML reader is: [0257] filterNode--Where a field is
filtered using the filter transform, this field defines whether to
completely remove the filtered node from the output XML or just to
set the value of the node to be blank.
[0258] Fields are configured by an XPath expression. All nodes
matching the expression belong to the same field. The text content
of the node is the field value which will be transformed.
[0259] An example of the XML configuration for the XML reader is as
follows (See section 2.9 for a full configuration file
example):
TABLE-US-00012 <reader> <name>XMLReader</name>
<type>xml</type> <fields> <field>
<name>title</name>
<definition>/book/title</definition> </field>
<field> <name>author</name>
<definition>/book/author</definition> </field>
<field> <name>description </name>
<definition>/book/descr</definition> </field>
</fields> </reader>
[0260] For example, the following HTML data could be used as input
to this reader:
TABLE-US-00013 <book> <title>Title</title>
<author>Author</author>
<descr>Description</descr> </book>
[0261] In this case, the values "Title", "Author", "Description"
would be picked up for transformation by the fields "title",
"author", "description" in the configuration file. For example, if
the destination system is dependent on the value of a specific
element, the transform should not be configured to set the value of
this element to an invalid value.
[0262] HTML
[0263] An HTML reader is responsible for reading HTML data and
extracting fields to transform based on XPath expressions. The
configurable property for the HTML reader is: [0264]
filterNode--Where a field is filtered using the filter transform,
this field defines whether to completely remove the filtered node
from the output HTML or just to set the value of the node to be
blank
[0265] Fields are configured by an XPath expression. All nodes
matching the expression belong to the same field. The text content
of the node is the field value which will be transformed. An
example of the XML configuration for the HTML reader is as
follows:
TABLE-US-00014 <reader> <name>HTMLReader</name>
<type>html</type> <fields> <field>
<name>field1</name>
<definition>/html/body/h1</definition> </field>
<field> <name>field2</name>
<definition>/html/body/h2</definition> </field>
<field> <name>field3</name>
<definition>/html/body/h3</definition> </field>
</fields> </reader>
[0266] For example, the following HTML data could be used as input
to this reader:
TABLE-US-00015 <html> <body> <h1>Heading
1</h1> <h2>Heading 2</h2> <h3>Heading
3</h3> </body> </html>
[0267] In this case, the values "Heading 1", "Heading 2", "Heading
3" would be picked up for transformation by the fields "field1",
"field2", "field3" in the configuration file.
[0268] SOAP
[0269] The SOAP reader is responsible for reading SOAP data and
extracting fields to transform based on XPath expressions. XPath
can be used to define specific elements or attributes to be
transformed; these are collectively known as nodes. The
configurable property for the SOAP reader is: [0270]
filterNode--Where a field is filtered using the filter transform,
this field defines whether to completely remove the filtered node
from the output SOAP data or just to set the value of the node to
be blank
[0271] Fields are configured by an XPath expression. All nodes
matching the expression belong to the same field. The text content
of the node is the field value which will be transformed.
[0272] An example of the XML configuration for the SOAP reader is
as follows:
TABLE-US-00016 <reader> <name>SOAPReader</name>
<type>soap</type> <fields> <field>
<name>title</name>
<definition>/book/title</definition> </field>
<field> <name>author</name>
<definition>/book/author</definition> </field>
<field> <name>description </name>
<definition>/book/descr</definition> </field>
</fields> </reader>
[0273] HTTP
[0274] The HTTP reader/writer is responsible for extracting and
transforming data from within an HTTP request body, and extracting
and transforming HTML elements using XPath in the HTTP response.
There are no configurable properties for the HTTP reader.
[0275] An example XML configuration for the HTTP reader is as
follows:
TABLE-US-00017 <reader> <name>HTTPReader</name>
<type>httpReaderWriter</type> <fields>
<field> <name>msisdn</name>
<definition>msisdn</definition> </field>
<field> <name>HTMLHeader1</name>
<definition>/html/body/div/span[@id=`original`]</definition>
</field> </fields> </reader>
[0276] This data reader/writer is effectively a composite reader
which processes HTTP request data on the outbound path, and
delegates to the HTML reader to transform HTML data on the HTTP
response. The field definition consists of the name of the field in
the case of a request, and an XPath expression in the case of the
response. In order to define which direction a transform is
applicable to, a property <direction> must be set within the
transform configuration. This value must be set to either OUTBOUND
or INBOUND, for request and response respectively.
[0277] The following is an example of the transform set
configuration for use with the reader definition above
TABLE-US-00018 <transformSet>
<name>HTTPTransform</name> <transforms>
<transform> <type>encrypt</type> <field>
msisdn </field> <direction>OUTBOUND</direction>
<properties> <schema>smokeencrypt</schema>
</properties> </transform> <transform>
<type>decrypt</type>
<field>HTMLHeader1</field>
<direction>INBOUND</direction> <properties>
<schema>smokeencrypt</schema> </properties>
</transform> </transforms> </transformSet>
[0278] YAML
[0279] The YAML reader is responsible for extracting and
transforming data from a YAML data stream. There are no
configurable properties for the YAML reader.
[0280] An example XML configuration for the YAML reader is as
follows:
TABLE-US-00019 <reader>
<name>tgwyamlreader1</name>
<type>yaml</type> <fields> <field>
<name>field1</name>
<definition>receipt</definition> </field>
<field> <name>field2</name>
<definition>customer.name</definition> </field>
<field> <name>field3</name>
<definition>items.{part_no}</definition> </field>
</fields> </reader>
[0281] Object-Graph Navigation Language (OGNL) is used as the
expression language to choose fields of data from a YAML object
map.
[0282] It is possible to specify a particular field in a list using
square brackets e.g. items[1].descrip. This would correspond to the
descrip field of the object at index 1 (zero-based) in the items
list. If the specified indexed item does not exist then a warning
will be logged to state that the system was unable to transform
this field definition as it does not exist and the application
would continue.
[0283] SMPP Protocol
[0284] SMPP Protocol v3.4 messages can be accepted as a data type
as per the specification [5]. This includes long SMS messages. The
following messages are available for transformation: [0285]
submit_sm [0286] deliver_sm [0287] query_sm [0288] cancel_sm [0289]
replace_sm [0290] data_sm [0291] submit_multi
[0292] In these messages only the following fields are available
for transformation (where available): [0293] source_addr [0294]
destination_addr
[0295] All other messages will be sent through the filter
unaffected.
[0296] An example SMPP reader configuration XML fragment is as
follows:
TABLE-US-00020 <reader> <name>smpp-data</name>
<type>smpp</type> <fields> <field>
<name>submit_dest</name>
<definition>submit_sm/destination_addr</definition>
</field> </fields> </reader>
[0297] Fields are configured by a slash separated pair of message
type and field name, both according to the SMPP specification.
[0298] UCP Reader
[0299] The UCP Reader will read messages according to the UCP-EMI
specification v4.3c [6]. The following Message Types are supported:
[0300] Call Input Operation [0301] SMT Alert Operation
[0302] All other Message Types will pass through the filter
unaffected.
[0303] The following fields are available for transform in the
outgoing message and response: [0304] AdC [0305] OAdC (Call Input
Operation only)
[0306] There are no configurable properties for the UCP reader.
[0307] An example XML configuration fragment for the UCP Reader is
as follows:
TABLE-US-00021 <reader> <name>ucp-data</name>
<type>ucp</type> <fields> <field>
<name>call_input_adc</name>
<definition>CallInputOperation/Operation/adc</definition>
</field> </fields> </reader>
[0308] Fields to transform are configured by a slash separated list
of message type, message operation and field name. Message type
must be one of CallInputOperation or SMTAlert. Please see the UCP
specification [6] for details of each message type. Message
direction must be either Operation or Result and field name must be
either adc or oadc.
[0309] Transform Configuration
[0310] Transforms are configured in the XML file by mapping
specific fields (defined by the reader/writers) to a transform
type, and specifying any required properties for the transform.
Multiple transforms can be applied to a single field.
[0311] The available transforms are described in the following
table, and this section details the functionality and configuration
parameters for each transform. Some transforms are "tokenisable",
meaning that the generated values will be stored against the input
values in a token store, for future lookup. See below in table 2
for more information on tokenisation.
TABLE-US-00022 TABLE 2 Feature Description Tokenisable Masking All
or part of the field value can be masked with a chosen masking
character. Encryption A field value can be encrypted using a Yes
configurable encryption algorithm including industry standard AES
encryption. Decryption An encrypted field value can be decrypted to
plain text with a configurable algorithm including industry
standard AES encryption. Filtering Fields can be completely removed
from the output, so they cannot be reconstructed or retrieved in
any way by the destination system. Hashing A field can be hashed by
way of a keyed Yes hash function using a secret key located in the
application key store. Find and Part of a field can be replaced
with another Replace value. Several pairs of values to find and
replace can be specified. Redaction Part of a field can be removed
from the output (effectively find and replace, replacing with
nothing). The part which is removed will be unrecoverable by the
destination system, in a similar way to filtering Validation A
field can be checked against a regular expression, with various
options for what to do if the field does not match. Random
Generates a random number, irrespective Yes Number of the value of
the input field. Intended to Generation be used only with
tokenisation enabled. Detokenisation Original input values can be
restored by Yes looking up a token in a token store.
[0312] Tokenisation
[0313] Tokenisation enables the output of certain transforms to be
stored in a token store along with the input value which generated
them. In other words, transformed fields are recoverable. The token
value may be derivable from the input or original value (e.g. by an
encryption or other function) or may be unconnected. The
tokenisation process follows these steps: [0314] Check whether the
input value exists in the token store. [0315] If so, return the
corresponding token [0316] If not, run the underlying transform
(any described in table 2 as being tokenisable, for example) and
add the result to the token store.
[0317] The anonymisation system 10 application comes with a
Database token store. The transforms that support tokenisation are:
[0318] Encryption [0319] Hashing [0320] Random number
generation
[0321] To reverse the tokenisation process, a detokenisation
transform can be used. This is effectively a tokenisable transform
which doesn't have an implementation of how to transform data, and
assumes that the token is present for every value received.
[0322] Transform Configuration Format
[0323] Transforms are configured as entries in a transform set,
which is defined in XML using the <transformSet> tag. A
transformSet is defined by the following configuration parameter:
[0324] name--The name to use for this transformSet, which the Route
will use to reference it
[0325] The following is an example configuration of a transform
set:
TABLE-US-00023 <transformSet>
<name>anonymise</name> <transforms>
<transform> <type>encrypt</type>
<field>field1</field> <properties>
<schema>fpe</schema> </properties>
</transform> </transforms> </transformSet>
[0326] The following configuration parameters may be present for
every transform: [0327] field--The name given to the field in the
reader/writer configuration to apply the transform to [0328]
type--The type of transform being configured. Valid values (case
sensitive) are filter, mask, encrypt, decrypt, hash,
findAndReplace, redaction, randomNumberGen, validation,
detokenisation
[0329] Additionally, the following two properties are optional for
every transform: [0330] sensitiveInput--Whether the input value
must be masked in log files. Defaults to true [0331]
sensitiveOutput--Whether the output value must be masked in log
files. Defaults to false
[0332] The rest of this section defines the configurable properties
for each transform type. Some of the properties may refer to
transform schemas, which are more complicated sections of XML,
rather than just a plain value. The use of properties to refer to
schemas is documented specifically for each transform type. See the
full configuration file at the end of the section for a full
example.
[0333] Filter Transform
[0334] The filter transform removes a field from the data. This may
mean removing the field entirely, or just removing the field's
value, depending on the data type. Example behaviour is defined in
the following paragraph.
[0335] The exact process for filtering is dependent on the specific
data reader/writer, as follows: [0336] Delimited data--The reader
can be configured to either completely remove the field or set the
field to be a blank value [0337] Fixed width data--The field will
be set to a blank value [0338] XML/HTML--The reader can be
configured to set filtered nodes values to be blank, or to remove
the entire node [0339] UCP--The field will be set to be a blank
value [0340] SMPP--The field will be set to be a blank value
[0341] Note: This transform is one way and not reversible. A
filtered value cannot be reinstated.
[0342] An example of the XML required to configure the filter
transform is as follows:
TABLE-US-00024 <transform> <type>filter</type>
<field>field1</field> </transform>
[0343] Masking Transform
[0344] This transform replaces a subset of characters within a
field with a specified character.
[0345] The configurable properties available for the masking
transform are: [0346] anchor--Used to define a substring to mask.
Whether to work from the beginning or end of the input value when
applying the offset and numberMasked properties. If specified, this
must be START or END. Defaults to START [0347] offset--The number
of characters from the anchor to skip before masking starts.
Defaults to 0. For example, if masking using the # character with
an anchor of start an offset of 1 and a numberMasked of 4, "Hello"
would become "H####". [0348] numberMasked--the number of characters
to mask from the offset [0349] character--the character to use as a
mask. Defaults to *
[0350] For example, if character=*, Anchor=START, Offset=2 and
numberMasked=4: [0351] "Hello" would become "He***". [0352]
"SecureServe" would become "Se****Serve".
[0353] An example of the XML required to configure this transform
is as follows:
TABLE-US-00025 <transform> <type>mask</type>
<field>MsisdnA</field> <properties>
<anchor>START</anchor>
<numberMasked>4</numberMasked>
<offset>2</offset> <character>*</character>
</properties> </transform>
[0354] This example will mask the 3rd--6th characters in the input
(if present) with a series of * characters.
[0355] Encryption and Decryption
[0356] The encryption transform will encrypt the data using a
defined encryption schema. The available encryption schemas are
loaded at system start up from configuration. Encryption keys to be
used by these transforms need to be added to the application
keystore using the Configuration Management Utility. Without a
valid encryption key defined in the application keystore, these
transforms cannot be used.
[0357] Two example types of encryption are described: [0358]
Ephemeral--The same input value will produce different encrypted
values when encrypted twice with the same encryption key, however,
any result can be decrypted to the original value. For example:
[0359] "12345" could encrypt to "X13f9s3gGsGh25DB" on the first
attempt and "IR3d2xSggs9DssH3" on the second time. Both of these
values would decrypt to "12345". [0360] Format Preserving--An input
value will always transform to the same ciphertext when encrypted
using the same encryption key. The ciphertext will be of the same
length and the same alphabet as the input value, specified by the
encryption schema configuration. For example:
[0361] "12345" could encrypt to "98627". "67890" could encrypt to
"46602". Then "98627" would decrypt back to "12345" and "46602"
would decrypt to "67890".
[0362] Optionally, only a substring can be encrypted using an
anchor/offset mechanism in a similar way to the masking
transform
[0363] An encryption schema may be specified in the XML
configuration file in order for an encryption or decryption
transform to be configured.
[0364] An encryption schema is defined by the following parameters:
[0365] name--a user defined name for the schema [0366] key--the
cryptographic key alias to use. This must have been defined in the
application keystore using the configuration management utility
[0367] type--the type of encryption. EPHEMERAL or FPE (Format
Preserving Encryption) [0368] alphabet (Format preserving
only)--The valid range of input/output characters as a string. E.g.
"0123456789" for numerical values. Special characters can be
defined using XML escape sequences e.g. & for &.
[0369] anchor (Format preserving only)--Used to define a substring
to encrypt. Whether to work from the beginning or end of the input
value when applying the offset and encryptionLength properties. If
specified, must be START or END. Defaults to START [0370] offset
(Format preserving only)--The number of characters from the anchor
to skip before encryption starts. Defaults to 0. [0371]
encryptionLength (Format preserving only)--The number of characters
from the offset to encrypt
[0372] Examples of configuration for both types of schema are as
follows:
TABLE-US-00026 <encryptionSchema>
<name>ephemeral</name> <key>mykey</key>
<type>EPHEMERAL</type> </encryptionSchema>
<encryptionSchema> <name>fpe</name>
<key>fpekey</key> <type>FPE</type>
<alphabet>0123456789</alphabet>
<anchor>START</anchor> <offset>2</offset>
<encryptionLength>20</encryptionLength>
</encryptionSchema>
[0373] In the above Format Preserving Encryption example,
characters 3-22 will be encrypted (if present).
[0374] Transform Configuration
[0375] The encryption and decryption transforms are configured by
the following property: [0376] schema--a reference by name to an
"Encryption schema", which must be defined elsewhere in the
configuration file. [0377] tokenisationSchema (encryption
only)--The tokenisation schema to use, if tokenisation is to be
enabled. If this parameter is left out, no tokenisation will be
used.
[0378] Example Transform Configuration
[0379] An example of the XML configuration for encryption and
decryption transforms are as follows:
TABLE-US-00027 <transform> <type>encrypt</type>
<field>field1</field> <properties>
<schema>fpe</schema> </properties>
</transform> <transform>
<type>decrypt</type> <field>field1</field>
<properties> <schema>fpe</schema>
</properties> </transform>
[0380] Hashing Transform
[0381] The hashing transform uses an algorithm with a secret key to
create a hash of the supplied value. The secret key may be kept in
the application key store and referred to in the same way as an
encryption key. This key needs to be added to the application
keystore using a Configuration Management Utility in the same way
as encryption keys. Without a valid key defined in the application
keystore, this transform cannot be used.
[0382] The configuration parameters for the hashing transform are:
[0383] keyProvider--This defines the key store to use. This should
be set to "keyProvider" to use the application key store. This has
been included as a configuration parameter for extra
configurability in future, but for this release should always be
set to "keyProvider" [0384] keyAlias--The alias of the key in the
application key store to use [0385] tokenisationSchema--The
tokenisation schema to use, if tokenisation is to be enabled. If
this parameter is left out, no tokenisation will be used.
[0386] An example configuration XML segment for the hashing
transform is as follows:
TABLE-US-00028 <transform> <type>hash</type>
<field>field1</field> <properties>
<keyProvider>keyProvider</keyProvider>
<keyAlias>hashKey</keyAlias> </properties>
</transform>
[0387] Find and Replace Transform
[0388] The Find and Replace Transform will replace any instances of
defined strings within a field with another value. The value to
find may optionally be a regular expression. The configuration
parameters for the find and replace transform are as follows:
[0389] schema--a reference by name to a "Find and replace schema"
which must be defined elsewhere in the configuration file
[0390] Find and Replace Schema Definition
[0391] A Find and Replace schema is defined by a name and a list of
pairs of find and replace values. Each entry may have the following
configuration parameters: [0392] find--the value to find [0393]
replace--the value to replace matching values with [0394]
regex--whether the value to find is a regular expression (defaults
to false). The example below uses the regular expression [a-z] {5},
which means it will match 5 consecutive lowercase characters.
Please see the glossary entry on regular expressions for more
details. [0395] casesensitive--whether the value to find should be
case sensitive (defaults to false)
[0396] An example find and replace schema is as follows:
TABLE-US-00029 <findAndReplaceSchema>
<name>mySchema</name> <propertyList>
<entry> <find>a</find>
<replace>b</replace> </entry> <entry>
<find>b</find> <replace>c</replace>
<casesensitive>true</casesensitive> </entry>
<entry> <find>[a-z]{5}</find>
<replace>###</replace> <regex>true</regex>
<casesensitive>true</casesensitive> </entry>
</propertyList> </findAndReplaceSchema>
[0397] Example Transform Configuration
[0398] An example for the configuration of a find and replace
transform using a defined schema is as follows:
TABLE-US-00030 <transform>
<type>findAndReplace</type>
<field>field1</field> <properties>
<schema>mySchema</schema> </properties>
</transform>
[0399] The list of values to find and replace are applied in the
order defined in the configuration file, the output of each being
used as the input of the next.
[0400] Redaction Transform
[0401] The Redaction Transform will remove any instances of defined
strings. No padding with spaces or replacing with "black blocks"
will be performed. If any form of padding is required, the Find and
Replace transform can be used, setting the replace value to a
string of spaces or another appropriate character. The
configuration parameters for the redaction transform are as
follows: [0402] schema--a reference by name to a "Redaction schema"
which must be defined elsewhere in the configuration file
[0403] Redaction Schema Definition
[0404] A Redaction schema is defined by a name and a list of values
to remove. Each entry may have the following configuration
parameter: [0405] redact--the value to remove
[0406] An example redaction schema is as follows:
TABLE-US-00031 <redactionSchema>
<name>mySchema</name> <propertyList>
<entry> <redact>a</redact> </entry>
<entry> <redact>b</redact> </entry>
</propertyList> </redactionSchema>
[0407] Example Transform Configuration
[0408] An example for the configuration of a redaction transform
using a defined schema is as follows:
TABLE-US-00032 <transform> <type>redaction</type>
<field>field1</field> <properties>
<schema>mySchema</schema> </properties>
</transform>
[0409] Random Number Generation Transform
[0410] The Random Number Generation Transform takes a String value
as input and returns a random number generated using the a
randomising algorithm between specified upper and lower bounds. The
application's built in secure random number generator will be used
to generate the random numbers. Note that this transform is not
dependent on the input value and is intended for use only with
tokenisation enabled. The following configuration parameters are
available for this transform: [0411] lowerBound--The inclusive
lower limit for the random number generator. i.e. the value
generated will be greater than or equal to this value [0412]
upperBound--The exclusive upper limit for the random number
generator. i.e. the value generated will be less than or equal to
this value. [0413] tokenisationSchema--The tokenisation schema to
use, if tokenisation is to be enabled. If this parameter is left
out, no tokenisation will be used.
[0414] An example configuration XML segment for this transform is
as follows:
TABLE-US-00033 <transform>
<type>randomNumberGen</type>
<field>field9</field> <properties>
<lowerBound>100</lowerBound>
<upperBound>200</upperBound>
<tokenisationSchema>mySchema</tokenisationSchema>
</properties> </transform>
[0415] Validation Transform
[0416] The validation transform checks the input value against a
regular expression. If it matches the value will pass through the
transform unchanged. If it doesn't match it can either be removed,
replaced with another value, or pass through anyway (with a warning
logged). The action to perform on validation failure is defined by
the configuration parameters. The configuration parameters for this
transform are as follows:
[0417] regularExpression--the regular expression to check the input
value against
[0418] actionOnFailure--the action to take if validation is
unsuccessful. Must be one of:
[0419] DONOTHING--the value will pass through the transform
anyway
[0420] FILTER--the value will be filtered (using the same rules as
the filter transform)
[0421] REPLACE--the value will be replaced by the value defined in
the "replacementValue" property
[0422] replacementValue--the value to be used as a replacement, if
the actionOnFailure parameter is set to REPLACE
[0423] logWarningFlag--whether a warning message should be logged
when a field fails validation. Defaults to false
[0424] An example of the XML configuration required for this
transform is as follows:
TABLE-US-00034 <transform>
<type>validation</type>
<field>msisdn</field> <properties>
<regularExpression>[0-9]{15}</regularExpression>
<actionOnFailure>REPLACE</actionOnFailure>
<replacementValue>NOT A MSISDN</replacementValue>
<logWarningFlag>true</logWarningFlag>
</properties> </transform>
[0425] Detokenisation Transform
[0426] The detokenisation transform is used to look up previously
defined values in a token store. It is intended to be used as the
reverse of one of the other tokenisable transforms (encryption,
hashing, random number generation) with tokenisation enabled. It
does not have any functionality as a standalone transform. The only
configuration parameter is: [0427] tokenisationSchema--The
tokenisation schema to use. This is mandatory for detokenisation.
If not present the transform will fail to start. Please note that
the "keyColumn" and "tokenColumn" of the tokenisation schema should
be reversed for the detokenisation transform. i.e. the "keyColumn"
should be the column containing previously generated tokens, and
the "tokenColumn" should be the column containing the original
input values.
[0428] An example of the configuration for this transform is as
follows:
TABLE-US-00035 <transform>
<type>detokenisation</type>
<field>field1</field> <properties>
<tokenisationSchema>myTokenisationSchema</tokenisationSchema>-
; </properties> </transform>
[0429] Tokenisable Transform Configuration
[0430] Any transform for which tokenisation is available
(encryption, hashing, random number generation) must specify a
tokenisation schema to use. This schema specifies the type of
tokenisation and the configuration parameters for that specific
type of tokenisation. The anonymisation system 10 comes with
database tokenisation built in, and provides a Java API for a
custom token store to be written.
[0431] The existence of the tokenisation database is a prerequisite
when turning on tokenisation for any tokenisable transform.
[0432] If tokenisation is enabled on a transform, the token store
may be checked every time the transform is invoked using the
following process: [0433] The input value may be looked up in the
token store. [0434] If the input value is already present in the
token store (i.e. it has been through the system already and a
token generated) then the token from the token store will be used
without running the transform logic. [0435] If the input value is
not present in the token store, the transform will be run and the
resulting value stored in the token store.
[0436] A tokenisation schema is specified by a name, the type of
tokenisation and a list of tokenisation properties, specific to the
type of tokenisation being used. This section describes how to use
the built in database token store.
[0437] To use the built in database token store, a database table
may be used, which has two String based columns (e.g. VARCHAR),
each of which must have unique constraints. There may be other
columns in the table, but they must all have default values.
[0438] Tokenisation configuration parameters for the database token
store may be: [0439] driver--The class of the JDBC driver to use
[0440] url--the fully qualified JDBC url to the database, including
the database schema name [0441] username--the username to connect
to the database [0442] password--the corresponding password [0443]
table--the name of the table to use to store tokens [0444]
keyColumn--the column to use to store input values [0445]
tokenColumn--the column to use to store tokens
[0446] An example of the XML required to configure a tokenisation
schema is as follows:
TABLE-US-00036 <tokenisationSchema>
<name>myTokenisationSchema</name>
<type>DATABASE</type> <tokenisationProperties>
<driver>com.mysql.jdbc.Driver</driver>
<url>jdbc:mysql://localhost:3306/test</url>
<username>root</username>
<password>password</password>
<table>tokens</table>
<keyColumn>input</keyColumn>
<tokenColumn>token</tokenColumn>
</tokenisationProperties> </tokenisationSchema>
[0447] Please note. It may be desirable to populate the token store
manually before starting the anonymisation system 10, for example
so the tokens do not have to be generated but are already present
when the system is started.
[0448] Validation Rules and Standardisation
[0449] Standardisation and simple format fixing can be achieved by
using a combination of validation, find and replace and redaction
transforms. For example, the following specific fields could be
standardised as follows:
[0450] MSISDN [0451] A Validation transform to check character
range, type and MSISDN length [0452] A Find and Replace transform
configured to replace +44 with 0 [0453] A Redaction transform to
remove whitespace
[0454] IMEI [0455] A Validation transform to check character range,
type and IMEI length (15 or 16 digits) [0456] A Redaction transform
to remove "-", and whitespace.
[0457] ICCID [0458] A Validation transform to check character
range, type and ICCID length (19 or 20 digits) [0459] A Redaction
transform to remove whitespace
[0460] IMSI [0461] A Validation transform to check character range,
type and IMSI length (14 or 15 digits) [0462] A Redaction transform
to remove whitespace
[0463] IP Address [0464] A Validation transform to check IP address
format i.e.
[0465] IPV4: nnn.nnn.nnn.nnn
[0466] IPV6: hhhh:hhhh:hhhh:hhhh:hhhh:hhhh:hhhh:hhhh
[0467] IPV6: hhhh-hhhh-hhhh-hhhh-hhhh-hhhh-hhhh-hhhh
[0468] A Redaction transform to remove whitespace
[0469] Route Configuration
[0470] How the data flows through the system may be configured
(i.e. workflows). These are known as routes, and are configured
preferably in the XML file using the following parameters: [0471]
interface--The data interface for this route, identified by the
name field in the interface configuration [0472] reader--The data
reader/writer for this route, identified by the name field in the
reader/writer configuration [0473] transformSet--Identified by the
name field within the transform set configuration. Exactly one
transform set must be applied to a single route, but the same
transform set can be shared across multiple routes. [0474]
maxConcurrentReaders (optional)--The maximum number of threads to
use to launch Data Reader/Writers within this route. Specifically,
each interface uses the maxConcurrentReaders property as
follows:
[0475] File Interface--The number of threads which can process
files concurrently
[0476] HTTP--The maximum number of HTTP requests which can be
processed simultaneously. Optimally, this should be set to the
maximum number of expected concurrent requests.
[0477] TCP/IP--The number of threads which can process data from
TCP/IP connections at once. Note that one thread per TCP/IP
connection will be used, so this should be set to the maximum
number of expected connections via this interface.
[0478] Messaging--The number of threads which will concurrently
listen to the input queue.
[0479] Database--The number of threads which can process database
data at once.
[0480] The following is an example of the XML configuration
required for a Route:
TABLE-US-00037 <route>
<interface>file-feed</interface>
<reader>delimited-data</reader>
<transformSet>anonymise</transformSet>
<maxConcurrentReaders>4</maxConcurrentReaders>
</route>
[0481] Example Configuration File
[0482] The following is an example of a complete configuration
file, specifying the following components: [0483] A startup section
informing the application of the namespaces in use by the XML file.
This should be set to the value given in the below example. [0484]
a file based interface, reading files with the .ready suffix from
the /input directory, writing the result to the /output directory
with no suffix and renaming the processed file in the input
directory to end with a.done suffix. [0485] a delimited reader,
using a comma as a delimiter and specifying 3 fields. [0486] the
following transforms
[0487] Format preserving encryption, encrypting up to 20 characters
from an offset of 2 from the beginning, using the alphabet
0123456789
[0488] Filter--the second field is removed
[0489] Mask, masking up to 4 characters with a #, with an offset of
2 from the start
[0490] It is possible to split the configuration across multiple
configuration files, for example all interfaces could be defined in
one file, all readers in another, and so on. Alternatively, all
components related to each route could be defined in separate
files.
[0491] Please note that the reference to URLs at the top of the
configuration file is specifying various XML namespaces required by
some of the application libraries. No internet connection is
required to run the anonymisation system 10.
TABLE-US-00038 <?xml version="1.0" encoding="UTF-8"?>
<beans:beans
xmlns:beans=http://www.springframework.org/schema/beans
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.detica.com/ddsf/configuration"
xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans-3.0.xsd">
<interface> <name>file-feed</name>
<type>filesystem</type> <properties>
<inputDirectory>/input</inputDirectory>
<outputDirectory>/output</outputDirectory>
<pollingFrequency>5000</pollingFrequency>
<inputSuffix>.ready</inputSuffix>
<removeInputSuffix>true</removeInputSuffix>
<processingSuffix>.working</processingSuffix>
<finishedSuffix>.done</finishedSuffix>
</properties> </interface> <reader>
<name>delimited-data</name>
<type>delimited</type> <properties>
<separatorCharacter>,</separatorCharacter>
</properties> <fields> <field>
<name>field1</name>
<definition>0</definition> </field> <field>
<name>field2</name>
<definition>1</definition> </field> <field>
<name>field3</name>
<definition>2</definition> </field>
</fields> </reader> <transformSet>
<name>anonymise</name> <transforms>
<transform> <type>encrypt</type>
<field>field1</field> <properties>
<schema>fpe</schema> </properties>
</transform> <transform>
<type>filter</type> <field>field2</field>
</transform> <transform> <type>mask</type>
<field>field3</field> <properties>
<anchor>START</anchor>
<numberMasked>4</numberMasked>
<offset>2</offset> <character>#</character>
</properties> </transform> </transforms>
</transformSet> <route>
<interface>file-feed</interface>
<reader>delimited-data</reader>
<transformSet>anonymise</transformSet>
<maxConcurrentReaders>4</maxConcurrentReaders>
</route> <encryptionSchema>
<name>fpe</name> <key>fpekey</key>
<type>FPE</type>
<alphabet>0123456789</alphabet>
<anchor>START</anchor> <offset>2</offset>
<encryptionLength>20</encryptionLength>
</encryptionSchema> </beans:beans>
[0492] Graphical User Interface
[0493] A GUI (graphical user interface) application provides a
facility to edit and manipulate commonly changed features of any of
the described configuration files. These include the list of
transforms in use by a particular route, the properties of those
transforms and the schemas that they need to function
correctly.
[0494] Typical Use Case
[0495] This section outlines an example use case for the GUI. These
are the steps required to modify and save changes to a
configuration file: [0496] User launches the GUI [0497] User enters
username and password [0498] User selects which configuration file
they wish to edit from the file browser [0499] Application uses
keys specified in the GUI configuration file to decrypt and open
the configuration file. If the keys are password protected, the
user will be prompted for the passwords. [0500] User browses
through the available transforms in the configuration file, and
selects one to edit. [0501] User select "Edit Transform" [0502]
User makes necessary updates, and presses the Submit button. The
application makes these changes in memory, but nothing has been
saved to disk yet [0503] User presses the Save button. The
application uses the keys specified in the GUI configuration to
encrypt the configuration file and overwrite the previous
configuration file on disk.
[0504] The anonymisation system 10 application groups transform
together into Transform Sets based on the list of transforms
defined within each <transformSet> element in the
configuration file. These are ordered lists of transforms which are
applied, as a whole, to routes. Each route will have exactly one
Transform Set applied to it; however a single Transform Set may be
used by several different routes. This relationship is defined in
each configuration file.
[0505] The default naming scheme will be the transform type
followed by its position in the transform set relative to other
transforms of the same type. For example in a transform set
containing the following transforms (in order): [0506] mask, mask,
hash, mask, hash
[0507] The generated names would be [0508] mask-1, mask-2, hash-1,
mask-3, hash-2
[0509] These names can be edited by the user using the Edit
Transform feature.
[0510] Changing the Order of Transforms
[0511] A transform may have its position changed, relative to other
transforms in a set.
[0512] Schemas
[0513] Most transform types may be simple and have a basic set of
properties that can be edited, however some have a more complicated
structure which require the use of a "Schema" to define their
properties. For example, the Find and Replace transform maintains a
list of things to look for and what to replace each item with. This
list can get quite long so it makes sense to group it together into
a Schema which can then be shared by several instances of the same
transform.
[0514] Several Schemas may be defined within a configuration file,
each applicable to different types of transform and each of these
schemas may be editable by the GUI application.
[0515] Extension API
[0516] Overview
[0517] This section explains the API, which may be used to
implement new modules for in the anonymisation system 10. The
module types which can be created are: [0518] Data Interfaces
[0519] Reader/Writers [0520] Transforms [0521] Tokenisers (token
stores)
[0522] Creating new modules involves writing a Java class in a
package com.detica.*, adding a DDSFModule annotation to the class
and updating the anonymisation system 10 configuration file
appropriately. For the system to be able to use the new module(s),
the Java classes should be compiled into a JAR and included on the
Java classpath when starting the anonymisation system 10. Here is a
simple example, applicable for Data Interfaces, Reader/Writers and
Transforms.
TABLE-US-00039 Class file: com.detica. newmodules;
@DDSFModule("mynewmodule") public class NewModule extends
(Polling)DataInterface/AbstractReaderWriter/AbstractTransform { ...
. @Override ... @Override ... . } Configuration file: ......... .
<interface/reader/transform> ... . .
<type>mynewmodule</type> .........
</interface/reader/transform> ............
[0523] Class Structure for Extensions
[0524] FIG. 5 shows the structure of the classes which can be
extended to create new anonymisation system 10 modules.
[0525] The DDSFComponent interface is a root level class for all
system components and defines the following two methods:
[0526] void initialise( )--This method has the purpose of
validating properties and initialising any external resources
required by a component, for example database connections.
[0527] void clean( )--This method has the purpose of clearing down
any external resources which were started up in the initialise
method, for example closing down a database connection created in
the initialise( ) method.
[0528] Every component should preferably implement these methods.
Where a superclass already defines these methods, the call
"super.initialise( )/super.clean( )" should be used as the first
line in the method.
[0529] Data Interfaces
[0530] Data Interfaces are responsible for processing incoming data
from a source and writing it to an output interface. An
anonymisation system Framework provides a class called
AbstractDataInterface which should preferably be extended to
implement data interfaces.
[0531] Another class, PollingDataInterface, is defined which
extends AbstractDataInterface and defines extra logic for the
polling of a source at a specified interval.
[0532] The following sections explain which methods need to be
overridden when implementing a new custom Data Interface of each
type.
[0533] Every Data Interface is responsible for creating a
SimplexDataStream object for each data unit to process. The
SimplexDataStream contains an input channel and an output channel
which define where the data is being read from and written to
respectively.
[0534] AbstractDataInterface
[0535] The AbstractDataInterface class contains two methods, which
must be overridden by implementing classes. They are described in
the following tables 3, 4 and 5.
TABLE-US-00040 TABLE 3 AbstractDataInterface.start( ) Method Name
start( ) Method Function Starts the interface. Note that this is
different from loading external resources, which should be done in
the initialise( ) method. Return Type Void
TABLE-US-00041 TABLE 4 AbstractDataInterface.stop( ) Method Name
stop( ) Method Function Stops the interface. Should not clear down
external resources, which should be done in clean( ) Return Type
Void
TABLE-US-00042 TABLE 5 AbstractDataInterface.isRunning( ) Method
Name isRunning( ) Method Function Whether or not the interface is
running. Return Type boolean
[0536] PollingDataInterface
[0537] The PollingDataInterface class can be extended to create a
Data Interface which polls an input source for content at a
specified interval. For example, the FileSystemInterface within
anonymisation system 10 is an extension class of
PollingDataInterface. The PollingDataInterface class itself handles
all the polling code, and the main method which needs to be
implemented is described in the following table 6:
TABLE-US-00043 TABLE 6 PollingDataInterface.getData( ) Method Name
getData( ) Method Function Find the next available data unit from
the input source and produce the appropriate SimplexDataStream.
Return Type SimplexDataStream. The next available data stream
should be returned, or null if there is no available incoming
data.
[0538] Reader/Writers
[0539] Reader/Writers are responsible for reading data from a data
interface, splitting it up into individual records and fields,
sending the fields off to the transform engine for processing and
packaging the resulting data back up into the same form for writing
back to the data interface.
[0540] The anonymisation system framework provides the
AbstractReaderWriter class for extension in order to define new
Reader/Writers. The "initialise" and "clean" methods of the
DDSFComponent interface are also applicable to the Reader/Writers
and should be overridden.
[0541] Transforms
[0542] A Transform class is responsible for performing a
transformation on a piece of data and returning the result. To
create a custom transform, the anonymisation system Framework
provides the AbstractTransform class which should be extended. The
"initialise" and "clean" methods of the DDSFComponent interface are
also applicable to the Transform and should be overridden.
[0543] Tokenisers (Token Stores)
[0544] The anonymisation system 10 includes a database
implementation of a token store, for use when using a tokenisable
transform with tokenisation turned on. It is also possible to
create a custom token store.
[0545] The anonymisation system 10 is mostly stateless and
multithreaded and can scale both horizontally and vertically
easily.
[0546] The anonymisation system 10 utilises encryption for various
purposes: [0547] Encryption of configuration files [0548]
Encryption of data fields within a transform [0549] Encryption of
the Keystore, which holds the keys used to perform the data field
encryption [0550] Encryption of startup keys, which are used to
access the keystore and encrypt and decrypt configuration files
[0551] There are several types of "Key" used by the anonymisation
system 10: [0552] Storage Master Key (SMK)--This is the key used to
encrypt the anonymisation system 10 application configuration files
and encryption Keystore files. There is only one of these per the
anonymisation system 10 instance [0553] Startup keys--The SMK
should not be stored in clear text. Instead, one or more Startup
keys may be required to "unlock" the SMK whenever it is needed.
Startup keys may be password protected. In the case of having a
single startup key, it should be password protected. All startup
keys will be required to unlock the configuration file for the
anonymisation system 10 startup, as well as when opening a
configuration file in the GUI. It is not recommended that all
startup keys are kept on the production server. At least one should
be stored remotely, e.g. on a USB drive, and inserted as necessary.
[0554] Transformation Keys--These are the encryption keys used to
perform encryption of the data fields within the Transform Engine.
They are stored in the Keystore, which in turn is encrypted using
the SMK
[0555] The Configuration Management utility is a command line tool
which provides the following functionality to manage the encryption
aspects of the system: [0556] Generate new encryption keys
(including the storage master key) using a cryptographically strong
random number generator. The random number generator will be
initialised with a cryptographically strong seed. The source of the
seed may be operating system dependent. [0557] Encrypt a storage
master key using any number of Startup keys [0558] Encrypt/decrypt
any configuration files with the storage master key [0559] Generate
a Key Store, encrypted with the storage master key
[0560] A storage master key may be required, and all associated
Startup keys should be provided on application start-up.
[0561] The following describes an example of an anonymisation
system 10 deployment. This deployment will utilise the file, HTTP
and TCP/IP interface, processing a variety of data formats. A high
level architecture is shown in FIG. 6.
[0562] The deployment consists of 3 main data flows: [0563] Web
application access over HTTP.--A Web service is used to return
customer data to a browser. The web application is a 3rd party
which contains a database of encrypted MSISDNs and unencrypted
customer names. A user of the web service knows the real MSISDN and
enters this onto a web form to search for customer details. The
anonymisation system 10 intercepts the request and encrypts the
MSISDN in the POST data of the HTTP request, and decrypts the
MSISDN in the HTML page returned by the web application. [0564]
SMPP/UCP message processing to an SMSC via TCP/IP--The
anonymisation system 10 acts as a proxy to an SMSC, anonymising
destination MSISDNs on the way out. [0565] Customer data record
transformation of delimited files via the file interface--Customer
data files are dropped into the input directory and these are
anonymised and placed in the output directory. These directories
are then accessed by external systems via SFTP.
[0566] Example Input/Output Data
[0567] The Configuration file is set up with multiple interfaces
and this section gives examples of input and output values for each
interface defined above.
[0568] File Based Interface
[0569] The file based interface is set up to read CSV files
consisting of Name, MSISDN and IMEI. An example input file would
be:
[0570] John Smith,447789123456,123456789012345
[0571] Joe Bloggs,447982794563,320247543723897
[0572] The name field is set to be partially masked, the MSISDN set
to be encrypted, and the IMEI left untransformed, so the output
might be as follows:
[0573] John #####,985572987352,123456789012345
[0574] Joe B#####,952953756154,320247543723897
[0575] HTTP Data Interface
[0576] The HTTP Data Interface is set up to transform HTTP request
data, encrypting the MSISDN and decrypting the MSISDN in the
resulting HTML page on the response.
[0577] The input would be creating by submitting a form on a web
page, but the resulting underlying HTTP request body could be:
msisdn=447789123456&submit=Submit+Query This msisdn will be
encrypted, so the output value could be:
msisdn=671968471158&submit=Submit+Query
[0578] This is intercepted and the MSISDN field decrypted by The
anonymisation system 10 to the output:
TABLE-US-00044 <html> <head> <title>MSISDN Lookup
Results</title> </head> <body> <div>
MSISDN: <span id=\"msisdn\">671968471158</span>
</div> <div> Name: <span id=\"name\">Ariel
Rineer</span> </div> </body> </html>
This is intercepted and the MSISDN field decrypted by the
anonymisation system 10 to the output:
TABLE-US-00045 <html> <head> <title>MSISDN Lookup
Results</title> </head> <body> <div>
MSISDN: <span id=\"msisdn\">447789123456</span>
</div> <div> Name: <span id=\"name\">Ariel
Rineer</span> </div> </body> </html>
[0579] TCPIP Interface
[0580] Two TCP/IP interfaces are set up, one for SMPP and one for
UCP. Each of them are set up encrypt a MSISDN field. Example input
and output values are shown in FIGS. 5 and 6. (The values here are
as viewed with a text editor, and contain unprintable values)
[0581] SMPP (The first record shown in FIG. 7 is the value pre
encryption; the second record shown in FIG. 7 is the value after
encryption):
[0582] UCP (The first line shown in FIG. 8 is the value pre
encryption; the second line shown in FIG. 8 is the value after
encryption)
[0583] FIG. 9 shows a list of the functional and non-functional
requirements for the anonymisation system.
[0584] FIG. 10 shows a non-exhaustive table listing 31 different
combinations of sensitive customer data items that may be
associated with a customer data record. For each combination of
data items in a data set, the table identifies which sensitive data
items do not need to be anonymised, and which data items do need to
be anonymised, filtered or masked (i.e. transformed) to meet
current security requirements.
[0585] Masking may include removing detail and granularity from
data items, for example location data for cell-IDs could be masked
to generalise the location information to a particular town, county
or country.
[0586] Of course, the anonymisation system can be configured to
anonymise any type of data item and any combination of these data
items in a data item set. The invention is not limited to use in
anonymising and filtering mobile network data or Customer Data
Record (CDR) data, and can be applied to any data having a
predefined data structure.
[0587] As will be appreciated by the skilled person, details of the
above embodiment may be varied without departing from the scope of
the present invention, as defined by the appended claims.
[0588] Many combinations, modifications, or alterations to the
features of the above embodiments will be readily apparent to the
skilled person and are intended to form part of the invention. Any
of the features described specifically relating to one embodiment
or example may be used in any other embodiment by making the
appropriate changes.
TABLE-US-00046 Glossary Term Description AES Advanced Encryption
Standard--An industry standard of encryption. An example encryption
used in the anonymisation system is based on AES-256 (the 256- bit
version) Application keystore A serialised Java class file,
encrypted using the Storage Master Key, which holds: Transformation
keys used for encryption, decryption and hashing within the
anonymisation system HTTPS Keystore Passwords, used to read the
contents of a password protected HTTPS Keystore File Atomic
operation An operation which acts as a single unit. Traditionally
refers to a transaction within a database, in the case of the
anonymisation system this is used to indicate that a file should
not be placed in the input directory with the configured input
suffix in a part-written state. It should be written with a
different suffix and then renamed, in order that the system does
not start to read a partial file. Configuration file The XML file
which contains the configuration of Route and other system
components required to start an instance of the anonymisation
system Configuration Management A command line utility Utility
provided with the anonymisation system to enable management of the
application keystore, storage master key, startup keys,
transformation keys, and the encryption and decryption of relevant
files with these keys. Data Interface The application layer
responsible for creating input and output data channels from
various raw sources Data Unit A "data unit" refers to a single
piece of data read by one of the interfaces, as follows: File
system interface--a single file TCP/IP interface--a single socket
connection on the relevant port. HTTP(S) Interface--a single
HTTP(S) Request Database--A single database row Messaging--A single
Message Encryption Key A key used for the encryption transform.
This may be a 256 bit value for Ephemeral encryption or a 2048 bit
value for format preserving encryption. See also Transformation Key
Encryption Schema This is a section of the application XML
configuration which defines which form of encryption to use in an
encryption transform, along with the specific properties for the
encryption type. Ephemeral Encryption A type of encryption where
every time a value is encrypted, it encrypts to a different value,
but every output can still be decrypted back to the correct
original value. Format Preserving A type of encryption where
Encryption an alphabet is specified, and every encrypted value is
of the same alphabet and the same length as the input value. Java
HTTPS Keystore This is a serialised Java class containing a
collection of certificates used by the HTTPS protocol. If HTTPS is
to be used, a Java HTTPS keystore must be generated containing the
appropriate certificates. Java comes with a utility for creating
such a store, called "keytool". Optionally, when creating the
store, a Key Password and a Store Password can be specified.
Keystore file The file containing the application keystore keytool
A utility provided with Java for the creation of HTTPS keystores.
May be protected with a password, which can be added to the
application key store as an aliased key. Redaction Removing
specific text from a field Regular Expression An expression for
defining patterns within text. See www.regular- expressions.info
for a reference guide. Route A combination of Data Interface,
Reader/Writer and TransformSet which defines a "channel" through
the system. Schema A fragment of XML which defines complex
properties for particular transforms. Encryption, Tokenisation,
Find and Replace and Redaction all have their own Schemas. These
can be edited via the GUI. Storage Master Key (SMK) A randomly
generated String which is used to encrypt sensitive configuration
files used by the system. The Storage master key is never stored in
clear text. Instead it is split up into startup keys which can be
stored separately. Startup Key One of a number of keys which when
combined together will form the Storage Master Key. Whenever the
Storage Master Key is required, all startup keys must be provided.
A startup key may optionally be password protected. Tokenisation
Schema This is a section of the application XML configuration which
defines which form of tokenisation to use in a tokenisable
transform, along with the specific configuration properties for the
token store. Transform The application layer responsible for
transforming individual data fields in a variety of ways, in order
to anonymise and de-anonymise them. Transform Set A collection of
transforms, grouped together to be applied to several fields within
a single data record Transformation Key A key used within certain
transforms. Most commonly this will be used for encryption; however
keys are also required by the hashing transform. This term is an
overarching term of any such key used by any relevant transform.
This is generally a 256 bit value with the exception of format
preserving encryption when it is a 2048 bit value. The
transformation keys are stored in the application keystore.
* * * * *
References