U.S. patent application number 16/878201 was filed with the patent office on 2021-11-25 for systems and methods for generating tokens using secure multiparty computation engines.
The applicant listed for this patent is Acronis International GmbH. Invention is credited to Serguei Beloussov, Stanislav Protasov, Kailash Sivanesan, Sanjeev Solanki, Mark A Will.
Application Number | 20210367774 16/878201 |
Document ID | / |
Family ID | 1000004896059 |
Filed Date | 2021-11-25 |
United States Patent
Application |
20210367774 |
Kind Code |
A1 |
Will; Mark A ; et
al. |
November 25, 2021 |
SYSTEMS AND METHODS FOR GENERATING TOKENS USING SECURE MULTIPARTY
COMPUTATION ENGINES
Abstract
Disclosed herein are systems and methods for generating tokens
using SMPC compute engines. In one aspect, a method may hash, by a
node, a data input with a salt value. The method may split, by the
node, the hashed data input into a plurality of secret shares,
wherein each respective secret share of the plurality of secret
shares is assigned to a respective SMPC compute engine of a
plurality of SMPC compute engines. The respective SMPC compute
engines may be configured to collectively hash the respective
secret share with a secret salt value, unknown to the plurality of
SMPC compute engines. The respective SMPC compute engine may
further receive a plurality of hashed secret shares from remaining
SMPC compute engines of the plurality of SMPC compute engines, and
generate a token, wherein the token is a combination of the hashed
respective secret share and the plurality of hashed secret
shares.
Inventors: |
Will; Mark A; (Singapore,
SG) ; Solanki; Sanjeev; (Singapore, SG) ;
Sivanesan; Kailash; (Singapore, SG) ; Beloussov;
Serguei; (Costa Del Sol, SG) ; Protasov;
Stanislav; (Moscow, RU) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Acronis International GmbH |
Schaffhausen |
|
CH |
|
|
Family ID: |
1000004896059 |
Appl. No.: |
16/878201 |
Filed: |
May 19, 2020 |
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06F 21/552 20130101;
H04L 9/088 20130101; H04L 2209/46 20130101; H04L 63/0428 20130101;
G06F 21/6245 20130101; G06F 2221/034 20130101; H04L 9/085 20130101;
H04L 9/0863 20130101; H04L 9/0643 20130101 |
International
Class: |
H04L 9/08 20060101
H04L009/08; H04L 9/06 20060101 H04L009/06; H04L 29/06 20060101
H04L029/06; G06F 21/55 20060101 G06F021/55; G06F 21/62 20060101
G06F021/62 |
Claims
1. A method for token generation using SMPC compute engines, the
method comprising: hashing, by a node, a data input with a salt
value; splitting, by the node, the hashed data input into a
plurality of secret shares, wherein each respective secret share of
the plurality of secret shares is assigned to a respective SMPC
compute engine of a plurality of SMPC compute engines, wherein the
respective SMPC compute engine is configured to: securely hash the
respective secret share with a secret salt value unique to the
respective SMPC compute engine; transmit the respective hashed
secret share to the remaining SMPC compute engines of the plurality
of SMPC compute engines; receive a plurality of hashed secret
shares from remaining SMPC compute engines of the plurality of SMPC
compute engines; and generate a token, wherein the token is a
combination of the hashed respective secret share and the plurality
of hashed secret shares.
2. The method of claim 1, wherein the node is one data source of
multiple data sources, wherein another token corresponding to data
inputs from both the node and at least one other node of the
multiple data sources should be generated, further comprising:
transmitting the salt value to the at least one other node, wherein
the at least one other node is configured to: hash at least one
other data input with the salt value; and split the hashed at least
one other data input into at least one other plurality of secret
shares, wherein each respective secret share of the at least one
other plurality of secret shares is assigned to a respective SMPC
compute engine of the plurality of SMPC compute engines.
3. The method of claim 2, wherein the respective SMPC compute
engine is further configured to: jointly, with the plurality of
SMPC compute engines, hash the respective secret share from the
plurality of secret shares with a secret salt value (3 and hash the
respective secret share from the at least one other plurality of
secret shares with the secret salt value (3, where R is unknown to
any SMPC compute engine of the plurality of SMPC compute engines;
transmit the respective hashed secret share to the remaining SMPC
compute engines of the plurality of SMPC compute engines; receive
another plurality of hashed secret shares from the remaining SMPC
compute engines of the plurality of SMPC compute engines; and
generate additional tokens, wherein each additional token is a
combination of the plurality of hashed secret shares, irrespective
of data source.
4. The method of claim 3, wherein the node and the at least one
other node all receive the additional tokens.
5. The method of claim 1, wherein the data input is an identifier
of a respective row or column in a dataset being uploaded by the
node to the plurality of SMPC compute engines.
6. The method of claim 1, further comprising: prior to splitting
the hashed data input, combining the hashed data input with a
passcode provided by a user of the node; and splitting the hashed
data input combined with the passcode into the plurality of secret
shares.
7. The method of claim 1, wherein the respective SMPC compute
engine is further configured to: securely encrypt the respective
secret share using an encryption scheme, wherein an initialization
vector and key of the encryption scheme are in a secret share and
are unknown to any SMPC compute engine of the plurality of SMPC
compute engines; transmit the respective encrypted secret share to
the remaining SMPC compute engines of the plurality of SMPC compute
engines; receive a plurality of encrypted secret shares from the
remaining SMPC compute engines of the plurality of SMPC compute
engines; and generate the token, wherein the token is a combination
of the encrypted respective secret share and the plurality of
encrypted secret shares.
8. The method of claim 1, wherein the respective SMPC compute
engine is further configured to: store a plurality of generated
tokens with corresponding data inputs; in response to detecting
that the stored plurality of generated tokens is being overwritten,
generate an alert indicating malicious behaviour.
9. A system for token generation using SMPC compute engines, the
system comprising: a first hardware processor of a node; a second
hardware processor of an SMPC compute engine; the first hardware
processor configured to: hash a data input with a salt value; split
the hashed data input into a plurality of secret shares, wherein
each respective secret share of the plurality of secret shares is
assigned to a respective SMPC compute engine of a plurality of SMPC
compute engines, wherein the second hardware processor of the
respective SMPC compute engine is configured to: securely hash the
respective secret share with a secret salt value unique to the
respective SMPC compute engine; transmit the respective hashed
secret share to the remaining SMPC compute engines of the plurality
of SMPC compute engines; receive a plurality of hashed secret
shares from remaining SMPC compute engines of the plurality of SMPC
compute engines; and generate a token, wherein the token is a
combination of the hashed respective secret share and the plurality
of hashed secret shares.
10. The system of claim 9, wherein the node is one data source of
multiple data sources, wherein another token corresponding to data
inputs from both the node and at least one other node of the
multiple data sources should be generated, wherein the first
hardware processor is further configured to: transmit the salt
value to the at least one other node, wherein a third hardware
processor of the at least one other node is configured to: hash at
least one other data input with the salt value; and split the
hashed at least one other data input into at least one other
plurality of secret shares, wherein each respective secret share of
the at least one other plurality of secret shares is assigned to a
respective SMPC compute engine of the plurality of SMPC compute
engines.
11. The system of claim 10, wherein the second hardware processor
of the respective SMPC compute engine is further configured to:
jointly, with the plurality of SMPC compute engines, hash the
respective secret share from the plurality of secret shares with a
secret salt value .beta. and hash the respective secret share from
the at least one other plurality of secret shares with the secret
salt value .beta., where .beta. is unknown to any SMPC compute
engine of the plurality of SMPC compute engines; transmit the
respective hashed secret share to the remaining SMPC compute
engines of the plurality of SMPC compute engines; receive another
plurality of hashed secret shares from the remaining SMPC compute
engines of the plurality of SMPC compute engines; and generate
additional tokens, wherein each additional token is a combination
of the plurality of hashed secret shares, irrespective of data
source.
12. The system of claim 11, wherein the node and the at least one
other node all receive the additional tokens.
13. The system of claim 9, wherein the data input is an identifier
of a respective row or column in a dataset being uploaded by the
node to the plurality of SMPC compute engines.
14. The system of claim 9, wherein the first hardware processor is
further configured to: prior to splitting the hashed data input,
combine the hashed data input with a passcode provided by a user of
the node; and split the hashed data input combined with the
passcode into the plurality of secret shares.
15. The system of claim 9, wherein the second hardware processor of
the respective SMPC compute engine is further configured to:
securely encrypt the respective secret share using an encryption
scheme, wherein an initialization vector and key of the encryption
scheme are in a secret share and are unknown to any SMPC compute
engine of the plurality of SMPC compute engines; transmit the
respective encrypted secret share to the remaining SMPC compute
engines of the plurality of SMPC compute engines; receive a
plurality of encrypted secret shares from the remaining SMPC
compute engines of the plurality of SMPC compute engines; and
generate the token, wherein the token is a combination of the
encrypted respective secret share and the plurality of encrypted
secret shares.
16. The system of claim 9, wherein the respective SMPC compute
engine is further configured to: store a plurality of generated
tokens with corresponding data inputs; in response to detecting
that the stored plurality of generated tokens is being overwritten,
generate an alert indicating malicious behaviour.
17. A non-transitory computer readable medium storing thereon
computer executable instructions for token generation using SMPC
compute engines, comprising instructions for: hashing, by a node, a
data input with a salt value; splitting, by the node, the hashed
data input into a plurality of secret shares, wherein each
respective secret share of the plurality of secret shares is
assigned to a respective SMPC compute engine of a plurality of SMPC
compute engines, wherein the respective SMPC compute engine is
configured to: securely hash the respective secret share with a
secret salt value unique to the respective SMPC compute engine;
transmit the respective hashed secret share to the remaining SMPC
compute engines of the plurality of SMPC compute engines; receive a
plurality of hashed secret shares from remaining SMPC compute
engines of the plurality of SMPC compute engines; and generate a
token, wherein the token is a combination of the hashed respective
secret share and the plurality of hashed secret shares.
18. The non-transitory computer readable medium of claim 17,
wherein the node is one data source of multiple data sources,
wherein another token corresponding to data inputs from both the
node and at least one other node of the multiple data sources
should be generated, further comprising instructions for:
transmitting the salt value to the at least one other node, wherein
the at least one other node is configured to: hash at least one
other data input with the salt value; and split the hashed at least
one other data input into at least one other plurality of secret
shares, wherein each respective secret share of the at least one
other plurality of secret shares is assigned to a respective SMPC
compute engine of the plurality of SMPC compute engines.
19. The non-transitory computer readable medium of claim 15,
wherein the data input is an identifier of a respective row or
column in a dataset being uploaded by the node to the plurality of
SMPC compute engines.
20. The non-transitory computer readable medium of claim 17,
further comprising instructions for: prior to splitting the hashed
data input, combining the hashed data input with a passcode
provided by a user of the node; and splitting the hashed data input
combined with the passcode into the plurality of secret shares.
Description
FIELD OF TECHNOLOGY
[0001] The present disclosure relates to the field of secure
multiparty computation (SMPC), and, more specifically, to systems
and methods for generating tokens using SMPC compute engines.
BACKGROUND
[0002] When processing, storing, or transmitting customer/user
information, IT companies must often separate user identification
information from user data to comply with privacy and data security
regulations. For example, a piece of data regarding salary such as
(Bob, $90k) can be quite revealing as it indicates that Bob earns a
salary of $90k. To perform computations on information, such as a
dataset, identification is not necessarily needed. For example,
performing a computation using the value $90k can happen
successfully regardless of whether the data input is (Bob, $90k) or
(Anonymous, $90k). Given that certain countries have very strict
privacy and data security laws aimed at protecting personal
information, with fines and other punishments in place for
breaching the laws, and given that individuals may not wish to give
up personal information, anonymizing personal identifiable
information is very important.
[0003] When multiple sources are involved that collectively link
data, it becomes even more important to keep each source's
identification confidential by converting the identification to
some value that does not reveal the original input and is unique
and deterministic--allowing for the joining of data. For example,
several companies may wish to determine an average employee salary
amongst each other to evaluate whether their own salaries are
competitive in the market. However, in order to determine the
average salary across the companies, their salary datasets need to
be combined to identify employees working for multiple companies.
To preserve individual confidential data and their own company
names, the identification information should be anonymized.
[0004] A common approach to this problem is using a tokenizer,
which takes some input and gives an obfuscated, seemingly random
output. Therefore, an email or identification number becomes a
random token. But for this to be useful in the case of multiple
data sources, the same input needs to give the same output in order
for the two sources to be joined together. Hence a random mapping
cannot be used, unless each uploader possesses the same mapping.
Unfortunately, a single mapping becomes a single point of failure
because all tokens can be revealed to a malicious entity if the
mapping is compromised. Furthermore, searchable encryption cannot
be used easily in an secure multiparty computation (SMPC)
environment because the data needs to the processed in the same
order across multiple nodes, where identification values may be
used for ordering the dataset, thus most searchable encryption
techniques will not be able to easily guarantee this ordering.
SUMMARY
[0005] To address these shortcomings, aspects of the present
disclosure describe methods and systems for generating tokens using
SMPC compute engines.
[0006] In one exemplary aspect, a method for token generation using
SMPC compute engines may apply a deterministic function, by a data
source node, to a data input. The method may split, at the data
source node, the output of the deterministic function into a
plurality of secret shares, wherein each respective secret share of
the plurality of secret shares is assigned to a respective SMPC
compute engine of a plurality of SMPC compute engines. The
respective SMPC compute engine may be configured to apply another
deterministic function to the respective secret share. The
remaining SMPC compute engine may transmit the respective secret
share to the remaining SMPC compute engines of the plurality of
SMPC compute engines. The respective SMPC compute engine may
further receive a plurality of secret shares from the remaining
SMPC compute engines of the plurality of SMPC compute engines, and
generate a token, wherein the token is a combination of the
respective secret share and the plurality of secret shares.
[0007] In some aspects, the respective SMPC compute engine of the
plurality of SMPC compute engines is one data source of a plurality
of data sources. Alternatively, the respective SMPC compute engines
of the plurality of SMPC compute engines may be separate entities
that should not have access to private inputs or private
outputs.
[0008] In some aspects, data inputs which contain identifiable
information, but are required for joining data together are
deterministically converted into a token. Data from one data source
and at least one other data source of the multiple data sources
should generate the same token if the method is given the same
input value.
[0009] In some aspects, the token may be partially generated on the
data source node using a deterministic function (e.g., MD5, SHA1,
AES with the same key and initialization vector etc.), before being
split into a plurality of secret shares, wherein each respective
secret share of the plurality of secret shares is assigned to a
respective SMPC compute engine of the plurality of SMPC compute
engines. The method may then continue the token generation process
within the SMPC compute engines. For example, the respective SMPC
compute engine may use a SMPC function to compute: (1) a hash value
(e.g., based on MD5, SHA1, etc.) of a respective secret share with
a secret salt value that is unknown to any other SMPC compute
engine of the plurality of SMPC compute engines, (2) an encryption
function with a deterministic output (e.g., AES CBC with a secret
key and secret static initialization vector) or (3), any
combination of deterministic functions involving secret values.
[0010] The method may generate the secret token when each SMPC
compute engine sends the remaining SMPC compute engines of the
plurality of SMPC compute engines its secret share of the plurality
of secret shares. The method may receive a plurality of secret
shares from other SMPC compute engines, wherein joining the secret
share with the plurality of secret shares will generate the secret
token. Each SMPC compute engine receives the same token, which can
be included with any other data values for uses such as, but not
limited to, storing a dataset, joining multiple datasets together,
or ordering data inputs for the use in another SMPC function.
[0011] Continuing with the previously mentioned example, suppose
that multiple companies desire to find an average employee salary
amongst each other. Each company may possess a dataset containing
social security numbers and salaries for computer engineers. The
datasets are to be uploaded into an SMPC compute system. A
contractor may be employed at the multiple companies, and
therefore, their separate salaries need to be totaled first to form
their overall salary, before the average can be computed. In this
case, the social security numbers are identifiable information and
should not be stored. However, the social security numbers may be
required to join together the salaries someone may be receiving
from different companies. Thus, the tokens may be used to identify
two salaries which need to be securely summed before the secure
average function can proceed. Similarly, a token may be required
when using an SMPC function to compute the average where the secret
shares need to be aligned when entering the function. For example,
if the salary $80k of an employee is split into $25k, $40k, and
$15k, the three shares need to be processed together on the
plurality of SMPC compute engines. Accordingly, an SMPC compute
engine that should receive $25k (i.e., the first secret share)
cannot receive $45k because then an incorrect salary of $100k is
the result. The inputs may be sorted by the token, thus, the tokens
on each of the SMPC compute engines need to be identical.
[0012] The method may transmit a static value (e.g., a salt value
for a hash function) to at least one other data source, in order to
allow data to be joined from the multiple sources. For example, at
least one other data source node may be configured to hash at least
one other data input with a salt value, and split the hashed value
into a plurality of secret shares, wherein each respective secret
share is assigned to a respective SMPC compute engine of the
plurality of SMPC compute engines. Therefore, for tokens across
multiple sources to be linked together, the same salt value is
required.
[0013] In some aspects, all SMPC compute engines of the plurality
of SMPC compute engines receive the same token which represents the
same plaintext input identifier value. In some aspects, the data
input is an identifier of that should be confidential.
[0014] In some aspects, prior to splitting the token input into a
plurality of secret shares, the method may combine the input with a
passcode provided by a user of the node (e.g., data source node).
In the case of a hash function, this passcode can be hashed with
the token input, where the output may then be split into the
plurality of secret shares.
[0015] In some aspects, the respective SMPC compute engine is
further configured to store a plurality of generated tokens with
corresponding data inputs and in response to detecting that the
stored plurality of generated tokens is being overwritten, or that
generated tokens are not part of a wider dataset, generate an alert
indicating malicious behaviour.
[0016] In some aspects, the data input is an identifier of a
respective row or column in a dataset being uploaded by a data
source to the plurality of SMPC compute engines.
[0017] It should be noted that the methods described above may be
implemented in a system comprising a hardware processor.
Alternatively, the methods may be implemented using computer
executable instructions of a non-transitory computer readable
medium.
[0018] The above simplified summary of example aspects serves to
provide a basic understanding of the present disclosure. This
summary is not an extensive overview of all contemplated aspects,
and is intended to neither identify key or critical elements of all
aspects nor delineate the scope of any or all aspects of the
present disclosure. Its sole purpose is to present one or more
aspects in a simplified form as a prelude to the more detailed
description of the disclosure that follows. To the accomplishment
of the foregoing, the one or more aspects of the present disclosure
include the features described and exemplarily pointed out in the
claims.
BRIEF DESCRIPTION OF THE DRAWINGS
[0019] The accompanying drawings, which are incorporated into and
constitute a part of this specification, illustrate one or more
example aspects of the present disclosure and, together with the
detailed description, serve to explain their principles and
implementations.
[0020] FIG. 1 is a block diagram illustrating a system for
generating secret shares of an input value for SMPC compute
engines, in accordance with aspects of the present disclosure.
[0021] FIG. 2 is a block diagram illustrating a system for
generating tokens using the SMPC compute engines, in accordance
with aspects of the present disclosure.
[0022] FIG. 3 illustrates a flow diagram of a method for a
generating token for one source using SMPC compute engines, in
accordance with aspects of the present disclosure.
[0023] FIG. 4 illustrates a flow diagram of a method for a
generating token for multiple sources using SMPC compute engines,
in accordance with aspects of the present disclosure.
[0024] FIG. 5 presents an example of a general-purpose computer
system on which aspects of the present disclosure can be
implemented.
DETAILED DESCRIPTION
[0025] Exemplary aspects are described herein in the context of a
system, method, and computer program product for generating tokens
using SMPC compute engines. Those of ordinary skill in the art will
realize that the following description is illustrative only and is
not intended to be in any way limiting. Other aspects will readily
suggest themselves to those skilled in the art having the benefit
of this disclosure. Reference will now be made in detail to
implementations of the example aspects as illustrated in the
accompanying drawings. The same reference indicators will be used
to the extent possible throughout the drawings and the following
description to refer to the same or like items.
[0026] In secure multiparty computation (SMPC), secret sharing
refers to distributing a secret amongst a group of participants,
where each participant is allocated a share of the secret.
Individual shares are of no use on their own, as the secret can
only be reconstructed when a number of shares are combined
together. SMPC frameworks allow multiple parties to jointly compute
a function, such that their inputs remain private, using secret
sharing. More specifically, data is protected by being split into
secret shares, where each party receives a subset of these shares.
Therefore each party cannot see the real data. For example, the
secret "10" can be split into secret shares "3," "2," and "5,"
whose sum (i.e., 3+2+5) gives 10.
[0027] FIG. 1 is a block diagram illustrating system 100 for
generating secret shares of an input value for SMPC compute
engines, in accordance with aspects of the present disclosure. FIG.
1 is an example of a flow which has two sources of data (e.g., node
102, 104). Both nodes may be any type of electronic device that can
store and share data such as computers and servers. One skilled in
the art will appreciate that the methods to be discussed are
applicable when only one node is involved in system 100 and also
when more than two nodes are involved.
[0028] Nodes 102 and 104 may possess plaintext data such as ID 106
and ID 108, respectively. The plaintext data may be identification
numbers of the respective nodes, or be identification numbers of
records in a dataset, and should be kept confidential. This is done
by creating a token for the respective IDs. A token is a
representation of some other data. For example, a token may be a
random number or a series of randomized characters representing
sensitive information. The output from hashing or encrypting may
also be a token.
[0029] Given that the target platform is an SMPC environment, the
SMPC compute engines (e.g., engines A, B, . . . Z) themselves may
be used to securely generate tokens. An SMPC compute engine is a
module on a server where the SMPC protocol is actually computed.
For example, an SMPC compute engine may perform a secure function,
which is a mathematical operation (e.g., determining the average
for a plurality of input numerical values), on a secret share
received from node 102. The server may be accessed via a network
connection (e.g., the Internet). Each SMPC compute engine A, B, Z
may be able to communicate with one another, but will not exchange
information which will reveal secret shares unless specified in the
function. This ensures that the SMPC compute engines individually
do not know any private secret.
[0030] However, in the case where all the SMPC compute engines are
breached, thereby exposing sensitive data associated with ID 106
and ID 108, the tokens themselves should still remain protected.
This is done by having nodes 102 and 104 involved with the token
generation process, and the option of having a password/phrase 118
that is not stored within the compute engines for the tokenization
process.
[0031] Accordingly, node 102 hashes ID 106 using hash generator
114, which may be a module that applies a cryptographic hash
function on an input value. Examples of the cryptographic hash
functions include, but are not limited to, MD5, SHA-1, and
Whirlpool. Hashing is a cryptographic method for taking some
arbitrary length input, and outputting a fixed length, seemingly
random output. Hashing is known as a one-way function, as reversing
the hashing process is very difficult.
[0032] Node 102 may utilize a salt value, namely, salt 110. A salt
value is a combination of characters used in hashing to prevent
pre-generated tables being used to reverse the hash function. While
node 102 is computing the hash of ID 106, node 104 may be hashing
ID 108 (node 102 and node 104 may also hash their respective inputs
at different times). To ensure consistency, when hashing, both hash
generator 114 and 116 must use the same hash function. Furthermore,
both node 102 and node 104 must use the same salt 110. Node 102 may
transmit salt 110 to node 104 via on-premise service node 112A
(received by on-premise service node 112B) or other mediums such as
email 122. On-Premise service nodes 112A and 112B may store salt
110 for future use. If salt 110 is shared over email 122,
on-premise service nodes 112A and 112B are not needed in system
100.
[0033] In one example, referring to node 102, suppose that ID 106
is the mixed order of numbers "2812." Salt 110 may be another mixed
order of numbers "3432." When applying salt 110 to ID 106, the
numbers of salt 110 may be appended to the numbers of ID 106
yielding "28123432." Suppose that subsequent to applying a hash
function via hash generator 114, the result is "74291869." It
should be noted that this example is oversimplified for easier
comprehension. In a real setting, ID 106 and salt 110 may each
include several characters such as numbers, letters, symbols, and
bytes. Furthermore, hashes tend to be longer and more complex
numbers. For example, when using a hash function such as SHA-256 on
the value "28123432," the result is
"6c13285e199a7fdd53cebcea0c86cfeeaf3cea6e7cde83c5846b41cc1fe8df7."
[0034] Nodes 102 and 104 may then split the hashed data input into
a plurality of secret shares, wherein each respective secret share
of the plurality of secret shares is assigned to a respective SMPC
compute engine of a plurality of SMPC compute engines. The
plurality of SMPC compute engines comprise of at least two SMPC
compute engines. In some aspects, the number of secret shares in
the plurality of secret shares is equal to the number of SMPC
compute engines to be used. FIG. 1 depicts engines A-Z. However, it
may be possible that more engines may be utilized or fewer engines
may be utilized. For simplicity, suppose that only three compute
engines are utilized by nodes 102 and node 104 to perform secure
functions (i.e., SMPC compute engine A, B and Z). Nodes 102 and 104
may communicate with the respective SMPC compute engines using a
wide area network (e.g., the Internet) or a local area network
(LAN).
[0035] The actual splitting may be performed by secret share
generators 124 and 126, which are modules that are configured to
split an input value using techniques such as additive secret
sharing, multiplicative secret sharing, etc. Secret share
generators 124 and 126 may further be configured to assign each
respective secret share to a respective SMPC compute engine and
transmit the secret shares according to their assignments (e.g.,
share A1 is transmitted to SMPC compute engine A). Generators 124
and 126 may coordinate to ensure that the same SMPC compute engines
are used (e.g., A, B, and Z) by both nodes.
[0036] In some aspects, each transmission of a secret share to a
respective SMPC compute engine may further comprise identification
information (e.g., node ID, the IP address, MAC address, etc.) of
the other SMPC compute engines that received the other secret
shares. This way, SMPC compute engine A is able to determine that
SMPC compute engines B and Z received the other portions of the
secret (e.g., shares B1, B2, Z1, and Z2).
[0037] Following the overarching example, node 102 may split the
hashed data input "74291869" using a technique such as additive
secret sharing. For example, node 102 may split the hashed data
input into three secret shares (e.g., share A1, share B1, and share
Z1). These shares may be "32150327" as share A1, "11041241" as
share B1, and "31100301" as share Z1--which all add up to
"74291869." Likewise, node 104 may hash ID 108 using the salt 110
value "3432," and may split the hashed value into three secret
shares: share A2, B2, and Z2.
[0038] In some aspects, the hashed data inputs may be combined with
a password or a phrase that the data uploader (e.g., a user of node
102 or node 104) manually enters. For example, password 118 may be
shared by node 102 with node 104 via an out-of-system method such
as a phone call, text, or email 122. Suppose that password 118 is
"1010." Prior to secret share generators 124 and 126 splitting the
hashed value, password 118 may be appended by the respective nodes
to the end/beginning of the respective hashed data inputs. For
example, "1010" may applied as an additional message or round from
the hash generated by 114. Alternatively, processes 114 and 118 can
be combined such that the input message to the hash function would
be "281234321010" for ID 106. This adds a level of complexity to
the data security to prevent a brute-force attack on the hashed
data input.
[0039] FIG. 2 is a block diagram illustrating system 200 for
generating tokens using the SMPC compute engines, in accordance
with aspects of the present disclosure. In system 200, the SMPC
compute engines may apply any deterministic function such as
hashing and encrypting, or any combination of deterministic
functions.
[0040] In some aspects, the SMPC compute engines may take the
hashing approach. More specifically, a respective SMPC compute
engine is configured to hash the respective secret share with a
secret salt value unique to the respective SMPC compute engine
and/or unknown to any other SMPC compute engine of the plurality of
SMPC compute engines (as it exists as a secret share). Suppose that
only node 102 is in system 200. Referring to the overarching
example, "32150327" as share A1 is sent to SMPC compute engine A,
"11041241" as share B1 is sent to SMPC compute engine B, and
"31100301" as share Z1 is sent to SMPC compute engine Z. Each SMPC
compute engine may possess a unique secret share salt value (i.e.,
secret salt A, B, Z). For example, engine A may have a secret salt
value of "1234," engine B may have "6545," and engine Z may have
"89482." It should be noted that the secret salt values shown above
are simplified, but in reality may be any arbitrary length and any
combination of letters, numbers, symbols, and bytes. Considering
the perspective of engine A, secret salt A may be appended to the
received hash value yielding "321503271234." The shares with
appended salts A1, B1, and Z1 may be jointly hashed with the hash
generators 202A, 202B, and 202Z using a SMPC protocol.
Alternatively, the values A1, B1, and Z1 could be the output from a
hash round in hash generator 114, where the secret salt is the
input for the final rounds in the hash function. In some aspects,
each of the SMPC compute engines of the plurality of SMPC compute
engines use the same hashing function, and communicate with each of
the other hash generators to securely compute the hash function
using SMPC protocols. For example, each hash generation module
202A, 202B, and 202Z may communicate with one another to select a
single hashing function. Suppose that subsequent to each SMPC
compute engine collectively computing the hash value using their
respective secret salt value, the respective results are
"324351253," "87585323," and "234324320." Each respective SMPC
compute engine may transmit their calculated value to the other
SMPC compute engines and may receive a plurality of hashed secret
shares from remaining SMPC compute engines of the plurality of SMPC
compute engines. For example, engine A may transmit "324351253" to
engines B and Z, and may receive "87585323" and "234324320" from
engines B and Z, respectively. Each SMPC compute engine may then
generate a token, wherein the token is a combination of the hashed
respective secret share and the plurality of hashed secret shares.
For example, each SMPC compute engine may add up the three values
to get the sum "646260896." This sum is the token that represents
ID 106. Token A, B, and Z are all the same value and are used by
SMPC compute engines A, B, and Z to identify ID 106.
[0041] Suppose that more than one node is using system 200. For
example, node 104 may also transmit secret shares of its determined
hash value to engines A, B, and Z. In some aspects, the SMPC
compute engines each generate individual tokens for each values
submitted by the nodes. For example, just as the process for
generating a token for ID 106 is described above, the SMPC compute
engines may generate another token using solely the secret shares
of ID 108. This results in a second token solely for ID 108.
[0042] In some aspects, one token is generated for each of the
plurality of inputs submitted by a plurality of nodes. More
specifically, in response to determining that node 102 and node 104
are both seeking to compute a secure function together, system 200
may join the individually received datasets from each node. For
example, engine A may receive share A1 from node 102 and share A2
from node 104. Engine 104 may then add these values, or append them
one after another, depending on their token. Following the
overarching example of multiple companies determining the average
salary of their employees, suppose that a dataset of a first
company is:
TABLE-US-00001 123-45-6789 $90 k 987-65-4321 $40 k 001-02-0003 $85
k
[0043] Suppose that a dataset of a second company is:
TABLE-US-00002 004-05-0006 $87 k 987-65-4321 $39 k 991-992-9993 $95
k
[0044] The respective datasets indicate the salary provided to each
employee, where each employee is identified by their social
security number. The social security numbers should be
confidential, and thus tokens should be generated for each
employee. It should be noted that although there are six entries,
only five employees are actually listed. The employee with the
social security number 987-65-4321 works for both companies and has
a combined salary of $79k. Thus, in general, if any tokens from
node 102's dataset match any tokens from node 104's dataset, then
their secret shares may be added together (i.e. adding the salaries
from multiple companies for the same social security number), else
they are appended as a separate row/column in the dataset, thus,
forming a dataset from both node 102 and node 104, without
revealing any personal identifiable information.
[0045] In some aspects, the SMPC may take the encryption approach.
In this case, each respective SMPC compute engine is configured to
encrypt the respective secret share (e.g., share A1 and A2) using
an encryption scheme (e.g., Advanced Encryption Standard (AES)),
wherein an initialization vector and key of the encryption scheme
are in a secret share unique to the respective SMPC compute engine.
For example, instead of secret salt A, B, and Z, each engine may
use an initialization vector and key (e.g., vector and key A, B,
and Z). Subsequent to encrypting each share collectively using SMPC
protocols at an engine, the engine may transmit the encrypted
secret share to the remaining SMPC compute engines and receive a
plurality of encrypted secret shares from the remaining SMPC
compute engines of the plurality of SMPC compute engines. Each
respective SMPC compute engine may then generate the token, wherein
the token is a combination (e.g., by adding or appending) of the
encrypted respective secret share and the plurality of encrypted
secret share. For an asymmetric encryption scheme, the public key
can exist in secret shares, with the private discarded.
[0046] Once the inputted values to system 200 (e.g., ID 106 and
108) have been hashed, encrypted, or both, the values may be then
revealed to each party (e.g., node 102 and 104). Therefore, each
party gets the same token, which is useful when data needs to be in
the same order for processing in an SMPC environment. Similarly,
when uploading datasets into an SMPC environment such as system
200, the tokens may be revealed and stored with the dataset on SMPC
compute engines A, B and Z, thus, not being revealed back to node
102 or 104. In particular, to run a secure query over some input
data (e.g., determine an average salary of a plurality of employees
across multiple companies), which has a column as a token for
searching or lookup, the order of which the matching rows (or
columns depending on how the dataset is divided) are fed into the
SMPC secure process is critical. Having the same token on each SMPC
compute engine means the order of rows (or columns) can be
preserved by sorting on the token. For example, if the same token
is received for a particular dataset, the SMPC compute engines may
proceed to perform a secure function and combine the results. For
example, if two sources of data node 102 and node 104 contain
separate parts of information on individuals, such as salary and
property valuation respectively, this can be joined together to
form a larger dataset containing an individual's salary and value
of their house if they own one. However, if the tokens are
different from each node for the same individual in the two
datasets, then the corresponding data values (e.g., salaries and
property valuation) are not going to be aligned in the joined
dataset.
[0047] Some features of the systems and methods of the present
disclosure thus include that no single party of the SMPC network
has knowledge of the salt/keys used during the token generation.
Furthermore, an input (salt 110 or password 118) that is not stored
within the SMPC platform is required to generate the tokens.
Therefore, in the event of a breach on all SMPC compute engines
within the SMPC platform, parts of the token generation process are
still unknown, protecting the personal identifiable information.
Because the methods are built on an SMPC platform, trusted and
uncompromised nodes can detect some level of malicious activity if
another node has been compromised.
[0048] It should be noted that using a tokenizer based on an
initial hashing function means that there is a possibility of a
collision occurring (same token generated) between two different
inputs. If there was only a single party performing the
tokenization, then a random mapping could be used to guarantee
collisions do not occur. However, with multiple data sources
performing part of the tokenization process, handling collisions is
important. For example, if two pieces of information are
incorrectly joined together (e.g., when computing a secure
function) based on the colliding token, then results based upon the
join are invalid--without knowledge of the fact that they are
invalid.
[0049] The issue depends on the type of input data. For example, in
Singapore, if a personal identification number or a phone number is
being tokenized, then there is a near-zero chance of a collision,
because the input space is small. Email addresses also have a
maximum size and limited character set, so the probability of a
collision is low. In cases where it is a possibility, even though
slim, two or more separate tokens may be generated such that they
use different salts and/or algorithms, thus reducing the chance
that the token pair is the same.
[0050] With any tokenization method, where the tokens need to be
generated by different parties, it opens the process up to a wider
range of attacks. As mentioned previously, some data inputs have a
limited input space. This makes brute forcing (i.e., generating all
possible tokens), within a reasonable timeframe, feasible. Because
of this, if someone were to gain access to the first salt value,
and access to one of the compute nodes, they could input desirable
values into the system, and record the output tokens, allowing them
to find the data associated with correct token. However, because
the actual meaningful data is split into secret shares, the
infiltrator would then need to gain access to all the SMPC compute
engines. Similarly if someone gained access to all the SMPC compute
engines, but did not know the password/phrase used in the initial
hashing process, then the tokens cannot be brute forced. This is
the advantage of using a distributed system for token
generation.
[0051] Because all the SMPC engines will be performing the same
operations in order to compute the secret hash and/or encryption,
then if the other SMPC engines notice the tokens being requested
are not fed into a meaningful dataset or function, the engines may
raise an alert that some suspicious activity is taking place. This
cannot be said of other conventional tokenization methods or
services, because they only create tokens; whereas here, the system
has broader knowledge and can detect some levels of malicious
behaviour. Furthermore, because the tokens need to be generated
within the SMPC environment, the performance makes it harder to
quickly generate all possible tokens.
[0052] FIG. 3 illustrates a flow diagram of method 300 for
generating a token for one source using SMPC compute engines, in
accordance with aspects of the present disclosure. At 302, a node
(e.g., node 102) hashes a data input (e.g., ID 106) with a salt
value (e.g., salt 110). At 304, the node splits the hashed data
input into a plurality of secret shares (e.g., share A1, B1, . . .
Z1), wherein each respective secret share of the plurality of
secret shares is assigned to a respective SMPC compute engine
(e.g., SMPC compute engine A, B, Z) of a plurality of SMPC compute
engines.
[0053] At 306, the respective SMPC compute engines (e.g., engine A)
collectively hash the respective secret shares (e.g., share A1)
with a secret salt value using SMPC protocols running on each of
the plurality of SMPC compute engines. At 308, the respective SMPC
compute engine transmits its respective hashed secret share to the
other SMPC compute engines of the plurality of SMPC compute
engines. At 310, the respective SMPC compute engine receives a
plurality of hashed secret shares from the remaining SMPC compute
engines (e.g., engine B, C, Z) of the plurality of SMPC compute
engines. At 312, the respective SMPC compute engine generates a
token, wherein the token is a combination of the hashed respective
secret share and the plurality of hashed secret shares.
[0054] FIG. 4 illustrates a flow diagram of method 400 for a
generating token for multiple sources using SMPC compute engines,
in accordance with aspects of the present disclosure. At 402, a
node (e.g., node 102) hashes a data input (e.g., ID 106) with a
salt value (e.g., salt 110). At 404, the node splits the hashed
data input into a plurality of secret shares (e.g., share A1, B1,
Z1), wherein each respective secret share of the plurality of
secret shares is assigned to a respective SMPC compute engine
(e.g., SMPC compute engine A, B, Z) of a plurality of SMPC compute
engines. At 406, the node transmits the salt value (e.g., via
on-premise service 112), note this process could be manual or
automatic.
[0055] At 408, at least one other node (e.g., node 104) receives
the salt value (e.g., salt 110). At 410, the at least one other
node hashes at least one other data input (e.g., ID 108) with the
salt value. At 412, the at least one other node splits the hashed
at least one other data input into at least one other plurality of
secret shares (e.g., share A2, B2, . . . , Z2), wherein each
respective secret share of the at least one other plurality of
secret shares is assigned to a respective SMPC compute engine
(e.g., SMPC compute engine A, B, Z) of the plurality of SMPC
compute engines.
[0056] At 414, each respective SMPC compute engine (e.g., SMPC
compute engine A) jointly hashes the respective secret share from
the plurality of secret shares (e.g., from 404) and then the
respective secret share from the at least one other plurality of
secret shares (e.g., from 412) with a secret salt value .beta.
(i.e., the same secret shared salt value is used). At 420, each
respective SMPC compute engine transmits the respective hashed
secret share to the other SMPC compute engines of the plurality of
SMPC compute engines. At 418, each respective SMPC compute engine
receives another plurality of hashed secret shares from the
remaining SMPC compute engines (e.g., SMPC compute engine B, C, Z)
of the plurality of SMPC compute engines. At 420, each respective
SMPC compute engine generates the token of each data input, wherein
each token is a combination of the plurality of hashed secret
shares, irrespective of the data source.
[0057] FIG. 5 is a block diagram illustrating a computer system 20
on which aspects of systems and methods for generating tokens using
SMPC compute engines may be implemented in accordance with an
exemplary aspect. The computer system 20 can be in the form of
multiple computing devices, or in the form of a single computing
device, for example, a desktop computer, a notebook computer, a
laptop computer, a mobile computing device, a smart phone, a tablet
computer, a server, a mainframe, an embedded device, and other
forms of computing devices.
[0058] As shown, the computer system 20 includes a central
processing unit (CPU) 21, a system memory 22, and a system bus 23
connecting the various system components, including the memory
associated with the central processing unit 21. The system bus 23
may comprise a bus memory or bus memory controller, a peripheral
bus, and a local bus that is able to interact with any other bus
architecture. Examples of the buses may include PCI, ISA,
PCI-Express, HyperTransport.TM., InfiniBand.TM., Serial ATA,
I.sup.2C, and other suitable interconnects. The central processing
unit 21 (also referred to as a processor) can include a single or
multiple sets of processors having single or multiple cores. The
processor 21 may execute one or more computer-executable code
implementing the techniques of the present disclosure. For example,
any of commands/steps discussed in FIGS. 1-4 may be performed by
processor 21. The system memory 22 may be any memory for storing
data used herein and/or computer programs that are executable by
the processor 21. The system memory 22 may include volatile memory
such as a random access memory (RAM) 25 and non-volatile memory
such as a read only memory (ROM) 24, flash memory, etc., or any
combination thereof. The basic input/output system (BIOS) 26 may
store the basic procedures for transfer of information between
elements of the computer system 20, such as those at the time of
loading the operating system with the use of the ROM 24.
[0059] The computer system 20 may include one or more storage
devices such as one or more removable storage devices 27, one or
more non-removable storage devices 28, or a combination thereof.
The one or more removable storage devices 27 and non-removable
storage devices 28 are connected to the system bus 23 via a storage
interface 32. In an aspect, the storage devices and the
corresponding computer-readable storage media are power-independent
modules for the storage of computer instructions, data structures,
program modules, and other data of the computer system 20. The
system memory 22, removable storage devices 27, and non-removable
storage devices 28 may use a variety of computer-readable storage
media. Examples of computer-readable storage media include machine
memory such as cache, SRAM, DRAM, zero capacitor RAM, twin
transistor RAM, eDRAM, EDO RAM, DDR RAM, EEPROM, NRAM, RRAM, SONOS,
PRAM; flash memory or other memory technology such as in solid
state drives (SSDs) or flash drives; magnetic cassettes, magnetic
tape, and magnetic disk storage such as in hard disk drives or
floppy disks; optical storage such as in compact disks (CD-ROM) or
digital versatile disks (DVDs); and any other medium which may be
used to store the desired data and which can be accessed by the
computer system 20.
[0060] The system memory 22, removable storage devices 27, and
non-removable storage devices 28 of the computer system 20 may be
used to store an operating system 35, additional program
applications 37, other program modules 38, and program data 39. The
computer system 20 may include a peripheral interface 46 for
communicating data from input devices 40, such as a keyboard,
mouse, stylus, game controller, voice input device, touch input
device, or other peripheral devices, such as a printer or scanner
via one or more I/O ports, such as a serial port, a parallel port,
a universal serial bus (USB), or other peripheral interface. A
display device 47 such as one or more monitors, projectors, or
integrated display, may also be connected to the system bus 23
across an output interface 48, such as a video adapter. In addition
to the display devices 47, the computer system 20 may be equipped
with other peripheral output devices (not shown), such as
loudspeakers and other audiovisual devices.
[0061] The computer system 20 may operate in a network environment,
using a network connection to one or more remote computers 49. The
remote computer (or computers) 49 may be local computer
workstations or servers comprising most or all of the
aforementioned elements in describing the nature of a computer
system 20. Other devices may also be present in the computer
network, such as, but not limited to, routers, network stations,
peer devices or other network nodes. The computer system 20 may
include one or more network interfaces 51 or network adapters for
communicating with the remote computers 49 via one or more networks
such as a local-area computer network (LAN) 50, a wide-area
computer network (WAN), an intranet, and the Internet. Examples of
the network interface 51 may include an Ethernet interface, a Frame
Relay interface, SONET interface, and wireless interfaces.
[0062] Aspects of the present disclosure may be a system, a method,
and/or a computer program product. The computer program product may
include a computer readable storage medium (or media) having
computer readable program instructions thereon for causing a
processor to carry out aspects of the present disclosure.
[0063] The computer readable storage medium can be a tangible
device that can retain and store program code in the form of
instructions or data structures that can be accessed by a processor
of a computing device, such as the computing system 20. The
computer readable storage medium may be an electronic storage
device, a magnetic storage device, an optical storage device, an
electromagnetic storage device, a semiconductor storage device, or
any suitable combination thereof. By way of example, such
computer-readable storage medium can comprise a random access
memory (RAM), a read-only memory (ROM), EEPROM, a portable compact
disc read-only memory (CD-ROM), a digital versatile disk (DVD),
flash memory, a hard disk, a portable computer diskette, a memory
stick, a floppy disk, or even a mechanically encoded device such as
punch-cards or raised structures in a groove having instructions
recorded thereon. As used herein, a computer readable storage
medium is not to be construed as being transitory signals per se,
such as radio waves or other freely propagating electromagnetic
waves, electromagnetic waves propagating through a waveguide or
transmission media, or electrical signals transmitted through a
wire.
[0064] Computer readable program instructions described herein can
be downloaded to respective computing devices from a computer
readable storage medium or to an external computer or external
storage device via a network, for example, the Internet, a local
area network, a wide area network and/or a wireless network. The
network may comprise copper transmission cables, optical
transmission fibers, wireless transmission, routers, firewalls,
switches, gateway computers and/or edge servers. A network
interface in each computing device receives computer readable
program instructions from the network and forwards the computer
readable program instructions for storage in a computer readable
storage medium within the respective computing device.
[0065] Computer readable program instructions for carrying out
operations of the present disclosure may be assembly instructions,
instruction-set-architecture (ISA) instructions, machine
instructions, machine dependent instructions, microcode, firmware
instructions, state-setting data, or either source code or object
code written in any combination of one or more programming
languages, including an object oriented programming language, and
conventional procedural programming languages. The computer
readable program instructions may execute entirely on the user's
computer, partly on the user's computer, as a stand-alone software
package, partly on the user's computer and partly on a remote
computer or entirely on the remote computer or server. In the
latter scenario, the remote computer may be connected to the user's
computer through any type of network, including a LAN or WAN, or
the connection may be made to an external computer (for example,
through the Internet). In some embodiments, electronic circuitry
including, for example, programmable logic circuitry,
field-programmable gate arrays (FPGA), or programmable logic arrays
(PLA) may execute the computer readable program instructions by
utilizing state information of the computer readable program
instructions to personalize the electronic circuitry, in order to
perform aspects of the present disclosure.
[0066] In various aspects, the systems and methods described in the
present disclosure can be addressed in terms of modules. The term
"module" as used herein refers to a real-world device, component,
or arrangement of components implemented using hardware, such as by
an application specific integrated circuit (ASIC) or FPGA, for
example, or as a combination of hardware and software, such as by a
microprocessor system and a set of instructions to implement the
module's functionality, which (while being executed) transform the
microprocessor system into a special-purpose device. A module may
also be implemented as a combination of the two, with certain
functions facilitated by hardware alone, and other functions
facilitated by a combination of hardware and software. In certain
implementations, at least a portion, and in some cases, all, of a
module may be executed on the processor of a computer system.
Accordingly, each module may be realized in a variety of suitable
configurations, and should not be limited to any particular
implementation exemplified herein.
[0067] In the interest of clarity, not all of the routine features
of the aspects are disclosed herein. It would be appreciated that
in the development of any actual implementation of the present
disclosure, numerous implementation-specific decisions must be made
in order to achieve the developer's specific goals, and these
specific goals will vary for different implementations and
different developers. It is understood that such a development
effort might be complex and time-consuming, but would nevertheless
be a routine undertaking of engineering for those of ordinary skill
in the art, having the benefit of this disclosure.
[0068] Furthermore, it is to be understood that the phraseology or
terminology used herein is for the purpose of description and not
of restriction, such that the terminology or phraseology of the
present specification is to be interpreted by the skilled in the
art in light of the teachings and guidance presented herein, in
combination with the knowledge of those skilled in the relevant
art(s). Moreover, it is not intended for any term in the
specification or claims to be ascribed an uncommon or special
meaning unless explicitly set forth as such.
[0069] The various aspects disclosed herein encompass present and
future known equivalents to the known modules referred to herein by
way of illustration. Moreover, while aspects and applications have
been shown and described, it would be apparent to those skilled in
the art having the benefit of this disclosure that many more
modifications than mentioned above are possible without departing
from the inventive concepts disclosed herein.
* * * * *