U.S. patent application number 17/481568 was filed with the patent office on 2022-03-10 for discrete variational auto-encoder systems and methods for machine learning using adiabatic quantum computers.
The applicant listed for this patent is D-WAVE SYSTEMS INC.. Invention is credited to Jason Rolfe.
Application Number | 20220076131 17/481568 |
Document ID | / |
Family ID | 58050832 |
Filed Date | 2022-03-10 |
United States Patent
Application |
20220076131 |
Kind Code |
A1 |
Rolfe; Jason |
March 10, 2022 |
DISCRETE VARIATIONAL AUTO-ENCODER SYSTEMS AND METHODS FOR MACHINE
LEARNING USING ADIABATIC QUANTUM COMPUTERS
Abstract
A computational system can include digital circuitry and analog
circuitry, for instance a digital processor and a quantum
processor. The quantum processor can operate as a sample generator
providing samples. Samples can be employed by the digital
processing in implementing various machine learning techniques. For
example, the computational system can perform unsupervised learning
over an input space, for example via a discrete variational
auto-encoder, and attempting to maximize the log-likelihood of an
observed dataset. Maximizing the log-likelihood of the observed
dataset can include generating a hierarchical approximating
posterior.
Inventors: |
Rolfe; Jason; (Vancouver,
CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
D-WAVE SYSTEMS INC. |
Burnaby |
|
CA |
|
|
Family ID: |
58050832 |
Appl. No.: |
17/481568 |
Filed: |
September 22, 2021 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
15753666 |
Feb 20, 2018 |
11157817 |
|
|
PCT/US2016/047627 |
Aug 18, 2016 |
|
|
|
17481568 |
|
|
|
|
62206974 |
Aug 19, 2015 |
|
|
|
62268321 |
Dec 16, 2015 |
|
|
|
62307929 |
Mar 14, 2016 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06N 3/08 20130101; G06N
3/084 20130101; G06N 3/086 20130101; G06N 3/0445 20130101; G06N
3/0454 20130101; G06N 3/088 20130101; G06N 3/0472 20130101; G06N
10/00 20190101 |
International
Class: |
G06N 3/08 20060101
G06N003/08; G06N 10/00 20060101 G06N010/00; G06N 3/04 20060101
G06N003/04 |
Claims
1-38. (canceled)
39. A method of unsupervised learning by a computational system,
the method executed by circuitry including at least one processor
and comprising: determining by the circuitry a first approximating
posterior distribution over at least one group of a set of discrete
random variables; sampling by the circuitry from at least one group
of a set of supplementary continuous random variables using the
first approximating posterior distribution over the at least one
group of the set of discrete random variables to generate one or
more samples, wherein a transforming distribution comprises a
conditional distribution over the set of supplementary continuous
random variables, conditioned on the at least one group of a set of
discrete random variables; determining by the circuitry a second
approximating posterior distribution and a first prior
distribution, the first prior distribution over at least one layer
of a set of continuous variables; sampling by the circuitry from
the second approximating posterior distribution; determining by the
circuitry an auto-encoding loss on an input space comprising
discrete or continuous variables, the auto-encoding loss
conditioned on the one or more samples; determining by the
circuitry a first KL-divergence, or at least an approximation
thereof, between the second approximating posterior distribution
and the first prior distribution; determining by the circuitry a
second KL-divergence, or at least an approximation thereof, between
the first approximating posterior distribution and a second prior
distribution, the second prior distribution over the set of
discrete random variables; and backpropagating by the circuitry a
sum of the first and the second KL-divergence and the auto-encoding
loss on the input space conditioned on the one or more samples.
40. The method of claim 39 wherein the auto-encoding loss is a
log-likelihood.
41. A method of unsupervised learning by a computational system,
the method executed by circuitry including at least one processor
and comprising: determining by the circuitry a first approximating
posterior distribution over a first group of discrete random
variables conditioned on an input space comprising discrete or
continuous variables; sampling by the circuitry from a first group
of supplementary continuous random variables based on the first
approximating posterior distribution; determining by the circuitry
a second approximating posterior distribution over a second group
of discrete random variables conditioned on the input space and
samples from the first group of supplementary continuous random
variables; sampling by the circuitry from a second group of
supplementary continuous random variables based on the second
approximating posterior distribution; determining by the circuitry
a third approximating posterior distribution and a first prior
distribution over a first layer of additional continuous random
variables, the third approximating posterior distribution
conditioned on the input space, samples from at least one of the
first and the second group of supplementary continuous random
variables, and the first prior distribution conditioned on samples
from at least one of the first and the second group of
supplementary continuous random variables; sampling by the
circuitry from the first layer of additional continuous random
variables based on the third approximating posterior distribution;
determining by the circuitry a fourth approximating posterior
distribution and a second prior distribution over a second layer of
additional continuous random variables, the fourth approximating
posterior distribution conditioned on the input space, samples from
at least one of the first and the second group of supplementary
continuous random variables, samples from the first layer of
additional continuous random variables, and the second prior
distribution conditioned on at least one of samples from at least
one of the first and the second group of supplementary continuous
random variables, and samples from the first layer of additional
continuous random variables; determining by the circuitry a first
gradient of a KL-divergence, or at least a stochastic approximation
thereof, between the third approximating posterior distribution and
the first prior distribution with respect to the third
approximating posterior distribution and the first prior
distribution; determining by the circuitry a second gradient of a
KL-divergence, or at least a stochastic approximation thereof,
between the fourth approximating posterior distribution and the
second prior distribution with respect to the fourth approximating
posterior distribution and the second prior distribution;
determining by the circuitry a third gradient of a KL-divergence,
or at least a stochastic approximation thereof, between an
approximating posterior distribution over a third group of discrete
random variables and a third prior distribution with respect to the
approximating posterior distribution over the third group of
discrete random variables and the third prior distribution, wherein
the approximating posterior distribution over the third group of
discrete random variables is a combination of the first
approximating posterior distribution over the first group of
discrete random variables, and the second approximating posterior
distribution over the second group of discrete random variables;
and backpropagating by the circuitry the first, the second and the
third gradients of the KL-divergence to the input space.
42. The method of claim 41 wherein determining by the circuitry a
third gradient of a KL-divergence, or at least a stochastic
approximation thereof, between an approximating posterior
distribution over the third group of discrete random variables and
a third prior distribution with respect to the approximating
posterior distribution over the third group of discrete random
variables and the third prior distribution comprises determining by
the circuitry a third gradient of a KL-divergence, or at least a
stochastic approximation thereof, between an approximating
posterior distribution over the third group of discrete random
variables and a third prior distribution with respect to the
approximating posterior distribution over the third group of
discrete random variables and the third prior distribution, the
third prior distribution is comprising a restricted Boltzmann
machine.
43. The method of claim 39 wherein determining by the circuitry a
first KL-divergence comprises computing by the circuitry a loss
function analytically.
44. The method of claim 39 wherein determining by the circuitry a
first KL-divergence comprises estimating by the circuitry a loss
function stochastically.
45. The method of claim 39 wherein determining by the circuitry a
second KL-divergence comprises computing by the circuitry a loss
function analytically.
46. The method of claim 39 wherein determining by the circuitry a
second KL-divergence comprises estimating by the circuitry a loss
function stochastically.
47. The method of claim 39 wherein determining by the circuitry a
second approximating posterior distribution and a first prior
distribution, the first prior distribution over at least one layer
of a set of continuous variables comprises determining by the
circuitry a second approximating posterior distribution and a first
prior distribution, the first prior distribution comprising a
restricted Boltzmann machine.
48. The method of claim 39 wherein determining by the circuitry a
second KL-divergence, or at least an approximation thereof, between
the first approximating posterior distribution and a second prior
distribution, the second prior distribution over the second group
of discrete random variables comprises determining by the circuitry
a second KL-divergence, or at least an approximation thereof,
between the first approximating posterior distribution and a second
prior distribution, the second prior comprising a restricted
Boltzmann machine.
49. The method of claim 39 wherein sampling by the circuitry from
the second approximating posterior distribution includes at least
one of generating samples by the circuitry or causing samples to be
generated by a digital processor.
50. The method of claim 39 wherein sampling by the circuitry from
the second approximating posterior distribution includes at least
one of generating samples by the circuitry or causing samples to be
generated by a quantum processor.
51. The method of claim 41 wherein sampling by the circuitry from a
first group of supplementary continuous variables based on the
first approximating posterior distribution includes at least one of
generating samples by the circuitry or causing samples to be
generated by one of a digital processor and a quantum
processor.
52. The method of claim 41 wherein sampling by the circuitry from a
second group of supplementary continuous variables based on the
second approximating posterior distribution includes at least one
of generating samples by the circuitry or causing samples to be
generated by one of a digital processor and a quantum
processor.
53. The method of claim 41 wherein sampling by the circuitry from
the first layer of additional continuous random variables based on
third first approximating posterior distribution includes at least
one of generating samples by the circuitry or causing samples to be
generated by one of a digital processor and a quantum
processor.
54. The method of claim 41 wherein determining by the circuitry a
third approximating posterior distribution and a first prior
distribution over a first layer of additional continuous random
variables comprises determining by the circuitry a third
approximating posterior distribution and a first prior distribution
over a first layer of additional continuous random variables, the
first prior distribution comprising a restricted Boltzmann
machine.
55. The method of claim 41 wherein determining by the circuitry a
fourth approximating posterior distribution and a second prior
distribution over a second layer of additional continuous random
variables comprises determining by the circuitry a fourth
approximating posterior distribution and a second prior
distribution over a second layer of additional continuous random
variables, the second prior comprising a restricted Boltzmann
machine.
Description
BACKGROUND
Field
[0001] The present disclosure generally relates to machine
learning.
Machine Learning
[0002] Machine learning relates to methods and circuitry that can
learn from data and make predictions based on data. In contrast to
methods or circuitry that follow static program instructions,
machine learning methods and circuitry can include deriving a model
from example inputs (such as a training set) and then making
data-driven predictions.
[0003] Machine learning is related to optimization. Some problems
can be expressed in terms of minimizing a loss function on a
training set, where the loss function describes the disparity
between the predictions of the model being trained and observable
data.
[0004] Machine learning tasks can include unsupervised learning,
supervised learning, and reinforcement learning. Approaches to
machine learning include, but are not limited to, decision trees,
linear and quadratic classifiers, case-based reasoning, Bayesian
statistics, and artificial neural networks.
[0005] Machine learning can be used in situations where explicit
approaches are considered infeasible. Example application areas
include optical character recognition, search engine optimization,
and computer vision.
Quantum Processor
[0006] A quantum processor is a computing device that can harness
quantum physical phenomena (such as superposition, entanglement,
and quantum tunneling) unavailable to non-quantum devices. A
quantum processor may take the form of a superconducting quantum
processor. A superconducting quantum processor may include a number
of qubits and associated local bias devices, for instance two or
more superconducting qubits. An example of a qubit is a flux qubit.
A superconducting quantum processor may also employ coupling
devices (i.e., "couplers") providing communicative coupling between
qubits. Further details and embodiments of exemplary quantum
processors that may be used in conjunction with the present systems
and devices are described in, for example, U.S. Pat. Nos.
7,533,068; 8,008,942; 8,195,596; 8,190,548; and 8,421,053.
Adiabatic Quantum Computation
[0007] Adiabatic quantum computation typically involves evolving a
system from a known initial Hamiltonian (the Hamiltonian being an
operator whose eigenvalues are the allowed energies of the system)
to a final Hamiltonian by gradually changing the Hamiltonian. A
simple example of an adiabatic evolution is a linear interpolation
between initial Hamiltonian and final Hamiltonian. An example is
given by:
H.sub.e=(1-s)H.sub.i+sH.sub.f
where H.sub.i is the initial Hamiltonian, H.sub.f is the final
Hamiltonian, H.sub.e is the evolution or instantaneous Hamiltonian,
and s is an evolution coefficient which controls the rate of
evolution (i.e., the rate at which the Hamiltonian changes).
[0008] As the system evolves, the evolution coefficient s goes from
0 to 1 such that at the beginning (i.e., s=0) the evolution
Hamiltonian H.sub.e is equal to the initial Hamiltonian H.sub.i and
at the end (i.e., s=1) the evolution Hamiltonian H.sub.e is equal
to the final Hamiltonian H.sub.f. Before the evolution begins, the
system is typically initialized in a ground state of the initial
Hamiltonian H.sub.i and the goal is to evolve the system in such a
way that the system ends up in a ground state of the final
Hamiltonian H.sub.f at the end of the evolution. If the evolution
is too fast, then the system can transition to a higher energy
state, such as the first excited state. As used herein an
"adiabatic" evolution is an evolution that satisfies the adiabatic
condition:
{dot over (s)}|1|dH.sub.e/ds|0|=.delta.g.sup.2(s)
where {dot over (s)} is the time derivative of s, g(s) is the
difference in energy between the ground state and first excited
state of the system (also referred to herein as the "gap size") as
a function of s, and .delta. is a coefficient much less than 1.
[0009] If the evolution is slow enough that the system is always in
the instantaneous ground state of the evolution Hamiltonian, then
transitions at anti-crossings (when the gap size is smallest) are
avoided. Other evolution schedules, besides the linear evolution
described above, are possible including non-linear evolution,
parametric evolution, and the like. Further details on adiabatic
quantum computing systems, methods, and apparatus are described in,
for example, U.S. Pat. Nos. 7,135,701; and 7,418,283.
Quantum Annealing
[0010] Quantum annealing is a computation method that may be used
to find a low-energy state, typically preferably the ground state,
of a system. Similar in concept to classical simulated annealing,
the method relies on the underlying principle that natural systems
tend towards lower energy states because lower energy states are
more stable. While classical annealing uses classical thermal
fluctuations to guide a system to a low-energy state and ideally
its global energy minimum, quantum annealing may use quantum
effects, such as quantum tunneling, as a source of disordering to
reach a global energy minimum more accurately and/or more quickly
than classical annealing. In quantum annealing thermal effects and
other noise may be present to annealing. The final low-energy state
may not be the global energy minimum. Adiabatic quantum computation
may be considered a special case of quantum annealing for which the
system, ideally, begins and remains in its ground state throughout
an adiabatic evolution. Thus, those of skill in the art will
appreciate that quantum annealing systems and methods may generally
be implemented on an adiabatic quantum computer. Throughout this
specification and the appended claims, any reference to quantum
annealing is intended to encompass adiabatic quantum computation
unless the context requires otherwise.
[0011] Quantum annealing uses quantum mechanics as a source of
disorder during the annealing process. An objective function, such
as an optimization problem, is encoded in a Hamiltonian H.sub.P,
and the algorithm introduces quantum effects by adding a
disordering Hamiltonian H.sub.D that does not commute with H.sub.P.
An example case is:
H.sub.E.varies.A(t)H.sub.D+B(t)H.sub.P,
where A(t) and B(t) are time dependent envelope functions. For
example, A(t) can change from a large value to substantially zero
during the evolution and H.sub.E can be thought of as an evolution
Hamiltonian similar to H.sub.e described in the context of
adiabatic quantum computation above. The disorder is slowly removed
by removing H.sub.D (i.e., by reducing A(t)).
[0012] Thus, quantum annealing is similar to adiabatic quantum
computation in that the system starts with an initial Hamiltonian
and evolves through an evolution Hamiltonian to a final "problem"
Hamiltonian H.sub.P whose ground state encodes a solution to the
problem. If the evolution is slow enough, the system may settle in
the global minimum (i.e., the exact solution), or in a local
minimum close in energy to the exact solution. The performance of
the computation may be assessed via the residual energy (difference
from exact solution using the objective function) versus evolution
time. The computation time is the time required to generate a
residual energy below some acceptable threshold value. In quantum
annealing, H.sub.P may encode an optimization problem and therefore
H.sub.P may be diagonal in the subspace of the qubits that encode
the solution, but the system does not necessarily stay in the
ground state at all times. The energy landscape of H.sub.P may be
crafted so that its global minimum is the answer to the problem to
be solved, and low-lying local minima are good approximations.
[0013] The gradual reduction of disordering Hamiltonian H.sub.D
(i.e., reducing A(t)) in quantum annealing may follow a defined
schedule known as an annealing schedule. Unlike adiabatic quantum
computation where the system begins and remains in its ground state
throughout the evolution, in quantum annealing the system may not
remain in its ground state throughout the entire annealing
schedule. As such, quantum annealing may be implemented as a
heuristic technique, where low-energy states with energy near that
of the ground state may provide approximate solutions to the
problem.
BRIEF SUMMARY
[0014] A method for unsupervised learning over an input space
comprising discrete or continuous variables, and at least a subset
of a training dataset of samples of the respective variables, to
attempt to identify the value of at least one parameter that
increases the log-likelihood of the at least a subset of a training
dataset with respect to a model, the model expressible as a
function of the at least one parameter, the method executed by
circuitry including at least one processor, may be summarized as
including forming a first latent space comprising a plurality of
random variables, the plurality of random variables comprising one
or more discrete random variables; forming a second latent space
comprising the first latent space and a set of supplementary
continuous random variables; forming a first transforming
distribution comprising a conditional distribution over the set of
supplementary continuous random variables, conditioned on the one
or more discrete random variables of the first latent space;
forming an encoding distribution comprising an approximating
posterior distribution over the first latent space, conditioned on
the input space; forming a prior distribution over the first latent
space; forming a decoding distribution comprising a conditional
distribution over the input space conditioned on the set of
supplementary continuous random variables; determining an ordered
set of conditional cumulative distribution functions of the
supplementary continuous random variables, each cumulative
distribution function comprising functions of a full distribution
of at least one of the one or more discrete random variables of the
first latent space; determining an inversion of the ordered set of
conditional cumulative distribution functions of the supplementary
continuous random variables; constructing a first stochastic
approximation to a lower bound on the log-likelihood of the at
least a subset of a training dataset; constructing a second
stochastic approximation to a gradient of the lower bound on the
log-likelihood of the at least a subset of a training dataset; and
increasing the lower bound on the log-likelihood of the at least a
subset of a training dataset based at least in part on the gradient
of the lower bound on the log-likelihood of the at least a subset
of a training dataset.
[0015] Increasing the lower bound on the log-likelihood of the at
least a subset of a training dataset based at least in part on the
gradient of the lower bound on the log-likelihood of the at least a
subset of a training dataset may include increasing the lower bound
on the log-likelihood of the at least a subset of a training
dataset using a method of gradient descent. Increasing the lower
bound on the log-likelihood of the at least a subset of a training
dataset using a method of gradient descent may include attempting
to maximize the lower bound on the log-likelihood of the at least a
subset of a training dataset using a method of gradient descent.
The encoding distribution and decoding distribution may be
parameterized by deep neural networks. Determining an ordered set
of conditional cumulative distribution functions of the
supplementary continuous random variables may include analytically
determining an ordered set of conditional cumulative distribution
functions of the supplementary continuous random variables. The
lower bound may be an evidence lower bound.
[0016] Constructing a first stochastic approximation to the lower
bound of the log-likelihood of the at least a subset of a training
dataset may include decomposing the first stochastic approximation
to the lower bound into at least a first part comprising negative
KL-divergence between the approximating posterior and the prior
distribution over the first latent space, and a second part
comprising an expectation, or at least a stochastic approximation
to an expectation, with respect to the approximating posterior over
the second latent space of the conditional log-likelihood of the at
least a subset of a training dataset under the decoding
distribution.
[0017] Constructing a second stochastic approximation to the
gradient of the lower bound may include determining the gradient of
the second part of the first stochastic approximation by
backpropagation; approximating the gradient of the first part of
the first stochastic approximation with respect to one or more
parameters of the prior distribution over the first latent space
using samples from the prior distribution; and determining a
gradient of the first part of the first stochastic approximation
with respect to parameters of the encoding distribution by
backpropagation. Approximating the gradient of the first part of
the first stochastic approximation with respect to one or more
parameters of the prior distribution over the first latent space
using samples from the prior distribution may include at least one
of generating samples or causing samples to be generated by a
quantum processor. A logarithm of the prior distribution may be, to
within a constant, a problem Hamiltonian of a quantum
processor.
[0018] The method may further include generating samples or causing
samples to be generated by a quantum processor; and determining an
expectation with respect to the prior distribution from the
samples. Generating samples or causing samples to be generated by
at least one quantum processor may include performing at least one
post-processing operation on the samples. Generating samples or
causing samples to be generated by at least one quantum processor
may include operating the at least one quantum processor as a
sample generator to provide the samples from a probability
distribution, wherein a shape of the probability distribution
depends on a configuration of a number of programmable parameters
for the at least one quantum processor, and wherein operating the
at least one quantum processor as a sample generator comprises:
programming the at least one quantum processor with a configuration
of the number of programmable parameters for the at least one
quantum processor, wherein the configuration of a number of
programmable parameters corresponds to the probability distribution
over the plurality of qubits of the at least one quantum processor;
evolving the quantum processor; and reading out states for the
qubits in plurality of qubits of the at least one quantum
processor, wherein the states for the qubits in the plurality of
qubits correspond to a sample from the probability
distribution.
[0019] The method may further include at least one of generating,
or at least approximating, samples or causing samples to be
generated, or least approximated, by a restricted Boltzmann
machine; and determining the expectation with respect to the prior
distribution from the samples. The set of supplementary continuous
random variables may include a plurality of continuous variables,
and each one of the plurality of continuous variables may be
conditioned on a different respective one of the plurality of
random variables.
[0020] The method may further include forming a second transforming
distribution, wherein the input space comprises a plurality of
input variables, and the second transforming distribution is
conditioned on one or more of the plurality of input variables and
at least one of the one or more discrete random variables.
[0021] A computational system may be summarized as including
hardware or circuitry, for example including at least one
processor; and at least one nontransitory processor-readable
storage medium that stores at least one of processor-executable
instructions or data which, when executed by the at least one
processor cause the at least one processor to execute any of the
above described acts or any of the methods of claims 1 through
16.
[0022] A method for unsupervised learning by a computational
system, the method executable by circuitry including at least one
processor, may be summarized as including forming a model, the
model comprising one or more model parameters; initializing the
model parameters; receiving a training dataset comprising a
plurality of subsets of the training dataset; testing to determine
if a stopping criterion has been met; in response to determining
the stopping criterion has not been met: fetching a mini-batch
comprising one of the plurality of subsets of the training dataset,
the mini-batch comprising input data; performing propagation
through an encoder that computes an approximating posterior
distribution over a discrete space; sampling from the approximating
posterior distribution over a set of continuous random variables
via a sampler; performing propagation through a decoder that
computes an auto-encoded distribution over the input data;
performing backpropagation through the decoder of a log-likelihood
of the input data with respect to the auto-encoded distribution
over the input data; performing backpropagation through the sampler
that samples from the approximating posterior distribution over the
set of continuous random variables to generate an auto-encoded
gradient; determining a first gradient of a KL-divergence, with
respect to the approximating posterior, between the approximating
posterior distribution and a true prior distribution over the
discrete space; performing backpropagation through the encoder of a
sum of the auto-encoding gradient and the first gradient of the
KL-divergence with respect to the approximating posterior;
determining a second gradient of a KL-divergence, with respect to
parameters of the true prior distribution, between the
approximating posterior and the true prior distribution over the
discrete space; determining at least one of a gradient or at least
a stochastic approximation of a gradient, of a bound on the
log-likelihood of the input data; updating the model parameters
based at least in part on the determined at least one of the
gradient or at least a stochastic approximation of the gradient, of
the bound on the log-likelihood of the input data. Initializing the
model parameters may include initializing the model parameters
using random variables. Initializing the model parameters may
include initializing the model parameters based at least in part on
a pre-training procedure. Testing to determine if a stopping
criterion has been met may include testing to determine if a
threshold number N of passes through the training dataset have been
run.
[0023] The method may further include receiving at least a subset
of a validation dataset, wherein testing to determine if a stopping
criterion has been met includes determining a measure of validation
loss on the at least a subset of a validation dataset computed on
two or more successive passes, and testing to determine if the
measure of validation loss meets a predetermined criterion.
Determining a second gradient of a KL-divergence, with respect to
parameters of the true prior distribution, between the
approximating posterior and the true prior distribution over the
discrete space may include determining a second gradient of a
KL-divergence, with respect to parameters of the true prior
distribution, between the approximating posterior and the true
prior distribution over the discrete space by generating samples or
causing samples to be generated by a quantum processor.
[0024] Generating samples or causing samples to be generated by a
quantum processor may include operating the at least one quantum
processor as a sample generator to provide the samples from a
probability distribution, wherein a shape of the probability
distribution depends on a configuration of a number of programmable
parameters for the at least one quantum processor, and wherein
operating the at least one quantum processor as a sample generator
comprises programming the at least one quantum processor with a
configuration of the number of programmable parameters for the at
least one quantum processor, wherein the configuration of a number
of programmable parameters corresponds to the probability
distribution over the plurality of qubits of the at least one
quantum processor; evolving the at least one quantum processor; and
reading out states for the qubits in plurality of qubits of the at
least one quantum processor, wherein the states for the qubits in
the plurality of qubits correspond to a sample from the probability
distribution. Operating the at least one quantum processor as a
sample generator to provide the samples from a probability
distribution may include operating the at least one quantum
processor to perform at least one post-processing operation on the
samples. Sampling from the approximating posterior distribution
over a set of continuous random variables may include generating
samples or causing samples to be generated by a digital
processor.
[0025] The method for unsupervised learning may further include
dividing the discrete space into a first plurality of disjoint
groups; and dividing the set of supplementary continuous random
variables into a second plurality of disjoint groups, wherein
performing propagation through an encoder that computes an
approximating posterior over a discrete space includes: determining
a processing sequence for the first and the second plurality of
disjoint groups; and for each of the first plurality of disjoint
groups in an order determined by the processing sequence,
performing propagation through an encoder that computes an
approximating posterior, the approximating posterior conditioned on
at least one of the previous ones in the processing sequence of the
second plurality of disjoint groups and at least one of the
plurality of input variables. Dividing the discrete space into a
first plurality of disjoint groups may include dividing the
discrete space into a first plurality of disjoint groups by random
assignment of discrete variables in the discrete space. Dividing
the discrete space into a first plurality of disjoint groups may
include dividing the discrete space into a first plurality of
disjoint groups to generate even-sized groups in the first
plurality of disjoint groups. Initializing the model parameters may
include initializing the model parameter using random variables.
Initializing the model parameters may include initializing the
model parameter based at least in part on a pre-training procedure.
Testing to determine if a stopping criterion has been met may
include testing to determine if a threshold number N of passes
through the training dataset have been run.
[0026] The method may further include receiving at least a subset
of a validation dataset, wherein testing to determine if a stopping
criterion has been met includes determining a measure of validation
loss on the at least a subset of a validation dataset computed on
two or more successive passes, and testing to determine if the
measure of validation loss meets a predetermined criterion.
Determining a second gradient of a KL-divergence, with respect to
parameters of the true prior distribution, between the
approximating posterior and the true prior distribution over the
discrete space may include determining a second gradient of a
KL-divergence, with respect to parameters of the true prior
distribution, between the approximating posterior and the true
prior distribution over the discrete space by generating samples or
causing samples to be generated by a quantum processor.
[0027] Generating samples or causing samples to be generated by a
quantum processor may include operating the at least one quantum
processor as a sample generator to provide the samples from a
probability distribution, wherein a shape of the probability
distribution depends on a configuration of a number of programmable
parameters for the analog processor, and wherein operating the at
least one quantum processor as a sample generator comprises:
programming the at least one quantum processor with a configuration
of the number of programmable parameters for the at least one
quantum processor, wherein the configuration of a number of
programmable parameters corresponds to the probability distribution
over the plurality of qubits of the at least one quantum processor,
evolving the at least one quantum processor, and reading out states
for the qubits in plurality of qubits of the at least one quantum
processor, wherein the states for the qubits in the plurality of
qubits correspond to a sample from the probability distribution.
Operating the at least one quantum processor as a sample generator
to provide the samples from a probability distribution may include
operating the at least one quantum processor to perform at least
one post-processing operation on the samples. Sampling from the
approximating posterior over a set of continuous random variables
may include generating samples or causing samples to be generated
by a digital processor.
[0028] A computational system may be summarized as including
hardware or circuitry, for example including at least one
processor; and at least one nontransitory processor-readable
storage medium that stores at least one of processor-executable
instructions or data which, when executed by the at least one
processor cause the at least processor to execute any of the above
described acts or any of the methods of claims 18 through 37.
[0029] A method of unsupervised learning by a computational system,
the method executable by circuitry including at least one
processor, may be summarized as including determining a first
approximating posterior distribution over at least one group of a
set of discrete random variables; sampling from at least one group
of a set of supplementary continuous random variables using the
first approximating posterior distribution over the at least one
group of the set of discrete random variables to generate one or
more samples, wherein a transforming distribution comprises a
conditional distribution over the set of supplementary continuous
random variables, conditioned on the one or more discrete random
variables; determining a second approximating posterior
distribution and a first prior distribution, the first prior
distribution over at least one layer of a set of continuous
variables; sampling from the second approximating posterior
distribution; determining an auto-encoding loss on an input space
comprising discrete or continuous variables, the auto-encoding loss
conditioned on the one or more samples; determining a first
KL-divergence, or at least an approximation thereof, between the
second posterior distribution and the first prior distribution;
determining a second KL-divergence, or at least an approximation
thereof, between the first posterior distribution and a second
prior distribution, the second prior distribution over the set of
discrete random variables; and backpropagating the sum of the first
and the second KL-divergence and the auto-encoding loss on the
input space conditioned on the one or more samples. The
auto-encoding loss may be a log-likelihood.
[0030] A computational system may be summarized as including
hardware or circuitry, for example including at least one
processor; and at least one nontransitory processor-readable
storage medium that stores at least one of processor-executable
instructions or data which, when executed by the at least one
processor cause the at least processor to execute any of the
immediately above described acts or any of the methods of claims 39
through 40.
[0031] A method of unsupervised learning by a computational system,
the method executable by circuitry including at least one
processor, may be summarized as including determining a first
approximating posterior distribution over a first group of discrete
random variables conditioned on an input space comprising discrete
or continuous variables; sampling from a first group of
supplementary continuous variables based on the first approximating
posterior distribution; determining a second approximating
posterior distribution over a second group of discrete random
variables conditioned on the input space and samples from the first
group of supplementary continuous random variables; sampling from a
second group of supplementary continuous variables based on the
second approximating posterior distribution; determining a third
approximating posterior distribution and a first prior distribution
over a first layer of additional continuous random variables, the
third approximating distribution conditioned on the input space,
samples from at least one of the first and the second group of
supplementary continuous random variables, and the first prior
distribution conditioned on samples from at least one of the first
and the second group of supplementary continuous random variables;
sampling from the first layer of additional continuous random
variables based on the third approximating posterior distribution;
determining a fourth approximating posterior distribution and a
second prior distribution over a second layer of additional
continuous random variables, the fourth approximating distribution
conditioned on the input space, samples from at least one of the
first and the second group of supplementary continuous random
variables, samples from the first layer of additional continuous
random variables, and the second prior distribution conditioned on
at least one of samples from at least one of the first and the
second group of supplementary continuous random variables, and
samples from the first layer of additional continuous random
variables; determining a first gradient of a KL-divergence, or at
least a stochastic approximation thereof, between the third
approximating posterior distribution and the first prior
distribution with respect to the third approximating posterior
distribution and the first prior distribution; determining a second
gradient of a KL-divergence, or at least a stochastic approximation
thereof, between the fourth approximating posterior distribution
and the second prior distribution with respect to the fourth
approximating posterior distribution and the second prior
distribution; determining a third gradient of a KL-divergence, or
at least a stochastic approximation thereof, between an
approximating posterior distribution over the discrete random
variables and a third prior distribution with respect to the
approximating posterior distribution over the discrete random
variables and the third prior distribution, wherein the
approximating posterior distribution over the discrete random
variables is a combination of the first approximating posterior
distribution over the first group of discrete random variables, and
the second approximating posterior distribution over the second
group of discrete random variables; backpropagating the first, the
second and the third gradients of the KL-divergence to the input
space. The third prior distribution may be a restricted Boltzmann
machine.
[0032] A computational system may be summarized as including
hardware or circuitry, for example including at least one processor
and at least one nontransitory processor-readable storage medium
that stores at least one of processor-executable instructions or
data which, when executed by the at least one processor cause the
at least processor to execute any of the immediately above
described acts or any of the methods of claims 41 through 42.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING(S)
[0033] In the drawings, identical reference numbers identify
similar elements or acts. The sizes and relative positions of
elements in the drawings are not necessarily drawn to scale. For
example, the shapes of various elements and angles are not
necessarily drawn to scale, and some of these elements may be
arbitrarily enlarged and positioned to improve drawing legibility.
Further, the particular shapes of the elements as drawn, are not
necessarily intended to convey any information regarding the actual
shape of the particular elements, and may have been solely selected
for ease of recognition in the drawings.
[0034] FIG. 1 is a schematic diagram of an exemplary hybrid
computer including a digital computer and an analog computer in
accordance with the present systems, devices, methods, and
articles.
[0035] FIG. 2A is a schematic diagram of an exemplary topology for
a quantum processor.
[0036] FIG. 2B is a schematic diagram showing a close-up of the
exemplary topology for a quantum processor.
[0037] FIG. 3 is a schematic diagram illustrating an example
implementation of a variational auto-encoder (VAE).
[0038] FIG. 4 is a flow chart illustrating a method for
unsupervised learning, in accordance with the presently described
systems, devices, articles, and methods.
[0039] FIG. 5 is a schematic diagram illustrating an example
implementation of a hierarchical variational auto-encoder
(VAE).
[0040] FIG. 6 is a schematic diagram illustrating an example
implementation of a variational auto-encoder (VAE) with a hierarchy
of continuous latent variables.
[0041] FIG. 7 is a flow chart illustrating a method for
unsupervised learning via a hierarchical variational auto-encoder
(VAE), in accordance with the present systems, devices, articles
and methods.
DETAILED DESCRIPTION
Generalities
[0042] In the following description, some specific details are
included to provide a thorough understanding of various disclosed
embodiments. One skilled in the relevant art, however, will
recognize that embodiments may be practiced without one or more of
these specific details, or with other methods, components,
materials, etc. In other instances, well-known structures
associated with quantum processors, such as quantum devices,
coupling devices, and control systems including microprocessors and
drive circuitry have not been shown or described in detail to avoid
unnecessarily obscuring descriptions of the embodiments of the
present methods. Throughout this specification and the appended
claims, the words "element" and "elements" are used to encompass,
but are not limited to, all such structures, systems, and devices
associated with quantum processors, as well as their related
programmable parameters.
[0043] Unless the context requires otherwise, throughout the
specification and claims that follow, the word "comprising" is
synonymous with "including," and is inclusive or open-ended (i.e.,
does not exclude additional, unrecited elements or method
acts).
[0044] Reference throughout this specification to "one embodiment"
"an embodiment", "another embodiment", "one example", "an example",
or "another example" means that a particular referent feature,
structure, or characteristic described in connection with the
embodiment or example is included in at least one embodiment or
example. Thus, the appearances of the phrases "in one embodiment",
"in an embodiment", "another embodiment" or the like in various
places throughout this specification are not necessarily all
referring to the same embodiment or example. Furthermore, the
particular features, structures, or characteristics may be combined
in any suitable manner in one or more embodiments or examples.
[0045] It should be noted that, as used in this specification and
the appended claims, the singular forms "a," "an," and "the"
include plural referents unless the content clearly dictates
otherwise. Thus, for example, reference to a problem-solving system
including "a quantum processor" includes a single quantum
processor, or two or more quantum processors. It should also be
noted that the term "or" is generally employed in its sense
including "and/or" unless the content clearly dictates
otherwise.
[0046] References to a processor or at least one processor refer to
hardware or circuitry, with discrete or integrated, for example
single or multi-core microprocessors, microcontrollers, central
processor units, digital signal processors, graphical processing
units, programmable gate arrays, programmed logic controllers, and
analog processors, for instance quantum processors. Various
algorithms and methods and specific acts are executable via one or
more processors.
[0047] The headings provided herein are for convenience only and do
not interpret the scope or meaning of the embodiments.
Quantum Hardware
[0048] FIG. 1 illustrates a hybrid computing system 100 including a
digital computer 105 coupled to an analog computer 150. In some
implementations analog computer 150 is a quantum processor. The
exemplary digital computer 105 includes a digital processor (CPU)
110 that may be used to perform classical digital processing
tasks.
[0049] Digital computer 105 may include at least one digital
processor (such as central processor unit 110 with one or more
cores), at least one system memory 120, and at least one system bus
117 that couples various system components, including system memory
120 to central processor unit 110.
[0050] The digital processor may be any logic processing unit, such
as one or more central processing units ("CPUs"), graphics
processing units ("GPUs"), digital signal processors ("DSPs"),
application-specific integrated circuits ("ASICs"), programmable
gate arrays ("FPGAs"), programmable logic controllers (PLCs), etc.,
and/or combinations of the same.
[0051] Unless described otherwise, the construction and operation
of the various blocks shown in FIG. 1 are of conventional design.
As a result, such blocks need not be described in further detail
herein, as they will be understood by those skilled in the relevant
art.
[0052] Digital computer 105 may include a user input/output
subsystem 111. In some implementations, the user input/output
subsystem includes one or more user input/output components such as
a display 112, mouse 113, and/or keyboard 114.
[0053] System bus 117 can employ any known bus structures or
architectures, including a memory bus with a memory controller, a
peripheral bus, and a local bus. System memory 120 may include
non-volatile memory, such as read-only memory ("ROM"), static
random access memory ("SRAM"), Flash NAND; and volatile memory such
as random access memory ("RAM") (not shown).
[0054] Digital computer 105 may also include other non-transitory
computer- or processor-readable storage media or non-volatile
memory 115. Non-volatile memory 115 may take a variety of forms,
including: a hard disk drive for reading from and writing to a hard
disk, an optical disk drive for reading from and writing to
removable optical disks, and/or a magnetic disk drive for reading
from and writing to magnetic disks. The optical disk can be a
CD-ROM or DVD, while the magnetic disk can be a magnetic floppy
disk or diskette. Non-volatile memory 115 may communicate with
digital processor via system bus 117 and may include appropriate
interfaces or controllers 116 coupled to system bus 117.
Non-volatile memory 115 may serve as long-term storage for
processor- or computer-readable instructions, data structures, or
other data (sometimes called program modules) for digital computer
105.
[0055] Although digital computer 105 has been described as
employing hard disks, optical disks and/or magnetic disks, those
skilled in the relevant art will appreciate that other types of
non-volatile computer-readable media may be employed, such magnetic
cassettes, flash memory cards, Flash, ROMs, smart cards, etc. Those
skilled in the relevant art will appreciate that some computer
architectures employ volatile memory and non-volatile memory. For
example, data in volatile memory can be cached to non-volatile
memory. Or a solid-state disk that employs integrated circuits to
provide non-volatile memory.
[0056] Various processor- or computer-readable instructions, data
structures, or other data can be stored in system memory 120. For
example, system memory 120 may store instruction for communicating
with remote clients and scheduling use of resources including
resources on the digital computer 105 and analog computer 150. Also
for example, system memory 120 may store at least one of processor
executable instructions or data that, when executed by at least one
processor, causes the at least one processor to execute the various
algorithms described elsewhere herein, including machine learning
related algorithms.
[0057] In some implementations system memory 120 may store
processor- or computer-readable calculation instructions to perform
pre-processing, co-processing, and post-processing to analog
computer 150. System memory 120 may store at set of analog computer
interface instructions to interact with analog computer 150.
[0058] Analog computer 150 may include at least one analog
processor such as quantum processor 140. Analog computer 150 can be
provided in an isolated environment, for example, in an isolated
environment that shields the internal elements of the quantum
computer from heat, magnetic field, and other external noise (not
shown). The isolated environment may include a refrigerator, for
instance a dilution refrigerator, operable to cryogenically cool
the analog processor, for example to temperature below
approximately 1.degree. Kelvin.
[0059] FIG. 2A shows an exemplary topology 200a for a quantum
processor, in accordance with the presently described systems,
devices, articles, and methods. Topology 200a may be used to
implement quantum processor 140 of FIG. 1, however other topologies
can also be used for the systems and methods of the present
disclosure. Topology 200a comprises a grid of 2.times.2 cells
210a-210d, each cell comprised of 8 qubits such as qubit 220 (only
one called out in FIG. 2A).
[0060] Within each cell 210a-210d, there are eight qubits 220 (only
one called out for drawing clarity), the qubits 220 in each cell
210a-210d arranged four rows (extending horizontally in drawing
sheet) and four columns (extending vertically in drawing sheet).
Pairs of qubits 220 from the rows and columns can be
communicatively coupled to one another by a respective coupler such
as coupler 230 (illustrated by bold cross shapes, only one called
out in FIG. 2A). A respective coupler 230 is positioned and
operable to communicatively couple the qubit in each column
(vertically-oriented qubit in drawing sheet) in each cell to the
qubits in each row (horizontally-oriented qubit in drawing sheet)
in the same cell. Additionally, a respective coupler, such as
coupler 240 (only one called out in FIG. 2A) is positioned and
operable to communicatively couple the qubit in each column
(vertically-oriented qubit in drawing sheet) in each cell with a
corresponding qubit in each column (vertically-oriented qubit in
drawing sheet) in a nearest neighboring cell in a same direction as
the orientation of the columns. Similarly, a respective coupler,
such as coupler 250 (only one called out in FIG. 2A) is positioned
and operable to communicatively couple the qubit in each row
(horizontally-oriented qubit in drawing sheet) in each cell with a
corresponding qubit in each row (horizontally-oriented qubit in
drawing sheet) in each nearest neighboring cell in a same direction
as the orientation of the rows. Since the couplers 240, 250 couple
qubits 220 between cells 210 such couplers 240, 250 may at times be
denominated as inter-cell couplers. Since the couplers 230 couple
qubits within a cell 210, such couplers 230 may at times be
denominated as intra-cell couplers.
[0061] FIG. 2B shows an exemplary topology 200b for a quantum
processor, in accordance with the presently described systems,
devices, articles, and methods. Topology 200b shows nine cells,
such as cell 210b (only one called out in FIG. 2B), each cell
comprising eight qubits q1 through q72. FIG. 2B illustrates the
intra-coupling, such as coupler 230b (only one called out in FIG.
2B), and inter-coupling, such as coupler 260 (only one called out
in FIG. 2B), for the cell 210b.
[0062] The non-planarity of the connections between qubits q1-q72
makes the problem of finding the lowest energy state of the qubits
q1-q72 an NP-hard problem, which means that it is possible to map
many practical problems to the topology illustrated in FIGS. 2A and
2B, and described above.
[0063] Use of the quantum processor 140 with the topology
illustrated in FIGS. 2A and 2B is not limited only to problems that
fit the native topology. For example, it is possible to embed a
complete graph of size N on a quantum processor of size O(N.sup.2)
by chaining qubits together.
[0064] A computational system 100 (FIG. 1) comprising a quantum
processor 140 with topology 200a of FIG. 2A or topology 200b of
FIG. 2B can specify an energy function over spin variables +1/-1,
and receive from the quantum processor with topology 200a or
topology 200b samples of lower energy spin configurations in an
approximately Boltzmann distribution according to the Ising model
as follows:
E .function. ( s ) = i .times. h i .times. s i + i , j .times. J i
, j .times. s i .times. s j ##EQU00001##
where h.sub.i are local biases and J.sub.i,j are coupling
terms.
[0065] The spin variables can be mapped to binary variables 0/1.
Higher-order energy functions can be expressed by introducing
additional constraints over auxiliary variables.
Machine Learning
[0066] Various systems and methods for augmenting conventional
machine learning hardware such as Graphics Processing Units (GPUs)
and Central Processing Units(CPUs) with quantum hardware are
described herein. Quantum hardware typically includes one or more
quantum processors or quantum processing units (QPUs). The systems
and methods described herein adapt machine learning architectures
and methods to exploit QPUs to advantageously achieve improved
machine performance. Improved machine performance typically
includes reduced training time and/or increased generalization
accuracy.
[0067] Optimization and sampling can be computational bottlenecks
in machine learning systems and methods. The systems and methods
described herein integrate the QPU into the machine learning
pipeline (including the architecture and methods) to perform
optimization and/or sampling with improved performance over
classical hardware. The machine learning pipeline can be modified
to suit QPUs that can be realized in practice.
Sampling in Training Probabilistic Models
[0068] Boltzmann machines including restricted Boltzmann machines
(RBMs) can be used in deep learning systems. Boltzmann machines are
particularly suitable for unsupervised learning and probabilistic
modeling such as in-painting and classification.
[0069] A shortcoming of existing approaches is that Boltzmann
machines typically use costly Markov Chain Monte Carlo (MCMC)
techniques to approximate samples drawn from an empirical
distribution. The existing approaches serve as a proxy for a
physical Boltzmann sampler.
[0070] A QPU can be integrated into machine learning systems and
methods to reduce the time taken to perform training. For example,
the QPU can be used as a physical Boltzmann sampler. The approach
involves programming the QPU (which is an Ising system) such that
the spin configurations realize a user-defined Boltzmann
distribution natively. The approach can then draw samples directly
from the QPU.
Restricted Boltzmann Machine (RBM)
[0071] The restricted Boltzmann machine (RBM) is a probabilistic
graphical model that represents a joint probability distribution
p(x,z) over binary visible units x and binary hidden units z. The
restricted Boltzmann machine can be used as an element in a deep
learning network.
[0072] The RBM network has the topology of a bipartite graph with
biases on each visible unit and on each hidden unit, and weights
(couplings) on each edge. An energy E(x,z) can be associated with
the joint probability distribution p(x,z) over the visible and the
hidden units, as follows:
p(x,z)=e.sup.-E(x,z)/Z
where Z is the partition function.
[0073] For a restricted Boltzmann machine, the energy is:
E(x,z)=-b.sup.Tx-c.sup.Tz-z.sup.TWx
where b and c are bias terms expressed as matrices, W is a coupling
term expressed as a matrix, and T denotes the transpose of a
matrix. The conditional probabilities can be computed:
p(x|z)=.sigma.(b+W.sup.Tz)
p(z|x)=.sigma.(c+W.sup.Tx)
where .sigma. is the sigmoid function, used to ensure the values of
the conditional probabilities lie in the range [0,1].
Training RBMs
[0074] Training is the process by which the parameters of the model
are adjusted to favor producing the desired training distribution.
Typically, this is done by maximizing of the observed data
distribution with respect to the model parameters. One part of the
process involves sampling over the given data distribution, and
this part is generally straightforward. Another part of the process
involves sampling over the predicted model distribution, and this
is generally intractable, in the sense that it would use
unmanageable amounts of computational resources.
[0075] Some existing approaches use a Markov Chain Monte Carlo
(MCMC) method to perform sampling. MCMC constructs a Markov chain
that has the desired distribution as its equilibrium distribution.
The state of the chain after k>>1 steps is used as a sample
of the desired distribution. The quality of the sample improves as
a function of the number of steps which means that MCMC makes
training a slow process.
[0076] To speed up the MCMC process, Contrastive Divergence-k
(CD-k) can be used, in which the method only takes k steps of the
MCMC process. Another way to speed up the process is to use
Persistent Contrastive Divergence (PCD), in which a Markov Chain is
initialized in the state where it ended from the previous model.
CD-k and PCD methods tend to perform poorly when the distribution
is multi-modal and the modes are separated by regions of low
probability.
[0077] Even approximate sampling is NP-hard. The cost of sampling
grows exponentially with problem size. Samples drawn from a native
QPU network (as described above) are close to a Boltzmann
distribution. It is possible to quantify the rate of convergence to
a true Boltzmann distribution by evaluating the KL-divergence
between the empirical distribution and the true distribution as a
function of the number of samples.
[0078] Noise limits the precision with which the parameters of the
model can be set in the quantum hardware. In practice, this means
that the QPU is sampling from a slightly different energy function.
The effects can be mitigated by sampling from the QPU and using the
samples as starting points for non-quantum post-processing e.g., to
initialize MCMC, CD, and PCD. The QPU is performing the hard part
of the sampling process. The QPU finds a diverse set of valleys,
and the post-processing operation samples within the valleys.
Post-processing can be implemented in a GPU and can be at least
partially overlapped with sampling in the quantum processor to
reduce the impact of post-processing on the overall timing.
Sampling to Train RBMs
[0079] A training data set can comprise a set of visible vectors.
Training comprises adjusting the model parameters such that the
model is most likely to reproduce the distribution of the training
set. Typically, training comprises maximizing the log-likelihood of
the observed data distribution with respect to the model parameters
.theta.:
.differential. .times. log .times. .times. ( z .times. .times. p
.function. ( x , z ) ) .differential. .theta. = - .differential. E
.function. ( x , z ) .differential. .theta. p .function. ( z | x )
+ .differential. E .function. ( x , z ) .differential. .theta. p
.function. ( x | z ) ##EQU00002##
[0080] The first term on the right-hand side (RHS) in the above
equation is related to the positive phase and computes an expected
value of energy E over p(z|x). The term involves sampling over the
given data distribution.
[0081] The second term on the RHS is related to the negative phase,
and computes an expected value of energy, over p(x|z). The term
involves sampling over the predicted model distribution.
Variational Auto-Encoder
[0082] Unsupervised learning of probabilistic models is a technique
for machine learning. It can facilitate tasks such as denoising to
extract a signal from a mixture of signal and noise, and inpainting
to reconstruct lost or corrupted parts of an image. It can also
regularize supervised tasks such as classification.
[0083] One approach to unsupervised learning can include attempting
to maximize the log-likelihood of an observed dataset under a
probabilistic model. Equivalently, unsupervised learning can
include attempting to minimize the KL-divergence from the data
distribution to that of the model. While the exact gradient of the
log-likelihood function is frequently intractable, stochastic
approximations can be computed, provided samples can be drawn from
the probabilistic model and its posterior distribution given the
observed data.
[0084] The efficiency of using stochastic approximations to arrive
at a maximum of the log-likelihood function can be limited by the
poor availability of desirable distributions for which the
requisite sampling operations are computationally efficient. Hence,
applicability of the techniques can be similarly limited.
[0085] Although sampling can be efficient in undirected graphical
models provided there are no loops present among the connections,
the range of representable relationships can be limited. Boltzmann
machines (including restricted Boltzmann machines) can generate
approximate samples using generally costly and inexact Markov Chain
Monte Carlo (MCMC) techniques.
[0086] Sampling can be efficient in directed graphical models
comprising a directed acyclic graph since sampling can be performed
by an ancestral pass. Even so, it can be inefficient to compute the
posterior distributions over the hidden causes of observed data in
such models, and samples from the posterior distributions are
required to compute the gradient of the log-likelihood
function.
[0087] Another approach to unsupervised learning is to optimize a
lower bound on the log-likelihood function. This approach can be
more computationally efficient. An example of a lower bound is the
evidence lower bound (ELBO) which differs from the true
log-likelihood by the KL-divergence between an approximating
posterior distribution, q(z|x, O), and the true posterior
distribution, p(z|x, .theta.). The approximating posterior
distribution can be designed to be computationally tractable even
though the true posterior distribution is not computationally
tractable. The ELBO can be expressed as follows:
L .function. ( x , .theta. , .PHI. ) = log .times. .times. p
.function. ( x | .theta. ) - K .times. L .function. [ q .function.
( z | x , .theta. ) .times. .times. p .function. ( z | x , .theta.
) ] = .intg. z .times. q .function. ( z | x , .PHI. ) .times.
.times. log .function. [ p .function. ( x , z | .theta. ) q
.function. ( z | x , .PHI. ) ] ##EQU00003##
where x denotes the observed random variables, z the latent random
variables, the parameters of the generative model and .PHI. the
parameters of the approximating posterior.
[0088] Successive optimization of the ELBO with respect to .PHI.
and .theta. is analogous to variational expectation-maximization
(EM). It is generally possible to construct a stochastic
approximation to gradient descent on the ELBO that only requires
exact, computationally tractable samples. A drawback of this
approach is that it can lead to high variance in the gradient
estimate, and can result in slow training and poor performance.
[0089] The variational auto-encoder can regroup the ELBO as:
(x,.theta.,.PHI.)=-KL[q(z|x,.PHI.).parallel.p(z|.theta.)]+.sub.q[log
p(x|z,.theta.)].
[0090] The KL-divergence between the approximating posterior and
the true prior is analytically simple and computationally efficient
for commonly chosen distributions, such as Gaussians.
[0091] A low-variance stochastic approximation to the gradient of
the auto-encoding term .sub.q can be backpropagated efficiently, so
long as samples from the approximating posterior q(z|x) can be
drawn using a differentiable, deterministic function
f(x,.PHI.,.rho.) of the combination of the inputs x, the parameters
.PHI., and a set of input- and parameter-independent random
variables .rho..about.D. For instance, given a Gaussian
distribution with mean m(x,.PHI.) and variance v(x,.PHI.)
determined by the input, (m(x,.PHI.), v(x,.PHI.)), samples can be
drawn using
f(x,.PHI.,.phi.=m(x,.PHI.)+ {square root over (v(x,.PHI.))}.rho.,
where .rho..about.(0,1).
When such an f(x,.PHI.,.rho.) exists,
.times. q .function. ( z | x , .0. ) [ log .times. p .function. ( x
| z , .theta. ) = .rho. .function. [ log .times. p .function. ( x |
f .function. ( x , .rho. , .0. ) , .theta. ) ] .times. .times.
.differential. .differential. .PHI. .times. q .function. ( z | x ,
.0. ) .function. [ log .times. p .function. ( x | z , .theta. ) ] =
.rho. .function. [ .differential. .differential. .PHI. .times. log
.times. p .function. ( x | f .function. ( x , .rho. . .PHI. ) ,
.theta. ) ] .apprxeq. 1 N .times. .rho. ~ D .times. .differential.
.differential. .0. .times. log .times. p ( x | ( f , .rho. , .0. )
, .theta. ) , ( 1 ) ##EQU00004##
and the stochastic approximation to the derivative in equation 1 is
analytically tractable so long as p(x|z,.theta.) and f(x,.rho.,O)
are defined so as to have tractable derivatives.
[0092] This approach is possible whenever the approximating
posteriors for each hidden variable, q.sub.i(z.sub.1|x,.PHI.), are
independent given x and .PHI., the cumulative distribution function
(CDF) of each q.sub.i is invertible; and the inverse CDF each
q.sub.i, is differentiable. Specifically, choose to be the uniform
distribution between 0 and 1, and f.sub.i to be the inverse CDF of
q.sub.i.
[0093] The conditional marginal cumulative distribution (CDF) is
defined by
F.sub.i(x)=.intg..sub.x.sub.i.sub.=-.infin..sup.xp(x.sub.i'|x.sub.1,
. . . ,x.sub.i-1)
[0094] Since the approximating posterior distribution q(z|x,.PHI.)
maps each input to a distribution over the latent space, it is
called the "encoder". Correspondingly, since the conditional
likelihood distribution p(x|z,.theta.) maps each configuration of
the latent variables to a distribution over the input space, it is
called the "decoder".
[0095] Unfortunately, a multivariate CDF is generally not
invertible. One way to deal with this is to define a set of CDFs as
follows:
F.sub.i(x)=.intg..sub.x.sub.i.sub.'=-.infin..sup.xp(x.sub.i'|x.sub.1,
. . . ,x.sub.i-1)
and invert each conditional CDF in turn. The CDF F.sub.i(x) is the
CDF of x.sub.i conditioned on all x.sub.j where j<i, and
marginalized over all x.sub.k where i<k. Such inverses generally
exist provided the conditional-marginal probabilities are
everywhere non-zero.
Discrete Variational Auto-Encoders
[0096] The approach can run into challenges with discrete
distributions, such as, for example, Restricted Boltzmann Machines
(RBMs). An approximating posterior that only assigns non-zero
probability to a discrete domain corresponds to a CDF that is
piecewise-constant. That is, the range of the CDF is a proper
subset of the interval [0, 1]. The domain of the inverse CDF is
thus also a proper subset of the interval [0, 1] and its derivative
is generally not defined.
[0097] The difficulty can remain even if a quantile function as
follows is used:
F p - 1 .function. ( .rho. ) = inf .times. { z .di-elect cons. :
.intg. z ' = - .infin. z .times. p .function. ( z ' ) .gtoreq.
.rho. } ##EQU00005##
The derivative of the quantile function is either zero or infinite
for a discrete distribution.
[0098] One method for discrete distributions is to use a
reinforcement learning method such as REINFORCE (Williams,
http://www-anw.cs.umass.edu/.about.barto/courses/cs687/williams92simple.p-
df). The REINFORCE method adjust weights following receipt of a
reinforcement value by an amount proportional to the difference
between a reinforcement baseline and the reinforcement value.
Rather than differentiating the conditional log-likelihood directly
in REINFORCE, the gradient of the log of the conditional likelihood
distribution is estimated, in effect, by a finite difference
approximation. The conditional log-likelihood log p(x|z,.theta.) is
evaluated at many different points z.about.q(z|x,.PHI.), and the
gradient
.differential. .differential. .PHI. .times. log .times. q ( ( z | x
, .PHI. ) ##EQU00006##
weighted more strongly when p(x|z,.theta.) differs more greatly
from the baseline.
[0099] One disadvantage is that the change of p(x|z,.theta.) in a
given direction can only affect the REINFORCE gradient estimate if
a sample is taken with a component in the same direction. In a
D-dimensional latent space, at least D samples are required to
capture the variation of the conditional distribution
p(x|z,.theta.) in all directions. Since the latent representation
can typically consist of hundreds of variables, the REINFORCE
gradient estimate can be much less efficient than one that makes
more direct use of the gradient of the conditional distribution
p(x|z,.theta.).
[0100] A discrete variational auto-encoder (DVAE) is a hierarchical
probabilistic model consisting of an RBM, followed by multiple
layers of continuous latent variables, allowing the binary
variables to be marginalized out, and the gradient to backpropagate
smoothly through the auto-encoding component of the ELBO.
[0101] The generative model is redefined so that the conditional
distribution of the observed variables given the latent variables
only depends on the new continuous latent space.
[0102] A discrete distribution is thereby transformed into a
mixture distribution over this new continuous latent space. This
does not alter the fundamental form of the model, nor the
KL-divergence term of the ELBO; rather it adds a stochastic
component to the approximating posterior and the prior.
[0103] One interpretation of the way that VAEs work is that they
break the encoder distribution into "packets" of probability, each
packet having infinitesimal but equal probability mass. Within each
packet, the values of the latent variables are approximately
constant. The packets correspond to a region in the latent space,
and the expectation value is taken over the packets. There are
generally more packets in regions of high probability, so more
probable values are more likely to be selected.
[0104] As the parameters of the encoder are changed, the location
of each packet can move, while its probability mass stays constant.
So long as F.sub.q(z|x,.PHI.).sup.-1 exists and is differentiable,
a small change in will correspond to a small change in the location
of each packet. This allows the use of the gradient of the decoder
to estimate the change in the loss function, since the gradient of
the decoder captures the effect of small changes in the location of
a selected packet in the latent space.
[0105] In contrast, REINFORCE works by breaking the latent
representation into segments of infinitesimal but equal volume,
within which the latent variables are also approximately constant,
while the probability mass varies between segments. Once a segment
is selected in the latent space, its location is independent of the
parameters of the encoder. As a result, the contribution of the
selected location to the loss function is not dependent on the
gradient of the decoder. On the other hand, the probability mass
assigned to the region in the latent space around the selected
location is relevant.
[0106] Though VAEs can make use of gradient information from the
decoder, the gradient estimate is generally only low-variance
provided the motion of most probability packets has a similar
effect on the loss function. This is likely to be the case when the
packets are tightly clustered (e.g., if the encoder produces a
Gaussian distribution with low variance) or if the movements of
well-separated packets have a similar effect on the loss function
(e.g., if the decoder is roughly linear).
[0107] One difficulty is that VAEs cannot generally be used
directly with discrete latent representations because changing the
parameters of a discrete encoder moves probability mass between the
allowed discrete values, and the allowed discrete values are
generally far apart. As the encoder parameters change, a selected
packet either remains in place or jumps more than an infinitesimal
distance to an allowed discrete value. Consequently, small changes
to the parameters of the encoder do not affect most of the
probability packets. Even when a packet jumps between discrete
values of the latent representation, the gradient of the decoder
generally cannot be used to estimate the change in loss function
accurately, because the gradient generally captures only the
effects of very small movements of the probability packet.
[0108] Therefore, to use discrete latent representations in the VAE
framework, the method described herein for unsupervised learning
transforms the distributions to a continuous latent space within
which the probability packets move smoothly. The encoder q(z|x,O)
and prior distribution p(z|.theta.) are extended by a
transformation to a continuous, auxiliary latent representation
.zeta., and the decoder is correspondingly transformed to be a
function of the continuous representation. By extending the encoder
and the prior distribution in the same way, the remaining
KL-divergence (referred to above) is unaffected.
[0109] In the transformation, one approach maps each point in the
discrete latent space to a non-zero probability over the entire
auxiliary continuous space. In so doing, if the probability at a
point in the discrete latent space increases from zero to a
non-zero value, a probability packet does not have to jump a large
distance to cover the resulting region in the auxiliary continuous
space. Moreover, it ensures that the CDFs F.sub.i(x) are strictly
increasing as a function of their main argument, and thus are
invertible. The method described herein for unsupervised learning
smooths the conditional-marginal CDF F.sub.i(x) of an approximating
posterior distribution, and renders the distribution invertible,
and its inverse differentiable, by augmenting the latent discrete
representation with a set of continuous random variables. The
generative model is redefined so that the conditional distribution
of the observed variables given the latent variables only depends
on the new continuous latent space.
[0110] The discrete distribution is thereby transformed into a
mixture distribution over the continuous latent space, each value
of each discrete random variable associated with a distinct mixture
component on the continuous expansion. This does not alter the
fundamental form of the model, nor the KL-divergence term of the
ELBO; rather it adds a stochastic component to the approximating
posterior and the prior.
[0111] The method augments the latent representation with
continuous random variables .zeta., conditioned on z, as
follows:
q(.zeta.,z|x,.PHI.)=r(.zeta.|x)q(z|x,.PHI.)
where the support of r(.zeta.|x) for all values of z is connected,
so the marginal distribution
q(.zeta.|x,.PHI.)=.SIGMA..sub.zr(.zeta.|z)q(z|x,.PHI.) has a
constant, connected support so long as 0<q(z|x,O)<1. The
approximating posterior r(.zeta.|x) is continuous and
differentiable except at the end points of its support so that the
inverse conditional-marginal CDF is differentiable.
[0112] FIG. 3 shows an example implementation of a VAE. The
variable z is a latent variable. The variable x is a visible
variable (for example, pixels in an image data set). The variable
is a continuous variable conditioned on a discrete z as described
above in the present disclosure. The variable can serve to smooth
out the discrete random variables in the auto-encoder term. As
described above, the variable generally does not directly affect
the KL-divergence between the approximating posterior and the true
prior.
[0113] In the example, the variables z.sub.1, z.sub.2, and z.sub.3
are disjoint subsets of qubits in the quantum processor. The
computational system samples from the RBM using the quantum
processor. The computational system generates the hierarchical
approximating posteriors using a digital (classical) computer. The
computational system uses priors 310 and 330, and hierarchical
approximating posteriors 320 and 340.
[0114] For the prior 330 and the approximating posterior 340, the
systems adds continuous variables .zeta..sub.1, .zeta..sub.2,
.zeta..sub.3 below the latent variables z.sub.1, z.sub.2,
z.sub.3.
[0115] FIG. 3 also shows the auto-encoding loop 350 of the VAE.
Initially, input x is passed into a deterministic feedforward
network q(z=1|x,O), for which the final non-linearity is the
logistic function. Its output q, along with independent random
variable p, is passed into the deterministic function
F.sub.q(.zeta.|x,O).sup.-1 to produce a sample of .zeta.. This
.zeta., along with the original input x, is finally passed to log
p(x|.zeta.,.theta.). The expectation of this log probability with
respect to .rho. is the auto-encoding term of the VAE. This
auto-encoder, conditioned on the input and the independent .rho.,
is deterministic and differentiable, so backpropagation can be used
to produce a low-variance, computationally efficient approximation
to the gradient.
[0116] The distribution remains continuous as q(z|x,.PHI.) changes.
The distribution is also everywhere non-zero in the approach that
maps each point in the discrete latent space to a non-zero
probability over the entire auxiliary continuous space.
Correspondingly, p(.zeta.,z|.theta.) is defined as
p(.zeta.,z|.theta.)=r(.zeta.|z)p(z|.theta.), where r(.zeta.|z) is
the same as for the approximating posterior, and
p(x|.zeta.,z,.theta.)=p(x|.zeta.,.theta.). This transformation
renders the model a continuous distribution over z.
[0117] The method described herein can generate low-variance
stochastic approximations to the gradient. The KL-divergence
between the approximating posterior and the true prior distribution
is unaffected by the introduction of auxiliary continuous latent
variables, provided the same expansion is used for both.
[0118] The auto-encoder portion of the loss function is evaluated
in the space of continuous random variables, and the KL-divergence
portion of the loss function is evaluated in the discrete
space.
[0119] The KL-divergence portion of the loss function is as
follows:
- KL [ q .function. ( z | x , .PHI. ) .times. p .function. ( z |
.theta. ) ] = z .times. q .function. ( z | x , .PHI. ) [ log
.times. p .function. ( z | .theta. ) - log .times. q .function. ( z
| x , .PHI. ) ] ##EQU00007##
[0120] The gradient of the KL-divergence portion of the loss
function in the above equation with respect to .theta. can be
estimated stochastically using samples from the true prior
distribution p(z|.theta.). The gradient of the KL-divergence
portion of the lost function can be expressed as follows:
.differential. KL ( q .times. p ) .differential. .theta. = -
.differential. E p .function. ( z | .theta. ) .differential.
.theta. q .function. ( z | x , .PHI. ) + .differential. E p
.function. ( z | .theta. ) .differential. .theta. p .function. ( z
| .theta. ) ##EQU00008##
[0121] In one approach, the method computes the gradients of the
KL-divergence portion of the loss function analytically, for
example by first directly parameterizing a factorial q(z|x,.PHI.)
with a deep network g(x):
q .function. ( z | x , .PHI. ) = e - E q .function. ( z | x , .PHI.
) z ' .times. e - E q .function. ( z | x , .PHI. ) ##EQU00009##
where .times. .times. E q .function. ( z | x ) = - g .function. ( x
) T z ##EQU00009.2##
and then using the following expression:
.differential. KL .function. ( q .times. .times. p ) .differential.
.PHI. = ( ( g .function. ( x ) - h - ( J T + J ) z q ) T
.circle-w/dot. ( z q - z q 2 ) T ) .differential. g .function. ( x
) .differential. .PHI. ##EQU00010##
Equation 1 can therefore be simplified by dropping the dependence
of p on z and then marginalizing z out of q, as follows:
.differential. .differential. .PHI. .times. q .function. ( .zeta. ,
z | x , .PHI. ) .function. [ log .times. p .function. ( x | .zeta.
, z , .theta. ) ] .apprxeq. 1 N .times. .rho. .about. U .function.
( 0 , 1 ) n .times. .differential. .differential. .0. .times. log
.times. p .function. ( x | .zeta. .function. ( .rho. ) , .theta. )
.zeta. = .zeta. .function. ( .rho. ) ( 2 ) ##EQU00011##
[0122] An example of a transformation from the discrete latent
space to a continuous latent space is the spike-and-slab
transformation:
r .function. ( .zeta. i | z i = 0 ) = { .infin. , if .times.
.times. .zeta. i = 0 0 , otherwise .times. .times. r .function. (
.zeta. i | z i = 1 ) = { 1 , if .times. .times. 0 .ltoreq. .zeta. i
.ltoreq. 1 0 , otherwise ##EQU00012##
This transformation is consistent with sparse coding.
[0123] Other expansions to the continuous space are also possible.
As an example a combination of delta spike and exponential function
can be used:
r .function. ( .zeta. i | z i = 0 ) = { .infin. , if .times.
.times. .zeta. i = 0 0 , otherwise .times. .times. r .function. (
.zeta. i | z i = 1 ) = { .beta. .times. e .beta..zeta. e .beta. - 1
, if .times. .times. 0 .ltoreq. .zeta. i .ltoreq. 1 0 , otherwise
##EQU00013##
[0124] Alternatively, it is possible to define a transformation
from discrete to continuous variables in the approximating
posterior, r(.zeta.|z), where the transformation is not independent
of the input x. In the true posterior distribution,
p(.zeta.|z,x).apprxeq.p(.zeta.|z) only if z already captures most
of the information about x and p(.zeta.|z,x) changes little as a
function of x. In a case where it may be desirable for
q(.zeta..sub.i|z.sub.i,x,.PHI.) to be a separate Gaussian for both
values of the binary z.sub.i, it is possible to use a mixture of a
delta spike and a Gaussian to define a transformation from the
discrete to the continuous space for which the CDF can be inverted
piecewise.
[0125] FIG. 4 shows a method 400 of unsupervised learning using a
discrete variational auto-encoder. Execution of the method 400 by
one or more processor-based devices may occur in accordance with
the present system, devices, articles, and methods. Method 400,
like other methods herein may be implemented by a series or set of
processor-readable instructions executed by one or more processors
(i.e., hardware circuitry).
[0126] Method 400 starts at 405, for example in response to a call
from another routine or other invocation.
[0127] At 410, the system initializes the model parameters with
random values. Alternatively, the system can initialize the model
parameters based on a pre-training procedure. At 415, the system
tests to determine if a stopping criterion has been reached. The
stopping criterion can, for example, be related to the number of
epochs (i.e., passes through the dataset) or a measurement of
performance between successive passes through a validation dataset.
In the latter case, when performance beings to degrade, it is an
indication that the system is over-fitting and should stop.
[0128] In response to determining the stopping criterion has been
reached, the system ends method 400 at 475, until invoked again,
for example, a request to repeat the learning.
[0129] In response to determining the stopping criterion has not
been reached, the system fetches a mini-batch of the training data
set at 420. At 425, the system propagates the training data set
through the encoder to compute the full approximating posterior
over discrete space z.
[0130] At 430, the system generates or causes generation of samples
from the approximating posterior over .zeta., given the full
distribution over z. Typically, this is performed by a non-quantum
processor, and uses the inverse of the CDF F.sub.i(x) described
above. The non-quantum processor can, for example, take the form of
one or more of one or more digital microprocessors, digital signal
processors, graphical processing units, central processing units,
digital application specific integrated circuits, digital field
programmable gate arrays, digital microcontrollers, and/or any
associated memories, registers or other nontransitory computer- or
processor-readable media, communicatively coupled to the
non-quantum processor.
[0131] At 435, the system propagates the samples through the
decoder to compute the distribution over the input.
[0132] At 440, the system performs backpropagation through the
decoder.
[0133] At 445, the system performs backpropagation through the
sampler over the approximating posterior over .zeta.. In this
context, backpropagation is an efficient computational approach to
determining the gradient.
[0134] At 450, the system computes the gradient of the
KL-divergence between the approximating posterior and the true
prior over z. At 455, the system performs backpropagation through
the encoder.
[0135] At 457, the system determines a gradient of a KL-divergence,
with respect to parameters of the true prior distribution, between
the approximating posterior and the true prior distribution over
the discrete space.
[0136] At 460, the system determines at least one of a gradient or
at least a stochastic approximation of a gradient, of a bound on
the log-likelihood of the input data.
[0137] In some embodiments, the system generates samples or causes
samples to be generated by a quantum processor. At 465, the system
updates the model parameters based at least in part on the
gradient.
[0138] At 470, the system tests to determine if the current
mini-batch is the last mini-batch to be processed. In response to
determining that the current mini-batch is the last mini-batch to
be processed, the system returns control to 415. In response to
determining that the current mini-batch is not the last mini-batch
to be processed, the system returns control to 420.
[0139] In some implementations, act 470 is omitted, and control
passes directly to 415 from 465. The decision whether to fetch
another mini-batch can be incorporated in 415.
[0140] In summary, as described in more detail above, the discrete
VAE method extends the encoder and the prior with a transformation
to a continuous, auxiliary latent representation, and
correspondingly makes the decoder a function of the same continuous
representation. The method evaluates the auto-encoder portion of
the loss function in the continuous representation while evaluating
the KL-divergence portion of the loss function in the z space.
[0141] Accommodating Explaining-Away with a Hierarchical
Approximating Posterior
[0142] When a probabilistic model is defined in terms of a prior
distribution p(z) over latent variables z and a conditional
distribution p(x|z) over observed variables x given the latent
variables, the observation of x often induces strong correlations
of the z, given x, in the posterior p(z|x) due to phenomena such as
explaining-away, a pattern of reasoning where the confirmation of
one cause reduces the need to search for alternative causes.
Moreover, an RBM used as the prior distribution may have strong
correlations between the units of the RBM.
[0143] To accommodate the strong correlations expected in the
posterior distribution while maintaining tractability, hierarchy
can be introduced into the approximating posterior q(z|x). Although
the variables of each hierarchical layer are independent given the
previous layers, the total distribution can capture strong
correlations, especially as the size of each hierarchical layer
shrinks towards a single variable.
[0144] The latent variables z of the RBM are divided into disjoint
groups, z.sub.1, . . . , z.sub.k. The continuous latent variables
.zeta. are divided into complementary disjoint groups .zeta..sub.1,
. . . , .zeta..sub.k. In one implementations, the groups may be
chosen at random, while in other implementations the groups be
defined so as to be of equal size. The hierarchical variational
auto-encoder defines the approximating posterior via a directed
acyclic graphical model over these groups.
q .function. ( z 1 , .zeta. 1 , , z k , .zeta. k | x , .PHI. ) = 1
.ltoreq. j .ltoreq. k .times. r .function. ( .zeta. j | z j ) q
.function. ( z j .zeta. i < j , x , .PHI. ) ##EQU00014## where
##EQU00014.2## q .function. ( z j | .zeta. i < j , x , .PHI. ) =
e g j .function. ( .zeta. i < j , x ) T z j z L .di-elect cons.
z j .times. ( 1 + e g z L .function. ( .zeta. i < j , x ) )
##EQU00014.3##
[0145] z.sub.j.di-elect cons.{0,1} and
g.sub.j(.zeta..sub.i<j,x,O) is a parameterized function of the
input and preceding .zeta..sub.i, such as a neural network. The
corresponding graphical model is shown in FIG. 5.
[0146] FIG. 5 schematic diagram illustrating an example
implementation of a hierarchical variational auto-encoder (VAE).
The model uses approximating posterior 510, where latent variable
z.sub.3 is conditioned on the continuous variables .zeta..sub.2 and
.zeta..sub.1 while z.sub.2 is conditioned on .zeta..sub.1.
[0147] The dependence of z.sub.j on the discrete variables
z.sub.i<j is mediated by the continuous variables
.zeta..sub.i<j.
[0148] This hierarchical approximating posterior does not affect
the form of the auto-encoding term 520 of FIG. 5, except to
increase the depth of the auto-encoder. Each can be computed via
the stochastic nonlinearity
F.sub.q.sub.j.sub.(.zeta..sub.j.sub.|.zeta..sub.i<j.sub.,x,.PHI.)(.rho-
.), where the function q.sub.j can take previous as input.
[0149] The deterministic probability value
q(z=1|.zeta..sub.i<j,x,O) is parameterized, for example by a
neural network.
[0150] For each successive layer j of the autoencoder, input x and
all previous .zeta..sub.i<j are passed into the network
computing q(z=1|.zeta..sub.i<1,x,O). Its output q.sub.j, along
with an independent random variable p is passed into the
deterministic function F.sub.q(.zeta..sub.i<j,x,.PHI.)(.rho.) to
produce a sample of .zeta..sub.j. Once all .zeta..sub.j have been
recursively computed, the full .zeta. along with the original input
x is finally passed to log p(x|.zeta.,.theta.).
[0151] The KL-divergence between the approximating posterior and
the true prior is also not significantly affected by the
introduction of additional continuous latent variables .zeta., so
long as the approach uses the same expansion r(.zeta.|z) for both
the approximating posterior and the prior, as follows:
KL .function. [ q .times. .times. p ] = z .times. .intg. .zeta.
.times. ( 1 .ltoreq. j .ltoreq. k .times. r .function. ( .zeta. j |
z j ) q .function. ( z j | .zeta. i < j , x ) ) log .times. 1
.ltoreq. j .ltoreq. k .times. r .function. ( .zeta. j | z j ) q
.function. ( z j | .zeta. i < j , x ) p .function. ( z ) 1
.ltoreq. j .ltoreq. k .times. r .function. ( .zeta. j | z j ) = z
.times. .intg. .zeta. .times. ( 1 .ltoreq. j .ltoreq. k .times. r
.function. ( .zeta. j | z j ) q .function. ( z j | .zeta. i < j
, x ) ) log .times. 1 .ltoreq. j .ltoreq. k .times. q .function. (
z j | .zeta. i < j , x ) p .function. ( z ) ##EQU00015##
[0152] The gradient of the KL-divergence with respect to the
parameter of the prior p(z|.theta.) can be estimated stochastically
using samples from the approximating posterior q(.zeta.,z|x,.PHI.)
and the true prior p(z|.theta.). The prior can be, for example, an
RBM.
[0153] The final expectation with respect to
q(z.sub.k|.zeta..sub.i<j,x,.PHI.) can be performed analytically;
all other expectations require samples from the approximating
posterior. Similarly, the prior requires samples from, for example,
an RBM.
[0154] Samples from the same prior distribution are required for an
entire mini-batch, independent from the samples chosen from the
training dataset.
[0155] Hierarchical Variational Auto-Encoders
[0156] Convolutional architectures are an essential component of
state-of-the-art approaches to visual object classification, speech
recognition, and numerous other tasks. In particular, they have
been successfully applied to generative modeling, such as in
deconvolutional networks and LAPGAN. There is, therefore, technical
benefit in incorporating convolutional architectures into
variational auto-encoders, as such can provide a technical solution
to a technical problem, and thereby achieve a technical result.
[0157] Convolutional architectures are necessarily hierarchical. In
the feedforward direction, they build from local, high-resolution
features to global, low-resolution features through the application
of successive layers of convolution, point-wise nonlinear
transformations, and pooling. When used generatively, this process
is reversed, with global, low-resolution features building towards
local, high-resolution features through successive layers of
deconvolution, point-wise nonlinear transformations, and
unpooling.
[0158] Incorporating this architecture into the variational
auto-encoder framework, it is natural to associate the upward
pathway (from local to global) with the approximating posterior,
and the downward pathway (from global to local) with the generative
model. However, if the random variables of the generative model are
defined to be the units of the deconvolutional network itself, then
samples from the approximating posterior of the last hidden layer
of the deconvolutional decoder can be determined directly by the
convolutional encoder. In particular, it can be natural to define
the samples from the last layer of the deconvolutional decoder to
be a function solely of the first layer of the convolutional
encoder. As a result, the auto-encoding component of the VAE
parameter update depends on the bottom-most layer of random
variables. This seems contradictory to the intuitive structure of a
convolutional auto-encoder.
[0159] Instead, ancillary random variables can be defined at each
layer of the deconvolutional decoder network. Ancillary random
variables can be discrete random variables or continuous random
variables.
[0160] In the deconvolutional decoder, the ancillary random
variables of layer n are used in conjunction with the signal from
layer n+1 to determine the signal to layer n-1. The approximating
posterior over the ancillary random variables of layer n is defined
to be a function of the convolutional encoder, generally restricted
to layer n of the convolutional encoder. To compute a stochastic
approximation to the gradient of the evidence lower bound, to the
approach can perform a single pass up the convolutional encoder
network, followed by a single pass down the deconvolutional decoder
network. In the pass down the deconvolutional decoder network, the
ancillary random variables are sampled from the approximating
posteriors computed in the pass up the convolutional encoder
network.
[0161] A Problem with the Traditional Approach
[0162] A traditional approach can result in approximating
posteriors that poorly match the true posterior, and consequently
can result in poor samples in the auto-encoding loop. In
particular, the approximating posterior defines independent
distributions over each layer. This product of independent
distributions ignores the strong correlations between adjacent
layers in the true posterior, conditioned on the underlying
data.
[0163] The representation throughout layer n should be mutually
consistent, and consistent with the representation in layer n-1 and
n+1. However, in the architecture described above, the
approximating posterior over every random variable is independent.
In particular, the variability in the higher (more abstract) layers
is uncorrelated with that in the lower layers, and consistency
cannot be enforced across layers unless the approximating posterior
collapses to a single point.
[0164] This problem is apparent in the case of (hierarchical)
sparse coding. At every layer, the true posterior has many modes,
constrained by long-range correlations within each layer. For
instance, if a line in an input image is decomposed into a
succession of short line segments (e.g., Gabor filters), it is
essential that the end of one segment line up with the beginning of
the next segment. With a sufficiently overcomplete dictionary,
there may be many sets of segments that cover the line, but differ
by a small offset along the line. A factorial posterior can
reliably represent one such mode.
[0165] These equivalent representations can be disambiguated by the
successive layers of the representation. For instance, a single
random variable at a higher layer may specify the offset of all the
line segments in the previous example. In the traditional approach,
the approximating posteriors of the (potentially disambiguating)
higher layers are computed after approximating posteriors of the
lower layers have been computed. In contrast, an efficient
hierarchical variational auto-encoder could infer the approximating
posterior over the top-most layer first, potentially using a deep,
convolutional computation. It would then compute the conditional
approximating posteriors of lower layers given a sample from the
approximating posterior of the higher layers.
[0166] A Proposed Approach-Hierarchical Priors and Approximating
Posteriors
[0167] In the present approach, rather than defining the
approximating posterior to be fully factorial, the computational
system conditions the approximating posterior for the n.sup.th
layer on the sample from the approximating posterior of the higher
layers preceding it in the downward pass through the
deconvolutional decoder. In an example case, the computational
system conditions the approximating posterior for the n.sup.th
layer on the sample from the (n-1).sup.th layer. This corresponds
to a directed graphical model, flowing from the higher, more
abstract layers to the lower, more concrete layers. Consistency
between the approximating posterior distributions over each pair of
layers is ensured directly.
[0168] With such a directed approximating posterior, it is possible
to do away with ancillary random variables, and define the
distribution directly over the primary units of the deconvolutional
network. In this case, the system can use a parameterized
distribution for the deconvolutional component of the approximating
posterior that shares structure and parameters with the generative
model. Alternatively, the system can continue to use a separately
parameterized directed model.
[0169] In the example case and other cases, a stochastic
approximation to the gradient of the evidence lower bound can be
computed via one pass up the convolutional encoder, one pass down
the deconvolutional decoder of the approximating posterior, and
another pass down the deconvolutional decoder of the prior,
conditioned on the sample from the approximating posterior. Note
that if the approximating posterior is defined directly over the
primary units of the deconvolutional generative model, as opposed
to ancillary random variables, the final pass down the
deconvolutional decoder of the prior does not actually pass signals
from layer to layer. Rather, the input to each layer is determined
by the approximating posterior.
[0170] Below is an outline of the computations for two adjacent
hidden layers, highlighting the hierarchical components and
ignoring the details of convolution and deconvolution. If the
approximating posterior is defined directly over the primary units
of the deconvolutional generative model, then it is natural to use
a structure such as:
z.sub.n-1,z.sub.n|x,O)=q(z.sub.n-1|x,O)q(z.sub.n|z.sub.n-1,x,O)
p(z.sub.n-1,z.sub.n|.theta.)=p(z.sub.n|z.sub.n-1.theta.)p(z.sub.n-1|.the-
ta.)
[0171] This builds the prior by conditioning the more local
variables of the (n-1).sup.th layer on the more global variables of
the n.sup.th layer. With ancillary random variables, we might
choose to use a simpler prior structure:
p(z.sub.n-1,z.sub.n|.theta.)=p(z.sub.n-1|.theta.)p(z.sub.n|.theta.)
[0172] The evidence lower bound decomposes as:
L VAE .function. ( x , .theta. , .PHI. ) = log .times. p .function.
( x | .theta. ) - KL [ q .function. ( z n , z n - 1 | x , .PHI. )
.times. .times. p ( z n , z n - 1 | x , .theta. ] = log .times. p
.function. ( z | .theta. ) - KL [ q .function. ( z n - 1 , z n , x
, .PHI. ) q .function. ( z n | x , .PHI. ) .times. .times. p
.function. ( z n - 1 | z n , x , .theta. ) .times. p .function. ( z
n | x , .theta. ) = z n .times. .intg. z n - 1 .times. q .function.
( z n - 1 | z n , x , .PHI. ) q .function. ( z n | x , .PHI. ) log
.function. [ p .function. ( x | z n - 1 , .theta. ) p .function. (
z n - 1 | z n , .theta. ) p .function. ( z .times. n | .theta. ) q
.function. ( z n - 1 | z n , x , .PHI. ) q .function. ( z n | x ,
.PHI. ) ] = q .function. ( z n - 1 | z n , x , .PHI. ) q .function.
( z n | x , .PHI. ) .function. [ log .times. p .function. ( x | z n
, z n - 1 , .theta. ) ] - KL .function. [ q .function. ( z n | x ,
.PHI. ) .times. .times. p .function. ( z n | .PHI. ) ] - Z n
.times. q .function. ( z n | x , .PHI. ) KL [ q ( z n - 1 .times. z
n , x , .PHI. .times. p ( z n - 1 .times. z n , .theta. ) ] ( 3 )
##EQU00016##
[0173] If the approximating posterior is defined directly over the
primary units of the deconvolutional generative model, then it may
be the case that
p(x|z.sub.n,z.sub.n-1,.theta.)=p(x|z.sub.n-1,.theta.).
[0174] If both q(z.sub.n-1|z.sub.n,x,.PHI.) and
p(z|.sub.n-1z.sub.n) are Gaussian, then their KL-divergence has a
simple closed form, which can be computationally efficient if the
covariance matrices are diagonal. The gradients with respect to
q(z.sub.n|x,.PHI.) in the last term of Equation 3 can be obtained
using the same reparameterization method used in a standard
VAE.
[0175] To compute the auto-encoding portion of the ELBO, the system
propagates up the convolutional encoder and down the
deconvolutional decoder of the approximating posterior, to compute
the parameters of the approximating posterior. In an example
parameterization, this can compute the conditional approximating
posterior of the n.sup.th layer based on both the n.sup.th layer of
the convolutional encoder, and the preceding (n-1).sup.th layer of
the deconvolutional decoder of the approximating posterior. In
principle, the approximating posterior of the n.sup.th layer may be
based upon the input, the entire convolutional encoder, and layers
i.ltoreq.n of the deconvolutional decoder of the approximating
posterior (or a subset thereof).
[0176] The configuration sampled from the approximating posterior
is then used in a pass down the deconvolutional decoder of the
prior. If the approximating posterior is defined over the primary
units of the deconvolutional network, then the signal from the
(n-1).sup.th layer to the n.sup.th layer is determined by the
approximating posterior for the (n-1).sup.th layer, independent of
the preceding layers of the prior. If the approach uses auxiliary
random variables, the sample from the n.sup.th layer depends on the
(n-1).sup.th layer of the deconvolutional decoder of the prior, and
the n.sup.th layer of the approximating posterior.
[0177] This approach can be extended to arbitrary numbers of
layers, and to posteriors and priors that condition on more than
one preceding layer, e.g. where layer n is conditioned on all
layers m<n preceding it.
[0178] The approximating posterior and the prior can be defined to
be fully autoregressive directed graphical models.
[0179] The directed graphical models of the approximating posterior
and prior can be defined as follows:
q .function. ( 1 , , n | z , .PHI. ) = 1 .ltoreq. m .ltoreq. n
.times. q .function. ( m | l < m , x , .PHI. ) ##EQU00017## p
.function. ( 1 , , n | .theta. ) = 1 .ltoreq. m .ltoreq. n .times.
p .function. ( m | i < m , .theta. ) ##EQU00017.2##
where the entire RBM and its associated continuous latent variables
are now denoted by .sub.1={z.sub.1, .zeta..sub.1, . . . , z.sub.k,
.zeta..sub.k). This builds an approximating posterior and prior by
conditioning the more local variables of layer m on the more global
variables of layer m-1, . . . , 1. However, the conditional
distribution in p(.sub.1, . . . , .sub.n|.theta.) only depends on
the continuous FIG. 6 is a schematic diagram illustrating an
example implementation of a variational auto-encoder (VAE) with a
hierarchy of continuous latent variables with an approximating
posterior 610 and a prior 620.
[0180] Each .sub.m>1 in approximating posterior 610 and prior
620, respectively, denotes a layer of continuous latent variables
and is conditioned on the layers preceding it. In the example
implementation of FIG. 6, there are three levels of hierarchy.
[0181] Alternatively, the approximating posterior can be made
hierarchical, as follows:
p .function. ( 1 , , n | .theta. ) = 1 .ltoreq. m .ltoreq. n
.times. p .function. ( m | .theta. ) ##EQU00018##
[0182] The ELBO decomposes as
L .function. ( x , .theta. , .0. ) = log .times. p .function. ( x |
.theta. ) - .times. KL .function. [ m .times. q .function. ( m | l
< m , x , .0. ) .times. .times. m .times. p .function. ( m | l
< m , x , .theta. ) ] = 1 .times. .intg. 2 .times. .times.
.intg. n .times. m .times. q .function. ( m | l < m = , x , .0.
) log .function. [ p .function. ( x | z , .theta. ) m .times. p
.function. ( m | l < m , x , .theta. ) m .times. p .function. (
m | l < m , x , .0. ) ] = m .times. q .function. ( m | l < m
, x , .0. ) .function. [ log .times. p .function. ( x | z , .theta.
) ] - m .times. l < m .times. ( l < m .times. q .function. (
l | K < l , x , .0. ) ) KL .function. [ q .function. ( m | l
< m , x , .0. ) .times. .times. p .function. ( m | l < m ,
.theta. ) ] = q .function. ( | x , .0. ) .function. [ log .times. p
.function. ( x | z , .theta. ) ] - m .times. q .function. ( l <
m | x , .0. ) KL .function. [ q .function. ( m | l < m , x , .0.
) .times. .times. P .function. ( m | l < m , .theta. ) ] ( 4 )
##EQU00019##
[0183] In the case where both q(m|l<m,x,.PHI.) and
p(m|l<m,.theta.) are Gaussian distributions, the KL-divergence
can be computationally efficient, and the gradient of the last term
in Equation 4 with respect to q(.sub.n-1|x,.PHI.) can be obtained
by reparametrizing, as commonly done in a traditional VAE. In all
cases, a stochastic approximation to the gradient of the ELBO can
be computed via one pass down approximating posterior 610, sampling
from each continuous latent .zeta..sub.i and .sub.m>1 in turn,
and another pass down prior 620, conditioned on the samples from
the approximating posterior. In the pass down the approximating
posterior, samples at each layer n may be based upon both the input
and all the preceding layers m<n. To compute the auto-encoding
portion of the ELBO, p(x|) can be applied from the prior to the
sample form the approximating posterior.
[0184] The pass down the prior need not pass signal from layer to
layer. Rather, the input to each layer can be determined by the
approximating posterior using equation 4.
[0185] The KL-divergence is then taken between the approximating
posterior and true prior at each layer, conditioned on the layers
above. Re-parametrization can be used to include
parameter-dependent terms into the KL-divergence term.
[0186] Both the approximating posterior and the prior distribution
of each layer .sub.m>1 are defined by neural networks, the
inputs of which are .zeta., .sub.1>l>m and x in the case of
the approximating posterior. The output of these are networks are
the mean and variance of a diagonal-covariance Gaussian
distribution.
[0187] To ensure that all the units in the RBM are active and
inactive, and thus all units in the RBM are used, when calculating
the approximating posterior over the RBM units, rather than using
traditional batch normalization, the system bases the batch
normalization on the L1 norm. In an alternative approach, the
system may base the batch normalization on the L2 norm.
[0188] Specifically, the system may use:
y=x-x
x.sub.bn=y/(|y|+.di-elect cons.).circle-w/dot.s+o
and bound 2.ltoreq.s.ltoreq.3 and -s.ltoreq.o.ltoreq.s.
[0189] ISTA-Like Generative Model
[0190] The training of variational auto-encoders is typically
limited by the form of the approximating posterior. However, there
can be challenges using an approximating posterior other than a
factorial posterior. The entropy of the approximating posterior,
which constitutes one of the components of the KL-divergence
between the approximating and true posterior (or true prior), can
be trivial if the approximating posterior is factorial, and close
to intractable if it is a mixture of factorial distributions. While
one might consider using normalizing flows, importance weighting,
or other methods to allow non-factorial approximating posteriors,
it may be easier to change the model to make the true posterior
more factorial.
[0191] In particular, with large numbers of latent variables, it
may be desirable to use a sparse, overcomplete representation. In
such a representation, there are many ways of representing a given
input, although some will be more probable than others. At the same
time, the model is sensitive to duplicate representations. Using
two latent variables that represent similar features is not
equivalent to using just one.
[0192] A similar problem arises in models with linear decoders and
a sparsity prior; i.e., sparse coding. ISTA (and LISTA) address
this by (approximately) following the gradient (with proximal
descent) of the L1-regularized reconstruction error. The resulting
transformation of the hidden representation is mostly linear in the
input and the hidden representation:
z.rarw.(I-.di-elect cons.W.sup.TW)z-.di-elect cons..lamda.
sign(z)+.di-elect cons.W.sup.Tx
[0193] Note, though, that the input must be provided to every
layer.
[0194] A somewhat similar approach can be employed in
deconvolutional decoder of the approximating posterior. Consider
the case where the conditional approximating posterior of layer
z.sub.n given layer z.sub.n-1 is computed by a multi-layer
deterministic network. Rather than making a deterministic
transformation of the input available to the first layer of this
network, the system can instead provide the deterministic
transformation of the input to the internal layers, or any subset
of the internal layers. The approximating posterior over the final
Gaussian units may then employ sparse coding via LISTA, suppressing
redundant higher-level units, and thus allowing factorial
posteriors where more than one unit coding for a given feature may
be active. In the prior pathway, there is no input to govern the
disambiguation between redundant features, so the winner-take-all
selection must be achieved via other means, and a more conventional
deep network may be sufficient.
[0195] Combination With Discrete Variational Auto-Encoder
[0196] The discrete variational auto-encoder can also be
incorporated into a convolutional auto-encoder. It is possible to
put a discrete VAE on the very top of the prior, where it can
generate multi-modal distributions that then propagate down the
deconvolutional decoder, readily allowing the production of more
sophisticated multi-modal distributions. If using ancillary random
variables, it would also be straightforward to include discrete
random variables at every layer.
[0197] Hierarchical Approximating Posteriors
[0198] True posteriors can be multi-modal. Multiple plausible
explanations for an observation can lead to a multi-modal
posterior. In one implementation, a quantum processor can employ a
Chimera topology. A Chimera topology can be defined as a tiled
topology with intra-cell couplings at crossings between qubits
within the cell and inter-cell couplings between respective qubits
in adjacent cells. Traditional VAEs typically use a factorial
approximating posterior. As a result, traditional VAEs have
difficulty capturing correlations between latent variables.
[0199] One approach is to refine the approximating posterior
automatically. This approach can be complex. Another, generally
simpler, approach is to make the approximating posterior
hierarchical. A benefit of this approach is that it can capture any
distribution, or at least a wider range of distributions.
[0200] FIG. 7 shows a method 700 for unsupervised learning via a
hierarchical variational auto-encoder (VAE), in accordance with the
present systems, devices, articles and methods. Method 700 may be
implemented as an extension of method 400 employing a hierarchy of
random variables.
[0201] Method 700 starts at 705, for example in response to a call
from another routine or other invocation.
[0202] At 710, the system initializes the model parameters with
random values, as described above with reference to 410 of method
400.
[0203] At 715, the system tests to determine if a stopping
criterion has been reached, as described above with reference to
415 of method 400.
[0204] In response to determining the stopping criterion has been
reached, the system ends method 700 at 775, until invoked again,
for example, a request to repeat the learning.
[0205] In response to determining the stopping criterion has not
been reached, the system, at 720, fetches a mini-batch of the
training data set.
[0206] At 722, the system divides the latent variables z into
disjoint groups z.sub.1, . . . , z.sub.k and the corresponding
continuous latent variables into disjoint groups .zeta..sub.1, . .
. .zeta..sub.k.
[0207] At 725, the system propagates the training data set through
the encoder to compute the full approximating posterior over
discrete z.sub.j. As mentioned before, this hierarchical
approximation does not alter the form of the gradient of the
auto-encoding term IE.
[0208] At 730, the system generates or causes generation of samples
from the approximating posterior over n layers of continuous
variables given the full distribution over z. The number of layers
n may be 1 or more.
[0209] At 735, the system propagates the samples through the
decoder to compute the distribution over the input, as describe
above with reference to 435 of method 400.
[0210] At 740, the system performs backpropagation through the
decoder, as describe above with reference to 440 of method 400.
[0211] At 745, the system performs backpropagation through the
sampler over the approximating posterior over as describe above
with reference to 445 of method 400.
[0212] At 750, the system computes the gradient of the
KL-divergence between the approximating posterior and the true
prior over z, as describe above with reference to 450 of method
400.
[0213] At 755, the system performs backpropagation through the
encoder, as describe above with reference to 455 of method 400.
[0214] At 757, the system determines a gradient of a KL-divergence,
with respect to parameters of the true prior distribution, between
the approximating posterior and the true prior distribution over
the discrete space.
[0215] At 760, the system determines at least one of a gradient or
at least a stochastic approximation of a gradient, of a bound on
the log-likelihood of the input data.
[0216] In some embodiments, the system generates samples or causes
samples to be generated by a quantum processor, as described above
with reference to 460 of method 400.
[0217] At 765, the system updates the model parameters based at
least in part on the gradient, as described above with reference to
465 of method 400.
[0218] At 770, the system tests to determine if the current
mini-batch is the last mini-batch to be processed, as described
above with reference to 470 of method 400. In some implementations,
act 770 is omitted, and control passes directly to 715 from 765.
The decision whether to fetch another mini-batch can be
incorporated in 715.
[0219] In response to determining that the current mini-batch is
the last mini-batch to be processed, the system returns control to
715. In response to determining that the current mini-batch is not
the last mini-batch to be processed, the system returns control to
720.
[0220] In summary and as described in more details above, method
700 renders the approximating posterior hierarchical over the
discrete latent variables. In addition, method 700 also adds a
hierarchy of continuous latent variables below them.
[0221] Computing the Gradients of the KL Divergence
[0222] The remaining component of the loss function can be
expressed as follows:
- K .times. L .function. [ q .function. ( z | x , .0. ) .times.
.times. p .function. ( z | .theta. ) ] = z .times. q .function. ( z
| x , .0. ) [ log .times. .times. p .function. ( z | .theta. ) -
log .times. .times. q .function. ( z | x , .0. ) ] ##EQU00020##
[0223] In some implementations, such as when the samples are
generated using an example embodiment of a quantum processor, the
prior distribution is a Restricted Boltzmann Machine (RBM), as
follows:
p .function. ( z | .theta. ) = e - E p .function. ( z , .theta. ) p
##EQU00021## where ##EQU00021.2## E p .function. ( z ) = - z T J z
- h T z ##EQU00021.3## and ##EQU00021.4## p = z .times. e - E p
.function. ( z , .theta. ) ##EQU00021.5##
where z.di-elect cons.{0,1}.sup.n,.sub.p is the partition function,
and the lateral connection matrix J is bipartite and very sparse.
The prior distribution described by the above equation contains
strong correlations, and the present computational system can use a
hierarchical approximating posterior.
[0224] The present method divides the latent variables into two
groups and defines the approximating posterior via a directed
acyclic graphical model over the two groups z.sub.a and z.sub.b, as
follows:
q .function. ( z | x , .0. ) = e - E a .function. ( z a | x , .0. )
a .times. ( x ) e - E b | a .function. ( z b | z a , x , .0. ) b |
a .times. ( z a , x ) ##EQU00022## where ##EQU00022.2## E a
.function. ( z a | x ) = - g a .function. ( x ) T z a
##EQU00022.3## E b | a .function. ( z b | z a , x ) = - g b | a
.function. ( x , z a ) T z b ##EQU00022.4## a .times. ( x ) = z a
.times. e - E a .function. ( z a | x , .0. ) = a i .di-elect cons.
a .times. ( 1 + e g a i .function. ( x ) ) ##EQU00022.5## b | a
.times. ( x , z a ) = z b .times. e - E b | a .function. ( z b | z
a , x , .0. ) = b i .di-elect cons. b .times. ( 1 + e g b i | a
.function. ( x , z a ) ) ##EQU00022.6##
[0225] The gradient -KL[q(z|x,.PHI.).parallel.p(z|.theta.)] with
respect to the parameters .theta. of the prior can be estimated
stochastically using samples from the approximating posterior
q(z|x)=q.sub.a(z.sub.a|x)q.sub.b|a(z.sub.b|z.sub.a,x) and the true
prior, as follows:
- .differential. .differential. .theta. .times. KL .function. [ q
.function. ( z | x , .0. ) .times. .times. p .function. ( z |
.theta. ) ] = - z .times. q .function. ( z | x , .0. )
.differential. E p .function. ( z , .theta. ) .differential.
.theta. + z ' .times. p .function. ( z ' | .theta. ) .differential.
E p .function. ( z ' | .theta. ) .differential. .theta. = - q a
.function. ( z a | x , .0. ) .function. [ q b | a .function. ( z b
| z a , x , .0. ) .function. [ .differential. E p .function. ( z ,
.theta. ) .differential. .theta. ] ] + p .function. ( z | .theta. )
.function. [ .differential. E p .function. ( z , .theta. )
.differential. .theta. ] ##EQU00023##
[0226] The expectation with respect to
q.sub.b|a(z.sub.b|z.sub.a,x,.PHI.) can be performed analytically;
the expectation with respect to q.sub.a(z.sub.a|x,.PHI.) requires
samples from the approximating posterior. Similarly, for the prior,
sampling is from the native distribution of the quantum processor.
Rao-Blackwellization can be used to marginalize half of the units.
Samples from the same prior distribution are used for a mini-batch,
independent of the samples chosen from the training dataset.
[0227] The gradient of -KL[q(z|x,.PHI.).parallel.p(z|.theta.)] with
respect to the parameters .PHI. of the approximating posterior does
not depend on the partition function of the prior .sub.P,
since:
KL .function. ( q .times. .times. p ) = z .times. ( q .times. log
.times. q - q .times. .times. log .times. .times. p ) = z .times. (
q .times. .times. log .times. .times. q + q E p + q .times. .times.
log .times. .times. p ) = z .times. ( q .times. .times. log .times.
.times. q + q E p ) + log .times. .times. p ##EQU00024##
[0228] Consider a case where q is hierarchical with
q=q.sub.aq.sub.b|a . . . . The random variables are fundamentally
continuous after marginalizing out the discrete random variables,
the re-parameterization technique is used to backpropagate through
.PI..sub.j<iq.sub.j|k<j.
[0229] The entropy term of the KL divergence is then:
H .function. ( q ) = z .times. q log .times. q = z .times. ( i
.times. q i | j < i ) ( i .times. log .times. q i | k < i ) =
i .times. z .times. ( j .ltoreq. i .times. q j | k < j ) log
.times. q i | k < i = i .times. z i .times. j < i .times. q j
| k < i .function. [ q i | k < i log .times. q i | k < i ]
= i .times. .rho. k < i .function. [ z i .times. q i | .rho. k
< i log .times. q i | .rho. k < i ] ##EQU00025##
where indices i, j, and k denote hierarchical groups of variables.
The probability
q i | .rho. k < i .function. ( z i ) ##EQU00026##
is evaluated analytically, whereas all variables k<i are sampled
stochastically via .rho..sub.k<i. Taking the gradient of H(q) in
the above equation and using the identity:
q .function. [ c .differential. .differential. .0. .times. log
.times. .times. q ] = c z .times. q ( .differential. q
.differential. .0. / q ) = c .differential. .differential. .0.
.times. ( z .times. q ) = 0 ##EQU00027##
for a constant c, allows elimination of the gradient of
log .times. q i | .rho. k < i ##EQU00028##
in the earlier equation, and obtain:
.differential. .differential. .0. .times. H .function. ( q ) = i
.times. .rho. k < i .function. [ z i .times. ( .differential.
.differential. .0. .times. q i | .rho. k < i ) log .times.
.times. q i | .rho. k < i ] ##EQU00029##
[0230] Moreover, elimination of a log-partition function in log
q.sub.i|.sub..rho.k<i is achieved by an analogous argument. By
repeating this argument one more time,
.differential. ( q i | .rho. k < i ) / .differential. .PHI.
##EQU00030##
can be broken into its factorial component. If
q i | .rho. k < i ##EQU00031##
is a logistic function of the input and z.sub.i.di-elect
cons.{0,1}, the gradient of the entropy reduces to:
.differential. .differential. .0. .times. H .function. ( q ) = i
.times. .rho. k < i .function. [ l .di-elect cons. i .times. z l
.times. q i .function. ( z i ) ( z i .differential. g l
.differential. .0. - z l .times. ( q l .function. ( z l ) z l
.differential. g l .differential. .0. ) ) ( g l z l ) ] = i .times.
.rho. k < i .function. [ .differential. g i T .differential. .0.
( g i .circle-w/dot. [ q i .function. ( z i = 1 ) - q i 2
.function. ( z i = 1 ) ] ) ] ##EQU00032##
where l and z.sub.l correspond to single variables within the
hierarchical groups denoted by i. In TensorFlow, it might be
simpler to write:
.differential. .differential. .0. .times. H .function. ( q ) =
.rho. k < i .function. [ .differential. q i T .function. ( z i =
1 ) .differential. .0. g i ] ##EQU00033##
[0231] The remaining cross-entropy term is:
z .times. q E p = - .rho. .function. [ z T J z + h T z ]
##EQU00034##
[0232] The term h.sup.Tz can be handled analytically, since
z.sub.i.di-elect cons.{0,1}, and
.sub..rho.[h.sup.Tz]=h.sup.T.sub..rho.[q(z=1)]
[0233] The approximating posterior q is continuous in this case,
with non-zero derivative, so the re-parameterization technique can
be applied to backpropagate gradients:
.differential. .differential. .0. .times. .rho. .function. [ h T z
] = h T .rho. .function. [ .differential. .differential. .0.
.times. q .function. ( z = 1 ) ] ##EQU00035##
[0234] In contrast, each element of the sum:
z T J z = i , j .times. J i .times. j z i z j ##EQU00036##
[0235] depends upon variables which are not usually in the same
hierarchical level, so, in general:
.sub..rho.[J.sub.ijz.sub.iz.sub.j].noteq.J.sub.ij.sub..rho.[z.sub.i].sub-
..rho.[z.sub.j]
[0236] This term can be decomposed into:
.sub..rho.[J.sub.ijz.sub.iz.sub.j]=J.sub.ij.sub..rho.k<i[z.sub.i.sub.-
.rho.k<i[z.sub.j]]
where, without loss of generality, z.sub.i is in a higher
hierarchical layer than z.sub.j. It can be challenging to take the
derivative of z.sub.j because it is a discontinuous function of
.rho..sub.k<i.
[0237] Direct Decomposition of
.differential.(I.sub.i,jz.sub.iz.sub.j)/.differential..PHI.
[0238] The re-parameterization technique initially makes z.sub.i a
function of .rho. and .PHI.. However, it is possible to marginalize
over values of the re-parameterization variables .rho. for which z
is consistent, thereby rendering z.sub.i a constant. Assuming,
without loss of generality, that i<j,
.sub..rho.[J.sub.ijz.sub.iz.sub.j] can be expressed as follows:
.rho. .function. [ J i .times. j .times. z i .times. z j ] = J i
.times. j .rho. k < i .function. [ z i ~ q i | .rho. k < i ,
.0. .function. [ z i .function. ( .rho. , .0. ) .rho. i | z i
.function. [ .rho. k < i .function. [ z j .function. ( .rho. z i
, .0. ) ] ] ] ] = J i .times. j .rho. k < i .function. [ z i
.times. q i .function. ( z 1 = 1 | .rho. k < i , .0. ) z i .rho.
i | z i .function. [ .rho. i < k < j .function. [ z j .times.
q j .function. ( z j = 1 | .rho. z i , k < j , .0. ) z j ] ] ] =
J i .times. j .rho. k < i [ .times. z i .times. q i .function. (
z i = 1 | .rho. k < i , .0. ) z i .rho. i | z i .function. [
.rho. i < k < j .function. [ q i .function. ( z j = 1 | .rho.
z i , k < i , .0. ) ] ] ] ##EQU00037##
[0239] The quantity
q.sub.i(z.sub.j=1|.rho..sub.z.sub.i.sub.,k<j,.PHI.) is not
directly a function of the original .rho., since .rho..sub.i is
sampled from the distribution conditioned on the value of z.sub.i.
It is this conditioning that coalesces
q.sub.i(z.sub.j=1|.rho..sub.z.sub.i.sub.,k<j,.PHI.), which
should be differentiated.
[0240] With z.sub.1 fixed, sampling from .rho..sub.i is equivalent
to sampling from .zeta..sub.i|z.sub.i. In particular, .rho..sub.i
is not a function of q.sub.k<i, or parameters from previous
layers. Combining this with the chain rule, .zeta..sub.i can be
held fixed when differentiating q.sub.j, with gradients not
backpropagating from q.sub.j through .zeta..sub.i.
[0241] Using the chain rule, the term due to the gradient of
q.sub.i(z.sub.i|.rho..sub.k<1,.PHI.), is:
.differential. .differential. .PHI. .times. p .function. [ J ij
.times. z i .times. z j ] = J ij ? .function. [ z i .times. ?
.times. ( z i = 1 ) .differential. .PHI. z i ? .function. [ ?
.function. [ ? .times. q j .function. ( z j = 1 ? , .PHI. ) z j ] ]
] = J ij ? .function. [ ? .function. [ .differential. q i
.function. ( z i = 1 ) .differential. .PHI. z i q i .function. ( z
i = 1 ? , .PHI. ) ? .function. [ ? .function. [ z j .function. (
.rho. , .PHI. ) ] ] ] ] = .rho. .function. [ J ij .differential. q
i .function. ( z i = 1 ) .differential. .PHI. z i .function. ( p ,
.PHI. ) q i .function. ( z i = 1 ? , .PHI. ) z j .function. ( .rho.
, .PHI. ) ] = .rho. .function. [ J ij z i .function. ( .rho. ,
.PHI. ) q i .function. ( z i = 1 ) q j .function. ( z j = 1 )
.differential. q i .function. ( z i = 1 ) .differential. .PHI. ]
##EQU00038## ? .times. indicates text missing or illegible when
filed ##EQU00038.2##
where, in the second line, we reintroduce sampling over z.sub.i,
but reweight the samples so the expectation is unchanged.
[0242] The term due to the gradient of q.sub.j(z.sub.j|.rho.,.PHI.)
is:
.differential. .differential. .PHI. .times. .rho. .function. [ J ij
.times. z i .times. z j ] = J ij .rho. .times. .times. k < i
.function. [ z i .times. q i .function. ( z i .rho. k < i ,
.PHI. ) z i .rho. i z i .function. [ .rho. i < k < i
.function. [ z j .times. .differential. q j .differential. .PHI. z
j ] ] ] = J ij .rho. .times. .times. k < j .function. [ q i
.rho. .times. .times. k < j , .PHI. .function. [ z i .function.
( .rho. , .PHI. ) z j .function. ( .rho. , .PHI. ) q j .function. (
z j .rho. k < j , .PHI. ) .differential. q j .differential.
.PHI. ] ] = .rho. .function. [ J ij z i .function. ( .rho. , .PHI.
) z j .function. ( .rho. , .PHI. ) q j .function. ( z j = 1 )
.differential. q j .differential. .PHI. ] ##EQU00039##
[0243] For both z.sub.j and z.sub.j, the derivative with respect to
q(z=0) can be ignored since in light of scaling by z=0. Once again,
gradients can be prevented from backpropagating through
.zeta..sub.i. Summing over z.sub.i, and then take the expectation
of .rho..sub.i conditioned on the chosen value of z.sub.i. As a
result, q.sub.i(z.sub.j=1|.rho..sub.z.sub.t.sub.,k<j,.PHI.),
depends upon being fixed, independent of the preceding .rho. and in
the hierarchy.
[0244] Further marginalize over z.sub.j to obtain:
.differential. .differential. .PHI. .times. .rho. .function. [ J ij
.times. z i .times. z j ] = .rho. .function. [ J ij z i
.differential. q j .function. ( z j = 1 ) .differential. .PHI. ]
##EQU00040##
[0245] Decomposition of
.differential.(J.sub.i,jz.sub.iz.sub.j)/.differential..PHI. Via the
Chain Rule
[0246] In another approach, the gradient of
E.sub.p(J.sub.i,jz.sub.iz.sub.j) can be decomposed using the chain
rule. Previously, z has been considered to be a function of .rho.
and .PHI.. Instead z can be formulated as a function of q(z=1) and
.rho., where q(z=1) is itself a function of .rho. and .PHI..
Specifically,
z i .function. ( q i .function. ( z i = 1 ) , .rho. i ) = { 0 if
.times. .times. .rho. i < 1 - q i .function. ( z i = 1 ) = q i
.function. ( z i = 0 ) 1 otherwise ##EQU00041##
[0247] The chain rule can be used to differentiate with respect to
q(z=1) since it allows pulling part of the integral over .rho.
inside the derivative with respect to .PHI..
[0248] Expanding the desired gradient using the re-parameterization
technique and the chain rule, finds:
.differential. .differential. .PHI. .times. q .function. [ J ij
.times. z i .times. z j ] = .differential. .differential. .PHI.
.times. q .function. [ J ij .times. z i .times. z j ] = .rho.
.function. [ k .times. .differential. J ij .times. z i .times. z j
.differential. q k .function. ( z k = 1 ) .differential. q k
.function. ( z k - 1 ) .differential. .PHI. ] ##EQU00042##
[0249] The order of integration (via the expectation) and
differentiation can be changed. Although z(q,.rho.) is a step
function, and its derivative is a delta function, the integral of
its derivative is finite. Rather than dealing with generalized
functions directly, the definition of the derivative can be
applied, and push through the matching integral to recover a finite
quantity. For simplicity, the sum over k can be pulled out of the
expectation in the above equation, and consider each summand
independently.
[0250] Since z.sub.i is only a function of q.sub.i, terms in the
sum over k in the above equation vanish except k=i and k=j. Without
loss of generality, consider the term k=the term k=j is symmetric.
Applying the definition of the gradient to one of the summands, and
then analytically taking the expectation with respect to
.rho..sub.i, obtains:
.rho. .function. [ .differential. J ij z i .function. ( q , .rho. )
z j .function. ( q , .rho. ) .differential. q i .function. ( z i =
1 ) .differential. q i .function. ( z i = 1 ) .differential. .PHI.
] = .times. .rho. .function. [ lim .delta. .times. .times. q i
.function. ( z i = 1 ) .fwdarw. 0 .times. ( J ij z i .function. ( q
+ .delta. .times. .times. q i , .rho. ) z j .times. ( q + .delta.
.times. .times. q i , .rho. ) - J ij z i .function. ( q , .rho. ) z
j .function. ( q , .rho. ) ) ( .delta. .times. .times. q i
.function. ( z i = 1 ) ) .differential. q i .function. ( z i = 1 )
.differential. .PHI. ] = .rho. i .noteq. i [ lim .delta. .times.
.times. q i .function. ( z i = 1 ) .fwdarw. 0 .times. .delta.
.times. .times. q i J ij 1 z j .function. ( q , .rho. ) - J ij 0 z
j .function. ( q , .rho. ) .delta. .times. .times. q i .function. (
z i = 1 ) .differential. q i .function. ( z i = 1 ) .differential.
.PHI. .rho. i = q i .function. ( z i = 0 ) .times. ] = .rho.
.times. .times. k .noteq. i [ J ij z j .function. ( q , .rho. )
.differential. q i .function. ( z i = 1 ) .differential. .PHI.
.rho. i = q i .function. ( z i = 0 ) ] ##EQU00043##
[0251] Since .rho..sub.i is fixed such that .zeta..sub.i=0, units
further down the hierarchy can be sampled in a manner consistent
with this restriction. The gradient is computed with a stochastic
approximation by multiplying each sample by 1-z.sub.i, so that
terms with .zeta..sub.i.noteq.0 can be ignored, and scaling up the
gradient when z.sub.i=0 by 1/q.sub.i(z.sub.i=0), as follows:
.differential. .differential. .PHI. .times. .function. [ J ij
.times. z i .times. z j ] = .rho. .function. [ J ij 1 - z i 1 - q i
.function. ( z i = 1 ) z j .differential. q i .function. ( z i = 1
) .differential. .PHI. ] ##EQU00044##
[0252] While this corresponds to taking the expectation of the
gradient of the log-probability, it is done for each unit
independently, so the total increase in variance can be modest.
[0253] Alternative Approach
[0254] An alternative approach is to take the gradient of the
expectation using the gradient of log-probabilities over all
variables:
.differential. .differential. .PHI. .times. .function. [ J ij
.times. z i .times. z j ] = .times. = .times. q 1 , q 2 1 , .times.
[ J ij .times. z i .times. z j k .times. .differential.
.differential. .PHI. .times. log .times. .times. q k .kappa. < k
] = .times. q 1 , q 2 1 , .times. [ J ij .times. z i .times. z j
.times. k .times. 1 q k .kappa. < k .differential. q k .kappa.
< k .differential. .PHI. ] ##EQU00045##
For the gradient term on the right-hand side, terms involving only
z.sub..kappa.<k that occur hierarchicaly before k can be dropped
out, since those terms can be pulled out of the expectation over
q.sub.k. However, for terms involving z.sub..kappa.>k that occur
hierarchically after k, the expected value of z.sub..kappa. depends
upon the chosen value of z.sub.k.
[0255] Generally, no single term in the sum is expected to have a
particularly high variance. However, the variance of the estimate
is proportional to the number of terms, and the number of terms
contributing to each gradient can grow quadratically with the
number of units in a bipartite model, and linearly in a
chimera-structured model. In contrast, in the previously described
approach, the number of terms contributing to each gradient can
grow linearly with the number of units in a bipartite mode, and be
constant in a chimera-structured model.
[0256] Introducing a baseline:
q .function. [ ( J ij .times. z i .times. z j - c .function. ( x )
) .differential. .differential. .PHI. .times. log .times. .times. q
] ##EQU00046##
[0257] Non-Factorial Approximating Posteriors Via Ancillary
Variables
[0258] Alternatively, or in addition, a factorial distribution over
discrete random variables can be retained, and made conditional on
a separate set of ancillary random variables.
.differential. .differential. .PHI. .times. ( z .times. q
.function. ( z .alpha. ) ( z T J z ) ) = .differential.
.differential. .PHI. .times. ( q T .function. ( z = 1 .alpha. ) J q
.function. ( z = 1 .alpha. ) ) ##EQU00047##
so long as J is bipartite. The full gradient of the KL-divergence
with respect to the parameters of the approximating posterior is
then as follows:
.differential. .differential. .PHI. .times. KL ( q .times. p ) =
.rho. .function. [ ( g i - h - ( J T + J ) q .function. ( z = 1 ) )
.differential. .differential. .PHI. .times. q .function. ( z = 1 )
] ##EQU00048##
[0259] Other than making the distributions conditioned on the
ancillary random variables .alpha. of the approximating posterior,
the KL-divergence between the approximating posterior and the true
prior of the ancillary variables can be subtracted. The rest of the
prior is unaltered, since the ancillary random variables .alpha.
govern the approximating posterior, rather than the generative
model.
[0260] Implementation
[0261] The following can be parameterized:
q(z|x,.PHI.)=.PI..sub.iq.sub.i(z.sub.i|x,.PHI.)
using a feedforward neural network g(x). Each layer i of the neural
network g(x) consists of a linear transformation, parameterized by
weight matrix W.sub.i and bias vector b.sub.i, followed by a
pointwise nonlinearity. While intermediate layers can consist of
ReLU or soft-plus units, with nonlinearity denoted by .tau., the
logistic function .sigma. can be used as the nonlinearity in the
top layer of the encoder to ensure the requisite range [0,1].
Parameters for each q.sub.i(z.sub.i|x,.PHI.) are shared across
inputs x, and 0.ltoreq.g.sub.i(x).ltoreq.1.
[0262] Similarly, p(x|.zeta.,.theta.) can be parameterized using
another feedforward neural network f(.zeta.), with complementary
parameterization. If x is binary,
p.sub.i(x.sub.i=1|,.theta.)=.sigma.(f.sub.i(.zeta.)) can again be
used. If x is real, an additional neural network f'(.zeta.) can be
introduced to calculate the variance of each variable, and take an
approach analogous to traditional variational auto-encoders by
using
p.sub.i(x.sub.i|.zeta.,.theta.)=(f.sub.i(.zeta.),f'.sub.i(.zeta.)).
The final nonlinearity of the network f(.zeta.) should be linear,
and the final nonlinearity of f(.zeta.) should be non-negative.
[0263] Algorithm 1 (shown below) illustrates an example
implementation of training a network expressed as pseudocode.
Algorithm 1 describes training a generic network with gradient
descent. In other implementations, other methods could be used to
train the network without loss of generality with respect to the
approach.
[0264] Algorithm 1 establishes the input and output, and initialize
the model parameters, then it determines if a stopping criterion
has been met. In addition, algorithm 1 defines the processing of
each mini-batch or subset.
[0265] Algorithms 1 and 2 (shown below) comprise pseudocode for
binary visible units. Since J is bipartite, J.sub.q can be used to
denote the upper-right quadrant of J, where the non-zero values
reside. Gradient descent is one approach that can be used. In other
implementations, gradient descent can be replaced by another
technique, such as RMSprop, adagrad, or ADAM.
TABLE-US-00001 Algorithm 1: Train generic network with simple
gradient descent def train ( ) | Input : A data set X, where X [: ,
i]is the ith element, and a learning rate parameter | Output: Model
parameters: {W, , J.sub.q, h} | Initialize model parameters with
random values | while Stopping criteria is not met do | | foreach
minibatch X.sub.pos = getMinibatch (X, ) of the training dataset do
| | | Draw a sample from the approx posterior Z , Z.sub.pos,
X.sub.out .rarw. posSamples (X.sub.pos) | | | Draw a sample from
the prior Z.sub.neg .rarw. negSamples (Z.sub.neg ) | | | Estimate
.differential. ? .differential. .theta. ##EQU00049## using
calcGradients (X.sub.pos, Z , Z.sub.pos, Z.sub.neg, X.sub.out) | |
| Update parameters according to .theta. t + 1 .rarw. .theta. t +
.differential. L .differential. .theta. ##EQU00050## | | end | end
indicates data missing or illegible when filed
[0266] At first, this approach appears to be caught between two
conflicting constraints when trying to apply the variational
auto-encoder technique to discrete latent representations. On the
one hand, a discrete latent representation does not allow use of
the gradient of the decoder, since the reparametrized latent
representation jumps discontinuously or remains constant as the
parameters of the approximating posterior are changed. On the other
hand, KL[q(z|x,.PHI.).parallel.p(z|.theta.)] is only easy to
evaluate if by remaining in the original discrete space.
[0267] The presently disclosed systems and methods avoid these
problems by symmetrically projecting the approximating posterior
and the prior into a continuous space. The computational system
evaluates the auto-encoder portion of the loss function in the
continuous space, marginalizing out the original discrete latent
representation. At the same time, the computational system
evaluates the KL-divergence between the approximating posterior and
the true prior in the original discrete space, and, owing to the
symmetry of the projection into the continuous space, it does not
contribute to this term.
TABLE-US-00002 Algorithm 2: Helper functions for discrete VAE L
.rarw. L.sub.up + L.sub.down def getMinibatch (X, ) | k .rarw. k +
1 | X.sub.pos .rarw. X [:, k m: (k + 1) m) def posSamples
(X.sub.pos) | Z.sub.o .rarw. X.sub.pos | for i .rarw. 1 to L.sub.up
- 1 do | | Z.sub.i .rarw. (W.sub.i-1 Z.sub.i-1 + b.sub.i-1) | end |
Z .rarw. W.sub.Lup-1 Z.sub.Lup-1 + b.sub.Lup-1 | Z.sub.pos .rarw.
.sigma. (Z ) | Z.sub.Lup .rarw. G .sub.-1 (p) where q' (.zeta. =
1|x, .PHI.) = Z.sub.pos and .rho.~U (0,1).sup.n.times.m | for i
.rarw. L.sub.up + 1 to L.sub.last - 1 do | | Z.sub.i .rarw.
(W.sub.i-1 Z.sub.i-1 + b.sub.i-1) | end | X.sub.out .rarw. .alpha.
(W Z + b ) def negSamples (Z.sub.pos) | if using D-Wave then | |
sample Z.sub.neg from D-Wave using h and J.sub.q | | post-process
samples | else | | if using CD then | | | Z.sub.neg .rarw. sample
(Z.sub.pos) | | else if using PCD then | | | Z.sub.neg initialized
to result of last call to negSamples ( ) | | end | | for i .rarw. 1
to n do | | | sample "left" half from p .function. ( Z neg
.function. [ : d 2 , ? ] = 1 ) = .sigma. .function. ( J q .times. Z
neg .function. [ d 2 .times. ? ] + h .function. [ : d 2 ] )
##EQU00051## | | | sample "right" half from p ( Z neg .function. [
d 2 : , : ] = 1 ) = .sigma. .function. ( J q T .times. Z neg
.function. [ : d 2 , : ] + h .function. [ d 2 : ] ) ##EQU00052## |
| end | end def calcGradients (X.sub.pos, Z , Z.sub.pos, Z.sub.neg,
X.sub.out) | B L .times. ? .rarw. .sigma. .times. ? .times. ( W L
.times. ? Z L .times. ? + b L .times. ? ) ( X pos X out - 1 - X pos
1 - X out ) ##EQU00053## | for i .rarw. L.sub.last - 1 to L.sub.up
do | | .differential. ? .differential. W .times. ? .rarw. B i + 1 Z
i T ##EQU00054## | | .differential. ? .differential. ? .rarw. B i +
1 1 ##EQU00055## | |B.sub.i .rarw. (W.sub.i-1 Z.sub.i-1 +
b.sub.i-1) W.sub.i.sup.T B.sub.i+1 | end | B pos .rarw.
.differential. ? .differential. q W T .times. ? B L up + 1
##EQU00056## | B KL .rarw. ( Z .times. ? - h - vstack .function. (
J q Z pos .function. [ d 2 .times. ? : ] , J q T Z pos .function. [
: d 2 , : ] ) ) .circle-w/dot. ( Z pos - Z pos 2 ) ##EQU00057## |
B.sub.L.sub.up .rarw. .sigma.' (W.sub.Lup-1 Z.sub.Lup-1 +
b.sub.Lup-1) B.sub.pos - B.sub.KL | for i .rarw. L.sub.up - 1 to 0
do | | .differential. ? .differential. W .times. ? .rarw. B i + 1 Z
i T ##EQU00058## | | .differential. ? .differential. ? .rarw. B i +
1 1 ##EQU00059## | |B.sub.i .rarw. (W.sub.i-1 Z.sub.i-1 +
b.sub.i-1) W.sub.i.sup.T B.sub.i+1 | end | .differential. ?
.differential. ? .rarw. Z pos .function. [ : d 2 , : ] Z pos
.function. [ d 2 : , : ] T - Z neg .function. [ : d 2 , : ] Z neg
.function. [ d 2 : , : ] T ##EQU00060## | .differential. ?
.differential. ? .rarw. Z pos 1 - Z neg 1 ##EQU00061## indicates
data missing or illegible when filed
[0268] The above description of illustrated embodiments, including
what is described in the Abstract, is not intended to be exhaustive
or to limit the embodiments to the precise forms disclosed.
Although specific embodiments of and examples are described herein
for illustrative purposes, various equivalent modifications can be
made without departing from the spirit and scope of the disclosure,
as will be recognized by those skilled in the relevant art. The
teachings provided herein of the various embodiments can be applied
to other methods of quantum computation, not necessarily the
exemplary methods for quantum computation generally described
above.
[0269] The various embodiments described above can be combined to
provide further embodiments. All of the U.S. patents, U.S. patent
application publications, U.S. patent applications, foreign
patents, foreign patent applications and non-patent publications
referred to in this specification and/or listed in the Application
Data Sheet including: U.S. patent application publication
2015/0006443 published Jan. 1, 2015; U.S. patent application
publication 2015/0161524 published Jun. 11, 2015; U.S. provisional
patent application Ser. No. 62/207,057, filed Aug. 19, 2015,
entitled "SYSTEMS AND METHODS FOR MACHINE LEARNING USING ADIABATIC
QUANTUM COMPUTERS"; U.S. provisional patent application Ser. No.
62/206,974, filed Aug. 19, 2015, entitled "DISCRETE VARIATIONAL
AUTO-ENCODER SYSTEMS AND METHODS FOR MACHINE LEARNING USING
ADIABATIC QUANTUM COMPUTERS"; U.S. provisional patent application
Ser. No. 62/268,321, filed Dec. 16, 2015, entitled "DISCRETE
VARIATIONAL AUTO-ENCODER SYSTEMS AND METHODS FOR MACHINE LEARNING
USING ADIABATIC QUANTUM COMPUTERS"; and U.S. provisional patent
application Ser. No. 63/307,929, filed 14 Mar. 2016, entitled
"DISCRETE VARIATIONAL AUTO-ENCODER SYSTEMS AND METHODS FOR MACHINE
LEARNING USING ADIABATIC QUANTUM COMPUTERS", each of which is
incorporated herein by reference in its entirety. Aspects of the
embodiments can be modified, if necessary, to employ systems,
circuits, and concepts of the various patents, applications, and
publications to provide yet further embodiments.
* * * * *
References