U.S. patent application number 17/323143 was filed with the patent office on 2021-11-25 for discrete variational auto-encoder systems and methods for machine learning using adiabatic quantum computers.
The applicant listed for this patent is D-WAVE SYSTEMS INC.. Invention is credited to Zhengbing Bian, Fabian A. Chudak, William G. Macready, Jason Rolfe.
Application Number | 20210365826 17/323143 |
Document ID | / |
Family ID | 1000005692531 |
Filed Date | 2021-11-25 |
United States Patent
Application |
20210365826 |
Kind Code |
A1 |
Rolfe; Jason ; et
al. |
November 25, 2021 |
DISCRETE VARIATIONAL AUTO-ENCODER SYSTEMS AND METHODS FOR MACHINE
LEARNING USING ADIABATIC QUANTUM COMPUTERS
Abstract
A computational system can include digital circuitry and analog
circuitry, for instance a digital processor and a quantum
processor. The quantum processor can operate as a sample generator
providing samples. Samples can be employed by the digital
processing in implementing various machine learning techniques. For
example, the computational system can perform unsupervised learning
over an input space, for example via a discrete variational
auto-encoder, and attempting to maximize the log-likelihood of an
observed dataset. Maximizing the log-likelihood of the observed
dataset can include generating a hierarchical approximating
posterior. Unsupervised learning can include generating samples of
a prior distribution using the quantum processor. Generating
samples using the quantum processor can include forming chains of
qubits and representing discrete variables by chains.
Inventors: |
Rolfe; Jason; (Vancouver,
CA) ; Macready; William G.; (West Vancouver, CA)
; Bian; Zhengbing; (Burnaby, CA) ; Chudak; Fabian
A.; (Vancouver, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
D-WAVE SYSTEMS INC. |
Burnaby |
|
CA |
|
|
Family ID: |
1000005692531 |
Appl. No.: |
17/323143 |
Filed: |
May 18, 2021 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
15725600 |
Oct 5, 2017 |
11042811 |
|
|
17323143 |
|
|
|
|
62404591 |
Oct 5, 2016 |
|
|
|
62462821 |
Feb 23, 2017 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06N 3/084 20130101;
G06N 10/00 20190101; G06N 20/00 20190101; G06N 3/088 20130101; G06K
9/6256 20130101; G06N 20/10 20190101; G06N 3/0472 20130101; G06N
7/005 20130101; G06N 3/0454 20130101; G06K 9/00986 20130101; G06F
15/80 20130101 |
International
Class: |
G06N 10/00 20060101
G06N010/00; G06N 3/04 20060101 G06N003/04; G06K 9/00 20060101
G06K009/00; G06N 3/08 20060101 G06N003/08; G06F 15/80 20060101
G06F015/80; G06N 20/00 20060101 G06N020/00; G06K 9/62 20060101
G06K009/62; G06N 20/10 20060101 G06N020/10 |
Claims
1. A method for searching an input space characterized by an
objective function and comprising discrete or continuous variables,
to attempt to identify an element of the input space based on the
objective function, the method executed by circuitry including at
least one processor, the method comprising: training a machine
learning model based on at least a subset of a training dataset of
samples of respective variables of the input space, said training
comprising: forming a latent representation of a latent space based
on the input space and comprising a plurality of random variables,
the plurality of random variables comprising one or more discrete
random variables; and forming an encoding distribution over the
latent space, conditioned on the input space; searching the latent
space based on at least a subset of the one or more discrete random
variables and the objective function, said searching comprising:
selecting an initial point in the latent space based on the latent
representation; determining one or more objective values for one or
more points in the input space based on the objective function;
determining one or more points in the latent space corresponding to
the input space based on the encoding distribution; and optimizing
the initial point in the latent space based on the latent
representation and at least a subset of the one or more discrete
random variables, said optimizing comprising associating the one or
more objective values with the one or more points in the latent
space.
2. The method of claim 1 wherein: training further comprises
forming a decoding distribution over the input space conditioned on
the latent representation; and associating the one or more
objective values with the one or more points in the latent space
comprises determining a composition of the objective function with
the decoding distribution to provide a composed objective function
with a domain in the latent space.
3. The method of claim 2 wherein training a machine learning model
comprises training a variational autoencoder, forming an encoding
distribution comprises forming an approximating posterior
distribution, and forming a decoding distribution comprises forming
a conditional distribution over the input space.
4. The method of claim 1 wherein forming a latent representation of
a latent space based on the input space and comprising a plurality
of random variables comprises forming a latent representation of a
latent space comprising one or more continuous random variables and
wherein optimizing the initial point in the latent space comprises
optimizing over at least a subset of the one or more continuous
random variables.
5. The method of claim 4 wherein optimizing comprises one of fixing
and varying at least one of the one or more discrete random
variables and varying at least one of the one or more continuous
random variables.
6. The method of claim 1 wherein searching comprises selecting a
plurality of initial points and optimizing the plurality of
initials point in the latent space based on the latent
representation and at least a subset of the one or more discrete
random variables.
7. The method of claim 1 wherein searching comprises determining a
result in the latent space and decoding the result into the input
space via a decoding distribution over the input space conditioned
on the latent representation.
8. The method of claim 1 wherein training the machine learning
model based on at least a subset of a training dataset of samples
comprises training the machine learning model based on at least a
subset of a training dataset of samples comprising supervised data
and forming the latent representation of the latent space based on
the input space comprises forming a latent representation of a
latent space based on at least a subset of the supervised data.
9. The method of claim 8 wherein searching comprises optimizing the
latent representation based on a gradient over the latent space,
the gradient based on the objective function, the objective
function comprising a loss function defined on one or more
properties represented in the supervised data.
10. The method of claim 9 wherein optimizing the latent
representation based on the gradient comprises backpropagating
values of the loss function from the input space to the latent
space via a decoding distribution over the input space conditioned
on the latent representation.
11. The method of claim 10 wherein optimizing comprises optimizing
based on both a log-probability of the properties and a log
probability of a prior distribution over the latent space.
12. A computational system, comprising: at least one processor; and
at least one nontransitory processor-readable storage medium that
stores at least one of processor-executable instructions or data
which, when executed by the at least one processor cause the at
least one processor to: train a machine learning model, based on at
least a subset of a training dataset of samples of respective
variables of an input space, the input space characterized by an
objective function and comprising discrete or continuous variables,
to: form a latent representation of a latent space based on the
input space and comprising a plurality of random variables, the
plurality of random variables comprising one or more discrete
random variables; and form an encoding distribution over the latent
space, conditioned on the input space; search the latent space,
based on at least a subset of the one or more discrete random
variables and the objective function, to: select an initial point
in the latent space based on the latent representation; determine
one or more objective values for one or more points in the input
space based on the objective function; determine one or more points
in the latent space corresponding to the input space based on the
encoding distribution; and optimize the initial point in the latent
space, based on the latent representation and at least a subset of
the one or more discrete random variables, to associate the one or
more objective values with the one or more points in the latent
space.
13. The computational system of claim 12 wherein: the machine
learning model is trained to form a decoding distribution over the
input space conditioned on the latent representation; and the one
or more objective values associate with the one or more points in
the latent space to determine a composition of the objective
function with the decoding distribution to provide a composed
objective function with a domain in the latent space.
14. The computational system of claim 13 wherein the machine
learning model comprises a variational autoencoder, the encoding
distribution comprises an approximating posterior distribution, and
the decoding distribution comprises a conditional distribution over
the input space.
15. The computational system of claim 12 wherein the plurality of
random variables comprises one or more continuous random variables
and the initial point in the latent space is optimized over at
least a subset of the one or more continuous random variables.
16. The computational system of claim 15 wherein the at least one
of processor-executable instructions or data cause the at least one
processor to optimize the initial point in the latent space
comprising at least one of the one or more discrete random
variables being one of fixed and varied and at least one of the one
or more continuous random variables being varied.
17. The computational system of claim 12 wherein the search of the
latent space comprises a selection of a plurality of initial points
and an optimization of the plurality of initial point in the latent
space based on the latent representation and at least a subset of
the one or more discrete random variables.
18. The computational system of claim 12 wherein the search of the
latent space comprises a determination of a result in the latent
space and the result being decoded into the input space via a
decoding distribution over the input space conditioned on the
latent representation.
19. The computational system of claim 12 wherein the training
dataset comprises supervised data and the latent representation of
the latent space is formed based on at least a subset of the
supervised data.
20. The computational system of claim 19 wherein: the objective
function comprises a loss function defined on one or more
properties represented in the supervised data; and the latent
representation is optimized based on a gradient over the latent
space, the gradient based on the loss function.
21. The computational system of claim 20 wherein the latent
representation is optimized by backpropagating values of the loss
function from the input space to the latent space via a decoding
distribution over the input space conditioned on the latent
representation.
22. The computational system of claim 20 wherein the loss function
is defined only on the one or more properties represented in the
supervised data.
23. The computational system of claim 20 wherein the latent
representation is optimized based on both a log-probability of the
properties and a log probability of a prior distribution over the
latent space.
Description
BACKGROUND
Field
[0001] The present disclosure generally relates to machine
learning, and, more particularly, to machine learning using quantum
computers.
Machine Learning
[0002] Machine learning relates to methods and circuitry that can
learn from data and make predictions based on data. In contrast to
methods or circuitry that follow static program instructions,
machine learning methods and circuitry can include deriving a model
from example inputs (such as a training set) and then making
data-driven predictions.
[0003] Machine learning is related to optimization. Some problems
can be expressed in terms of minimizing a loss function on a
training set, where the loss function describes the disparity
between the predictions of the model being trained and observable
data.
[0004] Machine learning tasks can include unsupervised learning,
supervised learning, and reinforcement learning. Approaches to
machine learning include, but are not limited to, decision trees,
linear and quadratic classifiers, case-based reasoning, Bayesian
statistics, and artificial neural networks.
[0005] Machine learning can be used in situations where explicit
approaches are considered infeasible. Example application areas
include optical character recognition, search engine optimization,
and computer vision.
Quantum Processor
[0006] A quantum processor is a computing device that can harness
quantum physical phenomena (such as superposition, entanglement,
and quantum tunneling) unavailable to non-quantum devices. A
quantum processor may take the form of a superconducting quantum
processor. A superconducting quantum processor may include a number
of qubits and associated local bias devices, for instance two or
more superconducting qubits. An example of a qubit is a flux qubit.
A superconducting quantum processor may also employ coupling
devices (i.e., "couplers") providing communicative coupling between
qubits. Further details and embodiments of example quantum
processors that may be used in conjunction with the present systems
and devices are described in, for example, U.S. Pat. Nos.
7,533,068; 8,008,942; 8,195,596; 8,190,548; and 8,421,053.
Adiabatic Quantum Computation
[0007] Adiabatic quantum computation typically involves evolving a
system from a known initial Hamiltonian (the Hamiltonian being an
operator whose eigenvalues are the allowed energies of the system)
to a final Hamiltonian by gradually changing the Hamiltonian. A
simple example of an adiabatic evolution is a linear interpolation
between initial Hamiltonian and final Hamiltonian. An example is
given by:
H.sub.e=(1-s)H.sub.i+sH.sub.f
where H.sub.i is the initial Hamiltonian, H.sub.f is the final
Hamiltonian, H.sub.e is the evolution or instantaneous Hamiltonian,
and s is an evolution coefficient which controls the rate of
evolution (i.e., the rate at which the Hamiltonian changes).
[0008] As the system evolves, the evolution coefficient s goes from
0 to 1 such that at the beginning (i.e., s=0) the evolution
Hamiltonian H.sub.e is equal to the initial Hamiltonian H.sub.i and
at the end (i.e., s=1) the evolution Hamiltonian H.sub.e is equal
to the final Hamiltonian H.sub.f. Before the evolution begins, the
system is typically initialized in a ground state of the initial
Hamiltonian H.sub.i and the goal is to evolve the system in such a
way that the system ends up in a ground state of the final
Hamiltonian H.sub.f at the end of the evolution. If the evolution
is too fast, then the system can transition to a higher energy
state, such as the first excited state. As used herein an
"adiabatic" evolution is an evolution that satisfies the adiabatic
condition:
s'|<1|dH.sub.e/ds|0>|=.delta.g.sup.2(s)
where s' is the time derivative of s, g(s) is the difference in
energy between the ground state and first excited state of the
system (also referred to herein as the "gap size") as a function of
s, and .delta. is a coefficient much less than 1.
[0009] If the evolution is slow enough that the system is always in
the instantaneous ground state of the evolution Hamiltonian, then
transitions at anti-crossings (when the gap size is smallest) are
avoided. Other evolution schedules, besides the linear evolution
described above, are possible including non-linear evolution,
parametric evolution, and the like. Further details on adiabatic
quantum computing systems, methods, and apparatus are described in,
for example, U.S. Pat. Nos. 7,135,701; and 7,418,283.
Quantum Annealing
[0010] Quantum annealing is a computation method that may be used
to find a low-energy state, typically preferably the ground state,
of a system. Similar in concept to classical simulated annealing,
the method relies on the underlying principle that natural systems
tend towards lower energy states because lower energy states are
more stable. While classical annealing uses classical thermal
fluctuations to guide a system to a low-energy state and ideally
its global energy minimum, quantum annealing may use quantum
effects, such as quantum tunneling, as a source of disordering to
reach a global energy minimum more accurately and/or more quickly
than classical annealing. In quantum annealing thermal effects and
other noise may be present to annealing. The final low-energy state
may not be the global energy minimum. Adiabatic quantum computation
may be considered a special case of quantum annealing for which the
system, ideally, begins and remains in its ground state throughout
an adiabatic evolution. Thus, those of skill in the art will
appreciate that quantum annealing systems and methods may generally
be implemented on an adiabatic quantum computer. Throughout this
specification and the appended claims, any reference to quantum
annealing is intended to encompass adiabatic quantum computation
unless the context requires otherwise.
[0011] Quantum annealing uses quantum mechanics as a source of
disorder during the annealing process. An objective function, such
as an optimization problem, is encoded in a Hamiltonian H.sub.P,
and the algorithm introduces quantum effects by adding a
disordering Hamiltonian H.sub.D that does not commute with H.sub.P.
An example case is:
H.sub.E.varies.A(t)H.sub.D+B(t)H.sub.P
where A(t) and B(t) are time dependent envelope functions. For
example, A(t) can change from a large value to substantially zero
during the evolution and H.sub.E can be thought of as an evolution
Hamiltonian similar to H.sub.e described in the context of
adiabatic quantum computation above. The disorder is slowly removed
by removing H.sub.D (i.e., by reducing A(t)).
[0012] Thus, quantum annealing is similar to adiabatic quantum
computation in that the system starts with an initial Hamiltonian
and evolves through an evolution Hamiltonian to a final "problem"
Hamiltonian H.sub.P whose ground state encodes a solution to the
problem. If the evolution is slow enough, the system may settle in
the global minimum (i.e., the exact solution), or in a local
minimum close in energy to the exact solution. The performance of
the computation may be assessed via the residual energy (difference
from exact solution using the objective function) versus evolution
time. The computation time is the time required to generate a
residual energy below some acceptable threshold value. In quantum
annealing, H.sub.P may encode an optimization problem and therefore
H.sub.P may be diagonal in the subspace of the qubits that encode
the solution, but the system does not necessarily stay in the
ground state at all times. The energy landscape of H.sub.P may be
crafted so that its global minimum is the answer to the problem to
be solved, and low-lying local minima are good approximations.
[0013] The gradual reduction of disordering Hamiltonian H.sub.D
(i.e., reducing A(t)) in quantum annealing may follow a defined
schedule known as an annealing schedule. Unlike adiabatic quantum
computation where the system begins and remains in its ground state
throughout the evolution, in quantum annealing the system may not
remain in its ground state throughout the entire annealing
schedule. As such, quantum annealing may be implemented as a
heuristic technique, where low-energy states with energy near that
of the ground state may provide approximate solutions to the
problem.
BRIEF SUMMARY
[0014] A method for unsupervised learning over an input space
including discrete or continuous variables, and at least a subset
of a training dataset of samples of the respective variables, to
attempt to identify a value of at least one parameter that
increases a log-likelihood of at least the subset of the training
dataset with respect to a model, the model expressible as a
function of the at least one parameter, the method executed by
circuitry including at least one processor, may be summarized as
including forming a first latent space comprising a plurality of
random variables, the plurality of random variables comprising one
or more discrete random variables; forming a second latent space
comprising the first latent space and a set of supplementary
continuous random variables; forming a first transforming
distribution comprising a conditional distribution over the set of
supplementary continuous random variables, conditioned on the one
or more discrete random variables of the first latent space;
forming an encoding distribution comprising an approximating
posterior distribution over the first latent space, conditioned on
the input space; forming a prior distribution over the first latent
space; forming a decoding distribution comprising a conditional
distribution over the input space conditioned on the set of
supplementary continuous random variables; determining an ordered
set of conditional cumulative distribution functions of the
supplementary continuous random variables, each cumulative
distribution function comprising functions of a full distribution
of at least one of the one or more discrete random variables of the
first latent space; determining an inversion of the ordered set of
conditional cumulative distribution functions of the supplementary
continuous random variables; constructing a first stochastic
approximation to a lower bound on the log-likelihood of the at
least a subset of a training dataset; constructing a second
stochastic approximation to a gradient of the lower bound on the
log-likelihood of at least the subset of the training dataset; and
increasing the lower bound on the log-likelihood of at least the
subset of the training dataset based at least in part on the
gradient of the lower bound on the log-likelihood of at least the
subset of the training dataset, wherein constructing a second
stochastic approximation to a gradient of the lower bound includes
approximating a gradient of at least a first part of the first
stochastic approximation with respect to one or more parameters of
the prior distribution over the first latent space using samples
from the prior distribution, wherein approximating the gradient of
at least a first part of the first stochastic approximation with
respect to one or more parameters of the prior distribution over
the first latent space using samples from the prior distribution
includes at least one of generating a plurality of samples or
causing a plurality of samples to be generated by a quantum
processor comprising a plurality of qubits and a plurality of
coupling devices providing communicative coupling between
respective pairs of qubits, wherein at least one of generating a
plurality of samples or causing a plurality of samples to be
generated by a quantum processor includes: forming one or more
chains, each chain comprising a respective subset of the plurality
of qubits; and representing at least one of the one or more
discrete random variables of the first latent space by a respective
chain. Forming one or more chains may include initiating a coupling
strength of at least one coupling device, the coupling device
selected to induce a correlation between a respective pair of
qubits.
[0015] The method may further include determining an approximating
posterior distribution over each of the first latent space and the
second latent space, wherein the approximating posterior
distribution is a hierarchical approximating posterior distribution
comprising a plurality of levels of the hierarchy, and wherein at
least one of generating a plurality of samples or causing a
plurality of samples to be generated by a quantum processor further
includes assigning a first qubit of each chain to a respective
first level of the hierarchy.
[0016] At least one of generating a plurality of samples or causing
a plurality of samples to be generated by a quantum processor may
further include assigning a second qubit of each chain to a
respective second level of the hierarchy, wherein the second qubit
of each chain is successively adjacent in the respective chain to
the first qubit of the respective chain, and the likelihood of the
second qubit of each chain for a given sample having the same value
as the first qubit of the respective chain exceeds a predetermined
threshold.
[0017] At least one of generating a plurality of samples or causing
a plurality of samples to be generated by a quantum processor may
further include assigning a third qubit of each chain to a
respective third level of the hierarchy, wherein the third qubit of
each chain is successively adjacent in the respective chain to the
second qubit of the respective chain, and the likelihood of the
third qubit of each chain for a given sample having the same value
as the second qubit of the respective chain exceeds the
predetermined threshold.
[0018] At least one of generating a plurality of samples or causing
a plurality of samples to be generated by a quantum processor may
further include assigning a third qubit of each chain to a
respective third level of the hierarchy, wherein the third qubit of
each chain is successively adjacent in the respective chain to the
second qubit of the respective chain, and the likelihood of the
third qubit of each chain for a given sample having the same value
as the first qubit of the respective chain exceeds the
predetermined threshold. The first qubit of each chain may be at
one end of the chain.
[0019] The first qubit of each chain may be in the interior of the
chain, and at least one of generating a plurality of samples or
causing a plurality of samples to be generated by a quantum
processor may further include assigning a second and a third qubit
of each chain to a respective second level of the hierarchy,
wherein the second and third qubits are both successively adjacent
in the respective chain to the respective first qubit.
[0020] The method may further include determining an approximating
posterior distribution over each of the first latent space and the
second latent space, wherein the approximating posterior
distribution is a hierarchical approximating posterior distribution
comprising a plurality of levels of the hierarchy, wherein at least
one of generating a plurality of samples or causing a plurality of
samples to be generated by a quantum processor further includes
assigning a single qubit of each chain to a respective level of the
hierarchy.
[0021] The method may further include determining an approximating
posterior distribution over each of the first latent space and the
second latent space, wherein the approximating posterior
distribution is a hierarchical approximating posterior distribution
comprising a plurality of levels of the hierarchy, wherein at least
one of generating a plurality of samples or causing a plurality of
samples to be generated by a quantum processor further includes
assigning a single qubit of each chain to a single level of the
hierarchy.
[0022] The method may further include determining an approximating
posterior distribution over each of the first latent space and the
second latent space, wherein the approximating posterior
distribution is a hierarchical approximating posterior distribution
comprising a plurality of levels of the hierarchy, wherein at least
one of generating a plurality of samples or causing a plurality of
samples to be generated by a quantum processor further includes
assigning a single qubit of each chain to a first level of the
hierarchy, and remaining qubits of each chain to a second level of
the hierarchy. Increasing the lower bound on the log-likelihood of
at least the subset of the training dataset based at least in part
on the gradient of the lower bound on the log-likelihood of at
least the subset of the training dataset may include increasing the
lower bound on the log-likelihood of at least the subset of the
training dataset using a method of gradient descent. Increasing the
lower bound on the log-likelihood of at least the subset of the
training dataset using a method of gradient descent may include
attempting to maximize the lower bound on the log-likelihood of at
least the subset of the training dataset using a method of gradient
descent. The encoding distribution and decoding distribution may be
parameterized by deep neural networks. Determining an ordered set
of conditional cumulative distribution functions of the
supplementary continuous random variables may include analytically
determining an ordered set of conditional cumulative distribution
functions of the supplementary continuous random variables. The
lower bound may be an evidence lower bound. Constructing a first
stochastic approximation to the lower bound of the log-likelihood
of the at least a subset of a training dataset may include
decomposing the first stochastic approximation to the lower bound
into at least a first part comprising negative KL-divergence
between the approximating posterior and the prior distribution over
the first latent space, and a second part comprising an
expectation, or at least a stochastic approximation to an
expectation, with respect to the approximating posterior over the
second latent space of the conditional log-likelihood of at least
the subset of the training dataset under the decoding
distribution.
[0023] Constructing a second stochastic approximation to the
gradient of the lower bound may further include determining the
gradient of the second part of the first stochastic approximation
by backpropagation; and determining a gradient of the first part of
the first stochastic approximation with respect to parameters of
the encoding distribution by backpropagation. A logarithm of the
prior distribution may be, to within a constant, a problem
Hamiltonian of a quantum processor.
[0024] The method may further include determining an expectation
with respect to the prior distribution from the plurality of
samples. Generating a plurality of samples or causing a plurality
of samples to be generated by at least one quantum processor may
include performing at least one post-processing operation on the
plurality of samples.
[0025] Generating a plurality of samples or causing a plurality of
samples to be generated by at least one quantum processor may
further include operating the at least one quantum processor as a
sample generator to provide the plurality of samples from a
probability distribution, wherein a shape of the probability
distribution may depend on a configuration of a number of
programmable parameters for the at least one quantum processor, and
wherein operating the at least one quantum processor as a sample
generator may include programming the at least one quantum
processor with a configuration of the number of programmable
parameters for the at least one quantum processor, wherein the
configuration of a number of programmable parameters corresponds to
the probability distribution over the plurality of qubits of the at
least one quantum processor; evolving the quantum processor; and
reading out states for the qubits in plurality of qubits of the at
least one quantum processor, wherein the states for the qubits in
the plurality of qubits correspond to a sample from the probability
distribution. Operating the at least one quantum processor as a
sample generator to provide the samples from a probability
distribution may include operating the at least one quantum
processor as a sample generator to provide the samples from a
probability distribution wherein the shape of the probability
distribution depends on at least one programmable qubit bias, and
at least one programmable coupling strength of a coupling device
providing communicative coupling between a respective pair of
qubits.
[0026] Generating a plurality of samples or causing a plurality of
samples to be generated by at least one quantum processor may
include generating a plurality of samples or causing a plurality of
samples to be generated by a restricted Boltzmann machine, and may
further include determining the expectation with respect to the
prior distribution from the plurality of samples. The set of
supplementary continuous random variables may include a plurality
of continuous variables, and each one of the plurality of
continuous variables is conditioned on a different respective one
of the plurality of random variables.
[0027] The method may further include forming a second transforming
distribution, wherein the input space comprises a plurality of
input variables, and the second transforming distribution is
conditioned on one or more of the plurality of input variables and
at least one of the one or more discrete random variables.
[0028] A computational system may be summarized as including at
least one processor; and at least one nontransitory
processor-readable storage medium that stores at least one of
processor-executable instructions or data which, when executed by
the at least one processor cause the at least one processor to
execute any of the methods of claims 1 through 26.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING(S)
[0029] In the drawings, identical reference numbers identify
similar elements or acts. The sizes and relative positions of
elements in the drawings are not necessarily drawn to scale. For
example, the shapes of various elements and angles are not
necessarily drawn to scale, and some of these elements may be
arbitrarily enlarged and positioned to improve drawing legibility.
Further, the particular shapes of the elements as drawn, are not
necessarily intended to convey any information regarding the actual
shape of the particular elements, and may have been solely selected
for ease of recognition in the drawings.
[0030] FIG. 1 is a schematic diagram of an example hybrid computer
including a digital computer and an analog computer in accordance
with the present systems, devices, methods, and articles.
[0031] FIG. 2A is a schematic diagram illustrating a cell used as a
building block in an example physical topology of a quantum
processor such as the quantum processor of FIG. 1.
[0032] FIG. 2B is a schematic diagram illustrating an example
physical topology of a quantum processor such as the quantum
processor of FIG. 1.
[0033] FIG. 2C is a schematic diagram illustrating an example
logical topology of a quantum processor such as the quantum
processor of FIG. 1.
[0034] FIG. 3 is a schematic diagram illustrating an example
implementation of a variational auto-encoder (VAE).
[0035] FIG. 4 is a flow chart illustrating a method for
unsupervised learning, in accordance with the presently described
systems, devices, articles, and methods.
[0036] FIG. 5 is a schematic diagram illustrating an example
implementation of a hierarchical variational auto-encoder
(VAE).
[0037] FIG. 6 is a schematic diagram illustrating an example
implementation of a variational auto-encoder (VAE) with a hierarchy
of continuous latent variables.
[0038] FIG. 7 is a flow chart illustrating a method for
unsupervised learning via a hierarchical variational auto-encoder
(VAE), in accordance with the present systems, devices, articles
and methods.
[0039] FIG. 8 is a schematic diagram illustrating an example
logical topology of a quantum processor such as the quantum
processor of FIG. 1.
[0040] FIG. 9 shows an example method for searching an input space
characterized by an objective function.
DETAILED DESCRIPTION
Generalities
[0041] In the following description, some specific details are
included to provide a thorough understanding of various disclosed
embodiments. One skilled in the relevant art, however, will
recognize that embodiments may be practiced without one or more of
these specific details, or with other methods, components,
materials, etc. In other instances, well-known structures
associated with quantum processors, such as quantum devices,
coupling devices, and control systems including microprocessors and
drive circuitry have not been shown or described in detail to avoid
unnecessarily obscuring descriptions of the embodiments of the
present methods. Throughout this specification and the appended
claims, the words "element" and "elements" are used to encompass,
but are not limited to, all such structures, systems, and devices
associated with quantum processors, as well as their related
programmable parameters.
[0042] Unless the context requires otherwise, throughout the
specification and claims that follow, the word "comprising" is
synonymous with "including," and is inclusive or open-ended (i.e.,
does not exclude additional, unrecited elements or method
acts).
[0043] Reference throughout this specification to "one embodiment"
"an embodiment", "another embodiment", "one example", "an example",
or "another example" means that a particular referent feature,
structure, or characteristic described in connection with the
embodiment or example is included in at least one embodiment or
example. Thus, the appearances of the phrases "in one embodiment",
"in an embodiment", "another embodiment" or the like in various
places throughout this specification are not necessarily all
referring to the same embodiment or example. Furthermore, the
particular features, structures, or characteristics may be combined
in any suitable manner in one or more embodiments or examples.
[0044] It should be noted that, as used in this specification and
the appended claims, the singular forms "a," "an," and "the"
include plural referents unless the content clearly dictates
otherwise. Thus, for example, reference to a problem-solving system
including "a quantum processor" includes a single quantum
processor, or two or more quantum processors. It should also be
noted that the term "or" is generally employed in its sense
including "and/or" unless the content clearly dictates
otherwise.
[0045] References to a processor or at least one processor refer to
hardware or circuitry, with discrete or integrated, for example
single or multi-core microprocessors, microcontrollers, central
processor units, digital signal processors, graphical processing
units, programmable gate arrays, programmed logic controllers, and
analog processors, for instance quantum processors. Various
algorithms and methods and specific acts are executable via one or
more processors.
[0046] The headings provided herein are for convenience only and do
not interpret the scope or meaning of the embodiments.
Quantum Hardware
[0047] FIG. 1 illustrates a hybrid computing system 100 including a
digital computer 102 coupled to an analog computer 104. The example
digital computer 102 includes one or more digital processors 106
that may be used to perform classical digital processing tasks.
Digital computer 102 may further include at least one system memory
108, and at least one system bus 110 that couples various system
components, including system memory 108 to digital processor(s)
106. System memory 108 may store a VAE instructions module 112.
[0048] The digital processor(s) 106 may be any logic processing
unit or circuitry (e.g., integrated circuits), such as one or more
central processing units ("CPUs"), graphics processing units
("GPUs"), digital signal processors ("DSPs"), application-specific
integrated circuits ("ASICs"), programmable gate arrays ("FPGAs"),
programmable logic controllers (PLCs), etc., and/or combinations of
the same.
[0049] In some implementations analog computer 104 includes one or
more quantum processors 114.
[0050] Digital computer 102 may include a user input/output
subsystem 116. In some implementations, the user input/output
subsystem includes one or more user input/output components such as
a display 118, mouse 120, and/or keyboard 122.
[0051] System bus 110 can employ any known bus structures or
architectures, including a memory bus with a memory controller, a
peripheral bus, and a local bus. System memory 108 may include
non-volatile memory, such as read-only memory ("ROM"), static
random access memory ("SRAM"), Flash NAND; and volatile memory such
as random access memory ("RAM") (not shown).
[0052] Digital computer 102 may also include other non-transitory
computer- or processor-readable storage media or non-volatile
memory 124. Non-volatile memory 124 may take a variety of forms,
including: a hard disk drive for reading from and writing to a hard
disk (e.g., magnetic disk), an optical disk drive for reading from
and writing to removable optical disks, and/or a solid state drive
(SSD) for reading from and writing to solid state media (e.g.,
NAND-based Flash memory). The optical disk can be a CD-ROM or DVD,
while the magnetic disk can be a rigid spinning magnetic disk or a
magnetic floppy disk or diskette. Non-volatile memory 124 may
communicate with digital processor(s) via system bus 110 and may
include appropriate interfaces or controllers 126 coupled to system
bus 110. Non-volatile memory 124 may serve as long-term storage for
processor- or computer-readable instructions, data structures, or
other data (sometimes called program modules) for digital computer
102.
[0053] Although digital computer 102 has been described as
employing hard disks, optical disks and/or solid state storage
media, those skilled in the relevant art will appreciate that other
types of nontransitory and non-volatile computer-readable media may
be employed, such magnetic cassettes, flash memory cards, Flash,
ROMs, smart cards, etc. Those skilled in the relevant art will
appreciate that some computer architectures employ nontransitory
volatile memory and nontransitory non-volatile memory. For example,
data in volatile memory can be cached to non-volatile memory. Or a
solid-state disk that employs integrated circuits to provide
non-volatile memory.
[0054] Various processor- or computer-readable instructions, data
structures, or other data can be stored in system memory 108. For
example, system memory 108 may store instruction for communicating
with remote clients and scheduling use of resources including
resources on the digital computer 102 and analog computer 104. Also
for example, system memory 108 may store at least one of processor
executable instructions or data that, when executed by at least one
processor, causes the at least one processor to execute the various
algorithms described elsewhere herein, including machine learning
related algorithms.
[0055] In some implementations system memory 108 may store
processor- or computer-readable calculation instructions and/or
data to perform pre-processing, co-processing, and post-processing
to analog computer 104. As described above, system memory 108 may
store a VAE instructions module that includes processor- or
computer-readable instructions to perform VAE. System memory 108
may store at set of analog computer interface instructions to
interact with analog computer 104. When executed, the stored
instructions and/or data cause the system to operate as a special
purpose machine.
[0056] Analog computer 104 may include at least one analog
processor such as quantum processor 114. Analog computer 104 can be
provided in an isolated environment, for example, in an isolated
environment that shields the internal elements of the quantum
computer from heat, magnetic field, and other external noise (not
shown). The isolated environment may include a refrigerator, for
instance a dilution refrigerator, operable to cryogenically cool
the analog processor, for example to temperature below
approximately 1.degree. Kelvin.
[0057] FIG. 2A is a schematic diagram illustrating a cell 200a used
as a building block in an example physical topology of quantum
processor 114 of FIG. 1. Cell 200a comprises qubits and coupling
devices, the coupling devices providing communicative coupling
between respective pairs of qubits. In one implementation, the
qubits and coupling devices shown in FIG. 2A are superconducting
flux qubits and superconducting coupling devices.
[0058] The cell or sub-topology illustrated in FIG. 2A can be
replicated and laid out in a grid to form a topology that is
referred to as a Chimera topology. The Chimera topology is
described in more detail below with reference to FIG. 2B.
[0059] Each cell 200 comprises four qubits 210a-210d (collectively
210) laid out horizontally and four qubits 220a-220d (collectively
220) laid out vertically. A person of skill in the art will
appreciate that while four qubits are illustrated both horizontally
and vertically, this number is arbitrary and embodiments may
comprise more or less than four qubits.
[0060] Crosstalk between qubits laid out horizontally and qubits
laid out vertically may not exist in some embodiments of the
present systems and methods. For crosstalk, or the unintended
coupling of qubits, to exist, two current carrying wires from two
respective qubits must run parallel in some manner to allow flux
from current within a first wire to induce a current to flow in a
second wire. Since qubits laid out horizontally run perpendicular
to qubits laid out vertically, crosstalk between them may be
limited. Applicant notes the qubits from one set may not
necessarily be limited to being perpendicular to the qubits of
another set, and the qubits can cross one another at angles other
than 90 degrees. Coupling between pairs of qubits can occur through
a third structure, such as a coupling device.
[0061] Each qubit 210a-210d may be a respective loop of
superconducting material 212a-212d interrupted by at least one
respective Josephson junction 215a-215d. Each qubit 220a-220d may
be a respective loop of superconducting material 222a-222d
interrupted by at least one respective Josephson junction
225a-225d.
[0062] Couplers 230a-230p (collectively 230) couple qubits 210 and
220. Each qubit 210a-210d is coupled to each qubit 220a-220d
through four respective couplers from couplers 230 in a region
proximate to where a portion of each qubit 210a-210d crosses a
portion of qubit 220a-220d. Each coupler 230a-230p may be a
respective loop of superconducting material wherein the loop or
superconducting material may define a perimeter to a coupling
region. Each coupler 230a-230p may be a respective loop of
superconducting material interrupted by at least one respective
Josephson junction wherein the loop or superconducting material may
define a perimeter to a coupling region wherein coupling occurs
along the perimeter by having a current carrying wire, such as loop
of superconducting material 212a-212d, 222a-222d, run parallel in
some manner to coupler 230a-230p to allow flux from current within
loop of superconducting material 212a-212d, 222a-222d to induce a
current to flow in a coupler 230a-230p and vice versa.
[0063] Couplers 230 may be tunable in that the coupling couplers
230 create between two respective qubits 210, 220 can be changed
during the operation of an analog processor. The coupling may
change during computation. The coupling may change between
computations to embed a problem into the analog processor.
[0064] While a topology based on cell 200a may be used to implement
quantum processor 114 of FIG. 1, other topologies can also be used
for the systems and methods of the present disclosure.
[0065] FIG. 2B is a schematic diagram illustrating an example
physical topology of quantum processor 114 of FIG. 1. Topology 200b
comprises a grid of 2.times.2 cells 200a-1, 200a-2, 200a-3, and
200a-4, each cell comprised of 8 qubits. Each of the cells in
topology 200b can have the structure of cell 200a of FIG. 2A. A
person of skill in the art will appreciate that while FIG. 2B
illustrates an example topology comprising a 2.times.2 array of
cells, this array size is arbitrary and embodiments may comprise a
suitable M.times.N array of cells where M>1 and N>1.
[0066] As described in reference to FIG. 2A, pairs of qubits from
the rows and columns of cell 200a can be communicatively coupled to
one another by a respective coupling device (coupling devices
230a-230p in FIG. 2A, illustrated by bold cross shapes). A
respective coupling device is positioned and operable to
communicatively couple the qubit in each column
(vertically-oriented qubit in drawing sheet) in each cell to the
qubits in each row (horizontally-oriented qubit in drawing sheet)
in the same cell.
[0067] Additionally, a respective coupling device, such as coupling
device 240 (only one called out in FIG. 2B) is positioned and
operable to communicatively couple the qubit in each column
(vertically-oriented qubit in drawing sheet) in each cell with a
corresponding qubit in each column (vertically-oriented qubit in
drawing sheet) in a nearest neighboring cell in a same direction as
the orientation of the columns. Similarly, a respective coupling
device, such as coupling device 250 (only one called out in FIG.
2B) is positioned and operable to communicatively couple the qubit
in each row (horizontally-oriented qubit in drawing sheet) in each
cell with a corresponding qubit in each row (horizontally-oriented
qubit in drawing sheet) in each nearest neighboring cell in a same
direction as the orientation of the rows. Since the coupling
devices 240, 250 couple qubits between cells such coupling devices
240, 250 may at times be denominated as inter-cell coupling
devices. Since the coupling devices 230a-230p of FIG. 2A couple
qubits within a cell 200a, such coupling devices 230a-230p may at
times be denominated as intra-cell coupling devices.
[0068] Coupling devices 230a-230p, 240, and 250 may be controllable
such that both ferromagnetic couplings, anti-ferromagnetic
couplings, zero couplings and transverse couplings may be created
between pairs of adjacent qubits.
[0069] FIG. 2C is a schematic diagram illustrating an example
logical topology of quantum processor 114 of FIG. 1. Topology 200c
shows nine cells, such as cell 200a-5 (only one called out in FIG.
2C), each cell comprising eight qubits q68 through q72. Cell 200a-5
can have the structure of cell 200a of FIG. 2A. FIG. 2C illustrates
the intra-coupling, such as coupling device 230 (only one called
out in FIG. 2C), and inter-coupling, such as coupling devices 240
and 250 (only one of each called out in FIG. 2C), for the cell
200a-5.
[0070] The non-planarity of the connections between qubits q1-q72
makes the problem of finding the lowest energy state of the qubits
q1-q72 an NP-hard problem, which means that it is possible to map
many practical problems to the topology illustrated in FIGS. 2A, 2B
and 2C, and described above.
[0071] Use of the quantum processor 114 with the topology
illustrated in FIGS. 2A, 2B, and 2C is not limited only to problems
that fit the native topology. For example, it is possible to embed
a complete graph of size N on a quantum processor of size
O(N.sup.2) by chaining qubits together.
[0072] Computational system 100 of FIG. 1 comprising quantum
processor 114 with a topology illustrated in FIGS. 2A, 2B, and 2C
can specify an energy function over spin variables +1/-1, and
receive from quantum processor 114 samples of lower-energy spin
configurations in an approximately Boltzmann distribution according
to an Ising model as follows:
E .function. ( s ) = i .times. h i .times. s i + i , j .times. J i
, j .times. s i .times. s j ##EQU00001##
where h.sub.i are local biases and J.sub.i,j are coupling
terms.
[0073] The spin variables can be mapped to binary variables 0/1.
Higher-order energy functions can be expressed by introducing
additional constraints over auxiliary variables.
Machine Learning
[0074] Various systems and methods for augmenting conventional
machine learning hardware such as Graphics Processing Units (GPUs)
and Central Processing Units(CPUs) with quantum hardware are
described herein. Quantum hardware typically includes one or more
quantum processors or quantum processing units (QPUs). The systems
and methods described herein adapt machine learning architectures
and methods to exploit QPUs to advantageously achieve improved
machine performance. Improved machine performance typically
includes reduced training time and/or increased generalization
accuracy.
[0075] Optimization and sampling can be computational bottlenecks
in machine learning systems and methods. The systems and methods
described herein integrate the QPU into the machine learning
pipeline (including the architecture and methods) to perform
optimization and/or sampling with improved performance over
classical hardware. The machine learning pipeline can be modified
to suit QPUs that can be realized in practice.
Sampling in Training Probabilistic Models
[0076] Boltzmann machines including restricted Boltzmann machines
(RBMs) can be used in deep learning systems. Boltzmann machines are
particularly suitable for unsupervised learning and probabilistic
modeling such as in-painting and classification.
[0077] A shortcoming of existing approaches is that Boltzmann
machines typically use costly Markov Chain Monte Carlo (MCMC)
techniques to approximate samples drawn from an empirical
distribution. The existing approaches serve as a proxy for a
physical Boltzmann sampler.
[0078] A QPU can be integrated into machine learning systems and
methods to reduce the time taken to perform training. For example,
the QPU can be used as a physical Boltzmann sampler. The approach
involves programming the QPU (which is an Ising system) such that
the spin configurations realize a user-defined Boltzmann
distribution natively. The approach can then draw samples directly
from the QPU.
Restricted Boltzmann Machine (RBM)
[0079] The restricted Boltzmann machine (RBM) is a probabilistic
graphical model that represents a joint probability distribution
p(x, z) over binary visible units x and binary hidden units z. The
restricted Boltzmann machine can be used as an element in a deep
learning network.
[0080] The RBM network has the topology of a bipartite graph with
biases on each visible unit and on each hidden unit, and weights
(couplings) on each edge. An energy E(x, z) can be associated with
the joint probability distribution p(x, z) over the visible and the
hidden units, as follows:
p(x,z)=e.sup.-E(x,z)/Z
where Z is the partition function.
[0081] For a restricted Boltzmann machine, the energy is:
E(x,z)=-b.sup.Tx-c.sup.Tz-z.sup.TWx
where b and c are bias terms expressed as matrices, W is a coupling
term expressed as a matrix, and T denotes the transpose of a
matrix. The conditional probabilities can be computed:
p(x|z)=.sigma.(b+W.sup.Tz)
p(z|x)=.sigma.(c+W.sup.Tx)
where a is the sigmoid function, used to ensure the values of the
conditional probabilities lie in the range [0,1].
Training RBMs
[0082] Training is the process by which the parameters of the model
are adjusted to favor producing the desired training distribution.
Typically, this is done by maximizing of the observed data
distribution with respect to the model parameters. One part of the
process involves sampling over the given data distribution, and
this part is generally straightforward. Another part of the process
involves sampling over the predicted model distribution, and this
is generally intractable, in the sense that it would use
unmanageable amounts of computational resources.
[0083] Some existing approaches use a Markov Chain Monte Carlo
(MCMC) method to perform sampling. MCMC constructs a Markov chain
that has the desired distribution as its equilibrium distribution.
The state of the chain after k>>1 steps is used as a sample
of the desired distribution. The quality of the sample improves as
a function of the number of steps which means that MCMC makes
training a slow process.
[0084] To speed up the MCMC process, Contrastive Divergence-k
(CD-k) can be used, in which the method only takes k steps of the
MCMC process. Another way to speed up the process is to use
Persistent Contrastive Divergence (PCD), in which a Markov Chain is
initialized in the state where it ended from the previous model.
CD-k and PCD methods tend to perform poorly when the distribution
is multi-modal and the modes are separated by regions of low
probability.
[0085] Even approximate sampling is NP-hard. The cost of sampling
grows exponentially with problem size. Samples drawn from a native
QPU network (as described above) are close to a Boltzmann
distribution. It is possible to quantify the rate of convergence to
a true Boltzmann distribution by evaluating the KL-divergence
between the empirical distribution and the true distribution as a
function of the number of samples.
[0086] Noise limits the precision with which the parameters of the
model can be set in the quantum hardware. In practice, this means
that the QPU is sampling from a slightly different energy function.
The effects can be mitigated by sampling from the QPU and using the
samples as starting points for non-quantum post-processing e.g., to
initialize MCMC, CD, and PCD. The QPU is performing the hard part
of the sampling process. The QPU finds a diverse set of valleys,
and the post-processing operation samples within the valleys.
Post-processing can be implemented in a GPU and can be at least
partially overlapped with sampling in the quantum processor to
reduce the impact of post-processing on the overall timing.
Sampling to Train RBMs
[0087] A training data set can comprise a set of visible vectors.
Training comprises adjusting the model parameters such that the
model is most likely to reproduce the distribution of the training
set. Typically, training comprises maximizing the log-likelihood of
the observed data distribution with respect to the model parameters
.theta.:
.differential. log .function. ( .SIGMA. z .times. p .function. ( x
, z ) ) .differential. .theta. = - .differential. E .function. ( x
, z ) .differential. .theta. p .function. ( z | x ) +
.differential. E .function. ( x , z ) .differential. .theta. p
.function. ( x | z ) ##EQU00002##
[0088] The first term on the right-hand side (RHS) in the above
equation is related to the positive phase and computes an expected
value of energy E over p(z|x) The term involves sampling over the
given data distribution.
[0089] The second term on the RHS is related to the negative phase,
and computes an expected value of energy, over p(x|z). The term
involves sampling over the predicted model distribution.
Variational Auto-Encoder
[0090] Unsupervised learning of probabilistic models is a technique
for machine learning. It can facilitate tasks such as de-noising to
extract a signal from a mixture of signal and noise, and inpainting
to reconstruct lost or corrupted parts of an image. It can also
regularize supervised tasks such as classification.
[0091] One approach to unsupervised learning can include attempting
to maximize the log-likelihood of an observed dataset under a
probabilistic model. Equivalently, unsupervised learning can
include attempting to minimize the KL-divergence from the data
distribution to that of the model. While the exact gradient of the
log-likelihood function is frequently intractable, stochastic
approximations can be computed, provided samples can be drawn from
the probabilistic model and its posterior distribution given the
observed data.
[0092] The efficiency of using stochastic approximations to arrive
at a maximum of the log-likelihood function can be limited by the
poor availability of desirable distributions for which the
requisite sampling operations are computationally efficient. Hence,
applicability of the techniques can be similarly limited. Although
sampling can be efficient in undirected graphical models provided
there are no loops present among the connections, the range of
representable relationships can be limited. Boltzmann machines
(including restricted Boltzmann machines) can generate approximate
samples using generally costly and inexact Markov Chain Monte Carlo
(MCMC) techniques.
[0093] Sampling can be efficient in directed graphical models
comprising a directed acyclic graph since sampling can be performed
by an ancestral pass. Even so, it can be inefficient to compute the
posterior distributions over the hidden causes of observed data in
such models, and samples from the posterior distributions are
required to compute the gradient of the log-likelihood
function.
[0094] Another approach to unsupervised learning is to optimize a
lower bound on the log-likelihood function. This approach can be
more computationally efficient. An example of a lower bound is the
evidence lower bound (ELBO) which differs from the true
log-likelihood by the KL-divergence between an approximating
posterior distribution, q(z|x, O), and the true posterior
distribution, p(z|x, .theta.). The approximating posterior
distribution can be designed to be computationally tractable even
though the true posterior distribution is not computationally
tractable. The ELBO can be expressed as follows:
L .function. ( x , .theta. , .PHI. ) = log .times. .times. p
.function. ( x | .theta. ) - KL .function. [ q .function. ( z | x ,
.theta. ) .parallel. p .function. ( z | x , .theta. ) ] = .intg. z
.times. q .function. ( z | x , .PHI. ) .times. log .function. [ p
.function. ( x , z | .theta. ) q .function. ( z | x , .PHI. ) ]
##EQU00003##
where x denotes the observed random variables, z the latent random
variables, .theta. the parameters of the generative model and .PHI.
the parameters of the approximating posterior.
[0095] Successive optimization of the ELBO with respect to .PHI.
and .theta. is analogous to variational expectation-maximization
(EM). It is generally possible to construct a stochastic
approximation to gradient descent on the ELBO that only requires
exact, computationally tractable samples. A drawback of this
approach is that it can lead to high variance in the gradient
estimate, and can result in slow training and poor performance.
[0096] The variational auto-encoder can regroup the ELBO as:
(x,.theta.,.PHI.)=-KL[q(z|x,.PHI.).parallel.p(z|.theta.)]+.sub.q[log
p(x|z,.theta.)].
[0097] The KL-divergence between the approximating posterior and
the true prior is analytically simple and computationally efficient
for commonly chosen distributions, such as Gaussians.
[0098] A low-variance stochastic approximation to the gradient of
the auto-encoding term .sub.q can be backpropagated efficiently, so
long as samples from the approximating posterior q(z|x) can be
drawn using a differentiable, deterministic function f(x, .PHI.,
.rho.) of the combination of the inputs x, the parameters .PHI.,
and a set of input- and parameter-independent random variables
.rho..about.D. For instance, given a Gaussian distribution with
mean m(x, .PHI.) and variance v(x, .PHI.) determined by the input,
(m(x, .PHI.), v(x, .PHI.)), samples can be drawn using:
f .function. ( x , .PHI. , .rho. ) = m .function. ( x , .PHI. ) + v
.function. ( x , .PHI. ) .rho. , where .times. .times. .rho. ~
.function. ( 0 , 1 ) . .times. When .times. .times. such .times.
.times. an .times. .times. f .function. ( x , .PHI. , .rho. )
.times. .times. exists , .times. q .function. ( z | x , .0. ) [ log
.times. .times. p .function. ( x | z , .theta. ) = .rho. .function.
[ log .times. .times. p .function. ( x | f .function. ( x , .rho. ,
.0. ) , .theta. ) ] .times. .times. .differential. .differential.
.PHI. .times. q .function. ( z | x , .0. ) .function. [ log .times.
.times. p .function. ( x | z , .theta. ) ] = .rho. .function. [
.differential. .differential. .PHI. .times. log .times. .times. p
.function. ( x | f .function. ( x , .rho. , .PHI. ) , .theta. ) ]
.times. .apprxeq. 1 N .times. .rho. ~ D .times. .differential.
.differential. .0. .times. log .times. .times. p .function. ( x | f
.function. ( x , .rho. , .0. ) , .theta. ) , ( 1 ) ##EQU00004##
and the stochastic approximation to the derivative in equation 1 is
analytically tractable so long as p(x|z, .theta.) and f(x, .rho.,
O) are defined so as to have tractable derivatives.
[0099] This approach is possible whenever the approximating
posteriors for each hidden variable, q.sub.i(z.sub.i|x, .PHI.), are
independent given x and .PHI.; the cumulative distribution function
(CDF) of each q.sub.i is invertible; and the inverse CDF each
q.sub.i, is differentiable. Specifically, choose to be the uniform
distribution between 0 and 1, and f.sub.i to be the inverse CDF of
q.sub.i.
[0100] The conditional marginal cumulative distribution (CDF) is
defined by:
F.sub.i(x)=.intg..sub.x.sub.i.sub.=-.infin..sup.xp(x.sub.i'|x.sub.1,
. . . ,x.sub.i-1)
[0101] Since the approximating posterior distribution q(z|x, .PHI.)
maps each input to a distribution over the latent space, it is
called the "encoder". Correspondingly, since the conditional
likelihood distribution p(x|z, .theta.) maps each configuration of
the latent variables to a distribution over the input space, it is
called the "decoder".
[0102] Unfortunately, a multivariate CDF is generally not
invertible. One way to deal with this is to define a set of CDFs as
follows:
F.sub.i(x)=.intg..sub.x.sub.i.sub.'=-.infin..sup.xp(x.sub.i'|x.sub.1,
. . . ,x.sub.i-1)
and invert each conditional CDF in turn. The CDF F.sub.i(x) is the
CDF of x.sub.i conditioned on all x.sub.1 where j<i, and
marginalized over all x.sub.k where i<k. Such inverses generally
exist provided the conditional-marginal probabilities are
everywhere non-zero.
Discrete Variational Auto-Encoders
[0103] The approach can run into challenges with discrete
distributions, such as, for example, Restricted Boltzmann Machines
(RBMs). An approximating posterior that only assigns non-zero
probability to a discrete domain corresponds to a CDF that is
piecewise-constant. That is, the range of the CDF is a proper
subset of the interval [0, 1]. The domain of the inverse CDF is
thus also a proper subset of the interval [0, 1] and its derivative
is generally not defined.
[0104] The difficulty can remain even if a quantile function as
follows is used:
F.sub.p.sup.-1(.rho.)=inf{z:.intg.z'=-.infin..sup.zp(z').gtoreq..rho.}
The derivative of the quantile function is either zero or infinite
for a discrete distribution.
[0105] One method for discrete distributions is to use a
reinforcement learning method such as REINFORCE (Williams,
http://www-anw.cs.umass.edu/.about.barto/courses/cs687/williams92simple.p-
df). The REINFORCE method adjust weights following receipt of a
reinforcement value by an amount proportional to the difference
between a reinforcement baseline and the reinforcement value.
Rather than differentiating the conditional log-likelihood directly
in REINFORCE, the gradient of the log of the conditional likelihood
distribution is estimated, in effect, by a finite difference
approximation. The conditional log-likelihood log p(x|z, .theta.)
is evaluated at many different points z.about.q(z|x, .PHI.), and
the gradient
.differential. .differential. .PHI. .times. log .times. .times. q (
( z | x , .PHI. ) ##EQU00005##
weighted more strongly when p(x|z, .theta.) differs more greatly
from the baseline.
[0106] One disadvantage is that the change of p(x|z, .theta.) in a
given direction can only affect the REINFORCE gradient estimate if
a sample is taken with a component in the same direction. In a
D-dimensional latent space, at least D samples are required to
capture the variation of the conditional distribution p(x|z,
.theta.) in all directions. Since the latent representation can
typically consist of hundreds of variables, the REINFORCE gradient
estimate can be much less efficient than one that makes more direct
use of the gradient of the conditional distribution p(x|z,
.theta.).
[0107] A discrete variational auto-encoder (DVAE) is a hierarchical
probabilistic model consisting of an RBM, followed by multiple
layers of continuous latent variables, allowing the binary
variables to be marginalized out, and the gradient to backpropagate
smoothly through the auto-encoding component of the ELBO.
[0108] The generative model is redefined so that the conditional
distribution of the observed variables given the latent variables
only depends on the new continuous latent space.
[0109] A discrete distribution is thereby transformed into a
mixture distribution over this new continuous latent space. This
does not alter the fundamental form of the model, nor the
KL-divergence term of the ELBO; rather it adds a stochastic
component to the approximating posterior and the prior.
[0110] One interpretation of the way that VAEs work is that they
break the encoder distribution into "packets" of probability, each
packet having infinitesimal but equal probability mass. Within each
packet, the values of the latent variables are approximately
constant. The packets correspond to a region in the latent space,
and the expectation value is taken over the packets. There are
generally more packets in regions of high probability, so more
probable values are more likely to be selected.
[0111] As the parameters of the encoder are changed, the location
of each packet can move, while its probability mass stays constant.
So long as F.sub.q(z|x, .PHI.).sup.-1 exists and is differentiable,
a small change in .PHI. will correspond to a small change in the
location of each packet. This allows the use of the gradient of the
decoder to estimate the change in the loss function, since the
gradient of the decoder captures the effect of small changes in the
location of a selected packet in the latent space.
[0112] In contrast, REINFORCE works by breaking the latent
representation into segments of infinitesimal but equal volume,
within which the latent variables are also approximately constant,
while the probability mass varies between segments. Once a segment
is selected in the latent space, its location is independent of the
parameters of the encoder. As a result, the contribution of the
selected location to the loss function is not dependent on the
gradient of the decoder. On the other hand, the probability mass
assigned to the region in the latent space around the selected
location is relevant.
[0113] Though VAEs can make use of gradient information from the
decoder, the gradient estimate is generally only low-variance
provided the motion of most probability packets has a similar
effect on the loss function. This is likely to be the case when the
packets are tightly clustered (e.g., if the encoder produces a
Gaussian distribution with low variance) or if the movements of
well-separated packets have a similar effect on the loss function
(e.g., if the decoder is roughly linear).
[0114] One difficulty is that VAEs cannot generally be used
directly with discrete latent representations because changing the
parameters of a discrete encoder moves probability mass between the
allowed discrete values, and the allowed discrete values are
generally far apart. As the encoder parameters change, a selected
packet either remains in place or jumps more than an infinitesimal
distance to an allowed discrete value. Consequently, small changes
to the parameters of the encoder do not affect most of the
probability packets. Even when a packet jumps between discrete
values of the latent representation, the gradient of the decoder
generally cannot be used to estimate the change in loss function
accurately, because the gradient generally captures only the
effects of very small movements of the probability packet.
[0115] Therefore, to use discrete latent representations in the VAE
framework, the method described herein for unsupervised learning
transforms the distributions to a continuous latent space within
which the probability packets move smoothly. The encoder q(z|x, O)
and prior distribution p(z|.theta.) are extended by a
transformation to a continuous, auxiliary latent representation
.zeta., and the decoder is correspondingly transformed to be a
function of the continuous representation. By extending the encoder
and the prior distribution in the same way, the remaining
KL-divergence (referred to above) is unaffected.
[0116] In the transformation, one approach maps each point in the
discrete latent space to a non-zero probability over the entire
auxiliary continuous space. In so doing, if the probability at a
point in the discrete latent space increases from zero to a
non-zero value, a probability packet does not have to jump a large
distance to cover the resulting region in the auxiliary continuous
space. Moreover, it ensures that the CDFs F.sub.i(x) are strictly
increasing as a function of their main argument, and thus are
invertible.
[0117] The method described herein for unsupervised learning
smooths the conditional-marginal CDF F.sub.i(x) of an approximating
posterior distribution, and renders the distribution invertible,
and its inverse differentiable, by augmenting the latent discrete
representation with a set of continuous random variables. The
generative model is redefined so that the conditional distribution
of the observed variables given the latent variables only depends
on the new continuous latent space.
[0118] The discrete distribution is thereby transformed into a
mixture distribution over the continuous latent space, each value
of each discrete random variable associated with a distinct mixture
component on the continuous expansion. This does not alter the
fundamental form of the model, nor the KL-divergence term of the
ELBO; rather it adds a stochastic component to the approximating
posterior and the prior.
[0119] The method augments the latent representation with
continuous random variables .zeta., conditioned on z, as
follows:
q(.zeta.,z|x,.PHI.)=r(.zeta.|x)q(z|x,.PHI.)
where the support of r(.zeta.|x) for all values of z is connected,
so the marginal distribution q(.zeta.|x,
.PHI.)=.SIGMA..sub.zr(.zeta.|z)q(z|x, .PHI.) has a constant,
connected support so long as 0<q(z|x, O)<1. The approximating
posterior r(.zeta.|x) is continuous and differentiable except at
the end points of its support so that the inverse
conditional-marginal CDF is differentiable.
[0120] FIG. 3 shows an example implementation of a VAE. The
variable z is a latent variable. The variable x is a visible
variable (for example, pixels in an image data set). The variable
.zeta. is a continuous variable conditioned on a discrete z as
described above in the present disclosure. The variable .zeta. can
serve to smooth out the discrete random variables in the
auto-encoder term. As described above, the variable .zeta.
generally does not directly affect the KL-divergence between the
approximating posterior and the true prior.
[0121] In the example, the variables z.sub.1, z.sub.2, and z.sub.3
are disjoint subsets of qubits in the quantum processor. The
computational system samples from the RBM using the quantum
processor. The computational system generates the hierarchical
approximating posteriors using a digital (classical) computer. The
computational system uses priors 310 and 330, and hierarchical
approximating posteriors 320 and 340.
[0122] For the prior 330 and the approximating posterior 340, the
system adds continuous variables .zeta..sub.1, .zeta..sub.2,
.zeta..sub.3 below the latent variables z.sub.1, z.sub.2,
z.sub.3.
[0123] FIG. 3 also shows the auto-encoding loop 350 of the VAE.
Initially, input x is passed into a deterministic feedforward
network q(z=1|x, O), for which the final non-linearity is the
logistic function. Its output q, along with independent random
variable .rho., is passed into the deterministic function
F.sub.q(.zeta.|x,O).sup.-1 to produce a sample of This .zeta.,
along with the original input x, is finally passed to log
p(x|.zeta., .theta.). The expectation of this log probability with
respect to p is the auto-encoding term of the VAE. This
auto-encoder, conditioned on the input and the independent p, is
deterministic and differentiable, so backpropagation can be used to
produce a low-variance, computationally efficient approximation to
the gradient.
[0124] The distribution remains continuous as q (z|x, .PHI.)
changes. The distribution is also everywhere non-zero in the
approach that maps each point in the discrete latent space to a
non-zero probability over the entire auxiliary continuous space.
Correspondingly, p(.zeta.z|.theta.) is defined as p(.zeta.,
z|.theta.)=r(.zeta.|z)p(z|.theta.), where r(.zeta.|z) is the same
as for the approximating posterior, and p(x|.zeta., z,
.theta.)=p(x|.zeta., .theta.). This transformation renders the
model a continuous distribution over z.
[0125] The method described herein can generate low-variance
stochastic approximations to the gradient. The KL-divergence
between the approximating posterior and the true prior distribution
is unaffected by the introduction of auxiliary continuous latent
variables, provided the same expansion is used for both.
[0126] The auto-encoder portion of the loss function is evaluated
in the space of continuous random variables, and the KL-divergence
portion of the loss function is evaluated in the discrete
space.
[0127] The KL-divergence portion of the loss function is as
follows:
- KL .function. [ q .function. ( z | x , .PHI. ) .parallel. p
.function. ( z | .theta. ) ] = z .times. q .function. ( z | x ,
.PHI. ) [ log .times. .times. p .function. ( z | .theta. ) - log
.times. .times. q .function. ( z | x , .PHI. ) ] ##EQU00006##
[0128] The gradient of the KL-divergence portion of the loss
function in the above equation with respect to .theta. can be
estimated stochastically using samples from the true prior
distribution p(z|.theta.). The gradient of the KL-divergence
portion of the lost function can be expressed as follows:
.differential. KL .function. ( q .parallel. p ) .differential.
.theta. = - .differential. E p .function. ( z | .theta. )
.differential. .theta. q .function. ( z | x , .PHI. ) +
.differential. E p .function. ( z | .theta. ) .differential.
.theta. p .function. ( z | .theta. ) ##EQU00007##
[0129] In one approach, the method computes the gradients of the
KL-divergence portion of the loss function analytically, for
example by first directly parameterizing a factorial q(z|x, .PHI.))
with a deep network g(x):
q .function. ( z | x , .PHI. ) = e - E q .function. ( z | x , .PHI.
) z ' .times. e - E q .function. ( z | x , .PHI. ) ##EQU00008##
where E.sub.q(z|x)=-g(x).sup.Tz and then using the following
expression:
.differential. KL .function. ( q .parallel. p ) .differential.
.PHI. = ( ( g .function. ( x ) - h - ( J T + J ) z q ) T
.circle-w/dot. ( z q - z q 2 ) T ) .differential. g .function. ( x
) .differential. .PHI. ##EQU00009##
Equation 1 can therefore be simplified by dropping the dependence
of p on z and then marginalizing z out of q, as follows:
.differential. .differential. .PHI. .times. q .function. ( .zeta. ,
z | x , .PHI. ) .function. [ log .times. .times. p .function. ( x |
.zeta. , z , .theta. ) ] .apprxeq. 1 N .times. .rho. ~ U .function.
( 0 , 1 ) n .times. .differential. .differential. .0. .times. log
.times. .times. p .function. ( x | .zeta. .function. ( .rho. ) ,
.theta. ) .times. | .zeta. = .zeta. .function. ( .rho. ) ( 2 )
##EQU00010##
[0130] An example of a transformation from the discrete latent
space to a continuous latent space is the spike-and-slab
transformation:
r .function. ( .zeta. i | z i = 0 ) = { .infin. , if .times.
.times. .zeta. i = 0 0 , otherwise .times. .times. r .function. (
.zeta. i | z i = 1 ) = { 1 , if .times. .times. 0 .ltoreq. .zeta. i
.ltoreq. 1 0 , otherwise ##EQU00011##
This transformation is consistent with sparse coding.
[0131] Other expansions to the continuous space are also possible.
As an example a combination of delta spike and exponential function
can be used:
r .function. ( .zeta. i | z i = 0 ) = { .infin. , .times. if
.times. .times. .zeta. i = 0 0 , .times. otherwise .times. .times.
r .function. ( .zeta. i | z i = 1 ) = { .beta. .times. .times. e
.beta. .times. .times. .zeta. e .beta. - 1 if .times. .times. 0
.ltoreq. .zeta. i .ltoreq. 1 0 , otherwise ##EQU00012##
[0132] Alternatively, it is possible to define a transformation
from discrete to continuous variables in the approximating
posterior, r(.zeta.|z), where the transformation is not independent
of the input x. In the true posterior distribution,
p(.zeta.|z,x).apprxeq.p(.zeta.|z) only if z already captures most
of the information about x and p(.zeta.|z,x) changes little as a
function of x. In a case where it may be desirable for
q(.zeta..sub.i|z.sub.i,x, .PHI.) to be a separate Gaussian for both
values of the binary z.sub.1, it is possible to use a mixture of a
delta spike and a Gaussian to define a transformation from the
discrete to the continuous space for which the CDF can be inverted
piecewise.
[0133] FIG. 4 shows a method 400 of unsupervised learning using a
discrete variational auto-encoder. Execution of the method 400 by
one or more processor-based devices may occur in accordance with
the present system, devices, articles, and methods. Method 400,
like other methods herein may be implemented by a series or set of
processor-readable instructions executed by one or more processors
(i.e., hardware circuitry).
[0134] Method 400 starts at 405, for example in response to a call
from another routine or other invocation.
[0135] At 410, the system initializes the model parameters with
random values. Alternatively, the system can initialize the model
parameters based on a pre-training procedure. At 415, the system
tests to determine if a stopping criterion has been reached. The
stopping criterion can, for example, be related to the number of
epochs (i.e., passes through the dataset) or a measurement of
performance between successive passes through a validation dataset.
In the latter case, when performance beings to degrade, it is an
indication that the system is over-fitting and should stop.
[0136] In response to determining the stopping criterion has been
reached, the system ends method 400 at 475, until invoked again,
for example, a request to repeat the learning.
[0137] In response to determining the stopping criterion has not
been reached, the system fetches a mini-batch of the training data
set at 420. At 425, the system propagates the training data set
through the encoder to compute the full approximating posterior
over discrete space z.
[0138] At 430, the system generates or causes generation of samples
from the approximating posterior over .zeta., given the full
distribution over z. Typically, this is performed by a non-quantum
processor, and uses the inverse of the CDF F.sub.i(x) described
above. The non-quantum processor can, for example, take the form of
one or more of one or more digital microprocessors, digital signal
processors, graphical processing units, central processing units,
digital application specific integrated circuits, digital field
programmable gate arrays, digital microcontrollers, and/or any
associated memories, registers or other nontransitory computer- or
processor-readable media, communicatively coupled to the
non-quantum processor.
[0139] At 435, the system propagates the samples through the
decoder to compute the distribution over the input.
[0140] At 440, the system performs backpropagation through the
decoder.
[0141] At 445, the system performs backpropagation through the
sampler over the approximating posterior over .zeta.. In this
context, backpropagation is an efficient computational approach to
determining the gradient.
[0142] At 450, the system computes the gradient of the
KL-divergence between the approximating posterior and the true
prior over z. At 455, the system performs backpropagation through
the encoder.
[0143] At 457, the system determines a gradient of a KL-divergence,
with respect to parameters of the true prior distribution, between
the approximating posterior and the true prior distribution over
the discrete space.
[0144] At 460, the system determines at least one of a gradient or
at least a stochastic approximation of a gradient, of a bound on
the log-likelihood of the input data.
[0145] In some embodiments, the system generates samples or causes
samples to be generated by a quantum processor. At 465, the system
updates the model parameters based at least in part on the
gradient.
[0146] At 470, the system tests to determine if the current
mini-batch is the last mini-batch to be processed. In response to
determining that the current mini-batch is the last mini-batch to
be processed, the system returns control to 415. In response to
determining that the current mini-batch is not the last mini-batch
to be processed, the system returns control to 420.
[0147] In some implementations, act 470 is omitted, and control
passes directly to 415 from 465. The decision whether to fetch
another mini-batch can be incorporated in 415.
[0148] In summary, as described in more detail above, the discrete
VAE method extends the encoder and the prior with a transformation
to a continuous, auxiliary latent representation, and
correspondingly makes the decoder a function of the same continuous
representation. The method evaluates the auto-encoder portion of
the loss function in the continuous representation while evaluating
the KL-divergence portion of the loss function in the z space.
[0149] Accommodating Explaining-Away with a Hierarchical
Approximating Posterior
[0150] When a probabilistic model is defined in terms of a prior
distribution p(z) over latent variables z and a conditional
distribution p(x|z) over observed variables x given the latent
variables, the observation of x often induces strong correlations
of the z, given x, in the posterior p(z|x) due to phenomena such as
explaining-away, a pattern of reasoning where the confirmation of
one cause reduces the need to search for alternative causes.
Moreover, an RBM used as the prior distribution may have strong
correlations between the units of the RBM.
[0151] To accommodate the strong correlations expected in the
posterior distribution while maintaining tractability, hierarchy
can be introduced into the approximating posterior q(z|x). Although
the variables of each hierarchical layer are independent given the
previous layers, the total distribution can capture strong
correlations, especially as the size of each hierarchical layer
shrinks towards a single variable.
[0152] The latent variables z of the RBM are divided into disjoint
groups, z.sub.1, . . . , z.sub.k. The continuous latent variables
are divided into complementary disjoint groups .zeta..sub.1, . . .
.zeta..sub.K. In one implementation, the groups may be chosen at
random, while in other implementations the groups may be defined so
as to be of equal size. The hierarchical variational auto-encoder
defines the approximating posterior via a directed acyclic
graphical model over these groups.
q .function. ( z 1 , .zeta. 1 , .times. , z k , .zeta. k | x ,
.PHI. ) = 1 .ltoreq. j .ltoreq. k .times. r .function. ( .zeta. j |
z j ) q .function. ( z j | .zeta. i < j , x , .PHI. ) .times.
.times. where .times. .times. q .function. ( z j | .zeta. i < j
, x , .PHI. ) = e g j .function. ( .zeta. i < j , x ) T z j z L
.di-elect cons. z j .times. ( 1 + e g z L .function. ( .zeta. i
< j , x ) ) ##EQU00013##
[0153] z.sub.j.di-elect cons.{0,1}, and g.sub.j(.zeta..sub.i<j,
x, O) is a parameterized function of the input and preceding such
as a neural network. The corresponding graphical model is shown in
FIG. 5.
[0154] FIG. 5 is a schematic diagram illustrating an example
implementation of a hierarchical variational auto-encoder (VAE).
The model uses approximating posterior 510, where latent variable
z.sub.3 is conditioned on the continuous variables .zeta..sub.2 and
.zeta..sub.1, while z.sub.2 is conditioned on .zeta..sub.1.
[0155] The dependence of z.sub.j on the discrete variables
z.sub.i<j is mediated by the continuous variables
.zeta..sub.i<j.
[0156] This hierarchical approximating posterior does not affect
the form of the auto-encoding term 520 of FIG. 5, except to
increase the depth of the auto-encoder. Each can be computed via
the stochastic nonlinearity
F.sub.q.sub.j.sub.(.zeta..sub.j.sub.|.zeta..sub.i<j.sub.,x,.PHI.).sup.-
-1(.rho.), where the function q.sub.j can take previous
.zeta..sub.i<j as input.
[0157] The deterministic probability value q(z=1|.zeta..sub.i<j,
x, O) is parameterized, for example by a neural network.
[0158] For each successive layer j of the autoencoder, input x and
all previous are passed into the network computing
q(z=1|.zeta..sub.i<j, x, O). Its output q.sub.j, along with an
independent random variable .rho. is passed into the deterministic
function F.sub.q(.zeta..sub.i<j,x,O.sub.).sup.-1(.rho.) to
produce a sample of .zeta..sub.j. Once all .zeta..sub.j have been
recursively computed, the full .zeta. along with the original input
x is finally passed to log p(x|, .zeta., .theta.).
[0159] The KL-divergence between the approximating posterior and
the true prior is also not significantly affected by the
introduction of additional continuous latent variables .zeta., so
long as the approach uses the same expansion r(.zeta.|z) for both
the approximating posterior and the prior, as follows:
KL .function. [ q .parallel. p ] = z .times. .intg. .zeta. .times.
( 1 .ltoreq. j .ltoreq. k .times. r .function. ( .zeta. j | z j ) q
.function. ( z j | .zeta. i < j , x ) ) log .times. .times. 1
.ltoreq. j .ltoreq. k .times. r .function. ( .zeta. j | z j ) q
.function. ( z j | .zeta. i < j , x ) p .function. ( z ) 1
.ltoreq. j .ltoreq. k .times. r .function. ( .zeta. j | z j ) = z
.times. .intg. .zeta. .times. ( 1 .ltoreq. j .ltoreq. k .times. r
.function. ( .zeta. j | z j ) q .function. ( z j | .zeta. i < j
, x ) ) log .times. 1 .ltoreq. j .ltoreq. k .times. q .function. (
z j | .zeta. i < j , x ) p .function. ( z ) ##EQU00014##
[0160] The gradient of the KL-divergence with respect to the
parameter .theta. of the prior p(z|.theta.) can be estimated
stochastically using samples from the approximating posterior
q(.zeta., z|x, .PHI.) and the true prior p(z|.theta.). The prior
can be, for example, an RBM.
[0161] The final expectation with respect to
q(z.sub.k|.zeta..sub.i<j, x, .PHI.) can be performed
analytically; all other expectations require samples from the
approximating posterior. Similarly, the prior requires samples
from, for example, an RBM.
[0162] Samples from the same prior distribution are required for an
entire mini-batch, independent from the samples chosen from the
training dataset.
[0163] Hierarchical Variational Auto-Encoders
[0164] Convolutional architectures are an essential component of
state-of-the-art approaches to visual object classification, speech
recognition, and numerous other tasks. In particular, they have
been successfully applied to generative modeling, such as in
deconvolutional networks and LAPGAN. There is, therefore, technical
benefit in incorporating convolutional architectures into
variational auto-encoders, as such can provide a technical solution
to a technical problem, and thereby achieve a technical result.
[0165] Convolutional architectures are necessarily hierarchical. In
the feedforward direction, they build from local, high-resolution
features to global, low-resolution features through the application
of successive layers of convolution, point-wise nonlinear
transformations, and pooling. When used generatively, this process
is reversed, with global, low-resolution features building towards
local, high-resolution features through successive layers of
deconvolution, point-wise nonlinear transformations, and
unpooling.
[0166] Incorporating this architecture into the variational
auto-encoder framework, it is natural to associate the upward
pathway (from local to global) with the approximating posterior,
and the downward pathway (from global to local) with the generative
model. However, if the random variables of the generative model are
defined to be the units of the deconvolutional network itself, then
samples from the approximating posterior of the last hidden layer
of the deconvolutional decoder can be determined directly by the
convolutional encoder. In particular, it can be natural to define
the samples from the last layer of the deconvolutional decoder to
be a function solely of the first layer of the convolutional
encoder. As a result, the auto-encoding component of the VAE
parameter update depends on the bottom-most layer of random
variables. This seems contradictory to the intuitive structure of a
convolutional auto-encoder.
[0167] Instead, ancillary random variables can be defined at each
layer of the deconvolutional decoder network. Ancillary random
variables can be discrete random variables or continuous random
variables.
[0168] In the deconvolutional decoder, the ancillary random
variables of layer n are used in conjunction with the signal from
layer n+1 to determine the signal to layer n-1. The approximating
posterior over the ancillary random variables of layer n is defined
to be a function of the convolutional encoder, generally restricted
to layer n of the convolutional encoder. To compute a stochastic
approximation to the gradient of the evidence lower bound, to the
approach can perform a single pass up the convolutional encoder
network, followed by a single pass down the deconvolutional decoder
network. In the pass down the deconvolutional decoder network, the
ancillary random variables are sampled from the approximating
posteriors computed in the pass up the convolutional encoder
network.
[0169] A Problem with the Traditional Approach
[0170] A traditional approach can result in approximating
posteriors that poorly match the true posterior, and consequently
can result in poor samples in the auto-encoding loop. In
particular, the approximating posterior defines independent
distributions over each layer. This product of independent
distributions ignores the strong correlations between adjacent
layers in the true posterior, conditioned on the underlying
data.
[0171] The representation throughout layer n should be mutually
consistent, and consistent with the representation in layer n-1 and
n+1. However, in the architecture described above, the
approximating posterior over every random variable is independent.
In particular, the variability in the higher (more abstract) layers
is uncorrelated with that in the lower layers, and consistency
cannot be enforced across layers unless the approximating posterior
collapses to a single point.
[0172] This problem is apparent in the case of (hierarchical)
sparse coding. At every layer, the true posterior has many modes,
constrained by long-range correlations within each layer. For
instance, if a line in an input image is decomposed into a
succession of short line segments (e.g., Gabor filters), it is
essential that the end of one segment line up with the beginning of
the next segment. With a sufficiently over-complete dictionary,
there may be many sets of segments that cover the line, but differ
by a small offset along the line. A factorial posterior can
reliably represent one such mode.
[0173] These equivalent representations can be disambiguated by the
successive layers of the representation. For instance, a single
random variable at a higher layer may specify the offset of all the
line segments in the previous example. In the traditional approach,
the approximating posteriors of the (potentially disambiguating)
higher layers are computed after approximating posteriors of the
lower layers have been computed. In contrast, an efficient
hierarchical variational auto-encoder could infer the approximating
posterior over the top-most layer first, potentially using a deep,
convolutional computation. It would then compute the conditional
approximating posteriors of lower layers given a sample from the
approximating posterior of the higher layers.
[0174] A Proposed Approach--Hierarchical Priors and Approximating
Posteriors
[0175] In the present approach, rather than defining the
approximating posterior to be fully factorial, the computational
system conditions the approximating posterior for the n.sup.th
layer on the sample from the approximating posterior of the higher
layers preceding it in the downward pass through the
deconvolutional decoder. In an example case, the computational
system conditions the approximating posterior for the n.sup.th
layer on the sample from the (n-1).sup.th layer. This corresponds
to a directed graphical model, flowing from the higher, more
abstract layers to the lower, more concrete layers. Consistency
between the approximating posterior distributions over each pair of
layers is ensured directly.
[0176] With such a directed approximating posterior, it is possible
to do away with ancillary random variables, and define the
distribution directly over the primary units of the deconvolutional
network. In this case, the system can use a parameterized
distribution for the deconvolutional component of the approximating
posterior that shares structure and parameters with the generative
model. Alternatively, the system can continue to use a separately
parameterized directed model.
[0177] In the example case and other cases, a stochastic
approximation to the gradient of the evidence lower bound can be
computed via one pass up the convolutional encoder, one pass down
the deconvolutional decoder of the approximating posterior, and
another pass down the deconvolutional decoder of the prior,
conditioned on the sample from the approximating posterior. Note
that if the approximating posterior is defined directly over the
primary units of the deconvolutional generative model, as opposed
to ancillary random variables, the final pass down the
deconvolutional decoder of the prior does not actually pass signals
from layer to layer. Rather, the input to each layer is determined
by the approximating posterior.
[0178] Below is an outline of the computations for two adjacent
hidden layers, highlighting the hierarchical components and
ignoring the details of convolution and deconvolution. If the
approximating posterior is defined directly over the primary units
of the deconvolutional generative model, then it is natural to use
a structure such as:
q(z.sub.n-1,z.sub.n|x,O)=q(z.sub.n-1|x,O)q(z.sub.n|z.sub.n-1,x,O)
p(z.sub.n-1,z.sub.n|.theta.)=p(z.sub.n|z.sub.n-1,.theta.)p(z.sub.n-1|.th-
eta.)
[0179] This builds the prior by conditioning the more local
variables of the (n-1).sup.th layer on the more global variables of
the n.sup.th layer. With ancillary random variables, we might
choose to use a simpler prior structure:
p(z.sub.n-1,z.sub.n|.theta.)=p(z.sub.n-1|.theta.)p(z.sub.n|.theta.)
[0180] The evidence lower bound decomposes as:
L VAE .function. ( x , .theta. , .PHI. ) = .times. log .times.
.times. p .function. ( x | .theta. ) - KL [ q .function. ( z n , z
n - 1 | x , .PHI. ) .parallel. p ( z n , z n - 1 | x , .theta. ] =
.times. log .times. .times. p .function. ( x | .theta. ) - KL [ q
.function. ( z n - 1 , z n , x , .PHI. ) .times. q .function. ( z n
| x , .PHI. ) .parallel. p .function. ( z n - 1 | z n , x , .theta.
) p .function. ( z n | x , .theta. ) = .times. z n .times. .intg. z
n - 1 .times. q .function. ( z n - 1 | z n , x , .PHI. ) q
.function. ( z n | x , .PHI. ) .times. log .function. [ p
.function. ( x | z n - 1 , .theta. ) p .function. ( z n - 1 | z n ,
.theta. ) p .function. ( z n | .theta. ) q .function. ( z n - 1 | z
n , x , .PHI. ) q .function. ( z n | x , .PHI. ) ] = .times. q
.function. ( z n - 1 | z n , x , .PHI. ) q .function. ( z n | x ,
.PHI. ) .function. [ log .times. .times. p .function. ( x | z n , z
n - 1 , .theta. ) ] - .times. KL .function. [ q .function. ( z n |
x , .PHI. ) .parallel. p .function. ( z n | .PHI. ) ] - .times. Z n
.times. q .function. ( z n | x , .PHI. ) KL [ q ( z n - 1 | z n , x
, .PHI. .parallel. p .function. ( z n - 1 | z n , .theta. ) ]
##EQU00015##
[0181] If the approximating posterior is defined directly over the
primary units of the deconvolutional generative model, then it may
be the case that
p(x|z.sub.n,z.sub.n-1,.theta.)=p(x|z.sub.n-1,.theta.).
[0182] If both q(z.sub.n-1|z.sub.n, x, .PHI.) and
p(z|.sub.n-1z.sub.n) are Gaussian, then their KL-divergence has a
simple closed form, which can be computationally efficient if the
covariance matrices are diagonal. The gradients with respect to
q(z.sub.n|x, .PHI.) in the last term of Equation 3 can be obtained
using the same re-parameterization method used in a standard
VAE.
[0183] To compute the auto-encoding portion of the ELBO, the system
propagates up the convolutional encoder and down the
deconvolutional decoder of the approximating posterior, to compute
the parameters of the approximating posterior. In an example
parameterization, this can compute the conditional approximating
posterior of the n.sup.th layer based on both the n.sup.th layer of
the convolutional encoder, and the preceding (n-1).sup.th layer of
the deconvolutional decoder of the approximating posterior. In
principle, the approximating posterior of the n.sup.th layer may be
based upon the input, the entire convolutional encoder, and layers
i.ltoreq.n of the deconvolutional decoder of the approximating
posterior (or a subset thereof).
[0184] The configuration sampled from the approximating posterior
is then used in a pass down the deconvolutional decoder of the
prior. If the approximating posterior is defined over the primary
units of the deconvolutional network, then the signal from the
(n-1).sup.th layer to the n.sup.th layer is determined by the
approximating posterior for the (n-1).sup.th layer, independent of
the preceding layers of the prior. If the approach uses auxiliary
random variables, the sample from the n.sup.th layer depends on the
(n-1).sup.th layer of the deconvolutional decoder of the prior, and
the n.sup.th layer of the approximating posterior.
[0185] This approach can be extended to arbitrary numbers of
layers, and to posteriors and priors that condition on more than
one preceding layer, e.g. where layer n is conditioned on all
layers m<n preceding it.
[0186] The approximating posterior and the prior can be defined to
be fully autoregressive directed graphical models.
[0187] The directed graphical models of the approximating posterior
and prior can be defined as follows:
q .function. ( 1 , .times. , n | z , .PHI. ) = 1 .ltoreq. m
.ltoreq. n .times. q .function. ( m | l < m , x , .PHI. )
##EQU00016## p .function. ( 1 , .times. , n | .theta. ) = 1
.ltoreq. m .ltoreq. n .times. p .function. ( m | l < m , .theta.
) ##EQU00016.2##
where the entire RBM and its associated continuous latent variables
are now denoted by .xi..sub.1={z.sub.1, .zeta..sub.1, . . . ,
z.sub.k, .zeta..sub.k). This builds an approximating posterior and
prior by conditioning the more local variables of layer m on the
more global variables of layer m-1, . . . , 1. However, the
conditional distribution in p(.xi..sub.1, . . . ,
.xi..sub.n|.theta.) only depends on the continuous
.zeta..sub.j.
[0188] FIG. 6 is a schematic diagram illustrating an example
implementation of a variational auto-encoder (VAE) with a hierarchy
of continuous latent variables with an approximating posterior 610
and a prior 620.
[0189] Each .xi..sub.m>1 in approximating posterior 610 and
prior 620, respectively, denotes a layer of continuous latent
variables and is conditioned on the layers preceding it. In the
example implementation of FIG. 6, there are three levels of
hierarchy.
[0190] Alternatively, the approximating posterior can be made
hierarchical, as follows:
p .function. ( 1 , .times. .times. n | .theta. ) = 1 .ltoreq. m
.ltoreq. n .times. p .function. ( m | .theta. ) ##EQU00017##
[0191] The ELBO decomposes as
L .function. ( x , .theta. , .PHI. ) = log .times. .times. p
.function. ( x | .theta. ) - KL .function. [ m .times. q .function.
( m | l < m , x , .0. ) .parallel. m .times. p .function. ( m |
l < m , x , .theta. ) ] = .times. .times. .times. .times.
.times. .times. m .times. q .function. ( m | l < m , x , .0. )
log .function. [ p .function. ( x | z , .theta. ) m .times. p
.function. ( m | l < m , .theta. ) m .times. p .function. ( m |
l < m , x , .0. ) ] = m .times. q .function. ( m | l < m , x
, .0. ) .function. [ log .times. .times. p .function. ( x | z ,
.theta. ) ] - m .times. .times. .times. ( l < m .times. q
.function. ( l | K < l , x , .0. ) ) KL .function. [ q
.function. ( m | l < m , x , .0. ) .parallel. p .function. ( m |
l < m , .theta. ) ] ( 4 ) = .function. [ log .times. .times. p
.function. ( x | z , .theta. ) ] - m .times. KL .function. [ q
.function. ( m | l < m , x , .0. ) .parallel. p .function. ( m |
l < m , .theta. ) ] ##EQU00018##
[0192] In the case where both q(.xi.m|.xi.l<m, x, .PHI.) and
p(.xi.m|.xi.l<m, .theta.) are Gaussian distributions, the
KL-divergence can be computationally efficient, and the gradient of
the last term in Equation 4 with respect to q(.xi..sub.n-1|x,
.PHI.) can be obtained by reparametrizing, as commonly done in a
traditional VAE. In all cases, a stochastic approximation to the
gradient of the ELBO can be computed via one pass down
approximating posterior 610, sampling from each continuous latent
.zeta..sub.i and .xi..sub.m>1 in turn, and another pass down
prior 620, conditioned on the samples from the approximating
posterior. In the pass down the approximating posterior, samples at
each layer n may be based upon both the input and all the preceding
layers m<n. To compute the auto-encoding portion of the ELBO,
p(x|.xi.) can be applied from the prior to the sample form the
approximating posterior.
[0193] The pass down the prior need not pass signal from layer to
layer. Rather, the input to each layer can be determined by the
approximating posterior using Equation 4.
[0194] The KL-divergence is then taken between the approximating
posterior and true prior at each layer, conditioned on the layers
above. Re-parametrization can be used to include
parameter-dependent terms into the KL-divergence term.
[0195] Both the approximating posterior and the prior distribution
of each layer .xi..sub.m>1 are defined by neural networks, the
inputs of which are .zeta., .xi..sub.1>l>m and x in the case
of the approximating posterior. The output of these are networks
are the mean and variance of a diagonal-covariance Gaussian
distribution.
[0196] To ensure that all the units in the RBM are active and
inactive, and thus all units in the RBM are used, when calculating
the approximating posterior over the RBM units, rather than using
traditional batch normalization, the system bases the batch
normalization on the L1 norm. In an alternative approach, the
system may base the batch normalization on the L2 norm.
[0197] Specifically, the system may use:
y=x-x
x.sub.bn=y/(|y|+.epsilon.).circle-w/dot.s+o
and bound 2.ltoreq.s.ltoreq.3 and -s.ltoreq.o.ltoreq.s.
[0198] ISTA-Like Generative Model
[0199] The training of variational auto-encoders is typically
limited by the form of the approximating posterior. However, there
can be challenges using an approximating posterior other than a
factorial posterior. The entropy of the approximating posterior,
which constitutes one of the components of the KL-divergence
between the approximating and true posterior (or true prior), can
be trivial if the approximating posterior is factorial, and close
to intractable if it is a mixture of factorial distributions. While
one might consider using normalizing flows, importance weighting,
or other methods to allow non-factorial approximating posteriors,
it may be easier to change the model to make the true posterior
more factorial.
[0200] In particular, with large numbers of latent variables, it
may be desirable to use a sparse, over-complete representation. In
such a representation, there are many ways of representing a given
input, although some will be more probable than others. At the same
time, the model is sensitive to duplicate representations. Using
two latent variables that represent similar features is not
equivalent to using just one.
[0201] A similar problem arises in models with linear decoders and
a sparsity prior; i.e., sparse coding. ISTA (and LISTA) address
this by (approximately) following the gradient (with proximal
descent) of the L1-regularized reconstruction error. The resulting
transformation of the hidden representation is mostly linear in the
input and the hidden representation:
z.rarw.(I- W.sup.TW)z- .about..lamda.sign(z)+ W.sup.Tx
[0202] Note, though, that the input must be provided to every
layer.
[0203] A somewhat similar approach can be employed in
deconvolutional decoder of the approximating posterior. Consider
the case where the conditional approximating posterior of layer
z.sub.n given layer z.sub.n-1 is computed by a multi-layer
deterministic network. Rather than making a deterministic
transformation of the input available to the first layer of this
network, the system can instead provide the deterministic
transformation of the input to the internal layers, or any subset
of the internal layers. The approximating posterior over the final
Gaussian units may then employ sparse coding via LISTA, suppressing
redundant higher-level units, and thus allowing factorial
posteriors where more than one unit coding for a given feature may
be active. In the prior pathway, there is no input to govern the
disambiguation between redundant features, so the winner-take-all
selection must be achieved via other means, and a more conventional
deep network may be sufficient.
[0204] Combination with Discrete Variational Auto-Encoder
[0205] The discrete variational auto-encoder can also be
incorporated into a convolutional auto-encoder. It is possible to
put a discrete VAE on the very top of the prior, where it can
generate multi-modal distributions that then propagate down the
deconvolutional decoder, readily allowing the production of more
sophisticated multi-modal distributions. If using ancillary random
variables, it would also be straightforward to include discrete
random variables at every layer.
[0206] Hierarchical Approximating Posteriors
[0207] True posteriors can be multi-modal. Multiple plausible
explanations for an observation can lead to a multi-modal
posterior. In one implementation, a quantum processor can employ a
Chimera topology. The Chimera topology is described above in
reference to FIGS. 2A, 2B, and 2C. Traditional VAEs typically use a
factorial approximating posterior. As a result, traditional VAEs
have difficulty capturing correlations between latent
variables.
[0208] One approach is to refine the approximating posterior
automatically. This approach can be complex. Another, generally
simpler, approach is to make the approximating posterior
hierarchical. A benefit of this approach is that it can capture any
distribution, or at least a wider range of distributions.
[0209] FIG. 7 shows a method 700 for unsupervised learning via a
hierarchical variational auto-encoder (VAE), in accordance with the
present systems, devices, articles and methods. Method 700 may be
implemented as an extension of method 400 employing a hierarchy of
random variables.
[0210] Method 700 starts at 705, for example in response to a call
from another routine or other invocation.
[0211] At 710, the system initializes the model parameters with
random values, as described above with reference to 410 of method
400.
[0212] At 715, the system tests to determine if a stopping
criterion has been reached, as described above with reference to
415 of method 400.
[0213] In response to determining the stopping criterion has been
reached, the system ends method 700 at 775, until invoked again,
for example, a request to repeat the learning.
[0214] In response to determining the stopping criterion has not
been reached, the system, at 720, fetches a mini-batch of the
training data set.
[0215] At 722, the system divides the latent variables z into
disjoint groups z.sub.1, . . . , z.sub.k and the corresponding
continuous latent variables .zeta. into disjoint groups
.zeta..sub.1, . . . .zeta..sub.k.
[0216] At 725, the system propagates the training data set through
the encoder to compute the full approximating posterior over
discrete z.sub.j. As mentioned before, this hierarchical
approximation does not alter the form of the gradient of the
auto-encoding term .
[0217] At 730, the system generates or causes generation of samples
from the approximating posterior over n layers of continuous
variables .zeta..sub.j, given the full distribution over z. The
number of layers n may be 1 or more.
[0218] At 735, the system propagates the samples through the
decoder to compute the distribution over the input, as describe
above with reference to 435 of method 400.
[0219] At 740, the system performs backpropagation through the
decoder, as describe above with reference to 440 of method 400.
[0220] At 745, the system performs backpropagation through the
sampler over the approximating posterior over .zeta., as describe
above with reference to 445 of method 400.
[0221] At 750, the system computes the gradient of the
KL-divergence between the approximating posterior and the true
prior over z, as describe above with reference to 450 of method
400.
[0222] At 755, the system performs backpropagation through the
encoder, as describe above with reference to 455 of method 400.
[0223] At 757, the system determines a gradient of a KL-divergence,
with respect to parameters of the true prior distribution, between
the approximating posterior and the true prior distribution over
the discrete space.
[0224] At 760, the system determines at least one of a gradient or
at least a stochastic approximation of a gradient, of a bound on
the log-likelihood of the input data.
[0225] In some embodiments, the system generates samples or causes
samples to be generated by a quantum processor, as described above
with reference to 460 of method 400.
[0226] At 765, the system updates the model parameters based at
least in part on the gradient, as described above with reference to
465 of method 400.
[0227] At 770, the system tests to determine if the current
mini-batch is the last mini-batch to be processed, as described
above with reference to 470 of method 400. In some implementations,
act 770 is omitted, and control passes directly to 715 from 765.
The decision whether to fetch another mini-batch can be
incorporated in 715. In response to determining that the current
mini-batch is the last mini-batch to be processed, the system
returns control to 715. In response to determining that the current
mini-batch is not the last mini-batch to be processed, the system
returns control to 720.
[0228] In summary and as described in more details above, method
700 renders the approximating posterior hierarchical over the
discrete latent variables. In addition, method 700 also adds a
hierarchy of continuous latent variables below them.
[0229] Computing the Gradients of the KL Divergence
[0230] The remaining component of the loss function can be
expressed as follows:
- KL .function. [ q .function. ( z | x , .0. ) .parallel. p
.function. ( z | .theta. ) ] = z .times. q .function. ( z | x , .0.
) [ log .times. .times. p .function. ( z | .theta. ) - log .times.
.times. q .function. ( z | x , .0. ) ] ##EQU00019##
[0231] In some implementations, such as when the samples are
generated using an example embodiment of a quantum processor, the
prior distribution is a Restricted Boltzmann Machine (RBM), as
follows:
p .function. ( z | .theta. ) = e - E p .function. ( z , .theta. ) p
.times. .times. where .times. .times. E p .function. ( z ) = - z T
J z - h T z .times. .times. and .times. .times. p = z .times. e - E
p .function. ( z , .theta. ) ##EQU00020##
where z.di-elect cons.{0,1}.sup.n, is the partition function, and
the lateral connection matrix J is bipartite and very sparse. The
prior distribution described by the above equation contains strong
correlations, and the present computational system can use a
hierarchical approximating posterior.
[0232] The present method divides the latent variables into two
groups and defines the approximating posterior via a directed
acyclic graphical model over the two groups z.sub.a and z.sub.b, as
follows:
q .function. ( z | x , .0. ) = e - E a .function. ( z a | x , .0. )
a .function. ( x ) e - E b | a .function. ( z b | z a , x , .0. ) b
| a .function. ( z a , x ) .times. .times. where ##EQU00021## E a
.function. ( z a | x ) = - g a .function. ( x ) T z a
##EQU00021.2## E b | a .function. ( z b | z a , x ) = - g b | a
.function. ( x , z a ) T z b ##EQU00021.3## a .function. ( x ) = z
a .times. e - E a .function. ( z a | x , .0. ) = a i .di-elect
cons. a .times. ( 1 + e g a i .function. ( x ) ) ##EQU00021.4## b |
a .function. ( x , z a ) = z b .times. e - E b | a .function. ( z b
| z a , x , .0. ) = b i .di-elect cons. b .times. ( 1 + e g b i | a
.function. ( x , z a ) ) ##EQU00021.5##
[0233] The gradient -KL[q(z|x, .PHI.).parallel.p(z|.theta.)] with
respect to the parameters .theta. of the prior can be estimated
stochastically using samples from the approximating posterior
q(z|x)=q.sub.a(z.sub.aix)q.sub.b|a(z.sub.b|z.sub.a, x) and the true
prior, as follows:
- .differential. .differential. .theta. .times. KL .function. [ q
.function. ( z | x , .0. ) .parallel. p .function. ( z | .theta. )
] = - z .times. q .function. ( z | x , .0. ) .differential. E p
.function. ( z , .theta. ) .differential. .theta. + z ' .times. p
.function. ( z ' | .theta. ) .differential. E p .function. ( z ' |
.theta. ) .differential. .theta. = - q a .function. ( z a | x , .0.
) .function. [ q b | a .function. ( z b | z a , x , .0. )
.function. [ .differential. E p .function. ( z , .theta. )
.differential. .theta. ] ] + p .function. ( z | .theta. )
.function. [ .differential. E p .function. ( z , .theta. )
.differential. .theta. ] ##EQU00022##
[0234] The expectation with respect to q.sub.b|a(z.sub.b|z.sub.a,
x, .PHI.) can be performed analytically; the expectation with
respect to q.sub.a(z.sub.a|x, .PHI.) requires samples from the
approximating posterior. Similarly, for the prior, sampling is from
the native distribution of the quantum processor.
Rao-Blackwellization can be used to marginalize half of the units.
Samples from the same prior distribution are used for a mini-batch,
independent of the samples chosen from the training dataset.
[0235] The gradient of -KL[q(z|x, .PHI.).parallel.p(z|.theta.)]
with respect to the parameters .PHI. of the approximating posterior
does not depend on the partition function of the prior , since:
KL .function. ( q .parallel. p ) = .times. z .times. ( q .times.
.times. log .times. .times. q - q .times. .times. log .times.
.times. p ) = .times. z .times. ( q .times. .times. log .times.
.times. q + q E p + q .times. .times. log .times. .times. p ) =
.times. z .times. ( q .times. .times. log .times. .times. q + q E p
) + log .times. .times. p ##EQU00023##
[0236] Consider a case where q is hierarchical with
q=q.sub.aq.sub.b|a . . . . The random variables are fundamentally
continuous after marginalizing out the discrete random variables,
the re-parameterization technique is used to backpropagate through
.PI..sub.j<iq.sub.j|k<j.
[0237] The entropy term of the KL divergence is then:
H .function. ( q ) = .times. z .times. q log .times. .times. q =
.times. z .times. ( i .times. q i | j < i ) ( i .times. log
.times. .times. q i | k < i ) = .times. i .times. z .times. ( j
.ltoreq. i .times. q j | k < j ) log .times. .times. q i | k
< i = .times. i .times. z i .times. j < i .times. q j | k
< i .function. [ q i | k < i log .times. .times. q i | k <
i ] = .times. i .times. .rho. k < i .function. [ z i .times. q i
| .rho. k < i log .times. .times. q i | .rho. k < i ]
##EQU00024##
where indices i, j, and k denote hierarchical groups of variables.
The probability q.sub.i|.rho..sub.k<i(z.sub.i) is evaluated
analytically, whereas all variables k<i are sampled
stochastically via .rho..sub.k<i. Taking the gradient of H(q) in
the above equation and using the identity:
q .function. [ c .differential. .differential. .0. .times. log
.times. .times. q ] = c z .times. q ( .differential. q
.differential. .0. .times. / .times. q ) = c .differential.
.differential. .0. .times. ( z .times. q ) = 0 ##EQU00025##
for a constant c, allows elimination of the gradient of log
q.sub.i|.rho..sub.k<i in the earlier equation, and obtain:
.differential. .differential. .0. .times. H .function. ( q ) = i
.times. .rho. k < i .function. [ z i .times. ( .differential.
.differential. .0. .times. q i | .rho. k < i ) log .times.
.times. q i | .rho. k < i ] ##EQU00026##
[0238] Moreover, elimination of a log-partition function in log
q.sub.i|.rho..sub.k<i is achieved by an analogous argument. By
repeating this argument one more time,
.differential.a(q.sub.i|.rho..sub.k<i)/.differential..PHI. can
be broken into its factorial components. If
q.sub.i|.rho..sub.k<i is a logistic function of the input and
z.sub.i.di-elect cons.{0,1}, the gradient of the entropy reduces
to:
.differential. .differential. .0. .times. H .function. ( q ) =
.times. i .times. .rho. k < i [ l .di-elect cons. i .times. z l
.times. q i .function. ( z i ) ( z i .differential. g l
.differential. .0. - z l .times. ( q l .function. ( z l ) .times. z
l .differential. g i .differential. .0. ) ) ( g l z l ) ] = .times.
i .times. .rho. k < i .function. [ .differential. g i T
.differential. .0. ( g i .circle-w/dot. [ q i .function. ( z i = 1
) - q i 2 .function. ( z i = 1 ) ] ) ] ##EQU00027##
where l and z.sub.l correspond to single variables within the
hierarchical groups denoted by i. In TensorFlow, it might be
simpler to write:
.differential. .differential. .0. .times. H .function. ( q ) =
.rho. k < i .function. [ .differential. q i T .function. ( z i =
1 ) .differential. .0. g i ] ##EQU00028##
[0239] The remaining cross-entropy term is:
z .times. q E p = - .rho. .function. [ z T J z + h T z ]
##EQU00029##
[0240] The term h.sup.Tz can be handled analytically, since
z.sub.i.di-elect cons.{0,1}, and
.sub..rho.[h.sup.Tz]=h.sup.T.sub..rho.[q(z=1)]
[0241] The approximating posterior q is continuous in this case,
with non-zero derivative, so the re-parameterization technique can
be applied to backpropagate gradients:
.differential. .differential. .0. .times. .rho. .function. [ h T z
] = h T .rho. .function. [ .differential. .differential. .0.
.times. q .function. ( z = 1 ) ] ##EQU00030##
[0242] In contrast, each element of the sum:
z T J z = i , j .times. J ij z i z j ##EQU00031##
depends upon variables which are not usually in the same
hierarchical level, so, in general:
.sub..rho.[J.sub.iz.sub.iz.sub.j].noteq.J.sub.ij.sub..rho.[z.sub.i].sub.-
.rho.[z.sub.j]
[0243] This term can be decomposed into:
.sub..rho.[J.sub.iz.sub.iz.sub.j]=J.sub.ij.sub..rho..sub.k<i[z.sub.i.-
sub..rho..sub.k<i[z.sub.j]]
where, without loss of generality, z.sub.i is in a higher
hierarchical layer than z.sub.j. It can be challenging to take the
derivative of z.sub.i because it is a discontinuous function of
.rho..sub.k<i.
[0244] Direct Decomposition of
.differential.(J.sub.i,jz.sub.iz.sub.j)/.differential..PHI.
[0245] The re-parameterization technique initially makes z.sub.i a
function of .rho. and .PHI.. However, it is possible to marginalize
over values of the re-parameterization variables .rho. for which z
is consistent, thereby rendering z.sub.i a constant. Assuming,
without loss of generality, that i<j,
.sub..rho.[J.sub.ijz.sub.iz.sub.j] can be expressed as follows:
.rho. .function. [ J ij .times. z i .times. z j ] = .times. J ij
.rho. k < i .function. [ z i ~ q i | .rho. k < i , .0.
.function. [ z i .function. ( .rho. , .0. ) .rho. i | z i
.function. [ .rho. k < i .function. [ z j .function. ( .rho. z i
, .0. ) ] ] ] ] = .times. J ij .rho. k < i [ z i .times. q i
.function. ( z 1 = 1 | .rho. k < i , .0. ) z i .times. .rho. i |
z i .function. [ .rho. i < k < j .function. [ z j .times. q j
.function. ( z j = 1 | .rho. z i , k < j , .0. ) z j ] ] ] =
.times. J ij .rho. k < i [ z i .times. q i .function. ( z i = 1
| .rho. k < i , .0. ) z i .rho. i | z i [ .rho. i < k < j
[ q i ( z j = .times. 1 | .rho. z i , k < i , .0. ) ] ] ]
##EQU00032##
[0246] The quantity
q.sub.i(z.sub.j=1|.rho..sub.z.sub.i.sub.,k<j, .PHI.) is not
directly a function of the original .rho., since .rho..sub.i is
sampled from the distribution conditioned on the value of z.sub.i.
It is this conditioning that coalesces
q.sub.i(z.sub.j=1|.sub..rho.z.sub.i.sub.,k<j, .PHI.), which
should be differentiated.
[0247] With z.sub.i fixed, sampling from .rho..sub.i is equivalent
to sampling from .zeta..sub.i|z.sub.i. In particular, .rho..sub.i
is not a function of q.sub.k<i, or parameters from previous
layers. Combining this with the chain rule, .zeta..sub.i can be
held fixed when differentiating q.sub.j, with gradients not
backpropagating from q.sub.j through .zeta..sub.i.
[0248] Using the chain rule, the term due to the gradient of
q.sub.i(z.sub.i|.beta..sub.k<i.PHI.),is:
.differential. .differential. .PHI. .times. .rho. .function. [ J ij
.times. z i .times. z j ] = .times. J ij .rho. k < i [ z i
.times. .differential. q i .function. ( i = 1 ) .differential.
.PHI. i .times. .rho. i | i .function. [ .rho. i < k <
.function. [ j .times. q j .function. ( j = 1 | .rho. k < j ,
.PHI. ) j ] ] ] = .times. J ij .rho. k < j [ q j ~ .times. ? [
.differential. q i .function. ( i = 1 ) .differential. .PHI. ? q i
.function. ( i = 1 | .rho. k < i , .PHI. ) .times. .rho. i | i
.function. [ .rho. k > i .times. j .function. ( .rho. , .PHI. )
] ] ] = .times. .rho. .function. [ J ij .differential. q i
.function. ( i = 1 ) .differential. .PHI. i .function. ( p , .PHI.
) q i .function. ( i = 1 | .rho. k < i , .PHI. ) j .function. (
.rho. , .PHI. ) ] = .times. .rho. .function. [ J ij i .function. (
.rho. , .PHI. ) q i .function. ( i = 1 ) q j .function. ( j = 1 )
.differential. q i .function. ( i = 1 ) .differential. .PHI. ]
##EQU00033## ? .times. indicates text missing or illegible when
filed ##EQU00033.2##
where, in the second line, we reintroduce sampling over z.sub.i,
but reweight the samples so the expectation is unchanged.
[0249] The term due to the gradient of q.sub.j(z.sub.j|.rho.,
.theta.)) is:
.differential. .differential. .PHI. .times. .rho. .function. [ J ij
.times. i .times. j ] = .times. J ij .rho. k < j [ i .times. q i
.function. ( i | .rho. k < i , .PHI. ) i .rho. i | i [ .rho. i
< k < j [ j .times. .differential. q j .differential. .PHI.
.times. j ] ] ] = .times. J ij .rho. k < j .function. [ q j |
.rho. k < j , .PHI. .function. [ i .function. ( .rho. , .PHI. )
j .function. ( .rho. , .PHI. ) q j .function. ( j | .rho. k < j
, .PHI. ) .differential. q j .differential. .PHI. ] ] = .times.
.rho. .function. [ J ij i .function. ( .rho. , .PHI. ) j .function.
( .rho. , .PHI. ) q j .function. ( j = 1 ) .differential. q j
.differential. .PHI. ] ##EQU00034##
[0250] For both z.sub.i and z.sub.j, the derivative with respect to
q(z=0) can be ignored since in light of scaling by z=0. Once again,
gradients can be prevented from backpropagating through
.zeta..sub.i. Summing over z.sub.i, and then take the expectation
of .rho..sub.i conditioned on the chosen value of z.sub.i. As a
result, q.sub.i(z.sub.j=1|.rho..sub.z.sub.i.sub.,k<j, .PHI.),
depends upon .zeta..sub.i being fixed, independent of the preceding
.rho. and .PHI. in the hierarchy.
[0251] Further marginalize over z.sub.j to obtain:
.differential. .differential. .PHI. .times. .rho. .function. [ J ij
.times. i .times. j ] = .rho. .function. [ J ij i .differential. q
j .function. ( j = 1 ) .differential. .PHI. ] ##EQU00035##
[0252] Decomposition of a
.differential.(J.sub.i,jz.sub.iz.sub.j)/.differential..PHI. Via the
Chain Rule
[0253] In another approach, the gradient of
E.sub.p(J.sub.i,jz.sub.iz.sub.j) can be decomposed using the chain
rule. Previously, z has been considered to be a function of .rho.
and .PHI.. Instead z can be formulated as a function of q(z=1) and
.rho., where q(z=1) is itself a function of .rho. and .PHI..
Specifically,
i .function. ( q i .function. ( i = 1 ) , .rho. i ) = { 0 if
.times. .times. .rho. i < 1 - q i .function. ( i = 1 ) = q i
.function. ( i = 0 ) 1 otherwise ##EQU00036##
[0254] The chain rule can be used to differentiate with respect to
q(z=1) since it allows pulling part of the integral over p inside
the derivative with respect to .PHI..
[0255] Expanding the desired gradient using the re-parameterization
technique and the chain rule, finds:
.differential. .differential. .PHI. .times. q .function. [ J ij
.times. i .times. j ] = .times. .differential. .differential. .PHI.
.times. .rho. .function. [ J ij .times. i .times. j ] = .times.
.rho. .function. [ k .times. .differential. J ij .times. i .times.
j .differential. q k .function. ( k = 1 ) .differential. q k
.function. ( k = 1 ) .differential. .PHI. ] ##EQU00037##
[0256] The order of integration (via the expectation) and
differentiation can be changed. Although z(q, .rho.) is a step
function, and its derivative is a delta function, the integral of
its derivative is finite. Rather than dealing with generalized
functions directly, the definition of the derivative can be
applied, and push through the matching integral to recover a finite
quantity. For simplicity, the sum over k can be pulled out of the
expectation in the above equation, and consider each summand
independently.
[0257] Since z.sub.i is only a function of q.sub.i, terms in the
sum over k in the above equation vanish except k=i and k=j. Without
loss of generality, consider the term k=i; the term k=j is
symmetric. Applying the definition of the gradient to one of the
summands, and then analytically taking the expectation with respect
to .rho..sub.i, obtains:
.times. .rho. .function. [ .differential. J ij i .function. ( q ,
.rho. ) j .function. ( q , .rho. ) .differential. q i .function. (
i = 1 ) .differential. q i .function. ( i = 1 ) .differential.
.PHI. ] .times. = .times. .rho. [ .times. lim .delta. .times.
.times. q i .function. ( i = 1 ) .fwdarw. 0 .times. J ij i
.function. ( q + .delta. .times. .times. q i , .rho. ) j .function.
( q + .delta. .times. .times. q i , .rho. ) - J ij i .function. ( q
, .rho. ) j .function. ( q , .rho. ) .delta. .times. .times. q i
.function. ( i = 1 ) .differential. q i .function. ( i = 1 )
.differential. .PHI. ] .times. = .rho. k .noteq. i .function. [ lim
.delta. .times. .times. q i .function. ( i = 1 ) .fwdarw. 0 .times.
.delta. .times. .times. q i J ij 1 j .function. ( q , .rho. ) - J
ij 0 j .function. ( q , .rho. ) .delta. .times. .times. q i
.function. ( i = 1 ) .differential. q i .function. ( i = 1 )
.differential. .PHI. ] .times. | .rho. i = q i .function. ( i = 0 )
] .times. .times. = .rho. k .noteq. i .function. [ J ij j
.function. ( q , .rho. ) .differential. q i .function. ( i = 1 )
.differential. .PHI. .times. | .rho. i = q i .function. ( i = 0 ) ]
##EQU00038##
[0258] Since .rho..sub.i is fixed such that .zeta..sub.i=0, units
further down the hierarchy can be sampled in a manner consistent
with this restriction. The gradient is computed with a stochastic
approximation by multiplying each sample by 1-z.sub.i, so that
terms with .zeta..sub.i.noteq.0 can be ignored, and scaling up the
gradient when z.sub.i=0 by 1/q.sub.i(z.sub.i=0), as follows:
.differential. .differential. .PHI. .times. .function. [ J ij
.times. i .times. j ] = .rho. .function. [ J ij 1 - i 1 - q i
.function. ( i = 1 ) j .differential. q i .function. ( i = 1 )
.differential. .PHI. ] ##EQU00039##
[0259] While this corresponds to taking the expectation of the
gradient of the log-probability, it is done for each unit
independently, so the total increase in variance can be modest.
[0260] Alternative Approach
[0261] An alternative approach is to take the gradient of the
expectation using the gradient of log-probabilities over all
variables:
.differential. .differential. .PHI. .times. .function. [ J ij
.times. i .times. j ] = .times. q .function. [ J ij .times. i
.times. j .differential. .differential. .PHI. .times. log .times.
.times. q ] = .times. q 1 , q 2 | 1 , .times. [ J ij .times. i
.times. j k .times. .differential. .differential. .PHI. .times. log
.times. .times. q k | .kappa. < k ] = .times. q 1 , q 2 | 1 ,
.times. [ J ij .times. i .times. j k .times. 1 q k | .kappa. < k
.differential. q k | .kappa. < k .differential. .PHI. ]
##EQU00040##
For the gradient term on the right-hand side, terms involving only
z.sub.k<k that occur hierarchicaly before k can be dropped out,
since those terms can be pulled out of the expectation over
q.sub.k. However, for terms involving z.sub.k>k that occur
hierarchically after k, the expected value of z.sub.k depends upon
the chosen value of z.sub.k.
[0262] Generally, no single term in the sum is expected to have a
particularly high variance. However, the variance of the estimate
is proportional to the number of terms, and the number of terms
contributing to each gradient can grow quadratically with the
number of units in a bipartite model, and linearly in a
chimera-structured model. In contrast, in the previously described
approach, the number of terms contributing to each gradient can
grow linearly with the number of units in a bipartite mode, and be
constant in a chimera-structured model.
[0263] Introducing a baseline:
q .function. [ ( J ij .times. i .times. j - c .function. ( x ) )
.differential. .differential. .PHI. .times. log .times. .times. q ]
##EQU00041##
[0264] Non-Factorial Approximating Posteriors Via Ancillary
Variables
[0265] Alternatively, or in addition, a factorial distribution over
discrete random variables can be retained, and made conditional on
a separate set of ancillary random variables.
.differential. .differential. .PHI. .times. ( .times. q .function.
( | .alpha. ) ( T J ) ) = .differential. .differential. .PHI.
.times. ( q T .function. ( = 1 | .alpha. ) J q .function. ( = 1 |
.alpha. ) ) ##EQU00042##
so long as J is bipartite. The full gradient of the KL-divergence
with respect to the parameters of the approximating posterior is
then as follows:
.differential. .differential. .PHI. .times. KL .function. ( q
.parallel. p ) = .rho. .function. [ ( g i - h - ( J T + J ) q
.function. ( z = 1 ) ) .differential. .differential. .PHI. .times.
q .function. ( z = 1 ) ] ##EQU00043##
[0266] Other than making the distributions conditioned on the
ancillary random variables .alpha. of the approximating posterior,
the KL-divergence between the approximating posterior and the true
prior of the ancillary variables can be subtracted. The rest of the
prior is unaltered, since the ancillary random variables a govern
the approximating posterior, rather than the generative model.
[0267] Implementation
[0268] The following can be parameterized:
q(z|x,.PHI.)=.PI..sub.iq.sub.i(z.sub.i|x,.PHI.)
using a feedforward neural network g(x). Each layer i of the neural
network g(x) consists of a linear transformation, parameterized by
weight matrix W.sub.i and bias vector b.sub.i, followed by a
pointwise nonlinearity. While intermediate layers can consist of
ReLU or soft-plus units, with nonlinearity denoted by .tau., the
logistic function .sigma. can be used as the nonlinearity in the
top layer of the encoder to ensure the requisite range [0,1].
Parameters for each q.sub.i(z.sub.i|x, .PHI.) are shared across
inputs x, and 0.ltoreq.g.sub.i(x).ltoreq.1.
[0269] Similarly, p(x|.zeta., .theta.) can be parameterized using
another feedforward neural network f(.zeta.), with complementary
parameterization. If x is binary, p.sub.i(x.sub.i=1|.zeta.,
.theta.)=.sigma.(f.sub.i(.zeta.)) can again be used. If x is real,
an additional neural network f'(.zeta.) can be introduced to
calculate the variance of each variable, and take an approach
analogous to traditional variational auto-encoders by using
p.sub.i(x.sub.i|.zeta.,
.theta.)=(f.sub.i(.zeta.),f'.sub.i(.zeta.)). The final nonlinearity
of the network f(.zeta.) should be linear, and the final
nonlinearity of f(.zeta.) should be non-negative.
[0270] Algorithm 1 (shown below) illustrates an example
implementation of training a network expressed as pseudocode.
Algorithm 1 describes training a generic network with gradient
descent. In other implementations, other methods could be used to
train the network without loss of generality with respect to the
approach.
[0271] Algorithm 1 establishes the input and output, and initialize
the model parameters, then it determines if a stopping criterion
has been met. In addition, algorithm 1 defines the processing of
each mini-batch or subset.
[0272] Algorithms 1 and 2 (shown below) comprise pseudocode for
binary visible units. Since J is bipartite, J.sub.q can be used to
denote the upper-right quadrant of J, where the non-zero values
reside. Gradient descent is one approach that can be used. In other
implementations, gradient descent can be replaced by another
technique, such as RMSprop, adagrad, or ADAM.
TABLE-US-00001 Algorithm 1: Train generic network with simple
gradient descent def train ( ) | Input :A data set X, where X[t, i]
is the ith element, and a learning rate parameter | Output: Model
parameters: {W, b, J.sub.q, h} | Initialize model parameters with
random values | while Stopping criteria is no met do | | foreach
minibatch X.sub.pos = getMinibatch(X, m) of the training dataset do
| | | Draw a sample from the approx posterior Z.sub.enc, Z.sub.pos,
X.sub.out .rarw. posSamples (X.sub.pos) | | | Draw a sample from
the prior Z.sub.neg .rarw. negSamples (Z.sub.neg.sup.prev) | | |
Estimate .times. .times. .differential. L .differential. .theta.
.times. .times. using ##EQU00044## calcGradients(X.sub.pos,
Z.sub.ene, Z.sub.pos, Z.sub.neg, X ) | | | Update .times. .times.
parameters .times. .times. according .times. .times. to .times.
.times. .theta. i + 1 .rarw. .theta. t + .differential. L
.differential. ? ##EQU00045## | | end | end indicates data missing
or illegible when filed
[0273] At first, this approach appears to be caught between two
conflicting constraints when trying to apply the variational
auto-encoder technique to discrete latent representations. On the
one hand, a discrete latent representation does not allow use of
the gradient of the decoder, since the reparametrized latent
representation jumps discontinuously or remains constant as the
parameters of the approximating posterior are changed. On the other
hand, KL[q(z|x, .PHI.).parallel.p(z|.theta.)] is only easy to
evaluate if by remaining in the original discrete space.
[0274] The presently disclosed systems and methods avoid these
problems by symmetrically projecting the approximating posterior
and the prior into a continuous space. The computational system
evaluates the auto-encoder portion of the loss function in the
continuous space, marginalizing out the original discrete latent
representation. At the same time, the computational system
evaluates the KL-divergence between the approximating posterior and
the true prior in the original discrete space, and, owing to the
symmetry of the projection into the continuous space, it does not
contribute to this term.
TABLE-US-00002 Algorithm 2: Helper functions for discrete VAE
L.sub.last .rarw. L.sub.up + L.sub.down def getMinibatch(X, m) | k
.rarw. k + 1 | X.sub.pos .rarw. X[: , k m : (k + 1) m] def
posSamples(X.sub.pos) | Z.sub.o .rarw. X.sub.pos | for i .rarw. 1
to L.sub.up - 1 do | | Z.sub.i .rarw. .tau. (W.sub.i-1 Z.sub.i-1 +
b.sub.i-1) | end | Z.sub.enc .rarw. W.sub.L.sub.up.sub.-1
Z.sub.L.sub.up.sub.-1 + b.sub.L.sub.up.sub.-1 | Z.sub.pos .rarw.
.sigma.(Z.sub.enc) | Z.sub.L.sub.up .rarw. G .sup.-1(.rho.) where
q'(.zeta. = 1|x, .PHI.) = Z.sub.pos and p ~ U(0, 1).sup.n.times.m |
for i .rarw. L.sub.up + 1 to L.sub.last - 1 do | | Z.sub.i .rarw. r
(W.sub.i-1 Z.sub.i-1 + b.sub.i-1) | end | X.sub.out .rarw. .sigma.
(W.sub.L.sub.inst.sub.-1, Z.sub.L.sub.inst.sub.-1 + b.sub.L
.sub.-1) def negSamples (Z.sub.pos) | if using D-Wave then | |
sample Z.sub.neg from D-Wave using h and J.sub.q | | post-process
samples | else | | if using CD then | | | Z.sub.neg .rarw. sample
(Z.sub.pos) | | else if using PCD then | | | Z.sub.neg initialized
to result of last call to negSamples( ) | | end | | for i .rarw. 1
to n do | | | sample "left" half from p ( Z neg [ : d 2 .times. ?
.times. ? = 1 ) = .sigma. .function. ( J q .times. ? Z neg
.function. [ d 2 .times. 1 , : ] + h .function. [ : d 2 ] )
##EQU00046## | | | sample "right" half from p ( Z neg .function. [
d 2 : , : ] = 1 ) = .sigma. .function. ( J q T Z neg .function. [ :
d 2 , : ] + h .function. [ d 2 : ] ) ##EQU00047## | | end | | | end
def calcGradients(X.sub.pos, Z.sub.enc, Z.sub.pos, Z.sub.neg,
X.sub.out) | B I .times. , last .rarw. .sigma. i .function. ( W L
last - 1 Z L last - 1 + b L last - 1 ) ( X pos X .times. ? - 1 - X
pos 1 - X .times. ? ) ##EQU00048## | for i .rarw. L.sub.last - 1 to
L.sub.up do | | D .times. .times. L .differential. W .times. ?
.rarw. B i + 1 Z i T ##EQU00049## | | D .times. .times. L
.differential. b .times. ? .rarw. B i + 1 1 ##EQU00050## | |
B.sub.i .rarw. .tau.' (W.sub.i-1 Z.sub.i-1 + b.sub.i-1)
W.sub.i.sup.T B.sub.i+1 | end | B pos .rarw. .differential. ?
.times. ( .rho. ) ? .times. q W L up T B L up + 1 ##EQU00051## | B
KL .rarw. ( Z .times. ? - h - vstack .times. .times. ( J q Z pos
.function. [ d 2 : , : ] , J q T Z pos .function. [ : d 2 , : ] ) )
.circle-w/dot. ( Z pos - Z pos 2 ) ##EQU00052## | B.sub.L.sub.up
.rarw. ' (W.sub.L.sub.up.sub.-1 Z.sub.L.sub.up.sub.-1 +
b.sub.L.sub.up.sub.-1) B.sub.pos - B.sub.KL | for i .rarw. L.sub.up
- 1 to 0 do | | D .times. .times. L .differential. W .times. ?
.rarw. B i + 1 Z i T ##EQU00053## | | D .times. .times. L
.differential. b .times. ? .rarw. B i + 1 1 ##EQU00054## | |
B.sub.i .rarw. .tau.' (W.sub.i-1 Z.sub.i-1 + b.sub.i-1)
W.sub.i.sup.T B.sub.i+1 | end | .differential. L .differential. J q
.rarw. Z pos .function. [ : d 2 , : ] Z pos .function. [ d 2 : , :
] T - Z neg .function. [ : d 2 , : ] Z neg .function. [ d 2 : , : ]
T ##EQU00055## | .differential. L .differential. h .rarw. Z pos 1 -
Z neg 1 ##EQU00056## indicates data missing or illegible when
filed
Application of DVAE to Quantum Processing with Weak Chains
[0275] The following describes the use of hierarchical
approximating posteriors and weak chains to facilitate the use of a
native architecture of a quantum processor in discrete variational
autoencoders. The native Chimera architecture is used as an
example.
[0276] Undirected graphical models, including restricted Boltzmann
machines (RBMs), can be used in machine learning applications.
Various deep architectures based on RBMs have been implemented.
[0277] A preference in some applications for the use of feedforward
systems, including deep neural networks and acyclic directed
graphical models, is based, at least in part, on their more modest
computational demands relative to RBMs. For example, feedforward
networks typically require computational time that is polynomial in
the number of units. With full connectivity between layers, the
computational complexity of feedforward networks is quadratic in
the size of each layer. With convolutional connections,
computational complexity is linear in the size of each layer.
Systems can use hundreds of layers, each containing thousands of
units, so computational complexity can be a serious
consideration.
[0278] A quantum processor, such as analog computer 104 of FIG. 1,
can use quantum dynamics to natively draw samples from a quantum
Boltzmann distribution. A quantum processor can be used to help
mitigate the computational demands of RBMs. With sufficient
reduction in the sources of noise and error, a quantum processor
can reliably draw samples from QBMs (and/or at least approximate
samples from RBMs) faster than a classical device, such as a
digital computer.
[0279] A shortcoming of existing quantum processor architectures
can be their sparse connectivity. For example, an architecture may
be based on a topology of qubits and coupling devices, the coupling
devices communicatively coupling respective pairs of qubits. In
architectures with sparse connectivity, the number of qubits
communicatively coupled to a selected qubit via coupling devices
can be small compared with the total number of qubits in the
topology.
[0280] FIG. 2A (described above) illustrates an example building
block of a physical topology of a quantum processor comprising cell
200a of qubits and coupling devices. In one implementation, the
qubits and coupling devices are superconducting flux qubits and
superconducting coupling devices. As described above, cell 200a of
FIG. 2A can be replicated and laid out in a grid to form a topology
that is commonly referred to as a Chimera topology. The topology is
described in more detail above with reference to FIGS. 2A, 2B, and
2C.
[0281] The Chimera architecture shown in FIGS. 2A, 2B, and 2C has
at most six connections per qubit, and the connections tend to be
strongly clustered i.e. mainly local. Compared to a full bipartite
topology between two layers of size 512, the Chimera architecture
has a sparsity of less than 1.2%.
[0282] Machine learning techniques based on Boltzmann machines
typically assume full bipartite connectivity between adjacent
layers. When these techniques are applied, without modification, to
sparsely-connected RBMs, performance can be poor. See for example
Dumoulin, V., et al. "On the challenges of physical implementations
of RBMs", arXiv:1312.5258 (2013).
[0283] Sparse connectivity can have a lesser impact if the
connections are scale-invariant and small-world. See for example
Mocanu, D. C., et al. "A topological insight into restricted
Boltzmann machines", arXiv:1604.05978 (2016). These properties need
not be compatible with the physical constraints of a quantum
processor such as a quantum processor based on the Chimera
architecture of FIGS. 2A, 2B, and 2C.
[0284] Dense connectivity can be desirable because datasets of
interest, such as visual images, typically have strong correlations
amongst relatively large groups of variables. The strongest such
correlations can be identified by sparse coding, for example by
using a Gabor filter for image data and a gammatone filter for
audio data. The resulting dictionary elements can span many tens,
or hundreds, of pixels in the visual case, for example. Moreover,
datasets of interest can contain a large number of observed
variables compared to the size of a quantum processor. Even smaller
datasets like MNIST have 28.times.28=784 observed variables, and
require at least a comparable number of hidden variables.
[0285] Dense connectivity can be emulated in a quantum processor by
using groups of qubits communicatively coupled with strong
ferromagnetic connections. These groups of qubits are typically
referred to as "chains".
[0286] Ferromagnetically coupled chains of qubits in a quantum
processor topology can be used to mediate long-range interactions
across a lattice of qubits and/or cells of qubits, and to increase
the connectivity of single qubits. The rationale behind this
approach is that by providing sufficiently strong ferromagnetic
coupling between qubits in a chain, their spins will remain locked
parallel and consequently the entire chain may be treated
effectively as a single qubit (i.e., an effective, or logical,
qubit). The chain can become an at least approximate 2-level
system, since a state of the chain can be restricted to reside
within that portion of Hilbert space that is spanned by the two
ferromagnetically aligned spin states. Thus, connections can
hypothetically be made anywhere along the length of the chain in
order to make use of the physically extended effective qubit. See
U.S. Pat. No. 8,174,305 for more description of chains.
[0287] Under some circumstances, the qubits in a chain can acquire
a single, common value with a high probability. For longer chains,
the ferromagnetic connections need to be stronger (relative to the
strength of other couplings) to maintain the high probability of
acquiring a single, common value. If the qubits in a chain do not
acquire a single, common value, then the chain is typically
referred to as a broken chain.
[0288] The strength of the ferromagnetic connections in a chain is
bounded. So when a 32.times.32 full bipartite connectivity (K32,32)
graph is embedded using 1024 qubits in a quantum processor, using
chains of length 32 that stretch laterally across, and also
vertically down, the Chimera architecture, it can become likely
that at least a few chains will be broken in any sample taken from
the quantum processor.
[0289] When samples are drawn from the quantum processor, it can be
desirable to discard at least some of the samples from broken
chains.
[0290] Another approach is to attempt to repair broken chains using
a suitable method such as, for example, majority-voting, and
gradient descent. The resulting samples may no longer be from a
well-defined probability distribution. Broken chains can be
repaired without affecting the probability distribution by
annealing from the "true" distribution of the quantum processor to
the logical distribution on the chains. This approach can require
an accurate, and computationally tractable, model of the quantum
processor.
[0291] At the same time, an RBM (or QBM) is only hard to sample
from (such as when it is in the spin-glass phase) if there are
long-range correlations, such as those induced by chains.
[0292] Another approach is to use weak chains, with moderate
ferromagnetic connections, and use models that describe individual
qubits in the quantum processor, rather than models that treat
chains as cohesive and coherent logical variables.
[0293] A discrete variational autoencoder (discrete VAE) can help
to insulate the distribution represented by the quantum processor
from the statistics of the data distribution. This, in turn, can
enable the use of RBMs and QBMs with fewer variables and sparser
connections than are present in the dataset. As a particular
example, a K32,32 graph in a discrete VAE trained on dynamically
binarized MNIST can be as effective as a K64,64 graph. Larger
complete bipartite RBMs may be unlikely to provide any additional
advantage. Nonetheless, it can be a challenge to sample from the
resulting trained RBM, and improved sampling is desirable for
improved generative performance.
[0294] If a sparsely-connected architecture (such as Chimera) is
used in the prior of a discrete VAE, and the architecture includes
weak chains, and the individual qubits are modeled, then the chain
connections will induce strong correlations in the true posterior
distribution. These correlations can be modeled in the
approximating posterior. Otherwise the ELBO will be small, and the
bound on the log-likelihood will be loose.
[0295] Traditional VAEs, in contrast, use a factorial approximating
posterior, in which the approximating posterior is a product of
independent distributions over each latent variable.
[0296] For example, consider a case where a chain is relatively
strong, but either value of the chain (or its associated logical
qubit, in the case of very strong chains) is consistent with the
observed variables. The two configurations 00 . . . 0 and 11 . . .
1 each have roughly equal probability, but other configurations
(e.g. 01 . . . 1) have lower probability, since they include one or
more breaks in the chain.
[0297] A factorial distribution can assign equal probability to 00
. . . 0 and 11 . . . 1 via a distribution .PI..sub.i p(z.sub.i),
where p(z.sub.i=0)=p(z.sub.i=1)=0.5. This maximum-entropy
distribution assigns equal probability 0.5.sup.n to each
configuration. The KL divergence from the approximating posterior
to the true prior can be minimized when each probability is
assigned to 00 . . . 0 (or, symmetrically, to 11 . . . 1).
[0298] If a factorial approximating posterior accommodates bimodal
true posterior distributions over chains by selecting one mode,
then the mode associated with each element of the training set will
be randomly assigned and locked in at the very beginning of
training. The mode associated with a given element of the dataset
can change only by passing through distributions that assign
exponentially low probability to the modes.
[0299] In contrast to the factorial approximating posterior of
traditional VAEs, the discrete VAE already includes a hierarchical
approximating posterior. This hierarchy can be aligned with the
chains to allow the approximating posterior to capture bimodality
in the chains. In the simplest case, the top-level of the hierarchy
can contain one element of each chain, with other qubits in the
second level of the hierarchy. The elements in the second level can
be made equal to the element of their chain in the top level, and
with high probability. The approximating posterior over the top
level can be high-entropy, but the chains can still be unbroken
with high probability in the approximating posterior.
[0300] Another approach is to assign successive elements of each
chain to successive levels (or layers) of the hierarchy. For
instance, in this approach, the left, or topmost element, of each
chain can be assigned to the top level of the hierarchy, the next
element to the right, or below, to the second level of the
hierarchy, and so on.
[0301] The approximating posterior can be defined with Markovian,
rather than fully autoregressive, connections, such that each
element of the chain is dependent upon only an adjacent element of
the chain, rather than more distant elements to which it has no
direct connection (in the prior), and on which it is conditionally
independent (in the prior, but not the approximating posterior)
given the adjacent elements of the chain. In practice, this could
be realized by allowing only the n.sup.th n layer of the hierarchy
to depend on the (n-1).sup.th layer, rather than allowing the
n.sup.th layer to depend upon layers 1, 2, . . . (n-1).
[0302] Rather than unrolling the chains from one end to the other
along the hierarchy, another approach is to start in the center,
and unroll the chains in both directions, thereby halving the
number of hierarchical levels while still allowing single chain
breaks to be more probable than double chain breaks.
[0303] In one implementation, the parameters in the prior and the
approximating posterior are initialized, or constrained, to induce
strong ferromagnetic correlations along the chains. In another
implementation the parameters in either the prior or the
approximating posterior are initialized, or constrained, to induce
strong ferromagnetic correlations along the chains. If the
initialization, or the constraint, is sufficiently strong, then it
may force at least some of the chains to be unbroken. If the
initialization, or the constraint, is insufficiently strong, then
it may induce correlations along the chains, but may allow at least
some of the chains to have one or more breaks. It is anticipated
that even relatively weak chains may be sufficient to induce
long-range correlations.
[0304] Each element of a strong chain should, in principle, have
the same semantics in the generative model. Another simpler, and
more standard, approach is to initialize the parameters of the
model independently, such that each element of the chains has
independent semantics.
[0305] In order for the elements of the chain to converge to a
single consensus meaning, a massive symmetry-breaking must occur,
with the network choosing amongst the many distinct initial
meanings for the elements of the chain. This process may be
hastened, and overfitting mitigated, by sharing parameters in the
generative model between elements of each chain.
[0306] Parameter sharing may require that the method identifies the
chains a priori, rather than allowing them to be learned. Parameter
sharing may also require that the chains be consistent across an
entire dataset, rather than encoding additional information in
chain layouts dynamically selected for each sample. A complementary
parameter sharing (analogous to the recurrent parameter sharing in
the prior of the discrete VAE) may also be applied to the
approximating posterior.
[0307] The above description provides a method for applying
discrete VAEs to a quantum processor with a sparsely-connected
architecture. The method uses hierarchical VAEs in conjunction with
a sparsely-connected undirected graphical model, in which strong
correlations are naturally induced by the prior. The method can be
applied to operation of a hybrid computing system (such as system
100 of FIG. 1) to provide samples for discrete VAEs.
[0308] As described above, to train a Boltzmann machine, the system
can determine a prior distribution which can be done, at least in
part, by drawing samples from a quantum processor. The samples can
be used to set the parameters of the prior distribution, the shape
of which can be configured by a suitable choice of programmable
parameters such as the qubit biases and the coupling strengths of
coupling devices that communicatively couple respective pairs of
qubits. These are the h and J parameters described above. The
programmable parameters determine, at least in part, the gradient
of the hierarchical approximating posterior.
[0309] FIG. 8 is a schematic diagram illustrating an example
logical topology 800 of a quantum processor such as quantum
processor 114 of FIG. 1. Topology 800 comprises 72 qubits indicated
by circles with and without a patterned fill. Only example qubits
802, 804, 806, 808, 810, and 812 are called out in FIG. 8. The 72
qubits of FIG. 8 are arranged in cells of 8 qubits in a Chimera
topology as previously described with reference to FIGS. 2A, 2B,
and 2C. Each qubit in a cell is communicatively coupled to each
other qubit in the same cell. Only one example intra-cell coupling
814 is called out in FIG. 8. Each qubit in a cell is
communicatively coupled to two other qubits in each of two other
cells. Only example inter-cell couplings 816, 818, 820, and 822 are
called out in FIG. 8.
[0310] Patterned fill of qubits, such as example qubits 802, 804,
806, 808, 810, and 812, is used in FIG. 8 to indicate chains. For
example, qubits 802, 804, and 806 belong to a first chain
(indicated by solid black fill), and qubits 808, 810, and 812
belong to a second chain (indicated by a diagonal patterned
fill).
[0311] The methods described above for unsupervised learning
include determining an approximating posterior distribution over
each of the discrete latent space and the corresponding continuous
latent space. The approximating posterior distribution can be a
hierarchical approximating posterior distribution with multiple
levels of hierarchy. The number of levels in the hierarchy can be
selected.
[0312] During computation, qubits can be assigned to levels of the
hierarchy. Typically, only one qubit of a chain appears at a
particular level of the hierarchy when the qubit represents a chain
of qubits appearing for the first time in the hierarchy.
[0313] In one implementation, the first qubit in each chain (for
example, qubits 802 and 808) is assigned to a first level of the
hierarchy, and the remaining qubits in the chains (for example,
qubits 804, 806, 810, and 812) are assigned to a second level of
the hierarchy. The first qubit in each chain can be used as a
deciding qubit with the remaining qubits in the chain having the
same value as the first qubit with high probability, i.e., when the
probability exceeds a predetermined threshold.
[0314] In one approach, the chain of qubits can be built along the
chain from an end qubit. In another implementation, the chain of
qubits can be built from the interior of the chain, for example
from the middle of the chain. In the latter, the number of levels
in the hierarchy can be reduced by building the chain in both
directions at once, i.e., by adding two qubits to the chain at each
level of the hierarchy, instead of one.
[0315] In another implementation, a Markovian property of chains
can be used, in which a qubit is expected to have the same value as
the qubit in the immediately preceding level of the hierarchy. For
example, a qubit at the third level can be independent of a qubit
in the same chain at the first level, given a qubit in the same
chain at the second level.
[0316] Referring to FIG. 8, qubit 802 can be assigned to a first
level of the hierarchy, 804 to a second level, 806 to a third
level, and so on along the chain. In one implementation, as
described above, qubit 802 can be a deciding qubit. In another
implementation, based on a Markovian property of chains as
described above, the value of 806 can be independent of the value
of 802, and be expected to have the same value as qubit 804 with
high probability.
[0317] In one implementation, both qubits 802 and 808 can be
assigned to a first level of the hierarchy, qubits 804 and 810 to a
second level, qubits 806 and 812 to a third level, and so on.
[0318] The topology illustrated in FIG. 8 is an example topology.
The chains of qubits illustrated in FIG. 8 are example chains.
Those with skill in the art will recognize that other suitable
topologies, and other suitable arrangements of chains, can be used
in the implementation of the systems and methods described
above.
Searching with a Discrete Latent Representation
[0319] Latent representations generated by variational autoencoders
can be used to search an input space characterized by an objective
function. Examples of such techniques are described, for example,
by Gomez-Bombarelli et al. in Automatic chemical design using a
data-driven continuous representation of molecules, available at
https://arxiv.org/pdf/1610.02415.pdf (October 2016), which is
incorporated herein by reference in its entirety. A challenge with
these techniques is that, in most circumstances, the vast majority
of possible inputs are not necessarily meaningful. For instance, in
the context of images, the vast majority of possible pixel
configurations yield white noise, rather than recognizable
images.
[0320] In some implementations, these challenges are ameliorated by
providing a machine learning model that accommodates the
representation of discrete variables in its latent representations.
Such implementations may take advantage of "clustered" structures
in the input space. For example, returning to the image context,
the set of well-formed images may include images of cats and images
of cars; although both cats and cars may be subject to (sometimes
complementary) transformations, images of cats may be thought to
occupy a relatively small region of the input space (i.e. a
"cluster") and similarly images of cars may occupy their own
cluster. Although there are edge cases where images of cars and
cats are difficult to distinguish (e.g. due to illumination, visual
ambiguity, etc.), they can generally be thought to have large gaps
between them--that is, to be spaced relatively far apart from each
other.
[0321] Unimodal probability distributions over continuous variables
tend to have difficulty representing gaps between clusters. In
particular, transformations from the latent space to the input
space tend to introduce sharp discontinuities; these
discontinuities may not be well-represented by continuous variables
and may make training more challenging (e.g. because they involve
large gradients over very small regions of parameter space). In
some cases, using multimodal continuous variables may not resolve
or sufficiently ameliorate these issues.
[0322] In such circumstances, explicitly representing probability
distributions over both continuous and discrete variables in the
latent representation may enable the machine learning model to more
accurately and/or efficiently represent discontinuities between
different types of data in the input space. The discrete variables
latent representation may broadly correspond to boundaries between
classes of input data, thereby defining softly-partitioned regions
within which continuous-valued variables may be relatively smooth
(as it is less likely that the continuous variables will need to
model sharp discontinuities). Moreover, the latent space tends to
have much smaller dimensionality than the input space. This
smoothness and lower dimensionality may simplify (and/or improve
the efficiency of) optimization techniques applied to the latent
space.
[0323] Searching the Latent Space with an Input Space Objective
Function
[0324] FIG. 9 shows an example method 900 for searching an input
space characterized by an objective function f(x). At 920, a
computing system receives training data 910 and trains a machine
learning model based on training data 910. The machine learning
model may comprise, for example, a discrete variational autoencoder
(e.g. as described herein) and/or any other machine learning model
capable of generating latent representations which comprise
explicit representations of discrete variables. The computer system
generates, via the training at 920, a latent representation 912
which comprises discrete variables.
[0325] At the various acts collectively denoted as 940, the
computing system searches the input space based on latent
representation 912 and the objective function f(x)
[0326] In some implementations, search act 940 comprises selecting
an initial point z.sub.0 for the search at 922 and optimizing in
the latent space at 924. At 922, the computing system selects an
initial point z.sub.0 in the latent space; this selection may be by
any suitable means. For example, initial point z.sub.0 may be
selected by sampling from the prior distribution over the latent
representation (denoted p(z)), by random selection, by selecting an
initial point x.sub.0 in the input space and encoding it to the
latent space (e.g. by applying an encoder of the machine learning
model, such as approximating posterior q(z|x)), by receiving a
user's selection, and/or by any other suitable means.
[0327] At 924, the computing system performs an optimization in the
latent space (i.e. based on latent representation 912). The
optimization may be performed according to any suitable method,
such as (for example) Bayesian optimization, simulated annealing,
and/or other methods. Such optimization methods would ordinarily be
applied to the input space by evaluating points in the input space
according to input space objective function f(x) In some
implementations, in order to perform the optimization in the latent
space, a latent space objective function g(z) is provided by
composing the input space objective function f(x) with a decoder
(e.g. the prior distribution conditioned on z, denoted p(x|z)) of
the machine learning model (e.g. a variational autoencoder) capable
of mapping points in the latent space to the input space. The
optimization function may then be applied to the latent space based
on the latent space objective function g(z). This composition (i.e.
the determination of an objective function value for a point z in
the latent space) is illustrated as act 930 in FIG. 9, and may
occur any number of times during act 924.
[0328] In some implementations, such composition is not explicit
(and a latent space objective function g(z) might not be
specifically identified). Regardless, for convenience, in this
specification (and in the appended claims) "composition" includes
implicit composition, such as decoding a latent space value z to an
input space value x, evaluating the objective function f(x) at x,
and using the resulting value as a measure of z in an optimization
method without explicitly defining a composed latent space
objective function g(z).
[0329] In some implementations, optimizing at act 924 comprises
fixing the discrete variables of z.sub.0 and performing the
optimization by modifying only continuous variables in the latent
space. Such implementations may involve (but at not limited to)
performing continuous-valued optimization methods which do not
necessarily accommodate discrete values, such as conventional
Bayesian Optimization. Alternatively, or in addition, the
optimization may comprise optimizing over some or all of the
discrete variables, either alone or together with some or all of
the continuous variables in the latent space, through the use of a
suitable optimization method which accommodates discrete-valued
(and/or mixed discrete/continuous-valued) domains.
[0330] In some implementations, method 900 performs acts 922 and
924 more than once. For example, a plurality of initial points
z.sub.0 may be selected at act 922 (e.g. simultaneously, in
parallel, sequentially, and/or otherwise) and each of those points
may be used as an initial value for optimization at 924.
[0331] In some implementations, method 900 performs act 924 a
plurality of times for an initial point z.sub.0. For example, where
the optimization method of act 924 is probabilistic, the
optimization may be performed from the same initial point z.sub.0
several times and yield a plurality of distinct results. In some
implementations, act 924 may involve performing one of a set of
optimizations, and an initial point z.sub.0 may be optimized at 924
using each (or a subset of) the set of optimizations.
[0332] Computing system determines a result 918 based on the
optimization at 924, such as a result returned by the search 940.
Where multiple instances of act 924 occur, thereby producing a
plurality of candidate results, the computing system may select one
or more of the candidate results to be used as result 918. For
example, the computing system may select the candidate result with
the largest (or smallest) corresponding objective function
value.
[0333] Result 918 may be in the latent or input space. In some
implementations, the computing system determines an optimized
latent space value z as an output of the optimization method
performed at 924 and decodes the latent space value z into a result
918 in the input space. In some implementations, a mapping of the
optimized point z generated at 930 is retained by the computing
system and used as result 918 without decoding it an additional
time at the conclusion of the act 924 optimization.
[0334] Example Problem Domains and Semi-Supervised
Implementations
[0335] In some exemplary implementations, method 900 involves
searching for particular points in domains characterized by large
quantities of observational data without necessarily deep
understanding of relationships between observable quantities. For
example, method 900 may search an input space of chemical
compositions. The space of naturally occurring molecules seems to
have a clustered structure analogous to that of natural images;
there are different classes of molecules with large gaps between
them, like the underlying objects in the visual world. These
classes may be represented relatively effectively using discrete
variables. Molecules within a class may share large scale
structural similarity, with continuous latent variables describing
variations in local structure.
[0336] Such a model may be trained based on a (potentially very
large) dataset comprising, for example, the constituent elements of
small molecules (such as protein primary structure). This
information is referred to herein as unsupervised data and may
comprise, for example, SMILES encodings of molecular structure. In
some implementations, training may be semi-supervised (i.e. may
incorporate additional supervised data corresponding to a subset of
the unsupervised data); for example, some of the molecules may have
associated data on their three-dimensional structure (e.g. protein
tertiary structure) and/or additional properties (such as
pharmacological or chemical properties, such as toxicity levels
and/or catalytic properties). Even if this empirical knowledge
lacks known theoretical underpinnings to model relationships
between observed quantities, and even if supervised data is
relatively sparse (which is often the case in chemoinformatics),
the observational data may improve search by providing structure to
the latent space which may correspond to (or even predict) desired
properties.
[0337] Even where particular properties are not specifically
desired (or to be avoided), such properties may be used as
supplementary supervised signals. Their inclusion may assist in
structuring the latent representation and may thus assist in
searching, even if those properties are not specifically
discriminated for or against by the objective function. Such
supplementary supervised data may be incorporated into a machine
learning model (such as a DVAE) by, for example, training a
generative model over an extended input representation that
includes the supplementary supervised properties.
[0338] In some semi-supervised implementations, the properties of
elements for which no supervised data is available (or for which it
is incomplete) may be marginalized out; and/or, equivalently, the
supervised data may be ignored when determining reconstruction
error for such elements. This may be accomplished, for example, by
not conditioning the approximating posterior on the supervised data
during training. In some implementations, a first approximating
posterior which receives supervised data is used for elements where
supervised data is defined and a second approximating posterior
which does not receive supervised data is used for elements where
supervised data is not defined. In some implementations, the
approximating posterior is defined so that the properties perturb
the resulting distribution when the supervised properties are not
defined, so that the two cases (i.e. supervised data being defined
vs. not defined) share a common core.
[0339] Gradient-Based Search
[0340] In some implementations, input data 912 comprises both
supervised and unsupervised data and method 900 involves
determining a gradient in the latent space and optimizing an
estimator for one or more properties in the supervised data. The
machine learning model may be trained at 920 over a joint
distribution of the supervised and unsupervised data. Input data
912 may be fully supervised or semi-supervised.
[0341] Once the machine learning model is trained and an initial
point z.sub.0 in the latent space has been selected at 922, the
latent representation may be optimized by gradient descent and/or
any other suitable gradient-based optimization method, such as
ADAM. The gradient may be based on a loss function (e.g. as
described above, with respect to gradient descent algorithms) and
the gradient-based optimization method may aim to minimize that
loss function. In certain circumstances, such as where there are
large quantities of supervised data, such techniques may scale more
efficiently than
[0342] Bayesian optimization and/or certain other optimization
methods.
[0343] In some such implementations of method 900, the loss
function is defined solely on the desired properties in the input
space, thereby effectively providing an objective function f(x)
with a domain in the input space. The loss is backpropagated
through the decoder (e.g. the prior distribution over the input
space conditioned on the latent representation, p(x|z)) into the
latent space, thereby providing a latent-space loss function
defined on the desired properties. This backpropagation corresponds
to the act 930 determination of g(z). (This is unlike certain
backpropagation techniques described above, which determine a loss
with respect to parameters of the model as opposed to latent/hidden
variables). Based on the resulting loss function, the optimization
method can determine a gradient for the desired properties over the
latent space. The latent representation may then be optimized
according to the chosen gradient-based optimization method based on
the latent-space gradient.
[0344] In some implementations, the gradient-based optimization
method optimizes a log-probability of both the desired properties
and the prior on the latent space. This corresponds to finding a
probable latent configuration conditioned on the desired properties
(e.g. pharmacological effect of a chemical composition), with other
properties (including, potentially, properties defined by the
unsupervised data, e.g. chemical structure) marginalized out. For
example, if the desired properties are denoted y, the
gradient-based optimization method may optimize the following
log-probability:
log(p(z)p(x|z)p(y|z)).varies.log p(z)+log p(y|z)
[0345] where the log p(x|z) term is a constant factor independent
of the configuration of the latent space and has been marginalized
out. For instance, in some implementations, the gradient-based
optimization method may select a candidate point z.sub.i+1 in the
latent space according to:
z i + 1 = .differential. y .differential. z + .differential. log
.times. .times. p .function. ( z ) .differential. z
##EQU00057##
[0346] evaluated at a previously-determined candidate point
z.sub.i.
[0347] Once the gradient-based optimization method has arrived at a
latent configuration that is probable conditioned on the desired
properties, the corresponding point in the input space can be
determined via the decoder, as described above. Returning to the
example of chemoinformatics, this corresponds to finding a chemical
structure which both has a high probability of existing and has a
high probability of having the desired properties, such as
pharmacological effect, low toxicity, etc. (Note that "high" here
is used relative to the probabilities assigned by the latent
representation to other elements in the input space, and not as an
objective measure of likelihood independent of the latent
representation.)
General
[0348] The above description of illustrated embodiments, including
what is described in the Abstract, is not intended to be exhaustive
or to limit the embodiments to the precise forms disclosed.
Although specific embodiments of and examples are described herein
for illustrative purposes, various equivalent modifications can be
made without departing from the spirit and scope of the disclosure,
as will be recognized by those skilled in the relevant art. The
teachings provided herein of the various embodiments can be applied
to other methods of quantum computation, not necessarily the
example methods for quantum computation generally described
above.
[0349] The various embodiments described above can be combined to
provide further embodiments. All of the U.S. patents, U.S. patent
application publications, U.S. patent applications, foreign
patents, foreign patent applications referred to in this
specification and/or listed in the Application Data Sheet that are
owned by D-Wave Systems Inc. including but not limited to: U.S.
provisional patent application Ser. No. 62/404,591 filed Oct. 5,
2016; U.S. provisional patent application Ser. No. 62/462,821 filed
Feb. 23, 2017; U.S. patent application publication 2015/0006443
published Jan. 1, 2015; U.S. patent application publication
2015/0161524 published Jun. 11, 2015; International PCT patent
application US2016/047628, filed Aug. 18, 2016, entitled "SYSTEMS
AND METHODS FOR MACHINE LEARNING USING ADIABATIC QUANTUM
COMPUTERS"; International PCT patent application US2016/047627,
filed Aug. 18, 2015, entitled "DISCRETE VARIATIONAL AUTO-ENCODER
SYSTEMS AND METHODS FOR MACHINE LEARNING USING ADIABATIC QUANTUM
COMPUTERS", is incorporated herein by reference in its entirety.
Aspects of the embodiments can be modified, if necessary, to employ
systems, circuits, and concepts of the various patents,
applications, and publications to provide yet further
embodiments.
* * * * *
References