It was developed in the 1980s in operations research, under the name of "pathwise gradients", or "stochastic gradients".[1][2] Its use in variational inference was proposed in 2013.[3]
Mathematics
Let be a random variable with distribution , where is a vector containing the parameters of the distribution.
REINFORCE estimator
Consider an objective function of the form:Without the reparameterization trick, estimating the gradient can be challenging, because the parameter appears in the random variable itself. In more detail, we have to statistically estimate:The REINFORCE estimator, widely used in reinforcement learning and especially policy gradient,[4] uses the following equality:This allows the gradient to be estimated:The REINFORCE estimator has high variance, and many methods were developed to reduce its variance.[5]
Reparameterization estimator
The reparameterization trick expresses as:Here, is a deterministic function parameterized by , and is a noise variable drawn from a fixed distribution . This gives:Now, the gradient can be estimated as:
Examples
For some common distributions, the reparameterization trick takes specific forms:
In general, any distribution that is differentiable with respect to its parameters can be reparameterized by inverting the multivariable CDF function, then apply the implicit method. See [1] for an exposition and application to the GammaBeta, Dirichlet, and von Mises distributions.
where is the encoder (recognition model), is the decoder (generative model), and is the prior distribution over latent variables. The gradient of ELBO with respect to is simplybut the gradient with respect to requires the trick. Express the sampling operation as:where and are the outputs of the encoder network, and denotes element-wise multiplication. Then we havewhere . This allows us to estimate the gradient using Monte Carlo sampling:where and for .
This formulation enables backpropagation through the sampling process, allowing for end-to-end training of the VAE model using stochastic gradient descent or its variants.
Variational inference
More generally, the trick allows using stochastic gradient descent for variational inference. Let the variational objective (ELBO) be of the form:Using the reparameterization trick, we can estimate the gradient of this objective with respect to :
Dropout
The reparameterization trick has been applied to reduce the variance in dropout, a regularization technique in neural networks. The original dropout can be reparameterized with Bernoulli distributions:where is the weight matrix, is the input, and are the (fixed) dropout rates.
More generally, other distributions can be used than the Bernoulli distribution, such as the gaussian noise:where and , with and being the mean and variance of the -th output neuron. The reparameterization trick can be applied to all such cases, resulting in the variational dropout method.[7]
^Maddison, Chris J.; Mnih, Andriy; Teh, Yee Whye (2017-03-05). "The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables". arXiv:1611.00712 [cs.LG].