Gaussian processes, unlike most other exact Bayesian models, have the notable property that we can carry out full Bayesian model averaging analytically.
However, what we gain in mathematical tractability we lose in computational
efficiency—with exact GP inference scaling as
At an intuitive level, we can see where the problem lies through an analogy to
communication theory: we can think of the data as serving to communicate what
the underlying function is. If the underlying signal is communicated with
redundant datapoints (in other words, it is over-sampled
0: as in the
Nyquist-Shannon sampling theorem, which says that the sampling rate (or number
of datapoints for a given volume of the input space) must be at least twice the
characteristic lengthscale of the signal to ensure there is no aliasing
distortion.
0[0]
), then this is a rather inefficient scheme. For instance,
consider the following data-generating function
1: inspired by
Snelson et al. (2005)
1[1]
with 2,000 noisy observations
Fitting an exact GP using all these redundant datapoints, which don’t evenly cover the domain, doesn’t afford us any meaningful reduction in uncertainty where the function is already well specified. Inference is unnecessarily expensive and inefficient.
Instead, we might imagine a more efficient scheme where we somehow sub-sample or otherwise summarise this data using a low number of points, and then treat those as the new training datapoints.
We could also use the numerous datapoints to learn a simple, approximate posterior process directly, and then use this computationally cheap approximation for prediction.
Low-Rank Matrix Approximations
Before considering the approximation methods themselves, we will briefly cover low-rank approximations to matrices, which will be the principle behind most of the computational savings.
A well-known result, which we will use without proof, is that a low-rank
approximation to a symmetric,
where
Storing this approximation only costs us the
which can be done in
which now also takes
Unfortunately, computing the SVD required to write Equation
The Nyström approximation instead considers a subset
where
To generalise the above to finding low-rank approximations of Gram matrices
for arbitrary inputs
The approximation methods discussed below will present different schemes for
selecing the
Inducing Point Methods
In this section, we consider approximation methods which aim to summarize the large, possibly redundant training dataset using a small set of points which we can condition on directly instead of the data. This provides us with computationally cheaper inference, with the downside that we now risk potentially overfitting the training data or incurring inaccuracies in both the predictive mean and uncertainty.
An alternative way to understand these methods is that we approximate the
GP prior. Consider a generic GP prior over the latent
The exact prior (i.e. making no approximation) is
which follows from the marginalisation property of a partitioned Gaussian. Note that henceforth, we will make the common modelling assumption that we have a zero-mean GP prior.
The key assumption in this class of approximation methods is that
We can now interpret the conditionals for
and the test conditional as
The following sections give an overview of variations on the above scheme, with modifications introduced to provide computational advantages. Not all of these schemes are equally wise, with the final FITC scheme perhaps being the wisest and most commonly used in practice.
For each scheme, we form the approximate Gaussian prior
Further, each scheme has an initial training cost of
Deterministic Inducing Conditional
In this approximation scheme (Quinonero-Candela et al., 2005)
(sometimes also called the subset of regressors approximation
(Smola et al., 2000)) we set the covariance of each of the
inducing conditionals to
The prior approximation comes to
and the predictive distribution follows from the regular GP prediction equations as
where we have defined
Using the DIC approximation on our toy regression problem yields the
following predictions, where we are plotting the inducing point input location
With this toy example, the approximation in the vacinity of the data is
actually rather good. However, in realistic datasets, particularly with
high-dimensional data, it is not as simple to linearly space the inducing
points and achieve good coverage. In particular, the DIC approximation
suffers from the predictive uncertainty going to
Deterministic Training Conditional
To remedy the issues with the predictive uncertainty of the DIC
approximation scheme away from
which is identical to the DIC case, however we allow the covariance of the test conditional to be exact:
The joint prior is now
and the predictive distribution is:
where we have used the same abbreviation for
Plotting the predictions with linearly spaced inducing inputs shows that we have indeed fixed the understated predictive uncertainty:
and repeating the above with a random set of inducing locations
We can however observe undue fluctuations in the predictions, particularly when
the inducing locations
Fully Independent Training Conditional
While it would be appealing to rectify the issues with the DTC approach by removing the determinism from the training conditional, this would blow up the computational cost. As a compromise, we can relax the complete determinism assumption from the training conditional by imposing a diagonal structure on the covariance matrix which ensures that it is still easy to invert. This fully-independent training conditional (FITC) approximation (Snelson et al., 2005) corresponds to a factorisation of the training conditional where the training points are independent from each other:
The test conditional is exactly as in DTC:
This leaves us with the joint prior of the form
The predictive distribution for a single point is:
where we have defined
For a batch of test points, we simply treat them as conditionally independent
and multiply Equation
Repeating the prediction task for our toy dataset, we can observe good predictive accuracy, with reasonable uncertainty throughout:
In the following we observe that inducing points placed far from the training data are no longer as susceptible to distorting the predictions:
Variational Free Energy
The approach of the inducing point methods described above was to approximate the GP prior, and then apply the usual GP predictive equations to make predictions. The approach described in this section will take the other approach: keeping the exact GP prior, while approximating the posterior.
Variational Objective
We will derive the variational free energy (VFE) approach of
(Titsias, 2009) using finite index sets,
considering the points at the training inputs
Denote as
In the VFE approach, we
attempt to optimise the parameters
The key assumption, that will result in computational savings, is that we factorise the approximate posterior as
Since the only connection to the training observations
Using Bayes rule, we can inspect the form of the posterior over the
finite-dimensional marginal
where we have suggestively factored out the
This is a convenient, and perhaps even intuitive result: stating that
regardless of how many additional points
As our objective, we now seek to maximise the KL divergence between the
simplified posteriors (omitting
where to get to Equation
Since the KL divergence in Equation
giving us the canonical variational free energy or ELBO objective that we want to maximise wrt. the variational parameters.
Inspecting the distributions involved, we note that
- the prior,
is of course Gaussian distributed . - it is common to select the approximate posterior
to be Gaussian , with variational parameters . - in some cases, for instance in GP regression, the likelihood
may also be Gaussian. In many other cases however, for instance in classification problems or when a Gaussian-noise assumption is unreasonable, the likelihood will not be Gaussian.
If all three distributions are Gaussian, then we can maximise the VFE
objective of Equation
The variational objective in Equation
Before we can write the log-likelihood term, we need the induced posterior over
latent function values at the training points,
To keep things concise, let us use a zero-mean GP prior (i.e.
When the likelihood is Gaussian,
where in Equation
Hence, the concrete form of the VFE objective / ELBO with all the distributions substituted in is:
We can now compute the gradients of the above, set them to 0, and solve for the
parameters
and we apply these with the factorised likelihood
where
Setting the derivatives above to
Thus, when the likelihood is Gaussian and the GP prior mean is
If the likelihood is not Gaussian, we can maximise the VFE objective using stochastic optimisation methods.
Posterior Predictive
With the (approximate) posterior
Applying this to our toy problem results in good predictions: