Scaling Gaussian Processes

An overview of approximation methods and computational techniques for scaling Gaussian processes to large, high-dimensional datasets; covering training conditionals and variational approximations.

October 8, 2023

London, UK

Gaussian processes, unlike most other exact Bayesian models, have the notable property that we can carry out full Bayesian model averaging analytically.

However, what we gain in mathematical tractability we lose in computational efficiency—with exact GP inference scaling as \(O(N^{3})\) in the number of parameters, and the memory cost scaling as \(O(N^{2})\). This is clearly unsuitable for even modestly sized datasets.

At an intuitive level, we can see where the problem lies through an analogy to communication theory: we can think of the data as serving to communicate what the underlying function is. If the underlying signal is communicated with redundant datapoints (in other words, it is over-sampled 0: as in the Nyquist-Shannon sampling theorem, which says that the sampling rate (or number of datapoints for a given volume of the input space) must be at least twice the characteristic lengthscale of the signal to ensure there is no aliasing distortion. 0[0] ), then this is a rather inefficient scheme. For instance, consider the following data-generating function 1: inspired by Snelson et al. (2005) 1[1] with 2,000 noisy observations \(\ervy_{i} = f(x_{i}) + \epsilon\) where the inputs are themselvves sampled from \(x_{i} \sim \gN(0, 1)\):

Fitting an exact GP using all these redundant datapoints, which don’t evenly cover the domain, doesn’t afford us any meaningful reduction in uncertainty where the function is already well specified. Inference is unnecessarily expensive and inefficient.

Instead, we might imagine a more efficient scheme where we somehow sub-sample or otherwise summarise this data using a low number of points, and then treat those as the new training datapoints.

We could also use the numerous datapoints to learn a simple, approximate posterior process directly, and then use this computationally cheap approximation for prediction.

Before considering the approximation methods themselves, we will briefly cover low-rank approximations to matrices, which will be the principle behind most of the computational savings.

A well-known result, which we will use without proof, is that a low-rank approximation to a symmetric, \(N \times N\) matrix can be derived from a singular value decomposition as

\[\begin{equation} \label{eq:low_rank_approx} \rmK_{XX} \approx \rmU \mLambda \rmU^{\top}, \end{equation} \]

where \(\mLambda\) is a diagonal matrix containing the \(M \ll N\) leading eigenvalues, and \(\rmU\) is a matrix containing the corresponding \(M\) eigenvectors, each of size \(N\).

Storing this approximation only costs us the \(N \times M\) matrix \(\rmU\), as well as the \(M\) leading eigenvalues. Further, performing the matrix operations required for GP inference comes with reduced computational cost. For instance, we can perform the matrix inversion required when calculating the posterior predictive using the matrix inversion lemma 2: see my derivation and discussion of this in my previous article 2[2] as:

\[\begin{equation} (\rmK_{XX} + \sigma^{2}\rmI_{N})^{-1} \approx \sigma^{-2}\rmI_{N} + \sigma^{-2}\rmU(\sigma^{2}\mLambda^{-1} + \rmU^{\top}\rmU)^{-1}\rmU^{\top}, \end{equation} \]

which can be done in \(O(NM^{2})\) time. We can also re-write the determinant required when computing the marginal likelihood using the matrix determinant lemma as

\[\begin{equation} \det (K_{XX} + \sigma^{2}\rmI_{N}) \approx \det(\mLambda)\det(\sigma^{2}\mLambda^{-1} + \rmU^{\top}\rmU), \end{equation} \]

which now also takes \(O(NM^{2})\) time. Crucially, since we get to choose the size of \(M\) (and, by extension, the quality of the approximation) these operations now only grow linearly in the size of the training dataset.

Unfortunately, computing the SVD required to write Equation \(\ref{eq:low_rank_approx}\) would take \(O(N^{3})\) time, which defeats the point of the above.

The Nyström approximation instead considers a subset \(\rmZ\) of \(M\) datapoints, and allows us to find a rank-\(M\) approximation to the full rank-\(N\) Gram matrix with only \(O(M^{3})\) up-front cost as

\[\begin{equation} \label{eq:nystrom_approx} \rmK_{XX} \approx \rmK_{XZ}\rmK_{ZZ}^{-1}\rmK_{ZX}, \end{equation} \]

where \(\rmK_{XX} = k(\rmX, \rmX)\), \(\rmK_{XZ} \rmK_{ZX}^{\top} = k(\rmX, \rmZ)\) and \(\rmK_{ZZ} = k(\rmZ, \rmZ)\).

To generalise the above to finding low-rank approximations of Gram matrices for arbitrary inputs \(\rmA\) and \(\rmB\), we use the notation

\[\begin{equation} \label{eq:general_nystrom} \rmQ_{AB} \doteq \rmK_{AZ}\rmK_{ZZ}^{-1}\rmK_{ZB}. \end{equation} \]

The approximation methods discussed below will present different schemes for selecing the \(M\)-dimensional matrix of points \(\rmZ\); which needn’t be columns of the training data but could rather be any set of representative vectors which approximate the Gram matrix well when used in Equation \(\ref{eq:general_nystrom}\).

In this section, we consider approximation methods which aim to summarize the large, possibly redundant training dataset using a small set of points which we can condition on directly instead of the data. This provides us with computationally cheaper inference, with the downside that we now risk potentially overfitting the training data or incurring inaccuracies in both the predictive mean and uncertainty.

An alternative way to understand these methods is that we approximate the GP prior. Consider a generic GP prior over the latent \(f \sim \gG\gP\big(m(\cdot), k(\cdot, \cdot)\big)\), \(N\) observed input locations \(\rmX \in \R^{N \times D}\) an arbitrary number of test points \(\rmX_{\star}\) as well as a set of \(M\) inducing points of pseudo-inputs \(\rmZ\) which summarise the observations. The corresponding latent function values are \(\rvf_{X} = f(\rmX)\), \(\rvf_{\star} = f(\rmX_{\star})\) and \(\rvf_{Z} = f(\rmZ)\).

The exact prior (i.e. making no approximation) is

\[\begin{align} p(\rvf_{\star}, \rvf_{X}) &= \int p(\rvf_{\star}, \rvf_{X}, \rvf_{Z})d\rvf_{Z} = \int p(\rvf_{\star}, \rvf_{X} \vert \rvf_{Z})p(\rvf_{Z})d\rvf_{Z} \\ &= \gN\left(\begin{bmatrix}\rvm_{X} \\ \rvm_{\star}\end{bmatrix}, \begin{bmatrix}\rmK_{XX} & \rmK_{X\star} \\ \rmK_{\star X} & \rmK_{XX}\end{bmatrix}\right), \end{align} \]

which follows from the marginalisation property of a partitioned Gaussian. Note that henceforth, we will make the common modelling assumption that we have a zero-mean GP prior.

The key assumption in this class of approximation methods is that \(\rvf_{Z}\) acts as a sufficient statistic (i.e. a ‘complete summary’) for the data (Quinonero-Candela et al., 2005), allowing us to predict \(\rvf_{\star}\) using \(\rvf_{Z}\) rather than \(\rvf_{X}\). Put otherwise, the GP prior approximation is

\[\begin{align} p(\rvf_{\star}, \rvf_{X}, \rvf_{Z}) &= p(\rvf_{\star} \vert \rvf_{X}, \rvf_{Z})p(\rvf_{X} \vert \rvf_{Z})p(\rvf_{Z}) \\ &\approx p(\rvf_{\star} \vert \rvf_{Z})p(\rvf_{X} \vert \rvf_{Z})p(\rvf_{Z}). \end{align} \]

We can now interpret the conditionals for \(\rvf_{X}\) and \(\rvf_{\star}\) as a case of noiseless GP regression, using \(\rvf_{Z}\) as the training data. That is, we form the training conditional as

\[\begin{equation} p(\rvf_{X} \vert \rvf_{Z}) = \gN(\rvf_{X} \vert \rmK_{XZ}\rmK_{ZZ}^{-1}\rvf_{Z}, \rmK_{XX} - \underbrace{\rmK_{XZ}\rmK_{ZZ}^{-1}\rmK_{ZX}}_{\rmQ_{XX}}) \end{equation} \]

and the test conditional as

\[\begin{equation} p(\rvf_{\star} \vert \rvf_{Z}) = \gN(\rvf_{\star} \vert \rmK_{\star Z}\rmK_{ZZ}^{-1}\rvf_{Z}, \rmK_{\star \star} - \underbrace{\rmK_{\star Z}\rmK_{ZZ}^{-1}\rmK_{Z\star}}_{\rmQ_{\star \star}}). \end{equation} \]

The following sections give an overview of variations on the above scheme, with modifications introduced to provide computational advantages. Not all of these schemes are equally wise, with the final FITC scheme perhaps being the wisest and most commonly used in practice.

For each scheme, we form the approximate Gaussian prior \(q(\rvf_{X}, \rvf_{\star}) = \int q(\rvf_{X} \vert \rvf_{Z})q(\rvf_{\star} \vert \rvf_{Z})p(\rvf_{Z}) d\rvf_{Z}\), after which we apply the usual Gaussian conditioning equations (see this section of my previous article as a reference.)

Further, each scheme has an initial training cost of \(O(M^{3} + NM^{2})\), and subsequently each test case incurrs an \(O(M)\) cost to predict the mean, and \(O(M^{2})\) cost for the predictive variance.

In this approximation scheme (Quinonero-Candela et al., 2005) (sometimes also called the subset of regressors approximation (Smola et al., 2000)) we set the covariance of each of the inducing conditionals to \(0\):

\[\begin{align} p_{\text{DIC}}(\rvf_{X} \vert \rvf_{Z}) &= \gN(\rvf_{X} \vert \rmK_{XZ}\rmK_{ZZ}^{-1}\rvf_{Z}, \mathbf{0}) \\ p_{\text{DIC}}(\rvf_{\star} \vert \rvf_{Z}) &= \gN(\rvf_{\star} \vert \rmK_{\star Z}\rmK_{ZZ}^{-1}\rvf_{Z}, \mathbf{0}). \end{align} \]

The prior approximation comes to

\[\begin{equation} q_{\text{DIC}}(\rvf_{X}, \rvf_{\star}) = \gN\left(\mathbf{0}, \begin{bmatrix}\rmQ_{XX} & \rmQ_{X\star} \\ \rmQ_{\star X} & \rmQ_{\star, \star}\end{bmatrix}\right), \end{equation} \]

and the predictive distribution follows from the regular GP prediction equations as

\[\begin{align} q_{\text{DIC}}(\rvf_{\star} \vert \rvy) &= \gN(\rvf_{\star} \vert \rmQ_{\star X}(\rmQ_{XX} + \sigma^{2}\rmI_{N})^{-1}\rvy, \rmQ_{\star, \star} - \rmQ_{\star X}(\rmQ_{XX} + \sigma^{2}\rmI_{N})^{-1}\rmQ_{X\star}) \\ &= \gN(\rvf_{\star} \vert \sigma^{-2}\rmK_{\star Z}\mSigma\rmK_{Z X}\rvy, \rmK_{\star Z}\mSigma\rmK_{Z\star}), \end{align} \]

where we have defined

\[\begin{equation} \label{eq:sigma_abbrev} \mSigma \doteq (\sigma^{-2}\rmK_{ZX}\rmK_{XZ} + \rmK_{ZZ})^{-1}. \end{equation} \]

Using the DIC approximation on our toy regression problem yields the following predictions, where we are plotting the inducing point input location \(\rmZ\) using the vertical black lines. Here, we have just linearly spaced 30 such inducing points between the smallest and largest training point.

With this toy example, the approximation in the vacinity of the data is actually rather good. However, in realistic datasets, particularly with high-dimensional data, it is not as simple to linearly space the inducing points and achieve good coverage. In particular, the DIC approximation suffers from the predictive uncertainty going to \(0\) away from the inducing points: a problem that is better visualised with a randomly selected set of inducing points:

To remedy the issues with the predictive uncertainty of the DIC approximation scheme away from \(\rmZ\), the deterministic training conditional, as the name suggests, only assumes that the training conditional is deterministic (Seeger et al., 2003)

\[\begin{equation} p_{\text{DTC}}(\rvf_{X} \vert \rvf_{Z}) = \gN(\rvf_{X} \vert \rmK_{XZ}\rmK_{ZZ}^{-1}\rvf_{Z}, \mathbf{0}), \end{equation} \]

which is identical to the DIC case, however we allow the covariance of the test conditional to be exact:

\[\begin{equation} p_{\text{DTC}}(\rvf_{\star} \vert \rvf_{Z}) = \gN(\rvf_{\star} \vert \rmK_{\star Z}\rmK_{ZZ}^{-1}\rvf_{Z}, \rmK_{\star \star} - \rmK_{\star Z}\rmK_{ZZ}^{-1}\rmK_{Z\star}). \end{equation} \]

The joint prior is now

\[\begin{equation} q_{\text{DTC}}(\rvf_{X}, \rvf_{\star}) = \gN\left(\mathbf{0}, \begin{bmatrix}\rmQ_{XX} & \rmQ_{X\star} \\ \rmQ_{\star X} & \rmK_{\star, \star}\end{bmatrix}\right), \end{equation} \]

and the predictive distribution is:

\[\begin{align} q_{\text{DTC}}(\rvf_{\star} \vert \rvy) &= \gN(\rvf_{\star} \vert \rmQ_{\star X}(\rmQ_{XX} + \sigma^{2}\rmI_{N})^{-1}\rvy, \rmK_{\star \star} - \rmQ_{\star X}(\rmQ_{XX} + \sigma^{2}\rmI_{N})^{-1}\rmQ_{X\star}) \\ &= \gN(\rvf_{\star} \vert \sigma^{-2}\rmK_{\star Z}\mSigma\rmK_{ZX}\rvy, \rmK_{\star \star} - \rmQ_{\star \star} + \rmK_{\star Z}\mSigma\rmK_{Z\star}), \end{align} \]

where we have used the same abbreviation for \(\mSigma\) defined in Equation \(\ref{eq:sigma_abbrev}\).

Plotting the predictions with linearly spaced inducing inputs shows that we have indeed fixed the understated predictive uncertainty:

and repeating the above with a random set of inducing locations \(\rmZ\) shows that the predictive uncertainty remains sensible in this case—at least from a visual inspection:

We can however observe undue fluctuations in the predictions, particularly when the inducing locations \(\rmZ\) lie far from the data \(\rmX\), which can be attributed to the deterministic training conditional assumption.

While it would be appealing to rectify the issues with the DTC approach by removing the determinism from the training conditional, this would blow up the computational cost. As a compromise, we can relax the complete determinism assumption from the training conditional by imposing a diagonal structure on the covariance matrix which ensures that it is still easy to invert. This fully-independent training conditional (FITC) approximation (Snelson et al., 2005) corresponds to a factorisation of the training conditional where the training points are independent from each other:

\[\begin{align} p_{\text{FITC}}(\rvf_{X} \vert \rvf_{Z}) &= \prod_{n=1}^{N}p(\ervf_{n} \vert \rvf_{Z}) \\ &= \gN\big(\rvf_{X} \vert \rmK_{XZ}\rmK_{ZZ}^{-1}\rvf_{Z}, \text{diag}(\rmK_{XX} - \rmK_{XZ}\rmK_{ZZ}^{-1}\rmK_{ZX})\big). \end{align} \]

The test conditional is exactly as in DTC:

\[\begin{equation} p_{\text{FITC}}(\rvf_{\star} \vert \rvf_{Z}) = \gN(\rvf_{\star} \vert \rmK_{\star Z}\rmK_{ZZ}^{-1}\rvf_{Z}, \rmK_{\star \star} - \rmK_{\star Z}\rmK_{ZZ}^{-1}\rmK_{Z\star}). \end{equation} \]

This leaves us with the joint prior of the form

\[\begin{equation} q_{\text{FITC}}(\rvf_{X}, \rvf_{\star}) = \gN\left(\mathbf{0}, \begin{bmatrix}\rmQ_{XX} - \text{diag}(\rmQ_{XX} - \rmK_{XX}) & \rmQ_{X\star} \\ \rmQ_{\star X} & \rmK_{\star \star}\end{bmatrix}\right). \end{equation} \]

The predictive distribution for a single point is:

\[\begin{equation} \label{eq:fitc_predictive} q_{\text{FITC}}(\ervf_{\star} \vert \rvy) = \gN(\ervf_{\star} \vert \rvk_{\star Z}\mSigma\rmK_{ZX}\mLambda^{-1}\rvy, \ervk_{\star star} - \ervq_{\star \star} + \rvk_{\star Z}\mSigma \rvk_{Z\star}) \end{equation} \]

where we have defined

\[\begin{align*} \mLambda &\doteq \text{diag}(\rmK_{XX} - \rmQ_{XX} + \sigma^{2}\rmI_{N}) \\ \mSigma &\doteq (\rmK_{ZZ} + \rmK_{ZX}\mLambda^{-1}\rmK_{XZ})^{-1}. \end{align*} \]

For a batch of test points, we simply treat them as conditionally independent and multiply Equation \(\ref{eq:fitc_predictive}\) together.

Repeating the prediction task for our toy dataset, we can observe good predictive accuracy, with reasonable uncertainty throughout:

In the following we observe that inducing points placed far from the training data are no longer as susceptible to distorting the predictions:

The approach of the inducing point methods described above was to approximate the GP prior, and then apply the usual GP predictive equations to make predictions. The approach described in this section will take the other approach: keeping the exact GP prior, while approximating the posterior.

We will derive the variational free energy (VFE) approach of (Titsias, 2009) using finite index sets, considering the points at the training inputs \(\rmX\), the inducing points \(\rmZ\) and all other points \(\rmX_{\star}\). This will simplify the notation, without deviating too far from the infinite-dimensional case which can be recovered with a bit more mathematical care (Matthews, 2016). The following draws from the expositions of Bui et al. (2014), van der Wilk (2019) and Murphy (2023).

Denote as \(\rvf = \begin{bmatrix}\rvf_{\star}, \rvf_{X}, \rvf_{Z}\end{bmatrix}\) the concatenation of the latent function values, evaluated at the training inputs \(\rvf = f(\rmX)\), inducing points \(\rvf_{Z} = f(\rmZ)\) and test locations \(\rvf_{\star} = f(\rmX_{\star})\). These are initially distributed according to our GP prior \(p(\rvf) \sim \mathcal{GP}\big(m(\cdot), k(\cdot, \cdot)\big)\). The likelihood factorises such that it only depends on the GP at locations \(\rmX\): \(p(\rvy \vert \rvf_{X}) = \prod^{N}_{n=1}p(\ervy_{n} \vert f(\rvx_{n}))\).

In the VFE approach, we attempt to optimise the parameters \(\vphi\) of an approximate posterior \(q_{\vphi}(\rvf)\) so as to minimise the KL divergence between this and the true posterior:

\[\argmin_{\vphi}\KL[q_{\vphi}(\rvf) \Vert p(\rvf \vert \rvy)]. \]

The key assumption, that will result in computational savings, is that we factorise the approximate posterior as

\[q_{\vphi}(\rvf) = p(\rvf_{\star}, \rvf_{X} \vert \rvf_{Z})q_{\vphi}(\rvf_{Z}). \]

Since the only connection to the training observations \(\rvy\) is through \(\rvf_{Z}\), the inducing points act as a summary of the training data. Further, predictions for \(\rvf_{X}\) or \(\rvf_{\star}\) are made only through their dependence on \(\rvf_{Z}\), an \(M\)-dimensional vector, rather than their dependence on each other, which in the case of \(\rvf_{X}\) is an \(N \gg M\) dimensional vector; realising computational savings.

Using Bayes rule, we can inspect the form of the posterior over the finite-dimensional marginal \(\rvf\):

\[\begin{align} p(\rvf_{\star}, \rvf_{X}, \rvf_{Z} \vert \rvy) &= \frac{p(\rvy \vert \rvf_{\star}, \rvf_{X}, \rvf_{Z})p(\rvf_{\star}, \rvf_{X}, \rvf_{Z})}{p(\rvy)} \\ &= \frac{p(\rvy \vert \rvf_{X})p(\rvf_{X} \vert \rvf_{Z})p(\rvf_{Z})}{p(\rvy)}p(\rvf_{\star} \vert \rvf_{X}, \rvf_{Z}), \end{align} \]

where we have suggestively factored out the \(p(\rvf_{\star} \vert \rvf_{X}, \rvf_{Z})\) term on the right. This is because after we factorise out the same term in the approximate posterior \(q_{\phi}(\rvf)\), this common factor cancels out when we write the KL between these two finite-dimensional posteriors:

\[\begin{align} \KL[q_{\phi}&(\rvf_{\star}, \rvf_{X}, \rvf_{Z}) \Vert p(\rvf_{\star}, \rvf_{X}, \rvf_{Z} \vert \rvy)] \\ &= \KL[p(\rvf_{\star} \vert \rvf_{X}, \rvf_{Z}) p(\rvf_{X} \vert \rvf_{Z})q_{\phi}(\rvf_{Z}) \Vert p(\rvf_{\star} \vert \rvf_{X}, \rvf_{Z})p(\rvf_{X}, \rvf_{Z} \vert \rvy)] \\ &= \int q_{\phi}(\rvf_{\star}, \rvf_{X}, \rvf_{Z}) \log \frac{\cancel{p(\rvf_{\star} \vert \rvf_{X}, \rvf_{Z})} p(\rvf_{X} \vert \rvf_{Z})q_{\phi}(\rvf_{Z})}{\cancel{p(\rvf_{\star} \vert \rvf_{X}, \rvf_{Z})}p(\rvf_{X}, \rvf_{Z} \vert \rvy)} d\rvf_{\star}\ d\rvf_{X}\ d\rvf_{Z} \\ &= \KL[p(\rvf_{X} \vert \rvf_{Z})q_{\phi}(\rvf_{Z}) \Vert p(\rvf_{X}, \rvf_{Z} \vert \rvy)]. \end{align} \]

This is a convenient, and perhaps even intuitive result: stating that regardless of how many additional points \(\rmX_{\star}\) we choose to evaluate the GP at and their location, this does not influence the value of the KL and hence has no bearing on our approximate posterior (i.e. its variational parameters).

As our objective, we now seek to maximise the KL divergence between the simplified posteriors (omitting \(\rvf_{\star}\) terms) wrt. the variational parameters, \(\phi\):

\[\begin{align} &\KL[p(\rvf_{X} \vert \rvf_{Z})q_{\phi}(\rvf_{Z}) \Vert p(\rvf_{X}, \rvf_{Z} \vert \rvy)] \label{eq:reduced_kl} \\ &= \int p(\rvf_{X} \vert \rvf_{Z})q_{\phi}(\rvf_{Z}) \log \frac{p(\rvf_{X} \vert \rvf_{Z})q_{\phi}(\rvf_{Z})}{p(\rvf_{X}, \rvf_{Z} \vert \rvy)} d\rvf_{X} d\rvf_{Z} \\ &= \int p(\rvf_{X} \vert \rvf_{Z})q_{\phi}(\rvf_{Z}) \log \frac{\cancel{p(\rvf_{X} \vert \rvf_{Z})}q_{\phi}(\rvf_{Z})p(\rvy)}{\cancel{p(\rvf_{X} \vert \rvf_{Z})}p(\rvf_{Z})p(\rvy \vert \rvf_{X})} d\rvf_{X} d\rvf_{Z} \label{eq:vfe_bayes} \\ &= \int q_{\vphi}(\rvf_{Z})\log \frac{q_{\vphi}(\rvf_{Z})}{p(\rvf_{Z})}d\rvf_{Z} - \int p(\rvf_{X} \vert \rvf_{Z})q_{\vphi}(\rvf_{Z}) \log p(\rvy \vert \rvf_{X})d\rvf_{X} d\rvf_{Z} + \log p(\rvy) \\ &= \KL[q_{\vphi}(\rvf_{Z}) \Vert p(\rvf_{Z})] - \E_{p(\rvf_{X} \vert \rvf_{Z})q_{\vphi}(\rvf_{Z})}\big[\log p(\rvy \vert \rvf_{X})\big] + \log p(\rvy), \end{align} \]

where to get to Equation \(\ref{eq:vfe_bayes}\) we have applied Bayes rule. In Equation \(\ref{eq:vfe_bayes}\) we also observe that both the numerator (corresponding to the approximate process) and denomenator (corresponding tot he true process) contain the \(p(\rvf_{X} \vert \rvf_{Z})\) term which is responsible for the \(O(N^{3})\) matrix operations; allowing us to cancel it.

Since the KL divergence in Equation \(\ref{eq:reduced_kl}\) is non-negative, we can rearrange the above into a lower bound on the log-evidence term:

\[\begin{equation} \label{eq:vfe_objective} \log p(\rvy) \ge \E_{p(\rvf_{X} \vert \rvf_{Z})q_{\vphi}(\rvf_{Z})}\big[\log p(\rvy \vert \rvf_{X})\big] - \KL[q_{\vphi}(\rvf_{Z}) \Vert p(\rvf_{Z})], \end{equation} \]

giving us the canonical variational free energy or ELBO objective that we want to maximise wrt. the variational parameters.

Inspecting the distributions involved, we note that

  • the prior, \(p(\rvf_{Z})\) is of course Gaussian distributed \(\gN\big(\rvf_{Z} \vert m(\rmZ), k(\rmZ, \rmZ)\big)\).
  • it is common to select the approximate posterior \(q_{\vphi}(\rvf_{Z})\) to be Gaussian \(q_{\vphi}(\rvf_{Z}) = \gN(\rvf_{Z} \vert \rvm, \rmS)\), with variational parameters \(\vphi = \{\rvm, \rmS\}\).
  • in some cases, for instance in GP regression, the likelihood \(p(\rvy \vert \rvf_{X})\) may also be Gaussian. In many other cases however, for instance in classification problems or when a Gaussian-noise assumption is unreasonable, the likelihood will not be Gaussian.

If all three distributions are Gaussian, then we can maximise the VFE objective of Equation \(\ref{eq:vfe_objective}\) in closed form. If not, we may fall back to iterative gradient-based optimisers.

The variational objective in Equation \(\ref{eq:vfe_objective}\) has two main terms: the KL term, and the expected log-likelihood term. The former, being the KL between two Gaussians, can be found in closed form using standard identities (Petersen et al., 2012):

\[\begin{align} &\hspace{1.1em}\KL[q_{\phi}(\rvf_{Z}) \Vert p(\rvf_{Z})] \\ &= \KL[\gN(\rvf_{Z} \vert \rvm, \rmS) \Vert \gN(\rvf_{Z} \vert \vmu_{z}, \rmK_{Z,Z})] \\ &= \frac{1}{2}\left[\Tr(\rmK^{-1}_{ZZ}\rmS) + (\vmu_{Z} - \rvm)^{\top}\mSigma_{ZZ}^{-1}(\vmu_{Z} - \rvm) - M + \log \frac{\det \rmK_{ZZ}}{\det\rmS} \right] \end{align} \]

Before we can write the log-likelihood term, we need the induced posterior over latent function values at the training points, \(q(\rvf_{X} \vert \rvm, \rmS)\). We can first find the prior term \(p(\rvf_{X} \vert \rvf_{Z}, \rmX, \rmZ)\) by treating it as a case of noiseless GP regression 3: see this section of my previous article on Gaussian processes 3[3] , following which we can apply Bayes rule for linear Gaussian systems 4: which I derive in this section) of my previous article on the Gaussian 4[4] :

\[\begin{align} &\hspace{1.2em}q(\rvf_{X} \vert \rvm, \rmS) \nonumber \\ &= \int p(\rvf_{X} \vert \rvf_{Z}, \rmX, \rmZ) q(\rvf_{Z}\vert \rvm, \rmS) \nonumber \\ &= \int \gN\big(\rvf_{X} \vert \vmu_{X} + \rmK_{XZ}\rmK_{ZZ}^{-1}(\rvf_{Z} - \vmu_{Z}), \rmK_{XX} - \rmK_{XZ}\rmK_{ZZ}^{-1}\rmK_{ZX}\big) \nonumber \\ &\hspace{2.5em}\gN(\rvf_{Z} \vert \rvm, \rmS) d\rvf_{Z} \\ &= \gN\big(\rvf_{X} \vert \vmu_{X} + \rmK_{XZ}\rmK_{ZZ}^{-1}(\rvm - \vmu_{Z}), \rmK_{XX} - \rmK_{XZ}\rmK_{ZZ}^{-1}(\rmK_{ZZ} - \rmS)\rmK_{ZZ}^{-1}\rmK_{ZX})\big) \label{eq:full_induced_posterior} \\ &\doteq \gN(\rvf_{X}, \tilde{\vmu}, \tilde{\mSigma}). \end{align} \]

To keep things concise, let us use a zero-mean GP prior (i.e. \(m(\rmX) = \vmu_{X} = \mathbf{0}\)), which will shorten Equation \(\ref{eq:full_induced_posterior}\).

When the likelihood is Gaussian, \(p(\rvy \vert \rvf_{X}) = \gN(\rvy \vert \rvf_{X}, \sigma^{2}\rmI)\), the expected log-likelihood term in the VFE objective takes the following form:

\[\begin{align} &\hspace{1.2em}\E_{q_{\phi}(\rvf_{X})}\left[\log p(\rvy \vert \rvf_{X})\right] = \E_{\gN(\rvf_{X} \vert \tilde{\vmu}, \tilde{\mSigma})}\left[\log p(\rvy \vert \rvf_{X})\right] \\ &= \int \gN(\rvf_{X} \vert \tilde{\vmu}, \tilde{\mSigma}) \left[-\frac{N}{2} \log (2\pi \sigma^{2}) - \frac{1}{2\sigma^{2}}(\rvy - \rvf_{X})^{\top}\rmI(\rvy - \rvf_{X})\right]d\rvf_{X} \\ &= \int \gN(\rvf_{X} \vert \tilde{\vmu}, \tilde{\mSigma}) \left[-\frac{N}{2}\log (2\pi \sigma^{2}) - \frac{1}{2\sigma^{2}} \Tr(\rvy \rvy^{\top} - 2\rvy \rvf_{X}^{\top} + \rvf_{X}\rvf_{X}^{\top})\right] d\rvf_{X} \label{eq:trace_trick} \\ &= -\frac{N}{2}\log (2\pi \sigma^{2}) - \frac{1}{2\sigma^{2}} \Tr(\rvy \rvy^{\top} - 2\rvy \tilde{\vmu}^{\top} + \tilde{\vmu}\tilde{\vmu}^{\top} + \tilde{\mSigma}) \label{eq:gaussian_expectation} \\ &= -\frac{1}{2\sigma^{2}}\Tr(\tilde{\mSigma}) + \log \gN(\rvy \vert \tilde{\vmu}, \sigma^{2}\rmI), \end{align} \]

where in Equation \(\ref{eq:trace_trick}\) we have used the fact that for a vector \(\rvx\) and matrix \(\rmA\), the trace 5: this is sometimes referred to as the trace trick 5[5] allows us to write \(\rvx^{\top}\rmA\rvx = \Tr(\rvx^{\top}\rmA\rvx) = Tr(\rvx\rvx^{\top}\rmA) = \Tr(\rmA\rvx\rvx^{\top})\), and in Equation \(\ref{eq:gaussian_expectation}\) we note that the expectation of \(\rvf_{X}\) under \(\gN(\rvf_{X} \vert \tilde{\vmu}, \tilde{\mSigma})\) is merely \(\tilde{\vmu}\) while \(\E[\rvf_{X}\rvf_{X}^{\top}] = \E[\rvf_{X}]\E[\rvf_{X}]^{\top} + \Cov(\rvf_{X}) = \tilde{\vmu}\tilde{\vmu}^{\top} + \tilde{\mSigma}\).

Hence, the concrete form of the VFE objective / ELBO with all the distributions substituted in is:

\[\begin{align} \gL(\vphi) &= \log \gN(\rvy \vert \rmK_{XZ}\rmK_{ZZ}^{-1}\rvm, \sigma^{2}\rmI_{N}) - \frac{1}{2\sigma^{2}}\Tr(\rmK_{XZ}\rmK_{ZZ}^{-1}\rmS\rmK^{-1}_{ZZ}\rmK_{ZX}) \\ &- \frac{1}{2\sigma^{2}}\Tr(\rmK_{XX} - \rmK_{XZ}\rmK_{ZZ}^{-1}\rmK_{ZX}) - \KL[q_{\phi}(\rvf_{Z}) \Vert p(\rvf_{Z})]. \end{align} \]

We can now compute the gradients of the above, set them to 0, and solve for the parameters \(\rvm\) and \(\rmS\). To do this, we note the following results (Opper et al., 2009)

\[\begin{align} \frac{\partial}{\partial \mu} \E_{\gN(x \vert \mu, \sigma^{2})}\big[h(x)\big] &= \E_{\gN(x \vert \vmu, \sigma^{2})}\left[\frac{\partial}{\partial x}h(x)\right] \\ \frac{\partial}{\partial \sigma^{2}}\E_{\gN(x \vert \mu, \sigma^{2})}\big[h(x)\big] &= \frac{1}{2}\E_{\gN(x \vert \mu, \sigma^{2})}\left[\frac{\partial^{2}}{\partial x^{2}}h(x)\right], \end{align} \]

and we apply these with the factorised likelihood \(\log p(\ervy_{n} \vert \ervf_{n})\) substituted for \(h(x)\). It can then be shown (Murphy, 2023) that

\[\begin{align} \nabla_{m}\gL(\phi) &= \sigma^{-2}\rmK_{ZZ}^{-1}\rmK_{ZX}\rvy - \mLambda\rvm \\ \nabla_{\rmS}\gL(\phi) &= \frac{1}{2}\rmS^{-1} - \frac{1}{2}\mLambda, \end{align} \]


\[\mLambda = \sigma^{-2}\rmK^{-1}_{ZZ}\rmK_{ZX}\rmK_{XZ}\rmK_{ZZ}^{-1} + \rmK_{ZZ}^{-1}. \]

Setting the derivatives above to \(\mathbf{0}\) and solving gives the optimal variational parameters

\[\begin{align} \rmS &= \mLambda^{-1} \\ \rvm &= \sigma^{-2}\mLambda^{-1}\rmK_{ZZ}^{-1}\rmK_{ZX}\rvy. \end{align} \]

Thus, when the likelihood is Gaussian and the GP prior mean is \(\mathbf{0}\), the lower bound on the log marginal likelihood comes to

\[\log p(\rvy) \ge \log \gN(\rvy \vert \mathbf{0}, \rmK_{XZ}\rmK^{-1}_{ZZ}\rmK_{ZX} + \sigma^{2}\rmI_{N}) - \frac{1}{2\sigma^{2}}\Tr(\rmK_{XX} - \rmK_{XZ}\rmK_{ZZ}^{-1}\rmK_{ZX}). \]

If the likelihood is not Gaussian, we can maximise the VFE objective using stochastic optimisation methods.

With the (approximate) posterior \(q_{\phi}(\rvf_{Z})\) in hand, we can now turn our attention to making predictions at new test points. Using the standard rules for Gaussian conditioning, it’s straightforward to show that the predictive distribution at some test locations \(\rmX_{\star}\) is found as

\[\begin{align} p(\rvf_{\star}) &\approx \int p(\rvf_{\star} \vert \rvf_{Z})q_{\phi}(\rvf_{Z}) d\rvf_{Z} \\ &= \gN(\rmK_{\star Z}\rmK_{ZZ}^{-1}\rvm, \rmK_{\star \star} - \rmK_{\star Z}\rmK_{ZZ}^{-1}(\rmK_{ZZ} - \rmS)\rmK_{ZZ}^{-1}\rmK_{Z\star}). \end{align} \]

Applying this to our toy problem results in good predictions: