A Second Introduction to Gaussian Processes for Function Approximation

One does not simply get GPs in one sitting. This is particularly true if you are an engineer or computer scientist and haven’t taken a course on stochastic processes or anything of that nature.

My goal with this second 0: As a first introduction, this may make for slightly tedious reading: too thin on the motivation and thick in the details. For more easy-going introductions, see this talk from David MacKay, this one from Rich Turner or the introductory section of the GPML book. ^0[0] introduction to GPs is to provide a self-contained treatment of GPs for the reader who’s sat through a first course or lecture on the subject, yet still feels like they’re missing the full picture.

There are a few essential preliminaries to be familiar with, without which the going can be rather tough. 1: I made this mistake when first learning about GPs: I was able to apply the formulae in a rote manner and code up the model. Yet I didn’t feel I understood why the equations worked. ^1[1] These include several Gaussian identities, the definition of a stochastic process, among a few others. Happily, these preliminaries aren’t difficult to grasp, and we will cover them as they come up 2: If you’d like to cover them preemptively, then begin by taking a read through my previous article on the Gaussian distribution, followed by the this article’s Appendix covering the measure theory. ^2[2] . With a good hold on these key ideas, the rest of the GP framework follows almost effortlessly.

Gaussian Processes

The terminology of measure theory proves to be rather valuable in succinctly expressing the key ideas behind GPs. 3: Often, introductions to GPs will avoid this for the sake of accessibility; choosing analogies to parametric function approximation and other visualisations instead. While intuitive and accessible, they might lack the precision that can be afforded with just a little bit of basic measure theory. ^3[3]

If the words probability space or Borel sigma algebra don’t mean very much to you, then take a look through the Appendix before proceeding, in which I give a quick back-alley treatment of the subject, covering just the bits you need to know.

To establish notation, we begin with the definition of a stochastic process.

Stochastic Processes

Definition (Stochastic process). Consider a probability space $\gS = (\Omega, \gF, \P)$ and a measurable space $(E, \gE)$. A stochastic process $f$ is a collection of $E$-valued random variables, all defined on the same probability space $\gS$, where this collection is indexed by some set $\gX$.

Note: to avoid unnecessary abstraction at this point, you may wish to substitute real-valued random variables; $(E, \gE) = (\R, \gB(\R))$ and $d$-dimensional real-valued indices $\gX = \R^{d}$ every time these come up; since these will closely resemble the GP regression case later.

Let $\ervf_{x}$ or $f(x)$ denote the random variable in the collection $f$ indexed by $x$. That is, by the definition of a random variable, $f(x)$ is itself a function $f(x) : \Omega \to E$ of some sample $\omega \in \Omega$. Outcomes of a stochastic process are individually functions of both the index of the random variable $x \in \gX$, as well as a sample $\omega \in \Omega$. Hence, it is common to write a stochastic process as the following mapping:

\[f: \gX \times \Omega \to E. \]

In this sense, the fact that $f$ accepts an index parameter $x$, in addition to the sample parameter $\omega$ that is ordinarily the only parameter to a random variable, allows us to see $f$ as a random function or a distribution over functions. The result is that we can sample functions from our stochastic process just as we would sample values from a random variable.

To do this, we fix some $\omega \in \Omega$ at which we will evaluate all the random variables in our collection $f$ that we decide to look at (i.e. that we decide to index). That is, the sampled function is

\[f(\cdot, \omega): \gX \to E, \]

where by selecting some $\omega$, we have ‘fixed’ the random function 4: In practice, when using stochastic processes, we almost never fix some $\omega$ unless we are plotting sampled functions from the process. Instead, it is usually more interesting to consider the distribution (and moments) of the function points, as well as evaluating the likelihood of new function points under the random function. ^4[4] , which now behaves like any ordinary deterministic function. The indices $x \in \gX$ can be interpreted more directly as ‘function inputs’ which are mapped to single points in the output space $E$.

A stochastic process is characterised by the joint probability distribution of $f$ on arbitrary finite subsets of $\gX$. That is, if $\rmX_{n} \subseteq \gX$ is a finite $n$-element vector of indices, then we obtain the following random vector

\[\rvf_{\rmX_{n}} = (\ervf_{x_{1}}, \ldots, \ervf_{x_{n}}) = \big(f(x_{1}), \ldots, f(x_{n})\big) = f(\rmX_{n}). \]

More formally, we refer to these as finite-dimensional distributions.

Definition (Finite-dimensional distribution). Consider a probability space $(\Omega, \gF, \P)$, measurable space $(E, \gE)$ and a stochastic process $f: \gX \times \Omega \to E$. The finite-dimensional distributions (FDDs) of $f$ are the (joint, $n$-dimensional) probability distributions $\P^{f}_{x_{1}, \ldots, x_{n}}$ on the product space $E^{n}$ defined by

\[\P^{f}_{x_{1}, \ldots, x_{n}}(S) \doteq \P\Big(\big\{\omega \in \Omega\ :\ \big(f(x_{1}, \omega), \ldots, f(x_{n}, \omega)\big) \in S\big\}\Big), \]

where $S \in E^{n}$ and $n \in \mathbb{N}$.

The Gaussian process

Example (Gaussian process). A Gaussian process is a stochastic process, whose finite-dimensional distributions are multivariate Gaussian.

That is, the FDDs of the Gaussian process are

\[\P^{f}_{x_{1}, \ldots, x_{n}}(S) \doteq \gN(S; \vmu, \mSigma), \]

where $S \in \R^{n}$ and the Gaussian is parametrised by a mean vector $\vmu \in \R^{n}$ and a covariance matrix $\mSigma \in \R^{n\times n}$.

While the above is sufficiently abstract to entertain a variety of Gaussian processes (e.g. over indices $\gX$ which are graphs or strings), for most straightforward applications of GP regression, the following will suffice:

We consider real-valued random variables, that is $(E, \gE) = (\R, \gB(\R))$ in which case each random variable $\ervf_{x}$ indexed by some $x \in \gX$ maps some element of the sample space to the reals: $\ervf_{x}: \Omega \to \R$.
The collection of random variables that makes up the stochastic process can be denoted $\{f(x)\ :\ x\ \in \gX\}$, where in GP regression we often have $\gX = \R^{n}$, for $n \ge 1$. In this case, since $\gX$ is uncountable, then so is this collection of random variables, and we call the stochastic process a continuous stochastic process.

Modelling Functions with GPs

For the purposes of modelling functions with Gaussian processes 5: or any stochastic process for that matter… ^5[5] , we interpret the index set $\gX$ as the domain or inputs to our function.

For instance, for time-series prediction the indices may be a set of time values we have in a training dataset, as well as time values we’d like to make predictions at. For Bayesian optimisation of an ML model’s hyperparameters, the indices may be the set of hyperparameters we’ve already tried, as well as the ones we’re interested in trying next. For plotting a 2d function on a graph, the indices are the real values along the x-axis that we’d like to plot the function at.

More generally, the indices are the vector-valued covariates for our prediction problem 6: It is easiest to think about regression for now, although we will see later how GPs can be used for classification by changing the likelihood. ^6[6] for which we have training observations 7: Note that we needn’t have observed any observations to use a GP; indeed Bayesian methods shine for how well they do a-priori and when we have observed very few data points ^7[7] as well as those for which we’d like to make predictions (obtain a predictive distribution).

Single Sampled Function

To see how the GP corresponds to a collection of random variables at the points of interest (i.e. train and test points) which are in some sense consistent with one another, we will build up to this one index at a time, working with a simple 1d function $f: \R \to \R$, having already fixed some value in the sample space, $\omega_{1} \in \Omega$ 8: The effect of this $\omega_{1}$ is to ‘pin down’ the random function, which we may now think of as a deterministic function, whose points can be described by a Gaussian with a particular mean and covariance. ^8[8] .

Note that in this section we are only interested in seeing how GPs can be used to represent functions. We leave the form of $f$ unspecified, as well as how we actually find the parameters of the GP’s FDD; which will be the subject of the next section.

Picking an initial index value $\rvx_{1} \in \gX$ gives a single element subset of the GP’s infinite collection of real-valued random variables, $\{\ervf_{\rvx_{1}}\} \subset \{\ervf_{\rvx}\ :\ \rvx\in \gX\}$. The random variable in this singleton set is univariate Gaussian

\[\begin{equation}\label{eq:univariate_gaussian} \ervf_{\rvx_{1}} \sim \gN(\mu_{1}, \Sigma_{1}). \end{equation} \]

Having already fixed an element of the sample space, $\omega_{1}$, we can evaluate this random variable $\ervf_{\rvx_{1}} : \Omega \to \R$ at $\omega_{1}$ to get a real value which we may plot as the first function point / ‘pixel’ of our sampled random function:

Adding a second index $\rvx_{2}$ to the index set, $\{\rvx_{1}, \rvx_{2}\} \in \gX$, we now obtain a two-element set of random variables $\{\ervf_{\rvx_{1}}, \ervf_{\rvx_{2}}\} \subset \{\ervf_{\rvx}\ :\ \rvx\in \gX\}$. By the definition of a GP as a stochastic process with Gaussian FDDs, these two random variables form a Gaussian random vector distributed as

\[\begin{bmatrix}\ervf_{\rvx_{1}} & \ervf_{\rvx_{2}}\end{bmatrix}^{\top} = \gN(\vmu, \mSigma), \]

for some $\vmu \in \R^{2}$ and $\mSigma \in \R^{2\times 2}$, whose exact form we will cover in the next section. For now, note that the fact that $\ervf_{\rvx_{1}}$ and $\ervf_{\rvx_{2}}$ are elements of the same joint Gaussian distribution is the sense in which the predictions at $\rvx_{1}$ and $\rvx_{2}$ are consistent. Marginalising 9: Marginalising a Gaussian simply reduces to picking the relevant sub-vector and sub-matrix of the mean and covariance. See this section in my previous article on the Gaussian for more on this. ^9[9] this 2-dimensional FDD further gives $\ervf_{\rvx_{1}} \sim \gN(\mu_{1}, \Sigma_{1})$ and $\ervf_{\rvx_{2}} \sim \gN(\mu_{2}, \Sigma_{2})$, where the former is exactly the univariate Gaussian from Equation $\ref{eq:univariate_gaussian}$.

Once again, having already fixed $\omega_{1} \in \Omega$, we can evaluate both $\ervf_{\rvx_{1}}$ and $\ervf_{\rvx_{2}}$ at $\omega_{1}$ which gives us two real-valued function points / ‘pixels’ which lie on the same sampled function line:

This does not offer us a very high ‘resolution’ for plotting our function. Indexing using 100 additional points, spaced from $[-10, 10]$, evaluating the resulting 100-dimensional Gaussian random vector at $\omega_{1}$ and plotting these points lets us see the rest of the sampled function more clearly:

Hence indexing results in a Gaussian random vector, and sampling from this gives us consistent realisations of the random function evaluated at the indices / inputs.

Distributions over Functions

If we now refrain from picking a single element of the sample space $\Omega$, we can consider the distribution over functions.

Working up to this once more from a single index, the univariate Gaussian FDD at $\rvx_{1}$ is $\ervf_{\rvx_{1}} \sim \gN(\mu_{1}, \Sigma_{1})$. This is just a univariate Gaussian distribution, which we will suggestively plot vertically on the function plot, along with five sampled function points from this univariate Gaussian:

Continuing as we did before by adding a second index to the index set, $\rvx_{2}$, this yields a second marginal $\ervf_{\rvx_{2}}$ of the now 2-dimensional joint, which we will also plot vertically at the corresponding index location $\rvx_{2}$. Under the density, we have also plotted 5 samples using the same $\{\omega_{1}, \ldots, \omega_{5}\}$ as were used to draw the samples at $\rvx_{1}$.

Continuing in this way by adding more and more indices to the index set, we can draw out the full function lines 10: Note that when using GPs, we never actually do this unless we’re trying to visualise the GP. Instead, we only need to add the training points and test points to the index set. ^10[10] for our five elements of the sample space $\{\omega_{1}, \ldots, \omega_{5}\}\subset \Omega$, for some index / design matrix $\rmX \in \R^{n\times 1}$.

You’ll notice that we stopped visualising the distribution over functions at the various indices using vertical Gaussian densities at each input location, since this quickly becomes messy. Visualising the distribution by drawing a bunch of functions (as we have done above, using 5 samples) is equally messy. Rather, to represent the broadness of these vertical Gaussians at various inputs and communicate our uncertainty about the function values at those points, it is common to plot $\mu_{\rvx} \pm 2\sigma_{\rvx}$ for all $\rvx \in \rmX$.

For the GP prior that we have been sampling from in the plots above (i.e. the GP without conditioning any training data) this is usually fairly boring; with a constant mean and standard deviation / uncertainty at all points, which correspond to how we have defined the prior (see next section).

Note that I am still (albeit faintly) plotting samples drawn from the process for illustration. The opacity of a function line is proportional to its likelihood under our GP prior; those that deviate from the mean are fainter than those that lie close to the mean.

Once we have conditioned on training data however, we may observe varying degrees of uncertainty in our random function at different points away from the data.

In general, to define a Gaussian it suffices to define its first two moments: the mean and covariance. For any $n$ indices $\rmX = \{\rvx_{i} \in \gX\}_{i=1}^{n}$, the resulting FDD is a $n$-dimensional Gaussian, parametrised by $\mu \in \R^{n}$ and $\mSigma \in \R^{n\times n}$. In the next section, we will see how to obtain these parameters for a given GP prior, a training dataset, and some test points of interest.

Defining the GP Prior

When we place a Gaussian process prior on a function that we’re interested in modelling, we define a way of obtaining the first two moments of the FDD; in other words, we define a mean function $m: \gX \to \R$ and a positive definite covariance (or kernel) function $k: \gX \times \gX \to \R$.

This is convenient since we often don’t know which indices $\rmX$ we’ll choose to look at when choosing the GP prior 11: i.e. where our training points will land, and where we will need to evaluate the function at test time ^11[11] . With these two functions, we can postpone the calculation of the FDD’s first two moments until we obtain the indices we need.

Supposing that our indices or covariates live in the space $\gX = \R^{d}$, and we select / observe $n$ of them $\rmX \in \R^{n \times d}$, the resulting prior FDD can be found by the element-wise evaluation of these functions at each of the indicies; giving us our mean vector $m(\rmX) = \vmu \in \R^{n}$ and covariance or Gram matrix $\rmK$, where $\ermK_{i, j} = k(\rvx_{i}, \rvx_{k})$. The positive definite constraint on the kernel function says that the Gram matrices it produces satisfy $\rvv^{\top}\rmK\rvv \gt 0$, which is required to use these as covariance matrices. Drawing $n$-dimensional samples from $\gN(\vmu, \rmK)$ results in (prior) function samples, as plotted in the penultimate digram above.

The mean and covariance functions are the main levers we can pull to influence the expression of our prior beliefs about the function of interest. In most cases, using a zero-mean function $m(\rvx) \doteq \mathbf{0}$ is a reasonable choice 12: although this is not always the case; a non-zero mean functions can be extremely useful, if not essential, in certain settings such as Bayesian optimisation, time-series predictions and deep Gaussian processes ^12[12] leaving the characterisation of the function down to the choice of kernel function. Different kernel functions can influence things such as the smoothness or periodicity of the resulting functions—for a nice taxonomy of some common kernels, see this page by David Duvenaud.

Some popular kernels include the following, where $r = |\rvx - \rvx'|$ represents a pairwise distance between points 13: See this Fragment for more on computing pairwise distances between datapoints. ^13[13]

Kernel	$k(\rvx, \rvx')$
Squared Exponential	$\sigma^{2} \exp \big(-\frac{r^{2}}{2\ell^{2}}\big)$
Rational Quadratic	$\sigma^{2}\left(1 \frac{r^{2}}{2\alpha \ell^{2}}\right)^{-\alpha}$
Matérn 3/2	$\left(1 + \frac{\sqrt{3}r}{\ell}\right)\exp\left(-\frac{\sqrt{3}r}{\ell}\right)$
Matérn 5/2	$\left(1 + \frac{\sqrt{5}r}{\ell} + \frac{5r^{2}}{3\ell^{2}}\right)\exp\left(-\frac{\sqrt{5}r}{\ell}\right)$
ArcCosine (Cho et al., 2009)	$\sin \theta + (\pi - \theta)\cos\theta$

Note that for the ArcCos kernel, we have $\cos \theta = \frac{\rvx \cdot \rvx'}{\vert \rvx \vert \vert \rvx' \vert}$. The $\sigma$ and $\ell$ terms are hyperparameters which influence the properties of each kernel; we discuss how to set these later.

In summary, we define a Gaussian process prior $f: \R^{n} \times \Omega \to \R$, which we denote as

\[f \sim \gG\gP(m, k), \]

by defining the functions $m(\cdot)$ and $k(\cdot, \cdot)$, which in turn determine the first two moments of the resulting FDDs:

\[\begin{align*} m(\rvx) &= \E[f(\rvx)], \\ k(\rvx, \rvx') &= \E\big[\big(f(\rvx) - \mu(\rvx)\big)^{\top}\big(f(\rvx') - \mu(\rvx')\big)\big]. \end{align*} \]

Conditioning on Data

Defining a GP prior and drawing samples from it is not usually very useful in itself: rather we seek to make predictions about new inputs by updating or conditioning our beliefs about the latent function values on observed (training) data points. In this section, we will see how we can obtain the updated parameters of the FDD and parametrise the posterior predictive / posterior process; as was done to produce the last plot in the previous section.

For clarity of exposition, we will first assume a noiseless regression setup 14: These might come up when modelling the output of a computer simulation, or something for which we have extremely precise measurements. ^14[14] .

Consider a training set of $n$ input and function-value pairs, $\gD = \{(\rvx_{i}, f_{i})\}_{i=1}^{n}$, whose $d$-dimensional inputs we collectively denote as $\rmX \in \R^{n \times d}$ and where the function points are denoted as $\rvf = f(\rmX) \in \R^{n}$. Further consider a set of $m$ new indices / test points, which we denote $\rmX_{\star} \in \R^{m\times d}$ at which we’d like to make predictions $\rvf_{\star} = f(\rmX_{\star})$.

With these input points defined we can compute the GP’s FDD at $\{\rmX, \rmX_{\star}\}$ by evaluating the mean and covariance functions at these input points. This gives us a joint distribution over the function values at the training and test locations $p(\rvf, \rvf_{\star})$. Writing these parameters as a partitioned mean vector and covariance matrix, we obtain the distribution over the Gaussian random vector $\begin{bmatrix}\rvf, \rvf_{\star}\end{bmatrix}$ as

\[\begin{bmatrix}\rvf \\ \rvf_{\star} \end{bmatrix} \sim \gN\left(\begin{bmatrix}\vmu \\ \vmu_{\star}\end{bmatrix}, \begin{bmatrix}\rmK & \rmK_{\star} \\ \rmK_{\star}^{\top} & \rmK_{\star\star}\end{bmatrix}\right), \]

where $\rmK = k(\rmX, \rmX) \in \R^{n\times n}$, $\rmK_{\star} = k(\rmX, \rmX_{\star}) \in \R^{n\times m}$, $\vmu_{\star} = m(\rmX_{\star}) \in \R^{m}$ and so forth.

Given this joint model, we now note that while we know the function values at the training inputs $\rvf$ (these are in our training set), we don’t know the function values at the test points $\rvf_{\star}$—this is what we would like to infer.

To make predictions, we therefore seek the predictive distribution of function values at the test points, conditioned on the observed training dataset, $p(\rvf_{\star} \vert \rvf)$ 15: or, using more explicit notation, $p(\rvf_{\star} \vert \gD)$ or even $p(\rvf_{\star} \vert \rmX_{\star}, \rvf, \rmX)$. However, it is common to leave out the indices for notational brevity, with the understanding that $\rvf \doteq f(\rmX)$. ^15[15] .

Owing to the convenient properties of the Gaussian, this Gaussian conditional distribution has a closed-form solution, which I give a full derivation of in this section of my previous article on the Gaussian distribution. If you would find it satisfying to know how GPs work end-to-end, I encourage you to read through the proof. Otherwise, I have provided the relevant results below which you may use ad-hoc:

Conditional of a Joint Gaussian Distribution

For a partitioned Gaussian joint $\rvx \sim \gN(\vmu, \mSigma)$, where

\[\rvx = \begin{bmatrix}\rvx_{1} \\ \rvx_{2}\end{bmatrix}\ \ \ \vmu = \begin{bmatrix}\vmu_{1} \\ \vmu_{2} \end{bmatrix} \ \ \ \mSigma = \begin{bmatrix}\mSigma_{11} & \mSigma_{12} \\ \mSigma_{21} & \mSigma_{22}\end{bmatrix}, \]

and, necessarily, $\mSigma_{12} = \mSigma_{21}^{\top}$, the distribution of $\rvx_{2}$ conditioned on $\rvx_{1}$ is

\[\begin{align} p(\rvx_{2} \vert \rvx_{1}) &= \gN(\rvx_{2}; \vmu_{2 \vert 1}, \mSigma_{2\vert 1}) \nonumber \\ \vmu_{2 \vert 1} &= \vmu_{2} + \mSigma_{21}\mSigma_{11}^{-1}(\rvx_{1} - \vmu_{1}) \label{eq:conditional_mean} \\ \mSigma_{2 \vert 1} &= \mSigma_{22} - \mSigma_{21}\mSigma_{11}^{-1}\mSigma_{12} = (\mSigma / \mSigma_{11}) \label{eq:conditional_cov}. \end{align} \]

For other related results, Petersen et al. (2012) provides an excellent reference.

Hence the Gaussian random vector $\rvf_{\star}$ conditioned on our training points is distributed as $p(\rvf_{\star} \vert \rmX_{\star}, \rvf, \rmX) = \gN(\rvf_{\star}; \vmu_{\star \vert \rmX}, \mSigma_{\star \vert \rmX})$ with parameters

\[\begin{align} \vmu_{\star \vert \rmX} &= \vmu_{\star} + \rmK_{\star}^{\top}\rmK^{-1}(\rvf - \vmu) \\ \mSigma_{\star\vert \rmX} &= \rmK_{\star\star} - \rmK_{\star}^{\top}\rmK^{-1}\rmK_{\star}, \end{align} \]

where we have used Equations $\ref{eq:conditional_mean}$ and $\ref{eq:conditional_cov}$ for the mean and covariance, respectively.

These are the equations that can be used 16: I refrain from saying ‘that were used’, since in practice these equations prove to be numerically unstable. I discuss some methods for practical inversion of kernel matrices in this section of another article on Bayesian linear regression. ^16[16] to plot the predictive GP conditioned on some training data:

Notice how the mean of the posterior perfectly interpolates the training data, and the predictive uncertainty ‘pinches’ around the training points. This is an artefact of the noiseless modelling assumption. In most cases when modelling observed phenomena this is unrealistic, and this may further have undesired effects on the predicted function. For instance if two similar inputs have differing observations (e.g. due to corrupting observation noise on a sensor), then the predicted function (distribution) will oscillate wildly around that point to match the data:

Noise-Corrupted Observations

To rectify the above, we now assume that instead of observing noiseless function values $f(\rvx_{n})$ in our training set, we rather observe noise corrupted versions thereof; $\ervy_{n} = f(\rvx_{n}) + \epsilon_{n}$, for $\epsilon_{n} \sim \gN(0, \sigma^{2}_{y})$.

Note that this unimodal, Gaussian noise modelling assumption is required in the exact GP setting to remain tractable, however in a subsequent article we will see how we can relax this constraint and use different likelihoods.

Due to our assumption that the noise has zero mean (i.e. is centred at the latent function values $\rvf$) and i.i.d. (i.e. it has a diagonal covariance), incorporating this modelling assumption is fairly straightforward; only requiring us to update the diagonal entries of the covariance matrix at the training locations 17: we don’t update the test locations since we are concerned with inferring the latent function values $\rvf_{\star}$, not noisy versions thereof! ^17[17] :

\[\Cov[\rvy \vert \rmX] = \rmK + \sigma^{2}_{y}\rmI_{n}. \]

Hence, our model for the joint density of our observations (which we now denote $\rvy$) and the latent function values at the test inputs $\rvf_{\star}$ becomes:

\[\begin{equation} \begin{bmatrix}\rvy \\ \rvf_{\star}\end{bmatrix} \sim \gN\left( \begin{bmatrix}\vmu \\ \vmu_{\star} \end{bmatrix}, \begin{bmatrix}\rmK + \sigma^{2}_{y}\rmI & \rmK_{\star} \\ \rmK_{\star}^{\top} & \rmK_{\star\star}\end{bmatrix} \right). \end{equation} \]

Conditioning on the observations $\rvy$ proceeds exactly as in the noiseless case (Equations $\ref{eq:conditional_mean}$ and $\ref{eq:conditional_cov}$) with $\rmK + \sigma^{2}_{y}\rmI$ substituted for $\rmK$, giving the posterior predictive distribution in the noisy case as:

\[\begin{align} p(\rvf_{\star} \vert \rmX_{\star}, \rvy, \rmX) &= \gN(\rvf; \vmu_{\star \vert \rmX}, \mSigma_{\star \vert \rmX}) \\ \vmu_{\star \vert \rmX} &= \vmu_{\star} + \rmK_{\star}^{\top}(\rmK + \sigma_{y}^{2}\rmI)^{-1}(\rvy - \vmu) \\ \mSigma_{\star \vert \rmX} &= \rmK_{\star\star} - \rmK_{\star}^{\top}(\rmK + \sigma_{y}^{2}\rmI)^{-1}\rmK_{\star}. \end{align} \]

Setting the noise level to $\sigma_{y}^{2} = 0.2$, we can now see that the mean function values now merely pass close to the training datapoints, without the hard requirement to interpolate through them:

Due to this behaviour, these simple, single-layer GPs are often referred to as smoothing devices.

Appendix A: The Measure Theory You Need To Know (and no more)

This Appendix collects the bits of basic measure theory which are useful for understanding GPs. It reads like a list of definitions, which you should aim to commit to memory.

Definition (Sample Space). A sample space $\Omega$ is the set of all outcomes of a probabilistic experiment. More generally, a measurable space, commonly denoted $E$, is the set of things which we are interested in measuring. A sample space is a measurable space 18: The former arising when talking about probability, with the latter arising more generally in measure theory. ^18[18] .

Example. For a single toss of a coin, the sample space is $\Omega = \{\omega_{1}, \omega_{2}\} = \{H, T\}$.

Sigma Algebras

Definition (Sigma-algebra, event space). These are the same thing, with the latter being a probabilistic interpretation of the former. A $\sigma$-algebra $\gE$ is a collection of ‘nice’ (i.e. not infinite, tractable) subsets of a measurable space. If the measurable space is finite and discrete, then its power set $\gP(E)$ can be used as a valid sigma algebra. In the probabilistic interpretation, for a finite sample space $\Omega$, the event space (which we usually denote $\gF$) can be taken as a subset of $\gP(\Omega)$; elements of which are called events. Elements of a sigma-algebra must follow the following regularity conditions:

$\E \in \gE$; the sample space itself is an event (the sure event)
$A \in \gE \implies A^{c} \in \gE$; it is closed under complementation (the complement of an event is an event)
$A_{1}, A_{2}, \dots \in \gE \implies \bigcup_{i=1}^{\infty} A_{i} \in \gE$; it is closed under countable unions (the union of countably many events is an event)

If a measurable space $E$ is countable, then this is nice because we can always find the power set $\gE = \gP(E)$ and the measure can be defined on all possible subsets. However, for ‘bigger’ spaces, we have to be more careful, since the set of all subsets if often too large to work with. For such inconvenient, infinite event spaces, we usually define $\sigma$-algebras in terms of a smaller set, known as the generating sets.

Definition (Generator of a $\sigma$-algebra). Let $E$ be a set, and $\gA \subseteq \gP(E)$ be a collection of subsets of $E$. We define

\[\sigma(\gA) = \{A \subseteq E\ :\ A \in \gA\text{ for all $\sigma$-algebras $\gE$ that contain $\gA$}\}. \]

Put otherwise, $\sigma(\gA)$ is the smallest sigma algebra that contains $\gA$. This is known as the $\sigma$-algebra generated by $\gA$.

Definition (Borel $\sigma$-algebra). Consider a real measurable space $E = \R$ and the generator $\gA = \{U \subseteq \R\ :\ U\text{ is open}\}$. Then $\sigma(\gA)$ is known as the Borel $\sigma$-algebra, which is not the set of all subsets of $\R$ (which would be an uncountably infinite object, and hence hard to measure).

We can perhaps more intuitively, and entirely equivalently, define this as $\bar{\gA} = \{(a, b)\ :\ a < b,\ a, b \in \Q\}$—the set of tuples of ordered rational numbers. Then $\sigma(\bar{\gA})$ is also the Borel $\sigma$-algebra. We denote the Borel $\sigma$-algebra as $\gB(\R)$.

Measures

Definition (Measurable space). A pair $(E, \gE)$ of a set and its $\sigma$-algebra (for instance a sample space $\Omega$ and an event space $\gF$) is called a measurable space. This is because we can define a function that ‘measures’ the size of the elements of $\gF$. In fact, any element of a sigma algebra $B \in \gF$ is called a measurable set.

Definition (Measure). A measure on a measurable space $(E, \gE)$ is a function $\mu : \gE \to [0, +\infty)$, such that

$\mu(\emptyset) = 0$
It follows countable additivity: for any disjoint sequence $(A_{n})$ in $\gE$, then

\[\mu\left(\bigcup_{n}A_{n}\right) = \sum_{n=1}^{\infty}\mu(A_{n}). \]

Example (Mass function). Let $E$ be any countable set, and $\gE = \gP(E)$ its power set (i.e. the set of all subsets of $E$). A mass function is any function $m : E \to [0, +\infty)$. We can define a measure by setting

\[\mu(A) = \sum_{x \in A}m(x). \]

In particular, if we put $m(x) = 1$ for all $x \in E$, then we obtain the counting measure.

Definition (Probability measure). Let $(\Omega, \gF)$ be a measurable space with the property that $\mu(\Omega) = 1$. Then we call $\mu$ a probability measure and usually denote it as $\P$ instead of $\mu$. Note that $\P$ must also satisfy the additional property that it maps to $[0, 1]$ rather than $[0, +\infty)$.

Definition (Lebesgue measure). The Lebesgue measure is the unique Borel measure $\mu$ on $\R$ with $\mu([a, b]) = b-a$. Note that the Lebesgue measure is not a finite measure, since $\mu(\R) = \infty$, however it is a $\sigma$-finite measure (defined below).

We often denote the Lebesgue measure as $\lambda$ (with its $n$-dimensional generalisation on $\gB(\R^{n})$ denoted $\lambda^{n}$).

Definition ($\sigma$-finite measure). Let $(E, \gE)$ be a measurable space, and $\mu$ a measure. We say that $\mu$ is $\sigma$-finite if there exists a sequence $(E_{n})$ in $\gE$ such that $\bigcup_{n}E_{n} = E$, and $\mu(E_{n}) < \infty$ for all $n$.

—

Notice how the measure takes elements of $\gF$ and not elements of the sample space: a common example of this is the probability space $(\R, \gB(\R), \P)$, where $\gB(\R)$ is a Borel sigma-algebra and $\P$ is the Lebesgue measure. We always evaluate a probability density function on an interval (i.e. an element of $\gB(\R)$) and never directly on a real number!

Probability

Definition (Probability space). A probability space is a triple $(\Omega, \gF, \P)$, composed of

the sample space $\Omega$,
an event space $\gF$ which is a $\sigma$-algebra on $\Omega$, and
a probability measure $\P$ on $(\Omega, \gF)$.

Definition (Random variable). Consider two measurable spaces: the first is the pair of a sample space and event space $(\Omega, \gF)$, and the other is an arbitrary pair of a set (a measurable space) and its sigma algebra $(E, \gE)$. A random variable (also called a measurable function 19: Note that when we talk about a measurable function, we will more often use the notation $f$ rather than $X$. ^19[19] ) is a function $X: \Omega \to E$ which maps elements in the sample space to elements in the arbitrary set $E$, such that for any $\gE$-measurable set $B \in \gE$, the preimage $X^{-1}(B)$ is an $\gF$-measurable set:

\[X^{-1}(B) = \{\omega \in \Omega\ :\ X(\omega) \in B\} \in \gF\ \ \forall B \in \gE. \]

$\blacksquare$

Intuitively, this is useful because the actual sample space and event space of our experiments, for instance $E$ and $\gE$ might be tricky to work with, and we’d rather use alternative ones that we’re more comfortable using such as $\Omega$ and $\gF$ (which, concretely, might be $\R$ and $\gB(\R)$). The property above guarantees that there is some sort of correspondence between the events in our original event space $\gF$ and our transformed event space $\gE$.

On terminology, we often say that an $E$-valued random variable is a measurable function $X: \Omega \to E$.

Example. If $E = \R$ and $\gE = \gB(\R)$ a real-valued random variable (sometimes $\R$-valued random variable) assigns a real number to each $\omega \in \Omega$. More generally, we might take $E = \R^{n}$ and $\gE = \gB(\R^{n})$, where the random variable now maps elements of the sample space to an $n$-dimensional Euclidean vector.

Definition (Probability distribution / law). Suppose we have a probability space $(\Omega, \gF, \P)$ allowing us to assign probabilities to events in $\gF$. Suppose that we also have another measurable space $(E, \gE)$ for which we don’t yet have a probability measure to measure events in $\gE$. A probability distribution (also a law or even the pushforward of a measure, denoted $\mu_{x} = \P \circ X^{-1}$ or $X_{\#}\P$, respectively) is an image measure: the composition of the inverse of the measurable function defining the random variable $X$ with the probability measure $\P$. It is effectively a mapping from sets in $\gE$ into $[0, 1]$:

\[X_{\#}\P = (\P \circ X^{-1}): \gE \to [0, 1]. \]

The key idea is that, given some set (read event) in $\gE$, we can use a random variable $X$ to find the pre-image of this set in the event space $\gF$, and then measure this set via the probability measure $\P$ defined on this set. The existence of the random variable $X: \Omega \to \gE$ allows us to measure events $A \in \gE$ not in the original probability space.

Using more probabilistic notation, we would write

\[\P(X \in A) = \mu_{x}(A) = \P(X^{-1}(A)) = \P(\{\omega \in \Omega\ :\ X(\omega) \in A\}). \]

The above works well for countable $E$ (that is, discrete random variables). For uncountable sample spaces, such as when $E = \R$, then $\mu_{X}$ is determined by its values on the intervals $(-\infty, y]$. We set

\[F_{X}(x) = \mu_{X}\big((-\infty, x]\big) = \P(X \le x), \]

where $F_{X}(\cdot)$ is the distribution function of the random variable $X$.

One common example of this is the measurable space $(E, \gE) = \big(\R^{n}, \gB(\R^{n})\big)$, where we have

\[X_{\#}\P:\gB(\R^{n}) \to [0, 1]. \]

Definition (Probability density function). Consider the measurable space $(\R^{n}, \gB(\R^{n})\big)$, and suppose that we have the usual Lebesgue measure $\lambda^{n}$ on this measurable space. This allows us to assign the notion of ‘size’ to each set $A \in \gB(\R^{n})$ based on the likelihood of it happening. To do this, we assign a density to every point in $\R^{n}$, and to compute the new measure of $A$ according to $X_{\#}\P$, we simply take the average of its density at all points in $A$, according to the Lebesgue measure:

\[X_{\#}\P(A) = \int_{A}dX_{\#}\P = \int_{A}fd\lambda^{n}\ \ \forall A \in \gB(\R^{n}). \]

BibTeX Citation

@article{robeyns2021introgps
    author={Maxime Robeyns},
    title={A Second Introduction to Gaussian Processes for Function Approximation},
    year={2023},
    url={https://maximerobeyns.com/second_intro_gps},
}

Find me on Twitter.

References

Kernel Methods for Deep Learning

Cho, Youngmin and Saul, Lawrence (2009)

https://papers.nips.cc/paper/2009/hash/5751ec3e9a4feab575962e78e006250d-Abstract.html

The Matrix Cookbook

Petersen, K. B. and Pedersen, M. S. (2012)

http://www2.compute.dtu.dk/pubdb/pubs/3274-full.html

Footnotes

[0]

As a first introduction, this may make for slightly tedious reading: too thin on the motivation and thick in the details. For more easy-going introductions, see this talk from David MacKay, this one from Rich Turner or the introductory section of the GPML book.

↩︎

[1]

I made this mistake when first learning about GPs: I was able to apply the formulae in a rote manner and code up the model. Yet I didn’t feel I understood why the equations worked.

↩︎

[2]

If you’d like to cover them preemptively, then begin by taking a read through my previous article on the Gaussian distribution, followed by the this article’s Appendix covering the measure theory.

↩︎

[3]

Often, introductions to GPs will avoid this for the sake of accessibility; choosing analogies to parametric function approximation and other visualisations instead. While intuitive and accessible, they might lack the precision that can be afforded with just a little bit of basic measure theory.

↩︎

[4]

In practice, when using stochastic processes, we almost never fix some $\omega$ unless we are plotting sampled functions from the process. Instead, it is usually more interesting to consider the distribution (and moments) of the function points, as well as evaluating the likelihood of new function points under the random function.

↩︎

[5]

or any stochastic process for that matter…

↩︎

[6]

It is easiest to think about regression for now, although we will see later how GPs can be used for classification by changing the likelihood.

↩︎

[7]

Note that we needn’t have observed any observations to use a GP; indeed Bayesian methods shine for how well they do a-priori and when we have observed very few data points

↩︎

[8]

The effect of this $\omega_{1}$ is to ‘pin down’ the random function, which we may now think of as a deterministic function, whose points can be described by a Gaussian with a particular mean and covariance.

↩︎

[9]

Marginalising a Gaussian simply reduces to picking the relevant sub-vector and sub-matrix of the mean and covariance. See this section in my previous article on the Gaussian for more on this.

↩︎

[10]

Note that when using GPs, we never actually do this unless we’re trying to visualise the GP. Instead, we only need to add the training points and test points to the index set.

↩︎

[11]

i.e. where our training points will land, and where we will need to evaluate the function at test time

↩︎

[12]

although this is not always the case; a non-zero mean functions can be extremely useful, if not essential, in certain settings such as Bayesian optimisation, time-series predictions and deep Gaussian processes

↩︎

[13]

See this Fragment for more on computing pairwise distances between datapoints.

↩︎

[14]

These might come up when modelling the output of a computer simulation, or something for which we have extremely precise measurements.

↩︎

[15]

or, using more explicit notation, $p(\rvf_{\star} \vert \gD)$ or even $p(\rvf_{\star} \vert \rmX_{\star}, \rvf, \rmX)$. However, it is common to leave out the indices for notational brevity, with the understanding that $\rvf \doteq f(\rmX)$.

↩︎

[16]

I refrain from saying ‘that were used’, since in practice these equations prove to be numerically unstable. I discuss some methods for practical inversion of kernel matrices in this section of another article on Bayesian linear regression.

↩︎

[17]

we don’t update the test locations since we are concerned with inferring the latent function values $\rvf_{\star}$, not noisy versions thereof!

↩︎

[18]

The former arising when talking about probability, with the latter arising more generally in measure theory.

↩︎

[19]

Note that when we talk about a measurable function, we will more often use the notation $f$ rather than $X$.

↩︎

Article

Published

Discussion

Contents