One does not simply *get* GPs in one sitting. This is particularly true
if you are an engineer or computer scientist and haven’t taken a course on
stochastic processes or anything of that nature.

My goal with this *second*
0: As a first introduction, this may make for
slightly tedious reading: too thin on the motivation and thick in the details.
For more easy-going introductions, see this
talk from David MacKay, this
one from Rich Turner or the
introductory section of the GPML book.
^{0[0]}
introduction to GPs is to provide a self-contained
treatment of GPs for the reader who’s sat through a first course or
lecture on the subject, yet still feels like they’re missing the full picture.

There are a few essential preliminaries to be familiar with, without which
the going can be rather tough.
1: I made this mistake when first learning
about GPs: I was able to apply the formulae in a rote manner and code up the
model. Yet I didn’t feel I understood why the equations worked.
^{1[1]}
These include several Gaussian identities, the definition of a stochastic
process, among a few others.
Happily, these preliminaries aren’t difficult to grasp, and we will cover them
as they come up
2: If you’d like to cover them preemptively, then begin by
taking a read through my previous article on the Gaussian
distribution, followed by the this article’s Appendix
covering the measure theory.
^{2[2]}
. With a good hold on these key ideas, the rest of the GP
framework follows almost effortlessly.

## Gaussian Processes

The terminology of measure theory proves to be rather valuable in succinctly
expressing the key ideas behind GPs.
3: Often, introductions to
GPs will avoid this for the sake of accessibility; choosing analogies
to parametric function approximation and other visualisations instead. While
intuitive and accessible, they might lack the precision that can be afforded
with just a little bit of basic measure theory.
^{3[3]}

If the words *probability space* or *Borel sigma algebra* don’t mean very much to you,
then take a look through the Appendix before
proceeding, in which I give a quick back-alley treatment of the subject, covering just
the bits you need to know.

To establish notation, we begin with the definition of a stochastic process.

### Stochastic Processes

**Definition** (Stochastic process)**.** Consider a probability space \(\gS = (\Omega, \gF, \P)\) and a measurable space \((E, \gE)\). A *stochastic process* \(f\)
is a collection of \(E\)-valued random variables, all defined on the same
probability space \(\gS\), where this collection is indexed by some set \(\gX\).

*Note: to avoid unnecessary abstraction at this point, you may wish to
substitute real-valued random variables; \((E, \gE) = (\R, \gB(\R))\) and
\(d\)-dimensional real-valued indices \(\gX = \R^{d}\) every time these come up;
since these will closely resemble the GP regression case later.*

Let \(\ervf_{x}\) or \(f(x)\) denote the random variable in the collection \(f\)
indexed by \(x\). That is, by the definition of a random variable, \(f(x)\) is
itself a function \(f(x) : \Omega \to E\) of some sample \(\omega \in \Omega\).
*Outcomes* of a stochastic process are individually functions of both the
index of the random variable \(x \in \gX\), as well as a sample \(\omega \in \Omega\). Hence, it is common to write a stochastic process as the following
mapping:

\[f: \gX \times \Omega \to E. \]

In this sense, \(f\) is a *random function* or a *distribution*
over functions. We can sample functions from our stochastic process just as we
would sample values from a random variable.

To do this, we fix some \(\omega \in \Omega\) at which we will evaluate all the random variables in our collection \(f\) that we decide to look at (i.e. that we decide to index). That is, the sampled function is

\[f(\cdot, \omega): \gX \to E, \]

where by selecting some \(\omega\), we have ‘fixed’ the random function
4: In
practice, when using stochastic processes, we almost never fix some \(\omega\)
unless we are plotting sampled functions from the process. Instead, it is
usually more interesting to consider the distribution (and moments) of the
function points, as well as evaluating the likelihood of new function points
under the random function.
^{4[4]}
, which now behaves like any ordinary deterministic
function. The indices \(x \in \gX\) can be interpreted more directly as ‘function
inputs’ which are mapped to single points in the output space \(E\).

A stochastic process is characterised by the joint probability distribution of \(f\) on arbitrary finite subsets of \(\gX\). That is, if \(\rmX_{n} \subseteq \gX\) is a finite \(n\)-element vector of indices, then we obtain the following random vector

\[\rvf_{\rmX_{n}} = (\ervf_{x_{1}}, \ldots, \ervf_{x_{n}}) = \big(f(x_{1}), \ldots, f(x_{n})\big) = f(\rmX_{n}). \]

More formally, we refer to these as finite-dimensional distributions.

**Definition** (Finite-dimensional distribution)**.** Consider a probability
space \((\Omega, \gF, \P)\), measurable space \((E, \gE)\) and a stochastic process
\(f: \gX \times \Omega \to E\). The *finite-dimensional distributions*
(FDDs) of \(f\) are the (joint, \(n\)-dimensional) probability distributions
\(\P^{f}_{x_{1}, \ldots, x_{n}}\) on the product space \(E^{n}\) defined by

\[\P^{f}_{x_{1}, \ldots, x_{n}}(S) \doteq \P\Big(\big\{\omega \in \Omega\ :\ \big(f(x_{1}, \omega), \ldots, f(x_{n}, \omega)\big) \in S\big\}\Big), \]

where \(S \in E^{n}\) and \(n \in \mathbb{N}\).

### The Gaussian process

**Example** (Gaussian process)**.** A *Gaussian process* is a stochastic
process, whose finite-dimensional distributions are multivariate Gaussian.

That is, the FDDs of the Gaussian process are

\[\P^{f}_{x_{1}, \ldots, x_{n}}(S) \doteq \gN(S; \vmu, \mSigma), \]

where \(S \in \R^{n}\) and the Gaussian is parametrised by a mean vector \(\vmu \in \R^{n}\) and a covariance matrix \(\mSigma \in \R^{n\times n}\).

While the above is sufficiently abstract to entertain a variety of Gaussian processes (e.g. over indices \(\gX\) which are graphs or strings), for most straightforward applications of GP regression, the following will suffice:

- We consider real-valued random variables, that is \((E, \gE) = (\R, \gB(\R))\) in which case each random variable \(\ervf_{x}\) indexed by some \(x \in \gX\) maps some element of the sample space to the reals: \(\ervf_{x}: \Omega \to \R\).
- The collection of random variables that makes up the stochastic process can be
denoted \(\{f(x)\ :\ x\ \in \gX\}\), where in GP regression we often have
\(\gX = \R^{n}\), for \(n \ge 1\). In this case, since \(\gX\) is uncountable,
then so is this collection of random variables, and we call the stochastic
process a
*continuous*stochastic process.

### Modelling Functions with GPs

For the purposes of modelling functions with Gaussian processes
5: or any
stochastic process for that matter…
^{5[5]}
, we interpret the index set \(\gX\) as
the *domain* or inputs to our function.

For instance, for time-series prediction the indices may be a set of time values we have in a training dataset, as well as time values we’d like to make predictions at. For Bayesian optimisation of an ML model’s hyperparameters, the indices may be the set of hyperparameters we’ve already tried, as well as the ones we’re interested in trying next. For plotting a 2d function on a graph, the indices are the real values along the x-axis that we’d like to plot the function at.

More generally, the indices are the vector-valued covariates for our
prediction problem
6: It is easiest to think about regression for now,
although we will see later how GPs can be used for classification by
changing the likelihood.
^{6[6]}
for which we have training observations
7: Note
that we needn’t have observed any observations to use a GP; indeed
Bayesian methods shine for how well they do a-priori and when we have observed
very few data points
^{7[7]}
as well as those for which we’d like to make predictions
(obtain a predictive *distribution*).

#### Single Sampled Function

To see how the GP corresponds to a collection of random variables at the
points of interest (i.e. train and test points) which are in some sense
*consistent* with one another, we will build up to this one index at a time,
working with a simple 1d function \(f: \R \to \R\),
having already fixed some value in the sample space, \(\omega_{1} \in \Omega\)
8: The effect of this \(\omega_{1}\) is to ‘pin down’ the random function, which
we may now think of as a deterministic function, whose points can be described
by a Gaussian with a particular mean and covariance.
^{8[8]}
.

*Note that in this section we are only interested in seeing how GPs can be
used to represent functions. We leave the form of \(f\) unspecified, as well as
how we actually find the parameters of the GP’s FDD; which will be
the subject of the next section.*

Picking an initial index value \(\rvx_{1} \in \gX\) gives a single element subset of the GP’s infinite collection of real-valued random variables, \(\{\ervf_{\rvx_{1}}\} \subset \{\ervf_{\rvx}\ :\ \rvx\in \gX\}\). The random variable in this singleton set is univariate Gaussian

\[\begin{equation}\label{eq:univariate_gaussian} \ervf_{\rvx_{1}} \sim \gN(\mu_{1}, \Sigma_{1}). \end{equation} \]

Having already fixed an element of the sample space, \(\omega_{1}\), we can evaluate this random variable \(\ervf_{\rvx_{1}} : \Omega \to \R\) at \(\omega_{1}\) to get a real value which we may plot as the first function point / ‘pixel’ of our sampled random function:

Adding a second index \(\rvx_{2}\) to the index set, \(\{\rvx_{1}, \rvx_{2}\} \in \gX\), we now obtain a two-element set of random variables \(\{\ervf_{\rvx_{1}}, \ervf_{\rvx_{2}}\} \subset \{\ervf_{\rvx}\ :\ \rvx\in \gX\}\). By the definition of a GP as a stochastic process with Gaussian FDDs, these two random variables form a Gaussian random vector distributed as

\[\begin{bmatrix}\ervf_{\rvx_{1}} & \ervf_{\rvx_{2}}\end{bmatrix}^{\top} = \gN(\vmu, \mSigma), \]

for some \(\vmu \in \R^{2}\) and \(\mSigma \in \R^{2\times 2}\), whose exact form
we will cover in the next section. For now, note that the fact that
\(\ervf_{\rvx_{1}}\) and \(\ervf_{\rvx_{2}}\) are elements of the same joint
Gaussian distribution is the sense in which the predictions at \(\rvx_{1}\) and
\(\rvx_{2}\) are *consistent*. Marginalising
9: Marginalising a Gaussian simply
reduces to picking the relevant sub-vector and sub-matrix of the mean and
covariance. See this section in my previous article
on the Gaussian for more on this.
^{9[9]}
this 2-dimensional FDD further gives
\(\ervf_{\rvx_{1}} \sim \gN(\mu_{1}, \Sigma_{1})\) and \(\ervf_{\rvx_{2}} \sim \gN(\mu_{2}, \Sigma_{2})\), where the former is exactly the univariate Gaussian
from Equation \(\ref{eq:univariate_gaussian}\).

Once again, having already fixed \(\omega_{1} \in \Omega\), we can evaluate both \(\ervf_{\rvx_{1}}\) and \(\ervf_{\rvx_{2}}\) at \(\omega_{1}\) which gives us two real-valued function points / ‘pixels’ which lie on the same sampled function line:

This does not offer us a very high ‘resolution’ for plotting our function. Indexing using 100 additional points, spaced from \([-10, 10]\), evaluating the resulting 100-dimensional Gaussian random vector at \(\omega_{1}\) and plotting these points lets us see the rest of the sampled function more clearly:

Hence indexing results in a Gaussian random vector, and sampling from this gives us consistent realisations of the random function evaluated at the indices / inputs.

#### Distributions over Functions

If we now refrain from picking a single element of the sample space \(\Omega\),
we can consider the *distribution* over functions.

Working up to this once more from a single index, the univariate Gaussian FDD at \(\rvx_{1}\) is \(\ervf_{\rvx_{1}} \sim \gN(\mu_{1}, \Sigma_{1})\). This is just a univariate Gaussian distribution, which we will suggestively plot vertically on the function plot, along with five sampled function points from this univariate Gaussian:

Continuing as we did before by adding a second index to the index set, \(\rvx_{2}\), this yields a second marginal \(\ervf_{\rvx_{2}}\) of the now 2-dimensional joint, which we will also plot vertically at the corresponding index location \(\rvx_{2}\). Under the density, we have also plotted 5 samples using the same \(\{\omega_{1}, \ldots, \omega_{5}\}\) as were used to draw the samples at \(\rvx_{1}\).

Continuing in this way by adding more and more indices to the index set, we can
draw out the full function lines
10: Note that when using GPs, we never
actually do this unless we’re trying to visualise the GP. Instead, we
only need to add the training points and test points to the index set.
^{10[10]}
for
our five elements of the sample space \(\{\omega_{1}, \ldots, \omega_{5}\}\subset \Omega\), for some index / design matrix \(\rmX \in \R^{n\times 1}\).

You’ll notice that we stopped visualising the distribution over functions at the various indices using vertical Gaussian densities at each input location, since this quickly becomes messy. Visualising the distribution by drawing a bunch of functions (as we have done above, using 5 samples) is equally messy. Rather, to represent the broadness of these vertical Gaussians at various inputs and communicate our uncertainty about the function values at those points, it is common to plot \(\mu_{\rvx} \pm 2\sigma_{\rvx}\) for all \(\rvx \in \rmX\).

For the GP *prior* that we have been sampling from in the plots above
(i.e. the GP without conditioning any training data) this is usually
fairly boring; with a constant mean and standard deviation / uncertainty at all
points, which correspond to how we have defined the prior (see next section).

Once we have conditioned on training data however, we may observe varying degrees of uncertainty in our random function at different points away from the data.

In general, to define a Gaussian it suffices to define its first two moments: the mean and covariance. For any \(n\) indices \(\rmX = \{\rvx_{i} \in \gX\}_{i=1}^{n}\), the resulting FDD is a \(n\)-dimensional Gaussian, parametrised by \(\mu \in \R^{n}\) and \(\mSigma \in \R^{n\times n}\). In the next section, we will see how to obtain these parameters for a given GP prior, a training dataset, and some test points of interest.

## Defining the GP Prior

When we place a *Gaussian process prior* on a function that we’re interested
in modelling, we define a way of obtaining the first two moments of the
FDD; in other words, we define a mean *function* \(m: \gX \to \R\) and
a positive definite covariance (or kernel) *function* \(k: \gX \times \gX \to \R\).

This is convenient since we often don’t know which indices \(\rmX\) we’ll choose
to look at when choosing the GP prior
11: i.e. where our training points
will land, and where we will need to evaluate the function at test time
^{11[11]}
. With
these two functions, we can postpone the calculation of the FDD’s first
two moments until we obtain the indices we need.

Supposing that our indices or covariates live in the space \(\gX = \R^{d}\), and
we select / observe \(n\) of them \(\rmX \in \R^{n \times d}\), the resulting
*prior* FDD can be found by the element-wise evaluation of these functions at
each of the indicies; giving us our mean vector \(m(\rmX) = \vmu \in \R^{n}\)
and covariance or *Gram* matrix \(\rmK\), where \(\ermK_{i, j} = k(\rvx_{i}, \rvx_{k})\). The positive definite constraint on the kernel function says
that the Gram matrices it produces satisfy \(\rvv^{\top}\rmK\rvv \gt 0\), which
is required to use these as covariance matrices.
Drawing \(n\)-dimensional samples from \(\gN(\vmu, \rmK)\) results
in (prior) function samples, as plotted in the penultimate digram above.

The mean and covariance functions are the main levers we can pull to influence
the expression of our prior beliefs about the function of interest.
In most cases, using a zero-mean function \(m(\rvx) \doteq \mathbf{0}\) is a reasonable
choice
12: although this is not always the case; a non-zero mean functions can
be extremely useful, if not essential, in certain settings such as Bayesian
optimisation, time-series predictions and deep Gaussian processes
^{12[12]}
leaving
the characterisation of the function down to the choice of kernel function.
Different kernel functions can influence things such as the smoothness or
periodicity of the resulting functions—for a nice taxonomy of some common
kernels, see this page by
David Duvenaud.

Some popular kernels include the following, where \(r = |\rvx - \rvx'|\)
represents a pairwise distance between points
13: See this
Fragment for more on computing pairwise distances between
datapoints.
^{13[13]}

Kernel | \(k(\rvx, \rvx')\) |
---|---|

Squared Exponential | \(\sigma^{2} \exp \big(-\frac{r^{2}}{2\ell^{2}}\big)\) |

Rational Quadratic | \(\sigma^{2}\left(1 \frac{r^{2}}{2\alpha \ell^{2}}\right)^{-\alpha}\) |

Matérn 3/2 | \(\left(1 + \frac{\sqrt{3}r}{\ell}\right)\exp\left(-\frac{\sqrt{3}r}{\ell}\right)\) |

Matérn 5/2 | \(\left(1 + \frac{\sqrt{5}r}{\ell} + \frac{5r^{2}}{3\ell^{2}}\right)\exp\left(-\frac{\sqrt{5}r}{\ell}\right)\) |

ArcCosine (Cho et al., 2009) |
\(\sin \theta + (\pi - \theta)\cos\theta\) |

Note that for the ArcCos kernel, we have \(\cos \theta = \frac{\rvx \cdot \rvx'}{\vert \rvx \vert \vert \rvx' \vert}\). The \(\sigma\) and \(\ell\) terms are hyperparameters which influence the properties of each kernel; we discuss how to set these later.

In summary, we define a Gaussian process prior \(f: \R^{n} \times \Omega \to \R\), which we denote as

\[f \sim \gG\gP(m, k), \]

by defining the functions \(m(\cdot)\) and \(k(\cdot, \cdot)\), which in turn determine the first two moments of the resulting FDDs:

\[\begin{align*} m(\rvx) &= \E[f(\rvx)], \\ k(\rvx, \rvx') &= \E\big[\big(f(\rvx) - \mu(\rvx)\big)^{\top}\big(f(\rvx') - \mu(\rvx')\big)\big]. \end{align*} \]

### Conditioning on Data

Defining a GP prior and drawing samples from it is not usually very
useful in itself: rather we seek to make predictions about new inputs by
updating or *conditioning* our beliefs about the latent function values on
observed training data.
In this section, we will see how we can obtain the updated parameters of the
FDD and parametrise the *posterior* predictive / posterior process; as
was done to produce the last plot in the previous section.

For clarity of exposition, we will first assume a noiseless regression
setup
14: These might come up when modelling the output of a computer
simulation, or something for which we have extremely precise measurements.
^{14[14]}
.

Consider a training set of \(n\) (*input, function-value*) pairs, \(\gD = \{(\rvx_{i}, f_{i})\}_{i=1}^{n}\), whose \(d\)-dimensional inputs we collectively denote as
\(\rmX \in \R^{n \times d}\) and where the function points are denoted as \(\rvf = f(\rmX) \in \R^{n}\). Further consider a set of \(m\) new indices / test points,
which we denote \(\rmX_{\star} \in \R^{m\times d}\) at which we’d like to
make predictions \(\rvf_{\star} = f(\rmX_{\star})\).

With these input points defined we can compute the GP’s FDD at \(\{\rmX, \rmX_{\star}\}\) by evaluating the mean and covariance functions at these input points. This gives us a joint distribution over the function values at the training and test locations \(p(\rvf, \rvf_{\star})\). Writing these parameters as a partitioned mean vector and covariance matrix, we obtain the distribution over the Gaussian random vector \(\begin{bmatrix}\rvf, \rvf_{\star}\end{bmatrix}\) as

\[\begin{bmatrix}\rvf \\ \rvf_{\star} \end{bmatrix} \sim \gN\left(\begin{bmatrix}\vmu \\ \vmu_{\star}\end{bmatrix}, \begin{bmatrix}\rmK & \rmK_{\star} \\ \rmK_{\star}^{\top} & \rmK_{\star\star}\end{bmatrix}\right), \]

where \(\rmK = k(\rmX, \rmX) \in \R^{n\times n}\), \(\rmK_{\star} = k(\rmX, \rmX_{\star}) \in \R^{n\times m}\), \(\vmu_{\star} = m(\rmX_{\star}) \in \R^{m}\) and so forth.

Given this joint model, we now note that while we know the function values at the training inputs \(\rvf\) (these are in our training set), we don’t know the function values at the test points \(\rvf_{\star}\)—this is what we would like to infer.

To make predictions, we therefore seek the *predictive distribution* of
function values at the test points, conditioned on the observed training
dataset, \(p(\rvf_{\star} \vert \rvf)\)
15: or, using more explicit notation,
\(p(\rvf_{\star} \vert \gD)\) or even \(p(\rvf_{\star} \vert \rmX_{\star}, \rvf, \rmX)\). However, it is common to leave out the indices for notational brevity,
with the understanding that \(\rvf \doteq f(\rmX)\).
^{15[15]}
.

Owing to the convenient properties of the Gaussian, this Gaussian conditional distribution has a closed-form solution, which I give a full derivation of in this section of my previous article on the Gaussian distribution. If you would find it satisfying to know how GPs work end-to-end, I encourage you to read through the proof. Otherwise, I have provided the relevant results below which you may use ad-hoc:

### Conditional of a Joint Gaussian Distribution

For a partitioned Gaussian joint \(\rvx \sim \gN(\vmu, \mSigma)\), where

\[\rvx = \begin{bmatrix}\rvx_{1} \\ \rvx_{2}\end{bmatrix}\ \ \ \vmu = \begin{bmatrix}\vmu_{1} \\ \vmu_{2} \end{bmatrix} \ \ \ \mSigma = \begin{bmatrix}\mSigma_{11} & \mSigma_{12} \\ \mSigma_{21} & \mSigma_{22}\end{bmatrix}, \]

and, necessarily, \(\mSigma_{12} = \mSigma_{21}^{\top}\), the distribution of \(\rvx_{2}\) conditioned on \(\rvx_{1}\) is

\[\begin{align} p(\rvx_{2} \vert \rvx_{1}) &= \gN(\rvx_{2}; \vmu_{2 \vert 1}, \mSigma_{2\vert 1}) \nonumber \\ \vmu_{2 \vert 1} &= \vmu_{2} + \mSigma_{21}\mSigma_{11}^{-1}(\rvx_{1} - \vmu_{1}) \label{eq:conditional_mean} \\ \mSigma_{2 \vert 1} &= \mSigma_{22} - \mSigma_{21}\mSigma_{11}^{-1}\mSigma_{12} = (\mSigma / \mSigma_{11}) \label{eq:conditional_cov}. \end{align} \]

For other related results, Petersen *et al.* (2012) provides an excellent
reference.

Hence the Gaussian random vector \(\rvf_{\star}\) conditioned on our training points is distributed as \(p(\rvf_{\star} \vert \rmX_{\star}, \rvf, \rmX) = \gN(\rvf_{\star}; \vmu_{\star \vert \rmX}, \mSigma_{\star \vert \rmX})\) with parameters

\[\begin{align} \vmu_{\star \vert \rmX} &= \vmu_{\star} + \rmK_{\star}^{\top}\rmK^{-1}(\rvf - \vmu) \\ \mSigma_{\star\vert \rmX} &= \rmK_{\star\star} - \rmK_{\star}^{\top}\rmK^{-1}\rmK_{\star}, \end{align} \]

where we have used Equations \(\ref{eq:conditional_mean}\) and \(\ref{eq:conditional_cov}\) for the mean and covariance, respectively.

These are the equations that can be used
16: I refrain from saying ‘*that were
used*’, since in practice these equations prove to be numerically unstable.
I discuss some methods for practical inversion of kernel matrices in this
section of another
article on Bayesian linear regression.
^{16[16]}
to plot the predictive GP
conditioned on
some training data:

Notice how the mean of the posterior perfectly interpolates the training data, and the predictive uncertainty ‘pinches’ around the training points. This is an artefact of the noiseless modelling assumption. In most cases when modelling observed phenomena this is unrealistic, and this may further have undesired effects on the predicted function. For instance if two similar inputs have differing observations (e.g. due to corrupting observation noise on a sensor), then the predicted function (distribution) will oscillate wildly around that point to match the data:

### Noise-Corrupted Observations

To rectify the above, we now assume that instead of observing noiseless function values \(f(\rvx_{n})\) in our training set, we rather observe noise corrupted versions thereof; \(\ervy_{n} = f(\rvx_{n}) + \epsilon_{n}\), for \(\epsilon_{n} \sim \gN(0, \sigma^{2}_{y})\).

Note that this unimodal, Gaussian noise modelling assumption is required in the exact GP setting to remain tractable, however in a subsequent article we will see how we can relax this constraint and use different likelihoods.

Due to our assumption that the noise has zero mean (i.e. is centred at the latent
function values \(\rvf\)) and i.i.d. (i.e. it has a diagonal covariance),
incorporating this modelling assumption is fairly straightforward; only
requiring us to
update the diagonal entries of the covariance matrix at the training
locations
17: we don’t update the test locations since we are concerned with
inferring the latent function values \(\rvf_{\star}\), not noisy versions
thereof!
^{17[17]}
:

\[\Cov[\rvy \vert \rmX] = \rmK + \sigma^{2}_{y}\rmI_{n}. \]

Hence, our model for the joint density of our observations (which we now denote \(\rvy\)) and the latent function values at the test inputs \(\rvf_{\star}\) becomes:

\[\begin{equation} \begin{bmatrix}\rvy \\ \rvf_{\star}\end{bmatrix} \sim \gN\left( \begin{bmatrix}\vmu \\ \vmu_{\star} \end{bmatrix}, \begin{bmatrix}\rmK + \sigma^{2}_{y}\rmI & \rmK_{\star} \\ \rmK_{\star}^{\top} & \rmK_{\star\star}\end{bmatrix} \right). \end{equation} \]

Conditioning on the observations \(\rvy\) proceeds exactly as in the noiseless case (Equations \(\ref{eq:conditional_mean}\) and \(\ref{eq:conditional_cov}\)) with \(\rmK + \sigma^{2}_{y}\rmI\) substituted for \(\rmK\), giving the posterior predictive distribution in the noisy case as:

\[\begin{align} p(\rvf_{\star} \vert \rmX_{\star}, \rvy, \rmX) &= \gN(\rvf; \vmu_{\star \vert \rmX}, \mSigma_{\star \vert \rmX}) \\ \vmu_{\star \vert \rmX} &= \vmu_{\star} + \rmK_{\star}^{\top}(\rmK + \sigma_{y}^{2}\rmI)^{-1}(\rvy - \vmu) \\ \mSigma_{\star \vert \rmX} &= \rmK_{\star\star} - \rmK_{\star}^{\top}(\rmK + \sigma_{y}^{2}\rmI)^{-1}\rmK_{\star}. \end{align} \]

Setting the noise level to \(\sigma_{y}^{2} = 0.2\), we can now see that the mean
function values now merely pass *close* to the training datapoints, without the
hard requirement to interpolate through them:

Due to this behaviour, these simple, single-layer GPs are often referred
to as *smoothing devices*.

## Appendix A: The Measure Theory You Need To Know (and no more)

This Appendix collects the bits of basic measure theory which are useful for understanding GPs. It reads like a list of definitions, which you should aim to commit to memory.

**Definition** (Sample Space)**.** A *sample space* \(\Omega\) is the set of all
outcomes of a probabilistic experiment. More generally, a *measurable space*,
commonly denoted \(E\), is the set of things which we are interested in
measuring. A sample space is a measurable space
18: The former arising when
talking about probability, with the latter arising more generally in measure theory.
^{18[18]}
.

**Example.** For a single toss of a coin, the sample space is \(\Omega = \{\omega_{1}, \omega_{2}\} = \{H, T\}\).

### Sigma Algebras

**Definition** (Sigma-algebra, event space)**.** These are the same thing, with
the latter being a probabilistic interpretation of the former. A
\(\sigma\)-algebra \(\gE\) is a collection of ‘nice’ (i.e. not infinite, tractable)
subsets of a measurable space. If the measurable space is finite and discrete,
then its power set \(\gP(E)\) can be used as a valid sigma algebra. In the
probabilistic interpretation, for a finite sample space \(\Omega\), the event
space (which we usually denote \(\gF\)) can be taken as a subset of \(\gP(\Omega)\);
elements of which are called *events*. Elements of a sigma-algebra must follow
the following regularity conditions:

- \(\E \in \gE\); the sample space itself is an event (the
*sure event*) - \(A \in \gE \implies A^{c} \in \gE\); it is closed under complementation (the complement of an event is an event)
- \(A_{1}, A_{2}, \dots \in \gE \implies \bigcup_{i=1}^{\infty} A_{i} \in \gE\); it is closed under countable unions (the union of countably many events is an event)

If a measurable space \(E\) is countable, then this is nice because we can always
find the power set \(\gE = \gP(E)\) and the measure can be defined on all
possible subsets. However, for ‘bigger’ spaces, we have to be more careful,
since the set of all subsets if often too large to work with. For such
inconvenient, infinite event spaces, we usually define \(\sigma\)-algebras in
terms of a smaller set, known as the *generating sets*.

**Definition** (Generator of a \(\sigma\)-algebra)**.** Let \(E\) be a set, and
\(\gA \subseteq \gP(E)\) be a collection of subsets of \(E\). We define

\[\sigma(\gA) = \{A \subseteq E\ :\ A \in \gA\text{ for all $\sigma$-algebras $\gE$ that contain $\gA$}\}. \]

Put otherwise, \(\sigma(\gA)\) is the smallest sigma algebra that contains \(\gA\).
This is known as the \(\sigma\)-algebra *generated* by \(\gA\).

**Definition** (Borel \(\sigma\)-algebra)**.** Consider a real measurable space
\(E = \R\) and the generator \(\gA = \{U \subseteq \R\ :\ U\text{ is open}\}\).
Then \(\sigma(\gA)\) is known as the *Borel \(\sigma\)-algebra*, which is *not* the
set of all subsets of \(\R\) (which would be an uncountably infinite object, and hence hard to measure).

We can perhaps more intuitively, and entirely equivalently, define this as \(\bar{\gA} = \{(a, b)\ :\ a < b,\ a, b \in \Q\}\)—the set of tuples of ordered rational numbers. Then \(\sigma(\bar{\gA})\) is also the Borel \(\sigma\)-algebra. We denote the Borel \(\sigma\)-algebra as \(\gB(\R)\).

### Measures

**Definition** (Measurable space)**.** A pair \((E, \gE)\) of a set and its
\(\sigma\)-algebra (for instance a sample space \(\Omega\) and an event space
\(\gF\)) is called a *measurable space*. This is because we can define a function
that ‘measures’ the size of the elements of \(\gF\). In fact, any element of a
sigma algebra \(B \in \gF\) is called a *measurable set*.

**Definition** (Measure)**.** A *measure* on a measurable space \((E, \gE)\) is a
function \(\mu : \gE \to [0, +\infty)\), such that

- \(\mu(\emptyset) = 0\)
- It follows countable additivity: for any disjoint sequence \((A_{n})\) in \(\gE\), then

\[\mu\left(\bigcup_{n}A_{n}\right) = \sum_{n=1}^{\infty}\mu(A_{n}). \]

**Example** (Mass function)**.** Let \(E\) be any countable set, and \(\gE = \gP(E)\) its power set (i.e. the set of all subsets of \(E\)). A *mass function*
is any function \(m : E \to [0, +\infty)\). We can define a measure by setting

\[\mu(A) = \sum_{x \in A}m(x). \]

In particular, if we put \(m(x) = 1\) for all \(x \in E\), then we obtain the *counting measure*.

**Definition** (Probability measure)**.** Let \((\Omega, \gF)\) be a measurable
space with the property that \(\mu(\Omega) = 1\). Then we call \(\mu\) a
*probability measure* and usually denote it as \(\P\) instead of \(\mu\). Note that
\(\P\) must also satisfy the additional property that it maps to \([0, 1]\) rather
than \([0, +\infty)\).

**Definition** (Lebesgue measure)**.** The *Lebesgue measure* is the unique
Borel measure \(\mu\) on \(\R\) with \(\mu([a, b]) = b-a\). Note that the Lebesgue
measure is not a finite measure, since \(\mu(\gR) = \infty\), however it is a
\(\sigma\)-finite measure (defined below).

We often denote the Lebesgue measure as \(\lambda\) (with its \(n\)-dimensional generalisation on \(\gB(\R^{n})\) denoted \(\lambda^{n}\)).

**Definition** (\(\sigma\)-finite measure)**.** Let \((E, \gE)\) be a measurable
space, and \(\mu\) a measure. We say that \(\mu\) is \(\sigma\)-finite if there
exists a sequence \((E_{n})\) in \(\gE\) such that \(\bigcup_{n}E_{n} = E\), and
\(\mu(E_{n}) < \infty\) for all \(n\).

Notice how the measure takes elements of \(\gF\) and not elements of the sample space: a common example of this is the probability space \((\R, \gB(\R), \P)\), where \(\gB(\R)\) is a Borel sigma-algebra and \(\P\) is the Lebesgue measure. We always evaluate a probability density function on an interval (i.e. an element of \(\gB(\R)\)) and never directly on a real number!

### Probability

**Definition** (Probability space)**.** A *probability space* is a triple
\((\Omega, \gF, \P)\), composed of

- the sample space \(\Omega\),
- an event space \(\gF\) which is a \(\sigma\)-algebra on \(\Omega\), and
- a probability measure \(\P\) on \((\Omega, \gF)\).

**Definition** (Random variable)**.** Consider two measurable spaces: the first
is the pair of a sample space and event space \((\Omega, \gF)\), and the other is
an arbitrary pair of a set and its sigma algebra \((E, \gE)\). A *random
variable* (also called a *measurable function*
19: Note that when we talk about
a measurable function, we will more often use the notation \(f\) rather than
\(X\).
^{19[19]}
) is a function \(X: \Omega \to E\) which maps elements in the sample space
to elements in the arbitrary set \(E\), such that for any \(\gE\)-measurable set \(B \in \gE\), the preimage \(X^{-1}(B)\) is an \(\gF\)-measurable set:

\[X^{-1}(B) = \{\omega \in \Omega\ :\ X(\omega) \in B\} \in \gF\ \ \forall B \in \gE. \]

\(\blacksquare\)

Intuitively, this is useful because the actual sample space and event space of our experiments, for instance \(E\) and \(\gE\) might be tricky to work with, and we’d rather use alternative ones that we’re more comfortable using such as \(\Omega\) and \(\gF\) (which, concretely, might be \(\R\) and \(\gB(\R)\)). The property above guarantees that there is some sort of correspondence between the events in our original event space \(\gF\) and our transformed event space \(\gE\).

On terminology, we often say that an *\(E\)-valued random variable* is a
measurable function \(X: \Omega \to E\).

**Example.** If \(E = \R\) and \(\gE = \gB(\R)\) a *real-valued random variable*
(sometimes \(\R\)-valued random variable) assigns a real number to each \(\omega \in \Omega\). More generally, we might take \(E = \R^{n}\) and \(\gE = \gB(\R^{n})\), where the random variable now maps elements of the sample space
to an \(n\)-dimensional Euclidean vector.

**Definition** (Probability distribution / law)**.** Suppose we have a
probability space \((\Omega, \gF, \P)\) allowing us to assign probabilities to
events in \(\gF\). Suppose that we also have another measurable space \((E, \gE)\)
for which we don’t yet have a probability measure to measure events in \(\gE\). A
*probability distribution* (also a *law* or even the *pushforward* of a
measure, denoted \(\mu_{x} = \P \circ X^{-1}\) or \(X_{\#}\P\), respectively) is an
image measure: the composition of the inverse of the measurable function
defining the random variable \(X\) with the probability measure \(\P\). It is
effectively a mapping from sets in \(\gE\) into \([0, 1]\):

\[X_{\#}\P = (\P \circ X^{-1}): \gE \to [0, 1]. \]

The key idea is that, given some set (read *event*) in \(\gE\), we can use a
random variable \(X\) to find the pre-image of this set in the event space \(\gF\),
and then measure this set via the probability measure \(\P\) defined on this set.
The existence of the random variable \(X: \Omega \to \gE\) allows us to measure
events \(B \in \gE\) not in the original probability space.

Using more probabilistic notation, we would write

\[\P(X \in A) = \mu_{x}(A) = \P(X^{-1}(A)) = \P(\{\omega \in \Omega\ :\ X(\omega) \in A\}). \]

The above works well for countable \(E\) (that is, discrete random variables). For uncountable sample spaces, such as when \(E = \R\), then \(\mu_{X}\) is determined by its values on the intervals \((-\infty, y]\). We set

\[F_{X}(x) = \mu_{X}\big((-\infty, x]\big) = \P(X \le x), \]

where \(F_{X}(\cdot)\) is the *distribution function* of the random variable \(X\).

One common example of this is the measurable space \((E, \gE) = \big(\R^{n}, \gB(\R^{n})\big)\), where we have

\[X_{\#}\P:\gB(\R^{n}) \to [0, 1]. \]

**Definition** (Probability density function)**.** Consider the measurable
space \((\R^{n}, \gB(\R^{n})\big)\), and suppose that we have the usual Lebesgue
measure \(\lambda^{n}\) on this measurable space. This allows us to assign the
notion of ‘size’ to each set \(A \in \gB(\R^{n})\) based on the likelihood of it
happening. To do this, we assign a density to every point in \(\R^{n}\), and to
compute the new measure of \(A\) according to \(X_{\#}\P\), we simply take the
average of its density at all points in \(A\), according tot he Lebesgue measure:

\[X_{\#}\P(A) = \int_{A}dX_{\#}\P = \int_{A}fd\lambda^{n}\ \ \forall A \in \gB(\R^{n}). \]