One does not simply get GPs in one sitting. This is particularly true if you are an engineer or computer scientist and haven’t taken a course on stochastic processes or anything of that nature.
My goal with this second 0: As a first introduction, this may make for slightly tedious reading: too thin on the motivation and thick in the details. For more easy-going introductions, see this talk from David MacKay, this one from Rich Turner or the introductory section of the GPML book. 0[0] introduction to GPs is to provide a self-contained treatment of GPs for the reader who’s sat through a first course or lecture on the subject, yet still feels like they’re missing the full picture.
There are a few essential preliminaries to be familiar with, without which the going can be rather tough. 1: I made this mistake when first learning about GPs: I was able to apply the formulae in a rote manner and code up the model. Yet I didn’t feel I understood why the equations worked. 1[1] These include several Gaussian identities, the definition of a stochastic process, among a few others. Happily, these preliminaries aren’t difficult to grasp, and we will cover them as they come up 2: If you’d like to cover them preemptively, then begin by taking a read through my previous article on the Gaussian distribution, followed by the this article’s Appendix covering the measure theory. 2[2] . With a good hold on these key ideas, the rest of the GP framework follows almost effortlessly.
Gaussian Processes
The terminology of measure theory proves to be rather valuable in succinctly expressing the key ideas behind GPs. 3: Often, introductions to GPs will avoid this for the sake of accessibility; choosing analogies to parametric function approximation and other visualisations instead. While intuitive and accessible, they might lack the precision that can be afforded with just a little bit of basic measure theory. 3[3]
If the words probability space or Borel sigma algebra don’t mean very much to you, then take a look through the Appendix before proceeding, in which I give a quick back-alley treatment of the subject, covering just the bits you need to know.
To establish notation, we begin with the definition of a stochastic process.
Stochastic Processes
Definition (Stochastic process). Consider a probability space
Note: to avoid unnecessary abstraction at this point, you may wish to
substitute real-valued random variables;
Let
In this sense, the fact that
To do this, we fix some
where by selecting some
A stochastic process is characterised by the joint probability distribution of
More formally, we refer to these as finite-dimensional distributions.
Definition (Finite-dimensional distribution). Consider a probability
space
where
The Gaussian process
Example (Gaussian process). A Gaussian process is a stochastic process, whose finite-dimensional distributions are multivariate Gaussian.
That is, the FDDs of the Gaussian process are
where
While the above is sufficiently abstract to entertain a variety of Gaussian
processes (e.g. over indices
- We consider real-valued random variables, that is
in which case each random variable indexed by some maps some element of the sample space to the reals: . - The collection of random variables that makes up the stochastic process can be
denoted
, where in GP regression we often have , for . In this case, since is uncountable, then so is this collection of random variables, and we call the stochastic process a continuous stochastic process.
Modelling Functions with GPs
For the purposes of modelling functions with Gaussian processes
5: or any
stochastic process for that matter…
5[5]
, we interpret the index set
For instance, for time-series prediction the indices may be a set of time values we have in a training dataset, as well as time values we’d like to make predictions at. For Bayesian optimisation of an ML model’s hyperparameters, the indices may be the set of hyperparameters we’ve already tried, as well as the ones we’re interested in trying next. For plotting a 2d function on a graph, the indices are the real values along the x-axis that we’d like to plot the function at.
More generally, the indices are the vector-valued covariates for our prediction problem 6: It is easiest to think about regression for now, although we will see later how GPs can be used for classification by changing the likelihood. 6[6] for which we have training observations 7: Note that we needn’t have observed any observations to use a GP; indeed Bayesian methods shine for how well they do a-priori and when we have observed very few data points 7[7] as well as those for which we’d like to make predictions (obtain a predictive distribution).
Single Sampled Function
To see how the GP corresponds to a collection of random variables at the
points of interest (i.e. train and test points) which are in some sense
consistent with one another, we will build up to this one index at a time,
working with a simple 1d function
Note that in this section we are only interested in seeing how GPs can be
used to represent functions. We leave the form of
Picking an initial index value
Having already fixed an element of the sample space,
Adding a second index
for some
Once again, having already fixed
This does not offer us a very high ‘resolution’ for plotting our function.
Indexing using 100 additional points, spaced from
Hence indexing results in a Gaussian random vector, and sampling from this gives us consistent realisations of the random function evaluated at the indices / inputs.
Distributions over Functions
If we now refrain from picking a single element of the sample space
Working up to this once more from a single index, the univariate Gaussian FDD
at
Continuing as we did before by adding a second index to the index set,
Continuing in this way by adding more and more indices to the index set, we can
draw out the full function lines
10: Note that when using GPs, we never
actually do this unless we’re trying to visualise the GP. Instead, we
only need to add the training points and test points to the index set.
10[10]
for
our five elements of the sample space
You’ll notice that we stopped visualising the distribution over functions at
the various indices using vertical Gaussian densities at each input location,
since this quickly becomes messy. Visualising the distribution by drawing a
bunch of functions (as we have done above, using 5 samples) is equally messy.
Rather, to represent the broadness of these vertical Gaussians at various
inputs and communicate our uncertainty about the function values at those
points, it is common to plot
For the GP prior that we have been sampling from in the plots above (i.e. the GP without conditioning any training data) this is usually fairly boring; with a constant mean and standard deviation / uncertainty at all points, which correspond to how we have defined the prior (see next section).
Note that I am still (albeit faintly) plotting samples drawn from the process for illustration. The opacity of a function line is proportional to its likelihood under our GP prior; those that deviate from the mean are fainter than those that lie close to the mean.
Once we have conditioned on training data however, we may observe varying degrees of uncertainty in our random function at different points away from the data.
In general, to define a Gaussian it suffices to define its first two moments:
the mean and covariance. For any
Defining the GP Prior
When we place a Gaussian process prior on a function that we’re interested
in modelling, we define a way of obtaining the first two moments of the
FDD; in other words, we define a mean function
This is convenient since we often don’t know which indices
Supposing that our indices or covariates live in the space
The mean and covariance functions are the main levers we can pull to influence
the expression of our prior beliefs about the function of interest.
In most cases, using a zero-mean function
Some popular kernels include the following, where
Kernel | |
---|---|
Squared Exponential | |
Rational Quadratic | |
Matérn 3/2 | |
Matérn 5/2 | |
ArcCosine (Cho et al., 2009) |
Note that for the ArcCos kernel, we have
In summary, we define a Gaussian process prior
by defining the functions
Conditioning on Data
Defining a GP prior and drawing samples from it is not usually very useful in itself: rather we seek to make predictions about new inputs by updating or conditioning our beliefs about the latent function values on observed (training) data points. In this section, we will see how we can obtain the updated parameters of the FDD and parametrise the posterior predictive / posterior process; as was done to produce the last plot in the previous section.
For clarity of exposition, we will first assume a noiseless regression setup 14: These might come up when modelling the output of a computer simulation, or something for which we have extremely precise measurements. 14[14] .
Consider a training set of
With these input points defined we can compute the GP’s FDD at
where
Given this joint model, we now note that while we know the function values at the
training inputs
To make predictions, we therefore seek the predictive distribution of
function values at the test points, conditioned on the observed training
dataset,
Owing to the convenient properties of the Gaussian, this Gaussian conditional distribution has a closed-form solution, which I give a full derivation of in this section of my previous article on the Gaussian distribution. If you would find it satisfying to know how GPs work end-to-end, I encourage you to read through the proof. Otherwise, I have provided the relevant results below which you may use ad-hoc:
Conditional of a Joint Gaussian Distribution
For a partitioned Gaussian joint
and, necessarily,
For other related results, Petersen et al. (2012) provides an excellent reference.
Hence the Gaussian random vector
where we have used Equations
These are the equations that can be used 16: I refrain from saying ‘that were used’, since in practice these equations prove to be numerically unstable. I discuss some methods for practical inversion of kernel matrices in this section of another article on Bayesian linear regression. 16[16] to plot the predictive GP conditioned on some training data:
Notice how the mean of the posterior perfectly interpolates the training data, and the predictive uncertainty ‘pinches’ around the training points. This is an artefact of the noiseless modelling assumption. In most cases when modelling observed phenomena this is unrealistic, and this may further have undesired effects on the predicted function. For instance if two similar inputs have differing observations (e.g. due to corrupting observation noise on a sensor), then the predicted function (distribution) will oscillate wildly around that point to match the data:
Noise-Corrupted Observations
To rectify the above, we now assume that instead of observing noiseless function
values
Note that this unimodal, Gaussian noise modelling assumption is required in the exact GP setting to remain tractable, however in a subsequent article we will see how we can relax this constraint and use different likelihoods.
Due to our assumption that the noise has zero mean (i.e. is centred at the latent
function values
Hence, our model for the joint density of our observations (which we now denote
Conditioning on the observations
Setting the noise level to
Due to this behaviour, these simple, single-layer GPs are often referred to as smoothing devices.
Appendix A: The Measure Theory You Need To Know (and no more)
This Appendix collects the bits of basic measure theory which are useful for understanding GPs. It reads like a list of definitions, which you should aim to commit to memory.
Definition (Sample Space). A sample space
Example. For a single toss of a coin, the sample space is
Sigma Algebras
Definition (Sigma-algebra, event space). These are the same thing, with
the latter being a probabilistic interpretation of the former. A
; the sample space itself is an event (the sure event) ; it is closed under complementation (the complement of an event is an event) ; it is closed under countable unions (the union of countably many events is an event)
If a measurable space
Definition (Generator of a
Put otherwise,
Definition (Borel
We can perhaps more intuitively, and entirely equivalently, define this as
Measures
Definition (Measurable space). A pair
Definition (Measure). A measure on a measurable space
- It follows countable additivity: for any disjoint sequence
in , then
Example (Mass function). Let
In particular, if we put
Definition (Probability measure). Let
Definition (Lebesgue measure). The Lebesgue measure is the unique
Borel measure
We often denote the Lebesgue measure as
Definition (
Notice how the measure takes elements of
Probability
Definition (Probability space). A probability space is a triple
- the sample space
, - an event space
which is a -algebra on , and - a probability measure
on .
Definition (Random variable). Consider two measurable spaces: the first
is the pair of a sample space and event space
Intuitively, this is useful because the actual sample space and event space of
our experiments, for instance
On terminology, we often say that an
Example. If
Definition (Probability distribution / law). Suppose we have a
probability space
The key idea is that, given some set (read event) in
Using more probabilistic notation, we would write
The above works well for countable
where
One common example of this is the measurable space
Definition (Probability density function). Consider the measurable
space