The Venerable Gaussian

The Gaussian distribution underpins a broad range of methods in data analysis. This is often due to its remarkable analytical properties, which make working with it relatively easy.

To use it effectively however, one often has to trudge through a fair bit of linear algebra and memorise some identities. This article aims to collect several of these in one place; intended as both an initial exposition and a reference.

We first introduce the univariate Gaussian in Section 1 and highlight some of its properties before generalising to the multivariate case in Section 1.2. We then derive the matrix inversion lemma and matrix determinant lemma in Section 2, which sets us up for a treatment of Gaussian joint distributions in Section 3 in which we derive the equations for the conditional of a joint Gaussian distribution. Finally, we conclude by looking at linear Gaussian systems in the multivariate case in Section 4.

Introducing the Gaussian Distribution

Rather than giving the formula for the Gaussian distribution straight away, we will work up to it, pointing out some of its useful properties as we go along. For clarity, we’ll begin with the univariate case before generalising to the full Gaussian distribution.

At its core, a Gaussian function is the exponential of a negative quadratic:

\[\begin{equation}\label{eq:gaussian_function} f(x) = e^{-\frac{x^{2}}{2}}, \end{equation} \]

where we have divided by \(2\) for mathematical convenience later. Plotting this function gives us the familiar bell curve:

This function is symmetric around the peak, which lies at \(x=0\) in the expression above, and quickly tapers off 0: In fact, the Gaussian is notable for how quickly it tapers off when far from the mode. This distribution is said ot have light tails, meaning that there is not much support in the tails of the distribution far from the mean. ^0[0] for large \(\vert x \vert \gg 1\).

The expression above is not a valid density function however, since it is not normalised (it does not integrate to 1). In order to find the proper normalisation term, we must compute the Gaussian integral:

\[I_{1} \doteq \int_{-\infty}^{\infty} e^{-\frac{x^{2}}{2}}dx, \]

where the \(1\) in the subscript of \(I_{1}\) indicates that the above distribution has unit variance—we will later denote more general Gaussian integrals for distributions with variance \(\sigma^{2}\) as \(I_{\sigma^{2}}\).

This integral is not trivial to compute, but there is a neat trick 1: There are actually several approaches, with the approach using polar coordinates, due to Poisson, arguably being the most straightforward. For another, more geometric take on this integral, see this blog. ^1[1] to do it: by squaring the integral, we can change the variables to polar coordinates \((x, y) = (r \cos \theta, r \sin \theta)\), which will result in two slightly simpler elementary integrals.

We therefore square the integral, and substitute two dummy integration variables \(\alpha\) and \(\beta\) such that \(x = \alpha\beta\), giving:

\[I^{2}_{1} = \left(\int_{-\infty}^{\infty}e^{-\frac{x^{2}}{2}}dx\right)^{2} = \int_{-\infty}^{\infty}e^{-\frac{\alpha^{2}}{2}}d\alpha \int_{-\infty}^{\infty}e^{-\frac{\beta^{2}}{2}}d\beta = \iint_{-\infty}^{\infty}e^{-\frac{1}{2}(\alpha^{2} + \beta^{2})}d\alpha d\beta. \]

We can now interpret the above using polar coordinates, and so substitute the integral measure \(d\alpha d\beta = rdrd\theta\) while also recalling that \(r = \sqrt{\alpha^{2} + \beta^{2}}\), which leaves us with two elementary integrals to compute:

\[\begin{align*} I^{2}_{1} = \iint_{-\infty}^{\infty}e^{-\frac{1}{2}(\alpha^{2} + \beta^{2})} d\alpha d\beta &= \int^{\infty}_{0}r\ dr \int_{0}^{2\pi} e^{-\frac{r^{2}}{2}}d\theta \\ &= 2\pi \int^{\infty}_{0} r e^{-\frac{r^{2}}{2}}dr = 2\pi \left[-e^{-\frac{r^{2}}{2}}\right]_{r=0}^{r=\infty} \\ &= 2\pi. \end{align*} \]

Remembering to take the square root again, the solution to the Gaussian integral comes to:

\[\begin{equation} \label{eq:std_gaussian_integral} I_{1} = \int^{\infty}_{-\infty}e^{-\frac{x^{2}}{2}}dx = \sqrt{2\pi}. \end{equation} \]

This is the normalisation term that we need, in order to turn the Gaussian function of Equation \(\ref{eq:gaussian_function}\) into a Gaussian probability density function (PDF) with zero mean and unit variance. Hence, dividing Equation \(\ref{eq:gaussian_function}\) by \(\sqrt{2\pi}\), we get the (univariate) standard normal distribution:

\[\begin{equation} \label{eq:univariate_standard_gaussian} \gN(\ervx; 0, 1) \doteq \frac{1}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}}. \end{equation} \]

To generalise this to Gaussian distributions with different positive variances \(\sigma^{2} \gt 0\), we can divide the \(x\) term in the Gaussian function by \(\sigma^{2}\), giving \(f_{\sigma^{2}}(x) = \exp(-\frac{x^{2}}{2\sigma^{2}})\). However, we now need to re-calculate the normalisation term. Fortunately, since the variance does not depend on the integrand \(x\), we can just factor it out of the integral, make the \(u\)-substitution \(u = \frac{x}{\sqrt{\sigma^{2}}}\), and make use of our result from Equation \(\ref{eq:std_gaussian_integral}\):

\[I_{\sigma^{2}} \doteq \int_{-\infty}^{\infty}e^{-\frac{x^{2}}{2\sigma^{2}}}dx = \sqrt{\sigma^{2}}\int_{-\infty}^{\infty}e^{-\frac{u^{2}}{2}}du = \sqrt{2\pi \sigma^{2}}. \]

Dividing the Gaussian function by this new normalisation term, the more general Gaussian density function with variance \(\sigma^{2}\) is therefore 2: It is conventional to denote the variance as \(\sigma^{2}\), which lets us easily recover the standard deviation or ‘scale’ of \(x\) as \(\sigma\). ^2[2]

\[\gN(\ervx; 0, \sigma^{2}) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{x^{2}}{2\sigma^{2}}}. \]

Plotting this new Gaussian density function, the bell curve is still symmetric around \(x=0\), however \(\sigma^{2}\) now characterises its broadness, and it tapers off for large \(x\); \(\vert x \vert \gg \sqrt{\sigma^{2}}\).

Generalising this further, we can translate the mean of the distribution 3: Intuitively: shift the centre of the bell ^3[3] by \(\mu \in \R\), which gives the full definition of a univariate Gaussian density:

Definition (Univariate Gaussian). The univariate Gaussian distribution is defined as

\[\begin{equation} \label{eq:univariate_gaussian} \gN(\ervx; \mu, \sigma^{2}) \doteq \frac{1}{\sqrt{2\pi}\sigma}\exp \left(-\frac{(x - \mu)^2}{2\sigma^{2}}\right), \end{equation} \]

where \(\mu\in\R\) is the mean of the distribution of \(\ervx\) (and coincidentally also the mode of the distribution), \(\sigma^{2} \in \R\) is the variance of \(\ervx\), and \(\sigma\) is the standard deviation of \(\ervx\).
\(\blacksquare\)

The fact that the centre value \(\mu\) is the mean of the distribution can be seen by taking the expectation of the above:

\[\begin{align*} \E[\ervx] \doteq \int_{-\infty}^{\infty} \gN(x; \mu, \sigma^{2}) x\ dx &= \frac{1}{\sqrt{2\pi}\sigma} \int_{-\infty}^{\infty}e^{-\frac{(x - \mu)^{2}}{2\sigma^{2}}}x\ dx \\ &= \frac{1}{I_{\sigma^{2}}}\int_{-\infty}^{\infty}e^{-\frac{\omega^{2}}{2\sigma^{2}}}(\mu + \omega)d\omega \\ &= \frac{\mu I_{\sigma^{2}}}{I_{\sigma^{2}}} + \frac{1}{I_{\sigma^{2}}}\int_{-\infty}^{\infty}\left(e^{-\frac{\omega^{2}}{2\sigma^{2}}}\omega\right)d\omega \\ &= \mu, \end{align*} \]

where on the second line above we made the substitution \(\omega \doteq x - \mu\). On the last line we noticed that the integrand of the second term is odd with respect to the sign flip of the integration variable \(\omega \leftrightarrow -\omega\), and hence it integrates to 0. In other words, the negative bits, when \(\omega < 0\), cancel out the positive bits \(\omega > 0\):

As some final basic properties of the univariate Gaussian, we note that the variance 4: of any distribution, not just the Gaussian ^4[4] is the expected squared distance of samples drawn from the distribution to its mean, \(\mu\). That is, for samples \(\{\ervx_{i}\}_{i=1}^{N} \sim X = \gN(\mu, \sigma^2)\), we have

\[\begin{align*} \Var(X) &= \E\left[(X - \mu)^2\right] = \E[X^2] - 2\mu\E[X] + \mu^2 \\ &= \E[X^2] - \mu^2 \\ &\approx \frac{1}{N}\sum_{i=1}^{N} \ervx_{i}^2 - \mu^{2}, \end{align*} \]

where the first line follows from the linearity of expectation, and the second from the definition that \(\E[X] = \mu\).

As alluded to above when mentioning how quickly the Gaussian function tapers off for values far from the mean, the Gaussian distribution is particularly unsuitable for modelling populations drawn from heavy tailed distributions, in which some events occur very far from the mean of the distribution. To see this, consider the following diagram, which shows the probability mass placed in each consecutive, 1 standard deviation ‘bin’ away from the mean:

Only 0.27% of the probability mass lies over three standard deviations away from the mean of the Gaussian distribution!

It can finally be worth pointing out that we have symmetry in the random variable \(\ervx\) and the mean \(\mu\):

\[\gN(\ervx; \mu, \sigma^{2}) = \gN(\mu; \ervx, \sigma^{2}). \]

Univariate Linear Gaussian Systems

Here we will take a brief look at a simple case of a univariate linear Gaussian system to gain some intuition about working with Gaussians. For the multivariate case, see Section 4.

We might use a linear Gaussian system when we have a parameter of interest, \(\ervx\) which is possibly hidden or latent. We are interested in modelling the value of this parameter, but we can only make noisy observations of another value \(\ervy\), which we hypothesise to be linearly related to \(\ervx\). We can formalise this as follows:

Definition (Univariate linear Gaussian system). A linear Gaussian system arises when a Gaussian conditional distribution is centred at (some linear function of) another Gaussian distributed random variable. For clarity of exposition, in this section we will use an identity function; see Section 4 for the more general case. Thus, we have

\[\begin{align*} \label{eq:univariate_linear_gaussian_sys} p(\ervx) &= \gN(\ervx; \mu, \sigma^2) \\ p(\ervy \vert \ervx) &= \gN(\ervy; \ervx, \nu^2). \end{align*} \]

Having observed \(\ervy\), we are usually interested in inferring 5: That is, applying Bayes rule to find the posterior. ^5[5] \(p(\ervx \vert \ervy)\). The fact that the Gaussian remains closed under multiplication 6: products of exponentiated quadratics remain exponentiated quadratics. We will derive a more genearl expression for the product of two Gaussian distributions later ^6[6] allows us to deduce the following:

\[\begin{align} \label{eq:gaussian_posterior} p(\ervx \vert \ervy) &= \frac{p(\ervx)p(\ervy \vert \ervx)}{\int p(\ervx) p(\ervy \vert \ervx) d\ervx} = \frac{p(\ervx, \ervy)}{p(\ervy)} \\ &= \gN(\ervx; m, s^2), \end{align} \]

where

\[\begin{align*} s^2 &\doteq (\sigma^{-2} + \nu^{-2})^{-1}, \\ m &\doteq \big(\sigma^{-2}\mu + \nu^{-2}\ervy\big)s^{2}. \end{align*} \]

\(\blacksquare\)

Proof. We can omit the normalising constants in the Gaussians, since these can be factored out of both the integral in the denominator of Eq. \(\ref{eq:gaussian_posterior}\) as well as the product in the numerator, at which point they will cancel out.

Thus, we take

\[p(\ervx) \propto \exp\left(-\frac{1}{2}\frac{(\ervx-\mu)^{2}}{\sigma^{2}}\right), \]

and

\[p(\ervy \vert \ervx) \propto \exp\left(-\frac{1}{2}\frac{(\ervy - \ervx)^{2}}{\nu^{2}}\right). \]

To proceed, note that \(p(\ervx \vert \ervy) \propto p(\ervx, \ervy) = p(\ervy \vert \ervx)p(\ervx)\), where the final product is the exponential of a negative quadratic. By writing this quadratic as a perfect square, we can more easliy match terms and recover the mean and variance of the Gaussian \(p(\ervx \vert \ervy)\); \(m\) and \(s^{2}\).

Completing the Square

Recall that to ‘complete the square’ for some quadratic expression in \(x\) means to add a term into this expression that lets us write it as a perfect square plus some constant terms.

That is, using the fact that \((x + \alpha)^{2} = x^{2} + 2\alpha x + \alpha^{2}\), we can match terms to identify a value of the squared constant \(\alpha^{2}\) which, after adding it in, will let us create a perfect square:

\[\begin{align*} ax^{2} + bx + c &= x^{2} + \frac{b}{a}x + \frac{c}{a} \\ &= \left(x^{2} + \frac{b}{a}x + \left(\frac{b}{2a}\right)^{2}\right) + \frac{c}{a} - \left(\frac{b}{2a}\right)^{2} \\ &= \left(x + \frac{b}{2a}\right)^{2} - \frac{b^{2}}{4a^{2}} + \frac{c}{a} \\ &= a\left(x + \frac{b}{2a}\right)^{2} - \frac{b^{2}}{4a} + c, \end{align*} \]

where we used \(2\alpha x = \frac{b}{a}x\), implying \(\alpha = \frac{b}{2a}\) on the second line. The particular choice of rearrangement on the first and last lines is arbitrary, but leads to clean formulae that we can re-use as identities.

Multiplying these two exponents gives \(\exp\left(-\frac{1}{2}\left(\frac{(\ervx-\mu)^{2}}{\sigma^{2}} + \frac{(\ervx - \ervy)^{2}}{\nu^{2}}\right)\right)\). Since we want to recover the first two moments required to specify the resulting Gaussian, \(\exp\left(-\frac{1}{2}\frac{(\ervx - m)^{2}}{s^{2}}\right)\), writing the perfect square in the form \(a\left(x + \frac{b}{2a}\right)^{2}\) makes it trivial to find the values of \(m\) and \(s^{2}\). Proceeding,

\[\begin{align*} &\frac{(\ervx - \mu)^{2}}{\sigma^{2}} + \frac{(\ervx - \ervy)^{2}}{\nu^{2}} \\ &= \frac{\ervx^{2} - 2\ervx\mu + \mu^{2}}{\sigma^{2}} + \frac{\ervx^{2} - 2\ervx\ervy + \ervy^{2}}{\nu^{2}} \\ &= \left(\frac{1}{\sigma^{2}} + \frac{1}{\nu^{2}}\right)\ervx^{2} - 2\left(\frac{\mu}{\sigma^{2}} + \frac{\ervy}{\nu^{2}}\right) \ervx + \text{const} \\ &= \underbrace{\left(\frac{1}{\sigma^{2}} + \frac{1}{\nu^{2}}\right)}_{s^{-2}}\left(\ervx - \underbrace{\left(\frac{\mu}{\sigma^{2}} + \frac{\ervy}{\nu^{2}}\right) \left(\frac{1}{\sigma^{2}} + \frac{1}{\nu^{2}}\right)^{-1}}_{m}\right)^{2} + \text{lin.} + \text{const} \end{align*} \]

giving the variance and mean of \(p(\ervx \vert \ervy)\) as

\[\begin{align*} s^{2} &= (\sigma^{-2} + \nu^{-2})^{-1} \\ m &= (\mu \sigma^{-2} + \ervy \nu^{-2})s^{2}. \end{align*} \]

\(\square\)

Introducing the Multivariate Gaussian

Most interesting applications of the Gaussian require us to jointly model two or more random variables. Hence, we will now consider the multivariate Gaussian (which we will simply refer to as the Gaussian).

Definition (Gaussian). The Gaussian distribution is a joint probability density over multiple variables. If the number of variables of interest is \(D\), then we work with the \(D\)-dimensional random vector \(\rvx\), and the Gaussian PDF is defined as

\[\gN(\rvx \vert \vmu, \mSigma) = (2\pi)^{-D/2}\vert \mSigma \vert^{-1/2} \exp\left(-\frac{1}{2}(\rvx - \vmu)^{\top}\mSigma^{-1}(\rvx - \vmu)\right), \]

where we now have a mean vector \(\vmu\), as well as a covariance matrix \(\mSigma\). The notation \(\vert \mSigma \vert\) denotes the determinant of the matrix \(\mSigma\).

\(\blacksquare\)

If \(D = 1\), and \(\mSigma = \mI\) (the identity matrix) then we can see that we recover the univariate Gaussian exactly.

Recall that in the univariate case, we had a quadratic expression in the exponent; in the multivariate case we now have a quadratic form:

\[\Delta^{2} = (\rvx - \vmu)^{\top}\mSigma^{-1}(\rvx - \vmu). \]

For \(D > 1\) and \(\mSigma\) still the identity, the quantity \(\Delta\) is just the Euclidean distance (or \(\ell_2\) norm) between the random variable and the mean.

Interpreting the Covariance Matrix

For more general covariance matrices, the quadratic form corresponds to the Mahalanobis distance between the data vector \(\rvx\) and the mean vector \(\vmu\). Intuitively, the Mahalanobis distance corresponds to the Euclidean distance in a transformed coordinate system, with a shift of \(\vmu\), and a rotation by some rotation matrix \(\rmU\).

The rotation matrix \(\rmU\) comes from the eigendecomposition of the covariance matrix 7: You might notice that this form looks slightly different from the usual eigendecomposition, where we have a matrix inverse rather than a transpose on the last term: this is due to the properties of the covariance matrix, which is a real, symmetrical matrix. ^7[7] : \(\mSigma = \rmU\mLambda\rmU^{\top}\). Here \(\rmU\) is the orthonormal matrix of eigenvectors (i.e. with \(\rmU^{\top}\rmU = \rmI\)) and \(\mLambda\) is the diagonal matrix of eigenvalues.

The covariance matrix’s inverse can then be written 8: Using the fact that \((\rmA\rmB)^{-1} = \rmB^{-1}\rmA^{-1}\) ^8[8] as

\[\mSigma^{-1} = \rmU^{-\top}\mLambda^{-1}\rmU^{-1} = \rmU\mLambda^{-1}\rmU^{\top} = \sum_{i=1}^{D}\frac{1}{\lambda_{i}}\rvu_{i}\rvu_{i}^{\top}, \]

where \(\rvu_{i}\rvu_{i}^{\top}\) is the outer product of the \(i\)th eigenvector. Substituting this into the quadratic form in the exponent of the Gaussian, we get

\[\begin{align*} (\rvx - \vmu)^{\top}\mSigma^{-1}(\rvx - \vmu) &= (\rvx - \vmu)^{\top} \left(\sum_{i=1}^{D}\frac{1}{\lambda_{i}}\rvu_{i}\rvu_{i}^{\top}\right)(\rvx - \vmu) \\ &= \sum_{i=1}^{D}\frac{1}{\lambda_{i}}(\rvx - \vmu)^{\top}\rvu_{i}\rvu_{i}^{\top}(\rvx - \vmu) = \sum_{i=1}^{D}\frac{y_{i}^{2}}{\lambda_{i}}, \end{align*} \]

where we have defined the scalar \(y_{i} \doteq \rvu_{i}^{\top}(\rvx - \vmu)\). We may interpret \(\{y_i\}\) as a new coordinate system, defined by the orthonormal basis vectors \(\rvu_{i}\) that are shifted and rotated wrt. the original \(\ervx_{i}\) coordinates.

Recalling that the equation of an ellipse (corresponding to \(D=2\)) is

\[\frac{y_{1}^{2}}{\lambda_{1}} + \frac{y^{2}_{2}}{\lambda_{2}} = 1, \]

we can see that contours of equal probability density for a Gaussian lie along ellipses. This is often used as a way to visualise high-dimensional Gaussians, where we pick pairs of eigenvectors, usually with large eigenvalues, and plot the resulting ellipse.

Note that in order for the Gaussian to be well defined, we need all the eigenvalues of the covariance matrix to be strictly positive—if they weren’t, we would not be able to normalise the distribution. A suitable covariance matrix with positive eigenvalues is said to be positive definite, and it is common when working with Gaussians to check for the size of these eigenvalues: lots of values close to 0 may predict numerical instabilities.

Natural Parametrisation

We have so far been using the first two moments of a Gaussian distribution, the mean \(\vmu\) and covariance \(\mSigma\), as its parameters. However, the Gaussian is part of the exponential family of distributions, meaning that we can define a Gaussian in an altenative form—information form— using the so-called natural or canonical parameters.

Many procedures and manipulations involving Gaussians simplify vastly when using the information form of the Gaussian, as opposed to the moment form. For instance certain Gaussian identities, such as the product or KL divergence between two Gaussians, are much simpler to derive in information form. The information form is also helpful in other situations such as when updating beliefs over variables in Bayesian networks, or solving linear Gaussian systems.

We can obtain the natural parameters of the Gaussian by rearranging the moment form. Denoting the normalising terms as \(p \doteq (2\pi)^{-D/2}\vert \mSigma \vert^{-1/2}\), and expanding the brackets in the quadratic form, we get:

\[\begin{align*} \gN(\rvx; \vmu, \mSigma) &= p \cdot \exp\left(-\frac{1}{2}(\rvx - \vmu)^{\top}\mSigma^{-1}(\rvx - \vmu)\right) \\ &= \exp\left( \log p - \frac{1}{2}\vmu^{\top}\mSigma^{-1}\vmu + \rvx^{\top}\mSigma^{-1}\vmu - \frac{1}{2}\rvx^{\top}\mSigma^{-1}\rvx \right) \\ &\doteq \exp\left(g + \rvx^{\top}\rvh - \frac{1}{2}\rvx^{\top}\rmK\rvx \right) \end{align*} \]

where we have defined

\[\begin{equation} \label{eq:natural_parameters} \rmK \doteq \mSigma^{-1},\hspace{2em}\rvh \doteq \mSigma^{-1}\vmu,\hspace{2em}g \doteq \log p - \frac{1}{2} \vmu^{\top}\rmK\vmu. \end{equation} \]

\(\rmK\) is referred to as the precision matrix, and is also commonly denoted \(\mLambda\). \(\rvh\) is referred to as the precision-adjusted mean.

Generally for the Gaussian, there will always be a one-to-one mapping between the moments and the natural parameters. We can go back to the moment form by inverting the definitions above:

\[\begin{equation}\label{eq:natural_to_moment} \mSigma = \rmK^{-1}\hspace{2em}\vmu = \rmK^{-1}\rvh\hspace{2em}p=\exp(g+\frac{1}{2}\vmu^{\top}\mSigma{-1}\vmu) \end{equation} \]

KL Divergence Between Two Gaussians

The KL divergence between two multivariate Gaussian distributions is given by

\[\begin{align*} \KL[&\gN(\rvx \vert \vmu_{1}, \mSigma_{1})\Vert \gN(\rvx \vert \vmu_{2}, \mSigma_{2})] =\\ &= \frac{1}{2}\left[\Tr(\mSigma_{2}^{-1}\mSigma_{1}) + (\vmu_{2} - \vmu_{1})^{\top}\mSigma^{-1}_{2}(\vmu_{2} - \vmu_{1}) - D + \log \left(\frac{\det(\mSigma_{2})}{\det(\mSigma_{1})}\right)\right]. \end{align*} \]

In the scalar case, this reduces to

\[\KL[\gN(\ervx \vert \mu_{1}, \sigma_{1}) \Vert \gN(\ervx \vert \mu_{2}, \sigma_{2})] = \log \frac{\sigma_{2}}{\sigma_{1}} + \frac{\sigma^{2}_{1} + (\mu_{1} - \mu_{2})^{2}}{2\sigma_{2}^{2}} - \frac{1}{2}. \]

Inverting a Partitioned Matrix

Evaluating a Gaussian density involves inverting a covariance matrix 9: As well as computing its determinant. For most real-world applications, we compute the Cholesky decomposition of \(\mSigma\) which lets us re-use our computational work in both the determinant and matrix inversion calculation. The partitioned matrix inverse however has many useful corollaries, and is thus useful for understanding Gaussian manipulations. ^9[9] :

\[\gN(\rvx; \vmu, \mSigma) = (2\pi)^{-D/2}\vert \mSigma \vert^{-1/2} \exp \left(-\frac{1}{2}(\rvx - \vmu)^{\top} \mSigma^{-1} (\rvx - \vmu) \right). \]

Often, our problem gives this covariance matrix some structure; in particular, we can often partition it as follows:

\[\mSigma = \begin{bmatrix}\mSigma_{11} & \mSigma_{12} \\ \mSigma_{21} & \mSigma_{22}\end{bmatrix}. \]

We will assume that the sub-matrices \(\mSigma_{11}\) and \(\mSigma_{22}\) are themselves invertible—in many contexts this is not an inconvenience since these sub-matrices are themselves covariance matrices (e.g. for a prior or likelihood).

To find \(\mSigma^{-1}\), we can begin by block-diagonalising it 10: that is; making the off-diagonal terms \(\mSigma_{12}\) and \(\mSigma_{21}\) zero. The resulting block diagonal matrix has two, square main-diagonal blocks (of the same size as \(\mSigma_{11}\) and \(\mSigma_{22}\), respectively), and all off-diagonal blocks are zero matrices ^10[10] . To zero-out the top right block, we left-multiply \(\mSigma\) by a matrix \(\mathbf{A}\) such that \(\mSigma_{12} = \mathbf{A}_{11}\mSigma_{12} + \mathbf{A}_{12}\mSigma_{22} = \mathbf{0}\), leaving other entries unchanged. Setting \(\rmA_{11} = \rmI\) and \(\rmA_{12} = -\mSigma_{12}\mSigma_{22}^{-1}\) gives us what we need:

\[\underbrace{\begin{bmatrix} \mathbf{I} & -\mathbf{\Sigma_{12}}\mSigma_{22}^{-1} \\ \mathbf{0} & \mathbf{I}\end{bmatrix}}_{\mathbf{A}} \begin{bmatrix}\mSigma_{11} & \mSigma_{12} \\ \mSigma_{21} & \mSigma_{22}\end{bmatrix} = \begin{bmatrix} \mSigma_{11} - \mSigma_{12}\mSigma_{22}^{-1}\mSigma_{21} & \mathbf{0} \\ \mSigma_{21} & \mSigma_{22}\end{bmatrix}. \]

Similarly, to zero-out the bottom-left block, we right-multiply \(\mSigma\) by a matrix \(\mathbf{B}\) such that \(\mSigma_{21} = \mSigma_{21}\rmB_{11} + \mSigma_{22}\rmB_{21} = \mathbf{0}\) to obtain our block-diagonal matrix, which we will call \(\mathbf{\Xi}\)

\[\begin{bmatrix} \mSigma_{11} - \mSigma_{12}\mSigma_{22}^{-1}\mSigma_{21} & \mathbf{0} \\ \mSigma_{21} & \mSigma_{22}\end{bmatrix} \underbrace{ \begin{bmatrix} \mathbf{I} & \mathbf{0} \\ -\mSigma_{22}^{-1}\mSigma_{21} & \mathbf{I} \end{bmatrix}}_{\mathbf{B}} = \underbrace{\begin{bmatrix} \mSigma_{11} - \mSigma_{12}\mSigma_{22}^{-1}\mSigma_{21} & \mathbf{0} \\ \mathbf{0} & \mSigma_{22}\end{bmatrix}}_{\mathbf{\Xi}}. \]

In summary, we have \(\mathbf{A}\mSigma\mathbf{B} = \mathbf{\Xi}\). Each of these matrices are individually invertible:

\[\mathbf{B}^{-1}\mSigma^{-1}\mathbf{A}^{-1} = \mathbf{\Xi}^{-1}, \]

and rearranging gives us an expression for \(\mSigma^{-1}\):

\[\mSigma^{-1} = \mathbf{B}\mathbf{\Xi}^{-1}\mathbf{A}. \]

Schur Complements

At this point, to lighten notation, we should introduce the Schur complement of a block matrix. For a block matrix

\[\mathbf{M} = \begin{bmatrix}\mathbf{A} & \mathbf{B} \\ \mathbf{C} & \mathbf{D}\end{bmatrix}, \]

the Schur complement of \(\mathbf{M}\) wrt \(\mathbf{D}\), denoted \(\mathbf{M}/\mathbf{D}\), is defined as

\[\mathbf{M}/\mathbf{D} \doteq \mathbf{A} - \mathbf{B}\mathbf{D}^{-1}\mathbf{C}. \]

Similarly,

\[\mathbf{M}/\mathbf{A} \doteq \mathbf{D} - \mathbf{C}\mathbf{A}^{-1}\mathbf{B}. \]

Substituting our expressions for \(\mathbf{A}\), \(\mathbf{B}\) and \(\mathbf{\Xi}\), we get the following expression for the inverse:

\[\begin{align} \mSigma^{-1} &= \begin{bmatrix} \mathbf{I} & \mathbf{0} \\ -\mSigma_{22}^{-1}\mSigma_{21} & \mathbf{I} \end{bmatrix} \begin{bmatrix} (\mSigma/\mSigma_{22})^{-1} & \mathbf{0} \\ \mathbf{0} & \mSigma_{22} \end{bmatrix} \begin{bmatrix} \mathbf{I} & -\mSigma_{12}\mSigma_{22}^{-1} \\ \mathbf{0} & \mathbf{I}\end{bmatrix} \nonumber \\ &= \begin{bmatrix} (\mSigma/\mSigma_{22})^{-1} & \mathbf{0} \\ -\mSigma_{22}^{-1}\mSigma_{21}(\mSigma/\mSigma_{22})^{-1} & \mSigma_{22}^{-1} \end{bmatrix} \begin{bmatrix} \mathbf{I} & -\mSigma_{12}\mSigma_{22}^{-1} \\ \mathbf{0} & \mathbf{I}\end{bmatrix} \nonumber \\ &= \begin{bmatrix} (\mSigma/\mSigma_{22})^{-1} & -(\mSigma/\mSigma_{22})^{-1}\mSigma_{12}\mSigma_{22}^{-1} \\ -\mSigma_{22}^{-1}\mSigma_{21}(\mSigma/\mSigma_{22})^{-1} & \mSigma_{22}^{-1} -\mSigma_{22}^{-1}\mSigma_{21}(\mSigma/\mSigma_{22})^{-1}\mSigma_{12}\mSigma_{22}^{-1} \end{bmatrix} \label{eq:block_inv_1}. \end{align} \]

Note that had we done the block-diagonalisation the other way (zeroing-out the bottom-left block first) we would have obtained an expression in terms of \((\mSigma/\mSigma_{11})\):

\[\begin{equation} \label{eq:block_inv_2} \mSigma^{-1} = \begin{bmatrix} \mSigma_{11}^{-1} + \mSigma^{-1}_{11}\mSigma_{12}(\mSigma/\mSigma_{11})^{-1}\mSigma_{21}\mSigma_{11}^{-1} & -\mSigma_{11}^{-1}\mSigma_{12}(\mSigma/\mSigma_{11})^{-1} \\ -(\mSigma/\mSigma_{11})^{-1}\mSigma_{21}\mSigma_{11}^{-1} & (\mSigma/\mSigma_{11})^{-1} \end{bmatrix}. \end{equation} \]

Matrix Inversion Lemma

This soup of matrices above, giving the equation for the inverse of a partitioned matrix, might not immediately seem very helpful, however there are a few useful corollaries that we can draw from them. One example is the matrix inversion lemma:

Matrix Inversion Lemma

Consider a general partitioned matrix

\[\mathbf{M} = \begin{bmatrix}\mathbf{A} & \mathbf{B} \\ \mathbf{C} & \mathbf{D}\end{bmatrix}, \]

where we assume that the submatrices \(\mathbf{A}\) and \(\mathbf{D}\) are invertible. The matrix inversion lemma states that

\[(\mathbf{M}/\mathbf{D})^{-1} = \mathbf{A}^{-1} + \mathbf{A}^{-1}\mathbf{B}(\mathbf{M}/\mathbf{A})^{-1}\mathbf{C}\mathbf{A}^{-1}, \]

or equivalently (expanding Schur complements):

\[\begin{equation} \label{eq:matrix_inversion} (\mathbf{A} - \mathbf{B}\mathbf{D}^{-1}\mathbf{C})^{-1} = \mathbf{A}^{-1} + \mathbf{A}^{-1}\mathbf{B}(\mathbf{D} - \mathbf{C}\mathbf{A}^{-1}\mathbf{B})^{-1}\mathbf{C}\mathbf{A}^{-1}, \end{equation} \]

which follows by equating the top-left blocks of Equations \(\ref{eq:block_inv_1}\) and \(\ref{eq:block_inv_2}\), respectively. Alternatively, equating the top-right blocks gives the related formula:

\[\begin{equation} \label{eq:matrix_inversion_alternate} (\rmA - \rmB\rmD^{-1}\rmC)^{-1}\rmB\rmD^{-1} = \rmA^{-1}\rmB(\rmD - \rmC\rmA^{-1}\rmB)^{-1}. \end{equation} \]

We can now consider some examples where the matrix inversion lemma is useful in machine learning applications.

Example (Woodbury identity). Suppose that the matrix \(\rmM\) is partitioned such that:

\(\rmA = \mSigma\) is an \(N \times N\) diagonal matrix,
\(\rmB = \rmC^\top = \rmX \in \R^{N\times D}\), where we will assume that \(N \gg D\), and
\(\rmD^{-1} = -\rmI\).

Substituting these terms into Equation \(\ref{eq:matrix_inversion}\) of the matrix inversion lemma gives us

\[\begin{equation} \label{eq:woodbury} (\mSigma + \rmX\rmX^\top)^{-1} = \mSigma^{-1} - \mSigma^{-1}\rmX(\rmI + \rmX^{\top}\mSigma^{-1}\rmX)^{-1}\rmX^{\top}\mSigma^{-1}. \end{equation} \]

The term \(\mSigma + \rmX\rmX^{\top}\) corresponds to doing a low-rank 11: since we have assumed that \(D\) is very small compared to \(N\); it may even be \(D = 1\) ^11[11] update to the matrix \(\mSigma\). The left hand side shows a naive re-inversion of this new matrix, which will take \(O(N^{3})\) time. However, if we keep a previously inverted \(\mSigma^{-1}\) in memory, we can quickly find the inverse of the updated matrix in \(O(D^{3})\) time by applying the right hand side of Equation \(\ref{eq:woodbury}\).

Crucially this might make the difference between an algorithm which is impractical to run on even the largest clusters, and one which is readily run on commodity hardware.
\(\blacksquare\)

The Sherman-Morrison formula is another useful example.

Example (Sherman-Morrison formula for rank one updates). This is just a special case of the above where \(D = 1\), corresponding to performing a rank-one update to an inverse matrix—this is common when data is coming in incrementally, and we’d like to re-use computation. This time, let our matrix \(\rmM\) be partitioned as follows:

\(\rmA\) is some large matrix for which we already know the inverse \(\rmA^{-1}\),
\(\rmB = \rvu\) (a column vector),
\(\rmC = \rvv^\top\) (a row vector), and
\(D = -1\) (a single scalar value). We then have

\[\begin{align*} (\rmA + \rvu\rvv^{\top})^{-1} &= \rmA^{-1} + \rmA^{-1}\rvu(-1 - \rvv^{\top}\rmA^{-1}\rvu)^{-1}\rvv^{\top}\rmA^{-1} \\ &= \rmA^{-1} - \frac{\rmA^{-1}\rvu\rvv^{\top}\rmA^{-1}}{1 + \rvv^{\top}\rmA^{-1}\rvu}. \end{align*} \]

\(\blacksquare\)

Another result that we can derived from the partitioned inverse formula is the matrix determinant lemma, which when used in conjunction with the Sherman-Morrison formula allows us to conveniently calculate low-rank updates to both a matrix inverse and determinant.

Matrix Determinant Lemma

For a matrix \(\rmM\) partitioned as above, we have that

\[\begin{equation} \label{eq:matrix_determinant_lemma} \vert \rmA - \rmB \rmD^{-1}\rmC\vert = \vert \rmD - \rmC\rmA^{-1}\rmB\vert \vert \rmD^{-1}\vert \vert \rmA \vert. \end{equation} \]

Similar to our statement of the Sherman-Morrison formula, we can highlight a special case of the matrix determinant lemma for computing a rank-one update to a determinant, where \(\rmB = \rvu\), \(\rmC = \rv^\top\) and \(D = -1\), giving

\[\vert \rmA + \rvu\rvv^\top \vert = (1 + \rvv^\top \rmA^{-1}\rvu)\vert \rmA \vert. \]

Gaussian Joint Distributions

A \(D\)-dimensional Gaussian random vector \(\mathbf{x} \sim \mathcal{N}(\vmu, \mSigma)\) can be partitioned into two sub-vectors \(\mathbf{x} = \begin{bmatrix}\mathbf{x}_1 & \mathbf{x}_{2}\end{bmatrix}^\top\) where \(\mathbf{x}_1 \in \mathbb{R}^N\), \(\mathbf{x}_{2}\in\mathbb{R}^M\) and \(N + M = D\).

In other words, we can interpret \(p(\mathbf{x})\) as the joint distribution \(p(\mathbf{x}_1, \mathbf{x}_2)\).

There are two common manipulations that we often want to perform on such a joint distribution:

marginalising out one of the sub-vectors e.g. \(p(\mathbf{x}_1) = \int p(\mathbf{x}_1, \mathbf{x}_2) d\mathbf{x}_2\),
computing a conditional e.g. \(p(\mathbf{x}_1 \vert \mathbf{x}_2)\).

For multivariate Gaussians, both of these can done in closed form.

Computing the marginal is particularly easy. Partitioning the parameters \(\vmu, \mSigma\) of our Gaussian random vector just as we partitioned \(\mathbf{x}\), we get:

\[\begin{equation} \label{eq:partitioned_gaussian_params} \vmu = \begin{bmatrix}\vmu_1 \\ \vmu_2 \end{bmatrix}, \hspace{20pt} \mSigma = \begin{bmatrix}\mSigma_{11} & \mSigma_{12} \\ \mSigma_{21} & \mSigma_{22}\end{bmatrix}, \hspace{20pt} \mathbf{\Lambda} \doteq \mSigma^{-1} = \begin{bmatrix} \mathbf{\Lambda}_{11} & \mathbf{\Lambda}_{12} \\ \mathbf{\Lambda}_{21} & \mathbf{\Lambda}_{22} \end{bmatrix}, \end{equation} \]

where \(\vmu_1 \in \mathbb{R}^N, \mSigma_{12} \in \mathbb{R}^{N\times M}\) etc., the marginals can be found by just reading off the corresponding parameters. Do also note that the partitions in \(\mLambda\) do not correspond straightforwardly to the partitions of \(\mSigma\); for instance, \(\mLambda_{11}\) is not merely the inverse of \(\mSigma_{11}\).

Marginal of a Joint Gaussian

Given a Gaussian random vector \(\vx \sim \gN(\vmu, \mSigma)\), the marginals are

\[\begin{align*} p(\mathbf{x}_1) &= \mathcal{N}(\mathbf{x}_1 \vert \vmu_1, \mSigma_{11}) \\\\ p(\mathbf{x}_2) &= \mathcal{N}(\mathbf{x}_2 \vert \vmu_2, \mSigma_{22}). \end{align*} \]

In other words, we simply have to select the relevant sub-vectors and sub-matrices of the joint Gaussian’s parameters \(\vmu\) and \(\mSigma\), respectively, to parametrise the marginals.

Finding the conditional is slightly more involved. We state the result here before proving it.

Conditional of a Joint Gaussian Distribution

For a Gaussian joint \(p(\rvx_1, \rvx_2)\), the posterior conditional is given by

\[\begin{align} p(\mathbf{x}_1 \vert \mathbf{x}_2) &= \mathcal{N}(\mathbf{x}_1 \vert \vmu_{1 \vert 2}, \mSigma_{1\vert 2}) \\ \vmu_{1\vert 2} &= \vmu_1 + \mSigma_{12}\mSigma^{-1}_{22}(\mathbf{x}_2 - \vmu_2) \\ &= \vmu_1 - \mathbf{\Lambda}^{-1}_{11}\mathbf{\Lambda}_{12}(\mathbf{x}_2 - \vmu_2) \label{eq:conditional_mean_precision_1} \\ &= \mSigma_{1\vert 2}\big(\mathbf{\Lambda}_{11}\vmu_1 - \mathbf{\Lambda}_{12}(\mathbf{x}_2 - \vmu_2)\big) \label{eq:conditional_mean_precision_2} \\ \mSigma_{1\vert 2} &= \mSigma_{11} - \mSigma_{12} \mSigma^{-1}_{22}\mSigma_{21} = (\mSigma/\mSigma_{22}) \label{eq:conditional_cov}. \end{align} \]

Proof idea: The Gaussian that we’re after has the form

\[\mathcal{N}(\mathbf{x}_1 \vert \vmu_{1\vert 2}, \mSigma_{1\vert 2}) = (2\pi)^{-d_1/2}\vert \mSigma_{1\vert 2}\vert^{-1/2} \exp\left(-\frac{1}{2}(\mathbf{x}_1 - \vmu_{1\vert 2})^\top \mSigma_{1\vert 2}^{-1}(\mathbf{x}_{1} - \vmu_{1\vert 2})\right). \]

In particular, the log-density of the above is a quadratic in both \(\rvx_1\) and \(\rvx_2\) 12: the \(\vmu_{1\vert 2}\) term must in general make reference to \(\rvx_2\); since \(\rvx_1\) is not independent from \(\rvx_2\). ^12[12] and the parameters we need to define \(p(\rvx_1\vert \rvx_2)\) (i.e. \(\vmu_{1\vert 2}\) and \(\mSigma_{1\vert 2}\)) appear in this quadratic.

Factorising the joint, \(\log p(\mathbf{x}_1, \mathbf{x}_2) = {\color{MidnightBlue}{\log p(\mathbf{x}_1 \vert \mathbf{x}_2)}} + \color{RedOrange}{\log p(\mathbf{x}_2)}\), we see that the first term on the right is the sum of a quadratic in both \(\rvx_1\) and \(\rvx_2\), and the second is a quadratic in \(\rvx_2\) only (ignoring constants and linear terms). Hence we can read off the parameters we need (the first two moments of the posterior, namely \(\color{MidnightBlue}{\vmu_{1\vert 2}}\) and \(\color{MidnightBlue}{\mSigma_{1\vert 2}}\)) from this first quadratic, discarding the rest.

Proof. We begin by explicitly writing the log-density of the joint:

\[\log p(\rvx_1, \rvx_2) = -\frac{1}{2}\log \vert \mSigma \vert - \frac{1}{2} \underbrace{ \begin{bmatrix}\mathbf{x}_1 - \vmu_1 \\ \mathbf{x}_2 - \vmu_2 \end{bmatrix}^\top \begin{bmatrix}\mSigma_{11} & \mSigma_{12} \\ \mSigma_{21} & \mSigma_{22}\end{bmatrix}^{-1} \begin{bmatrix}\mathbf{x}_1 - \vmu_1 \\ \mathbf{x}_2 - \vmu_2\end{bmatrix}}_{Q} + \text{const}, \]

and henceforth refer to this quadratic form as \(Q\). Since we want to end up with a term in \(\color{RedOrange}{\mathbf{x}_2}\), \(\color{RedOrange}{\vmu_{2}}\) and \(\color{RedOrange}{\mSigma_{22}}\) only, we block-diagonalise the partitioned matrix \(\mSigma\) (exactly as we did while deriving the partitioned matrix inverse formulae above); making sure to keep \(\mSigma_{22}\) on its own:

\[\begin{align*} Q &= \begin{bmatrix}\mathbf{x}_1 - \vmu_1 \\ \mathbf{x}_2 - \vmu_2 \end{bmatrix}^\top \begin{bmatrix}\mathbf{I} & \mathbf{0} \\ -\mSigma_{22}^{-1}\mSigma_{21} & \mathbf{I}\end{bmatrix} \begin{bmatrix}(\mSigma/\mSigma_{22})^{-1} & \mathbf{0} \\ \mathbf{0} & \mSigma_{22}^{-1} \end{bmatrix} \begin{bmatrix}\mathbf{I} & -\mSigma_{12}\mSigma_{22}^{-1} \\ \mathbf{0} & \mathbf{I}\end{bmatrix} \begin{bmatrix}\mathbf{x}_1 - \vmu_1 \\ \mathbf{x}_2 - \vmu_2\end{bmatrix} \\ &= \begin{bmatrix} \color{MidnightBlue}{\mathbf{x}_1 - \vmu_1 - \mSigma^{-1}_{22}\mSigma_{21}(\mathbf{x}_2 - \vmu_2)} \\ \color{RedOrange}{\mathbf{x}_2 - \vmu_2}\end{bmatrix}^\top \begin{bmatrix}\color{MidnightBlue}{(\mSigma/\mSigma_{22})^{-1}} & \mathbf{0} \\ \mathbf{0} & \color{RedOrange}{\mSigma_{22}^{-1}} \end{bmatrix} \begin{bmatrix}\color{MidnightBlue}{\mathbf{x}_1 - \vmu_1 - \mSigma_{12}\mSigma^{-1}_{22}(\mathbf{x}_2 - \vmu_2)} \\ \color{RedOrange}{\mathbf{x}_2 - \vmu_2}\end{bmatrix} \end{align*} \]

At this point, since the notation is already running over the lines, we will pre-emptively substitute \(\color{MidnightBlue}{\vmu_{1\vert 2} \doteq \vmu_1 + \mSigma_{12}\mSigma^{-1}_{22}(\mathbf{x}_2 - \vmu_2)}\). This should not be too much of a leap of faith, since we arrive at the factorised form with just a few more multiplications:

\[\begin{align*} Q &= \begin{bmatrix} \color{MidnightBlue}{\mathbf{x}_1 - \vmu_{1\vert 2}} \\ \color{RedOrange}{\mathbf{x}_2 - \vmu_2}\end{bmatrix}^\top \begin{bmatrix}\color{MidnightBlue}{(\mSigma/\mSigma_{22})^{-1}} & \mathbf{0} \\ \mathbf{0} & \color{RedOrange}{\mSigma_{22}^{-1}} \end{bmatrix} \begin{bmatrix}\color{MidnightBlue}{\mathbf{x}_1 - \vmu_{1 \vert 2}} \\ \color{RedOrange}{\mathbf{x}_2 - \vmu_2}\end{bmatrix} \\ &= \begin{bmatrix} \color{MidnightBlue}{\big(\mathbf{x}_1 - \vmu_{1\vert 2}\big)^\top(\mSigma/\mSigma_{22})^{-1}} & \color{RedOrange}{(\mathbf{x}_2 - \vmu_2)^\top\mSigma^{-1}_{22}}\end{bmatrix} \begin{bmatrix}\color{MidnightBlue}{\mathbf{x}_1 - \vmu_{1\vert 2}} \\ \color{RedOrange}{\mathbf{x}_2 - \vmu_2}\end{bmatrix} \\ &= {\color{MidnightBlue}{(\mathbf{x}_1 - \vmu_{1\vert 2})^\top(\mSigma/\mSigma_{22})^{-1}(\mathbf{x}_1 - \vmu_{1\vert 2})}} + {\color{RedOrange}{(\mathbf{x}_2 - \vmu_2)^\top\mSigma^{-1}_{22}(\mathbf{x}_2 - \vmu_2)}}. \\ \end{align*} \]

With the expression in the form we need, we can now just read off the covariance term for the posterior \(\mSigma_{1 \vert 2}\), which comes to the Schur complement of \(\mSigma\) wrt. \(\mSigma_{22}\):

\[\mSigma_{1\vert 2} = (\mSigma/\mSigma_{22}) \doteq \mSigma_{11} - \mSigma_{12}\mSigma_{22}^{-1}\mSigma_{21}, \]

and we already have \(\vmu_{1\vert 2}\) from the substitution made above.

While we omitted the linear and constant terms above (we only worked with the exponential terms in the factorisation of \(p(\vx_1 \vert \vx_2)\)), we can check our value for \(\mSigma_{1\vert 2}\) using the terms we omitted. Substituting \(\vert \mSigma \vert\) with the matrix determinant lemma (Equation \(\ref{eq:matrix_determinant_lemma}\)), we get

\[\begin{align*} (2\pi)^{(d_1 + d_2)/2}\vert \mSigma \vert^{1/2} &= (2\pi)^{(d_1 + d_2)/2}(\vert \mSigma / \mSigma_{22}\vert \vert \mSigma_{22}\vert)^{1/2} \\ &= {\color{MidnightBlue}{(2\pi)^{\frac{d_1}{2}}\vert \mSigma / \mSigma_{22}\vert^{\frac{1}{2}}}} {\color{RedOrange}{(2\pi)^{\frac{d_2}{2}}\vert \mSigma_{22}\vert^{\frac{1}{2}}}}, \end{align*} \]

as required.
\(\square\)

Working with Gaussians

As we saw above, when working with Gaussians we can often sidestep the need to do any integration when marginalising out terms, by instead considering the quadratic form in the exponent of the Gaussian that we are after, and matching coefficients to find the parameters of the distribution we’re after.

This turns out to be a useful pattern that is worth dwelling on. Considering the partitioning in Equation \(\ref{eq:partitioned_gaussian_params}\), the quadratic form in question is:

\[\begin{align} -\frac{1}{2}&(\rvx - \vmu)^{\top}\mSigma^{-1}(\rvx - \vmu) = \nonumber \\ &-\frac{1}{2}{\color{MidnightBlue}(\rvx_{1} - \vmu_1)^\top}\mLambda_{11}{\color{MidnightBlue}(\rvx_1 - \vmu_1)} - \frac{1}{2}{\color{MidnightBlue}(\rvx_1 - \vmu_1)^\top} \mLambda_{12}{\color{RedOrange}(\rvx_2 - \vmu_2)} \nonumber \\ &-\frac{1}{2}{\color{RedOrange}(\rvx_2 - \vmu_2)^\top} \mLambda_{21}{\color{MidnightBlue}(\rvx_1 - \vmu_1)} - \frac{1}{2}{\color{RedOrange}(\rvx_2 - \vmu_2)^\top} \mLambda_{22}{\color{RedOrange}(\rvx_2 - \vmu_2)} \label{eq:general_quadratic_form}. \end{align} \]

As an example of how we can use this, consider the above as a function of \(\rvx_1\). This would correspond to the quadratic form in the exponent of \(p(\rvx_1 \vert \rvx_2)\), and we could obtain the parameters of this Gaussian by inspection as we did above 13: This is often referred to as ‘completing the square’ ^13[13] .

The term in the exponent of a general Gaussian distribution, \(\gN(\rvx \vert \vmu, \mSigma)\) is:

\[\begin{equation} \label{eq:gaussian_exponent} -\frac{1}{2}(\rvx - \vmu)^{\top} \mSigma^{-1}(\rvx - \vmu) = -\frac{1}{2} \rvx^{\top} \mSigma^{-1}\rvx + \rvx^{\top}\mSigma^{-1}\vmu + \text{const}, \end{equation} \]

and by merely equating the general quadratic form to the right hand side of the above, we can equate the matrix of coefficients entering the second order term in \(\rvx_1\) to the inverse covariance \(\mSigma^{-1}\), and the coefficient of the linear term in \(\rvx_1\) to \(\mSigma^{-1}\vmu\), from which we can isolate \(\vmu\).

Returning to our example, we can derive the alternative forms for \(\vmu_{1\vert 2}\); that is, the mean of \(p(\rvx_1 \vert \rvx_2)\) that we gave in terms of the precision in Equations \(\ref{eq:conditional_mean_precision_1}\), \(\ref{eq:conditional_mean_precision_2}\). Treating \(\rvx_2\) as a constant conditioning term, we can pick out all the terms that are second order in \(\rvx_1\) in Equation \(\ref{eq:general_quadratic_form}\) to get

\[-\frac{1}{2}\rvx_1^\top \mLambda_{11}\rvx_1, \]

from which we can immediately conclude that the covariance of \(p(\rvx_1 \vert \rvx_2)\) is:

\[\mSigma_{1 \vert 2} = \mLambda_{11}^{-1}. \]

Now for the mean that we were originally after 14: remember, we needed to find \(\mSigma^{-1}\) first to recover the mean \(\vmu\) from the coefficient of the linear term ^14[14] , if we consider all the terms in Equation \(\ref{eq:general_quadratic_form}\) that are linear in \(\rvx_1\), we get 15: Using the fact that \(\mLambda_{21}^{\top} = \mLambda_{12}\) ^15[15]

\[\rvx_1^\top{\color{MidnightBlue}\left(\mLambda_{11}\vmu_1 - \mLambda_{12}(\rvx_2 - \vmu_2)\right)} \doteq \rvx_1^\top {\color{MidnightBlue}\mSigma^{-1}_{1\vert 2}\vmu_{1\vert 2}}. \]

Hence, left-multiplying both sides of the coefficients by \(\mSigma_{1\vert 2}\) gives us the conditional mean in the from of Equations \(\ref{eq:conditional_mean_precision_1}\), \(\ref{eq:conditional_mean_precision_2}\) that we were after:

\[\begin{align*} \vmu_{1\vert 2} &= \mSigma_{1 \vert 2}\big(\mLambda_{11}\vmu_1 - \mLambda_{12}(\rvx_2 - \vmu_2)\big) \\ &= \vmu_1 - \mLambda_{11}^{-1}\mLambda_{12}(\rvx_2 - \vmu_2). \end{align*} \]

\(\square\)

Of course, applying the Schur complement identities for the inverse of a partitioned matrix derived previously lets us write the precision matrices \(\mLambda_{11}\) and \(\mLambda_{12}\) in the above in terms of the covariance matrices, recovering the result from the previous section.

Linear Gaussian Systems

One problem that arises in many contexts is the following: suppose that we have two random variables. The first is a Gaussian random variable \(\rvx \in \R^{m}\), that might explain or cause something of interest. The second, \(\rvy \in \R^{d}\), is a Gaussian-noise corrupted observation of the thing of interest. If we also postulate that \(\rvx\) and the thing of interest are related by a linear function, then we have the following setup:

\[\begin{align*} p(\rvx) &= \gN(\rvx\vert \vmu, \mLambda^{-1}) \\ p(\rvy \vert \rvx) &= \gN(\rvy \vert \rmA\rvx + \rvb, \rmL^{-1}), \end{align*} \]

for some \(\rmA\in\R^{d\times m}\), \(\rvb\in\R^{d}\), and covariance \(\rmL^{-1}\) which is independent of \(\rvx\). This is called a linear Gaussian system.

We’re often interested in reversing this relationship: namely to ask, given a noisy observation \(\rvy\) of the thing I’m interested in, what are some probable values of \(\rvx\) that might have caused this observation?

It turns out that the Gaussian is particularly amenable to this sort of analysis, giving a convenient closed-form solution:

Bayes Rule for Linear Gaussian Systems

Given a linear Gaussian system, the conditional \(p(\rvx \vert \rvy)\) is given by

\[\begin{align*} p(\rvx \vert \rvy) &= \gN(\rvx \vert \vmu_{x\vert y}, \mSigma_{x\vert y}) \\ \mSigma_{x\vert y} &= \big(\mLambda + \rmA^\top \rmL\rmA\big)^{-1} \\ \vmu_{x\vert y} &= \mSigma \big(\rmA^{\top}\rmL(\rvy - \rvb) + \mLambda \vmu_{x}\big) \end{align*} \]

Moreover, the marginal \(p(\rvy)\) is:

\[\begin{align*} p(\rvy) &= \int_{\rvx} p(\rvy \vert \rvx) p(\rvx) d\rvx \\ &= \gN(\rvy \vert \rmA \vmu + \rvb, \rmL^{-1} + \rmA \mLambda^{-1}\rmA^{\top}). \end{align*} \]

Proof idea: Since the prior and likelihood are both Gaussian, the joint \(p(\rvx, \rvy) = p(\rvx)p(\rvy \vert \rvx)\) is also a Gaussian whose mean and covariance parameters can be partitioned into the elements relevant to \(p(\rvx)\) and those relevant for \(p(\rvy \vert \rvx)\). We therefore first want to find the partitioned parameters of this joint.

From then, obtaining the parameters for \(p(\rvy)\) is a trivial case of reading off the relevant terms in the partitioned mean and covariance 16: see the marginalisation result above ^16[16] , while for \(p(\rvx \vert \rvy)\) we can complete the square and inspect the coefficients, as outlined in the previous section, to find the terms we need.

Proof. We begin by writing out the log of the joint, focusing on terms that depend on \(\rvx\) or \(\rvy\):

\[\begin{align} \log p(\rvx, \rvy) &= \log p(\rvx) + \log p(\rvy \vert \rvx) \nonumber \\ &= -\frac{1}{2}(\rvx - \vmu)^{\top}\mLambda(\rvx - \vmu) \nonumber \\ &\hspace{1em} -\frac{1}{2}(\rvy - \rmA\rvx - \rvb)^{\top}\rmL (\rvy - \rmA\rvx - \rvb) + \mathrm{const} \label{eq:log_joint}. \end{align} \]

We can verify that the joint is indeed Gaussian since this is a quadratic function of the components of \(\begin{bmatrix}\rvx & \rvy\end{bmatrix}^{\top}\). For conciseness later, let us define \(\rvz \doteq \begin{bmatrix}\rvx & \rvy\end{bmatrix}^{\top}\), where \(p(\rvz) = \gN(\rvz\vert \vmu_{\rvz}, \mSigma_{\rvz})\).

We first focus on finding the precision of this joint, which will be the coefficient of the second order term of \(\begin{bmatrix}\rvx & \rvy\end{bmatrix}^{\top}\). Expanding the brackets in the above, while keeping only the second order terms gives us:

\[\begin{align*} -\frac{1}{2}&\rvx^{\top}\big(\mLambda + \rmA^{\top}\rmL\rmA\big)\rvx - \frac{1}{2}\rvy^{\top}\rmL\rvy + \frac{1}{2}\rvy^{\top}\rmL\rmA\rvx + \frac{1}{2}\rvx^{\top}\rmA^{\top}\rmL\rvy \\ &= -\frac{1}{2}\begin{bmatrix}\rvx \\ \rvy \end{bmatrix}^{\top}\underbrace{\begin{bmatrix}\mLambda + \rmA^{\top}\rmL\rmA & -\rmA^{\top}\rmL \\ -\rmL \rmA & \rmL\end{bmatrix}}_{\rmR} \begin{bmatrix}\rvx \\ \rvy \end{bmatrix}. \end{align*} \]

Hence \(\rmR\) is the precision of \(p(\rvx, \rvy)\). To apply our results for the conditional of a joint Gaussian however, we need to invert the precision to find the covariance. Using the formula for the inverse of a partitioned matrix we derived earlier (Equation \(\ref{eq:block_inv_1}\)), we get

\[\begin{align*} \rmR^{-1} &= \begin{bmatrix}\mLambda^{-1} & \mLambda^{-1}\rmA^{\top} \\ \rmA\mLambda^{-1} & \rmL^{-1} + \rmA\mLambda^{-1}\rmA^{\top}\end{bmatrix} \\ &= \mSigma_{\rvz}. \end{align*} \]

To find the partitioned mean of the joint, we identify the linear terms in Equation \(\ref{eq:log_joint}\) giving

\[\rvx^{\top}\mLambda \vmu - \rvx^{\top}\rmA^{\top}\rmL\rvb + \rvy^{\top}\rmL\rvb = \begin{bmatrix}\rvx \\ \rvy\end{bmatrix}^{\top}\begin{bmatrix}\mLambda \vmu - \rmA^{\top}\rmL\rvb \\ \rmL \rvb\end{bmatrix}. \]

Recall that the linear term in the Gaussian exponent of the joint \(p(\rvz)\) is \(\rvz^{\top}\mSigma_{\rvz}^{-1}\vmu_{\rvz}\) (c.f. Equation \(\ref{eq:gaussian_exponent}\)). Matching terms,

\[\begin{align*} \mSigma_{\rvz}^{-1}\vmu_{\rvz} &= \begin{bmatrix}\mLambda \vmu - \rmA^{\top}\rmL\rvb \\ \rmL \rvb\end{bmatrix} \\ \vmu_{\rvz} &= \rmR^{-1} \begin{bmatrix}\mLambda \vmu - \rmA^{\top}\rmL\rvb \\ \rmL \rvb\end{bmatrix} \\ \end{align*} \]

Substituting our value for \(\rmR^{-1}\), we obtain

\[\vmu_{\rvz} = \begin{bmatrix}\vmu \\ \rmA\vmu + \rvb\end{bmatrix}. \]

With the partitioned parameters of the joint in hand, we can now trivially find the parameters of the marginal distribution \(p(\rvy)\) selecting the corresponding partitions:

\[\begin{align*} \E[\rvy] &= \rmA \vmu + \rvb \\ \Cov[\rvy] &= \rmL^{-1} + \rmA \mLambda^{-1}\rmA^{\top}. \end{align*} \]

For the parameters of the conditional \(p(\rvx \vert \rvy)\), we refer to Equations \(\ref{eq:conditional_mean_precision_1}\) and \(\ref{eq:conditional_cov}\), to obtain

\[\begin{align*} \E[\rvx \vert \rvy] &= (\mLambda + \rmA^{\top}\rmL\rmA)^{-1}\big(\rmA^{\top}\rmL(\rvy - \rvb) + \mLambda \vmu\big) \\ \Cov[\rvx \vert \rvy] &= (\mLambda + \rmA^{\top}\rmL\rmA)^{-1}. \end{align*} \]

\(\square\)

BibTeX Citation

@article{robeyns2021gpmath
    author={Maxime Robeyns},
    title={The Venerable Gaussian},
    year={2023},
    url={https://maximerobeyns.com/gaussian},
}

Find me on Twitter.

Footnotes

[0]

In fact, the Gaussian is notable for how quickly it tapers off when far from the mode. This distribution is said ot have light tails, meaning that there is not much support in the tails of the distribution far from the mean.

↩︎

[1]

There are actually several approaches, with the approach using polar coordinates, due to Poisson, arguably being the most straightforward. For another, more geometric take on this integral, see this blog.

↩︎

[2]

It is conventional to denote the variance as \(\sigma^{2}\), which lets us easily recover the standard deviation or ‘scale’ of \(x\) as \(\sigma\).

↩︎

[3]

Intuitively: shift the centre of the bell

↩︎

[4]

of any distribution, not just the Gaussian

↩︎

[5]

That is, applying Bayes rule to find the posterior.

↩︎

[6]

products of exponentiated quadratics remain exponentiated quadratics. We will derive a more genearl expression for the product of two Gaussian distributions later

↩︎

[7]

You might notice that this form looks slightly different from the usual eigendecomposition, where we have a matrix inverse rather than a transpose on the last term: this is due to the properties of the covariance matrix, which is a real, symmetrical matrix.

↩︎

[8]

Using the fact that \((\rmA\rmB)^{-1} = \rmB^{-1}\rmA^{-1}\)

↩︎

[9]

As well as computing its determinant. For most real-world applications, we compute the Cholesky decomposition of \(\mSigma\) which lets us re-use our computational work in both the determinant and matrix inversion calculation. The partitioned matrix inverse however has many useful corollaries, and is thus useful for understanding Gaussian manipulations.

↩︎

[10]

that is; making the off-diagonal terms \(\mSigma_{12}\) and \(\mSigma_{21}\) zero. The resulting block diagonal matrix has two, square main-diagonal blocks (of the same size as \(\mSigma_{11}\) and \(\mSigma_{22}\), respectively), and all off-diagonal blocks are zero matrices

↩︎

[11]

since we have assumed that \(D\) is very small compared to \(N\); it may even be \(D = 1\)

↩︎

[12]

the \(\vmu_{1\vert 2}\) term must in general make reference to \(\rvx_2\); since \(\rvx_1\) is not independent from \(\rvx_2\).

↩︎

[13]

This is often referred to as ‘completing the square’

↩︎

[14]

remember, we needed to find \(\mSigma^{-1}\) first to recover the mean \(\vmu\) from the coefficient of the linear term