Piecewise Linear Log-Likelihood Estimation

Consider a density estimation problem for some univariate quantity of interest, \(\ervy \in \R\). That is, given some contextual information \(\rvx \in \R^{d}\), we wish to find the distribution of \(p(\ervy \vert \rvx)\).

This is a very general setting. For instance, if a component of \(\rvx\) is time \(t\), then we might interpret a sequence of \((\rvx, \ervy)\) values as a time series, and a \(p(\ervy \vert \rvx_{t_{\star}})\) prediction for some future time \(t_{\star}\) as a probabilistic forecast. Alternatively, \(\rvx\) might be the states (and actions) in a reinforcement leaning problem, where \(\ervy\) corresponds to the returns from that state and \(p(\ervy \vert \rvx)\) is the return distribution. Elements of \(\ervx\) might alternatively be latent variables, in which case \(p(\ervy \vert \rvx)\) might be interpreted as the generator / decoder in a latent variable model.

Given the broadness of this problem, it isn’t surprising that a similarly broad set of solutions has been proposed for estimating this distribution. For the time series prediction, we might use a Gaussian process, in which case the predictive distribution \(p(\ervy \vert \rvx_{t_{\star}})\) will be a Gaussian. In the distributional RL setting, we might use quantile regression (Dabney et al., 2017), expectile regression (Kostrikov et al., 2021) or a discrete / histogram distribution (Salimans et al., 2016; Bellemare et al., 2017). For the generator in a latent variable model, we might use the VAE approach for parametrising a Gaussian (Kingma et al., 2014). More generally, a number of neural density estimators have been proposed which give parametric density estimates \(\P_{\vphi}(\ervy \vert \rvx)\), of which the VAE is an example. Some others include mixture density networks (Bishop, 1994), energy-based models (Teh et al., 2003), GANs (Goodfellow et al., 2014), MADE (Germain et al., 2015), normalising flows (Rezende et al., 2015), diffusion models (Sohl-Dickstein et al., 2015) among numerous others.

These are all suited for different things. Some differentiators to consider include the estimator’s effectiveness 0: as measured by computational cost and ability to model the high-dimensional distribution itself ^0[0] in high dimensions, ability to model distributions of varying ‘complexity’ 1: for instance, attributes such as asymmetry, heavy-tailedness, multi-modality and correlations among elements ^1[1] , to define an exact density or only statistics of the distribution and to be able to sample fast or slowly from the learned distribution 2: often a large contributor to sampling speed is whether it can proceed in parallel or must be carried out autoregressively ^2[2] .

In this post, I describe an experimental neural density estimation method for estimating arbitrary 3: that is, with an arbitrary number of modes, light/heavy tails, asymmetry, etc. ^3[3] univariate distributions \(\P(\ervy \vert \rvx)\). We begin by reviewing the motivating setting of quantile regression, before outlining the main idea in Section 2. We then describe some closed form properties of the distribution—the partition function, likelihood and variance, distribution and quantile function—and show a simple regression example using this likelihood. We conclude by discussing a PyTorch implementation.

Background: Quantile Regression

The proposed method will somewhat resemble quantile regression (Koenker et al., 1978) and the likelihood can be understood as a generalisation thereof. Hence, if only to point out this relationship, we will quickly review QR here. The uninterested reader may skip to the next section for the main method.

Suppose we have a real-valued random variable \(Y\) with distribution function \(F_{Y}(y) = \P(Y \le y)\). A quantile of \(Y\) is the smallest element in the support of the distribution for which the distribution function (CDF) is no less than \(\tau \in (0, 1)\). For instance, common quantiles include the median, corresponding to \(\tau = 0.5\), the first quartile corresponding to \(\tau = 0.25\), or the first percentile at \(\tau = 0.01\).

More generally, for a chosen quantile \(\tau \in (0, 1)\), the quantile function \(Q_{Y}(\tau): (0, 1) \to \R\), equivalent to the inverse distribution function, is:

\[Q_{Y}(\tau) = F^{-1}_{Y}(\tau) = \inf \{y\ :\ F_{Y}(y) \ge r\}. \]

Rather than having a single fixed distribution over \(Y\), we might, as in the setup above, have a vector of covariates, which naturally leads to the conditional quantile function:

\[Q_{Y}(\tau \vert \rvx) = \inf \{y\ :\ \P(Y \le \ervy \mid X = \rvx) \ge \tau\}. \]

Definition (Quantile Regression). For some selected quantile \(\tau \in (0, 1)\), a training dataset \(\gD = \{(\rvx_{i}, \ervy_{i})\}^{N}_{i=1}\) and some model 4: for our purposes, this will almost always be a neural network, although variants such as Bayesian neural networks and other estimators may be used. ^4[4] of the conditional quantiles \(Q_{Y}(\tau \vert \rvx) = f_{\vtheta}(\rvx; \tau) + \epsilon\), parametrised by \(\vtheta \in \R^{d}\) and with corrupting observation noise \(\epsilon\), the task of quantile regression is to estimate the parameters \(\vtheta\) by minimising the expected quantile regression loss, \(\rho_{\tau}(u)\)

\[\begin{align} \rho_{\tau}(u) &= \begin{cases}u(\tau - 1)&\text{if } u < 0 \\ u\tau &\text{otherwise}\end{cases} \\ &= u\big(\tau - \mathbb{I}(u < 0)\big) \label{eq:qr_loss}, \end{align} \]

for \(u = y_{i} - f_{\vtheta}(\rvx_{i})\).
\(\blacksquare\)

In other words, assuming we have i.i.d. data, we wish to find

\[\begin{equation} \label{eq:qr_objective} \vtheta = \argmin_{\hat{\vtheta} \in \Theta} \sum^{N}_{i=1}\rho_{\tau}\big(y_{i} - f_{\hat{\vtheta}}(\rvx_{i}, \tau)\big). \end{equation} \]

Why does this loss function work?

Due to the connections to the proposed method later, we will dwell on the form of \(\rho_{\tau}\) a little longer.

In supervised learning, we can change the statistic of the distribution \(Y\) that is being predicted by our function approximator \(f_{\vtheta}(\rvx)\), by changing the loss function / the likelihood we maximise. Optimising the usual squared error loss / Gaussian likelihood, \(\argmin_{\hat{\vtheta}\in \Theta}\E_{\gD}\left[\big(y - f(\rvx, \vtheta)\big)^{2}\right]\) leads to an estimator \(f_{\vtheta}\) for the mean. Similarly, when using an absolute deviation loss / Laplace likelihood, \(\argmin_{\hat{\vtheta}\in \Theta}\E_{\gD}\left[\vert y - f(\rvx, \vtheta)\vert\right]\), we get an estimator for the median, or 0.5th quantile.

This median estimator is equivalent 5: up to a scaling constant of \(0.5\) ^5[5] to Equation \(\ref{eq:qr_objective}\) with \(\rho_{0.5}\). To see this, observe that an ‘over estimate’, \(y_{i} < f_{\vtheta}(\rvx_{i})\), leading to a negative residual \(u = y_{i} - f_{\vtheta}(\rvx_{i}) < 0\), will result in the product \(-0.5 u\) in Equation \(\ref{eq:qr_loss}\); hence implementing a (scaled) absolute error. ‘Under estimates’ \(y_{i} > f_{\vtheta}(\rvx_{i})\) leading to positive \(u\) will simply be scaled by \(0.5\).

By changing \(\tau\), we optimise a skewed absolute deviations loss / asymmetric Laplace likelihood:

Intuitively, the median is the cut point / element that partitions the support into two sets with equal probabilities: it makes sense that positive and negative residuals are weighted equally. When predicting the \(\tau = 0.75\)th percentile however, we want one set to contain \(3/4\) of the probability mass, and the other to contain \(1/4\). Hence, under-estimates (positive residuals) should be three times as costly as over-estimates (negative residual), and the skewed loss with \(\tau = 0.75\) achieves this. The resulting parameters \(\vtheta\) yield a predicted value \(\hat{y} = f_{\vtheta}(\rvx; 0.75)\) which is smaller than samples from \(Y\) one quarter of the time, and larger than \(Y\) samples three quarters of the time.

Finally, for a more thorough verification that this loss works, we can write the expected quantile regression loss, for some estimate of the \(\tau\)th quantile \(\hat{y} = f_{\vtheta}(\rvx; \tau)\), as:

\[\begin{align*} \E[\rho_{\tau}(Y - \hat{y})] &= \int^{\infty}_{-\infty}\rho_{\tau}(y - \hat{y})dF_{Y}(y) \\ &= (\tau - 1) \int_{-\infty}^{\hat{y}} (y - \hat{y}) dF_{Y}(y) + \tau \int_{\hat{y}}^{\infty}(y - \hat{y})dF_{Y}(y). \end{align*} \]

The above is convex, hence we may differentiate it and solve for the value of \(\hat{y}\) at which the loss is minimised. Proceeding using Leibniz’s rule, we get:

\[\begin{align} \frac{d}{d\hat{y}}\E[\rho_{\tau}(Y - \hat{y})] &= (\tau - 1)\left((\hat{y} - \hat{y}) + \int_{-\infty}^{\hat{y}}\frac{\partial}{\partial\hat{y}}(y - \hat{y})d F_{Y}(y)\right) \\ &+ \tau \left(-(\hat{y} - \hat{y}) + \int_{\hat{y}}^{\infty}\frac{\partial}{\partial\hat{y}}(y - \hat{y})d F_{Y}(y)\right) \nonumber \\ &= (1 - \tau)F_{Y}(\hat{y}) - \tau (1 - F_{Y}(\hat{y})) \\ &= F_{Y}(\hat{y}) - \tau \nonumber, \end{align} \]

and so we verify that \(\hat{y} = F^{-1}_{Y}(\tau)\) as required when the derivative is \(0\).
\(\square\)

Asymmetric Laplace Likelihood

As mentioned above, we may view the quantile regression loss minimisation as an analogue to maximising an asymmetric Laplace distribution.

The usual Laplace density, parametrised by a location \(\mu\) and scale \(\sigma\) is

\[\begin{equation} \label{eq:laplace_density} f_{\text{L}}(x; \mu, \sigma) = \frac{1}{2\sigma}\exp\left(- \frac{\vert x-\mu \vert}{\sigma}\right). \end{equation} \]

Maximising this likelihood, \(\vtheta = \argmax_{\hat{\vtheta}}\prod^{N}_{i=1}f_{L}(y_{i}; f_{\hat{\vtheta}}(\rvx_{i}, 1))\), where the product comes from the i.i.d. data assumption, and where we have arbitrarily set \(\sigma = 1\), leads to the median estimator \(f_{\vtheta}\).

We may introduce asymmetry into the Laplace distribution with an asymmetry parameter \(\tau \in (0, 1)\) and the following modification to the density (Kozubowski, 1999):

\[\begin{align} f_{\text{ALD}}(x; \mu, \sigma, \tau) &= \frac{\tau(1-\tau)}{\sigma}\exp\left(-\frac{x-\mu}{\sigma}\big(\tau - \mathbb{I}(x - \mu < 0)\big)\right) \\ &\propto \exp \left(-\rho_{\tau}(x - \mu)\right) \label{eq:prop_ald}, \end{align} \]

which we can see is proportional to the negative exponential of the quantile regression loss, \(\rho_{\tau}\).

Values of \(\tau < 0.5\) skew the Laplace distribution to the right 6: that is, giving it a heavy right tail ^6[6] , while values of \(\tau > 0.5\) skew the distribution to the left:

Estimating Multiple Quantiles

Recall that we are interested in characterising the full density of interest, and not merely a few statistics (quantiles). The QR loss above treats \(\tau\) as a fixed hyperparameter of the loss function, thus selecting a single quantile to estimate. However, the full potential of this framework lies not in single-\(\tau\) fits, but simultaneously finding \(Q_{Y}(\tau \vert \rvx)\), \(\forall \tau \in (0, 1)\). Indeed, if one has the distribution function / quantile function, then this does fully describe the distribution.

It is however not trivial to simply specify \(\tau\) as an additional network input, since without adding some additional constraint on the network architecture, there is no guarantee that the network outputs will be monotonic. That is, we would need some mechanism to ensure that \(f_{\vtheta}(\rvx, \tau_{1}) < f_{\vtheta}(\rvx, \tau_{2})\) for all \(\tau_{1} < \tau_{2}\).

To see this, suppose that we naively pick a set of \(M\) different \(\tau\) values of interest, and fit \(M\) separate estimators. This is better than a single \(\tau\)-fit, and still lets us somewhat characterise the distribution at some \(\rvx\). For our dataset \(\gD = \{(\rvx_{i}, \ervy_{i})\}_{i=1}^{N}\) of assumed i.i.d. values, each individual likelihood is:

\[p(\gD \vert \theta, \tau_{m}) = \prod_{i=1}^{N}\frac{\tau(1-\tau_{m})}{\sigma} \exp \left(-\frac{\rho_{\tau_{m}}(y_{i} - f_{\vtheta}(\rvx_{i}))}{\sigma}\right), \]

for \(m \in \{1, \ldots, M\}\).

Training \(M = 20\) such estimators on some synthetic data, for \(\tau\) set to the midpoints of \(21\) linearly spaced points between 0 and 1, and plotting each of the \(f_{\vtheta_{m}}(\cdot, \tau_{m})\) estimators on linearly spaced \(x\) values gives the following:

We can see that the individual model outputs are not monotonic despite the different likelihoods—there is poor sharing of information between the individual estimates. We might try to improve the state of things with a joint prediction of the \(M\) quantiles using a single model \(\rvy = f_{\vtheta}(\rvx, \vtau)\), where element \(\ervy_{i}\) of the output corresponds to \(\ervy_{i} = Q_{Y}(\tau_{i} \vert \rvx)\). The joint likelihood is now

\[\begin{equation} \label{eq:joint_likelihood} p(\gD \vert \vtheta) = \prod_{i=1}^{N} \prod_{j=1}^{M} \frac{\tau_{j}(1-\tau_{j})}{\sigma}\exp \left(-\frac{\rho_{\tau_{j}}\big(y_{i} - f_{\vtheta}(\rvx_{i})\big)}{\sigma}\right). \end{equation} \]

and repeating the above exercise, this results in slightly more coherent quantile estimates for each test \(\rvx\), however there is still no mechanism to guarantee that the outputs will be monotonic 7: indeed, they aren’t everywhere in the plot, with a few lines crossing over each other ^7[7] .

One possible architectural modification to ensure monotonicity is outlined in Unconstrained Monotonic Neural Networks (Wehenkel et al., 2019), where a positive activation function is used at the penultimate network layer (such as an ELU activation), and the output function value is the integral of this result.

However, such methods can be costly and impose undue restrictions on the function approximator. Further, even if we did solve this problem, we would still only be characterising a few statistics of the distribution, and not a closed form density. This necessitates what Rowland et al. (2019) term an imputation strategy, to map the vector of statistics 8: e.g. the vector of \(M\) quantile estimates ^8[8] to a distribution with those statistics for downstream tasks.

Additionally, the assumption of fixed quantiles might also be needlessly restrictive. For instance, it might be sensible to vary the \(\tau_{j}\) locations to place more around regions of interest (for instance around the mode of the distribution, or along one of the tails) as is proposed in implicit quantile networks (Dabney et al., 2018).

Piecewise-Linear Log Likelihood Construction

We now come to the main idea, with the following definition of the likelihood.

Definition (Piecewise-Linear log likelihood). Let the distribution over the observed dataset be the following factorised 9: where we factorise across the \(N\) i.i.d. datapoints, as well as \(M\) piecewise-linear likelihood components ^9[9] density:

\[\begin{equation} \label{eq:pl3e_likelihood} p(\gD \vert \vtheta) = \prod^{N}_{j=1} \prod^{M}_{i=1} \frac{1}{Z(\vphi^{(j)})}\exp\big(\ell_{\vphi_{i}^{(j)}}(y_{j})\big), \end{equation} \]

where \(N\) are the number of training data points, \(M\) is set to some positive integer, \(Z(\vphi_{j})\) is a normalising constant 10: written by analogy to the partition function in an exponential family. ^10[10] , the parameters \(\vphi^{(j)} = \{\alpha_{i}, \beta_{i}, f_{i}\}_{i=1}^{M}\) are the outputs of a neural network \(\vphi^{(j)} = f_{\vtheta}(\rvx_{j})\) for the \(j\)th data point, and \(\ell_{\vphi_{i}^{(j)}}\) is a piecewise-linear function parametrised by the three-element subset of \(\vphi^{(j)}\):

\[\begin{equation} \label{eq:log_likelihood_component} \ell_{\vphi_{i}^{(j)}}(y) \doteq -(y - f_{i})\big(\alpha_{i}\Theta(y - f_{i}) - \beta_{i}\Theta(f_{i} - y)\big), \end{equation} \]

for \(\Theta\) Heaviside step.
\(\blacksquare\)

For clarity, let’s omit the factorisation over the entire dataset, and just consider the form of the log-likelihood for single \((\rvx, y)\) datapoint:

\[\begin{equation} \label{eq:log_likelihood} \log \P(y \vert \rvx) = \sum_{i=1}^{M}\ell_{\vphi_{i}}(y) - \log Z(\vphi),\hspace{2em} \text{for } \vphi = f_{\vtheta}(\rvx). \end{equation} \]

Comparing the negative log likelihood components \(-\ell_{\phi}(y)\) to the quantile regression loss of Equation \(\ref{eq:qr_loss}\), visualised in Section 1.1, we see that our loss function (components) generalise the quantile regression loss in that:

the left and right gradients needn’t sum to 1
the discontinuity in the loss needn’t be centred at 0
we have multiple of them in the loss.

Here are \(M=3\) arbitrary negative log-likelihood / loss components for some randomly selected \(\vphi = \{f_{i}, \alpha_{i}, \beta_{i}\}_{i=1}^M\).

Exponentiating, the individual likelihood components resemble asymmetric ‘Laplace-like’ distributions:

We have written the normalising term in Equation \(\ref{eq:log_likelihood}\) as a log partition function in a nod to the exponential family. Indeed, at first glance it seems like this likelihood should be part of this family which would afford us some convenient properties. However due to the variable \(f_{i}\) terms, we cannot express the term inside the exponent as an inner product between some sufficient statistics and natural parameters. One solution might be to fix the \(f_{i}\) values according to some spacing or schedule, which we might update using some EM procedure. I’ll leave this for future work.

Normalising the Distribution

To ensure that the above is a valid likelihood, we must compute the normalising term, \(Z(\vphi)\). With some constraints on the values of the parameters \(\valpha\) and \(\vbeta\), as well as an assumption that the \(f_{i}\) are ordered; that is \(f_{i} \le f_{i+1}\), we can compute this normalising function in closed form.

We proceed by integrating the log-density in Equation \(\ref{eq:log_likelihood}\), which we denote \(\gL \doteq \log \P(y \vert \rvx)\). Observe that we have \(M+1\) segments, which we may handle separately. The log likelihood of the \(j\)th segment, where \(y \in (f_{j}, f_{j+1}]\) is found by summing the individual piecewise-linear components between \(f_{j}\) and \(f_{j+1}\):

\[\begin{align} \gL(f_{j} < y \le f_{j+1}) &= -\sum_{i=1}^{j}\alpha_{i}(y - f_{i}) + \sum_{i=j+1}^{M}\beta_{i}(y - f_{i}) - \log Z_{j} \nonumber \\ &= \left(\sum_{i=j+1}^{M}\beta_{i} - \sum_{i=1}^{j}\alpha_{i}\right)y + \left(\sum_{i=1}^{j}\alpha_{i}f_{i} - \sum_{i=j+1}^{M}\beta_{i}f_{i}\right) - \log Z_{j} \nonumber \\ &\doteq a_{j}y + b_{j} - \log Z_{j}, \end{align} \]

where for notational brevity, we have defined the coefficients for the \(j\)th interval (the terms in the parentheses) as \(a_{j}\) and \(b_{j}\).

We must also consider the leftmost segment, where \(y \in (-\infty, f_{1}]\)

\[\begin{align} \gL(-\infty < y \le f_{1}) &= \sum_{i=1}^{M}\beta_{i}(y - f_{i}) - \log Z_{0} \nonumber \\ &= \underbrace{\sum_{i=1}^{M}\beta_{i}y}_{a_{0}} \underbrace{-\sum_{j=1}^{M}\beta_{i}f_{i}}_{b_{0}} - \log Z_{0} \label{eq:leftmost_segment}, \end{align} \]

as well as the rightmost segment for \(y \in (f_{M}, \infty)\)

\[\begin{align} \gL(f_{M} < y < \infty) &= -\left(\sum_{i=1}^{M}\alpha_{i}(y - f_{i})\right) - \log Z_{M} \nonumber \\ &= \underbrace{-\sum_{i=1}^{M}\alpha_{i}y}_{a_{M}} + \underbrace{\sum_{i=1}^{M}\alpha_{i}f_{i}}_{b_{M}} - \log Z_{M} \label{eq:rightmost_segment}, \end{align} \]

where we can see that \(a_{0}\), \(b_{0}\) as well as \(a_{M}\) and \(b_{M}\) are just special cases of the definitions of \(a_{j}\) and \(b_{j}\) given above.

We may now find the normalising term of a distribution parametrised by \(\vphi = \{\vf, \valpha, \vbeta\}\) by summing the integrals of the PDF between each \(f_{j}\), giving special consideration to the integrals at the endpoints from \(-\infty\) to \(f_{1}\), as well as from \(f_{M}\) to \(\infty\).

Beginning with the simpler case where \(f_{j} < y \le f_{j+1}\), we have

\[\begin{align} \int_{f_{j}}^{f_{j+1}}dy\ e^{a_{j}y + b_{j}} &= \left[\frac{1}{a_{j}}e^{a_{j}y + b_{j}}\right]_{f_{j}}^{f_{j+1}} \nonumber \\ &= \frac{1}{a_{j}}e^{b_{j}}\left(e^{a_{j}f_{j+1}} - e^{a_{j}f_{j}}\right). \end{align} \]

Note

It is possible, although hopefully unlikely, that some \(a_{j} = 0\). For non-probabilistic function approximators \(f_{\vtheta}(\rvx)\), we may resolve to add some small dithering term \(\epsilon \sim \gN(0, 1^{-n})\) for \(n \gg 0\). If we have a probabilistic model, e.g. a BNN, then we can try to re-sample \(\vtheta\) and get a new \(a_{j}\).

For the first edge case where \(y \in (-\infty, f_{1}]\), we have

\[\begin{align} \lim_{\xi \to -\infty} \int_{\xi}^{f_{1}}dy\ e^{a_{0}y + b_{0}} &= \lim_{\xi \to -\infty}\left[\frac{1}{a_{0}}e^{a_{0}y + b_{0}}\right]_{\xi}^{f_{1}} \nonumber \\ &= \lim_{\xi \to -\infty}\frac{1}{a_{0}}e^{b_{0}}\left(e^{a_{0}f_{1}} - e^{a_{0}\xi}\right) \nonumber \\ &= \frac{1}{a_{0}}e^{a_{0}f_{1} + b_{0}} \label{assumption:1} \tag{A1}, \end{align} \]

where the final equality above (\ref{assumption:1}) holds so long as \(a_{0} > 0\). In order to satisfy this condition, recall that in the evaluation of the leftmost segment above (Equation \(\ref{eq:leftmost_segment}\)) we had that \(a_{0} = \sum_{1=1}^{M}\beta_{i}\). In particular, observe that \(\beta_{1}\) only features in the calculations for \(a_{0}\) and \(b_{0}\). Hence, we may treat \(\beta_{1}\) as a ‘free parameter’ which we use to ensure that \(a_{0} > 0\). Writing \(\beta_{1} + \sum_{i=2}^{M}\beta_{i} > 0\), we can therefore set \(\beta_{1} = -\sum_{i=2}^{M}\beta_{i} + \epsilon^{+}\), for some small positive \(\epsilon^{+}\); ensuring numerical stability.

Similarly, for the segment where \(y \in (f_{M}, \infty)\), we have

\[\begin{align} \lim_{\xi \to \infty}\int_{f_{M}}^{\xi}dy\ e^{a_{M}y + b_{M}} &= \lim_{\xi\to\infty}\left[\frac{1}{a_{M}}e^{a_{M}y + b_{M}}\right]_{f_{M}}^{\xi} \nonumber \\ &= \lim_{\xi \to \infty}\frac{1}{a_{M}}e^{b_{M}}\left(e^{a_{M}\xi} - e^{a_{M}f_{M}}\right) \nonumber \\ &= -\frac{1}{a_{M}}e^{a_{M}}f_{M} + b_{M} \label{assumption:2} \tag{A2}, \end{align} \]

with the assumption in (\ref{assumption:2}) now being that \(a_{M} < 0\). As above, from Equation \ref{eq:rightmost_segment} we observe that \(a_{M} = -\sum^{M}_{i=1} \alpha_{i}\), with \(\alpha_{M}\) now being the ‘free parameter’; appearing in the calculations for \(a_{M}\) and \(b_{M}\). Setting \(\alpha_{M} = -\sum_{i=1}^{M-1}\alpha_{i} + \epsilon^{+}\) makes the assumption hold.

We may now explicitly write down the normalising term / partition function \(Z(\vphi)\) as the sum of these integrals:

\[\begin{align} Z(\vphi) &= \int_{-\infty}^{f_{1}}dy\ e^{a_{0}y+b_{0}} + \sum_{j=1}^{M}\int_{f_{j}}^{f_{j+1}}dy\ e^{a_{j}y + b_{j}} + \int_{f_{M}}^{\infty}dy\ e^{a_{M}y + b_{M}} \\ &= \frac{1}{a_{0}}e^{a_{0}f_{1} + b_{0}} + \sum_{j=1}^{M}\frac{1}{a_{j}}e^{b_{j}}\left(e^{a_{j}f_{j+1}} - e^{a_{j}f_{j}}\right) - \frac{1}{a_{M}}e^{a_{M}f_{M} + b_{M}}. \end{align} \]

Implementation Note

To implement this efficiently, observe that each of the coefficients \(a_{j}\) and \(b_{j}\) are made up of the same underlying sums of terms. Hence a simple dynamic programming algorithm which computes the partial sums of the coefficients takes the time to compute the normalising term from \(O(M^{2})\) for a naive implementation to \(O(M)\).

That said, while this is nice and all, the size of \(M\) is never very large, and on modern computers this optimisation is rather inconsequential, particularly when compared to the cost of running the big neural network \(f_{\vtheta}(\rvx)\) that produces \(\vphi\).

Moments of the Distribution

One of the benefits of this simple piecewise linear log-likelihood is that the integrals under the resulting distribution are relatively easy to compute. This allows us to find closed form expressions for various moments of the distribution.

For instance, the mean can be found by \(\E_{Y}[y] = \int_{-\infty}^{\infty}y f_{Y}(y) dy\). This integral is relatively long and un-enlightening (I have put it in the appendix). The expression for the mean comes to:

\[\begin{align} \E_{Y}[y] &= \int_{-\infty}^{\infty}y f_{Y}(y)dy \nonumber \\ &= \frac{1}{Z(\vphi)}\Bigg(\frac{e^{a_{0}f_{1} + b_{0}}(a_{0}f_{1} - 1)}{a^{2}_{0}} \nonumber \\ &+\sum_{j=1}^{M-1}\frac{e^{b_{j}}\big(e^{a_{j}f_{j+1}}(a_{j}f_{j+1}-1) - e^{a_{j}f_{j}}(a_{j}f_{j}-1)\big)}{a^{2}_{j}} \nonumber \\ &-\frac{e^{a_{M}f_{M} + b_{M}}(a_{M}f_{M} - 1)}{a^{2}_{M}}\Bigg). \label{eq:mean} \end{align} \]

The variance or second moment of the distribution, found through a similarly long integral to the above which is also in the appendix, comes to

\[\begin{align} \Var(Y) &= \int_{-\infty}^{\infty}y^{2}f_{Y}(y)dy - \E_{Y}[y]^{2}\nonumber \\ &= \frac{1}{Z(\vphi)}\Bigg(\frac{e^{a_{0}f_{1} + b_{0}}\big(a_{0}^{2}f_{1}^{2} - 2a_{0}f_{1} + 2\big)}{a_{0}^{3}} \nonumber \\ &+ \sum_{j=1}^{M-1} \frac{e^{b_{j}}\big(e^{a_{j}f_{j+1}}(a_{j}^{2}f_{j+1}^{2} - 2a_{j}f_{j+1} + 2) - e^{a_{j}f_{j}}(a^{2}_{j}f^{2}_{j} - 2a_{j}f_{j} + 2)\big)}{a^{3}_{j}} \nonumber \\ &-\frac{e^{a_{M}f_{M}+b_{M}}\big(a^{2}_{M}f^{2}_{M} - 2a_{M}f_{M} + 2\big)}{a^{3}_{M}}\Bigg). \end{align} \]

Regression Example

We can use this likelihood to do regression. For example, consider the following data from Fanaee-T et al. (2013) which shows the number of daily rentals for a bike sharing scheme.

We can observe a number of fluctuations due to seasonality, along with a broader upward trend, with differences in variance throughout. While the distribution appears to be mostly unimodal, it provides a reasonably interesting application.

Fitting a small neural network using maximum likelihood results in the following:

In the above figure, we compute the distribution for \(100\) equally spaced dates along the x axis. We plot the mean of these distributions, along with the 0.05th, 0.15th, 0.25th and 0.35th quantiles in the shaded regions to characterise the shape of the distribution at each point. The equations for the mean and quantiles (inverse CDF) are derived below.

Notably, the learned distributions do a good job of placing probability mass away from the mean to account for the probability of outlier rental counts; for instance on sunny days or when there’s an event in town. It should not be surprising that the distributions show heavy tails, due to the likelihood’s similarity to the Laplace distribution.

Inverse Transform Sampling

We can use the inverse of the distribution function (i.e. the quantile function) to map samples drawn from the unit interval \(U[0, 1]\) onto the support of the distribution. This method of drawing samples from a distribution is known as inverse transform sampling.

To do inverse transform sampling with our proposed density, we must first find an expression for the distribution function \(F_{Y}(y)\), which we can then invert to find the quantile function \(F^{-1}_{Y}(\tau) = Q_{Y}(\tau)\).

Recall that a distribution function is found by integrating the density from \(-\infty\) to \(y\):

\[F_{Y}(y) = \int_{-\infty}^{y}dt\ f_{Y}(t). \]

Since we have already evaluated this integral when computing the expression for the normalising constant, \(Z(\vphi)\), we may re-use the integrals under the segments \((f_{j}, f_{j+1})\) where \(f_{j+1} < y\). We must then add the integral for the last (partial) segment—noting that this may in fact be the first segment if \(y < f_{1}\).

More formally, let the integer \(K\) for some evaluation point \(y\) be the index of the segment containing \(y\):

\[K \doteq \max \left( \{0\} \cup \{\hat{K} \in \{1, \ldots, M\}\ :\ f_{\hat{K}} \le k\}\right). \]

Now,

\[\begin{align} F_{Y}(y) &= \int_{-\infty}^{y}dt\ f_{Y}(t)\nonumber \\ &= \frac{1}{Z(\vphi)} \begin{cases} \int_{-\infty}^{y}dt\ e^{a_{0}t + b_{0}} &\text{if } y < f_{1} \\ \int_{-\infty}^{f_{1}}dt\ e^{a_{0}t + b_{0}} + \sum_{j=1}^{K-1}\int_{f_{j}}^{f_{j+1}}dt\ e^{a_{j}t + b_{j}} + \int_{f_{K}}^{y}dt\ e^{a_{K}t + b_{K}}& \text{otherwise} \end{cases} \nonumber \\ &= \frac{1}{Z(\vphi)} \begin{cases} \frac{1}{a_{0}}e^{a_{0}y + b_{0}} &\text{if } y < f_{1} \\ \frac{1}{a_{0}}e^{a_{0}f_{1} + b_{0}} + \sum_{j=1}^{K-1}\frac{1}{a_{j}}e^{b_{j}}\big(e^{a_{j}f_{j+1}} - e^{a_{j}f_{j}}\big) + \frac{1}{a_{K}}e^{b_{K}}\big(e^{a_{K}y} - e^{a_{K}f_{K}}\big) &\text{otherwise} \end{cases} \nonumber \\ &= \frac{1}{Z(\vphi)}\Bigg(A + \sum_{j=1}^{K-1}\frac{1}{a_{j}}e^{b_{j}}\big(e^{a_{j}f_{j+1}} - e^{a_{j}f_{j}}\big) + \frac{1}{a_{K}}e^{b_{K}}\big(e^{a_{K}y} - e^{a_{K}f_{K + \mathbb{I}[K=0]}}\big)\Bigg) \label{eq:CDF} \end{align} \]

where we have defined \(A \doteq \frac{1}{a_{0}}e^{a_{0}f_{1} + b_{0}}\) for notational brevity. The terms \(a_{j}\) and \(b_{j}\) are defined for \(j \in [0, M]\) as they were in the previous section which, for completeness, was

\[\begin{align*} a_{j} &= \sum_{i=j+1}^{M}\beta_{i} - \sum_{i=1}^{j}\alpha_{i} \\ b_{j} &= \sum_{i=1}^{j}\alpha_{i}f_{i} - \sum_{i = j + 1}^{M}\beta_{i}f_{i}. \end{align*} \]

Inverting the CDF in Equation \ref{eq:CDF} gives us an analytical expression for the quantile function. We will define a new limit for the summand (previously \(K\)) for some input point \(\tau \in [0, 1]\) as:

\[L \doteq \max \left(\{0\} \cup \{\hat{L} \in \{1, \ldots, M\}\ :\ F_{Y}(f_{\hat{L}}) \le \tau \}\right). \]

Now inverting the CDF gives us the quantile function:

\[\begin{equation} \label{eq:ICDF} F^{-1}_{Y}(\tau) = \frac{\log\left(\left(\tau Z(\vphi) - A - \sum_{j=1}^{L-1}\frac{1}{a_{j}}e^{b_{j}}\big(e^{a_{j}f_{j+1}} - e^{a_{j}f_{j}}\big)\right)a_{L}e^{b_{L}} + e^{a_{L}f_{L + \mathbb{I}[L=0]}}\right)}{a_{L}}. \end{equation} \]

As usual, the normal constraints pertaining to invertability persist: if the distribution function \(F_{Y}\) is not monotonically increasing at some point 11: i.e. the distribution has density \(0\) at some point ^11[11] then it is not invertible. We may also get a zero-valued coefficient \(a_{j}\) for \(j \le L\) if the number of components \(M\) is large compared to the complexity of the distribution being modelled. As before, these are relatively uncommon and we can handle these at runtime by adding some small perturbation \(\epsilon \sim \gN(0, 1^{-n})\), \(n \gg 0\).

With this closed form expression for the quantile function in hand, we can proceed to sample values from the distribution using inverse transform sampling. The figure below shows a fitted density on multi-modal test data:

Here is the corresponding density function:

Drawing samples \(\vtau \sim U[0, 1]\) and mapping these through the quantile function in Equation \ref{eq:ICDF} allows us to sample from the distribution:

Finally, for some applications it may be convenient to find the mode(s) of the distribution. With the PL3E distribution, these will always appear at one of the \(f_{j}\) locations. In particular, we may identify a mode if the density at the neighbouring points \(f_{j-1}\) and \(f_{j+1}\) are smaller than \(f_{j}\). In other words, \(f_{j}\) can be identified as a mode if

\[\begin{align*} f_{Y}(f_{j}) &> f_{Y}(f_{j-1}) \\ a_{j}f_{j} + b_{j} &> a_{j-1}f_{j-1} + b_{j-1} \end{align*} \]

and

\[\begin{align*} f_{Y}(f_{j}) &> f_{Y}(f_{j+1}) \\ a_{j}f_{j} + b_{j} &> a_{j+1}f_{j+1} + b_{j+1}, \end{align*} \]

where we arrive at the above by taking the logarithm (a monotonic function). Note that for the endpoints \(f_{1}\) and \(f_{M}\), we only need the first and second of these comparisons to hold, respectively.

The figure below shows the location of the modes on a density fitted to synthetic data.

Appendix

Mean Calculation

Here we evaluate the integral \(\E_{Y}[y] = \int_{-\infty}^{\infty}y f_{Y}(y)\ dy\), where the density function is of the form

\[f_{Y}(y) = e^{a_{j}y + b_{j}}, \]

with \(j \in \{0, \ldots, M\}\) such that \(f_{j} < y \le f_{j+1}\).

We begin by evaluating the indefinite integral \(\int x e^{ax + b}\ dx\). Letting \(u = ax + b\), we have

\[\begin{align*} \int x e^{ax + b}dx &= \frac{1}{a}\int \frac{u - b}{a}e^{u}du \\ &= \frac{1}{a^{2}}\int e^{u}(u-b)du \\ &= \frac{1}{a^{2}}\int e^{u}u - be^{u}du \\ &= \frac{1}{a^{2}}\int e^{u}u\ du - \frac{b}{a^{2}}\int e^{u}du. \end{align*} \]

We can integrate the first term by parts, setting \(f = u\) and \(dg = e^{u}\). Now,

\[\begin{align*} \frac{1}{a^{2}} \int e^{u}u\ du - \frac{b}{a^{2}} \int e^{u}du &= \frac{1}{a^{2}}\left(e^{u}u - \int e^{u}\ du\right) - \frac{b}{a^{2}}\int e^{u} du \\ &= \frac{e^{u}u}{a^{2}} + \left(-\frac{b}{a^{2}} - \frac{1}{a^{2}}\right)\int e^{u}\ du \\ &= \frac{e^{u}u}{a^{2}} - \frac{(b+1)e^{u}}{a^{2}} + c \\ &= \frac{e^{ax + b}(ax + b)}{a^{2}} - \frac{(b+1)e^{ax + b}}{a^{2}} + c \\ &= \frac{e^{ax + b}(ax - 1)}{a^{2}} + c. \end{align*} \]

Having found this indefinite integral for the expectation of individual segments of the distribution, we can find the integral of each segment from \(-\infty\) to \(\infty\). As previously, we consider three cases: for the first when \(-\infty < y < f_{1}\) we have

\[\begin{align*} \lim_{\xi \to -\infty}\int_{\xi}^{f_{1}}y f_{Y}(y)\ dy &= \lim_{\xi \to -\infty}\int_{\xi}^{f_{1}} y e^{a_{0}y + b_{0}}dy \\ &= \lim_{\xi \to -\infty}\left[\frac{e^{a_{0}y + b_{0}}(a_{0}y - 1)}{a^{2}_{0}}\right]_{\xi}^{f_{1}} \\ &= \lim_{\xi \to -\infty} \frac{e^{a_{0}f_{1} + b_{0}}(a_{0}f_{1} - 1)}{a^{2}_{0}} - \frac{e^{a_{0}\xi + b_{0}}(a_{0}\xi - 1)}{a^{2}_{0}} \\ &= \frac{e^{a_{0}f_{1} + b_{0}}(a_{0}f_{1} - 1)}{a^{2}_{0}} - \lim_{\xi \to -\infty}\left(\frac{e^{a_{0}\xi + b_{0}}a_{0}\xi}{a^{2}_{0}} - \frac{e^{a_{0}\xi + b_{0}}}{a^{2}_{0}}\right) \\ &= \frac{e^{a_{0}f_{1} + b_{0}}(a_{0}f_{1} - 1)}{a^{2}_{0}} - \lim_{\xi \to -\infty}\frac{\xi}{e^{-(a_{0}\xi + b_{0})}a_{0}}, \end{align*} \]

where the final step above follows from the fact that \(a_{0} > 0\). Now, by L’Hôpital’s rule,

\[\lim_{\xi \to -\infty} \frac{\frac{d}{d\xi}\xi}{\frac{d}{d\xi}e^{-(a_{0}\xi + b_{0})}a_{0}} = \lim_{\ell \to -\infty} \frac{1}{-a_{0}^{2}e^{-a_{0}\xi} - b_{0}} = 0, \]

since, once again, \(a_{0} > 0\). Hence, for the first integral, we get

\[\begin{equation} \label{eq:first_mean_integral} \lim_{\xi \to -\infty}\int_{\xi}^{f_{1}}y f_{Y}(y) dy = \frac{e^{a_{0}f_{1} + b_{0}}(a_{0}f_{1} - 1)}{a_{0}^{2}}. \end{equation} \]

For the second case where \(f_{j} \le y < f_{j+1}\) for some \(j \in \{1, \ldots, M\}\), we have

\[\begin{align*} \int_{f_{j}}^{f_{j+1}}y f_{Y}(y)dy &= \left[\frac{e^{a_{j}y + b_{j}}(a_{j}y - 1)}{a^{2}_{j}}\right]^{f_{j+1}}_{f_{j}} \\ &= \frac{e^{a_{j}f_{j+1} + b_{j}}(a_{j}f_{j+1} - 1)}{a^{2}_{j}} - \frac{e^{a_{j}f_{j} + b_{j}}(a_{j}f_{j} - 1)}{a_{j}^{2}} \\ &= \frac{e^{b_{j}}\big(e^{a_{j}f_{j+1}}(a_{j}f_{j+1} - 1) - e^{a_{j}f_{j}}(a_{j}f_{j} - 1)\big)}{a^{2}_{j}}. \end{align*} \]

Finally, for the last case where \(f_{M} \le y < \infty\), we have

\[\begin{align*} \lim_{\xi \to \infty}\int_{f_{M}}^{\xi}y f_{Y}(y)\ dy &= \lim_{\xi \to \infty}\left[\frac{e^{a_{M}y + b_{M}}(a_{M}y - 1)}{a^{2}_{M}}\right]^{\xi}_{f_{M}} \\ &= \lim_{\xi \to \infty} \frac{e^{a_{M}\xi + b_{M}}(a_{M}\xi - 1)}{a^{2}_{M}} - \frac{e^{a_{M}f_{M} + b_{M}}(a_{M}f_{M} - 1)}{a^{2}_{M}} \\ &= \lim_{\xi \to \infty} \frac{e^{a_{M}\xi + b_{M}}a_{M}\xi}{a^{2}_{M}} - \frac{e^{a_{M}\xi + b_{M}}}{a^{2}_{M}} - \frac{e^{a_{M}f_{M} + b_{M}}(a_{M}f_{M} - 1)}{a^{2}_{M}} \\ &= \lim_{\xi \to \infty} \frac{e^{a_{M}\xi + b_{M}}a_{M}\xi}{a^{2}_{M}} - \frac{e^{a_{M}f_{M} + b_{M}}(a_{M}f_{M} - 1)}{a^{2}_{M}} \\ &= -\frac{e^{a_{M}f_{M} + b_{M}}(a_{M}f_{M} - 1)}{a^{2}_{M}}. \end{align*} \]

where the penultimate line follows from the fact that \(a_{M} < 0\), and the last is arrived at by L’Hôpital’s rule, as above.

Multiplying by the reciprocal of the normalising constant, we obtain the expression for the expected value:

\[\begin{align*} \E_{Y}[y] &= \frac{1}{Z(\vphi)}\Bigg(\frac{e^{a_{0}f_{1} + b_{0}}(a_{0}f_{1} - 1)}{a^{2}_{0}} \\ &+\sum_{j=1}^{M-1}\frac{e^{b_{j}}\big(e^{a_{j}f_{j+1}}(a_{j}f_{j+1}-1) - e^{a_{j}f_{j}}(a_{j}f_{j}-1)\big)}{a^{2}_{j}} \\ &-\frac{e^{a_{M}f_{M} + b_{M}}(a_{M}f_{M} - 1)}{a^{2}_{M}}\Bigg). \end{align*} \]

\(\square\)

BibTeX Citation

@article{robeyns2021pl3e
    author={Maxime Robeyns},
    title={Piecewise Linear Log-Likelihood Estimation},
    year={2023},
    url={https://maximerobeyns.com/pl3e},
}

Find me on Twitter.

References

A Distributional Perspective on Reinforcement Learning

Bellemare, Marc G. and Dabney, Will and Munos, Rémi (2017)

arXiv:1707.06887 [cs, stat]

http://arxiv.org/abs/1707.06887

Mixture Density Networks

Bishop, Christopher M. (1994)

Mixture density networks

Distributional Reinforcement Learning with Quantile Regression

Dabney, Will and Rowland, Mark and Bellemare, Marc G. and Munos, Rémi (2017)

arXiv:1710.10044 [cs, stat]

http://arxiv.org/abs/1710.10044

Implicit Quantile Networks for Distributional Reinforcement Learning

Dabney, Will and Ostrovski, Georg and Silver, David and Munos, Remi (2018)

http://proceedings.mlr.press/v80/dabney18a.html

Event Labeling Combining Ensemble Detectors and Background Knowledge

Fanaee-T, Hadi and Gama, Joao (2013)

Progress in Artificial Intelligence

https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

MADE: Masked Autoencoder for Distribution Estimation

Germain, Mathieu and Gregor, Karol and Murray, Iain and Larochelle, Hugo (2015)

https://proceedings.mlr.press/v37/germain15.html

Generative Adversarial Nets

Goodfellow, Ian and Pouget-Abadie, Jean and Mirza, Mehdi and Xu, Bing and Warde-Farley, David and Ozair, Sherjil and Courville, Aaron and Bengio, Yoshua (2014)

https://papers.nips.cc/paper/2014/hash/5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html

Auto-Encoding Variational Bayes

Kingma, Diederik P. and Welling, Max (2014)

arXiv:1312.6114 [cs, stat]

http://arxiv.org/abs/1312.6114

Regression Quantiles

Koenker, Roger and Bassett, Gilbert (1978)

Econometrica

https://www.jstor.org/stable/1913643?origin=crossref

Offline Reinforcement Learning with Implicit Q-Learning

Kostrikov, Ilya and Nair, Ashvin and Levine, Sergey (2021)

http://arxiv.org/abs/2110.06169

A Class of Asymmetric Distributions

Kozubowski, Tomasz J (1999)

Variational Inference with Normalizing Flows

Rezende, Danilo and Mohamed, Shakir (2015)

https://proceedings.mlr.press/v37/rezende15.html

Statistics and Samples in Distributional Reinforcement Learning

Rowland, Mark and Dadashi, Robert and Kumar, Saurabh and Munos, Rémi and Bellemare, Marc G. and Dabney, Will (2019)

arXiv:1902.08102 [cs, stat]

http://arxiv.org/abs/1902.08102

PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications

Salimans, Tim and Karpathy, Andrej and Chen, Xi and Kingma, Diederik P. (2016)

https://openreview.net/forum?id=BJrFC6ceg

Deep Unsupervised Learning Using Nonequilibrium Thermodynamics

Sohl-Dickstein, Jascha and Weiss, Eric and Maheswaranathan, Niru and Ganguli, Surya (2015)

https://proceedings.mlr.press/v37/sohl-dickstein15.html

Energy-Based Models for Sparse Overcomplete Representations

Teh, Yee Whye and Welling, Max and Osindero, Simon and Hinton, Geoffrey E. (2003)

Journal of Machine Learning Research

https://www.jmlr.org/papers/v4/teh03a.html

Unconstrained Monotonic Neural Networks

Wehenkel, Antoine and Louppe, Gilles (2019)

https://papers.nips.cc/paper/2019/hash/2a084e55c87b1ebcdaad1f62fdbbac8e-Abstract.html

Footnotes

[0]

as measured by computational cost and ability to model the high-dimensional distribution itself

↩︎

[1]

for instance, attributes such as asymmetry, heavy-tailedness, multi-modality and correlations among elements

↩︎

[2]

often a large contributor to sampling speed is whether it can proceed in parallel or must be carried out autoregressively

↩︎

[3]

that is, with an arbitrary number of modes, light/heavy tails, asymmetry, etc.

↩︎

[4]

for our purposes, this will almost always be a neural network, although variants such as Bayesian neural networks and other estimators may be used.

↩︎

[5]

up to a scaling constant of \(0.5\)

↩︎

[6]

that is, giving it a heavy right tail

↩︎

[7]

indeed, they aren’t everywhere in the plot, with a few lines crossing over each other

↩︎

[8]

e.g. the vector of \(M\) quantile estimates

↩︎

[9]

where we factorise across the \(N\) i.i.d. datapoints, as well as \(M\) piecewise-linear likelihood components

↩︎

[10]

written by analogy to the partition function in an exponential family.

↩︎

[11]

i.e. the distribution has density \(0\) at some point

↩︎

Article

Published

Discussion

Code

Contents