These notes are structured as an iPython/Jupyter notebook tutorial written in python 3, and depend only on the numpy and scipy libraries. [download as: .ipynb , .html , .pdf ]

Applying Gaussian process models to hippocampal grid cell data

Gaussian Processes (GPs) generalize the idea of multivariate Gaussian distributions to distributions over functions. In neuroscience, they can be used to estimate how the firing rate of a neuron varies as a function of other variables (e.g. to track retinal waves ). Lately, we've been using Gaussian processes to describe the firing rate map of hippocampal grid cells .

We review Bayesian inference and Gaussian processes, explore applications of Gaussian Processes to analyzing grid cell data, and finally construct a GP model of the log-rate that accounts for the Poisson noise in spike count data. Along the way, we discuss fast approximations for these methods, like kernel density estimation , or approximating GP inference using convolutions.

Introduction

First, we briefly review Bayesian inference for multivariate Gaussian variables and Gaussian processes. Then, we construct some synthetic spike-count observations, similar to what one might see in hippocampal grid cells. We then review how to estimate the underlying firing rate map using kernel density estimation, and discuss some regularization choices when data are limited.

Bayesian inference in multivariate Gaussian distributions

Loosely, Gaussian processes can be viewed as "really big" multivariate Gaussian distributions, with infinitely many variables. It's helpful to review Bayesian inference for multivariate Gaussian variables before continuing.

Consider estimating some jointly Gaussian variables $z$ from observations $y$. Bayes' rule states that the posterior distribution $\Pr(z|y)$ is proportional to our prior, $\Pr(z)$, times the likelihood of observing $y$ given $z$, $\Pr(y|z)$:

\begin{equation} \Pr(z|y) \propto \Pr(y|z) \Pr(z). \end{equation}

Consider a case where both $\Pr(y|z)$ and $\Pr(z)$ are Gaussian:

\begin{equation} \begin{aligned} \Pr(z) &= \mathcal N( \mu_0, \Sigma_0 ) \\ \Pr(z|y) &= \mathcal N( y, \Sigma_\epsilon ) \end{aligned} \end{equation}

We can estimate $\Pr(z|y)$ using Bayes' rule, by multiplying these two probability distributions (and normalizing the result to integrate to one).

\begin{equation} \begin{aligned} \Pr(y|z) \Pr(z) &\propto \exp\left[-\tfrac 1 2 (z-\mu_0)^\top\Sigma_0^{-1}(z-\mu_0)\right] \exp\left[-\tfrac 1 2 (z-y)^\top\Sigma_\epsilon^{-1}(z-y)\right] \\&\propto \exp\left\{-\tfrac 1 2 \left[ z^\top\Sigma_0^{-1}z -2z^\top\Sigma_0^{-1}\mu_0 + z^\top\Sigma_\epsilon^{-1}z -2z^\top\Sigma_\epsilon^{-1}y \right] \right\} \\&= \exp\left\{-\tfrac 1 2 \left[ z^\top(\Sigma_0^{-1}+\Sigma_\epsilon^{-1})z -2z^\top(\Sigma_0^{-1}\mu_0+\Sigma_\epsilon^{-1}y) \right] \right\} \\&= \exp\left\{-\tfrac 1 2 \left[ z^\top\Sigma^{-1}z -2z^\top\Sigma^{-1}\mu \right] \right\} \\&\propto \exp\left\{-\tfrac 1 2 (z-\mu)^\top\Sigma^{-1}(z-\mu) \right\} \end{aligned} \end{equation}

This product of two multivariate Gaussian distributions is also a multivariate Gaussian distribution, $\hat z \sim \mathcal N(\mu, \Sigma)$, with covariance and mean:

\begin{equation} \begin{aligned} \Sigma &= \left[\Sigma_0^{-1} + \Sigma_\epsilon^{-1} \right]^{-1} \\ \mu &= \Sigma \left[\Sigma_0^{-1}\mu_0 + \Sigma_\epsilon^{-1}y \right] \end{aligned} \end{equation}

In other textbooks or tutorials, you might also see this written as

\begin{equation} \begin{aligned} \Sigma &= \Sigma_0 - \Sigma_0[\Sigma_0 + \Sigma_\epsilon]^{-1} \Sigma_0 \\ \mu &= \mu_0 + \Sigma_0\left[\Sigma_\epsilon + \Sigma_0 \right]^{-1}(y-\mu_0). \\ \end{aligned} \end{equation}

Both forms are equivalent, and are related to each other by applying the Sherman–Morrison–Woodbury matrix inversion lemma .

Gaussian process regression

Gaussian processes are commonly used to estimate a smooth underlying trend from noisy observations. Peter Roelants' notes on Gaussian processes is a clear and detailed introduction.

Consider a GP regression problem for learning $y=f(\mathbf x)$, where $\mathbf x=\{x_1,x_2\}$ are coordinates in 2D. Here, our prior over functions is specified not by a mean and covariance, but by a mean function $m(\mathbf x)$ and a two-point correlation function $\kappa(\mathbf x,\mathbf x')$, called a kernel. These functions accept a set of points as input and return a mean vector and covariance matrix evaluated at those points.

For the regression problem, we'd like learn a model of $y=f(\mathbf x)$ given some initial observations $\text Y_1=\{y_{1,1},..,y_{1,n}\}$ at locations $\text X_1=\{\mathbf x_{1,1},..,\mathbf x_{1,n}\}$.

GP regression builds a posterior distribution over possible functions $f(\mathbf x)$, given our prior (mean and kernel), and these observations.

For any finite collection of points $\text X_2=\{\mathbf x_{2,1},..,\mathbf x_{2,m}\}$, we can evaluate the GP posterior $y_2 = f(\mathbf x_2)$ at output points $\text Y_2=\{y_{2,1},..,y_{2,n}\}$.

\begin{equation} \begin{aligned} y_2 &\sim \mathcal N(\mu,\Sigma) \\ \mu &= \mu_2 + \Sigma_{12}^\top[\Sigma_{11}+\Sigma_{\epsilon}]^{-1} (y_1 - \mu_1) \\ \Sigma &= \Sigma_{22} - \Sigma_{12}^\top [\Sigma_{11}+\Sigma_{\epsilon}]^{-1} \Sigma_{12}, \end{aligned} \end{equation}

where the means and covariances are computed according to the prior mean and kernel, $\mu_i{=}m(\mathbf x_i)$ and $\sigma_{ij}{=}\kappa(\mathbf x_i,\mathbf x_j)$, respectively. The observation noise model $\Sigma_{\epsilon}$ is typically assumed to be i.i.d. Gaussian noise with variance $\xi^2$, i.e. $\Sigma_{\epsilon} = \xi^2 I$, although we'll explore some other options here.

To connect GP regression to the Bayesian update for multivariate normal variables, consider sampling both the data and the posterior over the same set of points $\text X_0=\{\mathbf x_1,..,\mathbf x_L\}$, i.e. $\text X_1 = \text X_2 = \text X_0$.

In this case, $\mu_a = \mu_b = \mu_0$ and $\Sigma_{11} = \Sigma_{12} = \Sigma_{22} = \Sigma_0$, and the GP regression simplifies to:

\begin{equation} \begin{aligned} \Sigma &= \Sigma_0 - \Sigma_0 [\Sigma_0+\Sigma_{\epsilon}]^{-1} \Sigma_0 \\ \mu &= \mu_0 + \Sigma_0[\Sigma_0+\Sigma_{\epsilon}]^{-1} (y_1 - \mu_0), \end{aligned} \end{equation}

This is identical to the posterior distribution for a multivariate Gaussian model we discussed earlier. Indeed, if the data consist of Gaussian observations over a set of points, and you evaluate the posterior at these same locations, there is no difference between Gaussian Process regression and Bayesian inference using multivariate Gaussian variables.

Exploring Gaussian Process methods in grid cell data

First, let's set up our Python environment in the notebook.

Simulating some data

Let's generate some fake grid cell data. We'll simulate a $L\times L$ spatial grid, and define a periodic grid-like firing intensity. Real data are always a bit messy, so we'll make the arena irregularly shaped, and model some background rate fluctuations, and non-uniform sampling of the grid (maybe the rat visits some locations more than others).

We also add a bit of zero padding around the data. This allows us to apply convolution kernels using circular convolution, without mixing up data from opposite ides. This will be useful later, since circular convolution can be computed very quickly using the Fast Fourier Transform (FFT) . More generally, we might also want to mask out parts of the space if e.g. the rat was exploring an arena with something other than a square shape.

We summarize the data in terms of two $L\times L$ arrays: $N$, which counts the number of times the rat visits each location, and $K$, which counts the total number of spikes observed in each location. Spikes are sampled as a conditionally-Poisson process with rate $\lambda$ equal to the intensity at each location.

Let's plot things.

Estimating rate in each bin

The simplest way to estimate the rate at each location is to simply divide the number of observed spikes $K$ by the number of visits $N$ to each location. This is a very noisy estimate (below, left), and is undefined when $N$ is zero.

$$ \hat\lambda = \frac K N $$

It's tempting to add a little ad-hoc regularization to handle the $N=0$ case gracefully, for example $\hat\lambda = {K}/({N+\tfrac 1 2} )$. Tricks like this might seem arbitrary (and perhaps wrong), but can be more formally motivated via Bayesian statistics.

We assume that each time the rat visits a location with intensity $\lambda$, we observe $y$ spikes, which are Poisson distributed:

\begin{equation} \Pr(y|\lambda) = \frac {\lambda^y e^{-\lambda}} {\Gamma(y+1)} \end{equation}

This gives us a likelihood for estimating $\lambda$ given $y$. The gamma distribution is the conjugate prior for Poisson rates, with shape parameter $\alpha$ and rate parameter $\beta$:

\begin{equation} \Pr(\lambda|\alpha,\beta) = \frac{\beta^{\alpha}}{\Gamma(\alpha)} {\lambda^{\alpha-1} e^{-\beta\lambda}}. \end{equation}

We can write the likelihood of observing a count observation $y$ given rate $\lambda$ as:

$$ \Pr(\lambda|y) \propto {\lambda^{y} e^{-\lambda}}. $$

To combine $N$ observations $y_t\in\{y_1,..,y_T\}$, take the product of the likelihoods for each observation. This reduces to a simplified expression in terms of the total number of spikes $K=\sum_t y_t$:

\begin{equation} \begin{aligned} \Pr(\lambda|y_1,..,y_T)&\propto \textstyle\prod_t \lambda^{y_t} e^{-\lambda}= \lambda^K e^{-N\lambda}. \end{aligned} \end{equation}

To assign a rate estimate to locations with missing data, we can define a Bayesian prior for $\lambda$. This regularizes bins that have limited data, reducing variance at the expense of increased bias. For regularization strength $\rho>0$, we set $\Pr(\lambda)\sim\operatorname{Gamma}(\alpha_0,\beta_0)$, with $\beta_0=\rho$ and $\alpha_0=\rho(\mu-1)+1$, and were $\mu$ is the overall average firing rate of the neuron, regardless of location. This leads to a posterior distribution of:

\begin{equation} \begin{aligned} \Pr(\lambda|y_1,..,y_T) &\propto \lambda^{K}e^{-N\lambda} \cdot [\lambda^{\alpha_0-1} e^{-\beta_0\lambda}] \\&= \lambda^{K+\rho(\mu-1)+1}e^{-(N+\rho)\lambda}. \end{aligned} \end{equation}

This gives a gamma-distributed posterior with $\alpha = K+\rho(\mu-1)+1$ and $\beta=N+\rho$. The posterior mean, $\alpha/\beta$, is a regularized estimator $\hat\lambda_\mu$ of the rate:

\begin{equation} \hat\lambda_\mu = \frac{K+\rho(\mu-1)+1}{N+\rho}. \end{equation}

This is biased toward higher rates due the +1 in the numerator. Using the mode $(\alpha-1)/\beta$ is another option, which lacks this bias:

\begin{equation} \hat\lambda_\text{mode} = \frac{K+\rho\mu}{N+\rho}. \end{equation}

One can interpolate between these mean-based and mode-based regularizers with another parameter $\gamma\in[0,1]$, where $\gamma=0$ corresponds to the mode-based estimator, and $\gamma=1$ to the mean-based estimator:

\begin{equation} \hat\lambda = \frac{K+\rho(\mu-\gamma)+\gamma}{N+\rho}. \end{equation}

We use $\gamma=0.5$ and $\rho=1.3$ as the default here.

Even with regularization, estimating the rate directly in each bin is far too noisy to be useful. Why go through all this trouble to define a principled way to regularize counts for single bins then? These regularized rate estimators provide a principled way to define how a rate estimator should behave when data are limited, and can be incorporated into better estimators that pool data from adjacent bins. Next, we explore a simple way to pool data from adjacent bins using kernel density smoothing.

Estimating rate by smoothing

Estimating rate via Kernel Density Estimation (KDE)

The simplest way to estimate rate is to average the spike counts over nearby regions. We'll use a Gaussian blur here. The 2D Gaussian blur is a separable filter , so we can compute it using two 1D Gaussian blurs in each direction. This can also be done quickly using the Fast Fourier Transform (FFT) . This amounts to Kernel Density Estimation (KDE) .

In our case, we must also account for the nonuniform sampling of space. The rat visits some locations more than others. The solution is to smooth the spike counts $K$ and location visits $N$ separately, and then estimate the rate.

\begin{equation} \begin{aligned} \hat\lambda_\text{kde} &= \frac {\kappa \otimes K} {\kappa \otimes N} \\ \kappa(\mathbf x,\mathbf x') &= \exp\left[ -\tfrac {(\mathbf x-\mathbf x')^2}{2\sigma^2}\right] \end{aligned} \end{equation}

If we want to use the regularized rate estimator defined earlier, we should normalize our smoothing kernel $\kappa(\mathbf x,\mathbf x')$ to unit height. This accounts for the fact that smoothing pools multiple observations, and so increases the certainty of our rate estimate relative to the prior.

\begin{equation} \begin{aligned} \hat\lambda_\text{kde} &= \frac {\kappa \otimes K+\rho(\mu-\gamma)+\gamma} {\kappa \otimes N+\rho} \end{aligned} \end{equation}

For analyzing the underlying grid, we might also want to remove large-scale variations in rate across the arena. We can estimate a background rate also via Gaussian smoothing, and divide out this rate to get a normalized estimate of how rate changes with location.

Inspecting the data

Kernel density smoothing yields a good estimate of the rate map, but we need to know how much to blur the spike count data. We can also pick $\sigma$ in a principled way by examining the autocorrelation of the data.

We can calculate the 2D autocorrelation efficiently using the FFT . To focus on fluctuations around the mean rate, we should first subtract any constant component.

We can collapse this 2D autocorrelation down to 1D by averaging this 2D autocorrelation as a function of radial distance. This radial autocorrelation has a large peak at zero lag, but also several smaller peaks due to the periodic tuning curve.

We can estimate the grid spacing based on the location of the first non-zero-lag peak. Here, we use sinc interpolation computed via FFT to find the location of the peak that corresponds to the grid spacing.

For grid cells, the 2D autocorrelation should show a hexagon, which reflects the three sinusoidal components that make up the periodic grid tiling (below, left).

Once we have grid spacing $P$, we can define the scales for smoothin. We want to smooth as much as possible, but not so much that we erase the underlying grid. A Gaussian with $\sigma = P/\pi$ is a good heuristic for the largest acceptable smoothing radius. For subtracting the background, we use $\sigma_{bg} = 2.5\cdot P/\pi$.

Smoothing with Gaussian Process regression

KDE smoothing is ok, but we can do better. Gaussian Process (GP) regression provides a flexible way to handle missing data, and also lets us encode more assumptions about the spatial correlations in the underlying rate map.

Let's start by implementing smoothing using GP regression. Recall the formula for the GP posterior mean:

\begin{equation} \begin{aligned} \\ \mu &= \left[\Sigma_0^{-1}+\Sigma_\epsilon^{-1}\right]^{-1}\left[\Sigma_\epsilon^{-1}y+\Sigma_0^{-1}\mu_0\right] \end{aligned} \end{equation}

If we set the prior means to zero, this simplifies to:

\begin{equation} \begin{aligned} \hat\lambda_\text{gp} = \left[\Sigma_0^{-1} + \Sigma_\epsilon^{-1}\right]^{-1}\Sigma_\epsilon^{-1}y \end{aligned} \end{equation}

To set up our GP regression problem, we need to define the prior covariance $\Sigma_0$ and measurement precisions $\Sigma_\epsilon^{-1}$. For now, we assume uniform measurement noise of $\sigma_\epsilon^2$ per time-point.

We work with the binned spike counts $K$, and the total number of times the rat visits each location $N$. Binning observations lumps $N$ observations together into a single estimate. Points with more visits have less error. Bins lacking data ($N=0$) can be removed, or handled gracefully if we work with precision rather than variance. Precision $\tau$ is the reciprocal of variance, $\tau=1/\sigma^2$ (in the multivariate case: the inverse of the covariance matrix). For $N$ measurements with noies $\sigma_\epsilon^2$ the precision is $\tau = N/\sigma_\epsilon^2$. We therefore define the precision matrix of the observations as

\begin{equation} \begin{aligned} \Sigma^{-1}_\epsilon &= \operatorname{diag}[\tau_\epsilon] \\ \tau_\epsilon &= \frac{\operatorname{vec}[N]}{\sigma_\epsilon^2} \end{aligned} \end{equation}

where $\operatorname{diag}[]$ denotes constructing a diagonal matrix from a vector, and $\operatorname{vec}[]$ denotes unravelling the $L{\times}L$ array into a length $L^2$ vector.

We construct the prior covariance matrix $\Sigma_0$ by evaluating the kernel $\kappa(\mathbf x,\mathbf x')$ for all pairs of bins. Here, we configure the kernel heuristically: we pick an arbitrary smoothing radius, and scale the kernel height to match the estimated variance of the rate map.

Here, we explore a grid size of $L=100$. For larger problems, the prior covariance might not fit in memory. As we'll see shortly, it is enough to define a routine that computes the product of the prior covariance (or its inverse) with a vector.

The numerical stability of our GP regression will be poor if our prior covariance has small eigenvalues. The eigenvalues of our covariance correspond to the coefficients of the Fourier transform of our kernel, so we can "repair" our kernel by setting too-small eivenvalues with a small positive value.

To solve our GP regression problem, we use the following form for the posterior mean. This form avoids inverting the prior covariance, so it's more numerically stable.

\begin{equation} \begin{aligned} \mu &= \left[\Sigma_0^{-1} + \Sigma_\epsilon^{-1}\right]^{-1} \Sigma_\epsilon^{-1} y \\&= \left[\Sigma_0^{-1} + \operatorname{diag}[\tau_\epsilon]\right]^{-1} \operatorname{diag}[\tau_\epsilon] y \\&= \left[\Sigma_0 \operatorname{diag}[\tau_\epsilon] + I\right]^{-1} \Sigma_0 \operatorname{diag}[\tau_\epsilon]y \end{aligned} \end{equation}

We apply a few optimizations for speed.

Finally, we use scipy.sparse.minres to solve the linear system

\begin{equation} \begin{aligned} \mu &= A^{-1}\mathbf v \\ A&=\Sigma_0 \operatorname{diag}[\tau_\epsilon] + I \\ \mathbf v&=\Sigma_0^{-1} \operatorname{diag}[\tau_\epsilon]y . \end{aligned} \end{equation}

Minres stands for "minimum residual", and is a type of Krylov subspace solver. It works by re-phrasing $\mu = A^{-1}\mathbf v$ as:

\begin{equation} \mu = \underset{\mu}{\operatorname{argmin}} \| A\mu - \mathbf v\|^2 \end{equation}

And then minimizing this error ("residual"). We do not need to explicitly construct $A$ (which is memory intensive), but instead supply a function that calculates $A\mu$ for any given vector $\mu$. This can be evaluated efficiently as element-wise multiplications and convolutions implemented via FFT.

Sometimes GP regression reduces to convolution

It seems like GP regression yields similar results to kernel density estimation. Can we relate these two operations? Recall the solution for the GP posterior:

\begin{equation} \begin{aligned} \mu = \left[\Sigma_0^{-1} + \Sigma_\epsilon^{-1}\right]^{-1}\Sigma_\epsilon^{-1}y. \end{aligned} \end{equation}

The prior $\Sigma_0$ is a positive semi-definite matrix, so it can be written in terms of the eigenvalue decomposition

\begin{equation} \Sigma_0 = \text F \operatorname{diag}[\tilde k] \text F^{-1}, \end{equation}

where $\tilde k$ is a vector of eigenvalues and $\operatorname F$ is a unitary basis. If $\Sigma_\epsilon^{-1}$ can also be diagonalized by $\operatorname F$ as $\Sigma_\epsilon^{-1} = \operatorname F \operatorname{diag}[\tilde\tau] \operatorname F^\top$, then the posterior mean simplifies to

\begin{equation} \begin{aligned} \mu = \operatorname F \operatorname{diag}\left[\tfrac {\tilde k \tilde \tau} {\tilde k \tilde \tau + 1}\right] \operatorname F^{-1} y. \end{aligned} \end{equation}

In the special case that all measurements have noise $\sigma_\epsilon^2$, the precision matrix $\Sigma_\epsilon^{-1} = I/\sigma_\epsilon^2$ is proportional to the identity, and the GP posterior reduces to:

\begin{equation} \begin{aligned} \mu = \text F \operatorname{diag}\left[\tfrac {\tilde k} {\tilde k + \sigma_\epsilon^2}\right] \text F^{-1} y. \end{aligned} \end{equation}

When the GP is evaluated on a regularly-spaced grid, the eigenspace $\text F$ is Fourier space, and $\text F$ is the (unitary) Fourier transform. The above matrix operations can therefore be computed as a convolution ($\otimes$) with the kernel $g(\mathbf x,\mathbf x')$:

\begin{equation} \begin{aligned} g(\mathbf x,\mathbf x') &= \text F^{-1} \operatorname{diag}\left[\tfrac {\tilde k} {{\tilde k} + \sigma_\epsilon^2}\right] \\ \mu&\approx g \otimes y. \end{aligned} \end{equation}

For large measurement error $\sigma_\epsilon^2\gg\tilde k$, the kernel $g(\mathbf x,\mathbf x')$ is approximately proportional to the prior kernel $\kappa$. When the measurement error is small $\sigma_\epsilon^2\ll\tilde k$, convolution with $g(\mathbf x,\mathbf x')$ approximates the identity transformation.

This highlights that sometimes filtering the observations with a convolution kernel gives you something almost as good as a GP regression. This is much simpler, and is often good enough.

Better priors

So far, we've only used GP regression with a Gaussian prior. When analyzing data from grid cells, the real power of GP regression lies in being able to encode the knowledge that the grid should be periodic into the GP prior kernel.

To construct a periodic prior, we estimate the autocorrelation from a perfect grid. To avoid assuming any particular orientation, we make the kernel radially symmetric. To avoid inferring long-range interactions where none exist, we taper the kernel to look only at the local neighborhood.

We adapt the kernel to the observed statistics of the spike count data by scaling the zero-lag peak in the kernel to match a estimate of the variance in the rate.

The zero-lag autocorrelation of the data reflects the sum of the true variance in the underlying rates, plus the average measurement noise.

To remove the contribution from the measurement noise, we estimating the zero-lag variance by fitting a quadratic polynomial to the correlation at nearby, nonzero lags.

This prior encodes the assumption that the observed spike counts have a periodic underlying structure, and leads to better recovery of the grid fields.

Heuristic approximation of Poisson noise

Neuronal spiking is typically treated as conditionally Poisson, which means its variance should be proportional to the firing rate. Let's explore a heuristic way to incorporate a Poisson noise assumption into our GP regressions. Earlier, we discussed how the gamma distribution could serve a conjugate prior for Poisson count data. We can also use a Gamma distribution to model the measurement uncertainty from a collection of Poisson observations, and incorporate this model of uncertainty into our GP regression.

The variance of a $\operatorname{Gamma}(\alpha,\beta)$ distribution is $\sigma^2=\alpha/\beta^2$. The regularized rate estimator given $K$ spikes in $N$ visits to a given location yields a Gamma distribution with with $\alpha = K+\rho(\mu-\gamma)+1$ and $\beta=N+\rho$. The variance, then, is

\begin{equation} \sigma^2_\epsilon=\tfrac{\alpha}{\beta^2}=\tfrac{K+\rho(\mu-\gamma)+1}{(N+\rho)^2} \end{equation}

Performance for this model of the error is mixed: it can work better than assuming constant error when data are limited, but sometimes performs worse than simply assuming uniform variance equal to the neuron's average firing rate. We discuss a more principled way to handle Poisson noise in the next section.

Log-Gaussian Cox Processes

So far, we've been using GP regression to estimate the underlying rate. This works, but entails some heuristic decisions about how to model measurement noise. Can we do better?

We can get an even better model of the data by fitting a log-Gaussian Cox process model to the binned count observations. This places a Gaussian process prior on the logarithm of the intensity, $\ln(\lambda)$, and assumed that spike count observations are conditionally Poisson:

\begin{equation} \begin{aligned} y&\sim\operatorname{Poisson}(\lambda) \\ \lambda &= \exp(\mathbf w^\top \mathbf x) \\ \mathbf w&\sim \mathcal N(0,\Sigma_0) \end{aligned} \end{equation}

Above, $\mathbf w$ are the log-rates that we want to infer, and $\mathbf x$ is a an indicator vector which is 1 for the rat's current binned location and zero otherwise.

Recall that the probability of observing spike count $y$ given rate $\lambda$, for Poisson-distributed spike counts, is:

\begin{equation} \Pr(y|\lambda) = \frac {\lambda^y e^{-\lambda}} {\Gamma(y+1)} \end{equation}

We work with log-probability for numerical stability. The log-probability of observing spike count $y$ given rate $\lambda$, for Poisson-distributed spike counts, is:

\begin{equation} \ln\Pr(y|\lambda) = y \ln \lambda -\lambda + \text{const.} \end{equation}

We estimate a posterior distribution on $\mathbf w$ by multiplying our Gaussian process prior by this Poisson likelihood, for all $T$ time points.

The maximum a posteriori estimate

We find $\mathbf w$ that maximizes the posterior probability of the observed spike counts. This is the Maximum A Posteriori (MAP) estimator. For this, we need only calculate the posterior log-probability up to a constant. By convention, we work with the negative log-posterior so that finding the MAP is a minimization problem.

The negative log-posterior $\mathcal L=-\ln\Pr(Y|x,\mathbf w,\beta)$, summed over all observations $\text Y=\{y_1,..,y_T\}$, $\text X=\{\mathbf x_1,..,\mathbf x_T\}$, is:

\begin{equation} \begin{aligned} \mathcal L &= \tfrac 1 2 \mathbf w^\top \Sigma_0^{-1} \mathbf w - \textstyle\sum_{t=1}^T [y_t \ln(\lambda_t) - \lambda_t] + \text{const.} \\ \lambda_t &= \exp(\mathbf w^\top \mathbf x_t) \end{aligned} \end{equation}

We bin the data into $r\in{1..R}$ spatial regions. Each site $r$ has $n_r$ visits in which we observe $k_r$ spikes. Define $\bar y_r=k_r/n_r$ as the empirical rate in each region. We can rewrite the sum over all timepoints in the log likelihood, as a sum over all spatial regions:

\begin{equation} \begin{aligned} \textstyle\sum_{t=1}^T y_t \ln(\lambda_t) - \lambda_t &= \textstyle\sum_{r=1}^R n_r [ \bar y_r \ln(\lambda_r) - \lambda_r ] \end{aligned} \end{equation}

The negative log-posterior can then be written as:

\begin{equation} \begin{aligned} \mathcal L = \tfrac 1 2 \mathbf w^\top \Sigma_0^{-1} \mathbf w +\textstyle\sum_{r=1}^R n_r [\lambda_r - \bar y_r \ln(\lambda_r)] +\text{const.} \end{aligned} \end{equation}

Written as a sum over bins like this, $\lambda_r = \exp(\mathbf w^\top \mathbf x_r) = \exp(\mathbf w_r)$ the log-posterior simplifies to:

\begin{equation} \begin{aligned} \mathcal L = \tfrac 1 2 \mathbf w^\top \Sigma_0^{-1} \mathbf w +\textstyle\sum_{r=1}^R n_r [\exp(\mathbf w_r) - \bar y_r \mathbf w_r] +\text{const.} \end{aligned} \end{equation}

We find the MAP by minimizing the above as a function of $\mathbf w$. This can be solved via gradient descent. However, I found that most of the optimization routines in Scipy's minimize function performed poorly, either crashing, failing to terminate. Scipy's conjugate gradient method performed the best, but achieved poor error tolerance. Instead, we can build our own Newton-Raphson solver.

Finding the maximum a posteriori using Newton-Raphson

Newton-Raphson solves a linear system on each iteration. Each iteration takes the same amount of time as solving a single GP regression problems.

(Indeed, one can view each stage of Newton-Raphson as its own GP regression problem. This is the idea behind the Iteratively Reweighted Least Squares (IRLS) approach to fitting Generalized Linear Models (GLMs). The Gaussian process model used here can be viewed as a Poisson GLM with the GP prior acting as a regularizer. Lieven Clement has a good introduction on IRLS.)

Each iteration of Newton-Raphson updates the parameters as

\begin{equation} \mathbf w_{i+1} = \mathbf w_i - \mathbf H^{-1} \mathbf J, \end{equation}

where $\mathbf J = \nabla\mathcal L$ and $\mathbf H = \nabla\nabla^\top\mathcal L$ are the Jacobian (gradient) and Hessian (curvature) of our negative log-posterior at the current parameter estimate $\mathbf w_i$.

To apply Newton-Raphson we need to calculate the Hessian matrix and Jacobian vector. We can express these as a sum of a contribution from the log-prior and log-likelihood.

The negative log-prior is $\tfrac 1 2 \mathbf w^\top \Sigma_0^{-1} \mathbf w$ (up to a constant). We can express its contribution to the Hessian and Jacobian in terms of vector derivatives in $\mathbf w$:

\begin{equation} \begin{aligned} \mathbf J_0 &= \nabla_\mathbf w [\tfrac 1 2 \mathbf w^\top \Sigma_0^{-1} \mathbf w] \\&= \tfrac 1 2 [\mathbf w^\top \Sigma_0^{-1} + \Sigma_0^{-1} \mathbf w] \\&= \Sigma_0^{-1} \mathbf w \\ \mathbf H_0 &= \nabla\nabla^\top_\mathbf w [\tfrac 1 2 \mathbf w^\top \Sigma_0^{-1} \mathbf w] \\&= \Sigma_0^{-1} \end{aligned} \end{equation}

For the negative log-likelihood $\ell=\textstyle\sum_{r=1}^R n_r[\exp(\mathbf w_r)-\bar y_r\mathbf w_r]$, only the rate $\lambda_r$ contributes to the corresponding derivatives in $w_r$:

\begin{equation} \begin{aligned} \partial_{w_r}\ell &= n_r [\exp(\mathbf w_r)-\bar y_r] = n_r [\lambda_r-\bar y_r] \\ \partial^2_{w_r} \ell &= n_r \exp(\mathbf w_r) = n_r \lambda_r. \end{aligned} \end{equation}

These can be written in vector form as

\begin{equation} \begin{aligned} \mathbf J_\ell& = N\circ(\lambda-\bar y) \\ \mathbf H_\ell&=\operatorname{diag}[N\circ\lambda], \end{aligned} \end{equation}

where $\circ$ denotes element-wise multiplication and $N=\{n_1,..,n_R\}$, $\lambda=\{\lambda_1,..,\lambda_R\}$, and $\lambda=\{\bar y_1,..,\bar y_R\}$, are column vectors of the number of visits per bin, the current estimated rates, and the empirical rates, respectively.

The Jacobian and Hessian can be written as:

\begin{equation} \begin{aligned} \mathbf J &= \mathbf J_0+\mathbf J_\ell= \Sigma_0^{-1}\mathbf w+N\circ(\lambda-\bar y) \\ \mathbf H&= \mathbf H_0+\mathbf H_\ell= \Sigma_0^{-1}+\operatorname{diag}[N\circ\lambda] \end{aligned} \end{equation}

The Newton-Raphson update is then given by

\begin{equation} \begin{aligned} \mathbf w_{n+1} &= \mathbf w_{n} -\mathbf H^{-1} \mathbf J \\&= \mathbf w_{n} - \left[ \Sigma_0^{-1}+\operatorname{diag}[N\circ\lambda] \right] ^{-1} \left[ \Sigma_0^{-1}\mathbf w_{n}+ N\circ(\lambda-\bar y) \right] \end{aligned} \label{eq:lgcpnr1} \end{equation}

Note: (pre) conditioning

As in GP regression, this problem can be numerically unstable if $\Sigma_0$ has smal eigenvalues. One can mitigate this by multiplying both the Hessian and Jacobain on the left by $\Sigma_0$ (i.e. preconditioning ).

\begin{equation} \begin{aligned} \mathbf w_{n+1} &= \mathbf w_{n} - \left[ \Sigma_0\operatorname{diag}[N\circ\lambda] + I \right] ^{-1} \left\{ \mathbf w_{n}+\Sigma_0 [N\circ(\lambda-\bar y)] \right\} \end{aligned} \end{equation}

However, when $\Sigma_0$ is invertable, it is faster in practice to use form $\eqref{eq:lgcpnr1}$, and pass an operator that computes $f(\mathbf v)=\Sigma_0\mathbf v$ to the preconditioner argument of minres .

Note: when to use a separate bias term?

Sometimes, you might want to separate out the average log-rate, and paramterize the LGCP as $\Theta=(\mathbf w,\beta)$; $\lambda=\exp(\mathbf w+\beta)$. Why? We want to avoid placing prior assumptions on the average firing rate of the neuron. The average rate therefore corresponds to a direction in our Gaussian process that is entirely unconstrained, i.e. has infinite variance in the prior.

Since GP regression is linear, the average firing rate is the maximum likelihood estimate of the average of the posterior mean, and it suffices to subtract the average rate before inferring the firing rate map. The average log-rate is less straightforward to estimate in LGCP regression, and it must be inferred along with the weights during optimization.

Here, we limit small eigenvalues of $\Sigma_0$, so that $\Sigma_0^{-1}$ is well-defined. We then zero-out the uniform (average) rate component of $\Sigma_0^{-1}$ by setting the DC term in its Fourier transform to zero.

However, in other applications $\Sigma_0$ might be sufficiently low-rank that computing even a regularized $\Sigma_0^{-1}$ is impractical or inaccurate. In this case, one can avoid inverting $\Sigma_0$ by treating the unconstrained mean as an additional bias parameter $\beta$ that is unaffected by the prior.

Note: Iteratevely Reweighted Least-Squares (IRLS)

The Iteratevely Reweighted Least-Squares (IRLS) approach recasts the Newton-Raphson iteration as solving a new GP regression problem. Rewrite the Newton-Raphson iteration as:

\begin{equation} \begin{aligned} \mathbf w_{n+1} &=\mathbf w_{n}-\mathbf H^{-1}\left[\Sigma_0^{-1}\mathbf w_{n}+N\circ(\lambda-\bar y)\right] \\&=\mathbf H^{-1}\left[\mathbf H\mathbf w_{n}-\Sigma_0^{-1}\mathbf w_{n}+ N\circ(\lambda-\bar y) \right] \\&=\mathbf H^{-1}\left[\left(\Sigma_0^{-1}+\operatorname{diag}[N\circ\lambda]\right)\mathbf w_{n}-\Sigma_0^{-1}\mathbf w_{n}+N\circ(\lambda-\bar y)\right] \\&=\mathbf H^{-1}\operatorname{diag}[N\circ\lambda]\left[\mathbf w_{n}+(1-\bar y/\lambda)\right] \end{aligned} \end{equation}

Recall the formula for the GP posterior is $\mu= \Sigma \left[\Sigma_0^{-1}\mu_0 + \Sigma_\epsilon^{-1}y \right]$ and $\Sigma = \left[\Sigma_0^{-1} + \Sigma_\epsilon^{-1} \right]^{-1}$. Matching terms, we get:

\begin{equation} \begin{aligned} \Sigma_0^{-1} &= \Sigma_0^{-1} \\ \mu_0 &= 0 \\ \Sigma_\epsilon^{-1} &= \operatorname{diag}[N\circ\lambda] \\ y &= \mathbf w_n + 1-\bar y/\lambda \end{aligned} \end{equation}

This confirms that estimating the LGCP posterior has similar complexity to GP regression.

Note: initializing a prior for log-Gaussian inference

So far, we have defined a Poisson observation model and log-posterior loss function for a log-Gaussian point-process model of the grid cell. We also need to initialize a sensible prior for the weights $\mathbf w$, which will correspond to our estimates of $\ln\lambda$. We'll use the same periodic kernel from earlier, but normalize the kernel height to the variance of the log-rate, estimated via KDE.

Note: Hessian-vector products

We can gain the benefit of Krylov subspace methods in LGCP regression by defining a routine to compute the product of the Hessian and a vector.

\begin{equation} \begin{aligned} \mathbf H \mathbf v &= \Sigma_0^{-1}\mathbf v+\operatorname{diag}[\nu]\mathbf v \\ &= \text F^{-1} \left[\tfrac 1 {\tilde\kappa} \circ\text F\,\mathbf v\right] +\nu\circ\mathbf v \end{aligned} \end{equation}

Subtracting the background

We can modify our log-Gaussian cox process model to subtract background rate by adding the estimated background log-rate as an offset during inference.

Convolution approximation

We can also approximate the log-Gaussian model as a convolution. This amounts to considering only the first iteration of Newton-Raphson as if it were a GP regression problem, and replacing the per-bin measurement noise with its average. Consider a single iteration of the Newton-Raphson iteration, for the weights alone. This is:

\begin{equation} \begin{aligned} \mathbf w_{i+1} &= \mathbf w_{i} - [\Sigma_0^{-1} + \Sigma_\varepsilon^{-1}]^{-1} \left[ \Sigma_0^{-1} \mathbf w_{i} + N\circ(\lambda - \bar y) \right], \end{aligned} \end{equation}

where $\Sigma_\varepsilon = \operatorname{diag}[N\circ\lambda]^{-1}$. Now, approximate $\Sigma_\varepsilon\approx\sigma^2_\epsilon I$ where $\sigma^2_\epsilon=\langle(N\circ\lambda)^{-1}\rangle$:

\begin{equation} \begin{aligned} \hat {\mathbf w} &\approx \mathbf w_0 - [\Sigma_0^{-1} + \tfrac 1 {\sigma^2_\epsilon} I]^{-1} \left[ \Sigma_0^{-1} \mathbf w_{i} + N\circ(\lambda-\bar y) \right] \\ &= \mathbf w_0 - \text F \tfrac {\sigma^2_\epsilon} {\tilde k+\sigma^2_\epsilon} \text F^{-1} \mathbf w_0 - \text F \tfrac {\sigma^2_\epsilon \tilde k} {\tilde k+\sigma^2_\epsilon} \text F^{-1} N\circ(\lambda-\bar y) \end{aligned} \end{equation}

This implies an approximate solution in terms of two convolutions:

\begin{equation} \begin{aligned} \hat {\mathbf w} \approx \mathbf w_0 &- g\otimes\{\mathbf w_0 + \kappa\otimes[N\circ(\lambda-\bar y)]\} \\ g(\mathbf x,\mathbf x') &= \text F^{-1} \left[ \tfrac {\sigma^2_\epsilon} {\tilde k+\sigma^2_\epsilon}, \right] \end{aligned} \end{equation}

where $\kappa(\mathbf x,\mathbf x')$ is the GP prior and $\tilde k$ is its Fourier transform. This can be calculated almost instantaneously, and differs from the MAP in this example by only a few percent.

Interim summary

Almost done!

Estimating confidence intervals around peaks

One question that might be nagging you at this point is: should we believe the inferred rate maps? When we see a peak (a "grid field"), is this real , or just a noisy fluctuation? For this, it is useful generate some sort of confidence bounds or other summary of uncertainty in the inferred grid map.

First, let's find our peaks

Both GP regression and LGCP model provide estimates of the posterior covariance, which encodes the uncertainty in our posterior mean (or mode).

For GP regression, the covariance is

\begin{equation} \begin{aligned} \Sigma_\text{post} &= \left[\Sigma_0^{-1} + \Sigma_\epsilon^{-1} \right]^{-1} \end{aligned} \end{equation}

For the LGCP model, we can use the Laplace approximation to model the uncertainty in our MAP estimate $\hat{\mathbf w}$. This models the posterior covariance as the inverse of the Hessian, evaluated at $\hat{\mathbf w}$.

Intuitively, directions with higher curvature in our loss function are more constrained, and so have lower posterior variance. Conversely, unconstrained directions have low curvature, and therefore large posterior variance.

\begin{equation} \begin{aligned} \Sigma_\text{post} &= \mathbf H_{\mathbf w}^{-1} = \left[\Sigma_0^{-1}+\operatorname{diag}[N\circ\hat\lambda]\right]^{-1} \end{aligned} \end{equation}

Incidentally, the curvature of the observation likelihood in the LGCP model, $\operatorname{diag}[N\circ\hat\lambda]$, is equivalent to the measurement precision $\Sigma_\epsilon^{-1}$ in GP regression. This highlights that what the LGCP model is simply finding $\hat\lambda$ such that the observed measurement errors are consistent with a Poisson error model, $\sigma_\varepsilon^2 = \hat\lambda$.

In these notes, GP regression inferred a posterior distribution on $\lambda$, and the LGCP model inferred a posterior distribution on $\ln\lambda$. For generality, we'll refer to both of these in terms of a generic GP posterior for function $f(\mathbf x)$, and denote $\mathbf f$ as the vector that results from to evaluating $f$ over our discrete $L\times L$ grid.

Using the Laplace approximation to calculate uncertainty in peak location

The posterior covariance describes a distribution over different possible rate maps. We can denote this as the posterior mean (or mode) $\mu(\mathbf x)$, plus some fluctuations $\epsilon(\mathbf x)$:

\begin{equation} \begin{aligned} f(\mathbf x)&=\mu(\mathbf x) + \epsilon(\mathbf x) \\ \epsilon(\mathbf x)&\sim\mathcal{GP}\left[0,\Sigma_\text{post}(\mathbf x,\mathbf x')\right]. \end{aligned} \end{equation}

The fluctuations $\epsilon(\mathbf x)$ represent the uncertainty in our smoothed rate map. We are interested in how much these fluctuations $\epsilon(\mathbf x)$ might shift a local maxima in the rate map at location $\mathbf x_0$.

If $\mu(\mathbf x)$ is our inferred posterior mean (or mode), then a perturbation $\epsilon(\mathbf x)$ changes the rate map to $\mu(\mathbf x) + \epsilon(\mathbf x)$. If perturbations are small relative to the height of out peak, they will move the inferred local maximum by an amount $\Delta \mathbf x_0$.

One can calculate $\Delta \mathbf x_0$ given $\epsilon(\mathbf x)$ by considering a second-order Taylor expansion of our rate map as a function of location $\mathbf x$. The slope is zero at $\mathbf x_0$, since we are at a local maxima. The shift $\Delta \mathbf x_0$ is therefore influences by the second-order term:

\begin{equation} \Delta \mathbf x_0 = - {\mathbf H^{\mathbf x}_\mu}^{-1}\mathbf J^{\mathbf x}_\epsilon. \end{equation}

Above, $\mathbf J^{\mathbf x}_\epsilon = \nabla_{\mathbf x} \epsilon(\mathbf x_0)$ is the slope of our perturbation $\epsilon(\mathbf x)$ at $\mathbf x_0$. The larger this is, the more our the peak moves. The size of the shift is also controlled by the curvature of our rate map at the peak, $\mathbf H^{\mathbf x}_\mu=\nabla_{\mathbf x}\nabla_{\mathbf x}^\top\mu(\mathbf x_0)$. More curved directions shift less, i.e. sharper peaks are more difficult to move.

We are interested in summarizing the overall uncertainty in the location of a peak. This is captured by by the covariance $\Sigma_{\Delta\mathbf x_0}=\langle\Delta\mathbf x_0 (\Delta\mathbf x_0)^\top\rangle$:

\begin{equation} \begin{aligned} \Sigma_{\Delta\mathbf x_0} &=\langle\Delta\mathbf x_0 (\Delta\mathbf x_0)^\top\rangle \\&= {\mathbf H^{\mathbf x}_\mu}^{-1} \left< \mathbf J^{\mathbf x}_\epsilon {\mathbf J^{\mathbf x}_\epsilon}^\top \right> {\mathbf H^{\mathbf x}_\mu}^{-1} \\&= {\mathbf H^{\mathbf x}_\mu}^{-1} \left< \nabla_{\mathbf x} \epsilon(\mathbf x_0) \epsilon(\mathbf x_0)^\top \nabla_{\mathbf x}^\top \right> {\mathbf H^{\mathbf x}_\mu}^{-1} \\&= {\mathbf H^{\mathbf x}_\mu}^{-1} \left[ \nabla_{\mathbf x} \Sigma_\text{post}(\mathbf x_0,\mathbf x_0) \nabla_{\mathbf x}^\top \right] {\mathbf H^{\mathbf x}_\mu}^{-1} \end{aligned} \end{equation}

The term $\nabla_{\mathbf x} \Sigma(\mathbf x_0,\mathbf x_0) \nabla_{\mathbf x}^\top$ reflects the distribution of the gradient of our rate map at $\mathbf x_0$. We need to calculate $\Sigma_\text{post}$ to get this.

Building a low-rank model of the posterior variance

A $200\times200$ spatial grid has $4\times10^4$ regions, and a covariance matrix with $1.6\times10^9$ entries. This requires several gigabytes to store. Furthermore, matrix operations generally scale as $\mathcal{O}(n^3)$, where in our case $n=L\times L$. In general then, it is infeasable to construct the posterior covariance or to preform exact calculations with it.

The solution is to approximate the posterior covariance in a low-rank subspace

$$ \begin{aligned} \Sigma_{\text{post}}\approx QQ^\top, \end{aligned} $$

where $Q$ is a $L^2\times M$ matrix, with $M\ll L^2$.

Selecting a random basis is common. In our case, we use the discrete cosyne transform of the 2D spatial domain. This can be computed rapidly using the FFT. Additionally, the prior $\Sigma_0$ is diagonal in this basis, and we can construct our low-rank approximaition by throwing away any components in which $\Sigma_0$ has small eigenvalues.

Armed with this low-rank approximation, we can now calculate

$$ \begin{aligned} \Sigma_{\Delta\mathbf x_0} &= {\mathbf H^{\mathbf x}_\mu}^{-1} \left[ \nabla_{\mathbf x} \Sigma_\text{post}(\mathbf x_0,\mathbf x_0) \nabla_{\mathbf x}^\top \right] {\mathbf H^{\mathbf x}_\mu}^{-1} \\&= {\mathbf H^{\mathbf x}_\mu}^{-1} \left[ \nabla_{\mathbf x}QQ^\top \nabla_{\mathbf x}^\top \right] {\mathbf H^{\mathbf x}_\mu}^{-1} \end{aligned} $$

The derivatives $\nabla_{\mathbf x}$ can then be calculated numerically using the discrete derivative. The curvature at our peak can be calculated as $\mathbf H^{\mathbf x}_\mu=\nabla_{\mathbf x}\nabla_{\mathbf x}^\top f(\mathbf x_0)\approx \mathbf D_{\mathbf x_0}\mathbf D_{\mathbf x_0}^\top \mathbf f$. We can calculate this quickly for all points at once by convolving our discrete difference operators with the data using FFT.

One we have calculated $\Sigma_{\Delta\mathbf x_0}$ for a given peak, we can estimate a ellipsoid confidence region for how much that peak location might shift, given our posterior uncertainty.

Overall, this enables reasonably fast approximate confidence intervals on the peak locations.

Use sampling to assess the probability of a peak

We can also sample from the posterior distribution, and test how many of these samples have a certain property, e.g. having a peak at location $\mathbf x_0$.

In sum, we have illustrated the following workflow for analyzing firing rate maps for hippocampal grid cells:

  1. Bin spikes into a histogram of total number spikes and visits to each region
  2. Use autocorrelation to estimate the grid scale
  3. Use an idealized grid to set a prior for the log rate
  4. Infer log-rate using kernel density estimation (KDE)
  5. Use this KDE estimate as initializer for LGCP regression
  6. Heuristically fit a log-Gaussian Cox process model using convolution
  7. Identify grid field centers local maxima in the inferred rate map
  8. Use the Laplace approximation to fit confidence regions for grid field centers
  9. Sample from the GP posterior to estimate the probability of a grid field center in each region

Putting it all together

In this document, we discussed several ways to infer the underlying tuning of hippocampal grid cells. The methods outlines here will also work more generally, for any neuron tuned to a 2D feature space.

Histogram-based estimators are easy, and work if low spatial resolution is acceptable. However, they need a lot of data to return a meaningful result. It's also hard to define notions of confidence for histograms.

Kernel density estimators (KDEs) pool data from nearby regions. They are more efficient than Histograms, especially if one optimizes the kernel bandwidth to match the underlying variations in neuronal tuning. For moderate amounts of data, KDE estimators are fast, and return a reasonable estimate of the rate map.

Gaussian process (GP) regression is more statistically efficient than KDE, and also estimates posterior covariance. This enables one to estimate confidence bounds for the inferred rate maps. GP regression requires solving a large linear system, but this can be accelerated using the minimum residual algorithm and calculating matrix products using the FFT. Sometimes, GP regression can be approximated by a convolution.

Log-Gaussian Cox process (LGCP) regression is a generalization of GP regression. It infers the log-firing rate under a Poisson noise assumption, and returns good estimates from limited data. The computational cost of LGCP regression is only slightly higher than GP regression.

Overall, we illustrated several approaches to Gaussian-process regression to hippocampal grid cell data. We covered kernel density estimation, Gaussian process regression, and log-Gaussian Cox process regression. Throughout, we discussed practical issues necessary to achieve good performance, like using the FFT when possible, choosing numerically stable forms of the equations, and fast approximations based on convolution. Ultimately, we derived an FFT-based approximation to log-Gaussian Cox process regression. This provides an approach to analyzing grid cell data that is both statistically and computationally efficient.

Appendix: numerical considerations in Gaussian Process (GP) and Log-Gaussian Cox Process (LGCP) regression

Tricks for faster linear algebra

We use several tricks to speed up large matrix calculations

Matrix inversion and linear system solving using Cholesky factorization

If a covariance matrix is non-singular (has no zero eigenvalues), then it is Positive Definite. This means it has a Cholesky factorization . The default behavior of scipy.linalg.cholesky is to return an upper-triangular matrix $Q$, for which:

\begin{equation} \begin{aligned} Q &= \operatorname{chol}[\Sigma] \\ \Sigma &= Q^\top Q \end{aligned} \end{equation}

The routine scipy.linalg.lapack.dtrtri will invert an upper-triangular matrix quickly. We can leverage this to invert the matrix $\Sigma$ efficiently.

\begin{equation} \begin{aligned} \Sigma^{-1} = (Q^{\top}Q)^{-1} = Q^{-1} Q^{-\top} \end{aligned} \end{equation}

We can also use Cholesky factorization to quickly solve the linear system $\Sigma^{-1}\mathbf u$

\begin{equation} \begin{aligned} \Sigma^{-1}\mathbf u &= Q^{-1} Q^{-\top} \mathbf u \end{aligned} \end{equation}

The routine scipy.linalg.solve_triangular can solve $Q^{-1}\mathbf v$ quickly for upper-triangular $Q$. Pass the lower=True argument to solve_triangular to calculate $\mathbf v = Q^{-\top} \mathbf u$

Cholesky decomposition and inverting a triangular matrix both have $\mathcal O(n^3)$, similarly to standard complexity for matrix multiplication and inversion, but has better constant factors. Once the Cholesky factor is computed, solving a linear system has complexity $\mathcal O(n^2)$. While this does not change the asymptotic complexity, constant factors are important. A $L\times L$ grid has $L^2$ regions, and a covariance matrix over $L^2$ elements has $L^4$ entries. A matrix operation that has $n^3$ complexity therefore costs $L^6$ in terms of the linear dimension of our 2D grid.

Multiplication using the Fast Fourier Transform (FFT)

Typically, our prior covariance kernel $\kappa(\mathbf x,\mathbf x')$ that is translationally invariant. When we evaluated on a periodic grid, the resulting prior $\Sigma_0$ is circulant . For circulant matrices, matrix inversion and multiplication can be done in $n\,\lg(n)$ time using the FFT, where $n$ is the number of points in our spatial grid. Denote the forward Fourier transform as the operator $\text F$, and the inverse Fourier transform as $\text F^{-1}$. Define the Fourier transform of our prior kernel as

\begin{equation} \begin{aligned} \tilde \kappa(\omega) &= \text F\, \kappa(\Delta \mathbf x). \end{aligned} \end{equation}

The Fourier transform coefficients $\tilde \kappa$ are the eigenvlues of $\Sigma_0$. If we choose the unitary definition of the Fourier transform, $\text F^{-1}=\text F^\dagger$, then $\text F$ is the eigenbasis of $\Sigma_0$:

\begin{equation} \begin{aligned} \Sigma_0 &= \text F\,\operatorname{diag}[\tilde\kappa]\,\text F^\dagger. \end{aligned} \end{equation}

The inverse $\Sigma_0^{-1}$ can therefore be calculated as

\begin{equation} \begin{aligned} \Sigma_0^{-1} &= \text F\, \operatorname{diag}\left[1/{\tilde\kappa}\right]\,\text F^\dagger. \end{aligned} \end{equation}

The prodict $\Sigma_0 \mathbf u(\mathbf x)$ can be calculated via FFT using the convolution theorem:

\begin{equation} \begin{aligned} \Sigma_0 \mathbf u &= \kappa \otimes \mathbf u = \text F^\dagger\left[\tilde \kappa\circ{\operatorname F\mathbf u}\right], \end{aligned} \end{equation}

where $\circ$ is element-wise multiplication. Likewise, if all $\tilde\kappa$ are greater than zero, then the linear system $\Sigma_0^{-1}\mathbf u$ can be solved as

\begin{equation} \begin{aligned} \Sigma_0^{-1} \mathbf u &= \text F^\dagger\left[\tfrac{1}{\tilde \kappa}\circ{\operatorname F\mathbf u}\right]. \end{aligned} \end{equation}

Multiplication by a diagonal matrix is element-wise multiplication

If $D=\operatorname{diag}[\mathbf d]$ is a diagonal matrix, multiplication $D$ by vector $\mathbf u$ is simply an element-wise product:

\begin{equation} \begin{aligned} D\mathbf u &= \operatorname{diag}[\mathbf d\circ\mathbf u], \end{aligned} \end{equation}

And the matrix product $AD$ can be computed as a column-wise product

\begin{equation} \begin{aligned} D\mathbf u &= \operatorname{diag}[\mathbf d\circ\mathbf u] \\ A D &= A \circ (\mathbf d {\tt1}^\top), \end{aligned} \end{equation}

where $\tt1$ is a column vector of ones.

Four ways to do Gaussian process regression

Form (a)

In most textbooks or tutorials you'll see the posterior mean written as:

\begin{equation} \begin{aligned} y_2 &\sim \mathcal N(\mu,\Sigma) \\ \mu &= \mu_2 + \Sigma_{12}^\top[\Sigma_{11}+\Sigma_{\epsilon}]^{-1} (y_1 - \mu_1) \\ \Sigma &= \Sigma_{22} - \Sigma_{12}^\top [\Sigma_{11}+\Sigma_{\epsilon}]^{-1} \Sigma_{12}, \end{aligned} \end{equation}

where $\Sigma_{11}$ and $\Sigma_{22}$ are the prior covariance of the observation and output points, respectively; $\Sigma_{12}$ is the cross-covariance of the observation and output points; $\Sigma_{\epsilon}$ is the measurement noise covariance; $\mu_2$ and $\mu_1$ are the prior mean at the observation and output points, respectively, and $y_1$ are the observations.

When observations are sparse, this form is computationally efficient efficient, because $\Sigma_{11}$ and $\Sigma_{\epsilon}$ are small, and the expression $\Sigma_{12}^\top [\Sigma_{11}+\Sigma_{\epsilon}]^{-1} \Sigma_{12}$ is low-rank. This form is especially convenient when updating a GP model with a single observation, in which case it reduces to a rank-1 update

\begin{equation} \begin{aligned} \mu &= \mu_2 + \tfrac{1}{\sigma^2_{11}+\sigma^2_{\epsilon}}\Sigma_{12}^\top (y_1 - \mu_1) \\ \Sigma &= \Sigma_{22} - \tfrac{1}{\sigma^2_{11}+\sigma^2_{\epsilon}} \Sigma_{12}^\top \Sigma_{12}. \end{aligned} \end{equation}

However, in our application we have an extended time-series where a rat visits each location many times. There are many more observations than output points, and $\Sigma_{11}$ and $\Sigma_{\epsilon}$ are large matrices. Binning together nearby measurements reduces the complexity. However, coverage of the space is quite dense. $\Sigma_{11}$ and $\Sigma_{\epsilon}$ are almost as large as $\Sigma_{22}$, even excluding bins with no observations. We therefore assume that the observations and output points are evaluated on a common grid, in which case the equations take the form:

\begin{equation} \begin{aligned} \mu &= \mu_0 + \Sigma_0^\top[\Sigma_0+\Sigma_{\epsilon}]^{-1} (y - \mu_0) \\ \Sigma &= \Sigma_0 - \Sigma_0^\top [\Sigma_0+\Sigma_{\epsilon}]^{-1} \Sigma_0, \end{aligned} \tag{a} \end{equation}

where $\Sigma_0$ $\mu_0$ are the prior covariance and mean, $\Sigma_\epsilon$ is the measurement error covariance, and $y$ are the measurements

This form has the following useful properties:

Problem: This form is unsuitable if $\Sigma_\epsilon^{-1}$ is singular. In order to leverage the FFT for matrix calculations, we need to perform the GP regression over a regular and periodic grid. To avoid opposite ends of the space from interacting, we need to add as many zeros to the edge of our data as our kernel is wide. This means that, inherently, some bins will be missing data. Missing observations can be represented as zero precision (inverse variance), so $\Sigma_\epsilon^{-1}$ can be constructed—but it will not be invertable, so $\Sigma_\epsilon$ does not exist.

Form (b)

It's also common to encounter the following form, when deriving the posterior mean for the product of two multivariate Gaussian distributions:

\begin{equation} \begin{aligned} \mu &= \mu_0 + \left[\Sigma_0^{-1}+\Sigma_\epsilon^{-1}\right]^{-1} \Sigma_\epsilon^{-1} (y-\mu_0). \end{aligned} \tag{b} \end{equation}

This form has the following useful properties

Problem: This form is unsuitable if $\Sigma_0$ is singular, and is numerically unstable if $\Sigma_0$ is ill-conditioned. So, it requires regularization to ensure that no eigenvalue of $\Sigma_0$ is too small. For larger problems, it may require so much regularization that the prior $\Sigma_0$ is altered substantially, affecting the accuracy of the inferred posterior mean. Additionally, computing $\Sigma_0^{-1}$ via FFT requires adding zero padding to handle the circular boundary conditions correctly. This padding cannot be removed before solving the subsequent linear system without introducing boundary artifacts, so this can lead to slightly larger linear system to solve. Since the time complexity of this is $\mathcal O(L^6)$ in terms of the linear dimension of a $2\times 2$ grid, this can lead to a significant slowdown.

Form (c)

We can also pull out $\Sigma_0$ from form (b), and solve for the posterior mean as :

\begin{equation} \begin{aligned} \mu &=\mu_0 + \left[\Sigma_0\Sigma_\epsilon^{-1} + I\right]^{-1} \left[ \Sigma_0\Sigma_\epsilon^{-1} (y-\mu_0) \right]. \\ \end{aligned} \tag{c} \end{equation}

This form has the following useful properties:

Problem: The matrix $\Sigma_0\Sigma_\epsilon^{-1} + I$ will not be symmetric in general, so we cannot use Cholesky factorization to solve the linear system here. This can lead to significant slow-downs for larger systems. However! Once $\Sigma_0\Sigma_\epsilon^{-1} y$ is computed via FFT, one can remove the zero-padding before solving the linear system, which reduces the problem size. Since computing circulant matrix operations using FFT requires as much zero-padding as our prior kernel is wide, this reduction in problem size can be significant. Since solving the linear system scales as $\mathcal O(L^6)$ for a $L\times L$ spatial grid, a small reduction in $L$ is much more important than the constant-factor improvement provided by Cholesky decomposition.

Additionally, for translationally-invariant kernels on a regular grid, the matrix-vector product $[\Sigma_0\Sigma_\epsilon^{-1} + I]\mathbf u$ can be calculated quickly via FFT. This used with a Krylov-subspace-based solver like minres, and can be orders of magnitude faster than other approaches.

Form (d)

In the special case that the measurement error covariance is constant, $\Sigma_\epsilon = \sigma^2_\epsilon I$, then GP regression reduces to a convolution, and can be evaluated using the Fourier transform thanks to the convolution theorem:

\begin{equation} \begin{aligned} \mu &= \mu_0 + \text F^{-1}\,\left[ \left(\frac{\tilde\kappa}{\tilde\kappa+\sigma^2_\epsilon}\right)\circ\text F\,(y-\mu_0)\right], \end{aligned} \tag{d} \end{equation}

where $\text F$ is the Fourier transform.

Problem: The regression must be on a periodic domain in order to compute $\text F$ quickly using the FFT. This can cause measurements at opposite sides of the spatial grid to influence each-other. This can be solved by zero-padding. However, the above assumption assumes $\sigma^2_\epsilon$ for all observations, so this zero padding will also be treated as a $y=0$ observation with error $\sigma^2_\epsilon$. This will lead to some tapering toward zero at the boundaries. Alternatively, one may use reflected boundary conditions, copying a mirrored version of $y$ into the padded space. This reflected boundary condition reduces artifacts, but is not identical to solving the original, un-padded GP regression problem.

Note: if $\Sigma_\epsilon$ is not constant, but changes slowly relative to the scale of the prior kernel, then one may also decompose a larger problem into smaller ones that tile the space.

When to use which?

Use form (a) when

Use form (b) with when

Use form (c) when

Use form (d) when