### Difference of covariances method

Neural responses in higher cortical areas seem rather elusive. Three observations about these responses are largely responsible for this seeming elusiveness (I will focus on the first two observations in this post): i) they display mixed selectivity, meaning that they respond to a combination of task variables, such as the identity of the stimuli presented on a given trial, the decision made on that trial, the context or the rule in effect on that trial etc., rather than to individual task variables alone; ii) they are very heterogeneous, meaning that the particular combination of task variables a neuron responds to can be wildly different from the combination that the next neuron responds to; iii) neurons can be responsive to task-irrelevant variables, such as the elapsed time during the delay period of a working memory task.

However, this apparent complexity may be misleading. A simple model where the responses of individual neurons are just random, linear mixtures of a small number of underlying response components that themselves do not display mixed selectivity or heterogeneity may be able to explain the apparent complexity of responses at the single neuron level. Indeed, it turns out that this simple model does a pretty good job of explaining the responses of individual neurons, at least for neurons in prefrontal cortex during the delay period of working memory tasks (see, e.g., Machens, Romo & Brody, 2010). This implies that population dynamics may be low-dimensional in these situations.

Let’s be a little more concrete. Suppose we suspect that there are essentially two task  variables neurons respond to, $s$ and $d$ (for stimulus and decision respectively, although they can be anything). The claim is then that the responses of neurons can be represented as follows (Machens, 2010):

$\mathbf{r}(t,s,d) = \mathbf{a}_1 z_1(t,s) + \mathbf{a}_2 z_2(t,d) + \mathbf{c} + \mathbf{n}(t)$

where $\mathbf{r}$ is the vector of neural responses, $z_1$ and $z_2$ are the underlying response components (same for all neurons, no heterogeneity) that depend only on $s$ and $d$ respectively (no mixing), and $\mathbf{a}_1$ and $\mathbf{a}_2$ are vectors of mixing coefficients for the neurons. $\mathbf{n}(t)$ is a noise vector that I will ignore for the rest of this post. As mentioned before, it is assumed that the mixing coefficients of neurons are random and independent of each other so that $\mathbf{a}_1$ and $\mathbf{a}_2$ are orthogonal on average, i.e. $\langle \mathbf{a}_1^T \mathbf{a}_2 \rangle = 0$ (the angular brackets denote an average). Note that if this model gives an accurate description of neural responses, it means that although there are $N$ neurons, trajectories of neural responses actually live in a two-dimensional subspace (spanned by $\mathbf{a}_1$ and $\mathbf{a}_2$) rather than in the full $N$-dimensional space. This is what we mean by ‘low-dimensional population dynamics.’ Of course the noise term might push the trajectories out of this two-dimensional subspace (assuming it’s sufficiently large), but I am ignoring the noise term here.

I should note here that population dynamics in higher cortical areas does not always seem to be low-dimensional. There is some evidence for high-dimensional dynamics in similar tasks as well (e.g. Harvey, Coen & Tank, 2012; Pastalkova, Itskov, Amarasingham & Buzsaki, 2008; Baeg, Kim, Huh, Mook-Jung, Kim, Jung, 2003). I will hopefully write about possible causes and consequences of this discrepancy in a later post.

Now, returning to our equation above, $\mathbf{r}(t,s,d)$ is observable. We can record the responses of neurons during the delay period and sort the responses according to the time $t$ into the delay period, the stimulus $s$ presented prior to the delay and the decision $d$ made after the delay. Then the question is: given only $\mathbf{r}(t,s,d)$, how can we recover the underlying unmixed response components $z_1$, $z_2$ and the mixing coefficients $\mathbf{a}_1$, $\mathbf{a}_2$? A simple suggestion is to use principal component analysis (PCA). Unfortunately, PCA does not work in this case, because although it would recover the two-dimensional subspace spanned by $\mathbf{a}_1$ and $\mathbf{a}_2$, in general it would not recover $\mathbf{a}_1$ and $\mathbf{a}_2$ themselves. Instead, PCA finds the orthogonal directions $\mathbf{u}_1$ and $\mathbf{u}_2$ along which variation in $\mathbf{r}(t,s,d)$ is maximal. Mathematically,  $\mathbf{u}_1$ and $\mathbf{u}_2$ are the eigenvectors of the covariance matrix $C$ with the largest eigenvalues. The covariance matrix is given by:

$C = \langle (\mathbf{r}(t,s,d) - \mathbf{r})(\mathbf{r}(t,s,d) - \mathbf{r})^T\rangle_{t,s,d}$

where $\mathbf{r}$ is the mean responses of neurons averaged over $t, s, d$. The eigenvector basis $U = [\mathbf{u}_1, \mathbf{u}_2]$ maximizes $tr(U^T C U)$ which is the total variance of the responses in the subspace. However, there is no reason to expect the directions of maximum total variance to coincide neatly with the directions of maximum variance due to changes in $s$ or changes in $d$ alone. To illustrate this, consider the following simple example:

In this example, there are two neurons whose responses are represented by $r_1$ and $r_2$. We assume that $s$ and $d$ are binary variables that can take the values 0 and 1. For $s =1, d = 1$, the responses of the neurons are shown by the point A, for $s =1, d = 0$ by point B, for $s =0, d = 0$, by point C and for $s = 0, d = 1$ by point D. Thus, intuitively, most of the response variation along the y-axis is caused by changing $s$ and most of the variation along the x-axis is caused by changing $d$.

The first and the second eigenvectors of the covariance matrix of these four points are represented by $u_1$ and $u_2$ respectively. When projected onto this eigenvector basis, the total variance of the data points is maximized, irrespective of whether the variance is caused by changes in $s$ or by changes in $d$. However, the two orthogonal directions that maximize the variance due to changes in $s$ alone and the variance due to changes in $d$ alone are $a_1$ and $a_2$ (unit vectors along the x- and y-axis respectively). So, the two bases differ.

In order to find the direction that maximizes the variance due to changes in $s$ alone, Machens (2010) suggests that we maximize $tr(U_1^T C_s U_1)$ where $C_s$ is the covariance of responses due to changes in $s$ alone:

$C_s = \langle (\mathbf{r}(t,s,d) - \mathbf{r}(t,d))(\mathbf{r}(t,s,d) - \mathbf{r}(t,d))^T\rangle_{t,s,d}$

where $\mathbf{r}(t,d) = \langle \mathbf{r}(t,s,d)\rangle_s$ is the vector of mean responses marginalized over $s$. Similarly, to find the direction that maximizes the variance due to changes in $d$ alone, we maximize $tr(U_2^T C_d U_2)$ where $C_d$ is the covariance of responses due to changes in $d$ alone:

$C_d = \langle (\mathbf{r}(t,s,d) - \mathbf{r}(t,s))(\mathbf{r}(t,s,d) - \mathbf{r}(t,s))^T\rangle_{t,s,d}$

where $\mathbf{r}(t,s) = \langle \mathbf{r}(t,s,d)\rangle_d$. Combining the two into a single objective, we can maximize $tr(U_1^T C_s U_1) + tr(U_2^T C_d U_2)$ with the additional constraint that $U_1$ and $U_2$ be orthogonal to each other. It turns out that this optimization problem is equivalent to finding the eigenvectors of the difference matrix, $C_s - C_d$ (hence the name difference of covariances), so it can be solved very efficiently (see the Appendix in Machens, 2010). Applied to our problem above, it can be shown that this method returns $\mathbf{a}_1$ and $\mathbf{a}_2$ as solutions to $U_1$ and $U_2$ respectively. Hence, unlike PCA, it recovers the correct mixing coefficients in our problem.

Below, I applied this method to a toy problem with 100 neurons whose responses were simulated according to:

$\mathbf{r}(t,s,d) = \mathbf{a}_1 s \exp(- (t-400)^2 / 900) + \mathbf{a}_2 d \exp(- (t-600)^2 / 900)$

where $s$ took five possible values from 0 to 25 and $d$ took two possible values, either -10 or 10 (Matlab code available here). The mixing coefficients were all randomly drawn from a standard normal distribution. Note that in this example, $z_1(t,s) = s \exp(- (t-400)^2 / 900)$ and $z_2(t,d) = d\exp(- (t-600)^2 / 900)$. The following is a picture of the time-course of the responses of three selected neurons for all combinations of $s$ and $d$ values. Trajectories live in a two-dimensional subspace spanned by $\mathbf{a}_1$ and $\mathbf{a}_2$.

Temporal evolution of the responses of three selected neurons. Red lines show trajectories for $d = 10$; green lines show trajectories for $d = -10$. Different shades of color correspond to different s values. The black circle marks the (0,0,0) point.

The following figure shows the underlying response functions $z_1(t,s)$ returned by PCA and by the difference of covariances method. The difference of covariances method recovers the correct unmixed response function nearly perfectly, whereas the function returned by PCA still mixes $s$-dependence with $d$-dependence.

The response functions $z_1(t,s)$ found by PCA and the difference of covariances method. Different colors correspond to different $s$ values. x-axis shows the time.

The results for the other response function $z_2(t,d)$ are shown below. Again, the difference of covariances method recovers the correct unmixed function nearly perfectly, whereas the function returned by PCA still mixes $s$-dependence with $d$-dependence. In general, PCA and the difference of covariances method return different solutions when there are temporal correlations between $z_1(t,s)$ and $z_2(t,d)$ or when the variance generated by changes in $s$ is similar to the variance generated by changes in $d$.

The response functions $z_2(t,d)$ found by PCA and the difference of covariances method. Different colors correspond to different $d$ values. x-axis shows the time.

I should finally note that there is a more recent, and apparently better, version of this method that works for nonlinearly mixed components (not just for linearly mixed components) and also provides a probabilistic interpretation of the method: see this paper by Brendel, Romo & Machens (2011).

An interesting question to think about is why there is mixing in neural responses in the first place. It has recently been suggested that mixed responses are useful for flexible, context-dependent computations (e.g. Rigotti, Rubin, Wang & Fusi, 2010; Rigotti, Barak, Warden, Wang, Miller & Fusi, 2013; Mante, Sussillo, Shenoy & Newsome, 2013). However, mixed responses are not always desirable. I currently do a little bit of work on this myself. For example, mixing responses to multiple stimuli turns out to be a really bad idea, if the task is to estimate those stimuli simultaneously. I am hoping to write more about this in the next couple of months. An interesting question is to determine, as precisely as one can hope, the types of tasks for which mixed responses are useful versus those for which they are not.