Computational neuroscience and machine learning

## Month: February, 2014

### The beauty of circulant matrices (and of Toeplitz matrices, to a lesser extent)

A circulant matrix is a matrix where each row (column) is a cyclic shift of the preceding row (column):

$C = \begin{bmatrix} c_0 & c_1 & c_2 & \cdots & c_{n-1} \\ c_{n-1} & c_0 & c_1 & \cdots & c_{n-2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ c_1 & c_2 & c_3 & \cdots & c_0 \end{bmatrix}$

Note that a single row (or column) completely determines the entire matrix.

The beauty of circulant matrices lies in the fact that they are all diagonalized in the Fourier basis, that is, every circulant matrix $C$ can be decomposed as:

$C = U \Psi U^{\ast}$

where $U$ is the unitary discrete Fourier transform matrix, $U^{\ast}$ is its conjugate transpose and $\Psi$ is the diagonal matrix of eigenvalues of $C$. The eigenvalues can be found by taking the discrete Fourier transform of the first row of $C$.

Here are two immediate consequences of the above fact:

1. The inverse of a circulant matrix $C$ is also circulant. Indeed, if $C = U \Psi U^{\ast}$, then $C^{-1} = U \Psi^{-1} U^{\ast}$.
2. The sum or product of two circulant matrices is also circulant: if $C = U \Psi U^{\ast}$ and $D = U \Phi U^{\ast}$, then $C+D = U (\Psi +\Phi) U^{\ast}$ and $CD = DC = U \Psi \Phi U^{\ast}$.

So, it’s extremely easy to work with circulant matrices. Unfortunately, these properties do not generalize to Toeplitz matrices, i.e. matrices with constant diagonals. However, it turns out that if one thinks of a sequence of Toeplitz matrices $T_n$ defined by the $2n - 1$ entries $t_{-n+1:n-1}$, provided certain conditions on $T_n$ are satisfied, one can find (or construct) a corresponding sequence of circulant matrices $C_n$ such that $C_n$ and $T_n$ are asymptotically equivalent, i.e. roughly speaking they behave similarly in the limit of large $n$ and thus properties similar (but not identical, for example $T_n$ will not diagonalize in the Fourier basis even for large $n$) to the ones mentioned above for circulant matrices also hold for $T_n$.

This result may be relevant to computational neuroscience, because in some cases, one wants to perform operations on covariance matrices (for example, inverting the covariance matrix of neural responses for Fisher information calculations) that are Toeplitz, but not necessarily circulant: for example, when correlations between neurons decay as a function of the difference between their preferred stimuli in non-circular stimulus dimensions (e.g. spatial frequency or spatial location). The fact that $C_n$ and $T_n$ are only asymptotically equivalent (i.e., in the large $n$ limit) is not a problem, because that’s the limit one is interested in in most cases in computational neuroscience (the limit of a large number of neurons).

### Note on the Hopfield model

To simplify is the greatest form of sophistication. -Leonardo da Vinci

I wrote a brief review on some basic properties of the Hopfield model. It is available here. I really like the Hopfield model, because I think one should always work with the simplest model one can get away with and the Hopfield model is certainly one of the simplest models one can use to study collective phenomena in neural networks. Of course, one cannot explain everything in terms of Hopfield-type models, but surprisingly, in a lot of cases, the level of abstraction provided by the Hopfield turns out to beĀ  sufficient to qualitatively explain interesting phenomena. For example, here is a nice paper that explains, among other things, why spontaneous activity in the cortex should have the same structure as stimulus-evoked activity as has been observed experimentally, using a very simple Hopfield-type model.

Also, there is a large and rich body of theoretical results on the Hopfield model and its variants, thanks in large measure to the efforts of physicists. So, people who want to study an interesting collective behavior in neural networks can quickly build on an already existing literature.

### Hierarchical probabilistic inference for experimental psychologists

On a gross scale, I think of the sensory cortices (not the entire brain, just the sensory cortices) as highly sophisticated, hierarchical probabilistic inference engines that are adapted to the statistical structure of the natural world we live in. This view may be loosely (and rather cumbersomely) called the HelmholtzLee & MumfordFriston theory of sensory cortices and has both perceptual and neural implications. What are some general consequences of viewing sensory cortices in this way? Below, I will try to review some general properties of probabilistic inference in hierarchical models that tend to hold regardless of the detailed structure of the model and discuss some perceptual implications of these properties without going into potential neural implications (which I may do in a later post). I have found that most experimental psychologists are unfamiliar with thinking in terms of hierarchical probabilistic models, which I find unfortunate, because as I will argue below, thinking in these terms can help them understand a vast array of diverse phenomena within a coherent, unified framework. So if at least one thing in this post is potentially interesting, useful, helpful etc. to at least one experimental psychologist reading this post, it will have achieved its objective.

Figure 1. Probabilistic inference in a simple hierarchical graphical model.

First a brief non-technical overview of inference in hierarchical probabilistic models. The figure above illustrates how probabilistic inference works within the context of a toy probabilistic graphical model. In this toy example (Figure 1A), we assume that stimuli $s$ in the environment belong to one of three categories $C=0,1,2$. Category $C=0$ corresponds to a uniform distribution over a certain range, whereas categories $C=1$ and $C=2$ are normal distributions with different means. A priori probabilities of the three categories are assumed to be the same in the environment (Figure 1B). A priori probability of the stimulus, $p(s)$, is thus an equal proportion mixture of two normal distributions and a uniform distribution, and has a bimodal shape (Figure 1C). Let’s assume that the observer has already learned an accurate prior model $p(C,s)$ of the environment through prior experience with it. Suppose now that the observer is presented with stimuli from this environment. The observation process is assumed to be stochastic due to internal noise on the part of the observer. Thus, the observer does not have access to the actual stimulus value $s$, but rather a noisy measurement of it denoted by $x$. Given a noisy measurement $x$, the observer tries to infer both the actual stimulus $s$, as well as its category $C$. This is achieved through probabilistic inference, combining the prior distribution $p(s,C)=p(s|C)p(C)$ with the likelihood $p(x|s,C) = p(x|s)$ to compute a posterior distribution over $s$ and $C$. The posterior distribution can then be used to make point estimates of $s$ and $C$. $C$ and $s$ are called unobservable or latent variables (because they are not directly observable), whereas $x$ is the only observable variable in the model.

Crucially, the quality of the posterior distributions depends on the quality of the measurements $x$. Figures D-F and G-I show the posterior distributions and the likelihoods under low-noise and high-noise conditions respectively for three different noisy measurements (represented by different colors). In the low-noise condition, the measurements are precise (as demonstrated by the narrow likelihoods in Figure F). This leads to precise posterior distributions over both $s$ (Figure E) and $C$ (Figure D). When the noise is high, on the other hand, the measurements become less precise (Figure I), which makes the posterior distributions over both $s$ and $C$ less precise (Figures G-H). In addition to their dependence on the quality of the observable variables, let’s highlight two other important features of the posterior distributions shown in Figures G-H. First, for observations $x$ that are closer to the high probability regions under the prior, the posteriors $p(s|x)$ are more precise (compare, for instance, the red distribution with the blue and the green distributions in Figure H). This means that high probability regions under the prior are represented better. Although this is also true for the low-noise condition, it becomes more prominent under the high-noise condition. Secondly, the noise affects lower level variables more than the higher level variables. This can be seen by comparing the effects of higher noise on the posterior distributions over $s$ and $C$ respectively. Although higher noise makes both of these posteriors less precise, it has a more dramatic effect on the posterior distribution over $s$.

Although Figure 1 presents only a toy model, three main intuitions gained from this model tend to generalize to more complex probabilistic generative models: namely, (i) that measurement noise degrades the quality of all posterior distributions in the model, (ii) measurement noise affects the representations of lower level variables more than those of higher-level variables and (iii) the representation of higher probability regions under the prior (i.e. stimuli that are a priori more likely) is enhanced relative to the representation of low probability regions.

Now, the perceptual implications of these properties. I think these three simple, intuitive properties can explain (at least qualitatively) a vast amount of literature. Property (i) explains the effects of stimulus manipulations such as changing the contrast or presentation duration of a stimulus. Property (ii) explains why it is in general easier to perceive, recognize or remember the “gist” of an image, such as the category of a scene or an object (a beach scene, a chair, a car etc.) than lower-level details about it (e.g. Brady, Konkle, Alvarez & Oliva, 2008). Property (iii) explains why natural stimuli are in general encoded, processed, recognized, remembered, perceived etc. better than unnatural stimuli (e.g. Parraga, Troscianko & Tolhurst, 2000; Biederman, 1972).

These are all basic, across-the-board effects that would influence the processing of stimuli in all types of tasks and they can all be explained in a simple and straightforward way as consequences of hierarchical probabilistic inference in a generative model that is adapted to the statistical structure of the natural world, without invoking any extraneous (and in many cases dubious) mechanisms such as attention. There is nothing experimental psychologists like more than invoking attention to explain something. For example, it has been suggested that statistical regularities capture or guide attention, leading to enhanced perceptual performance (e.g. in visual search tasks). But is it really necessary to invoke attention here? Property (iii) suggests that, other things being equal, more frequently encountered stimuli will automatically be represented better. So, maybe the reason for enhanced perceptual performance for statistical regularities is just that a chunk of the cortex (however small it may be) is allocated to encoding stuff that displays statistical regularities versus none for random stuff (or less dramatically, relatively more of the cortex is allocated to encoding stuff that displays statistical regularities).