### Hierarchical probabilistic inference for experimental psychologists

On a gross scale, I think of the sensory cortices (not the entire brain, just the sensory cortices) as highly sophisticated, hierarchical probabilistic inference engines that are adapted to the statistical structure of the natural world we live in. This view may be loosely (and rather cumbersomely) called the HelmholtzLee & MumfordFriston theory of sensory cortices and has both perceptual and neural implications. What are some general consequences of viewing sensory cortices in this way? Below, I will try to review some general properties of probabilistic inference in hierarchical models that tend to hold regardless of the detailed structure of the model and discuss some perceptual implications of these properties without going into potential neural implications (which I may do in a later post). I have found that most experimental psychologists are unfamiliar with thinking in terms of hierarchical probabilistic models, which I find unfortunate, because as I will argue below, thinking in these terms can help them understand a vast array of diverse phenomena within a coherent, unified framework. So if at least one thing in this post is potentially interesting, useful, helpful etc. to at least one experimental psychologist reading this post, it will have achieved its objective.

Figure 1. Probabilistic inference in a simple hierarchical graphical model.

First a brief non-technical overview of inference in hierarchical probabilistic models. The figure above illustrates how probabilistic inference works within the context of a toy probabilistic graphical model. In this toy example (Figure 1A), we assume that stimuli $s$ in the environment belong to one of three categories $C=0,1,2$. Category $C=0$ corresponds to a uniform distribution over a certain range, whereas categories $C=1$ and $C=2$ are normal distributions with different means. A priori probabilities of the three categories are assumed to be the same in the environment (Figure 1B). A priori probability of the stimulus, $p(s)$, is thus an equal proportion mixture of two normal distributions and a uniform distribution, and has a bimodal shape (Figure 1C). Let’s assume that the observer has already learned an accurate prior model $p(C,s)$ of the environment through prior experience with it. Suppose now that the observer is presented with stimuli from this environment. The observation process is assumed to be stochastic due to internal noise on the part of the observer. Thus, the observer does not have access to the actual stimulus value $s$, but rather a noisy measurement of it denoted by $x$. Given a noisy measurement $x$, the observer tries to infer both the actual stimulus $s$, as well as its category $C$. This is achieved through probabilistic inference, combining the prior distribution $p(s,C)=p(s|C)p(C)$ with the likelihood $p(x|s,C) = p(x|s)$ to compute a posterior distribution over $s$ and $C$. The posterior distribution can then be used to make point estimates of $s$ and $C$. $C$ and $s$ are called unobservable or latent variables (because they are not directly observable), whereas $x$ is the only observable variable in the model.

Crucially, the quality of the posterior distributions depends on the quality of the measurements $x$. Figures D-F and G-I show the posterior distributions and the likelihoods under low-noise and high-noise conditions respectively for three different noisy measurements (represented by different colors). In the low-noise condition, the measurements are precise (as demonstrated by the narrow likelihoods in Figure F). This leads to precise posterior distributions over both $s$ (Figure E) and $C$ (Figure D). When the noise is high, on the other hand, the measurements become less precise (Figure I), which makes the posterior distributions over both $s$ and $C$ less precise (Figures G-H). In addition to their dependence on the quality of the observable variables, let’s highlight two other important features of the posterior distributions shown in Figures G-H. First, for observations $x$ that are closer to the high probability regions under the prior, the posteriors $p(s|x)$ are more precise (compare, for instance, the red distribution with the blue and the green distributions in Figure H). This means that high probability regions under the prior are represented better. Although this is also true for the low-noise condition, it becomes more prominent under the high-noise condition. Secondly, the noise affects lower level variables more than the higher level variables. This can be seen by comparing the effects of higher noise on the posterior distributions over $s$ and $C$ respectively. Although higher noise makes both of these posteriors less precise, it has a more dramatic effect on the posterior distribution over $s$.

Although Figure 1 presents only a toy model, three main intuitions gained from this model tend to generalize to more complex probabilistic generative models: namely, (i) that measurement noise degrades the quality of all posterior distributions in the model, (ii) measurement noise affects the representations of lower level variables more than those of higher-level variables and (iii) the representation of higher probability regions under the prior (i.e. stimuli that are a priori more likely) is enhanced relative to the representation of low probability regions.

Now, the perceptual implications of these properties. I think these three simple, intuitive properties can explain (at least qualitatively) a vast amount of literature. Property (i) explains the effects of stimulus manipulations such as changing the contrast or presentation duration of a stimulus. Property (ii) explains why it is in general easier to perceive, recognize or remember the “gist” of an image, such as the category of a scene or an object (a beach scene, a chair, a car etc.) than lower-level details about it (e.g. Brady, Konkle, Alvarez & Oliva, 2008). Property (iii) explains why natural stimuli are in general encoded, processed, recognized, remembered, perceived etc. better than unnatural stimuli (e.g. Parraga, Troscianko & Tolhurst, 2000; Biederman, 1972).

These are all basic, across-the-board effects that would influence the processing of stimuli in all types of tasks and they can all be explained in a simple and straightforward way as consequences of hierarchical probabilistic inference in a generative model that is adapted to the statistical structure of the natural world, without invoking any extraneous (and in many cases dubious) mechanisms such as attention. There is nothing experimental psychologists like more than invoking attention to explain something. For example, it has been suggested that statistical regularities capture or guide attention, leading to enhanced perceptual performance (e.g. in visual search tasks). But is it really necessary to invoke attention here? Property (iii) suggests that, other things being equal, more frequently encountered stimuli will automatically be represented better. So, maybe the reason for enhanced perceptual performance for statistical regularities is just that a chunk of the cortex (however small it may be) is allocated to encoding stuff that displays statistical regularities versus none for random stuff (or less dramatically, relatively more of the cortex is allocated to encoding stuff that displays statistical regularities).