Researchers have developed methods to deal with this degeneracy problem, hence to overcome training difficulties in RNNs. One of the most well-known of these methods is the identity initialization for the recurrent weight matrix. Others proposed constraining the weight matrix to always be orthogonal, instead of orthogonalizing it at initialization only. The logic behind both of these methods is that since orthogonal transformations are isometries of the Euclidean space, applying a bunch of these transformations in a cascade does not lead to a degeneration of the metric (by “degeneration” here, I mean the collapse of the metric along the overwhelming majority of the directions in the input space and the astronomical expansion of the metric along a very small number of remaining directions). This is guaranteed in the linear case and, again, one hopes and expects (with some justification) that things are not all that different in the nonlinear case as well. So, in other words, a sequence of orthogonal transformations propagate vectors in Euclidean space without distortion, i.e. without changing their norms or the distances between them.

This is all true and fine, however, this analysis ignores a crucial factor that is relevant in training neural networks, namely the effect of *noise*. Noise comes in both through the stochasticity of SGD and sometimes through direct noise injection (as in Dropout) for regularization purposes. It is a bit hard to precisely characterize the noise that arises due to SGD, but let us assume for the sake of simplicity that the noise is additive so that what we propagate in the end is some kind of “signal + noise”. Now, although it is true that orthogonal transformations propagate the signal without distortion, they also propagate the noise without distortion as well. But, ultimately, we probably want a transformation that maximizes something like the signal-to-noise ratio (SNR) of the propagated signal + noise. Then, it is no longer obvious that orthogonal transformations are optimal for this purpose, because, one can, for example, imagine transformations that would amplify the signal more than they would amplify the noise (hence distorting both the signal and the noise), thus yielding a better SNR than an orthogonal transformation.

And indeed it turns out that for linear systems with additive Gaussian noise, one can mathematically show that optimal transformations (in the sense of maximizing the total SNR of the propagated signal + noise) are *not orthogonal*. In fact, one can say something even stronger: any optimal transformation has to be non-normal (a normal matrix is a unitarily diagonalizable matrix; all orthogonal matrices are normal, but the reverse is not true). This is the main result of this beautiful and insightful paper by Surya Ganguli and colleagues. Perhaps the simplest example of an optimal transformation in this sense is a feedforward chain: , where is the Kronecker delta function. This particular example maximizes the total SNR through a mechanism known as transient amplification: it exponentially amplifies the norm of its input transiently before the norm eventually decays to zero.

This brings me to the main message of this post: *that the commonly used orthogonal initializations for recurrent neural networks are likely suboptimal because of the often neglected effect of noise*. Another evidence for this claim comes from looking at the trained recurrent connectivity matrices in tasks that require memory. In this work (currently under review), we have shown that the trained recurrent connectivity matrices in such tasks always end up non-normal, with a feedforward structure hidden in the recurrent connectivity, even when they are initialized with an approximately normal matrix. How non-normal the trained matrices end up depend on a wide range of factors and investigating those factors was the main motivation for our paper. So, initializing RNNs with a non-normal matrix would potentially be a useful inductive bias for these networks.

In ongoing work, I have been investigating the merits of various non-normal initialization schemes for non-linear RNNs. One particular non-normal initialization scheme that seems to work quite well (and that is very easy to implement) is combining an identity matrix (or a scaled identity matrix) with a chain structure (which was shown by Ganguli et al. to be optimal in the case of a linear model with additive Gaussian noise). More details on these results will be forthcoming in the following weeks, I hope. Another open question at this point is whether non-normal initialization schemes are also useful for the more commonly used gated recurrent architectures like LSTMs or GRUs. These often behave very differently than vanilla recurrent networks, so I am not sure whether non-normal dynamics in these architectures will be as useful as it is in vanilla RNNs.

]]>Coming up with the right inductive biases for a particular type of task (or types of tasks) is not always straightforward and it is, in my mind, one of the things that make machine learning a creative enterprise. Here, by the “right inductive biases”, I mean inductive biases that (i) only exploit the structure in the problem (or problems) we are interested in and nothing more or less, but (ii) are also flexible enough that if the same model is applied to a problem that doesn’t display the relevant structure exactly, the model doesn’t break down disastrously (some “symbol”-based neural machines may suffer from such fragility).

In this post, I’d like to briefly highlight two really nice recent papers that introduce very simple inductive biases that enable neural networks to train faster and generalize better in particular types of problems.

The first one is from Uber AI: An intriguing failing of convolutional neural networks and the CoordConv solution. In this paper, the authors first observe that state of the art convolutional networks fail quite badly in tasks that require spatial coordinate transformations, for example, changing from Cartesian coordinates to image-based coordinates or vice versa (e.g. given the Cartesian coordinates , draw a square of a certain size centered at ). This may not be too surprising, since convolutional networks are explicitly designed to be translation-invariant, hence to ignore any spatial information, but the authors correctly note that ignoring spatial information completely (being rigidly translation-invariant) may not always be advisable (this may lead to failures of the type mentioned in (ii) above). It is rather much better to provide the model with the spatial information and let it figure out itself how much translation-invariant it needs to be in any particular task. This is exactly what the authors do. Specifically, they provide the spatial information in an explicit format through additional (fixed) channels that represent the Cartesian coordinates of each “pixel”. For image-based tasks, one thus needs only two additional channels, representing the and coordinates of each pixel. Pictorially, their scheme, which they call *CoordConv*, looks like this (Figure 3 in the paper):

That’s basically it. If the task at hand is highly translation-invariant, the model can learn to set the weights coming from those two Cartesian coordinate channels to small values; if the task at hand requires precise spatial information, on the other hand, the model can learn to utilize those channels appropriately. NLP people may recognize the conceptual similarity of this scheme to the positional encodings of items in sequence-based tasks. For the NLP people, we may thus summarize their contribution by saying that they extend the positional encoding idea from the temporal domain (in sequence-based tasks) to the spatial domain (in image-based tasks). It’s always a good idea to think about such exchanges between different domains!

The authors then go on to demonstrate that introducing a few of these *CoordConv* layers in standard architectures improves performance in a diverse range of tasks (but *not* in *all* tasks), including object detection, GAN training and Atari playing.

The second paper I’d like to highlight, called Neural Arithmetic Logic Units, starts from the observation that generic neural network architectures cannot generalize well in numerical tasks requiring arithmetic operations such addition, multiplication etc., even when they may successfully fit any given training data in such tasks (and sometimes they cannot even achieve that). The authors of this paper introduce very simple, elegant and easy-to-impement inductive biases that enable generic models (LSTMs and MLPs) to extrapolate from training data much better in such tasks. The basic idea is to “nudge” standard neural network operations (linear combination, pointwise nonlinearity etc.) to behave like arithmetic operators. For instance, for addition, they parametrize a dense weight matrix as:

where denotes elementwise multiplication, and is the sigmoid nonlinearity. In the saturated regime, this parametrization encourages to have entries, , and so a linear combination using this kind of , i.e. , tends to behave like an addition or subtraction of its inputs (without scaling). In light of the preceding discussion, it is important to note here again that the model does not *force* this kind of behavior, but rather it merely *facilitates* it.

As an inductive bias for multiplication, they use the exponentiated sum of logs formulation:

using the same matrix as above. This (approximately) expresses the multiplication of the elements in . A linear combination of these addition and multiplication operations gated by a sigmoid unit (called a NALU in the paper) then can function as either an addition or a multiplication operation (which can be learned as appropriate). One can then stack these operations to express, in principle, arbitrarily complex arithmetic operations.

This beautiful, simple idea apparently works fantastically well! I was quite impressed by the results in the paper. However, I would have liked to see (i) some results with more complex arithmetic operations than they report in the paper and also (ii) some results with tasks that do not have a strong arithmetic component to gauge how strong the introduced arithmetic inductive biases are. Again, the idea is to see whether, or how badly, the model fails when faced with a task without a strong arithmetic component. Ideally, we would hope that the model does not fail too badly in such cases.

**Note:** I will collect, and report here, examples of inductive biases, like the ones I discussed in this post, that I encounter in the literature, with brief descriptions of the bias introduced, how it is supposed to work and what kinds of problem it is intended to be applied to. To facilitate this, I tagged this post with the tag inductive biases and I will file similar posts under the same tag in the future.

There is a well-known construction of real numbers from rational numbers that defines real numbers as equivalence classes of Cauchy sequences of rational numbers. Two Cauchy sequences of rational numbers are counted as equal if their difference converges to zero in the Euclidean metric (choosing other metrics here gives rise to some interesting, alternative number systems, e.g. *p*-adic numbers).

In an analogous fashion, one can define a new number system starting from the reals this time. This new number system, called *hyperreal numbers *(), is defined in terms of equivalence classes of sequences of real numbers. Two sequences of real numbers, and , are considered equal (hence represent the same hyperreal number) iff the set of indices where they agree, , is *quasi-big*. The technical definition of a quasi-big set is somewhat complicated, but the most important properties of quasi-big sets can be listed as follows:

- No finite set is quasi-big.
- If and are quasi-big sets, then so is .
- If is quasi-big and , then is also quasi-big.
- For any set , either or its complement is quasi-big.

Given this definition of hyperreal numbers, the following properties can be established relatively straightforwardly:

- , i.e. hyperreals strictly contain the reals (we map any real to the equivalence class containing the sequence ).
- contains an
*infinitesimal*, i.e. a number such that , yet for every positive real number (to see this, consider the sequence ). - In a formal language that is expressive enough to develop the entire calculus in, any given sentence is true with respect to iff it is true with respect to . This is an extremely useful property that can be derived from Łoś’s Theorem.

At this point, you might be (and you should be!) wondering whether Properties 2 and 3 above are consistent with each other. Property 2 says there’s an infinitesimal number in , but we know that there’s no infinitesimal number in . So, how is it possible that every sentence true in is also true in ? The answer is that the language mentioned in Property 3, although powerful enough to allow us to do calculus, is a rather restricted language. In particular, it doesn’t allow us to *define* what a real number is. This is because the definition of a real number crucially relies on the completeness axiom. The completeness axiom cannot be expressed in , because it requires talking about sets of numbers, something that turns out not to be possible in the language . So, Property 2 cannot be expressed in the language , hence is not powerful enough to distinguish hyperreals from reals.

Here are some further useful properties of hyperreal numbers:

- We mentioned above that there is at least one infinitesimal hyperreal. Are there more? Yes, definitely! There are in fact an infinite number of them: for, if is an infinitesimal and is a real number, then is also infinitesimal. Moreover, if and are infinitesimals, then so are and .
- We say that a hyperreal is
*infinite*iff it is either greater than all real numbers or smaller than all real numbers. A hyperreal is*finite*iff it is not infinite. It is easy to show that if is infinitesimal, is an infinite hyperreal. - contains infinite integers. The proof is very easy, using Property 3 above. We know that for any in , there is an integer greater than (note that one can express this statement in the language , because “ is an integer”, unlike “ is a real”,
*can*be defined in ). Since this is true for , it must be true for as well. In particular, there must be an integer greater than . But we just observed that is infinite, hence that integer must be infinite too! In fact, it immediately follows that there must be an infinite number of infinite integers. - We say that two hyperreals and are
*infinitely close*, denoted by , if is infinitesimal or zero. - Let’s say that a hyperreal is
*nonstandard*if is not real. Then, it’s easy to show that for any real and nonstandard , (or ) is nonstandard. - If is any finite hyperreal, then there exists a unique real infinitely close to

One can develop the entire calculus using hyperreal numbers, instead of the reals. This is sometimes very useful, as some results turn out to be much easier to state and prove in than in and we know, by Property 3 above, that we won’t ever be led astray by doing so. Just to give a few examples: in , we can define a function to be *continuous* at iff implies . In , on the other hand, we would have to invoke the notion of a *limit* to define continuity, which in turn is defined in terms of a well-known -type argument.

Using hyperreals, derivatives are also very easy to define. The derivative of a function at is defined as:

where is an infinitesimal (it’s easy to check that the definition doesn’t depend on which infinitesimal is chosen). Again, the definition of a derivative in standard analysis would involve limits.

Taking the derivatives of specific functions is equally easy, since we can manipulate infinitesimals just like we manipulate real numbers (again by Property 3 above).

One important difference between and is that is complete (in fact, the construction of reals as equivalence classes of Cauchy sequences of rationals, which was mentioned at the beginning of this post, is known as the completion of rationals with respect to the Euclidean metric), whereas is not complete. As mentioned before, this is not inconsistent with Property 3 above, because completeness cannot be expressed in the language .

**Bonus:** I recently discovered this video of a great lecture by Terry Tao on ultraproducts. Ultraproduct is the required technical concept I glossed over above when I defined *quasi-big*ness. Most of the lecture is quite accessible and I recommend it to anybody who wants to learn more about this topic.

There is a fair amount of empirical observation in the literature on what makes a neural network generalize better or worse. These empirical observations typically report significant correlations between the generalization performance of models on a test set and some measure of their noise robustness or degeneracy. For example, Morcos et al. (2018) observe that models that generalize better are less sensitive to perturbations in randomly selected directions in the network’s activity space. Similarly, Novak et al. (2018) observe that models that generalize better have smaller mean Jacobian norms, , than models that don’t generalize well. The Jacobian norm at a given point can be considered as an estimate of the average sensitivity of the model to perturbations around that point. So, similar to Morcos et al.’s observation, this result amounts to a correlation between the model’s generalization performance and a particular measure of its noise robustness. Wu et al. (2017) observe that models that generalize better have smaller Hessian norms (Frobenius) than models that don’t generalize well. The Hessian norm can be considered as a measure of the total degeneracy of the model, i.e. the flatness of the loss landscape around the solution. Hence this result is in line with earlier hypotheses (by Hochreiter & Schmidhuber and by Hinton & Van Camp) about a possible connection between the generalization capacity of neural networks and the local flatness of the loss landscape. For two-layer shallow networks, Wu et al. also establish a link between the noise robustness of the models as measured by their mean Jacobian norm and the local flatness of the loss landscape as measured by the Hessian norm (in a nutshell, under some fairly plausible assumptions, the mean Jacobian norm can be upper-bounded by the Hessian norm -see Corollary 1 in the paper-).

Although these observations are all informative, they do not provide a rigorous quantitative explanation for why heavily over-parametrized neural networks generalize as well as they do in practice. Such an explanation would take the form of a rigorous quantitative bound on the generalization error of a model given what we know about the model’s various properties. As I mentioned above, standard theoretical tools have so far not been able to yield non-vacuous, i.e. meaningful, bounds on generalization error for models operating in the highly over-parametrized regime.

In this post, I’d like to highlight a recent paper by Wenda Zhou and colleagues that was able to derive non-vacuous generalization error bounds for ImageNet-scale models for the first time. Following prior work by Dziugaite & Roy, their starting point is PAC-Bayes theory. They leverage the compressibility of the trained deep neural nets to derive a non-vacuous (but still very loose) PAC-Bayes bound on the generalization error of large-scale image recognition models.

Roughly speaking, PAC-Bayes theory gives bounds on the generalization error that are of the following form: , where is the number of training points, is a data-dependent posterior distribution over the “hypotheses” (think of a distribution over neural network parameters) specific to a learning algorithm and training data and is a data-independent prior distribution over the hypotheses. The main challenge in coming up with useful bounds in this framework is finding a good prior that would minimize the -divergence between the prior and the posterior in the bound. This basically requires guessing what kinds of structure are favored by the training algorithm, training data and model class we happen to use (previously, Dziugaite & Roy had directly minimized the PAC-Bayes bound with respect to *the posterior distribution *instead of guessing an appropriate prior, assuming a Gaussian form with diagonal covariance for the posterior, and came up with non-vacuous and fairly tight bounds for MNIST models. Although this is a clever idea, it obscures which properties of trained neural nets specifically lead to better generalization).

The main idea in the paper by Wenda Zhou and colleagues is to choose such that greater probability mass is assigned to models with short code length, i.e. more compressible models. This implicitly expresses the hypothesis that neural networks trained with SGD on real-world datasets are more likely to yield compressible solutions. When we choose and , where denotes the number of bits required to describe the model according to some pre-specified compression scheme and denotes the compressed version of the trained model, we get a bound like the following (Theorem 4.1 simplified):

This means that the more compressible the model is, the better the generalization error bound will be. The authors then refine this bound by making the compression scheme more explicit and also incorporating the noise robustness property of the trained networks into the bound. I won’t go into those details here, but definitely do check out the paper if you are interested. They then apply a state-of-the-art compression method (based on dynamic network surgery) to pre-trained deep networks to instantiate the bound. This gives non-vacuous but still very loose bounds for models trained on both MNIST and ImageNet. For example, for ImageNet, they obtain a top-1 error bound of , whereas their compressed model achieves an actual error of on the validation data.

So, although it is encouraging to get a non-vacuous bound for an ImageNet model for the first time, there is still a frustratingly large gap between the actual generalization performance of the model and the derived bound, i.e. the bounds are nowhere near tight. Why is this? There are two possibilities as I see it:

- The authors’ approach of using a PAC-Bayes bound based on compression is on the right track, however the current compression algorithms are not as good as they can be, hence the bounds are not as tight as they can be. If true, this would be a practically exciting possibility, since it suggests that a few orders of magnitude better compression can be achieved with better compression methods currently not yet available.
- Just exploiting compression (or compression + noise robustness) will not be enough to construct a tight PAC-Bayes bound. We have to incorporate more specific properties of the trained neural networks that generalize well in order to come up with a better prior .

The second possibility would imply the existence of models that have similar training errors and are compressible to a similar extent, yet have quite different generalization errors. I’m not aware of any reports of such models, but it would be interesting to test this hypothesis.

]]>Here, is an matrix containing the log-probability of each word in the vocabulary given each context in the dataset: , where is the number of all distinct contexts in the dataset and is the vocabulary size; is an matrix containing the embedding space representations of all distinct contexts in the dataset, where is the dimensionality of the embedding space; and is an matrix containing the softmax weights.

In typical applications, and , so is a few order orders of magnitude smaller than . This means that while the right-hand side of the above equation can be full-rank, the left-hand side is rank-deficient: i.e. (we’re assuming that is larger than and , which is usually the case). This means that the model is not expressive enough to capture .

I think it is actually more appropriate to frame the preceding discussion in terms of the distribution of singular values rather than the ranks of matrices. This is because can be full-rank, but it can have a large number of small singular values, in which case the softmax bottleneck would presumably not be a serious problem (there would be a good low-rank approximation to ). So, the real question is: what is the proportion of *near-zero*, or small, singular values of ? Similarly, the proportion of near-zero singular values of the left-hand side, that is, the degeneracy of the model, is equally important. It could be argued that as long as the model has enough non-zero singular values, it shouldn’t matter how many near-zero singular values it has, because it can just adjust those non-zero singular values appropriately during training. But this is not really the case. From an optimization perspective, near-zero singular values are as bad as zero singular values: they both restrict the effective capacity of the model (see this previous post for a discussion of this point).

Framed in this way, we can see that the softmax bottleneck is really a special case of a more general degeneracy problem that can arise even when we’re not mapping from a relatively low dimensional space to a relatively high dimensional space and even when the nonlinearity is not softmax. I will now illustrate this with a simple example from a fully-connected (dense) layer with relu units.

In this example, we assume that the input and output spaces have the same dimensionality (i.e. , using the notation above), so there is no “bottleneck” due to a dimensionality difference between the input and output spaces. Mathematically, the layer is described by the equation: , where is relu and we ignore the biases for simplicity. The weights, , are drawn according to the standard Glorot uniform initialization scheme. We assume that the inputs are standard normal variables and we calculate the average singular values of the Jacobian of the layer, , over a bunch of inputs. The result is shown by the blue line (labeled “Dense”) below:

The singular values decrease roughly linearly up to the middle singular value, but drop sharply to zero after that. This is caused by the saturation of approximately half of the output units. Of course, the degeneracy here is not as dramatic as in the softmax bottleneck case, where more than of the singular values would have been degenerate (as opposed to roughly half in this example), but this is just a single layer and degeneracy can increase sharply in deeper models (again see this previous post for a discussion of this point).

The remaining lines in this figure show the average singular values of the Jacobian for a mixture-of-experts layer that’s directly inspired by, and closely related to, the mixture-of-softmaxes model proposed by the authors to deal with the softmax bottleneck problem. This mixture-of-experts layer is defined mathematically as follows:

where denotes the number of experts, represents the gating model for the -th expert and is the -th expert model (a similar mixture-of-experts layer was recently used in this paper from Google Brain). The mixture-of-softmaxes model proposed in the paper roughly corresponds to the case where both and are softmax functions (with the additional difference that in their model the input is first transformed through a linear combination plus tanh nonlinearity).

The figure above shows that this mixture-of-experts model effectively deals with the degeneracy problem (just like the mixture-of-softmaxes model effectively deals with the softmax bottleneck problem). Intuitively, this is because when we add a number of matrices that are each individually degenerate, the resulting matrix is less likely to be degenerate (assuming, of course, that the degeneracies of the different matrices are not “correlated” in some sense, e.g. caused by the same columns). Consistent with this intuition, we see in the above figure that using more experts (larger ) makes the model better conditioned.

However, it should be emphasized that the mixture-of-experts layer (hence the mixture-of-softmaxes layer) likely has additional benefits other than just breaking the degeneracy in the model. This can be seen by observing that setting the gates to be constant, e.g. , already effectively breaks the degeneracy:

I haven’t yet run detailed benchmarks, but it seems highly unlikely to me that this version with constant gates would perform as well as the full mixture-of-experts layer with input-dependent gating. This suggests that, in addition to breaking the degeneracy, the mixture-of-experts layer implements other useful inductive biases (e.g. input-dependent, differential weighting of the experts in different parts of the input space), so the success of this layer ** cannot** be explained entirely in terms of degeneracy breaking. The same comment applies to the mixture-of-softmaxes model too.

Finally, I would like to point out that the recently introduced non-local neural networks can also be seen as a special case of the mixture-of-experts architecture. In that paper, a non-local layer was defined as follows:

with input-dependent gating function and expert model . The canonical applications for this model are image-based, so the layers are actually two-dimensional (a position dimension -represented by the indices – and a feature dimension -implicitly represented by the vectors , etc.-), hence the inductive biases implemented by this model are not exactly the same as in the flat dense case above, but the analogy suggests that part of the success of this model may be explained by degeneracy breaking as well.

**Note:** I wrote a fairly general Keras layer implementing the mixture-of-experts layer discussed in this post. It is available here. Preliminary tests with small scale problems suggest that this layer actually works much better than a generic dense layer, so I am making the code available as a Keras layer in the hope that other people will be encouraged to explore its benefits (and its potential weaknesses). ~~I am currently working on a convolutional version of the mixture-of-experts layer.~~ The convolutional mixture of experts models have now been implemented and uploaded to the GithHub repository together with some examples illustrating how to use these dense and convolutional mixture of experts layers.

For me, the most surprising result in the paper was the model’s apparent ability to reproduce some very subtle aspects of actual grid cell phenomenology in the brain, particularly the emergence of discrete spatial scales with geometric ordering. It is not often that one gets such discrete structures when one trains neural networks on any task. The default pattern I would have predicted would rather be continuous variation in the spatial scale of grid cells, i.e. something like a uniform or normal distribution over spatial scales.

Here are some other (more negative) thoughts on the paper:

**1)** I found the paper frustratingly short on explanations. Yes, hexagonal grids seem to emerge in the network, but why? Some form of regularization seems to be necessary to get grid-like representations in the network, but again why? Yes, grid-like representation of space seems to be more beneficial than a place field like representation in some navigation problems in RL, again why? Unfortunately, they don’t even put forward any plausible hypotheses as to why these results are observed.

**2)** An ICLR paper this year reported somewhat similar (but less detailed) results on the emergence of a grid like representation of space in trained recurrent nets. There are some important differences between the setups of these two papers; for instance, the ICLR paper trains the networks on the *x*, *y* coordinates of the agent’s position, whereas in the DeepMind paper, the training signal is a population code (place field) representation of space (plus a similar population code representation of head direction). The emergent grid in the ICLR paper was a square grid, I think; so they weren’t able to explain the hexagonal grid, nor the discrete geometrically-ordered spatial scales, as far as I know. So, this difference between the training signals seems to be important in explaining the different properties of the emergent grid. I think it is really important to understand which features of the training setup or the architecture are necessary to get the biologically more realistic hexagonal grids with geometrically spaced scales: is the linear read-out important? Is the network architecture important? What kind of training signal is necessary for the emergence of a hexagonal grid? Etc.

**3)** The paper also claims that when provided as input to a standard RL agent (A3C), periodic grid code representations of self-location and goal location are more beneficial than place field representations of the same variables. Again, not enough control experiments were run to determine why this is the case, but there are at least two possibilities, as far as I see: (i) it could be that a grid code affords a “better” representation of space in some sense, (ii) this effect has nothing to do with the representational power of a grid code, but it’s just a training effect (i.e. this kind of representation just happens to work better when training the policy network in A3C with SGD).

I find (i) unlikely: as the grid cells themselves were trained to match place cell responses with a linear read-out, they can’t really be a better representation of space than the place cells; it can’t be the case that grid cells provide more reliable information about space than place cells, for example, because place cells are *the ground truth* here. It may be argued that a grid code contains more *easily decodable information* than a place cell code, but I don’t see how this could be possible (if anything I would actually expect the opposite; this was, in fact, one of the messages of this paper by Sreenivasan & Fiete: simply put, although a grid code can encode a lot of information, decoding that information is not straightforward).

This leaves (ii) as the likely (and less exciting) explanation. One possible reason why a grid code would be better for training the policy network could be that the activations in the grid code are less sparse than activations in a place cell code (a straightforward prediction is that the performance of the place cell network should improve with wider place fields). A crucial control sorely missing in this section was using a smooth, random, non-periodic basis (sort of like the basis that emerges in the network when it’s trained without regularization). If that works as well as the grid code, which is what I suspect, that would suggest that there’s nothing special about periodic multi-scale codes like the grid code, similar results hold for any sufficiently dense and smooth code (a less exciting conclusion than what the paper currently seems to be concluding).

**4)** In the RL experiments, the grid-cell network wasn’t trained end-to-end, different components of the network were trained with different objectives (for example, see Extended Data Fig. 5). Do grid cells still emerge if the entire network is trained end-to-end with the reward maximization (RL) objective? Alternatively, let’s assume that we plug a pre-trained grid-cell module to a policy network and then fine tune it with the RL objective. Are the grid cell representations in the grid cell module stable in the face of this fine-tuning? If not, that would be problematic for the claims of the paper.

**5)** This is really a more minor point compared to the first four points above. In training the grid-like cells in the network, the training objective they use is to match the responses of some place cells (plus some head-direction cells) which are assumed to have developed already. I think it would be much nicer and more realistic to have an unsupervised objective that would lead to the emergence of both grid cells and place cells. For justification, the authors note that place cells seem to develop earlier than grid cells; so coming up with an unsupervised objective that would reproduce this developmental order would be desirable.

The really nice thing about this model is that by properly setting ** only two** hyper-parameters, it is possible to significantly improve the accuracy of a pre-trained model at test time. For example, here are some error rates on the CIFAR-10 and CIFAR-100 test sets:

where ResNet32 () is the baseline model without a cache component, ResNet32-Cache3 is a model that linearly combines the predictions of the cache component and the baseline model (as you may have guessed, represents the mixture weight of the cache component), and ResNet32-Cache3-CacheOnly () is a model that uses the predictions of the cache component only.

I also found that a cache component substantially improves the robustness of baseline models against adversarial attacks. Here are the classification accuracies of the same three models above on adversarial images generated from the CIFAR-10 test set by four different attack methods applied to the baseline ResNet32 model:

Interestingly, the cache-only model is more robust against adversarial perturbations than the other models. It is intuitively easy to understand why cache models improve robustness against adversarial examples: they effectively extend the range in the input space over which the model behaves similarly to the way it behaves near training points and recent work has shown that neural networks behave more regularly near training points than elsewhere in the input space (for example, as measured by the Jacobian norm or the number of “linear regions” in the neighborhood of a given point).

Indeed, I confirmed this by looking at the Jacobian of the models at the test points:

Both cache models reduce the mean Jacobian norm over the baseline model (a). The two cache models, however, have somewhat different characteristics: the linear-combination model (ResNet32-Cache3) has the smallest first singular value but slightly increased lower-order singular values compared to the baseline model, whereas the cache-only model reduces the singular values (and the Jacobian norms per trial) more consistently. This pattern appears to be related to the differential generalization behavior of the two models: the linear-combination cache has the higher test accuracy, but the cache-only model is more robust against adversarial perturbations. So, the conjecture is that within-sample generalization performance is mostly determined by the first singular value (or the first few singular values), whereas the out-of-sample generalization performance, e.g. sensitivity of the models against adversarial perturbations, is also significantly affected by the lower-order singular values.

]]>One potential problem with this approach is that it becomes computationally prohibitive to unroll, and bake into the computation graph, more than a relatively small number of optimization steps, compared to the number of optimization steps or iterations one would normally perform for training the model. One might then worry that parameters or hyper-parameters optimized over this shorter horizon might not generalize well to longer horizons (the longer horizon problem is ultimately the problem we are interested in).

A new ICLR paper by Yuhuai Wu, Mengye Ren and colleagues argues that such worries are justified, at least for a certain class of meta-learning problems. They consider the problem of meta-learning the learning rate and momentum schedules in neural network training. In the first part of the paper, they approach this problem analytically by considering a noisy quadratic loss function. For this particular loss function, it turns out that one can estimate both the optimal learning rate and momentum schedules (using Theorem 1 in the paper), as well as the “greedy-optimal” learning rate and momentum schedules that minimize the one-step ahead loss function at each time step (using Theorem 2). The authors show that although the greedy-optimal strategy outperforms the optimal schedule initially during training, its final performance is substantially worse than that of the optimal strategy. The reason appears to be that the greedy-optimal strategy reduces the learning rate prematurely to be able to decrease the loss along high-curvature directions in the loss landscape quickly enough.

In the second part of the paper, a gradient-based hyper-parameter meta-learning method is shown to suffer from a similar short-horizon bias where meta-learning of the hyper-parameters controlling the learning rate schedule (i.e. the initial learning learning rate and a hyper-parameter controlling the learning rate decay over time) with a small number of unrolled optimization steps, i.e. over a short horizon, again leads to learning rates that are reduced much too quickly compared to the optimal schedule, resulting in severely suboptimal final performances.

Here are some thoughts and questions inspired by this work:

- In the inner loop of their meta-learning problems, the authors only consider vanilla SGD with momentum. It is a bit unfortunate that they don’t consider the Adam algorithm (or similar algorithms) in the inner loop. I suspect that using Adam in the inner loop might ameliorate the short-horizon bias significantly, because Adam is supposed to be less sensitive to hyper-parameter choices (one might then object, with some justification, that this would be a case where meta-learning is not really needed).
- Because the bias reported in the paper seems to be consistent, i.e. always in the direction of reduced learning rates, a quick fix one could imagine would be to introduce a prior that would counteract this bias by, for example, discouraging small learning rates. I’m not quite sure what the best prior to use would be in this context and I don’t know how well this would work in practice compared to the optimal learning rate schedules, but it seems like a straightforward fix that could be worth trying.
- As the authors note, the short-horizon bias becomes less significant (or might even disappear) when the optimization problem is either deterministic or well-conditioned. We have previously argued that skip connections reduce degeneracies in neural networks, thereby making the optimization problem better-conditioned. It’s likely that other methods such as batch normalization have similar effects. This suggests that more realistic architectures using skip connections and batch normalization might be less prone to the short-horizon problem uncovered in this paper. It would be interesting to investigate this.
- It would also be interesting to see if other kinds of meta-learning problems, for example, meta-learning a good initialization for the model parameters (as in the MAML algorithm) suffer from similar short-horizon biases.

Their approach to this question is thoroughly (and admirably) empirical. By designing various versions of the basic game that eliminate different kinds of structure in the game, they are able to not only identify, but also quantify the relative importance of, various different regularities or meaningful structures in the game for human players. The idea is that if the removal of a particular meaningful structure or regularity in the game severely affects the learning performance of human players, then that must be an important regularity that humans rely on heavily in learning the game. Conversely, if the removal of a meaningful structure or regularity does not affect the learning performance of human players, it is reasonable to assume that that regularity is not an important part of the prior knowledge that human players bring to the learning of the new game.

Using this logic, the authors show that (i) knowledge about how to interact with objects (e.g. one can climb a ladder by pressing the up key as opposed to zigzagging left and right, one can jump over monsters by pressing the up and right (or left) keys etc.), (ii) knowledge about object affordances (e.g. platforms support walking and ladders support climbing) and object semantics (e.g. one has to avoid fires and angry-looking monsters), (iii) the concept of an object as distinct from the background and a prior that expects visually similar things to behave similarly, are all parts of the prior knowledge that people utilize (in the order of increasing importance) in learning a new game.

One might naively think that it should be a great idea to build all this prior knowledge in our neural networks (if only we knew how to do that!). That would be expected to greatly increase the sample efficiency of neural networks when they encounter similar problems. However, one has to be careful here, since building in any kind of prior knowledge always comes at a cost: namely, if the problems encountered by the model do not conform to the assumed prior, the prior, rather than improving the sample efficiency, may in fact worsen it. The authors are (again, admirably) very clear about this potential drawback of prior knowledge (which they illustrate with a simple example video game):

However, being equipped with strong prior knowledge can sometimes lead to constrained exploration that might not be optimal in all environments… Thus, while incorporating prior knowledge in RL agents has many potential benefits, future work should also consider challenges regarding under-constrained exploration in certain kinds of settings.

Building in too many human priors could also make models vulnerable to a sort of “adversarial attack” where an adversary might design “unnatural” versions of a task that would be difficult to solve for a model with a lot of human priors (similar to the games created in the paper), a sort of inverse-CAPTCHA. A less constrained model with fewer human priors would be less vulnerable to such attacks. Besides, although human priors (especially more general ones such as the concept of an object or the visual similarity prior) are probably broadly useful in vision- or language-related problems, increasingly many problems we face today in applied sciences are not obviously amenable to any kind of useful human prior: problems in bioinformatics or molecular design come to mind, for example. Trying to prematurely incorporate human priors in our models (whatever they might be) may hinder performance in such cases. Furthermore, human priors (especially more specific ones such as those that arise in *naive physics *or in* naive psychology*) are often plain wrong, in which case incorporating them in our models doesn’t make any sense.

This is where the importance of meta-learning comes in, I think. Rather than trying to build in some assumptions *a priori*, in a lot of cases, *meta-learning* some prior knowledge or assumptions that would be broadly useful for a particular class of problems would be a more promising approach. Although there has recently been some interesting work in this direction, the prior knowledge or assumptions meta-learned in these works are quite primitive, e.g. meta-learning a good initialization. Meta-learning more substantial prior knowledge (for example, in the form of a model architecture) very much remains an interesting open problem.

At least in some cases, however, one might want to have more flexible embeddings that are optimized for a particular task, or set of tasks. Moreover, capturing the meaning of a word by a single vector may not be entirely appropriate either. Some words, for example, do not have a clear meaning on their own, but rather they have what we might call *operator semantics*: they act to modify the meanings of other words they are attached to in certain ways. Adverbs, for instance, usually function in this way. It seems more appropriate to model this aspect of meaning in terms of operators (e.g. matrices) rather than vectors. This is, in a nutshell, what is proposed in Socher et al. (2012).

Each word is assigned both a vector representation capturing its usual “content” semantics and a matrix representation capturing its operator semantics. Given two words represented by the tuples and , their composition is represented by the pair where:

and

Here, and are parameters shared across all compositions and optimized for a particular task, e.g. sentiment prediction.

The assumption of parameter sharing across all compositions in this model is somewhat problematic, since Adv-Adj compositions do not really work the same way as NP-VP compositions, for instance. So, a more flexible semantic composition model that distinguishes between different types of composition could potentially be more powerful.

Secondly, it isn’t really necessary to capture the operator semantics of a word with a matrix. Although they use low-rank matrices for this purpose, it’s possible to do this even more efficiently. All that is needed is to have separate parts of a word vector represent the content vs. operator semantics (and possibly other kinds of semantics) and to have them interact in a particular, constrained way with the content and operator parts of another word vector. For example, we can have a vector representation of a word like: where the first part of the vector captures the content semantics, and the second part captures the operator semantics. After composing two words, and , we get the following content and operator parts:

and

where we replaced the matrix-vector products with a more general function , which can be a shallow MLP, for instance. The important point is to note that is encapsulated in the sense that it only allows the content part of a word vector to interact with the operator part of another vector (just as in the original matrix-vector product model above).

]]>