### Simple inductive biases to make neural networks train faster and generalize better: two case studies

Perhaps the most important factor determining how quickly a neural network trains and how well it generalizes beyond the range of data it receives during training is the inductive biases inherent in its architecture. If the inductive biases embodied in the architecture match the kind of data the network receives, that can enable it to both train much faster and generalize much better. A well-known example in this regard is the convolutional architecture of the modern neural network models for vision tasks. The convolutional layers in these models implement the assumption (or the expectation) that the task that the model attempts to solve is more or less translation invariant (i.e. a given feature, of any complexity, can appear anywhere in the image). A more recent example is the relational inductive biases implemented in relational neural networks. Mechanistically, this is usually implemented with an inner-product like mechanism (sometimes also called attention) that computes an inner-product like measure between different parts of the input (e.g. as in this paper) or with a more complex MLP-like module with shared parameters (e.g. as in this paper). This inductive bias expresses the expectation that interactions between features (of any complexity) are likely to be important in solving the task that the model is being applied to. This is clearly the case for obviously relational VQA tasks such CLEVR, but may be true even in less obvious cases such as the standard ImageNet classification task (see the results in this paper).

Coming up with the right inductive biases for a particular type of task (or types of tasks) is not always straightforward and it is, in my mind, one of the things that make machine learning a creative enterprise. Here, by the “right inductive biases”, I mean inductive biases that (i) only exploit the structure in the problem (or problems) we are interested in and nothing more or less, but (ii) are also flexible enough that if the same model is applied to a problem that doesn’t display the relevant structure exactly, the model doesn’t break down disastrously (some “symbol”-based neural machines may suffer from such fragility).

In this post, I’d like to briefly highlight two really nice recent papers that introduce very simple inductive biases that enable neural networks to train faster and generalize better in particular types of problems.

The first one is from Uber AI: An intriguing failing of convolutional neural networks and the CoordConv solution. In this paper, the authors first observe that state of the art convolutional networks fail quite badly in tasks that require spatial coordinate transformations, for example, changing from Cartesian coordinates to image-based coordinates or vice versa (e.g. given the Cartesian coordinates $(x,y)$, draw a square of a certain size centered at $(x,y)$). This may not be too surprising, since convolutional networks are explicitly designed to be translation-invariant, hence to ignore any spatial information, but the authors correctly note that ignoring spatial information completely (being rigidly translation-invariant) may not always be advisable (this may lead to failures of the type mentioned in (ii) above). It is rather much better to provide the model with the spatial information and let it figure out itself how much translation-invariant it needs to be in any particular task. This is exactly what the authors do. Specifically, they provide the spatial information in an explicit format through additional (fixed) channels that represent the Cartesian coordinates of each “pixel”. For image-based tasks, one thus needs only two additional channels, representing the $x$ and $y$ coordinates of each pixel. Pictorially, their scheme, which they call CoordConv, looks like this (Figure 3 in the paper):

That’s basically it. If the task at hand is highly translation-invariant, the model can learn to set the weights coming from those two Cartesian coordinate channels to small values; if the task at hand requires precise spatial information, on the other hand, the model can learn to utilize those channels appropriately. NLP people may recognize the conceptual similarity of this scheme to the positional encodings of items in sequence-based tasks. For the NLP people, we may thus summarize their contribution by saying that they extend the positional encoding idea from the temporal domain (in sequence-based tasks) to the spatial domain (in image-based tasks). It’s always a good idea to think about such exchanges between different domains!

The authors then go on to demonstrate that introducing a few of these CoordConv layers in standard architectures improves performance in a diverse range of tasks (but not in all tasks), including object detection, GAN training and Atari playing.

The second paper I’d like to highlight, called Neural Arithmetic Logic Units, starts from the observation that generic neural network architectures cannot generalize well in numerical tasks requiring arithmetic operations such addition, multiplication etc., even when they may successfully fit any given training data in such tasks (and sometimes they cannot even achieve that). The authors of this paper introduce very simple, elegant and easy-to-impement inductive biases that enable generic models (LSTMs and MLPs) to extrapolate from training data much better in such tasks. The basic idea is to “nudge” standard neural network operations (linear combination, pointwise nonlinearity etc.) to behave like arithmetic operators. For instance, for addition, they parametrize a dense weight matrix as:

$\mathbf{W} = \tanh(\mathbf{V}) \circ \sigma(\mathbf{M})$

where $\circ$ denotes elementwise multiplication, and $\sigma(\cdot)$ is the sigmoid nonlinearity. In the saturated regime, this parametrization encourages $\mathbf{W}$ to have entries, $-1, 0, 1$, and so a linear combination using this kind of $\mathbf{W}$, i.e. $\mathbf{W}\mathbf{x}$, tends to behave like an addition or subtraction of its inputs (without scaling). In light of the preceding discussion, it is important to note here again that the model does not force this kind of behavior, but rather it merely facilitates it.

As an inductive bias for multiplication, they use the exponentiated sum of logs formulation:

$\exp \mathbf{W} (\log (\mathbf{x} + \epsilon))$

using the same matrix $\mathbf{W}$ as above. This (approximately) expresses the multiplication of the elements in $\mathbf{x}$. A linear combination of these addition and multiplication operations gated by a sigmoid unit (called a NALU in the paper) then can function as either an addition or a multiplication operation (which can be learned as appropriate). One can then stack these operations to express, in principle, arbitrarily complex arithmetic operations.

This beautiful, simple idea apparently works fantastically well! I was quite impressed by the results in the paper. However, I would have liked to see (i) some results with more complex arithmetic operations than they report in the paper and also (ii) some results with tasks that do not have a strong arithmetic component to gauge how strong the introduced arithmetic inductive biases are. Again, the idea is to see whether, or how badly, the model fails when faced with a task without a strong arithmetic component. Ideally, we would hope that the model does not fail too badly in such cases.

Note: I will collect, and report here, examples of inductive biases, like the ones I discussed in this post, that I encounter in the literature, with brief descriptions of the bias introduced, how it is supposed to work and what kinds of problem it is intended to be applied to. To facilitate this, I tagged this post with the tag inductive biases and I will file similar posts under the same tag in the future.