### Holographic reduced representations and associative LSTMs

Vector space representations of words (also known as word embeddings) are familiar. By mapping words onto a vector space equipped with a metric, we capture the similarity relationships between them, reduce their dimensionality (in contrast to a high-dimensional representation such as one-hot encoding), and enable their integration into larger, end-to-end differentiable models. How can we achieve the same thing for compositional objects? There are various ways of doing this, each with its own strengths and weaknesses. In an earlier post, I discussed a particular method that relies on tensor products to construct representations of compositional objects. However, this method quickly runs into a dimensionality problem: the number of dimensions required to represent a compositional object with $n$ constituents grows exponentially with $n$. Moreover, two objects with different numbers of constituents live in different spaces, hence are not directly comparable. In this post, I would like to discuss an alternative proposal first introduced by Tony Plate, called holographic reduced representations (HRR), that deals with both of these problems.

In this proposal, an association between two items, $\mathbf{a}$ and $\mathbf{b}$, is represented by their circular convolution. Because convolution corresponds to product in the Fourier domain, we can express this as the inverse transform of the elementwise multiplication of the discrete Fourier transforms of the vectors: $\mathbf{a} \circledast \mathbf{b} = \mathcal{F}^{-1}([r_1s_1 \exp(i(\phi_1 + \chi_1)), \ldots, r_ns_n \exp(i(\phi_n + \chi_n))])$

with $\tilde{\mathbf{a}} = [r_1 \exp(i \phi_1), \ldots, r_n \exp(i \phi_n)]$ representing the discrete Fourier transform of $\mathbf{a}$ and $\tilde{\mathbf{b}} = [s_1 \exp(i \chi_1), \ldots, s_n \exp(i \chi_n)]$ representing the discrete Fourier transform of $\mathbf{b}$. It is easy to see that the circular convolution of two vectors, unlike their tensor product, successfully deals with the dimensionality problem: the circular convolution of two vectors has the same dimensionality as the vectors themselves. Moreover, circular convolution behaves very similarly to scalar multiplication, making the manipulation of the items extremely easy and efficient. For example, circular convolution is both commutative and associative, just like scalar multiplication.

If we want to represent multiple associations, we can just add them up via vector addition: $\mathbf{c} = \mathbf{a}_1 \circledast \mathbf{b}_1 + \ldots + \mathbf{a}_k\circledast \mathbf{b}_k$

To retrieve an item $\mathbf{b}_1$ associated with an item $\mathbf{a}_1$ (which can be thought of as the “key” associated with $\mathbf{b}_1$), we take the circular convolution with the inverse of $\mathbf{a}_1$, where the inverse of a vector $\mathbf{a}$ is defined through $\tilde{\mathbf{a}^{-1}} = [s_1^{-1} \exp(-i \chi_1), \ldots, s_n^{-1} \exp(-i \chi_n)]$: $\mathbf{a}_1^{-1} \circledast \mathbf{c} = \mathbf{a}_1^{-1} \circledast (\mathbf{a}_1 \circledast \mathbf{b}_1 + \ldots + \mathbf{a}_k\circledast \mathbf{b}_k) = \mathbf{b}_1 + noise$

Here, the $noise$ grows linearly with the number of items (or associations) in memory and will be small if the keys are chosen sufficiently orthogonally (in the paper, only random normal keys are considered; however, orthogonal keys would be better, since although any pair of random normal keys are likely to be orthogonal, linear dependencies between a number of such keys also become highly likely; orthogonal keys, on the other hand, eliminate such linear dependencies as well).

Various different types of compositional objects can be readily represented in this framework. Here are some examples given in the paper.

Representing a sequence

There are multiple ways to represent a sequence like $\mathbf{a}\mathbf{b}\mathbf{c}$ in this scheme. We can, for instance, do something like: $\mathbf{a} + \mathbf{a} \circledast \mathbf{b} + \mathbf{a} \circledast\mathbf{b} \circledast \mathbf{c}$.

Representing a stack

To represent a stack with items $\mathbf{x}_1, \ldots, \mathbf{x}_k$ with $\mathbf{x}_1$ on top, we can do $\mathbf{s} = \mathbf{x}_1 + \mathbf{p} \circledast \mathbf{x_2} + \ldots + \mathbf{p}^n \circledast \mathbf{x}_n$ where $\mathbf{p}$ is an arbitrary vector. Then, we can define push and pop operations on this stack as follows: $push(\mathbf{s}, \mathbf{x}) = \mathbf{x} + \mathbf{p} \circledast \mathbf{s}$                      (push item $\mathbf{x}$ on top of stack $\mathbf{s}$) $top(\mathbf{s}) = clean - up(\mathbf{s})$                      (return the top item) $pop(\mathbf{s}) = (\mathbf{s} - top(\mathbf{s})) \circledast \mathbf{p}^{-1}$                      (pop the top item)

Chunking of sequences

To represent a long sequence like $\mathbf{abcdefgh}$, we can chunk it into smaller sequences as follows: $\mathbf{s}_{abc} = \mathbf{a} + \mathbf{a} \circledast \mathbf{b} + \mathbf{a} \circledast \mathbf{b} \circledast \mathbf{c}$ $\mathbf{s}_{de} = \mathbf{d} + \mathbf{d} \circledast \mathbf{e}$ $\mathbf{s}_{fgh} = \mathbf{f} + \mathbf{f} \circledast \mathbf{g} + \mathbf{f} \circledast \mathbf{g} \circledast \mathbf{h}$ $\mathbf{s}_{abcdefgh} = \mathbf{s}_{abc} + \mathbf{s}_{abc} \circledast \mathbf{s}_{de} + \mathbf{s}_{abc} \circledast \mathbf{s}_{de} \circledast \mathbf{s}_{fgh}$

Variable binding

Binding of $\mathbf{a}$ to variable $\mathbf{x}$ and binding of $\mathbf{b}$ to variable $\mathbf{y}$ can be achieved with: $\mathbf{t} = \mathbf{x}\circledast \mathbf{a} + \mathbf{y} \circledast \mathbf{b}$.

Frame (slot/filler) structures

The thematic structure of a sentence like “Mark ate the fish” can be represented by: $\mathbf{s}_1 = \mathbf{eat} + \mathbf{agt}_{eat} \circledast \mathbf{mark} + \mathbf{obj}_{eat} \circledast \mathbf{thefish}$,

where $\mathbf{agt}_{eat}$ and $\mathbf{obj}_{eat}$ represent the thematic roles of “agent” and “object” for the eat frame, respectively.

Recursive frames

A sentence can fill a slot in another sentence. For instance, the thematic structure of the sentence “Hunger caused Mark to eat” can be represented by: $\mathbf{s}_2= \mathbf{cause} + \mathbf{agt}_{cause}\circledast \mathbf{hunger} + \mathbf{obj}_{cause}\circledast \mathbf{s}_1$.

Representing types, tokens, features

Each token of a given type can be assigned a unique identifier. For instance, $\mathbf{mark} = \mathbf{being} + \mathbf{person} + \mathbf{id}_{mark}$, where $\mathbf{id}_{mark}$ denotes the unique identifier for “Mark”.

Associative LSTMs

Danihelka et al. (2016) use an HRR memory within a standard LSTM architecture to introduce an inductive bias in the model toward learning classic data structure algorithms like stacks and queues. They call this new model an “associative LSTM”. In the associative LSTM model, input and output key vectors, $r_i$ and $r_o$, are introduced as linear functions of the inputs and hidden states (analogous to gates in standard LSTMs, but without the nonlinearity). The cell state is then updated as follows: $c_t = g_f \odot c_{t-1} + r_i \circledast (g_i \odot u)$

and the output of the model is computed through: $h_t = g_o \odot \phi(r_o \circledast c_t)$

Here $\phi(\cdot)$ is a new nonlinearity they introduce. The only difference from the standard LSTM equations is the convolution with the input and output memory keys, $r_i$ and $r_o$, in these two equations. They also use multiple copies of each item in memory (with different keys) to reduce the noise arising from interference, but I will not go into the details of how exactly this is done (the above equations are for a version that uses a single copy). They show that the associative LSTM works better than standard LSTMs and a few other LSTM variants in several tasks designed to be naturally solvable in terms of classic data structure algorithms such as queues and stacks. They claim that the associative LSTM performs better because it implements these data structure algorithms. For example, commenting on the superior performance of the associative LSTM in a task that requires the prediction of closing tags in a sequence of nested random XML-like tags, they write:

We hypothesise that Associative LSTM succeeds at this task by using associative lookup to implement multiple queues to hold tag names from different nesting levels.

However, they do not provide any evidence (or at least, nowhere near enough evidence) for this hypothesis. It is possible that the associative LSTM does not explicitly implement anything like a queue, but  solves the task in an opaque, black-boxy way just like a standard LSTM (for example, an alternative hypothesis would be that the extra multiplication introduced by the convolution operation makes the model effectively deeper, increasing its expressive capacity). I had a similar concern with the “fast weights” paper as well. I think, in general, when people introduce a new architecture, they should be more rigorous about the hypothesized mechanism through which the new model is supposed to improve performance over baseline models.

In addition to analyzing the behavior of the model in simple, well-controlled setups that require the implementation of a particular type of algorithm, another way to probe for a hypothesized mechanism would be to look at the generalization pattern of the model. For example, if the model genuinely learned an addition algorithm (another task considered in the paper), it should be able to generalize well beyond the training data it received: e.g. if the model is trained with at-most-10-digit numbers, it should be able to generalize to much longer numbers (ideally arbitrarily long numbers). This approach was beautifully exemplified in the original neural Turing machines paper, for instance.

Another clue that the model is indeed utilizing the intended inductive biases (in this case, toward implementing classic data structure algorithms) is the speed of training: such a model usually trains much faster than a more generic model without the right inductive biases. Although the associative LSTM indeed trains much faster than more generic LSTM variants, the number of iterations required for convergence still seems somewhat large (usually tens or sometimes hundreds of thousands of iterations).

Advertisements