Compositional embeddings, again

by Emin Orhan

Continuing on the topic of the last post, I would like to briefly discuss yet another proposal for learning vector space embeddings for compositional objects. In the last post, I discussed a method that only used vector addition and circular convolution (which really boils down to elementwise multiplication, when everything is expressed in the Fourier domain) to construct vector space embeddings of compositional objects. Although this kind of embedding satisfies certain properties that we might want in a desirable vector space embedding (e.g. mapping objects with different structures onto the same space), it’s unlikely that it would be the optimal embedding when these vectors are used in the context of a particular task such as sentiment prediction. This is because this kind of embedding does not have any learnable parameters and hence is independent of the task.

At least in some cases, however, one might want to have more flexible embeddings that are optimized for a particular task, or set of tasks. Moreover, capturing the meaning of a word by a single vector may not be entirely appropriate either. Some words, for example, do not have a clear meaning on their own, but rather they have what we might call operator semantics: they act to modify the meanings of other words they are attached to in certain ways. Adverbs, for instance, usually function in this way. It seems more appropriate to model this aspect of meaning in terms of operators (e.g. matrices) rather than vectors. This is, in a nutshell, what is proposed in Socher et al. (2012).

Each word is assigned both a vector representation capturing its usual “content” semantics and a matrix representation capturing its operator semantics. Given two words represented by the tuples (a,A) and (b,B), their composition is represented by the pair (p,P) where:

p = g\Big(W \begin{bmatrix} Ba \\ Ab \end{bmatrix}\Big)      and      P = W_{\mathrm{M}} \begin{bmatrix} A \\ B \end{bmatrix}

Here, W and W_{\mathrm{M}} are parameters shared across all compositions and optimized for a particular task, e.g. sentiment prediction.

The assumption of parameter sharing across all compositions in this model is somewhat problematic, since Adv-Adj compositions do not really work the same way as NP-VP compositions, for instance. So, a more flexible semantic composition model that distinguishes between different types of composition could potentially be more powerful.

Secondly, it isn’t really necessary to capture the operator semantics of a word with a matrix. Although they use low-rank matrices for this purpose, it’s possible to do this even more efficiently. All that is needed is to have separate parts of a word vector represent the content vs. operator semantics (and possibly other kinds of semantics) and to have them interact in a particular, constrained way with the content and operator parts of another word vector. For example, we can have a vector representation of a word like: [a, \alpha] where the first part of the vector a captures the content semantics, and the second part \alpha captures the operator semantics. After composing two words, [a, \alpha] and [b, \beta], we get the following content and operator parts:

p = g\Big(W \begin{bmatrix} f(a,\beta) \\ f(b,\alpha) \end{bmatrix}\Big)      and      \pi = W_{\mathrm{M}} \begin{bmatrix} \alpha \\ \beta \end{bmatrix}

where we replaced the matrix-vector products Ab with a more general function f(a,\beta), which can be a shallow MLP, for instance. The important point is to note that f(\cdot) is encapsulated in the sense that it only allows the content part of a word vector to interact with the operator part of another vector (just as in the original matrix-vector product model above).