Constraining neural networks

by Emin Orhan

Fully-connected layers with arbitrary connectivity are commonly used in neural networks (both feed-forward and recurrent), but is the full power of an unconstrained, all-to-all connectivity matrix really needed for the network to perform well? A lot of empirical evidence suggests that trained neural networks are highly compressible, suggesting that unconstrained connectivity may not be necessary. If we are going to do away with unconstrained connectivity matrices, what should we replace them with? Here are a few recent suggestions from the literature (you can check out the original papers for detailed performance comparisons, but the basic message is that these methods reduce computational or memory requirements of neural networks, or improve their training, with minimal performance loss):

1) Instead of an unconstrained, arbitrary N\times D matrix \mathbf{W}, Yang et al. (2014) suggest matrices of the following form (called the adaptive fastfood transform):

\mathbf{W} = \mathbf{SHG\Pi H B}                                                                        (*)

where \mathbf{S}, \mathbf{G}, \mathbf{B} are diagonal matrices, \mathbf{\Pi} is a random permutation matrix, and \mathbf{H} is the Hadamard matrix. Here \mathbf{\Pi} and \mathbf{H} are fixed, so one only learns the diagonal matrices \mathbf{S}, \mathbf{G}, \mathbf{B}. This decomposition has O(N) parameters and requires O(N\log D) operations to compute, as opposed to the much less efficient O(ND) parameters and O(ND) operations required for an unconstrained matrix. The particular matrix structure in (*) was introduced by Le et al. (2013) in earlier work, but in their case \mathbf{S}, \mathbf{G}, \mathbf{B} were set randomly and left untrained. In that work, they showed that feature maps computed using this random non-adaptive transformation correspond to the Gaussian RBF kernel in expectation, or more precisely \mathbf{E}_{S,G,B,\Pi}[\overline{\phi(x)}^\top\phi(x^{\prime})] = \exp(-\frac{|| x-x^{\prime}||}{2\sigma^2}) where the features are \phi_j(x) = \frac{1}{\sqrt{n}}\exp(i [\mathbf{W}x]_j) with \mathbf{W} as in (*).

2) Moczulski et al. (2015) propose connectivity matrices of the following form:

\mathbf{W} = \prod_{k=1}^K \mathbf{A}_k \mathbf{F} \mathbf{D}_k \mathbf{F}^{-1}                                                                      (**)

where \mathbf{A}_k and \mathbf{D}_k are diagonal matrices and \mathbf{F} is the discrete Fourier transform matrix. This has O(KN) parameters and requires O(KN\log N) operations to compute, which is an improvement over the O(N^2) parameters and O(N^2) operations required to compute a fully-connected unconstrained linear layer (with identical input and output dimensions, N), assuming of course K is not O(N). In the real version they ultimately prefer, \mathbf{F} is replaced by the discrete cosine transform matrix (of type II) \mathbf{C}.

A justifiable worry at this point is whether we are losing a lot of expressive power when we constrain our connectivity matrices in this way. The authors mention a theoretical guarantee assuring that almost all matrices \mathbf{M} \in \mathbb{C}^{N \times N} can be factored as \mathbf{M} = [ \prod_{i=1}^{N-1} \mathbf{D}_{2i-1} \mathbf{R}_{2i} ] \mathbf{D}_{2N-1} with \mathbf{D}_{2i-1} diagonal and \mathbf{R}_{2i} circulant, and this is precisely in the form (**) above. However, I’m not sure how useful this guarantee is, because K is O(N) in this result, hence we lose our computational savings in this limit.

3) A very nice idea has recently been proposed for constraining the connectivity of a recurrent neural network: Arjovsky et al. (2015) propose constraining the connectivity to be unitary (orthogonal in the real case). The advantage of a unitary connectivity matrix is that it exactly preserves the norm of a vector it is applied to, hence avoiding the vanishing or exploding gradient problems in recurrent neural network training. What particular unitary structure should we choose for the connectivity matrix? They suggest the following structure:

\mathbf{W} = \mathbf{D}_3 \mathbf{R}_2 \mathbf{F}^{-1} \mathbf{D}_2 \mathbf{\Pi} \mathbf{R}_1 \mathbf{F} \mathbf{D}_1                                     (***)

where \mathbf{D}_1, \mathbf{D}_2, \mathbf{D}_3 are diagonal matrices, \mathbf{R}_1, \mathbf{R}_2 are reflection matrices, \mathbf{\Pi} is a permutation matrix and \mathbf{F} is the discrete Fourier transform matrix. The authors mention that this particular structure was based on trial and error essentially. It is important to note that all the matrices in (***), except for the permutation matrix, have complex-valued entries.

In summary, constraints are generally useful for reducing the complexity of neural networks (think convnets), making training easier and more efficient, and making the models more interpretable. However, finding the right set of constraints to impose on neural networks is not always easy: imposing too many constraints or the wrong set of constraints might unnecessarily limit the representational power of the network, imposing too few constraints might reduce the benefits that would otherwise be obtained with the right set of constraints.