Severely Theoretical

Computational neuroscience and machine learning

Optimal synaptic memory consolidation

One of the fundamental challenges facing the brain is a trade-off between what we might loosely call learning and memory. On the one hand, we want our synapses to be highly plastic so that we can learn from our experiences quickly. On the other hand, we want our memories to be stable so that they are not easily overwritten by the constant barrage of experiences that we face every day and this seems to require rigid synapses. So, what is the optimal way to strike the balance between these two conflicting requirements?

Building on a series of previous papers, Benna & Fusi (2016) address this question in the context of a highly simplified model of synaptic modifications. In this model, the current value of a synapse is assumed to be a linear superposition of all the past modifications made to that synapse, weighted by a function of the time since that modification:

w_a(t) = \sum_{t^\prime} \Delta w_a (t^\prime) r(t-t^\prime)

Here a indexes the synapse, \Delta w_a (t^\prime) is the synaptic modification made at time t^\prime and r(t-t^\prime) is the weighting function that determines the effect of a modification made at time t^\prime on the current value of the synapse. Another important assumption is that the modifications are unit-magnitude, equal-probability depression or potentiation events, i.e. \Delta w_a(t^\prime)=\pm 1, that arrive at the synapse at a constant speed and are uncorrelated across time. It is assumed that there are N such synapses in total, uncorrelated with each other.

To formalize the plasticity-rigidity trade-off, they define a signal-to-noise ratio that quantifies how well we can decode a past synaptic modification event from the current values of the synapses. The signal is defined as the mean overlap between the current values of the synapses and the past synaptic modification pattern:

\mathcal{S}_{t^\prime}(t) \equiv \frac{1}{N}  \langle \sum_{a=1}^N w_a(t) \Delta w_a(t^\prime)  \rangle

and the noise is defined as the standard deviation of this overlap:

\mathcal{N}_{t^\prime}(t) \equiv \sqrt{ \frac{1}{N^2} \langle (\sum_{a=1}^N w_a(t) \Delta w_a(t^\prime) )^2 \rangle - \mathcal{S}^{2}_{t^\prime}(t) }

where the angle brackets denote averages over the stochastic modifications. With the assumptions made, it is then straightforward to derive that the signal-to-noise ratio for a memory inducted at time t^\prime is proportional to:

SNR_{t^\prime} (t) \propto r(t-t^\prime) / \sqrt{\sum_{l: t_l<t} r(t-t_l)^2}

Heuristically, we can see from this equation that the slowest decaying r(t) we can afford before the variance term diverges is r(t) \approx 1 / \sqrt{t} (to see this, note that plugging this in the denominator gives the harmonic series). One can make this argument more formal by writing down an objective (e.g. the area under the SNR(t) curve above an arbitrary threshold), optimizing with respect to r(t) and finding out that r(t) \approx 1 / \sqrt{t} indeed gives the correct answer, but I won’t do it here. This is the first main result of the paper (and to me the more important one). One can show that a system that displays the 1/\sqrt{t} decay can achieve the almost extensive N / \log N memory capacity (capacity is defined as the time at which the SNR(t) drops to an arbitrary fixed value, which they take to be 1 in the paper).

The rest of the paper is devoted to coming up with a hand-crafted, tractable dynamical system (which can be interpreted as describing the internal dynamics of a complex synapse model) that would display the desired 1/\sqrt{t} decay. The solution they come up with is based on the heat equation:

\frac{\partial u}{\partial t} = D \frac{\partial^2 u}{\partial x^2}

where D \propto g/C is the diffusion coefficient (where C is known as the heat capacity and g is the conductivity). Recall that the solution to the heat equation at x=0 already displays the required 1/\sqrt{t} decay. However, this naive solution is not good enough, because any discretized version of it would require too many variables (\sim \sqrt{N} variables) to achieve the 1/\sqrt{t} decay over a sufficiently long time. The “patch” they offer for this problem is to use an inhomogeneous heat equation with an exponentially increasing heat capacity: i.e. C(x) = \exp(\beta x) and an exponentially decreasing conductivity g(x) = \exp(-\beta x). This has the effect of slowing down the diffusion along the x direction and requires only \sim \log N variables to implement the desired 1\sqrt{t} decay in a discretized version.

Although I find this model informative for elucidating the kinds of mechanisms required for achieving extensive memory capacity in an efficient way under the studied scenario (e.g. multiple variables with exponentially differing time-scales interacting with each other), I have two basic issues with the paper. First, the assumptions and simplifications they make to render the problem tractable also make it somewhat uninteresting from a practical viewpoint: uncorrelated events arriving at uncorrelated synapses is not a very practically relevant scenario. In the studied scenario, there’s also no accounting of the importance of a synapse for prior experiences (cf. Kirkpatrick et al., 2017). Incidentally, Kirkpatrick et al. (2017) show that any synapse that optimizes their elastic weight consolidation objective (which combines the loss in the current episode and deviation from the current value of the synapse weighted by the importance of that synapse for prior episodes) automatically satisfies the optimal 1/\sqrt{t} decay under a scenario similar to that studied in Benna & Fusi (2016). This result suggests that this decay might naturally fall out of other objectives that the synapse would have to optimize.

Secondly, the desire to come up with a tractable model also restricts the authors to a linear model (i.e the heat equation). But, there is no reason synapses should be linear. Nonlinear models might, in fact, perform better. There’s not much room for improvement over the linear model proposed in the paper, since both the 1/\sqrt{t} decay and the nearly extensive memory capacity are optimal under the studied scenario, but perhaps the \log N discrete variables required in the linear model could be improved upon in a nonlinear model, or perhaps the robustness of the model could be improved with a nonlinear model. So, an alternative approach would be to model the synapse as a nonlinear dynamical system (i.e. an RNN), and optimize its parameters (one can think of more sophisticated variations on this idea to learn more interpretable solutions). I find this approach very useful, because (i) the hand-crafted solutions we humans come up with usually tend to be highly special (e.g. an attractor with all eigenvalues equal to 1), whereas nature prefers more generic solutions, (ii) with a hand-crafted model, we can rarely beat or even match the performance of an optimized nonlinear model even in relatively simple problems, so such models are useful for delineating the contours of what is achievable and for giving us valuable hints about more general principles.


Why is it hard to train deep neural networks? Degeneracy, not vanishing gradients, is the key

In this post, I will try to address a common misunderstanding about the difficulty of training deep neural networks. It seems to be a widely held belief that this difficulty is mainly, if not completely, due to the vanishing (and/or exploding) gradients problem. “Vanishing gradients” refers to the gradient norms becoming exponentially smaller for parameters deeper in the network. Smaller gradients mean parameters changing ever so slowly, and so learning gets stuck until the gradients become large enough, which could take exponentially long. This idea goes back, at least, to Bengio et al. (1994) and still seems to be everybody’s favorite explanation for why it is hard to train deep neural networks.

Let’s first consider a simple scenario: a deep linear network being trained to learn a linear mapping. Of course, deep linear networks aren’t interesting from a computational perspective, but Saxe et al. (2013) showed that learning dynamics in such networks can still be informative about the learning dynamics in nonlinear networks. So, let’s start  with this simple scenario. Here’s the learning curve and the initial gradient norms (before any training) for a 30-layer network (errorbars are standard errors over 10 independent runs).


I will shortly explain why these results are labeled “Fold 0” in the figure. The gradients here are with respect to layer activations (gradients with respect to parameters behave similarly). The network weights are initialized with the standard 1/\sqrt{n} initialization. The training loss decreases rapidly at first, but then quickly asymptotes at a suboptimal value. The gradients certainly don’t vanish (or explode), at least initially! They do become smaller as training progresses, but that is to be expected and it’s not clear at all that the gradients here are “too small” in any sense:


To show that convergence to a suboptimal solution here doesn’t have anything to do with the size of the gradient norms per se, I will now introduce a manipulation that will increase the gradient norms, but will worsen the performance. Here it is (in blue):


So, what did I do? I simply changed the initialization in a pretty minimal way. Each initial weight matrix in the original networks is a 64 \times 64 matrix (initialized with the standard 1/\sqrt{n} initialization). In the networks shown in blue here, I just copied the first half of each initial weight matrix to the second half (the initial weight matrix is “folded” once, so that’s why I denote these as “Fold 1” networks). This reduces the rank of the initial weight matrices, and makes them more degenerate. Note that this manipulation is done only on the initial weight matrices, so no other constraint is imposed on learning once the training gets started. Here’s how the gradient norms look after the first few epochs:


So, I introduced a manipulation that increased the size of the gradient norms overall, yet the performance got significantly worse. Conversely, I will now introduce another manipulation that will shrink the size of the gradient norms, yet will improve performance substantially. Here it is (in green):


As you may have guessed from the label, this new manipulation initializes the weight matrices to be orthogonal. Let’s recall that orthogonal matrices are the least degenerate matrices among matrices of a fixed (Frobenius) norm, where degeneracy can be measured in different ways, for example, as the fraction of singular values smaller than a given constant. Here’s how the gradients look after the first few epochs in this case:


So, what’s going on? If the size of the gradient norms per se isn’t responsible for training difficulties, then what is? The answer is that it’s the degeneracy of the model that, by and large, determines the training performance. Why is degeneracy harmful for training performance? Intuitively, the reason is that learning slows down substantially along degenerate directions in the parameter space, hence degeneracies reduce the effective dimensionality of the model. So, you might think that you’re fitting a model with \sim 130\mathrm{K} parameters, as in the above examples, but in reality you’re effectively fitting a model with substantially fewer degrees of freedom because of the degeneracies. So, the problem with the “Fold 0” and “Fold 1” networks above was that although the gradient norms are fine, the network’s available degrees of freedom contribute extremely unevenly to those norms: while a handful of degrees of freedom (non-degenerate ones) explain almost all of it, the vast majority (degenerate ones) don’t contribute anything at all (a conceptually helpful, but not strictly accurate, way of thinking about this may be to imagine only a few of the hidden units in each layer changing their activity in response to different inputs, while the vast majority just responding the same way independently of the input).

This reduction in the effective degrees of freedom can be quite substantial, as in the 1/\sqrt{n}-initialized random matrices above. As shown by Saxe et al. (2013), the product of such matrices becomes increasingly degenerate as the number of matrices multiplied (i.e. network depth) increases. Here’s an example with 1-layer, 10-layer and 100-layer networks respectively (adapted from Saxe et al. (2013)):

saxe_plotAs depth increases, the singular values of the product matrix become increasingly concentrated around 0, except for a vanishingly small fraction of singular values that become arbitrarily large. This result isn’t just relevant for linear networks. A similar thing happens in nonlinear networks too: as depth increases, the responses of the hidden units in a given layer become increasingly lower-dimensional, i.e. increasingly degenerate. In fact, this “degeneration process” can proceed much more quickly with depth in nonlinear networks with hard-saturating boundaries as in ReLU networks.

A nice visualization of this degeneration process is presented in a paper by Duvenaud et al. (2014):

drawing-1As depth increases, the input space (shown on the top left) gets distorted into increasingly thin filaments with only a single direction (orthogonal to the filament) at each point in the input space affecting the network’s response (it may be a bit hard to extrapolate from this figure how input spaces of more than two dimensions will behave under a similar mapping, but it turns out they become “hyper-pancakey” locally, i.e. there is a single direction at each point, orthogonal to the pancake surface, the network is sensitive to). Along that sensitive direction, the network in fact becomes hyper-sensitive to variations.

Finally, I can’t resist mentioning my own paper (with Xaq Pitkow) at this point. In this paper, through a series of experiments, we argue that the degeneracy problem I discussed in this post severely afflicts training in deep nonlinear networks, and that one of the ways (and quite possibly the single most important way) in which skip connections help training in deep networks is through breaking such degeneracies. I suspect that other methods like batch normalization or layer normalization that also help training in deep networks also work at least partly through a similar degeneracy-breaking mechanism, in addition to any other potentially independent mechanisms such as reducing the internal covariate shift as originally proposed. It is well-known, for example, that divisive normalization is an effective way of decorrelating the responses of hidden units, which in turn can be seen as a degeneracy-breaking mechanism.

Update (1/5/18): I should also mention this important recent paper by Pennington, Schoenholz & Ganguli. Orthogonal initialization completely eliminates degeneracies in linear networks, but not in nonlinear networks. In this paper, they provide a method to calculate the entire singular value distribution of the Jacobian of a nonlinear network and show that a depth-independent, non-degenerate singular value distribution can be achieved, with careful initialization, in networks with hard-tanh nonlinearities, but not in ReLU networks. The empirical results show that networks with depth-independent, non-degenerate singular value distributions train orders of magnitude faster than networks whose singular value distributions become wider (higher variance) and more degenerate with depth. This is a powerful demonstration of the importance of eliminating degeneracies and controlling the entire singular value distribution of a network, not just its mean.

Introduction to hyperbolic geometry and hyperbolic embeddings

This week, I gave a tutorial on hyperbolic geometry and discussed this paper that proposes embedding symbolic data in a hyperbolic space, rather than in Euclidean space. In this post, I will try to summarize my presentation. I closely followed the book Hyberbolic Geometry by James W. Anderson in my presentation, which is an extremely well-written introductory textbook on the subject. I highly recommend this book to anyone who is interested in learning about hyperbolic geometry.


Euclid axiomatized plane geometry more than two millennia ago. He came up with five axioms from which all of plane geometry could be derived:

  1. Given two points in the plane, a unique line can be drawn passing through those points.
  2. Any line segment can be extended indefinitely in either direction.
  3. Given any line segment, a circle can be drawn with its center on one of the end-points of the line segment and its radius equal to the length of the line segment.
  4. Any two right angles are congruent.
  5. Parallel postulate: “If a straight line falling on two straight lines make the interior angles on the same side less than two right angles, the two straight lines, if produced indefinitely, meet on that side on which are the angles less than the two right angles.”

There are equivalent formulations of the parallel postulate, perhaps the most famous of which is the Playfair’s axiom: “Given a line and a point not on it, at most one line parallel to the given line can be drawn through the point.”

Generations of mathematicians after Euclid thought the parallel postulate suspiciously complicated and hoped that it could somehow be derived from the first four axioms. However, all efforts to do so failed. We now know that the parallel postulate is independent of the remaining axioms of plane geometry, meaning that the axiomatic system consisting of the first four axioms and the negation of the parallel postulate is a consistent system. In fact, even ancient Greeks were familiar with a perfectly fine model of a plane geometry that violates the parallel postulate: the surface of the sphere, i.e. S^2. If points are interpreted as points and lines as great circles on the surface of the sphere, the parallel postulate is violated, because any two great circles on the surface of the sphere have to intersect, hence they are not parallel. However, the problem with this model is that it also violates the second axiom: line segments cannot be extended indefinitely, since great circles are closed curves. So, it was not taken seriously as a plausible model of plane geometry.

At the beginning of the 19th century, people started to come up with model geometries that violate the parallel postulate, while satisfying all the remaining axioms of Euclid. Gauss, Bolyai and Lobachevsky were the first mathematicians to come up with such a model. However, we will be dealing mostly with two models introduced later by Poincaré. The first one is called the upper half-plane model.

The upper half-plane model, \mathbb{H}

The upper half-plane is the set of complex numbers with positive imaginary part: \mathbb{H}=\{ z \in \mathbb{C} | \mathrm{Im}(z)>0\}. The points in \mathbb{H} are defined to be the usual Euclidean points, and lines are defined to be either half-lines perpendicular to the real axis, or half-circles with centers on the real axis (image source):

Image Section61

Circles in \mathbb{H} are Eucliden circles (but not necessarily with the same center or radius). As an exercise, you can check that when points, lines and circles in \mathbb{H} are interpreted in this way, all axioms of Euclid are satisfied except for the parallel postulate: show that in \mathbb{H} there is an infinite number of parallel lines to a given line passing through a given point not on the line.

This model, by itself, does not come equipped with a metric. We have to define one for it. It turns out that there is a fairly natural way of assigning a metric to \mathbb{H}. The idea will be to first find a group of transformations in \mathbb{H} that leave hyperbolic lines invariant, i.e. that map hyperbolic lines to hyperbolic lines in \mathbb{H}. We will then require our metric to be invariant under these transformations. It turns out that this requirement uniquely determines the metric (up to a trivial scale factor).

The group \mathrm{Mob}(\mathbb{H})

The upper half-plane model is thought of as embedded in the extended complex plane, \overline{\mathbb{C}} = \mathbb{C} \cup \{\infty\}, consisting of the complex plane and a single point at infinity.

In \overline{\mathbb{C}}, we define a circle to be either a Euclidean circle in \mathbb{C} or a Euclidean line in \mathbb{C} plus the point at infinity, \infty. We denote the group of homeomorphisms of \overline{\mathbb{C}} taking circles in \overline{\mathbb{C}} to circles in \overline{\mathbb{C}} by \mathrm{Homeo}^C(\overline{\mathbb{C}}). Our first job is to exactly characterize this group. For this purpose, we first define the Möbius transformations.

Definition: A Möbius transformation is a function m: \overline{\mathbb{C}} \rightarrow \overline{\mathbb{C}} of the following form:

m(z) = \frac{az+b}{cz+d},  where a,b,c,d \in \mathbb{C} and ad - bc \neq 0.

These transformations correspond to compositions of simple translations, rotations, dilations and inversions in the complex plane. Here’s a nice video visualizing how these transformations deform a square in the complex plane:

The set of all Möbius transformations form a group denoted by \mathrm{Mob}^+.

Definition: The group generated by \mathrm{Mob}^+ and complex conjugation, c(z)=\bar{z}, is called the general Möbius group and denoted by \mathrm{Mob}.

A fundamental result then tells us that the general Möbius group is precisely the group of homeomorphisms of \overline{\mathbb{C}} taking circles to circles in \overline{\mathbb{C}}, i.e. \mathrm{Homeo}^C(\overline{\mathbb{C}}):

Theorem: \mathrm{Mob} = \mathrm{Homeo}^C(\overline{\mathbb{C}}).

Another important property of \mathrm{Mob} is that elements of \mathrm{Mob} are conformal homeomorphisms, i.e. they preserve angles.

We will concentrate on a particular subgroup of \mathrm{Mob} that preserves the upper half-plane:

Definition: \mathrm{Mob}(\mathbb{H}) = \{ m \in  \mathrm{Mob} | m(\mathbb{H})= \mathbb{H} \}.

It can be shown that every element of \mathrm{Mob}(\mathbb{H}) takes hyperbolic lines to hyperbolic lines in \mathbb{H}. In this precise sense, we can think of \mathrm{Mob}(\mathbb{H}) as some sort of a generator for the upper half-plane model.

We can express every element of \mathrm{Mob}(\mathbb{H}) explicitly either as:

m(z) = \frac{az+b}{cz+d}, where a, b, c, d \in \mathbb{R} and ad-bc=1

or as:

m(z) = \frac{a\bar{z}+b}{c\bar{z}+d}, where a, b, c, d are purely imaginary and, again, ad-bc=1.

Length and distance in \mathbb{H}

We are now ready to define a line element and a distance metric in \mathbb{H}. As mentioned above, we will do this by requiring that whatever line element and distance metric we choose for \mathbb{H} should be invariant under the action of the group \mathrm{Mob}(\mathbb{H}). It turns out that this requirement uniquely determines the line element and the distance metric (up to a trivial scalar factor).

Let’s first review a few basic definitions. Let f: [a,b] \rightarrow \mathbb{R}^2 define a path in the Euclidean plane. The Euclidean length of f is defined as:

length(f) \equiv \int_{a}^b \sqrt{x^\prime(t)^2 + y^\prime(t)^2} dt

or if the Euclidean plane is considered as the complex plane, this can be equivalently expressed as:

length(f) = \int_{a}^b |f^\prime(t)| dt \equiv \int_f |dz|

where |dz| is called the Euclidean line element. A more general line element can be defined by scaling the Euclidean line element by a scale factor, \rho(z), that could depend on z: i.e. \rho(z)|dz|. We define the length of f with respect to this new line element as:

length_{\rho}(f) \equiv \int_f  \rho(z)|dz| = \int_{a}^b \rho(f(t))|f^\prime(t)| dt

As mentioned above, our strategy will be to determine a \rho(z) for \mathbb{H} by requiring the lengths of all paths in \mathbb{H} to be preserved under the action of the group \mathrm{Mob}(\mathbb{H}), i.e. length_{\rho}(f)=length_{\rho}(\gamma \circ f) for all \gamma \in \mathrm{Mob}(\mathbb{H}) and all f. This requirement results in (up to a constant multiplicative factor):

\rho(z) = \frac{1}{\mathrm{Im}(z)}

Therefore, the hyperbolic length of a path in \mathbb{H} can be measured as:

length_{\mathbb{H}}(f) = \int_f \frac{1}{\mathrm{Im}(z)} |dz| = \int_a^b \frac{1}{\mathrm{Im}(f(t))} |f^\prime(t)|dt

Example: Consider f(t)=it defined in the interval [a,b]. Then, length_\mathrm{H}(f)=\int_a^b \frac{1}{t}dt = \log(\frac{b}{a}). We see that  the length blows up as a\rightarrow 0.

Now, we can define a distance metric in \mathbb{H} based on our definition of length_{\mathrm{H}}. The idea is to find a \gamma \in  \mathrm{Mob}(\mathbb{H}) that maps the hyperbolic line passing through z_1 and z_2 to the imaginary axis so that \gamma(z_1)=\mu i and \gamma(z_2)=\lambda i for some \mu and \lambda (it is easy to show that we can always find such a transformation and that, if there are multiple such transformations, the metric defined below does not depend on which one is chosen). From the example above, we know how to measure the lengths of line segments on the imaginary axis.

When we do this, we find the following formula for measuring the hyperbolic distance between two arbitrary points z_1, z_2 \in \mathbb{H}:

d_{\mathbb{H}}(z_1, z_2) = | \log | \frac{y_2(x_1-c-r)}{y_1(x_2-c-r)}  |  |

where (x_1,y_1) and (x_2,y_2) are the coordinates of z_1 and z_2, respectively, and c and r the center and the radius of the hyperbolic line passing through z_1 and z_2.

One can show that the group \mathrm{Mob}(\mathbb{H}) corresponds exactly to the group of isometries of \mathbb{H}: i.e. \mathrm{Isom}(\mathbb{H}, d_{\mathbb{H}}) = \mathrm{Mob}(\mathbb{H}).

One can also show that there are two qualitatively distinct types of parallel lines in \mathbb{H}: parallel lines that asymptote to a common point at infinity and parallel lines that do not asymptote to a common point at infinity (called ultraparallel lines). For the former ones, it can be shown that d_{\mathbb{H}}(l_0, l_1)=0, whereas for the latter case, d_{\mathbb{H}}(l_0, l_1)>0. Therefore, ultraparallel lines behave more like parallel lines in Euclidean space.

Poincaré disk model, \mathbb{D}

Another model of hyperbolic geometry is the Poincaré disk model. This model is defined inside the unit disk in \mathbb{C}: \mathbb{D}=\{ z \in \mathbb{C}| |z|<1 \}. A hyperbolic line in \mathbb{D} is the image under m^{-1} of a hyperbolic line in \mathbb{H}, where m: \mathbb{D} \rightarrow \mathbb{H} is an element of \mathrm{Mob}. This implies that hyperbolic lines in \mathbb{D} are Euclidean arcs perpendicular to the unit circle S^1 (image source):

Now, we can do the same things with this model that we did with the upper half-plane model, namely find a scaling function \rho(z) that makes the lengths of paths invariant under the group \mathrm{Mob}(\mathbb{D}), then define lengths and distances based on this \rho(z). When we do that, we get the following expressions:


length_{\mathbb{D}}(f) = \int_f \frac{2}{1-|z|^2}|dz|

Example: Consider f: [0,r] \rightarrow \mathbb{D}, f(t)=t. Then, length(f)=\int_0^r \frac{2}{1-t^2}dt=\log(\frac{1+r}{1-r}). Again, we see that the length blows up as r \rightarrow 1.

Example: It can be shown that the length of a hyperbolic circle in \mathbb{D} with hyperbolic center 0 and hyperbolic radius s is 2\pi \sinh(s).

This line element leads to the following distance metric in \mathbb{D}, d_{\mathbb{D}}: \mathbb{D} \times \mathbb{D} \rightarrow \mathbb{R}:

d_{\mathbb{D}}(z_1,z_2) = \mathrm{arccosh} ( \frac{2|z_1-z_2|^2}{(1-|z_1|^2)(1-|z_2|^2)}+1 )

This same formula holds in higher dimensional versions of the Poincaré disk model as well.

Hyperbolic Area

Similarly to the length of a path in \mathbb{H}, we define the area of a set X \subset \mathbb{H} by:

area_{\mathbb{H}}(X) \equiv \int_{X} \frac{1}{\mathrm{Im}(z)^2}dxdy

Just like length and distance, area_{\mathbb{H}} is also invariant under the action of the group \mathrm{Mob}(\mathbb{H}).

In the Poincaré disk model, area is defined as (using Cartesian coordinates):

area_{\mathbb{D}}(X) = \int_X \frac{4}{(1-|z|^2)^2}dxdy = \int_X \frac{4}{(1-x^2 -y^2)^2}dxdy

It can be shown that the area of a hyperbolic disk in \mathbb{D} with center 0 and radius s is 4\pi \sinh^2(\frac{s}{2}). This implies that for a disk with center 0 and radius s, the ratio \frac{length}{area} \rightarrow 1 as s \rightarrow \infty. This is very unlike the Euclidean case, where \frac{length}{area}\rightarrow 0 as s \rightarrow \infty. Intuitively, lengths become as big as areas for sufficiently large shapes in hyperbolic space (similar results hold for higher dimensional hyperbolic spaces).


This is a truly amazing result that says that for a hyperbolic triangle P with interior angles \alpha, \beta, \gamma, the area is given by area_{\mathbb{H}}(P)=\pi - (\alpha+\beta+\gamma).

In fact, this result generalizes to hyperbolic polygons: if P is a hyperbolic polygon with interior angles \alpha_1, \ldots, \alpha_n, then the area of P is given by:

area_{\mathbb{H}}(P) = (n-2)\pi - \sum_{k=1}^n \alpha_k

Another big difference from the Euclidean case is that in the Euclidean plane, there is only one regular n-gon (up to translation, rotation and dilation) and its interior angle is \frac{n-2}{n}\pi. The situation is completely different in the hyperbolic plane: for all n\geq 3 and each \alpha \in (0,\frac{n-2}{n}\pi), there is a compact regular hyperbolic n-gon whose interior angle is \alpha.

Higher-dimensional hyperbolic spaces

The two-dimensional models already make apparent many strange features of hyperbolic spaces. Things can get even stranger in higher dimensional hyperbolic spaces. However, instead of studying these spaces formally, we can try to get a feel for the strange properties of higher dimensional hyperbolic spaces by exploring the three-dimensional hyperbolic space, \mathbb{H}^3. Here‘s a software tool for exploring \mathbb{H}^3. You can find some background knowledge and information about how to use this tool here. The tool here models the {4,3,6}-tessellation of \mathbb{H}^3 (the notation here means the shape of the cells tiling the space is a cube, whose Schläfli symbol is {4,3}, and there are six of these cells around each edge). This is not the only possible tessellation of \mathbb{H}^3; in fact, it is not even the only possible cubic tessellation of \mathbb{H}^3 (this is again a drastic difference from the Euclidean 3-space where there is essentially only one cubic tessellation of the space). There’s, in fact, an infinite number of such tessellations (a fact closely related to the result mentioned above that says that there are regular hyperbolic n-gons with any interior angle \alpha as long as \alpha \in (0,\frac{n-2}{n}\pi)). I strongly encourage you to explore these different tessellations of \mathbb{H}^3 on Wikipedia, which has amazing visualizations of these tessellations.

Hyperbolic embeddings

Finally, I would like to briefly discuss this paper by Nickel and Kiela that proposes embedding data (especially symbolic or graph-structured data) in hyperbolic spaces (see this previous post of mine for a brief review of why one might want to use embeddings in continuous spaces to represent symbolic data). The idea is really simple: we first define an objective that encourages semantically or functionally related items to have closer representations (and conversely semantically or functionally unrelated items to have more distant representations) in the embedding space, where the embedding space is a hyperbolic space (for example, the Poincaré ball model, i.e. the higher dimensional version of the Poincaré disk model reviewed above) and distance is measured by hyperbolic distance. For instance, one of the objectives they use in the paper is the following loss function:

\mathcal{L}(\Theta) = \sum_{(u,v) \in \mathcal{D}} \log \frac{\exp(-d_{\mathbb{D}}(u,v))}{\sum_{v^\prime \in \mathcal{N}(u)} \exp(-d_{\mathbb{D}}(u,v^\prime))}

where \mathcal{D} denotes the set of semantically or functionally related items that are connected by an edge in the dataset, \mathcal{N}(u) denotes the set of items unrelated to u, and d_{\mathbb{D}} denotes the hyperbolic distance between the embeddings of the items in the Poincaré ball model.

We then just do gradient descent on this loss function to optimize the positions of the items in the embedding space. There are two subtleties to take care of while doing gradient descent: first, because we are working with the Poincaré ball model, we have to make sure that the positions stay within the unit ball during optimization. This can be achieved by applying a projection operator after each update:


where \epsilon > 0 ensures numerical stability.

Secondly, because we are working in a curved space, taking its intrinsic curvature into account during optimization would be more efficient. This means that we have to use the Riemannian gradient rather than the Euclidean gradient for gradient descent. These two gradients are related to each other by the inverse of the metric, which is itself given by the square of the line element scaling factor \rho(z) described above. This then leads to the following update rule:


Here, \eta_t is the learning rate, \nabla_E is the Euclidean gradient and the term (1-||\theta_t||^2)^2/4 is just the inverse of the metric, which in turn is just the square of the scaling factor \rho(z) = 2/(1-|z|^2) derived above for the Poincaré disk model.

The authors show that hyperbolic embeddings lead to better performance in reconstruction and link prediction tasks, usually with a few orders of magnitude smaller number of embedding dimensions than a Euclidean embedding! That’s an impressive improvement. Intuitively, hyperbolic spaces of a given dimension contain a lot “more space” than the corresponding Euclidean space, so we need a lot fewer dimensions to accurately capture the semantic or functional relationships between the items. That said, I have a few concerns about hyperbolic embeddings:

  • Although it may be true that hyperbolic spaces are better suited than Euclidean spaces to capture the semantic or functional relationships in tree-structured data, there is, in general, no reason to assume a priori that the underlying space has to be exactly hyperbolic. I think a better approach would be to learn the appropriate metric, rather than assuming it. One can still initialize it close to a hyperbolic metric, but making it flexible would allow us to capture possible deviations from hyperbolicity.
  • In the paper, they use relatively simple tasks to demonstrate the advantages of hyperbolic embeddings. In practice, one may want to use these embeddings as inputs to some deep and large model being trained on a more challenging task. One may also want to learn such hyperbolic embeddings jointly with the parameters of the deep model in an end-to-end fashion. I’m not quite sure how well hyperbolic embeddings would behave in such a context. The fact that distances blow up very rapidly in hyperbolic spaces as one approaches the boundary suggests that such end-to-end training of a deep and large model on hyperbolic embeddings may be riddled with serious optimization difficulties. Again, making the metric flexible could force the model to find and use a more “learnable” metric.


A note on fast weights: do they really work as advertised?

In fields where one cannot prove one’s claims mathematically, it is immensely important to be as rigorous as possible. Among other things, this implies considering all plausible alternative hypotheses to one’s own hypothesis and making sure that the results cannot be explained by those alternative hypotheses.  People working in experimental fields are usually quite careful about this. Machine learning is also, by and large, an experimental field (in the sense that usually one cannot prove one’s claims with mathematical rigour), so similar standards of experimental rigour should apply there as well.

I was reminded of this when I was re-reading the “fast weights” paper by Ba, Hinton et al. In this paper, they extend the standard vanilla recurrent neural network architecture with some form of Hebbian short-term synaptic plasticity. This Hebbian connectivity maintains a dynamically changing short-term memory of the recent history of the activities of the units in the network. They call this Hebbian connectivity “fast weights” as opposed to the standard “slow” recurrent connectivity.

If I haven’t made a mistake, the equation governing the model can be condensed into the following form (I’m assuming that their inner loop of Hebbian plasticity is unrolled for one step only, i.e. S=1):

\mathbf{h}_{t+1} = f\Big( \mathcal{LN}\Big[\mathbf{W}_h \mathbf{h}_t + \mathbf{W}_x \mathbf{x}_t + \big(\eta \sum_{\tau=1}^{\tau=t-1}\lambda^{t-\tau-1}\mathbf{h}_\tau \mathbf{h}_\tau^{\top}\big) f(\mathbf{W}_h \mathbf{h}_t + \mathbf{W}_x \mathbf{x}_t) \Big] \Big)

where \mathcal{LN}[\cdot] refers to layer normalization. The fast weight matrix here is given by the Hebbian matrix \eta \sum_{\tau=1}^{\tau=t-1}\lambda^{t-\tau-1}\mathbf{h}_\tau \mathbf{h}_\tau^{\top}.

The authors show that this model performs better than a number of other recurrent models in a couple of tasks. They seem to think that the Hebbian fast weight matrix is crucial (and mainly responsible) for the superior performance of the model. However, there are two obvious controls that are sorely missing in the paper and that makes a fair assessment of the contribution of fast weights difficult. First, the model has more depth than the standard architectures (note the nested f(\cdot)s in the equation above), so maybe there’s nothing special about the Hebbian fast weight matrix, rather the advantage is mainly due to the added depth. This could be tested by replacing the Hebbian matrix by another matrix, e.g. a random, fixed matrix or the identity matrix. In this connection, it’s interesting to observe that the authors chose this fairly complicated architecture rather than simpler ways of incorporating a Hebbian component to the recurrent connectivity, for example something like this:

\mathbf{h}_{t+1} = f\Big( \mathcal{LN} \Big[ \big[\mathbf{W}_{h} + \eta \sum_{\tau=1}^{\tau=t-1}\lambda^{t-\tau-1}\mathbf{h}_\tau \mathbf{h}_\tau^{\top} \big] \mathbf{h}_{t} + \mathbf{W}_{x} \mathbf{x}_{t} \Big] \Big)

which doesn’t increase the effective depth and would in fact be closer to the biological idea of short-term synaptic plasticity (this, in fact, appears to be how fast weights were originally defined in Hinton’s old “fast weights” paper). This may be because they tried something simple like this and it didn’t work well, so they found the added complexity necessary. But then it would be important to understand why fast weights don’t work in this simple context and require the more complicated machinery.

Secondly, layer normalization by itself improves the performance of vanilla recurrent networks, so we know that part of the improvement is due to normalization, again the question is how much. The necessary control to run here is a vanilla recurrent network with layer normalization.

I’m currently working on running these (and some other) controls, so I should hopefully have some answers to these questions soon.

No, information bottleneck (probably) doesn’t open the “black-box” of deep neural networks

This paper has been making the rounds recently. The paper was posted on arxiv in March, but for some reason seems to have attracted a lot attention recently due to a talk by the second author.

The paper claims to discover two distinct phases of stochastic gradient descent (SGD): error minimization and representation compression. During the error minimization phase, training (and test) error goes down substantially, whereas during the representation compression phase, SGD basically pumps noise into the model and washes out the irrelevant bits of information about the input (irrelevant for the task at hand). The transition between these two phases seems to precisely coincide with a transition in the magnitudes of the means and variances of the gradients: the error minimization phase is dominated by the mean gradients and the compression phase is dominated by the variances of the gradients.

The authors formalize these ideas in terms of information, so error minimization becomes maximizing the mutual information between the labels Y and the representation T, i.e. I(Y; T) and representation compression becomes minimizing the mutual information between the input X and the representation T, i.e. I(X; T). The authors seem to imply that the compression phase is essential to understanding successful generalization in deep neural networks.

Unfortunately, I think information is not the right tool to use if our goal is to understand generalization in deep neural networks. Information is just too general a concept to be of much use in this context. I think a satisfactory explanation of generalization in deep networks would have to be both architecture- and data-dependent (see this and this). I have several specific issues with the paper:

1) They have results from a toy model only. It’s not clear if their results generalize to “realistic” networks and datasets. The talk mentions some results from a convolutional net trained on MNIST, but larger networks and more complex datasets (e.g. CIFAR-10 or CIFAR-100 at the very least) have to be tested in order to be sure that the results are general.

2) Networks right before they enter the compression phase already achieve good generalization performance (in fact, the transition to the compression phase seems to take place at a point where the test error starts to asymptote). This makes it questionable whether compression can really explain the bulk of the generalization performance. Relatedly, the compression phase seems to take place over an unrealistically long period. In practical problems, networks are rarely, if ever, trained for that long, which again questions the applicability of these results to more realistic settings.

3) I tried to replicate the drift vs. diffusion dominated phases of SGD claimed in the paper, but was not successful. Unfortunately, the paper is not entirely clear (another general weakness of the paper: lots of details are left out) about how they generate the crossing gradient mean vs std plot (Fig. 4) –what does STD(\nabla W) even mean?–. So, I did something that seemed most reasonable to me, namely computed the norms of the mean and standard deviation (std) of gradients (across a number of mini-batches) at each epoch. Comparing these two should give an indication of the relative size of the mean gradient compared to the size of the noise in the gradient. The dynamics seems to be always diffusion (noise) dominated with no qualitatively distinct phases during training:


The network here is a simple fully-connected network with 3 hidden layers (128 units in each hidden layer) trained on MNIST (left) or CIFAR-100 (right). At the beginning of each epoch, the gradient means and stds were calculated from 100 estimates of the gradient (from 100 minibatches of size 500). The gradients are gradients with respect to the first layer weights. I didn’t normalize the gradients by the norm of the weights, but this would just scale the mean and std of the gradients the same way, so it shouldn’t change the qualitative pattern.

4) As long as the feedforward transformations are invertible, the networks never really lose any information either about the input or about the labels. So, they have to do something really ad hoc like adding noise or discretizing the activities to mold the whole thing into something that won’t give nonsensical results from an information-theoretic viewpoint. To me, this just shows the unsuitability of information-theoretic notions for understanding neural networks. In the paper, they use bounded tanh units in the hidden layers. To be able to make sense of information-theoretic quantities, they discretize the activations by dividing the activation range [-1,1] into equal-length bins. But then the whole thing basically depends on the discretization and it’s not even clear how this would work for unbounded ReLU units.


Elliptic tales 2

In this second part, we will be concerned with elliptic curves specifically, rather than arbitrary polynomial curves. The main idea will be to introduce an abelian group structure on the points of an elliptic curve, then we will be able to study the curve algebraically. But, first a brief recap of group theory.

Suppose S is a subset of the abelian group (G,+). We define the subgroup of G generated by S to be the group:

\langle S \rangle = \{n_1 x_1 + \ldots + n_k x_k| k\geq 0, n_i \in \mathbb{Z}, x_i \in S \}

with the group operation + borrowed from G (you can check that (\langle S \rangle, +) is indeed a group). G is called a finitely generated abelian group if G = \langle S \rangle for some finite subset S.

x is called a torsion element of G if x has order n for some integer n \geq 1, i.e. nx=0 for some n \geq 1, but mx \neq 0 for all smaller m. If x is not a torsion element, we say that x has infinite order. The set of torsion elements of G make up the torsion subgroup of G, with the group operation again borrowed from G (again, you check that this is indeed a group).

A theorem then tells us that each subgroup of a finitely generated group is also finitely generated and in particular, its torsion subgroup is also finitely generated, hence is finite.

Another important concept is rank. Suppose G is a finitely generated abelian group and H is torsion subgroup. The rank of G is defined to be the smallest integer r such that G can be generated by r elements along with all elements of H.

Now, we’re ready to take up elliptic curves and define an abelian group structure over them. We first define an elliptic curve over a field K, denoted E(K), to be a nonsingular cubic curve containing at least one point with K-coordinates. We call this point \mathcal{O}. It isn’t a trivial assumption to assume that such a point exists. For example, it isn’t at all obvious that a given homogeneous cubic polynomial with rational coefficients should have a solution with rational coordinates.

We define the abelian group operation on E(K) geometrically. Here’s how: let P and Q be two points on E. We define P+Q as follows: first let L be the line connecting P and Q. This line intersects E at a third point R. Then draw the line L^\prime connecting  R and \mathcal{O}. This line intersects the curve at a third point and that third point is defined to be the point P+Q. Pictorially, the construction looks like this (the curve E is shown in red):


Some amount of work is needed to show that this construction does indeed define an abelian group (for example, you can check that \mathcal{O} is the identity element of this group), but we will not do it here. Note that we’re also assuming that E(K) is non-singular. The case of singular cubic equations requires special handling, but the group structure in this case turns out to be isomorphic to one of a small number of much simpler groups (again we will not do this reduction here), so in a sense, the non-singular case is the interesting case. Whether an elliptic curve E is singular or not can be determined from its discriminant, \Delta_E, which is an easy-to-compute algebraic function of the coefficients describing the curve. The curve is singular if and only if \Delta_E=0.

We will later be interested in elliptic curves over finite fields such as \mathbf{F}_p, i.e. E(\mathbf{F}_p), and especially in the size of E(\mathbf{F}_p), which we denote by N_p. A theorem due to Hasse states that:

Theorem (Hasse): The number N_p satisfies -2\sqrt{p} \leq p+1-N_p \leq 2\sqrt{p}.

The number p+1-N_p turns out to be so important that it gets its own name, a_p, so Hasse’s theorem can be restated as saying that |a_p| \leq 2\sqrt{p}.

We will also be interested in  elliptic curves over rationals, E(\mathbf{Q}), and we can say a few things about the group structure of E(\mathbf{Q}):

1) A theorem due to Mordell says that E(\mathbf{Q}) is a finitely generated abelian group. This implies that the torsion subgroup E(\mathbf{Q})_{\mathrm{tors}} is also finitely generated, hence is finite.

2) A theorem due to Mazur says that if P is a point of order n in E(\mathbf{Q})_{\mathrm{tors}}, then either 1 \leq n \leq 10 or n=12. I fact, E(\mathbf{Q})_{\mathrm{tors}} has a very simple structure: either E(\mathbf{Q})_{\mathrm{tors}} is a cyclic group of order 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or 12; or E(\mathbf{Q})_{\mathrm{tors}} is generated by two elements A_1 and A_2, where the order of A_1 is 2, 4, 6, or 8, and the order of A_2 is 2.

3) A theorem due to Nagell and Lutz says that for an elliptic curve E described by the equation y^2 = x^3 + a_2 x^2 + a_4 x + a_6, with a_2, a_4, a_6 integers, if P is an element of E(\mathbf{Q})_{\mathrm{tors}}, then both x_1 and y_1 are integers and either y_1=0, in which case 2P=\mathcal{O}, or else y_1 divides the discriminant \Delta_E. This theorem allows us to enumerate all torsion points of E relatively easily.

4) It turns out to be much more difficult to say something about the rank of E(\mathbf{Q}). It is conjectured that there is no maximum rank for E(\mathbf{Q}), but this remains extremely difficult to prove. It is also believed that a random elliptic curve, i.e. an equation of the form y^2 = x^3 + Ax + B, where A and B are random integers, should have rank 0 or 1 with high probability.

Elliptic tales 1

I’ve been reading Elliptic Tales, a fantastic book on elliptic curves by Avner Ash and Robert Gross, and I’d like to summarize what I’ve learned from the book in three long-ish posts corresponding to the three parts of the book. This post is the first one in this series.

We are going to consider polynomial equations and their degrees. There are two different ways to define the degree of a polynomial equation and of the curve it describes. The first definition is purely algebraic: the algebraic degree of a polynomial is the largest degree of any monomial appearing in the polynomial: e.g. the algebraic degree of x^3y^2z + x^2y^2z+xz^3 is 6, because the monomial with the largest degree, i.e. x^3y^2z, has degree 6. The second notion of degree is purely geometric: informally, we want to say that the geometric degree of a curve described by a polynomial equation f(x,y,\ldots)=0 is the number of intersection points it has with an arbitrary line L. Of course, the problem is that this is not well-defined, because the number of intersection points depends on the line L we pick! Consider the figure below (adapted from Figure 1.6 in the book):


The line K intersects the parabola G at two points, the lines M and L intersect G at one point only and the line N doesn’t intersect G at all!

The uncreative and uninteresting way to deal with this nuisance is to define the geometric degree to be the maximum number of intersection points we could get by varying the line L. But instead we’re going to be bold and demand that the geometric degree of a curve to be the same regardless of the line L we choose! It turns out that this can be done and is the source of a lot of rich, beautiful and exciting mathematics. But, enforcing this demand will require us to reconsider which number system to adopt, and to carefully redefine what we mean by an “intersection point” and how to count them.

It turns out that we need to do three things: 1) define our polynomials over an algebraically closed field of numbers; 2) switch from the Euclidean plane to the projective plane; 3) count intersection points by their “multiplicities”. I will now explain what each of these means.

1) F is an algebraically closed field means that any non-constant polynomial whose coefficients are in F has a root in F. We can see immediately that the reals \mathbb{R} is not algebraically closed, since e.g. the polynomial x^2 + 1 has no roots in  \mathbb{R}. However, the fundamental theorem of algebra tells us that the complex numbers \mathbb{C} is algebraically closed. It is intuitively clear why we need to work with an algebraically closed field, hence why e.g. \mathbb{R} won’t do for us: we want to say that algebraically x^2 + 1 has degree 2, but geometrically x^2+1=0 does not even make sense in \mathbb{R}.

Switching to an algebraically closed field helps us deal with cases such as the line N in the above figure. Although this line does not intersect the parabola in the real plane, it does intersect the parabola in the complex plane at exactly two points.

2) In the projective plane, we describe points by three coordinates instead of the usual two: (x:y:z). Setting z=1 gives us the usual (“finite”) plane. Points with z=0 correspond to “points at infinity”. The coordinates (x:y:z) and (\lambda x: \lambda y: \lambda z) with \lambda\neq 0 describe the same point. So, the points at infinity form a single line at infinity that can be parametrized by (1:t:0) or by (t:1:0) where t can be any real number.

Polynomial curves in the usual “finite” plane might get points at infinity attached to them in the projective plane. To find out what specific points at infinity get attached to a particular curve, we need to define the homogenization of a polynomial: for a polynomial of degree d \geq 1 in two variables, f(x,y), this means multiplying each monomial in f(x,y) by an appropriate power of z so that all monomials have the same degree d. For example, the homogenization of f(x,y)=y-x^2 is F(x,y,z)=yz - x^2. Now, to find out which points at infinity get attached to the curve described by the polynomial equation F(x,y,z)=0, we simply set z=0: y \cdot 0 - x^2 =0 to which the solutions are (0:y:0) with y \neq 0. By the property above, these describe a single point which we might as well denote by (0:1:0).

Why do we need this homogenization of polynomials and these points at infinity? Intuitively, it is to deal with cases such as the line L in the figure above, which intersects the parabola at a single point, whereas we desire every line to intersect the parabola at exactly two points. The second intersection point turns out to be a point at infinity: namely (0:1:0). This point at infinity belongs to both L and the parabola, as you can easily check.

3) Finally, we have to count intersection points by their multiplicity. Suppose a given curve C and a line L intersect at a point P. Suppose further that the line is parametrized by the parameter t such that P corresponds to t=0. Then t=0 is a root of the polynomial g(t) obtained by plugging the parametric form of L in the polynomial describing C. If the degree of this root is k, we say that the intersection point has multiplicity k and express it by I(C,L,P)=k. This definition was a mouthful, so let’s do an example: suppose our curve is defined by the polynomial f(x,y)=x^2 + y + 4 and the line L is parametrized by x = 2t+1 and y=t-5. If we plug these in f(x,y), we get g(t)=5t+4t^2=(t-0)(4t+5). The root t=0 appears with multiplicity 1 in g(t), so we say that the intersection point corresponding to t=0, i.e. P=(1,-5) has multiplicity 1.

When can the multiplicity of an intersection point be larger than 1? This can happen either when the line is tangent to the curve at the intersection point, or when the intersection point is singular, i.e. the tangent is not well-defined at the intersection point, i.e. the partial derivatives all vanish at the intersection point. Incidentally, because we are dealing with polynomials only, we can define the derivative of a polynomial purely procedurally (i.e. the derivative of f(x)=a_0+a_1 x\ldots+a_n x^n is defined to be f^\prime (x) = a_1 + \ldots + n x^{n-1}), without having to worry about whether the standard analytic definition in terms of limits even makes sense in a given field. Ditto for partial derivatives.

It is again clear why we need to count intersection points by their multiplicity: it is to deal with cases such as the line M in the above figure: this line is tangent to the parabola, hence intersects it at a single point only, but the intersection has multiplicity 2, as you can easily check.

We are now ready to state Bézout’s theorem which is the culmination of our desire to make the algebraic degree and the geometric degree of polynomial equations or curves to always match each other. Suppose C and D are two projective curves that intersect at a finite number of points: P_1, P_2, \ldots, P_n. Then Bézout’s theorem states that the product of the degrees of the homogeneous polynomials defining C and D exactly matches what we call the global intersection multiplicity \mathcal{I}(C,D)=\sum_k I(C,D,P_k), i.e. the sum of the multiplicities of all intersection points. The fact that any line intersects a curve C at the same number of points and that this number is equal to the algebraic degree of the polynomial describing C are immediate consequences of Bézout’s theorem.


Recurrence as an efficient way to achieve depth in neural networks

We’ve recently discussed this paper on “feedback networks” by Zamir et al. in a journal club. The paper temporalizes static image recognition tasks and shows that recurrent nets that are trained to do the task at all times perform quite well even early on during inference and naturally implement a coarse-to-fine classification strategy where early outputs of the network correspond to coarse classifications that get refined over time.  I really like the paper overall. I especially appreciate and applaud their efforts to probe and understand how recurrence changes the nature of the representations learned in deep networks. I basically have two criticisms of the paper. The first is mostly terminological: I think the use of the term “feedback” in feedback networks is misleading. What they rather mean is just “recurrence” (as in recurrence in vanilla recurrent neural networks or in LSTMs), whereas “feedback” implies something more specific, i.e. a set of connections functionally and architecturally (architecturally=having a different structure) distinct from feedforward and lateral (within-layer) connections. The second criticism is that, again unlike what the title implies, the crucial manipulation of the paper has actually nothing to do with the architecture of the network per se, but how it is trained: specifically whether the outputs of the intermediate layers are also trained to do the task or not. This is the unambiguous conclusion of the results reported in the crucial Table 4. This particular training method can be implemented in any type of network be it purely feedforward or recurrent.

This second point made me think about what actual advantages (or disadvantages) a recurrent architecture specifically bestows on a neural net and I realized something that I perhaps knew only implicitly: recurrence is an efficient way of achieving depth in a neural network. It’s well known that a recurrent net unrolled in time is equivalent to a feedforward network with weight sharing across layers. So, a recurrent net achieves depth d with N neurons and N^2 synapses, whereas a feedforward network achieves the same depth with dN neurons and dN^2 synapses. Of course, the recurrent net is less expressive than the corresponding feedforward net due to weight sharing across layers, but both the feedback network paper and an earlier paper by Liao & Poggio show that one doesn’t lose much in terms of performance by sacrificing this extra bit of expressivity. Intriguingly, this could also explain why even in highly visual animals such as ourselves and other primates, one doesn’t find very deep visual cortices. Instead, one finds only a handful of hierarchical visual areas (~5 in primates), but lots of recurrence both within the same area and across areas. This then raises the opposite question: if recurrence is so efficient, why isn’t the whole visual cortex, or even the entire brain, a fully recurrent net? I suspect that there is an interesting trade-off between expressivity and efficiency and our visual cortices might be striking a balance between the two. But, fleshing out this idea requires some work.

Backpropagation works fine with random feedback weights

Neural networks are trained with the backpropagation algorithm. The backpropagation algorithm has a nice and simple interpretation in terms of backpropagating errors from the top layer to the lower layers of the network during training. For example, in the case of multi-layer feedforward linear networks, the update for a particular weight between layers l and l+1 can be written as:

\Delta W_{ij} \propto \gamma^\top W_{l+1}^\top W_{l+2}^\top \ldots W_{L-1}^\top \mathbf{e}

Here W_m are the forward weights, \mathbf{e} is the error vector and \gamma is a vector that depends only on the activities of the pre- and post-synaptic neurons of this particular connection. We can easily interpret this expression as the backpropagation of the error vector \mathbf{e} from the top of the network to the rest of it through feedback weights W_m^\top. Interpreting it this way has some biological appeal, since the brain is known to contain an abundance of feedback projections whose function is rather unclear. However, we have a problem. In order to make this idea work, we need to assume that feedback projections are always symmetric to the feed-forward projections, which is not particularly realistic for the brain. So, perhaps the first thing one would try is to see how important it is to maintain this precise symmetry between the forward and the backward projections. For example, what if we replace W_m^\top in the backward projections by some random and fixed matrices B_m. This is precisely what they do in this recent paper by Lillicrap et al. (2016). This is such a simple and beautiful idea that I’m surprised nobody tried it before!

Amazingly, this works just fine. In fact, it seems to work slightly better than standard backprop in most cases they consider in the paper. This could be because the algorithm with fixed random feedback has elements of second-order optimization (in particular Gauss-Newton), the authors argue. Formally, they show that in linear networks, the error feedback projected by fixed, random feedback weights is always a positive scalar multiple of the error feedback that would be sent by the pseudo-inverse of the forward weights, W^+. The significance of this is that if the feedback weights exactly matched the pseudo-inverse, W^+, they would be implementing an approximation to the Gauss-Newton method. Of course this argument is heuristic: the algorithm doesn’t really implement a second-order descent algorithm; it doesn’t even correspond to minimizing any loss function as the dynamics it describes is non-conservative, as the authors note.

It’s interesting to note that during learning forward weights come to resemble (but not exactly match) the pseudo-inverse of feedback weights (more than they come to resemble the transpose of feedback weights). Had the forward weights exactly matched the pseudo-inverse of feedback weights, that would mean that you don’t really need to train the networks, because the pseudo-inverse of a random matrix is a random matrix! So, the network seems to try to satisfy two conditions at the same time: on the one hand it tries to align the forward weights to the pseudo-inverse of feedback weights so that the feedback weights can backpropagate useful error signals, but on the other hand, it tries to do the task which requires that the feed-forward weights be non-random (and indeed orthogonal to any random matrix on average). So, there seems to be an interesting trade-off here. Of course, this “the-network-tries-to” language is a bit misleading, because as mentioned above, the dynamics doesn’t really correspond exactly to minimizing any objective, but effectively that seems to be what’s happening.

Like all good papers, this paper opens up a lot of interesting questions. Here are some I’m particularly interested in:

  • Although the authors’ efforts to analyze the algorithm are laudable, it would be nice to have a better, more rigorous understanding of what exactly the algorithm is doing.
  • Except for the spiking network example, they ignore any nonlinearities in the backward projection path (even in non-linear networks). This is not biologically plausible. It would be interesting to explore what happens when non-linearities in the backward pathway are included.
  • Can we somehow do better than the completely random feedback matrices B? What is the optimal fixed feedback matrix?
  • Most importantly, feedback connections are plastic too. Does learning become even faster with plastic feedback weights?

Logic in vector spaces

Symbolic and distributional representations in AI have complementary virtues. Symbolic representations are useful for reasoning, are easier to interpret and enable abstraction; whereas distributional representations can deal with noise and ambiguity, are easier to implement in neural networks and hence enable learning. So, how can we implement both representations in a common format so that we can make use of the desirable properties of both types of representation?

Grefenstette (2013) shows a simple way of mapping predicate logic (plus some rudimentary quantification) to tensor calculus. In this mapping, the truth values are assigned specific vectors: \top=[1 \; 0]^\top and \bot=[0 \; 1]^\top; domain objects are represented by one-hot vectors; unary predicates are assigned rank 2 tensors (i.e. matrices) that lead to the correct truth value when contracted with the vectors corresponding to the domain objects. Similarly, n-ary relations are represented by rank n+1 tensors that lead to the correct truth values when contracted with n-tuples of objects in the domain.

To give an example from the paper, suppose there are two objects in the domain represented by \mathbf{d}_1 and \mathbf{d}_2. Everybody loves everybody in this domain, except \mathbf{d}_2 doesn’t love \mathbf{d}_1. Then, the binary “loves” relation is represented by the rank 3 tensor:

\mathbf{T}^{loves} = \top \otimes \mathbf{d}_1 \otimes \mathbf{d}_1 + \top \otimes \mathbf{d}_1 \otimes \mathbf{d}_2 + \top \otimes \mathbf{d}_2 \otimes \mathbf{d}_2 + \bot \otimes \mathbf{d}_2 \otimes \mathbf{d}_1

Negation is modeled as the swap matrix:

\mathbf{T}^\neg = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}

and, because they are binary, the other logical operators can be modeled as rank 3 tensor. For simple (non-embedded) quantification, one first turns predicates into vectors (with domain objects satisfying the predicate having a 1 and the other objects having a 0 in the vector) and checks if \mathbf{X} = min(\mathbf{X},\mathbf{Y}) for “All Xs are Ys” and checks if |\mathbf{X}|>0 for “There exists X”, with \mathbf{X} and \mathbf{Y} representing the vectorized representations of the predicates X and Y, respectively.

One can improve upon the representations of domain objects and predicates in this scheme by learning low-dimensional embeddings for them. This approach is proposed in Rocktaeschel et al. (2014). The idea here is to minimize an objective like the following:

\displaystyle \min_{[d], [P]\; \forall d \in \mathcal{D}, \; \forall P \in \mathcal{P}} \sum_{\mathcal{F} \in K} || [\mathcal{F}]-\top ||_2

where [d] and [P] are low-dimensional embeddings of the domain objects and predicates and \mathcal{F} ranges over the facts in a given knowledge base K that we want to capture. One can easily minimize this objective by stochastic gradient descent.

The advantage of such low-dimensional embeddings is that they enable generalization of inferences made about domain objects or predicates to similar objects or predicates. They are also more efficient than the potentially very high-dimensional one-hot type representations.