Computational neuroscience and machine learning

### Elliptic tales 2

In this second part, we will be concerned with elliptic curves specifically, rather than arbitrary polynomial curves. The main idea will be to introduce an abelian group structure on the points of an elliptic curve, then we will be able to study the curve algebraically. But, first a brief recap of group theory.

Suppose $S$ is a subset of the abelian group $(G,+)$. We define the subgroup of $G$ generated by $S$ to be the group:

$\langle S \rangle = \{n_1 x_1 + \ldots + n_k x_k| k\geq 0, n_i \in \mathbb{Z}, x_i \in S \}$

with the group operation $+$ borrowed from $G$ (you can check that $(\langle S \rangle, +)$ is indeed a group). $G$ is called a finitely generated abelian group if $G = \langle S \rangle$ for some finite subset $S$.

$x$ is called a torsion element of $G$ if $x$ has order $n$ for some integer $n \geq 1$, i.e. $nx=0$ for some $n \geq 1$, but $mx \neq 0$ for all smaller $m$. If $x$ is not a torsion element, we say that $x$ has infinite order. The set of torsion elements of $G$ make up the torsion subgroup of $G$, with the group operation again borrowed from $G$ (again, you check that this is indeed a group).

A theorem then tells us that each subgroup of a finitely generated group is also finitely generated and in particular, its torsion subgroup is also finitely generated, hence is finite.

Another important concept is rank. Suppose $G$ is a finitely generated abelian group and $H$ is torsion subgroup. The rank of $G$ is defined to be the smallest integer $r$ such that $G$ can be generated by $r$ elements along with all elements of $H$.

Now, we’re ready to take up elliptic curves and define an abelian group structure over them. We first define an elliptic curve over a field $K$, denoted $E(K)$, to be a nonsingular cubic curve containing at least one point with $K$-coordinates. We call this point $\mathcal{O}$. It isn’t a trivial assumption to assume that such a point exists. For example, it isn’t at all obvious that a given homogeneous cubic polynomial with rational coefficients should have a solution with rational coordinates.

We define the abelian group operation on $E(K)$ geometrically. Here’s how: let $P$ and $Q$ be two points on $E$. We define $P+Q$ as follows: first let $L$ be the line connecting $P$ and $Q$. This line intersects $E$ at a third point $R$. Then draw the line $L^\prime$ connecting  $R$ and $\mathcal{O}$. This line intersects the curve at a third point and that third point is defined to be the point $P+Q$. Pictorially, the construction looks like this (the curve $E$ is shown in red):

Some amount of work is needed to show that this construction does indeed define an abelian group (for example, you can check that $\mathcal{O}$ is the identity element of this group), but we will not do it here. Note that we’re also assuming that $E(K)$ is non-singular. The case of singular cubic equations requires special handling, but the group structure in this case turns out to be isomorphic to one of a small number of much simpler groups (again we will not do this reduction here), so in a sense, the non-singular case is the interesting case. Whether an elliptic curve $E$ is singular or not can be determined from its discriminant, $\Delta_E$, which is an easy-to-compute algebraic function of the coefficients describing the curve. The curve is singular if and only if $\Delta_E=0$.

We will later be interested in elliptic curves over finite fields such as $\mathbf{F}_p$, i.e. $E(\mathbf{F}_p)$, and especially in the size of $E(\mathbf{F}_p)$, which we denote by $N_p$. A theorem due to Hasse states that:

Theorem (Hasse): The number $N_p$ satisfies $-2\sqrt{p} \leq p+1-N_p \leq 2\sqrt{p}$.

The number $p+1-N_p$ turns out to be so important that it gets its own name, $a_p$, so Hasse’s theorem can be restated as saying that $|a_p| \leq 2\sqrt{p}$.

We will also be interested in  elliptic curves over rationals, $E(\mathbf{Q})$, and we can say a few things about the group structure of $E(\mathbf{Q})$:

1) A theorem due to Mordell says that $E(\mathbf{Q})$ is a finitely generated abelian group. This implies that the torsion subgroup $E(\mathbf{Q})_{\mathrm{tors}}$ is also finitely generated, hence is finite.

2) A theorem due to Mazur says that if $P$ is a point of order $n$ in $E(\mathbf{Q})_{\mathrm{tors}}$, then either $1 \leq n \leq 10$ or $n=12$. I fact, $E(\mathbf{Q})_{\mathrm{tors}}$ has a very simple structure: either $E(\mathbf{Q})_{\mathrm{tors}}$ is a cyclic group of order 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or 12; or $E(\mathbf{Q})_{\mathrm{tors}}$ is generated by two elements $A_1$ and $A_2$, where the order of $A_1$ is 2, 4, 6, or 8, and the order of $A_2$ is 2.

3) A theorem due to Nagell and Lutz says that for an elliptic curve $E$ described by the equation $y^2 = x^3 + a_2 x^2 + a_4 x + a_6$, with $a_2, a_4, a_6$ integers, if $P$ is an element of $E(\mathbf{Q})_{\mathrm{tors}}$, then both $x_1$ and $y_1$ are integers and either $y_1=0$, in which case $2P=\mathcal{O}$, or else $y_1$ divides the discriminant $\Delta_E$. This theorem allows us to enumerate all torsion points of $E$ relatively easily.

4) It turns out to be much more difficult to say something about the rank of $E(\mathbf{Q})$. It is conjectured that there is no maximum rank for $E(\mathbf{Q})$, but this remains extremely difficult to prove. It is also believed that a random elliptic curve, i.e. an equation of the form $y^2 = x^3 + Ax + B$, where $A$ and $B$ are random integers, should have rank 0 or 1 with high probability.

Advertisements

### Elliptic tales 1

I’ve been reading Elliptic Tales, a fantastic book on elliptic curves by Avner Ash and Robert Gross, and I’d like to summarize what I’ve learned from the book in three long-ish posts corresponding to the three parts of the book. This post is the first one in this series.

We are going to consider polynomial equations and their degrees. There are two different ways to define the degree of a polynomial equation and of the curve it describes. The first definition is purely algebraic: the algebraic degree of a polynomial is the largest degree of any monomial appearing in the polynomial: e.g. the algebraic degree of $x^3y^2z + x^2y^2z+xz^3$ is 6, because the monomial with the largest degree, i.e. $x^3y^2z$, has degree 6. The second notion of degree is purely geometric: informally, we want to say that the geometric degree of a curve described by a polynomial equation $f(x,y,\ldots)=0$ is the number of intersection points it has with an arbitrary line $L$. Of course, the problem is that this is not well-defined, because the number of intersection points depends on the line $L$ we pick! Consider the figure below (adapted from Figure 1.6 in the book):

The line $K$ intersects the parabola $G$ at two points, the lines $M$ and $L$ intersect $G$ at one point only and the line $N$ doesn’t intersect $G$ at all!

The uncreative and uninteresting way to deal with this nuisance is to define the geometric degree to be the maximum number of intersection points we could get by varying the line $L$. But instead we’re going to be bold and demand that the geometric degree of a curve to be the same regardless of the line $L$ we choose! It turns out that this can be done and is the source of a lot of rich, beautiful and exciting mathematics. But, enforcing this demand will require us to reconsider which number system to adopt, and to carefully redefine what we mean by an “intersection point” and how to count them.

It turns out that we need to do three things: 1) define our polynomials over an algebraically closed field of numbers; 2) switch from the Euclidean plane to the projective plane; 3) count intersection points by their “multiplicities”. I will now explain what each of these means.

1) $F$ is an algebraically closed field means that any non-constant polynomial whose coefficients are in $F$ has a root in $F$. We can see immediately that the reals $\mathbb{R}$ is not algebraically closed, since e.g. the polynomial $x^2 + 1$ has no roots in  $\mathbb{R}$. However, the fundamental theorem of algebra tells us that the complex numbers $\mathbb{C}$ is algebraically closed. It is intuitively clear why we need to work with an algebraically closed field, hence why e.g. $\mathbb{R}$ won’t do for us: we want to say that algebraically $x^2 + 1$ has degree 2, but geometrically $x^2+1=0$ does not even make sense in $\mathbb{R}$.

Switching to an algebraically closed field helps us deal with cases such as the line $N$ in the above figure. Although this line does not intersect the parabola in the real plane, it does intersect the parabola in the complex plane at exactly two points.

2) In the projective plane, we describe points by three coordinates instead of the usual two: $(x:y:z)$. Setting $z=1$ gives us the usual (“finite”) plane. Points with $z=0$ correspond to “points at infinity”. The coordinates $(x:y:z)$ and $(\lambda x: \lambda y: \lambda z)$ with $\lambda\neq 0$ describe the same point. So, the points at infinity form a single line at infinity that can be parametrized by $(1:t:0)$ or by $(t:1:0)$ where $t$ can be any real number.

Polynomial curves in the usual “finite” plane might get points at infinity attached to them in the projective plane. To find out what specific points at infinity get attached to a particular curve, we need to define the homogenization of a polynomial: for a polynomial of degree $d \geq 1$ in two variables, $f(x,y)$, this means multiplying each monomial in $f(x,y)$ by an appropriate power of $z$ so that all monomials have the same degree $d$. For example, the homogenization of $f(x,y)=y-x^2$ is $F(x,y,z)=yz - x^2$. Now, to find out which points at infinity get attached to the curve described by the polynomial equation $F(x,y,z)=0$, we simply set $z=0$: $y \cdot 0 - x^2 =0$ to which the solutions are $(0:y:0)$ with $y \neq 0$. By the property above, these describe a single point which we might as well denote by $(0:1:0)$.

Why do we need this homogenization of polynomials and these points at infinity? Intuitively, it is to deal with cases such as the line $L$ in the figure above, which intersects the parabola at a single point, whereas we desire every line to intersect the parabola at exactly two points. The second intersection point turns out to be a point at infinity: namely $(0:1:0)$. This point at infinity belongs to both $L$ and the parabola, as you can easily check.

3) Finally, we have to count intersection points by their multiplicity. Suppose a given curve $C$ and a line $L$ intersect at a point $P$. Suppose further that the line is parametrized by the parameter $t$ such that $P$ corresponds to $t=0$. Then $t=0$ is a root of the polynomial $g(t)$ obtained by plugging the parametric form of $L$ in the polynomial describing $C$. If the degree of this root is $k$, we say that the intersection point has multiplicity $k$ and express it by $I(C,L,P)=k$. This definition was a mouthful, so let’s do an example: suppose our curve is defined by the polynomial $f(x,y)=x^2 + y + 4$ and the line $L$ is parametrized by $x = 2t+1$ and $y=t-5$. If we plug these in $f(x,y)$, we get $g(t)=5t+4t^2=(t-0)(4t+5)$. The root $t=0$ appears with multiplicity $1$ in $g(t)$, so we say that the intersection point corresponding to $t=0$, i.e. $P=(1,-5)$ has multiplicity $1$.

When can the multiplicity of an intersection point be larger than $1$? This can happen either when the line is tangent to the curve at the intersection point, or when the intersection point is singular, i.e. the tangent is not well-defined at the intersection point, i.e. the partial derivatives all vanish at the intersection point. Incidentally, because we are dealing with polynomials only, we can define the derivative of a polynomial purely procedurally (i.e. the derivative of $f(x)=a_0+a_1 x\ldots+a_n x^n$ is defined to be $f^\prime (x) = a_1 + \ldots + n x^{n-1}$), without having to worry about whether the standard analytic definition in terms of limits even makes sense in a given field. Ditto for partial derivatives.

It is again clear why we need to count intersection points by their multiplicity: it is to deal with cases such as the line $M$ in the above figure: this line is tangent to the parabola, hence intersects it at a single point only, but the intersection has multiplicity $2$, as you can easily check.

We are now ready to state Bézout’s theorem which is the culmination of our desire to make the algebraic degree and the geometric degree of polynomial equations or curves to always match each other. Suppose $C$ and $D$ are two projective curves that intersect at a finite number of points: $P_1$, $P_2$, $\ldots$, $P_n$. Then Bézout’s theorem states that the product of the degrees of the homogeneous polynomials defining $C$ and $D$ exactly matches what we call the global intersection multiplicity $\mathcal{I}(C,D)=\sum_k I(C,D,P_k)$, i.e. the sum of the multiplicities of all intersection points. The fact that any line intersects a curve $C$ at the same number of points and that this number is equal to the algebraic degree of the polynomial describing $C$ are immediate consequences of Bézout’s theorem.

### Recurrence as an efficient way to achieve depth in neural networks

We’ve recently discussed this paper on “feedback networks” by Zamir et al. in a journal club. The paper temporalizes static image recognition tasks and shows that recurrent nets that are trained to do the task at all times perform quite well even early on during inference and naturally implement a coarse-to-fine classification strategy where early outputs of the network correspond to coarse classifications that get refined over time.  I really like the paper overall. I especially appreciate and applaud their efforts to probe and understand how recurrence changes the nature of the representations learned in deep networks. I basically have two criticisms of the paper. The first is mostly terminological: I think the use of the term “feedback” in feedback networks is misleading. What they rather mean is just “recurrence” (as in recurrence in vanilla recurrent neural networks or in LSTMs), whereas “feedback” implies something more specific, i.e. a set of connections functionally and architecturally (architecturally=having a different structure) distinct from feedforward and lateral (within-layer) connections. The second criticism is that, again unlike what the title implies, the crucial manipulation of the paper has actually nothing to do with the architecture of the network per se, but how it is trained: specifically whether the outputs of the intermediate layers are also trained to do the task or not. This is the unambiguous conclusion of the results reported in the crucial Table 4. This particular training method can be implemented in any type of network be it purely feedforward or recurrent.

This second point made me think about what actual advantages (or disadvantages) a recurrent architecture specifically bestows on a neural net and I realized something that I perhaps knew only implicitly: recurrence is an efficient way of achieving depth in a neural network. It’s well known that a recurrent net unrolled in time is equivalent to a feedforward network with weight sharing across layers. So, a recurrent net achieves depth $d$ with $N$ neurons and $N^2$ synapses, whereas a feedforward network achieves the same depth with $dN$ neurons and $dN^2$ synapses. Of course, the recurrent net is less expressive than the corresponding feedforward net due to weight sharing across layers, but both the feedback network paper and an earlier paper by Liao & Poggio show that one doesn’t lose much in terms of performance by sacrificing this extra bit of expressivity. Intriguingly, this could also explain why even in highly visual animals such as ourselves and other primates, one doesn’t find very deep visual cortices. Instead, one finds only a handful of hierarchical visual areas (~5 in primates), but lots of recurrence both within the same area and across areas. This then raises the opposite question: if recurrence is so efficient, why isn’t the whole visual cortex, or even the entire brain, a fully recurrent net? I suspect that there is an interesting trade-off between expressivity and efficiency and our visual cortices might be striking a balance between the two. But, fleshing out this idea requires some work.

### Backpropagation works fine with random feedback weights

Neural networks are trained with the backpropagation algorithm. The backpropagation algorithm has a nice and simple interpretation in terms of backpropagating errors from the top layer to the lower layers of the network during training. For example, in the case of multi-layer feedforward linear networks, the update for a particular weight between layers $l$ and $l+1$ can be written as:

$\Delta W_{ij} \propto \gamma^\top W_{l+1}^\top W_{l+2}^\top \ldots W_{L-1}^\top \mathbf{e}$

Here $W_m$ are the forward weights, $\mathbf{e}$ is the error vector and $\gamma$ is a vector that depends only on the activities of the pre- and post-synaptic neurons of this particular connection. We can easily interpret this expression as the backpropagation of the error vector $\mathbf{e}$ from the top of the network to the rest of it through feedback weights $W_m^\top$. Interpreting it this way has some biological appeal, since the brain is known to contain an abundance of feedback projections whose function is rather unclear. However, we have a problem. In order to make this idea work, we need to assume that feedback projections are always symmetric to the feed-forward projections, which is not particularly realistic for the brain. So, perhaps the first thing one would try is to see how important it is to maintain this precise symmetry between the forward and the backward projections. For example, what if we replace $W_m^\top$ in the backward projections by some random and fixed matrices $B_m$. This is precisely what they do in this recent paper by Lillicrap et al. (2016). This is such a simple and beautiful idea that I’m surprised nobody tried it before!

Amazingly, this works just fine. In fact, it seems to work slightly better than standard backprop in most cases they consider in the paper. This could be because the algorithm with fixed random feedback has elements of second-order optimization (in particular Gauss-Newton), the authors argue. Formally, they show that in linear networks, the error feedback projected by fixed, random feedback weights is always a positive scalar multiple of the error feedback that would be sent by the pseudo-inverse of the forward weights, $W^+$. The significance of this is that if the feedback weights exactly matched the pseudo-inverse, $W^+$, they would be implementing an approximation to the Gauss-Newton method. Of course this argument is heuristic: the algorithm doesn’t really implement a second-order descent algorithm; it doesn’t even correspond to minimizing any loss function as the dynamics it describes is non-conservative, as the authors note.

It’s interesting to note that during learning forward weights come to resemble (but not exactly match) the pseudo-inverse of feedback weights (more than they come to resemble the transpose of feedback weights). Had the forward weights exactly matched the pseudo-inverse of feedback weights, that would mean that you don’t really need to train the networks, because the pseudo-inverse of a random matrix is a random matrix! So, the network seems to try to satisfy two conditions at the same time: on the one hand it tries to align the forward weights to the pseudo-inverse of feedback weights so that the feedback weights can backpropagate useful error signals, but on the other hand, it tries to do the task which requires that the feed-forward weights be non-random (and indeed orthogonal to any random matrix on average). So, there seems to be an interesting trade-off here. Of course, this “the-network-tries-to” language is a bit misleading, because as mentioned above, the dynamics doesn’t really correspond exactly to minimizing any objective, but effectively that seems to be what’s happening.

Like all good papers, this paper opens up a lot of interesting questions. Here are some I’m particularly interested in:

• Although the authors’ efforts to analyze the algorithm are laudable, it would be nice to have a better, more rigorous understanding of what exactly the algorithm is doing.
• Except for the spiking network example, they ignore any nonlinearities in the backward projection path (even in non-linear networks). This is not biologically plausible. It would be interesting to explore what happens when non-linearities in the backward pathway are included.
• Can we somehow do better than the completely random feedback matrices $B$? What is the optimal fixed feedback matrix?
• Most importantly, feedback connections are plastic too. Does learning become even faster with plastic feedback weights?

### Logic in vector spaces

Symbolic and distributional representations in AI have complementary virtues. Symbolic representations are useful for reasoning, are easier to interpret and enable abstraction; whereas distributional representations can deal with noise and ambiguity, are easier to implement in neural networks and hence enable learning. So, how can we implement both representations in a common format so that we can make use of the desirable properties of both types of representation?

Grefenstette (2013) shows a simple way of mapping predicate logic (plus some rudimentary quantification) to tensor calculus. In this mapping, the truth values are assigned specific vectors: $\top=[1 \; 0]^\top$ and $\bot=[0 \; 1]^\top$; domain objects are represented by one-hot vectors; unary predicates are assigned rank 2 tensors (i.e. matrices) that lead to the correct truth value when contracted with the vectors corresponding to the domain objects. Similarly, $n$-ary relations are represented by rank $n+1$ tensors that lead to the correct truth values when contracted with $n$-tuples of objects in the domain.

To give an example from the paper, suppose there are two objects in the domain represented by $\mathbf{d}_1$ and $\mathbf{d}_2$. Everybody loves everybody in this domain, except $\mathbf{d}_2$ doesn’t love $\mathbf{d}_1$. Then, the binary “loves” relation is represented by the rank 3 tensor:

$\mathbf{T}^{loves} = \top \otimes \mathbf{d}_1 \otimes \mathbf{d}_1 + \top \otimes \mathbf{d}_1 \otimes \mathbf{d}_2 + \top \otimes \mathbf{d}_2 \otimes \mathbf{d}_2 + \bot \otimes \mathbf{d}_2 \otimes \mathbf{d}_1$

Negation is modeled as the swap matrix:

$\mathbf{T}^\neg = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}$

and, because they are binary, the other logical operators can be modeled as rank 3 tensor. For simple (non-embedded) quantification, one first turns predicates into vectors (with domain objects satisfying the predicate having a 1 and the other objects having a 0 in the vector) and checks if $\mathbf{X} = min(\mathbf{X},\mathbf{Y})$ for “All Xs are Ys” and checks if $|\mathbf{X}|>0$ for “There exists X”, with $\mathbf{X}$ and $\mathbf{Y}$ representing the vectorized representations of the predicates X and Y, respectively.

One can improve upon the representations of domain objects and predicates in this scheme by learning low-dimensional embeddings for them. This approach is proposed in Rocktaeschel et al. (2014). The idea here is to minimize an objective like the following:

$\displaystyle \min_{[d], [P]\; \forall d \in \mathcal{D}, \; \forall P \in \mathcal{P}} \sum_{\mathcal{F} \in K} || [\mathcal{F}]-\top ||_2$

where $[d]$ and $[P]$ are low-dimensional embeddings of the domain objects and predicates and $\mathcal{F}$ ranges over the facts in a given knowledge base $K$ that we want to capture. One can easily minimize this objective by stochastic gradient descent.

The advantage of such low-dimensional embeddings is that they enable generalization of inferences made about domain objects or predicates to similar objects or predicates. They are also more efficient than the potentially very high-dimensional one-hot type representations.

### Constraining neural networks

Fully-connected layers with arbitrary connectivity are commonly used in neural networks (both feed-forward and recurrent), but is the full power of an unconstrained, all-to-all connectivity matrix really needed for the network to perform well? A lot of empirical evidence suggests that trained neural networks are highly compressible, suggesting that unconstrained connectivity may not be necessary. If we are going to do away with unconstrained connectivity matrices, what should we replace them with? Here are a few recent suggestions from the literature (you can check out the original papers for detailed performance comparisons, but the basic message is that these methods reduce computational or memory requirements of neural networks, or improve their training, with minimal performance loss):

1) Instead of an unconstrained, arbitrary $N\times D$ matrix $\mathbf{W}$, Yang et al. (2014) suggest matrices of the following form (called the adaptive fastfood transform):

$\mathbf{W} = \mathbf{SHG\Pi H B}$                                                                        (*)

where $\mathbf{S}$, $\mathbf{G}$, $\mathbf{B}$ are diagonal matrices, $\mathbf{\Pi}$ is a random permutation matrix, and $\mathbf{H}$ is the Hadamard matrix. Here $\mathbf{\Pi}$ and $\mathbf{H}$ are fixed, so one only learns the diagonal matrices $\mathbf{S}$, $\mathbf{G}$, $\mathbf{B}$. This decomposition has $O(N)$ parameters and requires $O(N\log D)$ operations to compute, as opposed to the much less efficient $O(ND)$ parameters and $O(ND)$ operations required for an unconstrained matrix. The particular matrix structure in (*) was introduced by Le et al. (2013) in earlier work, but in their case $\mathbf{S}$, $\mathbf{G}$, $\mathbf{B}$ were set randomly and left untrained. In that work, they showed that feature maps computed using this random non-adaptive transformation correspond to the Gaussian RBF kernel in expectation, or more precisely $\mathbf{E}_{S,G,B,\Pi}[\overline{\phi(x)}^\top\phi(x^{\prime})] = \exp(-\frac{|| x-x^{\prime}||}{2\sigma^2})$ where the features are $\phi_j(x) = \frac{1}{\sqrt{n}}\exp(i [\mathbf{W}x]_j)$ with $\mathbf{W}$ as in (*).

2) Moczulski et al. (2015) propose connectivity matrices of the following form:

$\mathbf{W} = \prod_{k=1}^K \mathbf{A}_k \mathbf{F} \mathbf{D}_k \mathbf{F}^{-1}$                                                                      (**)

where $\mathbf{A}_k$ and $\mathbf{D}_k$ are diagonal matrices and $\mathbf{F}$ is the discrete Fourier transform matrix. This has $O(KN)$ parameters and requires $O(KN\log N)$ operations to compute, which is an improvement over the $O(N^2)$ parameters and $O(N^2)$ operations required to compute a fully-connected unconstrained linear layer (with identical input and output dimensions, $N$), assuming of course $K$ is not $O(N)$. In the real version they ultimately prefer, $\mathbf{F}$ is replaced by the discrete cosine transform matrix (of type II) $\mathbf{C}$.

A justifiable worry at this point is whether we are losing a lot of expressive power when we constrain our connectivity matrices in this way. The authors mention a theoretical guarantee assuring that almost all matrices $\mathbf{M} \in \mathbb{C}^{N \times N}$ can be factored as $\mathbf{M} = [ \prod_{i=1}^{N-1} \mathbf{D}_{2i-1} \mathbf{R}_{2i} ] \mathbf{D}_{2N-1}$ with $\mathbf{D}_{2i-1}$ diagonal and $\mathbf{R}_{2i}$ circulant, and this is precisely in the form (**) above. However, I’m not sure how useful this guarantee is, because $K$ is $O(N)$ in this result, hence we lose our computational savings in this limit.

3) A very nice idea has recently been proposed for constraining the connectivity of a recurrent neural network: Arjovsky et al. (2015) propose constraining the connectivity to be unitary (orthogonal in the real case). The advantage of a unitary connectivity matrix is that it exactly preserves the norm of a vector it is applied to, hence avoiding the vanishing or exploding gradient problems in recurrent neural network training. What particular unitary structure should we choose for the connectivity matrix? They suggest the following structure:

$\mathbf{W} = \mathbf{D}_3 \mathbf{R}_2 \mathbf{F}^{-1} \mathbf{D}_2 \mathbf{\Pi} \mathbf{R}_1 \mathbf{F} \mathbf{D}_1$                                     (***)

where $\mathbf{D}_1$, $\mathbf{D}_2$, $\mathbf{D}_3$ are diagonal matrices, $\mathbf{R}_1$, $\mathbf{R}_2$ are reflection matrices, $\mathbf{\Pi}$ is a permutation matrix and $\mathbf{F}$ is the discrete Fourier transform matrix. The authors mention that this particular structure was based on trial and error essentially. It is important to note that all the matrices in (***), except for the permutation matrix, have complex-valued entries.

In summary, constraints are generally useful for reducing the complexity of neural networks (think convnets), making training easier and more efficient, and making the models more interpretable. However, finding the right set of constraints to impose on neural networks is not always easy: imposing too many constraints or the wrong set of constraints might unnecessarily limit the representational power of the network, imposing too few constraints might reduce the benefits that would otherwise be obtained with the right set of constraints.

### Large deviations

Here is a nice theorem about Gaussian tails. Suppose $Y_n = \sum_{i=1}^n X_{i,n}$ where $X_{i,n}$ are mutually independent random variables with mean $\mathrm{E}[X_{i,n}] = 0$ and variance $\mathrm{Var}[X_{i,n}] = \sigma_{i,n}^2$. Denote $\mathrm{Var}[Y_n] = \sigma_n^2 = \sum_{i=1}^n \sigma_{i,n}^2$. We consider the probability $\mathrm{P}[Y_n \geq a_n \sigma_n]$ as $\lim_{n\rightarrow \infty} \lambda_n \equiv \frac{a_n}{\sigma_n} = 0$, that is, $\sigma_n$ goes to infinity faster than $a_n$. Assume further that:

$\mathrm{E}[\exp(\lambda_n X_{i,n})] \leq \exp((1+o(1)) \lambda_n^2 \sigma_{i,n}^2 / 2)$

A simple sufficient condition for this last equation to hold is that the $X_{i,n}$ be bounded, i.e. that there be a constant $K$ such that $|X_{i,n}|\leq K$ for all $i$ and $n$. Then the following theorem holds (the proof is elementary and can be found in this book):

Theorem: $\mathrm{P}[Y_n \geq a_n \sigma_n] \leq \exp(- (1+o(1))a_n^2 / 2)$

This is exactly the tail behavior of a Gaussian random variable. Hence, this theorem shows that under somewhat general conditions, tails of sums of independent random variables asymptotically behave like Gaussian tails.

### Impressive use of the determinant

A spanning tree of a graph $G$ is a connected acyclic subgraph of $G$ that has the same vertex set as $G$, i.e. it is a tree that covers all the vertices of $G$. For a given graph $G$, how many spanning trees are there?

Take the Laplacian of the graph $G$, remove the last column and the last row of the Laplacian, call this new matrix $L^{-}$. Now, the amazing result is that the number of spanning trees of $G$ is exactly given by the determinant of $L^{-}$. This result is known as the Kirchoff’s matrix tree theorem. Its proof is non-obvious (as you might have guessed) and can be found here, together with some other examples of impressive uses of the determinant.

### Differential geometry notes 7

Parametric representation of a point set M in three-dimensional Euclidean space:

$\mathbf{x}(u^1,u^2) = \big( x_1(u^1,u^2), x_2(u^1,u^2),x_3(u^1,u^2) \big)$

where $u^1, u^2$ are real-valued variables defined in a simply-connected, bounded domain $B$.

We make the following assumptions about the parametric representations:

(1) $\mathbf{x}(u^1,u^2)$ is of class $r \geq 1$ in B. Each point of $M$ corresponds to just one ordered pair $(u^1,u^2)$ in $B$.

(2) The Jacobian is of rank 2 in $B$:

$J$ $= \Large \begin{pmatrix} \frac{\partial x_1}{\partial u_1} & \frac{\partial x_1}{\partial u_2} \\ \frac{\partial x_2}{\partial u_1} & \frac{\partial x_2}{\partial u_2} \\ \frac{\partial x_3}{\partial u_1} & \frac{\partial x_3}{\partial u_2} \end{pmatrix}$

A representation satisfying these assumptions is called an allowable representation.

Partial derivatives will be denoted with the following notation:

$\Large \mathbf{x}_{\alpha} = \frac{\partial \mathbf{x}}{\partial u^{\alpha}}$ and $\Large \mathbf{x}_{\alpha \beta} = \frac{\partial^2 \mathbf{x}}{\partial u^{\alpha} \partial u^{\beta}}$.

If $J$ is of rank $R<2$ for a particular $(u^1,u^2)$, the corresponding point is called a singular point with respect to the representation and it is called singular if it is singular with respect to every allowable representation.

By imposing a transformation $u^{\alpha} = u^{\alpha}(\bar{u}^1,\bar{u}^2)$ (with $\alpha = 1, 2$), we can obtain a new parametric representation $\mathbf{x}(\bar{u}^1,\bar{u}^2)$ of $M$. The transformation has to satisfy the following assumptions:

(1) The domain $\bar{B}$ of the transformation includes $B$.

(2) $u^{\alpha}$ are of class $r\geq 1$ everywhere in $\bar{B}$ and are one-to-one.

(3) The Jacobian $D$ is different from zero everywhere in $\bar{B}$:

$D$ $= \Large \frac{\partial (u^1,u^2)}{\partial (\bar{u}^1,\bar{u}^2)} = \begin{vmatrix} \frac{\partial u^1}{\partial \bar{u}^1} & \frac{\partial u^1}{\partial \bar{u}^2} \\ \frac{\partial u^2}{\partial \bar{u}^1} & \frac{\partial u^2}{\partial \bar{u}^2} \end{vmatrix}$

A transformation satisfying these assumptions is called an allowable coordinate transformation. As in the theory of curves, allowable coordinate transformations define equivalence classes of allowable parametric representations.

Definition 24.1: A point set $S$ in $\mathbb{R}^3$ which can be represented by the allowable representations of an equivalence class is called a portion of a surface. $u^1$ and $u^2$ are called the coordinates on $S$. The curves $u^1 = const$ and $u^2 = const$ are called the coordinate curves of the $u^1 u^2$ coordinate system.

Definition 24.2: A union $U$ of portions of surfaces is called a surface if every two portions $S$, $S^{\prime}$ of $U$ can be joined by finitely many portions $S = S_1, \ldots, S_n = S^{\prime}$ such that the intersection of two subsequent portions $S_i$, $S_{i+1}$ is a portion of a surface.

### Differential geometry notes 6

Natural equations of a curve: Is it possible to characterize a curve in a manner independent of the choice of coordinates, except for the position of the curve in space, that is, to within congruent transformations? For example, we have seen that $s$, $\kappa$ and $\tau$ are quantities of this sort (i.e. independent of the choice of coordinates). The following theorem, called the fundamental theorem of the theory of curves, answers this question.

Theorem 20.1: Let $\kappa(s)$ and $\tau(s)$ be continuous functions of a real variable $s$ defined in an interval $I: 0\leq s \leq a$. Then there exists one and only one arc $\mathbf{x}(s)$ of a curve, determined up to a direct congruent transformation, with arc length $s$ whose curvature and torsion are given by the functions $\kappa(s)$ and $\tau(s)$.

Involutes and evolutes: Tangents to a curve $\mathbf{x}(s)$ generate a surface called the tangent surface. The tangent surface can be represented as:

$\mathbf{y}(s,u) = \mathbf{x}(s) + u \mathbf{t}(s)$

Involutes of a curve are curves on the tangent surface that are orthogonal to the generating tangents. Thus the involutes can be represented as:

$\mathbf{z}(s) = \mathbf{x}(s) + u(s)\mathbf{t}(s)$

where $u(s)$ is chosen such that the orthogonality condition is satisfied, i.e. $\mathbf{\dot{z}} \cdot \mathbf{t} = 0$. Because:

$\dot{\mathbf{z}} = \mathbf{t} + \frac{d}{ds}(u \mathbf{t})$ $= \mathbf{t} + u \kappa \mathbf{p} + \dot{u} \mathbf{t}$

it follows that $1 + \dot{u}=0$ or $u(s) = c-s$ where $c$ is constant. Thus, the involutes can be expressed as:

$\mathbf{z}(s) = \mathbf{x}(s) + (c-s) \mathbf{t}(s)$

If $C$ is an involute of $C^*$, then $C^*$ is called an evolute of $C$. It can be shown that the evolutes of a curve $\mathbf{x}(s)$ can be represented in the form:

$\mathbf{y}(s) = \mathbf{x}(s) + \rho(s) \big[ \mathbf{p}(s) + \mathbf{b}(s) \cot \alpha(s) \big]$

where $\alpha(s) = \int_0^s \tau(\sigma) d\sigma + k^*$ and $k^*$ is a constant.