In my last post, I mentioned meta-learning as a promising approach for learning useful inductive biases in neural networks. In meta-learning, several steps of a standard optimization problem are unrolled and “baked” into the computation graph. One then performs backpropagation over the entire computation graph to optimize particular parameters or hyper-parameters of interest, for example, hyper-parameters controlling the learning rate or momentum schedules or the initial parameters of the network.
One potential problem with this approach is that it becomes computationally prohibitive to unroll, and bake into the computation graph, more than a relatively small number of optimization steps, compared to the number of optimization steps or iterations one would normally perform for training the model. One might then worry that parameters or hyper-parameters optimized over this shorter horizon might not generalize well to longer horizons (the longer horizon problem is ultimately the problem we are interested in).
A new ICLR paper by Yuhuai Wu, Mengye Ren and colleagues argues that such worries are justified, at least for a certain class of meta-learning problems. They consider the problem of meta-learning the learning rate and momentum schedules in neural network training. In the first part of the paper, they approach this problem analytically by considering a noisy quadratic loss function. For this particular loss function, it turns out that one can estimate both the optimal learning rate and momentum schedules (using Theorem 1 in the paper), as well as the “greedy-optimal” learning rate and momentum schedules that minimize the one-step ahead loss function at each time step (using Theorem 2). The authors show that although the greedy-optimal strategy outperforms the optimal schedule initially during training, its final performance is substantially worse than that of the optimal strategy. The reason appears to be that the greedy-optimal strategy reduces the learning rate prematurely to be able to decrease the loss along high-curvature directions in the loss landscape quickly enough.
In the second part of the paper, a gradient-based hyper-parameter meta-learning method is shown to suffer from a similar short-horizon bias where meta-learning of the hyper-parameters controlling the learning rate schedule (i.e. the initial learning learning rate and a hyper-parameter controlling the learning rate decay over time) with a small number of unrolled optimization steps, i.e. over a short horizon, again leads to learning rates that are reduced much too quickly compared to the optimal schedule, resulting in severely suboptimal final performances.
Here are some thoughts and questions inspired by this work:
- In the inner loop of their meta-learning problems, the authors only consider vanilla SGD with momentum. It is a bit unfortunate that they don’t consider the Adam algorithm (or similar algorithms) in the inner loop. I suspect that using Adam in the inner loop might ameliorate the short-horizon bias significantly, because Adam is supposed to be less sensitive to hyper-parameter choices (one might then object, with some justification, that this would be a case where meta-learning is not really needed).
- Because the bias reported in the paper seems to be consistent, i.e. always in the direction of reduced learning rates, a quick fix one could imagine would be to introduce a prior that would counteract this bias by, for example, discouraging small learning rates. I’m not quite sure what the best prior to use would be in this context and I don’t know how well this would work in practice compared to the optimal learning rate schedules, but it seems like a straightforward fix that could be worth trying.
- As the authors note, the short-horizon bias becomes less significant (or might even disappear) when the optimization problem is either deterministic or well-conditioned. We have previously argued that skip connections reduce degeneracies in neural networks, thereby making the optimization problem better-conditioned. It’s likely that other methods such as batch normalization have similar effects. This suggests that more realistic architectures using skip connections and batch normalization might be less prone to the short-horizon problem uncovered in this paper. It would be interesting to investigate this.
- It would also be interesting to see if other kinds of meta-learning problems, for example, meta-learning a good initialization for the model parameters (as in the MAML algorithm) suffer from similar short-horizon biases.