### Deep learning can make more use of available data

This is just a short post on something I’ve been thinking about lately. The argument is often made that deep learning needs stronger, better priors, usually in the form of architectural improvements. I’m not necessarily against this idea, however in this post I’d like to make the complementary case that even with the current architectures and training algorithms, deep learning can probably make more use of the available data, i.e. it can squeeze more juice out of available data. Why do I think so and how can deep learning achieve this? There are a couple of reasons that make me think so:

- Argument from
*cold posteriors*: in Bayesian neural networks, it has been empirically observed that the best predictive performance is obtained not with the actual posterior, but with “cold posteriors”, which correspond to artificially manipulated posteriors that overcount the effect of the data and undercount the effect of the (usually generic) prior. Conversely, this suggests that current techniques in deep learning may be undercounting the potential of the data given that one has to resort to an artificial boosting of its effect in Bayesian neural networks. - Argument from slow and brittle convergence to “natural” solutions: there is some interesting theoretical work suggesting that in some simplified problems, standard deep learning techniques will converge to what I would consider the “natural” solutions, but the convergence is painfully slow and brittle. Let me give two examples: Soudry et al. (2018) show that in logistic regression with separable data, gradient descent converges to the max margin solution (which can be considered as the natural solution for this type of problem), but convergence is extremely slow, i.e. something like , and is brittle in the sense that it doesn’t hold for some popular adaptive gradient descent algorithms like Adam. Ji & Telgarsky (2019) show a similar result for logistic regression problems with non-separable data, but the convergence here is again extremely slow, i.e. the rate of convergence
*in direction*to the max margin solution is again . On the other hand, it is clear that the convergence to the max margin solution in these problems can be significantly sped up with simple data-dependent initialization schemes. In a similar vein, some prior works have suggested that important generalization properties of neural networks, such as their ability to generalize compositionally, is extremely sensitive to initialization, again implying that starting from a data-agnostic, generic initialization may not be optimal. - Argument from empirical Bayes: how can deep learning make more use of the available data? A straightforward idea is the one I mentioned at the end of the last paragraph, i.e. using a data-dependent initialization scheme (I gave a simple example of this kind of scheme in a previous post). This approach is reminiscent of the empirical Bayes method in Bayesian statistics, which underlies a whole host of beautiful and surprising results like the Stein phenomenon. The basic idea in empirical Bayes is refusing to assume a non-informative, generic prior for the variables of interest (for a neural network, these could be the parameters, for instance), but estimating these priors from the data instead. You can see that this idea accords nicely with a data-dependent initialization scheme for neural network parameters. Empirical Bayes enjoys some appealing theoretical performance guarantees compared to common alternatives like maximum likelihood, which suggests that similar improvements may hold for data-dependent initialization schemes for neural networks as well.