Posted on: 2018-07-04, in Category: research, tags: dropout regularization machine_learning neural_networks

Dropout and Regularization

Dropout

First mention of dropout is found in (Hinton et al. 2012). That paper talks about preventing feature correlation in neural networks. Dropout was applied successfully in (Krizhevsky et al. 2012) after which it gained widespread popularity. It was shown to be effective in Recurrent Neural Networks for the first time in (Zaremba et al. 2014).

Biological and Historical Context

Historically, neural network pruning was an effective way to prevent overfitting of neural networks (Hassibi and Stork 1993; LeCun et al. 1990). These methods used ideas from perturbation theory to minimize the change in second order gradients (hessian).

Biological motivation for dropout and other sparsity inducing methods can be found in the human brain development (Bear et al. 2020, 704–707). The process is known as programmed cell death.

There are theories and evidence for correlation between sparsity in the brain and intelligence or expertise [Hänggi et al. (2014); brogliato2014sparse; Genç et al. (2018)]

Dropout for Regularization in Deep Neural Networks

A decent introduction to Regularization Theory can be found in (Haykin Haykin 2009, ch. 7). A tutorial with more linear algebraic formulation instead of function analytic perspective is (Neumaier 1998). A thorough introduction to the general theory of regularization can be found in (Engl et al. 1996). A simpler approach to regularization in terms controlling controlling the curvature can be seen in the theory of Additive Models. A good introductory text for that is (Wood 2006).

The paper that we had discussed was (Wang and Manning 2013). Equivalence of dropout to L₂ regularization can be seen in (Srivastava et al. 2014). As we’d discussed, all norm penalties are some form of Tikhonov Regularization.

(Wang and Manning 2013) used the Gaussian approximation to the Bernoulli distribution to analytically find the Δw and thereby speeding up the process. However, in practice their method doesn’t scale well to deeper networks. Gal and Gahramani (Yarin Gal 2015) have developed a sounder model with a Gaussian Process approximation. I’ll try to read that soon.

Noise and Regularization

Addition of noise to input data was proven equivalent to Tikhonov regularization by Bishop in (Bishop 1995). An interesting article that adds noise to gradients is (Neelakantan et al. 2015). Similar perturbation at the local minima has been a common technique to find solutions of problems with greedy methods.

Random Projections and Subspace Search

Random subspace search isn’t usually seen as a similar method as it is more a feature selection method, but that too is an effective regularizer. First discussed in (Ho 1998) it discusses building an ensemble of decision trees over subsets of features. The method was combined with bootstrap and bagging to create Random Forests by Breiman later (Breiman 2001)

$\mathcal{N}(\mu, \sigma) = \frac{1}{{2\pi\sigma}^{d/2}}\exp(\mathbf{(x-\mu)^T\Sigma^{-1}(x-\mu)})$

Since for dropout the equation over a vector and corresponding weights reduces to ∑p_ix_iw_i, it can be seen as either zeroing out x_i with probability p_i drawn from a Bernoulli distribution, or w. A relation to random subspace methods is arrived at immediately.

Random projections (Candes and Tao 2006) is a different idea which sounds similar but on which further reading is required.

References

[1]

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.

[2]

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.

[3]

W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” arXiv preprint arXiv:1409.2329, 2014.

[4]

Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in Advances in neural information processing systems, 1990, pp. 598–605.

[5]

B. Hassibi and D. G. Stork, “Second order derivatives for network pruning: Optimal brain surgeon,” in Advances in neural information processing systems, 1993, pp. 164–171.

[6]

M. Bear, B. Connors, and M. A. Paradiso, Neuroscience: Exploring the brain. Jones & Bartlett Learning, LLC, 2020, pp. 704–707.

[7]

J. Hänggi, K. Brütsch, A. M. Siegel, and L. Jäncke, “The architecture of the chess player׳ s brain,” Neuropsychologia, vol. 62, pp. 152–162, 2014.

[8]

E. Genç, C. Fraenz, C. Schlüter, P. Friedrich, R. Hossiep, M. C. Voelkle, J. M. Ling, O. Güntürkün, and R. E. Jung, “Diffusion markers of dendritic density and arborization in gray matter predict differences in intelligence,” Nature Communications, vol. 9, no. 1, p. 1905, 2018.

[9]

S. S. Haykin, Neural networks and learning machines, vol. 3. Pearson Upper Saddle River, NJ, USA:, 2009.

[10]

A. Neumaier, “Solving ill-conditioned and singular linear systems: A tutorial on regularization,” SIAM review, vol. 40, no. 3, pp. 636–666, 1998.

[11]

H. W. Engl, M. Hanke, and A. Neubauer, Regularization of inverse problems, vol. 375. Springer Science & Business Media, 1996.

[12]

S. Wood, Generalized additive models: An introduction with r. CRC press, 2006.

[13]

S. Wang and C. Manning, “Fast dropout training,” in International conference on machine learning, 2013, pp. 118–126.

[14]

N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting.” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[15]

Z. G. Yarin Gal, “Dropout as a bayesian approximation: Insights and applications,” in ICML workshop, 2015.

[16]

C. M. Bishop, “Training with noise is equivalent to tikhonov regularization,” Neural computation, vol. 7, no. 1, pp. 108–116, 1995.

[17]

A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser, K. Kurach, and J. Martens, “Adding gradient noise improves learning for very deep networks,” arXiv preprint arXiv:1511.06807, 2015.

[18]

T. K. Ho, “The random subspace method for constructing decision forests,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 8, 1998.

[19]

L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.

[20]

E. J. Candes and T. Tao, “Near-optimal signal recovery from random projections: Universal encoding strategies?” IEEE Transactions on Information Theory, vol. 52, no. 12, pp. 5406–5425, 2006.

Are we there yet?