Overparameterization for SGD under Deep Learning Practice

In this project, we  addresses fundamental question of “How much should we overparameterize a NN?” with a focus on genralizaiton and common practice in DL such as SGD, nonsmooth activations, and implicit/explicit regularizations.  For smooth activations and gradient descent, we established current best scaling on the number of parameters for fully-trained shallow NNs under standard initialization schemes [1].

To address grand challenges of DL, we need to understanding the fundamental limits and design the underlying architecture, e.g., a NN over which the learning algorithm is applied. Our recent research has established current best scaling on the number of parameters for fully-trained shallow NNs [1].

First-order methods with random initialization can consistently find a global minimum, even with randomized labels [2]. Demystifying this observation is of central interest to deep learning. Recently, a line of research [1, 3–5] suggests that such an empirical success can possibly be explained by the overparameterization of neural networks, whose number of parameters exceeds the number of training data n. In particular, gradient descent converges linearly fast to a global optimum in a number of problems with models that have wide hidden layers [3-5]. Despite of these remarkable results, the natural key question “How much should we overparameterize a neural network?” remains open even for the toy example of two-layer neural networks. On one hand, it is widely accepted that, for two-layer neural networks, the number of parameters should grow linearly with n (e.g., [5]). On the other hand, theoretical results either require much more parameters, or they are established under restrictive settings.

In this project, we  addresses fundamental question of “How much should we overparameterize a NN?” with a focus on genralizaiton and common practice in DL such as SGD, nonsmooth activations, and implicit/explicit regularizations.  For smooth activations and gradient descent, we established current best scaling on the number of parameters for fully-trained shallow NNs under standard initialization schemes [1]. 

While overparameterization of NNs have been studied extensively in terms of their convergence (optimization error) [1-5], their generalization to an unseen example, remained an open problem. Generalization error depends on the design of the learning algorithm and the characteristics of the loss function. Unlike generalization analyses which are based on the complexity of the hypothesis class, the stability analysis focuses on how a learning algorithm explores the hypothesis class [6]. Two main stability notions in the literature are uniform and on-average stability [6–9]. Using on-average algorithmic stability, we find how many parameters are sufficient for SGD to achieve sufficiently small generalization error for NNs under standard initialization schemes.

Using on-average algorithmic stability, we find how many parameters are sufficient for SGD to achieve sufficiently small generalization error for NNs under standard initialization schemes.

This project is available for a master student with a strong theoretical background. If interested, please contact Associate Professor Ali Ramezani-Kebrya for details. Students should be familiar with optimization, probability, and concentration inequalities. 

[1] Chaehwan Song*, Ali Ramezani-Kebrya*, Thomas Pethick, Armin Eftekhari, and Volkan Cevher. Subquadratic overparameterization for shallow neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2021.

[2] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR), 2017.

[3] Simon S. Du and Jason D. Lee. On the power of over-parametrization in neural networks with quadratic activation. In International Conference on Machine Learning (ICML), 2018.

[4] Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations (ICLR), 2019.

[5] Samet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1:84–105, 2020.

[6] Ali Ramezani-Kebrya, Ashish Khisti, and Ben Liang. On the generalization of stochastic gradient descent with momentum. arXiv preprint arXiv:1809.04564, 2021.

[7] Olivier Bousquet and Andr. Elisseeff. Stability and generalization. Journal of Machine Learning Research (JMLR), 2:499–526, 2002.

[8] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. Journal of Machine Learning Research (JMLR), 11:2635–2670, 2010.

[9] M. Hardt, B. Recht, and Y Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning (ICML), 2016.

 

 

Emneord: understanding deep learning, generalization, neural networks
Publisert 10. aug. 2023 21:23 - Sist endret 10. aug. 2023 21:23

Veileder(e)

Omfang (studiepoeng)

60