ICML 2020: My Visits and Understandings

Paper ID: 5664 - Analyzing Neural Network Architecture on Training Performance. (Google)

Presentation ICML 2020 - ICML virtual
Paper - https://proceedings.icml.cc/static/paper_files/icml/2020/5664-Paper.pdf
“This paper studies how neural network architecture affects the speed of training. We introduce a simple concept called gradient confusion to help formally analyze this. When gradient confusion is high, stochastic gradients produced by different data samples may be negatively correlated, slowing down convergence. But when gradient confusion is low, data samples interact harmoniously, and training proceeds quickly. Through theoretical and experimental results, we demonstrate how the neural network architecture affects gradient confusion, and thus the efficiency of training. Our results show that, for popular initialization techniques, increasing the width of neural networks leads to lower gradient confusion, and thus faster model training. On the other hand, increasing the depth of neural networks has the opposite effect. Our results indicate that alternate initialization techniques or networks using both batch normalization and skip connections help reduce the training burden of very deep networks.” “Karthik Abinav Sankararaman (University of Maryland); Soham De (DeepMind)*; Zheng Xu (Univ. of Maryland); W. Ronny Huang (University of Maryland); Tom Goldstein (University of Maryland, College)”

Using SGD, in todays NN’s the variance of the weights of the minimizer is small, leading to a fast convergence. (The convergence theory). 1. Why is constant LR SGD effecive on nn models 2. How doed nn architecture and initialzations affect this?

To answer these questions:

1. They analysed a condition called gradient confusion – that affects the convergence of SGD.
2. This is a simple measure of the amount of disagreement b/w 2 stocastic gradients.
3. Gradient confusion helps establish relationships b/w n/w depth, layer width and performance.
4. They formulated a formula for gradient confusion (eta). For each input i in the neural network, the gradients are calculated, similarly for another input j, gradients are calculated and then a kind of review of direction (i.e. where those gradients of two different training samples point to and using this a gradient confusion is calculated) is done. So the gradient confusion measure looks at how much the stocastic gradients disagree with each other on which direction to move the parameters in for an SGD update (This is related to gradient variance). For eg. if the inputs are orthogonal (say for a linear network) i.e xi(Transpose)xj = 0, then the Gradient confusion also is 0, which suggests us that the SGD(stocastic gradient descent) update for the example i, does not affect the update for example j, and thus we would expect a fast convergence of the gradient descent. When the gradient convergence is small, SGD has a fast convergence. Also we can get faster convergence rates of SGD if we assume that the variance of the gradients is bounded.
5. Solving the equations the conclusion is:

Training gets harder with increased depth (higher Gradient Confusion)
Training gets easier with increased width (lower Gradient Confusion)
Previous results says (for training deep nets without increasing width)
Orthogonal initializations (weights) (for linear networks)
Residual networks and Batch Norm (and recent works also suggests that combination of BN and skip connections can help train very deep nn’s)

Read a similar work just after this (ICML 2020, ICML 2019) -

1. Increasing the width of DNNs results in more regular behavior, and makes them easier to understand. 2. A number of recent results have shown that DNNs that are allowed to become infinitely wide converge to another, simpler, class of models called Gaussian processes. 3. In this limit, complicated phenomena (like Bayesian inference or gradient descent dynamics of a convolutional neural network) boil down to simple linear algebra equations.
4. Deep neural networks induce simple input / output maps as they become infinitely wide. 5. As the width of a neural network increases , we see that the distribution of outputs over different random instantiations of the network becomes Gaussian. 6. In this work Wide Neural Networks of Any Depth Evolve asLinear Models Under Gradient Descent, we showthat for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained fromthe first-order Taylor expansion of the network around its initial parameters.

Google Blog - ICML 2020: Googlr AI Blog
Unbublished ICML 2019: https://openreview.net/forum?id=SkGT6sRcFX

Talks and Tutorials Attended:

Representation Learning Without Labels - Deepmind (Tutorial): https://icml.cc/virtual/2020/tutorial/5751 Slides: Local PC
Adversarial Neural Pruning with Latent Vulnerability Suppression (Paper - Poster Session): https://icml.cc/virtual/2020/poster/5877
Uncertainity and Robustness in DL - Website
Why adversarial examples feels like bugs? - Justin Gilmer (Google Brain) Slides and Talk

Written on July 15, 2020