introduction

Neural networks become larger and larger and use up to billions of parameters. Researchers start to quantify the effort to train these large-scale models in $$$ amounts on cloud computing platforms and even in tons of carbon emissions. The common understanding used to be that overparametrized networks have more capacity but are also more prone to overfit the training data. Recent studies have shown that overparametrization, in fact, acts as a regularizer and lead to improved generalization performance that no other regularization technique may achieve1.

After training, however, large parts of such large-scale models can be pruned away without harming the accuracy of the model. Pruning techniques date back to 1990 with LeCun et al.’s paper on optimal brain damage2. The motivation for pruning is to reduce the model size and thus, memory requirement, inference time, and energy consumption. One pruning technique is magnitude pruning, which prunes those weights that have the lowest magnitude, and therefore, the lowest effect on the network output3.

Before the lottery ticket hypothesis4 (LTH) the common experience was that pruned architectures were harder to train from scratch. Now, the LTH states that certain subnetworks can be trained to match or even outperform the accuracy of the original, unpruned network. The key idea is to iteratively train a network and prune its parameters until only a small fraction of parameters are remaining. In each iteration, the surviving weights are reset to their values at initialization. The resulting subnetwork can then be trained in comparable time to match the accuracy the accuracy of original network. These subnetworks are referred to as winning tickets.

The Lottery Ticket Hypothesis. A randomly-initialized, dense neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations.4

In the LTH paper, the authors have found winning tickets that are only 10-20% of the size of their dense counterparts. With more remaining parameters, the winning tickets could even achieve higher test accuracies than the original networks.

Why is this important? The LTH suggests that it is not necessary to train a full-model, if we could only identify winning tickets early in the training process. If this was possible, it could save us wallets of $$$ and tons of carbon emissions.

thought experiment: sum of two inputs

To get an intuition on the LTH, let’s consider the simple task of computing the sum of two inputs . We want to approximate the ground truth with a two-layer, linear neural net with hidden units and no bias.

exemplary neural net with a winning ticket

For humans, a winning ticket for the sum of two inputs is easy to determine. Such a winning ticket would be for some with all remaining weights being zero. This winning ticket would even generalize out of the training data domain, as it actually does compute the real sum of its two inputs.

No matter how large we chose the hidden layer size , our winning ticket will always consist of three nonzero weights. Thus, we can prune all but those three weights without harming accuracy. An alternative would be that the input values are passed through the first layer and summed up in the second layer, which would need four weights in total. When we start training with a mask for those three (or four) nonzero connections, the network will eventually learn the correct weights.

how to identify winning tickets

To show that winning tickets exist, Frankle and Carbin4 employ the following procedure, which they label iterative magnitude pruning:

  1. Initialize model with parameters along with a mask set to all ones
  2. Train the (masked) model for iterations
  3. Prune the lowest magnitude parameters and update accordingly.
  4. Reset to their values in and fix all other parameters to zero.
  5. Repeat from step 2 unless stopping criterion on sparsity or validation accuracy is met

The result is a subnetwork (given by mask m) along with its initialization, which may perform a final training pass. In their experiments on image classification4, the authors compare the accuracy of these winning tickets against the whole model and against random tickets. Random tickets share the same structure but are re-initialized randomly before the final training pass. The main result is that the winning tickets consistently lead to higher scores than random tickets, and can also match or even outperform the full model.

So, random tickets share the same structure as winning tickets yet winning tickets yield higher scores. This means that the initialization values are important for the winning tickets’ success. When we have more parameters, we get more rolls for initialization values. We also get more possibilities to combine a subset of good rolls into a sparse subnetwork.

Given these massive amounts of possibilities, how can it be that we can find the right subnetwork by pruning? The authors of the LTH paper conjecture that already the optimizer focuses on training the parameters of a well-initialized sub-network. This may be an alternative explanation why highly overparametrized networks generalize better.

iterative magnitude pruning

Comparison of pruning techniques
Figure taken from Song et al. 2015. Pruning parameters that were trained with either L1 or L2 regularization. L2 is better than L1 as soon as the remaining parameters are retrained.

Frankle and Carbin make use of a pruning technique proposed by Han et al.3 at NeurIPS 2015. That is to prune those weights that have the lowest magnitude and retrain them. Song et al. have shown that, counterintuitively, magnitude pruning works better with L2 regularization as soon the remaining weights are retrained. They further show that iterative pruning is favorable over one-shot pruning. In one-shot pruning, you would only conduct only a single training pass and then prune the weights, whereas in iterative pruning, you would prune and train for several rounds until some criterion on sparsity or accuracy is met.

To validate the lottery ticket hypothesis, Frankle & Carbin ought to find subnetworks that match the accuracy of the original nets, when trained in isolation. That means that they cannot exploit the benefit from previous training rounds with the full model. Therefore, they modify the training and pruning procedure by resetting (non-pruned) weight values to their values at initializion.

During pruning, one can either prune to the desired fraction of weights at each layer, or put the weights of all layers into one pool and prune globally. In the LTH paper4, the authors use local pruning for LeNet and Conv-2/4/6, while they use global pruning for the deeper models: Resnet-18 and VGG-19. The idea is that within deeper models, the weights of some layers might be more important to keep than others’5.

late resetting

Late resetting results
Figure taken from Frankle et al. 2019. With late resetting, winning tickets can be found that may even achieve higher scores than the original network for VGG-19 and Resnet-18.

Learning rate warmup can help to find winning tickets for deeper models4. In follow-up work, however, the authors have introduced a different technique to deal with deeper models: late resetting6. With late resetting, the weights are reset to values early in the training process instead of strictly to their initial values. Hence, learning rate warm-up is not necessary anymore. Late resetting is specifically important to find winning tickets for deeper models.

winning tickets outside of the image domain

Is the lottery ticket phenomenon an artefact of supervised image classification with feed-forward convolutional nets nets or does it generalize to other domains? Yu et al.7 could show that winning tickets also exist in reinforcement learning and natural language processing architectures. Their experiments include classic control problems, Atari games, LSTMs, and Transformers. They managed to find winning tickets for all architectures. This suggests that the LTH phenomenon is not restricted to supervised image classification but might be a general property of deep neural nets.

transferring winning tickets

So far, the procedure to identify winning tickets is still expensive as it involves several full training passes. How can we still benefit from the winning tickets? Can we transfer them to other tasks such that only a small fraction of weights need to be learned for the target task?

Several works have analyzed whether winning tickets are transferable across tasks within the image domain85. Both works suggest that winning tickets are transferable across tasks. However, each makes use of a particular relaxation. Mehta8 relaxes late resetting to using the best weights anywhere in the training process on the source task. His explanation of this decision is that the purpose of transfer learning is to save training effort on the target task.

In the LTH4 paper, winning tickets are evaluated against random tickets. These random tickets share the same structure but are re-initialized at random. The inductive bias of winning tickets comes from both the initialization and the structure5. The empirical results from the original LTH paper compares against randomly initialized tickets with the same structure. This is more challenging than comparing against random tickets whose mask is also drawn at random.

Morcos et al.5 compare against random tickets that are not only randomly initialized but are also randomly permuted. The authors argue that the inductive biases of winning tickets consists of both the initialization and the structure. They further observe that larger datasets lead to better transferable winning tickets.

how do winning tickets look like

Zhou et al.9 have conducted a closer investigation on winning tickets. They show that a crucial element of the initialization is the sign of the weight. Furthermore, the authors develop the notion of supermasks that lead to good accuracy even without further training. They claim that sparse subnetworks work particularly well, when inititializations are close to their final form.

pruning and dropout

Dropout is a well-known regularization method that encourages sparsity tolerance during training by setting a random fraction of weights or hidden units to zero. However, when pruning is applied after training, the fraction of pruned weights depend on a heuristic such as the magnitude of the weights. Gomez et al.10 pursue the idea of improving the interaction of dropout and pruning. The idea is that dropout could be targeted to units, which are likely to be pruned, i.e., those with low magnitude. In the paper, the authors analyze not only the standard unit-dropout but also weight-dropout (aka DropConnect), which is even closer to the employed pruning techniques.

pruning on-the-go

The holy grail of winning tickets is to identify them as early as possible in the training process. Dettmers and Zettlemoyer11 propose a technique to identify winning tickets without the need for expensive retraining. They exploit the momentum of the gradients to determine how fast weights change during training and prune those that do not change much. Furthermore, the values of pruned weights are redistributed dynamically. The results show that this so-called sparse momentum technique outperforms their baselines for sparse learning.

limitations

There are also studies that challenge the LTH: Gale et al.12 conduct a large-scale comparison of sparse neural nets on machine translation with transfomers and image classification with ResNet-50. They confirm that naive magnitude pruning3 is the best pruning technique among the compared ones. However, they report that the LTH approach fails to find winning tickets for these architectures. Liu et al13 show that – with a carefully selected learning rate – random tickets can perform as well as winning tickets. Both works, however, did not yet use late resetting6, which helps to find winning tickets especially in deep architectures. Another limitation is that the LTH only claims that winning tickets exist, but does not give you winning tickets right-away. To find a winning ticket, you also need to start with a large, dense model in the first place.

experiments on the sum-of-two-inputs example

Let’s implement the sum-of-two-inputs example from above. We use local pruning such that the second layer does not get pruned away completely. We begin with 200 hidden units and train for 10 epochs for each pruning round. We iteratively prune 25% of the weights (by magnitude) until only 2 weights are left in each layer. In each round, we use late resetting to the weights after the first training iteration. We expect that the winning ticket will pass the two inputs through the first layer and sum them up in the second layer. We train on the interval [-1,1] and test on the interval [1,2]. We end up with the following results for the root mean squared error.

Pruning 0.25 weights => 4 weights still active ~= 0.67%
Stopping criterion met.
This is your winning ticket:
        Layer 0
        [60,0]: 0.9429353475570679
        [60,1]: 0.9429353475570679
        Layer 1
        [0,60]: 1.06050705909729
        [0,101]: -0.4135688245296478
Winning ticket RMSE: 6.870159915858438e-05
This is a random ticket:
        Layer 0
        [60,0]: -0.8950394988059998
        [60,1]: -0.8950406908988953
        Layer 1
        [0,60]: -1.1172559261322021
        [0,101]: 0.008170465007424355
Reinit Random ticket RSME: 7.364383157981381e-05
This is a permuted random ticket:
        Layer 0
        [2,0]: 0.15243083238601685
        [2,1]: -0.40985623002052307
        Layer 1
        [0,60]: -0.002661715727299452
        [0,101]: 0.03881140798330307
Permute+Reinit Random ticket RSME: 6.60740392345907
Full-model RSME: 7.780414107555133e-06

We observe that iterative magnitude pruning yields a winning ticket that corresponds to our human intuition (or at least, it comes close). Please note that the [0,101] weight on the second layer irrelevant as it will only ever receive zero inputs. The winning ticket’s accuracy with 4 weights is on par with the accuracy of the full model with 600 weights. The randomly reinitialized ticket also succeeds to learn good weights (the inverted signs cancel each other out). In contrast, the locally permuted ticket has a dead end and cannot learn anything.

What can we learn from implementing the “sum-of-two-inputs” toy example?

  1. We can verify our previous thought-experiment that it is possible to find a winning ticket with only four weights that leads comparable error as the full 600 parameter model.
  2. For this simple task, the initialization of winning tickets might be less important than it is in other tasks.
  3. We see that a comparison with randomly permuted tickets is dangerous. The random permutation has lead to a “dead end”, which may render the model untrainable.

conclusion

The lottery ticket hypothesis states that dense neural networks contain sparse subnetworks that can be trained in isolation to match the performance of the dense net. This phenomenon offers a novel interpretation of overparametrization, which leads to exponentially more draws from the lottery. To benefit from their existence, one needs to find methods to identify winning tickets early and without training the full model at all. Some approaches already tackle this, while others focus on training methods that make neural networks more amenable to later pruning. If we could identify winning tickets early or transfer them to other domains, we would save substantial amounts of training effort. Winning tickets sometimes even outperform the original networks, which might have implications for our understanding of and the design of architectures and their initializations. We can further confirm that iterative magnitude pruning succeeds to finds winning tickets that correspond to human wisdom for a simple task.

bonus material

references

  1. Arora, Sanjeev, Nadav Cohen, and Elad Hazan. “On the optimization of deep networks: Implicit acceleration by overparameterization.” ICML 2018. 

  2. LeCun, Yann, John S. Denker, and Sara A. Solla. “Optimal brain damage.” In Advances in neural information processing systems, pp. 598-605. 1990. 

  3. Han, Song, Jeff Pool, John Tran, and William Dally. “Learning both weights and connections for efficient neural network.” In Advances in neural information processing systems, pp. 1135-1143. 2015.  2 3

  4. Frankle, Jonathan, and Michael Carbin. “The lottery ticket hypothesis: Finding sparse, trainable neural networks.” ICLR 2019.  2 3 4 5 6 7

  5. Morcos, Ari S., Haonan Yu, Michela Paganini, and Yuandong Tian. “One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers.” arXiv preprint arXiv:1906.02773 (2019).  2 3 4

  6. Frankle, Jonathan, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. “The Lottery Ticket Hypothesis at Scale.” arXiv preprint arXiv:1903.01611 (2019).  2

  7. Yu, Haonan, Sergey Edunov, Yuandong Tian, and Ari S. Morcos. “Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP.” arXiv preprint arXiv:1906.02768 (2019). 

  8. Mehta, Rahul. “Sparse Transfer Learning via Winning Lottery Tickets.” arXiv preprint arXiv:1905.07785 (2019).  2

  9. Zhou, Hattie, Janice Lan, Rosanne Liu, and Jason Yosinski. “Deconstructing lottery tickets: Zeros, signs, and the supermask.” arXiv preprint arXiv:1905.01067 (2019). 

  10. Gomez, Aidan N., Ivan Zhang, Kevin Swersky, Yarin Gal, and Geoffrey E. Hinton. “Learning Sparse Networks Using Targeted Dropout.” arXiv preprint arXiv:1905.13678 (2019). 

  11. T Dettmers, L Zettlemoyer. “Sparse Networks from Scratch: Faster Training without Losing Performance” arXiv preprint arXiv:1907.04840. 

  12. Gale, Trevor, Erich Elsen, and Sara Hooker. “The state of sparsity in deep neural networks.” arXiv preprint arXiv:1902.09574 (2019). 

  13. Liu, Zhuang, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. “Rethinking the value of network pruning.” arXiv preprint arXiv:1810.05270 (2018).