This blog post is about my work, Sparse Networks from Scratch: Faster Training without Losing Performance, with Luke Zettlemoyer on fast training of neural networks which we keep sparse throughout training. We show that by developing an algorithm, sparse momentum, we can initialize a neural network with sparse random weights and train it to dense performance levels — all while doing just a single training run. Furthermore, If we use optimized sparse convolution algorithms, we can speed up training between 3.5x for VGG to 12x for Wide Residual Networks. This stands in stark contrast to computationally expensive methods which require repetitive prune-and-retrain cycles as used by the Lottery Ticket Hypothesis (Frankle and Carbin, 2019) and other work. Thus we show that training sparse networks to dense performance levels does not require “winning the initialization lottery” but can be done reliably from random weights if combined with a method that moves weights around the network in a smart way. We call the paradigm that maintains sparsity throughout training while maintaining dense performance levels sparse learning. While this work shows that sparse learning is possible, future work holds the promise to train larger and deep networks on more data while requiring the same or less computational resources as current dense networks.
Why Sparse Learning?
A significant driver of progress in deep learning has been advances in computational resources. From 2010 to 2018 we saw an increase of 9700% in computational GPU performance. However, we can expect increases of just little more than 80% GPU performance in the next 5-8 years due to reaching the physical limits of semiconductor technology. What does a research world look like where we cannot make further improvements in computational power?
A glimpse of this comes from the natural language processing (NLP) community where pretrained language models like ELMO, GPT, BERT, GPT-2, Grover, and XL-Net dominate the entire field by outperforming other methods on most NLP tasks. These models are often rather simple: You train them on lots of documents, and the task is mainly to predict a word given a sequence of other words — a bit like doing a fill-in-the-blank puzzle. The catch? These models are so big that they take well in excess of 100 GPUs hours to train. This is particularly frustrating for academic researchers that want to understand these models but are unable to do so because they lack the computational resources that big companies have. To truly understand these massive pretrained language models, a primary goal should be to democratize the training of these models by developing more resourceful training procedures.
One way to achieve this is to look at the human brain for inspiration. The human brain consumes 1/10th of the energy of a GPU but is 10^9 times more powerful. What makes the brain so computational efficient? There are many reasons, but one reason is sparsity.
It has been found that the more neurons a primate brain has the fewer connections does the average neuron make with all other neurons (Herculano-Houzel et al., 2010). This is very much contrary to how we design deep neural networks, which is to connect every new neuron in a layer with all neurons in the previous layer. We already understand how to compress a fully trained dense network to a sparse network (Han et al., 2015), but there has been little work on how to do this successfully if one starts from a sparse network which we keep sparse during training. How do we do this?
Sparse Momentum: An Efficient Way to Train Sparse Networks
This section explains the sparse momentum algorithm from intutiton up to the full algorithm.
What is the Main Quality of Good Sparse Learning Algorithms?
In sparse learning, the most important thing is to use every single weight in a neural network as effectively as possible. If you define “effectiveness” as “reducing the error,” then we have an obvious perspective on how we can proceed. We need to find a measure which describes how effective a weight is at reducing the error and remove all weights which do not. Once we removed weights, we want to regrow new weights in locations which we think are promising at reducing the error in the future.
If we look at the gradient of the error with respect to the weight, we actually have precisely such a measure. However, if we look at successive gradients, we find that gradients can wildly oscillate. For example if you have a neural network which classifies handwritten digits 0 to 9 then a weight might be good at detecting a straight line at the top and it might help to reduce the error for the numbers 5, 7 but then it might not help or even be detrimental for numbers 0, 1, 2, 3, 6, 8, 9. Instead, a weight which detects a curvy pattern in the top right might help for 0, 2, 3, 8, 9 and as such we would expect that this weight reduces the error more consistently over time than the “straight line at the top” weight. How can we detect such promising weights in a neural network automatically?
Momentum: Finding Weights that Reduce the Error Consistently
If you take the north pole to be a local minimum and a compass needle the gradient towards the local minimum, then you can simulate stochastic gradient descent updates by shaking the compass wildly to spin the compass needle. With every time the needle passes the north pole it will slow down and line-up more and more with the north pole, however, due to the spin it will still “overshoot” that direction. So it might be unclear where the north pole is from two or three measurements while the needle is still moving back and forth. However, if you take the average directions — one time the needle is a bit to the left of the north pole, another time it is more to the right — then these deviations cancel out, and you will immediately get a direction which is very close to the real north pole.
This is the main idea behind the momentum optimization technique: We average successive gradients to get a better estimate of the direction of the local minimum. Similarly to the compass needle, which gets more and more accurate over time as it slows down, we want to weight more recent gradient directions in stochastic gradient descent more highly. One way to do this is to assign a weighted average where we assign a much larger weight to the current gradient and a small weight to the previous gradients — this is called exponential smoothing. Through exponential smoothing the gradients of the weight we receive a weighted gradient matrix — this matrix is the momentum matrix which gives momentum optimization its name. With this measure, we can identify which are the weights which reduce the error consistently.
Redistributing Weights: The Mean Momentum Magnitude of a Layer
From here we make the first important observation for our sparse momentum algorithm: If the momentum of a weight indicates how much it reduces the error consistently, then the mean momentum magnitude of all the weights in a layer should indicate how much each layer is reducing the error on average. We take the magnitude because two different weights might consistently go into a negative direction and a positive direction. By taking the mean momentum magnitude of layers, we can easily compare how effective the average weight in each layer is. This enables to say, for example, that a weight in a convolutional layer A is on average 1/3 as effective at reducing the error as the average weight in fully connected layer B, or vice versa. This method enables us to redistribute weights effectively: if we find “useless” weights, we now know precisely in which layer to put it — but where to put them exactly within a layer?
Which Weights Should be Removed? Where to Regrow them?
The next two problems are more straightforward: Which are the most useless weights? Where do we grow weights within a layer? The first problem is a common problem in neural network compression research, where one often prunes the weights with the smallest magnitude. Why does this make sense? If we assume all weights receive on average inputs of similar magnitude — a reasonable assumption if one uses batch normalization — then weights with small magnitudes make the smallest difference in the activation of a neuron. As such, removing them should change the predictive performance of our networks by the smallest amount.
Once we removed weights and redistributed them to weight-effective layers as measured by the mean momentum magnitude of a layer, we need to decide where exactly to grow them within a layer. One possible solution becomes apparent if we ask: “Which two unconnected neurons would reduce the error consistently if we connect them?” The answer to this question would again point to the momentum magnitude. This time, however, we want to look at the momentum magnitude of “missing” or zero-valued weights, that is, we want to look at those weights which have been excluded from training before. Thus we grow weights in locations where missing weights have the largest momentum magnitude. This completes the sparse momentum algorithm, which depicted in Figure 1.
The results are quite impressive! We compared against compression algorithms on MNIST, where sparse momentum outperforms most other methods. This is a pretty good result given that compression methods start from a dense network and usually retrain repetitively while we train a sparse network from scratch! Another impressive result is that we can match or even exceed the performance of dense networks by using 20% of weights (80% sparsity). On CIFAR-10, we compare against Single-shot Network Pruning which is designed for simplicity and not performance — so it is not surprising that sparse momentum does better. However, what is interesting is that we can train both VGG16-D (a version of VGG16 with two fully connected layers) and Wide Residual Network (WRN) 16-10 (16 layers deep and very wide WRN) to dense performance levels with just 5% of weights. For other networks, sparse momentum comes close to dense performance levels. Furthermore, as I will show later, with an optimized sparse convolution algorithm, we would be able to train the WRN 16-10 12x faster than the dense baseline — same performance levels while training 12x faster!
On ImageNet, we are not able to reach dense performance levels, which indicates that there is room to improve sparse momentum. However, we can demonstrate that sparse momentum has a clear lead compared to other methods that maintain sparse weights throughout training.
The main promise of sparse learning was to accelerate training — were we successful? Yes — and no. Sparse momentum accelerates training efficiently if we measure possible speedups for sparse convolution, but since sparse networks were only very recently used for training, no optimized sparse convolution algorithms exist for the GPU — at least not for the fine-grained sparse patterns of weights as exhibited by sparse momentum.
As such, we divide the speedups into two groups: Possible speedups which could be achieved if sparse convolution algorithms would exist, and speedups which we can achieve today with standard dense convolutional algorithms. How can dense convolutions help for sparse networks?
If we look at the sparsity pattern of our network, we have the case where a convolutional channel is entirely empty — a convolutional filter full of zeros! If this happens, we can remove the channel from the computation without changing the results of the convolution and thus gain speedups.
However, if we look at the speedups, we see there is a marked difference between sparse convolution and dense convolution speedups. This clearly shows the need for optimized sparse convolution algorithms for the GPU.
Why does Sparse Learning Work?
Some of our sparse networks trained with sparse momentum matched the performance levels of dense networks with just 5% of weights. What makes these 5% of weights so efficient that they can match a neural network with 20 times as many weights?
To look into this question, we looked at how the features of sparse networks compare to dense networks. Low-level features might include things like edge detectors. Mid-level features might be things like wheels, noses, eyes, paws. High-level features might be the “face” of a car, a cat face, a fridge door, and so forth.
To reduce features to numbers we look at convolutional channels — the equivalent to a “neuron” in a convolutional network — and how useful the channel is to classes in the dataset. Edge detectors should be useful to almost all classes in the dataset — in other words, they should have a low level of class-specialization. Mid-level features like eyes should be useful to—some classes such as cats, dogs, and humans. High-level features should be useful to a few selected classes — they are highly class-specialized.
What we find is that on average, sparse networks learn features which are useful to a broader range of classes — they learn more general features. This might be a possible explanation of why sparse networks can match the performance of dense networks with as few as 5% weights.
The Future of Sparse Learning
I believe sparse learning has a very bright future because (1) GPUs will stagnate in performance over the next years, (2) specialized processors for sparse workloads, Graphcore processors, are around the corner. Graphcore processors store an entire network in its 300 MB cache and accelerate it by a factor of roughly 100x. This means, if we can compress a network to 300 MB during training, then we will have 100x faster training overall. Training a ResNet-50 on ImageNet would then take only roughly 15 minutes using one Graphcore processor. With sparse learning, the 300 MB limit will be in reach without a problem.
My prediction is that the first research team that can train a sparse neural network on a Graphcore processor successfully will unlock an entirely new level of artificial intelligence.
Besides this, another challenge is to apply sparse learning algorithms to natural language processing (NLP). Unsurprisingly, my experimentation on transformers for natural language processing tasks show that sparse learning is much more difficult in NLP compared to computer vision — lots of work to do!
Try Sparse Momentum with Your Own Model in 10 Lines of Code!
To make sparse learning accessible to everyone I developed a sparse learning library which allows the easy application of existing algorithms like sparse momentum to your own models — it can be done in less than 10 lines of code. The library is also designed to make it very easy to add your own sparse learning methods. You find my sparse learning library on GitHub.
For questions, I prefer if you post them below if they are simple and straightforward. If you have a more formal question regarding our work that requires careful answers, you can post an the question as a GitHub issue — I will try to answer as timely as possible.
I thank Luke Zettlemoyer for feedback on an early draft of this blog post.
Frankle, J. and Carbin, M. (2019). The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR 2019.
Han, S., Pool, J., Tran, J., and Dally, W. (2015). Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages
Herculano-Houzel, S., Mota, B., Wong, P., and Kaas, J.H. (2010). Connectivity-driven white matter scaling and folding in primate cerebral cortex. In Proceedings of the National Academy of Sciences of the United States of America, 107 44:19008—13.