This blog post is about my work, Sparse Networks from Scratch: Faster Training without Losing Performance, with Luke Zettlemoyer on fast training of neural networks which we keep sparse throughout training. We show that by developing an algorithm, sparse momentum, we can initialize a neural network with sparse random weights and train it to dense performance levels — all while doing just a single training run. Furthermore, If we use optimized sparse convolution algorithms, we can speed up training between 3.5x for VGG to 12x for Wide Residual Networks. This stands in stark contrast to computationally expensive methods which require repetitive prune-and-retrain cycles as used by the Lottery Ticket Hypothesis (Frankle and Carbin, 2019) and other work. Thus we show that training sparse networks to dense performance levels does not require “winning the initialization lottery” but can be done reliably from random weights if combined with a method that moves weights around the network in a smart way. We call the paradigm that maintains sparsity throughout training while maintaining dense performance levels *sparse learning*. While this work shows that sparse learning is possible, future work holds the promise to train larger and deep networks on more data while requiring the same or less computational resources as current dense networks.

## Why Sparse Learning?

A significant driver of progress in deep learning has been advances in computational resources. From 2010 to 2018 we saw an increase of 9700% in computational GPU performance. However, we can expect increases of just little more than 80% GPU performance in the next 5-8 years due to reaching the physical limits of semiconductor technology. What does a research world look like where we cannot make further improvements in computational power?

A glimpse of this comes from the natural language processing (NLP) community where pretrained language models like ELMO, GPT, BERT, GPT-2, Grover, and XL-Net dominate the entire field by outperforming other methods on most NLP tasks. These models are often rather simple: You train them on lots of documents, and the task is mainly to predict a word given a sequence of other words — a bit like doing a fill-in-the-blank puzzle. The catch? These models are so big that they take well in excess of 100 GPUs hours to train. This is particularly frustrating for academic researchers that want to understand these models but are unable to do so because they lack the computational resources that big companies have. To truly understand these massive pretrained language models, a primary goal should be to democratize the training of these models by developing more resourceful training procedures.

One way to achieve this is to look at the human brain for inspiration. The human brain consumes 1/10th of the energy of a GPU but is 10^9 times more powerful. What makes the brain so computational efficient? There are many reasons, but one reason is* sparsity*.

It has been found that the more neurons a primate brain has the fewer connections does the average neuron make with all other neurons (Herculano-Houzel et al., 2010). This is very much contrary to how we design deep neural networks, which is to connect every new neuron in a layer with all neurons in the previous layer. We already understand how to compress a fully trained dense network to a sparse network (Han et al., 2015), but there has been little work on how to do this successfully if one starts from a sparse network which we keep sparse during training. How do we do this?

## Sparse Momentum: An Efficient Way to Train Sparse Networks

This section explains the sparse momentum algorithm from intutiton up to the full algorithm.

### What is the Main Quality of Good Sparse Learning Algorithms?

In sparse learning, the most important thing is to use every single weight in a neural network as effectively as possible. If you define “effectiveness” as “reducing the error,” then we have an obvious perspective on how we can proceed. We need to find a measure which describes how effective a weight is at reducing the error and remove all weights which do not. Once we removed weights, we want to regrow new weights in locations which we think are promising at reducing the error in the future.

If we look at the gradient of the error with respect to the weight, we actually have precisely such a measure. However, if we look at successive gradients, we find that gradients can wildly oscillate. For example if you have a neural network which classifies handwritten digits 0 to 9 then a weight might be good at detecting a straight line at the top and it might help to reduce the error for the numbers 5, 7 but then it might not help or even be detrimental for numbers 0, 1, 2, 3, 6, 8, 9. Instead, a weight which detects a curvy pattern in the top right might help for 0, 2, 3, 8, 9 and as such we would expect that this weight reduces the error more consistently over time than the “straight line at the top” weight. How can we detect such promising weights in a neural network automatically?

### Momentum: Finding Weights that Reduce the Error Consistently

If you take the north pole to be a local minimum and a compass needle the gradient towards the local minimum, then you can simulate stochastic gradient descent updates by shaking the compass wildly to spin the compass needle. With every time the needle passes the north pole it will slow down and line-up more and more with the north pole, however, due to the spin it will still “overshoot” that direction. So it might be unclear where the north pole is from two or three measurements while the needle is still moving back and forth. However, if you take the average directions — one time the needle is a bit to the left of the north pole, another time it is more to the right — then these deviations cancel out, and you will immediately get a direction which is very close to the real north pole.

This is the main idea behind the momentum optimization technique: We average successive gradients to get a better estimate of the direction of the local minimum. Similarly to the compass needle, which gets more and more accurate over time as it slows down, we want to weight more recent gradient directions in stochastic gradient descent more highly. One way to do this is to assign a weighted average where we assign a much larger weight to the current gradient and a small weight to the previous gradients — this is called exponential smoothing. Through exponential smoothing the gradients of the weight we receive a weighted gradient matrix — this matrix is the momentum matrix which gives momentum optimization its name. With this measure, we can identify which are the weights which reduce the error consistently.

### Redistributing Weights: The Mean Momentum Magnitude of a Layer

From here we make the first important observation for our sparse momentum algorithm: If the momentum of a weight indicates how much it reduces the error consistently, then the mean momentum magnitude of all the weights in a layer should indicate how much each layer is reducing the error on average. We take the magnitude because two different weights might consistently go into a negative direction and a positive direction. By taking the mean momentum magnitude of layers, we can easily compare how effective the average weight in each layer is. This enables to say, for example, that a weight in a convolutional layer A is on average 1/3 as effective at reducing the error as the average weight in fully connected layer B, or vice versa. This method enables us to redistribute weights effectively: if we find “useless” weights, we now know precisely in which layer to put it — but where to put them exactly within a layer?

### Which Weights Should be Removed? Where to Regrow them?

The next two problems are more straightforward: Which are the most useless weights? Where do we grow weights within a layer? The first problem is a common problem in neural network compression research, where one often prunes the weights with the smallest magnitude. Why does this make sense? If we assume all weights receive on average inputs of similar magnitude — a reasonable assumption if one uses batch normalization — then weights with small magnitudes make the smallest difference in the activation of a neuron. As such, removing them should change the predictive performance of our networks by the smallest amount.

Once we removed weights and redistributed them to weight-effective layers as measured by the mean momentum magnitude of a layer, we need to decide where exactly to grow them within a layer. One possible solution becomes apparent if we ask: “Which two unconnected neurons would reduce the error consistently if we connect them?” The answer to this question would again point to the momentum magnitude. This time, however, we want to look at the momentum magnitude of “missing” or zero-valued weights, that is, we want to look at those weights which have been excluded from training before. Thus we grow weights in locations where missing weights have the largest momentum magnitude. This completes the sparse momentum algorithm, which depicted in Figure 1.

## Results

The results are quite impressive! We compared against compression algorithms on MNIST, where sparse momentum outperforms most other methods. This is a pretty good result given that compression methods start from a dense network and usually retrain repetitively while we train a sparse network from scratch! Another impressive result is that we can match or even exceed the performance of dense networks by using 20% of weights (80% sparsity). On CIFAR-10, we compare against Single-shot Network Pruning which is designed for simplicity and not performance — so it is not surprising that sparse momentum does better. However, what is interesting is that we can train both VGG16-D (a version of VGG16 with two fully connected layers) and Wide Residual Network (WRN) 16-10 (16 layers deep and very wide WRN) to dense performance levels with just 5% of weights. For other networks, sparse momentum comes close to dense performance levels. Furthermore, as I will show later, with an optimized sparse convolution algorithm, we would be able to train a variety of networks to yield the same performance levels while training between 3.0-5.6x faster!

On ImageNet, we are not able to reach dense performance levels, which indicates that there is room to improve sparse momentum. However, we can demonstrate that sparse momentum has a clear lead compared to other methods that maintain sparse weights throughout training.

### Speedups

The main promise of sparse learning was to accelerate training — were we successful? Yes — and no. Sparse momentum accelerates training efficiently if we measure possible speedups for sparse convolution, but since sparse networks were only very recently used for training, no optimized sparse convolution algorithms exist for the GPU — at least not for the fine-grained sparse patterns of weights as exhibited by sparse momentum.

As such, we divide the speedups into two groups: Possible speedups which could be achieved if sparse convolution algorithms would exist, and speedups which we can achieve today with standard dense convolutional algorithms. How can dense convolutions help for sparse networks?

If we look at the sparsity pattern of our network, we have the case where a convolutional channel is entirely empty — a convolutional filter full of zeros! If this happens, we can remove the channel from the computation without changing the results of the convolution and thus gain speedups.

However, if we look at the speedups, we see there is a marked difference between sparse convolution and dense convolution speedups. This clearly shows the need for optimized sparse convolution algorithms for the GPU.

## Why does Sparse Learning Work?

Some of our sparse networks trained with sparse momentum matched the performance levels of dense networks with just 5% of weights. What makes these 5% of weights so efficient that they can match a neural network with 20 times as many weights?

To look into this question, we looked at how the features of sparse networks compare to dense networks. Low-level features might include things like edge detectors. Mid-level features might be things like wheels, noses, eyes, paws. High-level features might be the “face” of a car, a cat face, a fridge door, and so forth.

To reduce features to numbers we look at convolutional channels — the equivalent to a “neuron” in a convolutional network — and how useful the channel is to classes in the dataset. Edge detectors should be useful to almost all classes in the dataset — in other words, they should have a low level of class-specialization. Mid-level features like eyes should be useful to—some classes such as cats, dogs, and humans. High-level features should be useful to a few selected classes — they are highly class-specialized.

What we find is that on average, sparse networks learn features which are useful to a broader range of classes — they learn more general features. This might be a possible explanation of why sparse networks can match the performance of dense networks with as few as 5% weights.

## The Future of Sparse Learning

I believe sparse learning has a very bright future because (1) GPUs will stagnate in performance over the next years, (2) specialized processors for sparse workloads, Graphcore processors, are around the corner. Graphcore processors store an entire network in its 300 MB cache and accelerate it by a factor of roughly 100x. This means, if we can compress a network to 300 MB during training, then we will have 100x faster training overall. Training a ResNet-50 on ImageNet would then take only roughly 15 minutes using one Graphcore processor. With sparse learning, the 300 MB limit will be in reach without a problem.

My prediction is that the first research team that can train a sparse neural network on a Graphcore processor successfully will unlock an entirely new level of artificial intelligence.

Besides this, another challenge is to apply sparse learning algorithms to natural language processing (NLP). Unsurprisingly, my experimentation on transformers for natural language processing tasks show that sparse learning is much more difficult in NLP compared to computer vision — lots of work to do!

## Try Sparse Momentum with Your Own Model in 10 Lines of Code!

To make sparse learning accessible to everyone I developed a sparse learning library which allows the easy application of existing algorithms like sparse momentum to your own models — it can be done in less than 10 lines of code. The library is also designed to make it very easy to add your own sparse learning methods. You find my sparse learning library on GitHub.

### Questions?

For questions, I prefer if you post them below if they are simple and straightforward. If you have a more formal question regarding our work that requires careful answers, you can post an the question as a GitHub issue — I will try to answer as timely as possible.

#### Acknowledgements

I thank Luke Zettlemoyer for feedback on an early draft of this blog post.

### References

Frankle, J. and Carbin, M. (2019). The lottery ticket hypothesis: Finding sparse, trainable neural networks. In *ICLR 2019*.

Han, S., Pool, J., Tran, J., and Dally, W. (2015). Learning both weights and connections for efficient neural network. In *Advances in neural information processing systems*, pages

1135—1143.

Herculano-Houzel, S., Mota, B., Wong, P., and Kaas, J.H. (2010). Connectivity-driven white matter scaling and folding in primate cerebral cortex. In *Proceedings of the National Academy of Sciences of the United States of America*, 107 44:19008—13.

Roberto says

I think that in order to avoid misunderstanding the following sentence is a key:

“but since sparse networks were only very recently used for training, no optimized sparse convolution algorithms exist for the GPU ”

So, everything stands on the “experimental” side and not actual sparse (very fast) training is performed at all. That does not mean that sparse is a powerful research (practical) line, it means that we have to be aware of the actual performance.

Tim Dettmers says

This is a very crucial point and it was difficult to make the right split between making people aware how much unmatched potential there is behind sparse networks compared to where we can get with current algorithms. Do you think this should have been clearer in introduction/abstract or is it fair as it is laid out right now?

Roberto says

Well, at least you mentioned 😉 There are different blogs, posts, papers… that does not mention at all the actual performance due to a lack of sparse implementations. Therefore the tentative reader should find your post very fair.

Thanks

Tim Dettmers says

Thank you, Roberto, that is good to hear. I would like to be honest and open, but it is sometimes not easy to find the right balance between being just that and making people excited about the possibilities of the future.

Rubén Chaves says

Good post, I agree on the power and need for sparse algorithms: faster training and inference time, less memory footprint, less energy consumption.

Do you know if there is any work being done about optimized sparse algorithms for the GPU?

Tim Dettmers says

I have been working on some sparse algorithms and I have a 16-bit sparse matrix algorithm which is much faster than cuBLAS, but I am not sure how useful the algorithm would be in its current state. My algorithm would not be suitable for sparse convolution via img2col + sparse matrix multiplication. I think to do really well for sparse convolution one needs dedicated CUDA kernels that exploit the repetitive nature of convolution. These sparse algorithms are difficult and their development is very time consuming, that makes them very risky to pursue from a research perspective — I am not sure if I would bet my research time on a sparse convolution algorithm for GPUs. If it is for Graphcore processors this might be a bit different story though. But first we would need to see if Graphcores survive or if they die like Intel’s Nervana processor — designing, producing and bringing new processors to market is a very risky endeavor.

Nikolay Abkairov says

usually, the problem with sparse calculations is not just an efficient core function for matrix sparse multiplication, but also efficiency of memory access – arbitrary sparseness doesn’t use cache lines efficiently – meaning that loading 1-2 bytes causes all of the adjacent 32 or 64 bytes to load at once. This is applicable both for CPUs and GPUs.

One way to mitigate this is to group pruned weights & remaining weights in rows or columns – so that each memory read would benefit several remaining weights at once. You can think of this as an intermediate mode between arbitrary sparseness (any weight may be pruned) and structured sparseness when a whole row or column is removed.

I’m curious to see how your algorithm may perform with grouped pruning. My guess is that it may slow down training, because of more coarse pruning – some weights in a group may still have good momentum, but we may still prune them just because all weights around them are not important – but I’m not sure if the same convergence/quality is even achievable for the same sparseness ratio but with grouping.

Tim Dettmers says

I experimented a little bit with group sparsity and it decreased performance. But this was during the initial stages of experimentation. Maybe I should give it another shot.

Regarding the cache line problem: You can avoid the problem by restructuring the matrix multiplication where you multiply the column times row (instead of row times column) and then accumulate results in cache. I have an algorithm like that and it works much better for deep learning, but I have not published the algorithm yet. The main problem for this algorithm is that it requires a fixed structure of matrix B (AB=C) and frequent synchronization. But both problems can be mitigated if you do it right.

Michael says

Hi Tim, there’s a discussion currently about making sparse ops for Pytorch, perhaps you could weigh in: https://github.com/pytorch/pytorch/issues/20402

Tim Dettmers says

Thank you, Michael, I was not aware of this discussion! Will look at in the detail later and chime in.

Anique Akhtar says

I am not sure how much of this is relevant to ‘sparse learning’ but we do have two implementations on GPUs in the 3D learning that uses some sort of sparse learning. I would be interested in knowing how these works are related to or different from your work:

1) https://arxiv.org/abs/1711.10275

Codebase:

https://github.com/facebookresearch/SparseConvNet

2) https://arxiv.org/abs/1904.08755

Tim Dettmers says

This looks very interesting — thank you for posting this! I will have a look at this and will get back to you in the next couple of days.

Tushar Jain says

Hi there, great post. Is the paper on arxiv as well?

Also, instead of momentum can something like the newly proposed “Impact on Loss” (https://twitter.com/alfcnz/status/1125532253390610433) of each param be used?

Tim Dettmers says

Thank you — I did not realize the link to the paper was missing! Here the link to our work on arxiv: https://arxiv.org/abs/1907.04840.

That is interesting work! Do you know if they released the work yet? I cannot find it on arxiv. One comment in the tweet also mentions that the work is similar to “Are All Layers Created Equal?” and what I can say is that we find similar results. Our momentum measure for each layer replicates the patterns in “Are All Layers Created Equal?” quite well. This might suggest that sparse momentum can find layers which are “critical” for learning.

Tushar Jain says

I don’ think that work is still out. And it’s interesting that your momentum measure replicates those patterns. Is there a more detailed analysis of the same?

kev says

Hi Tim,

The base dense architecture used for ImageNet experiments in Mostafa/Wang attained 74.9/92.4 Top1/Top5 accuracy. Your base architecture appears to attain 79.3/94.8 Top1/Top5, so it is a much stronger baseline. The amount of accuracy you lose compared to this architecture is actually more than what Mostafa/Wang lose compared to their baseline, so it is possible the accuracy gains are just coming from a better base architecture rather than a better sparse learning algorithm. Could you please comment on this? Thanks!

Tim Dettmers says

Very good point! The NeurIPS reviewers also noted that. The 79.3 Top 1 was actually misreported (multi-crop error). But we also used a different codebase which was supperior (cosine learning rate with warmup + label smoothing). Our baseline achieves 77.0% Top 1 (100% weights). I re-ran all the experiments on the dynamic sparse codebase to get comparable results and still attain state-of-the-art results with 72.3 and 74.2 for 10% and 20% weights.

kev says

Great! Good on you for checking.

Also I recommend you could also cite “DropBack”(https://www.sysml.cc/doc/2019/135.pdf, https://arxiv.org/abs/1806.06949) which is another sparse learning procedure.

Tim Dettmers says

Thanks for the link, I did not know about dropback. I will have a look at it and see if I can incorporate it into the next draft.

kev says

Also, do you also have the top5 numbers for the rerun experiments? Just curious.

Kelvin says

Hi Tim,

I wonder how your method performs when without tricks such as label smoothing and warmup? I think normally trained resenet50 should have a top-1 accuracy around 75~76. (90 epochs). Thanks

Tim Dettmers says

This was a weakness in the first draft of the paper. I will post the new version soon. The results without label smoothing/warmup/cosine schedule are 72.3 and 74.2 Top 1 for 10% and 20% weights. However, I ran this with 16-bit and it seems that this decreases performance (from some tests on CIFAR-10). I will rerun it with 32-bit just to make sure it is right.

User says

I feel as though your title and some of your claims are misleading, could you clarify please? Reading through the paper, you actually do lose performance on models that you test, including AlexNet, VGG16, and WRN as seen in Table 2, so how can you claim that this is faster training without losing performance? From what I understand, a lottery ticket is a subnetwork that performs strictly greater than or equal to the original performance of the network, which this doesn t seem to do. Although I do see that you have improved upon previous pruning attempts such as SNIP, this still doesn t mean that you are training sparse networks without losing performance. The accuracies are especially apparent in ImageNet tests, where your algorithm produces models that are anywhere from 2-6% worse, a significant margin! I may be reading the results incorrectly, and would love some clarification, but I don t think it s accurate or correct to state that you are not losing performance with this technique, especially when you say that you can find lottery tickets – something with a very clear definition – from scratch. Perhaps more work needs to be done in this technique, but this is just not true as it stands now.

Tim Dettmers says

Harsh but fair feedback. I got similar feedback from reviewers and the new draft addresses these problems. However, this blog post is currently outdated. I will probably update it sometime next week.

kevin says

Hi Tim,

In the new version of your paper on page 2 you say that “DSR requires some specific layers to be dense.” However, as with your method, it doesn’t actually require any layers to be dense. In their paper, they just set some layers to be dense because they are relatively small, but actually there is no reason in principle or implementation-wise that these layers cannot be dense – it was just a choice made in their paper for the experiments. So, I think this is an unfair claim as DSR can also work in what you term the “fully sparse” setting.

Tim Dettmers says

I think your point is fair. However, there is no data how DSR performs with these layers sparse. From my experiments, I know that these are the most important layers in the network, and it is unclear if the authors selected these to be dense because performance is bad otherwise, or because they have few parameters. The argument that these layers have few parameters is not always true. The last downsample convolutional layer for ResNet-50 on ImageNet is one of the largest (if not the largest) convolutional layer in terms of parameters. From these data, it is unclear to me if DSR would readily generalize to other networks where it is less clear which layers are most important. Thus I think the argument presented in our paper is fair.

Ivan says

Dear Tim,

Very exciting work! Has this results been published to any conferences? Will love to see the final version!

Tim Dettmers says

I submitted this to NeurIPS, but it has been rejected on the ground that I implemented too many changes and mentioned the lottery ticket hypothesis without further experiments comparing to it. However, the reviewers have been very positive overall. I will resubmit this work to ICLR.

Zhang says

This algorithm seems to require that you store momentum values for all of the weights, even the ones that are zero. This means even if you have sparse weights the momentum will be dense so the memory for that will not be reduced. But some earlier works like the one from Mostafa and Wang does not need to maintain these momentum values so both the network and the momentum could be sparse. Thus this one will use more memory. Please correct if I am misunderstanding.

Tim Dettmers says

That is correct. However, my work looks at training time and not memory. If you want to save memory, there are much much better ways of doing just that. If you want to save memory one should not really look at weights for a solution, but rather at activations. Gradient checkpointing and recomputation are the most effective techniques to save memory. They will save 20x more memory than sparse weights at no performance cost and minimal computational costs.

Wenjie says

Excellent piece of work. Would you please elaborate a bit more on your attempts on applying it in transformers, and NLP tasks in general? You briefly mentioned that “Unsurprisingly, my experimentation on transformers for natural language processing tasks show that sparse learning is much more difficult in NLP compared to computer vision”. Makes sense, but it doesn’t appear as unsurprising as it sounds. Thanks.

Tim Dettmers says

What I found is the following:

– One needs much more weights to come anywhere near dense baseline performance.

– There is much more variability regarding how parameters shift between layers over time, while the distribution is relatively static in computer vision after a couple of epochs.

– Having different rates of parameter redistribution was useful if one increases the rate slowly from input layer to output layer (one needs to build stable lower-level features before one can build good high level features?)

That is mostly it. I guess these observations stem from (1) computer vision has “static inputs” of predefined structure which is unchanging. The fixed structure in NLP are just one-hot vectors which have very little information; word embeddings themselves are highly variable. (2) Just having a high number of output classes complicates things. One sees the first sign of that if you compare CIFAR-10 and ImageNet. BERT uses two orders of magnitude more labels than ImageNet and has complex combinations between labels within a sequence.

Hope that helps!

Sachin says

Hi ! Thank you for describing the algorithm. Definitely looks interesting 🙂

I have a question though, about the regrowing strategy. Maybe I mis-understood it. Your picture suggests pruning down from 16 to 8 weights and then regrowing the network back to 16 weights. How are you then calling the network sparse ? Can you please clarify ?

Thank you!

Tim Dettmers says

I start with a sparse network, so not adding any weights will maintain its overall aggregate sparsity: Lets say you have 100 total weights, then I start with 16 weights and during the pruning stage, I remove 8 weights and regrow them elsewhere. Thus, in this case, I keep the same amount of weights, but because I started with 16% of weights to begin with, the network remains sparse.

John says

Hi Tim – do you know of any good benchmarks that compare algorithmic efficiency over the past few years? I’m thinking of something like ImageNet vs. a modern image recognition algorithm – how much of a speedup have we gotten purely from the algorithmic side of things, rather than hardware? I know there are some difficulties in making a perfect comparison, but it seems like the gist should be obtainable – you should at least be able to do something like “time to get ImageNet’s performance to convergence” vs. “time to get a modern algorithm to that performance level”. Thanks!

Tim Dettmers says

There are some benchmarks but they saturated a couple of years ago and you no longer see improvements. Here some benchmarks:

https://github.com/jcjohnson/cnn-benchmarks

https://github.com/soumith/convnet-benchmarks

https://github.com/baidu-research/DeepBench