Sparse Networks from Scratch: Faster Training without Losing Performance

This blog post is about my work with Luke Zettlemoyer on fast training of neural networks which we keep sparse throughout training. We show that by developing an algorithm, sparse momentum, we can initialize a neural network with sparse random weights and train it to dense performance levels — all while doing just a single training run. Furthermore, If we use optimized sparse convolution algorithms, we can speed up training between 3.5x for VGG to 12x for Wide Residual Networks. This stands in stark contrast to computationally expensive methods which require repetitive prune-and-retrain cycles as used by the Lottery Ticket Hypothesis (Frankle and Carbin, 2019) and other work. Thus we show that training sparse networks to dense performance levels does not require “winning the initialization lottery” but can be done reliably from random weights if combined with a method that moves weights around the network in a smart way. We call the paradigm that maintains sparsity throughout training while maintaining dense performance levels sparse learning. While this work shows that sparse learning is possible, future work holds the promise to train larger and deep networks on more data while requiring the same or less computational resources as current dense networks.