Comments on: Sparse Networks from Scratch: Faster Training without Losing Performance

By: Tim Dettmers

Tim Dettmers — Tue, 15 Sep 2020 05:03:10 +0000

I do not think Graphcore will be able to take a lot of market share from NVIDIA. The problem is right now that the most valuable models in the industry are very large transformers which are difficult to train on the small memory of the graph cores. Does it work on the 2nd generation processors? It is unclear.

Graphcores marketing is very confusing. They have 450 GB of memory, but then at another point, they say 16 GB per IPU, and then they say TB/s of bandwidth, and at another point, they say DDR4 DIMMs.

For me, this sounds like they just added DDR4 DIMM modules addressable by IPUs with a bandwidth of 20-40GB/s. Will that be enough to be useful?

At those bandwidths, it might be possible to train large models but the processor does not seem to be beneficial over GPUs if you want to train very large transformers.

Otherwise, Graphcore processors are great for computer vision and also for high-frequency trading and I can see that they will snag a bit of market share from NVIDIA in those areas.

This is at least my best guess on the data that is verifiable. I guess my opinion would change quite a bit depending on what information would become available over time.

By: Chew Kok Wah

Chew Kok Wah — Wed, 12 Aug 2020 12:34:12 +0000

With Graphcore recently launched 2nd generation AI chip that claim 3x performance and 10x memory of equivalent priced nVidia A100, do you see sparse learning finally seeing the light?
Do you think Graphcore will be able to take a sizable training market from nVidia ?

https://www.graphcore.ai/posts/introducing-second-generation-ipu-systems-for-ai-at-scale
http://3s81si1s5ygj3mzby34dq6qf-wpengine.netdna-ssl.com/wp-content/uploads/2020/07/mk2pricing.jpg

By: Tim Dettmers

Tim Dettmers — Wed, 26 Feb 2020 18:15:14 +0000

In reply to John.

There are some benchmarks but they saturated a couple of years ago and you no longer see improvements. Here some benchmarks:

https://github.com/jcjohnson/cnn-benchmarks
https://github.com/soumith/convnet-benchmarks
https://github.com/baidu-research/DeepBench

By: John

John — Thu, 13 Feb 2020 22:48:41 +0000

Hi Tim – do you know of any good benchmarks that compare algorithmic efficiency over the past few years? I’m thinking of something like ImageNet vs. a modern image recognition algorithm – how much of a speedup have we gotten purely from the algorithmic side of things, rather than hardware? I know there are some difficulties in making a perfect comparison, but it seems like the gist should be obtainable – you should at least be able to do something like “time to get ImageNet’s performance to convergence” vs. “time to get a modern algorithm to that performance level”. Thanks!

By: Tim Dettmers

Tim Dettmers — Tue, 22 Oct 2019 01:28:52 +0000

In reply to Sachin. I start with a sparse network, so not adding any weights will maintain its overall aggregate sparsity: Lets say you have 100 total weights, then I start with 16 weights and during the pruning stage, I remove 8 weights and regrow them elsewhere. Thus, in this case, I keep the same amount of weights, but because I started with 16% of weights to begin with, the network remains sparse.

By: Tim Dettmers

Tim Dettmers — Tue, 22 Oct 2019 01:09:11 +0000

In reply to Wenjie.

What I found is the following:
– One needs much more weights to come anywhere near dense baseline performance.
– There is much more variability regarding how parameters shift between layers over time, while the distribution is relatively static in computer vision after a couple of epochs.
– Having different rates of parameter redistribution was useful if one increases the rate slowly from input layer to output layer (one needs to build stable lower-level features before one can build good high level features?)

That is mostly it. I guess these observations stem from (1) computer vision has “static inputs” of predefined structure which is unchanging. The fixed structure in NLP are just one-hot vectors which have very little information; word embeddings themselves are highly variable. (2) Just having a high number of output classes complicates things. One sees the first sign of that if you compare CIFAR-10 and ImageNet. BERT uses two orders of magnitude more labels than ImageNet and has complex combinations between labels within a sequence.

Hope that helps!

By: Sachin

Sachin — Thu, 17 Oct 2019 02:13:10 +0000

Hi ! Thank you for describing the algorithm. Definitely looks interesting 🙂

I have a question though, about the regrowing strategy. Maybe I mis-understood it. Your picture suggests pruning down from 16 to 8 weights and then regrowing the network back to 16 weights. How are you then calling the network sparse ? Can you please clarify ?

Thank you!

By: Wenjie

Wenjie — Mon, 07 Oct 2019 19:48:59 +0000

Excellent piece of work. Would you please elaborate a bit more on your attempts on applying it in transformers, and NLP tasks in general? You briefly mentioned that “Unsurprisingly, my experimentation on transformers for natural language processing tasks show that sparse learning is much more difficult in NLP compared to computer vision”. Makes sense, but it doesn’t appear as unsurprising as it sounds. Thanks.

By: Tim Dettmers

Tim Dettmers — Sun, 22 Sep 2019 16:44:55 +0000

In reply to Zhang. That is correct. However, my work looks at training time and not memory. If you want to save memory, there are much much better ways of doing just that. If you want to save memory one should not really look at weights for a solution, but rather at activations. Gradient checkpointing and recomputation are the most effective techniques to save memory. They will save 20x more memory than sparse weights at no performance cost and minimal computational costs.

By: Zhang

Zhang — Sun, 22 Sep 2019 05:25:53 +0000

This algorithm seems to require that you store momentum values for all of the weights, even the ones that are zero. This means even if you have sparse weights the momentum will be dense so the memory for that will not be reduced. But some earlier works like the one from Mostafa and Wang does not need to maintain these momentum values so both the network and the momentum could be sparse. Thus this one will use more memory. Please correct if I am misunderstanding.