Comments on: How to Parallelize Deep Learning on GPUs Part 1/2: Data Parallelism

By: Kalyan

Kalyan — Fri, 04 Oct 2019 01:21:26 +0000

In reply to Tim Dettmers. Got it...Thanks

By: Tim Dettmers

Tim Dettmers — Thu, 03 Oct 2019 13:40:38 +0000

In reply to Kalyan. The idea is if you increase the time delta between computing the weight updates from one layer and the next layer, then you have more time to synchronize the weights of the next layer in backpropagation. Thus you can hide more communication under gradient computation if you use techniques which are more expensive to compute (Adam or in this case RMSProp).

By: Kalyan

Kalyan — Thu, 03 Oct 2019 04:15:06 +0000

This part is not clear:
“Another way is to increase the computational time/network time ratio by other means, e.g. by using is computationally intensive optimization techniques like RMSProp. You need the same time to pass the gradients to each other, but more time is spend on computation, thus increasing the utility of the fast GPUs.”
Exactly how the gradients take same time to be passed to each other?

By: Tim Dettmers

Tim Dettmers — Thu, 10 Nov 2016 21:19:30 +0000

In reply to Yogita. Thank you for pointing out that error! I correct the link's address and it should work now.

By: Yogita

Yogita — Wed, 09 Nov 2016 09:48:45 +0000

Your blog is really useful. Your link to previous blog is unavailable. Where can I get to read that blog?

By: Masoud

Masoud — Fri, 07 Oct 2016 12:06:32 +0000

Thanks for your nice explanation
Regarding to memory tiles section, I notices the same degradation when using batch size smaller than 32.
I am wondering if you have and reference that I can follow?

By: Justin

Justin — Tue, 30 Aug 2016 21:25:04 +0000

In reply to Bafu.

Yeah, I noticed the same thing.

The equation listed in the article is incorrect, but the one Baru listed isn’t quite right either. For example, the equation in the article divides by 40 twice, which is wrong, and the last 1024 is a 102, which is also wrong. In addition, to convert from seconds to milliseconds, you have to multiply the numerator by 1000, not divide by 1000.

And as Bafu points out, you need to multiply the numerator by 8 to convert the weight matrix from bytes to bits, not divide by it. Bafu’s equation missed converting from seconds to MS which should be an additional *1000 in the numerator.

I believe the correct equation is:

(4 * 8 * 1000 * 1000 * 1000) / (40 * 1024 ^ 3)

= ~0.74ms

By: Tim Dettmers

Tim Dettmers — Thu, 04 Aug 2016 04:46:56 +0000

In reply to vinhomes riverside hai phong. Thank you!

By: vinhomes riverside hai phong

vinhomes riverside hai phong — Tue, 02 Aug 2016 13:05:47 +0000

Hello there, You have done an incredible job. I will definitely digg it and personally suggest to my friends.

I’m confident they’ll be benefited from this site.

By: Bafu

Bafu — Mon, 28 Mar 2016 08:00:20 +0000

Hi Tim,

I like all your articles pretty much and here is a little correction for the formula calculating the time to pass a weight matrix:

passing time = weight matrix size / network bandwidth = (1000*1000*4*8) / (40*1024^3) = 0.75 (ms)

Thanks for sharing!