Comments on: How to Parallelize Deep Learning on GPUs Part 2/2: Model Parallelism

By: Tim Dettmers

Tim Dettmers — Sat, 10 Feb 2018 08:14:54 +0000

In reply to Vineet Gundecha. For model parallelism, the updates happen on reach GPU respectively. For data parallelism, the updates are accumulated for each GPU and each GPU updates its own parameters (with the same update as any other GPU). This is a bit faster than doing the updates on the CPU.

By: Vineet Gundecha

Vineet Gundecha — Fri, 09 Feb 2018 22:40:20 +0000

Thank you for the informative post!
I would like to know where the parameter update happens for model and data parallelism.
In model parallelism, each GPU stores the gradients for the parameters that reside on that GPU. So, I guess the parameters are updated on the respective GPU accordingly. But for data parallelism, the gradients for each sub-batch are accumulated and the CPU performs the update?

By: redfish

redfish — Fri, 01 Sep 2017 15:03:57 +0000

In reply to redfish. I see, thanks for the detail-clear reply, thanks a lot~

By: Tim Dettmers

Tim Dettmers — Fri, 01 Sep 2017 15:01:54 +0000

In reply to redfish. See my new reply. Does this help? Let me know if it is still unclear!

By: Tim Dettmers

Tim Dettmers — Fri, 01 Sep 2017 14:59:45 +0000

In reply to redfish.

Thanks for spotting the mistake, indeed it is a 128×250 matrix for C1 and C2 and they are then stacked.

The parameters come from both the forward (activation) and backward pass (errors):
128×500 (forward split-stack) + 128×500 (backward split-add) = 64000 + 64000 = 128000
128×1000 (forward split-add) + 128×250 (backward split-stack) = 128000 + 32000 = 160000

Hope that helps!

By: redfish

redfish — Fri, 01 Sep 2017 14:58:31 +0000

In reply to Alok.

Hey Alok,
So did you figure out how does 128000/160000 come from?
I still have no idea, and this kind of model parallelism seems manually partition of the model..

Appreciated if you shared your insight, thanks

By: Tim Dettmers

Tim Dettmers — Fri, 01 Sep 2017 14:46:45 +0000

In reply to Alok. This is correct, thanks for spotting this!

By: Alok

Alok — Fri, 25 Aug 2017 08:55:19 +0000

In reply to Tim Dettmers.

I think there is a little update for the matrix C1 and C2 should be of 128×250 since the matrix B (1000×500) is split to 2 matrices of 1000×250 and when these would be multiplied with 128×1000 , produces 2 matrices of 128×250 .

By: redfish

redfish — Sun, 13 Aug 2017 16:00:10 +0000

In reply to Tim Dettmers.

Hey, thanks for those replies but,
How to get the results like 128000/160000?

Take above as example:
Standard: 128×1000 dot 1000×500 = 128×500
Split by weight matrix second dimension: 128×1000 dot 1000×250 = 128×250 -> stack matrices

As you mention in above reply,
“…model parallelism is to split your parameters among all nodes, so that you have two 1000×250 matricies B1 and B2. When you matrix multiply this with the input you will get two matricies C1 and C2 with dimensions 128×500. You stack these matricies to get the full output matrix of size 128×1000.”

In my understand:
A(128×1000) dot W(1000×500) can be split as
A dot B1(1000×250) = C1(128×250)
A dot B2(1000×250) = C2(128×250)
and stack C1, C2 together get a result of C(128×500)

I’m not sure why C1/C2 are dim. 128×500 in your reply..

And if my understand was right, (or maybe you’re right)
Still don’t know how model prarallelism sync through activations? and the calculations of 128000/160000..

Thanks a lot for helping me understand this!!
Appreciate.

By: Tim Dettmers

Tim Dettmers — Fri, 16 Jun 2017 18:11:56 +0000

In reply to SK06.

I assume with model splitting you mean splitting the model among multiple GPUs. This is also called model paralellism.
PyTorch supports this quite well: https://discuss.pytorch.org/t/model-parallelism-in-pytorch-for-large-r-than-1-gpu-models/778
It does not seem that Caffe supports model parallelism yet: https://github.com/BVLC/caffe/issues/876