<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	
	xmlns:georss="http://www.georss.org/georss"
	xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
	
	>
<channel>
	<title>
	Comments on: How to Parallelize Deep Learning on GPUs Part 2/2: Model Parallelism	</title>
	<atom:link href="https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/feed/" rel="self" type="application/rss+xml" />
	<link>https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/</link>
	<description>Making deep learning accessible.</description>
	<lastBuildDate>Sun, 20 Sep 2020 21:42:01 +0000</lastBuildDate>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.0.11</generator>
	<item>
		<title>
		By: Tim Dettmers		</title>
		<link>https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-28307</link>

		<dc:creator><![CDATA[Tim Dettmers]]></dc:creator>
		<pubDate>Sat, 10 Feb 2018 08:14:54 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=85#comment-28307</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-28270&quot;&gt;Vineet Gundecha&lt;/a&gt;.

For model parallelism, the updates happen on reach GPU respectively.

For data parallelism, the updates are accumulated for each GPU and each GPU updates its own parameters (with the same update as any other GPU). This is a bit faster than doing the updates on the CPU.]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-28270">Vineet Gundecha</a>.</p>
<p>For model parallelism, the updates happen on reach GPU respectively.</p>
<p>For data parallelism, the updates are accumulated for each GPU and each GPU updates its own parameters (with the same update as any other GPU). This is a bit faster than doing the updates on the CPU.</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Vineet Gundecha		</title>
		<link>https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-28270</link>

		<dc:creator><![CDATA[Vineet Gundecha]]></dc:creator>
		<pubDate>Fri, 09 Feb 2018 22:40:20 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=85#comment-28270</guid>

					<description><![CDATA[Thank you for the informative post! 
I would like to know where the parameter update happens for model and data parallelism. 
In model parallelism, each GPU stores the gradients for the parameters that reside on that GPU. So, I guess the parameters are updated on the respective GPU accordingly. But for data parallelism, the gradients for each sub-batch are accumulated and the CPU performs the update?]]></description>
			<content:encoded><![CDATA[<p>Thank you for the informative post!<br />
I would like to know where the parameter update happens for model and data parallelism.<br />
In model parallelism, each GPU stores the gradients for the parameters that reside on that GPU. So, I guess the parameters are updated on the respective GPU accordingly. But for data parallelism, the gradients for each sub-batch are accumulated and the CPU performs the update?</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: redfish		</title>
		<link>https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-20469</link>

		<dc:creator><![CDATA[redfish]]></dc:creator>
		<pubDate>Fri, 01 Sep 2017 15:03:57 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=85#comment-20469</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-20466&quot;&gt;redfish&lt;/a&gt;.

I see, thanks for the detail-clear reply, thanks a lot~]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-20466">redfish</a>.</p>
<p>I see, thanks for the detail-clear reply, thanks a lot~</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Tim Dettmers		</title>
		<link>https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-20468</link>

		<dc:creator><![CDATA[Tim Dettmers]]></dc:creator>
		<pubDate>Fri, 01 Sep 2017 15:01:54 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=85#comment-20468</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-20466&quot;&gt;redfish&lt;/a&gt;.

See my new reply. Does this help? Let me know if it is still unclear!]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-20466">redfish</a>.</p>
<p>See my new reply. Does this help? Let me know if it is still unclear!</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Tim Dettmers		</title>
		<link>https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-20467</link>

		<dc:creator><![CDATA[Tim Dettmers]]></dc:creator>
		<pubDate>Fri, 01 Sep 2017 14:59:45 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=85#comment-20467</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-19090&quot;&gt;redfish&lt;/a&gt;.

Thanks for spotting the mistake, indeed it is a 128x250 matrix for C1 and C2 and they are then stacked.

The parameters come from both the forward (activation) and backward pass (errors):
128x500 (forward split-stack) + 128x500 (backward split-add) = 64000 + 64000 = 128000
128x1000 (forward split-add) + 128x250 (backward split-stack) = 128000 + 32000 = 160000

Hope that helps!]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-19090">redfish</a>.</p>
<p>Thanks for spotting the mistake, indeed it is a 128&#215;250 matrix for C1 and C2 and they are then stacked.</p>
<p>The parameters come from both the forward (activation) and backward pass (errors):<br />
128&#215;500 (forward split-stack) + 128&#215;500 (backward split-add) = 64000 + 64000 = 128000<br />
128&#215;1000 (forward split-add) + 128&#215;250 (backward split-stack) = 128000 + 32000 = 160000</p>
<p>Hope that helps!</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: redfish		</title>
		<link>https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-20466</link>

		<dc:creator><![CDATA[redfish]]></dc:creator>
		<pubDate>Fri, 01 Sep 2017 14:58:31 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=85#comment-20466</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-19945&quot;&gt;Alok&lt;/a&gt;.

Hey Alok, 
So did you figure out how does 128000/160000 come from?
I still have no idea, and this kind of model parallelism seems manually partition of the model..

Appreciated if you shared your insight, thanks]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-19945">Alok</a>.</p>
<p>Hey Alok,<br />
So did you figure out how does 128000/160000 come from?<br />
I still have no idea, and this kind of model parallelism seems manually partition of the model..</p>
<p>Appreciated if you shared your insight, thanks</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Tim Dettmers		</title>
		<link>https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-20465</link>

		<dc:creator><![CDATA[Tim Dettmers]]></dc:creator>
		<pubDate>Fri, 01 Sep 2017 14:46:45 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=85#comment-20465</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-19945&quot;&gt;Alok&lt;/a&gt;.

This is correct, thanks for spotting this!]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-19945">Alok</a>.</p>
<p>This is correct, thanks for spotting this!</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Alok		</title>
		<link>https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-19945</link>

		<dc:creator><![CDATA[Alok]]></dc:creator>
		<pubDate>Fri, 25 Aug 2017 08:55:19 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=85#comment-19945</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-3453&quot;&gt;Tim Dettmers&lt;/a&gt;.

I think there is a little update for the matrix C1 and C2  should be of  128x250 since the matrix B (1000x500) is split to 2 matrices of 1000x250  and when these would be multiplied with 128x1000 , produces  2 matrices of  128x250 .]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-3453">Tim Dettmers</a>.</p>
<p>I think there is a little update for the matrix C1 and C2  should be of  128&#215;250 since the matrix B (1000&#215;500) is split to 2 matrices of 1000&#215;250  and when these would be multiplied with 128&#215;1000 , produces  2 matrices of  128&#215;250 .</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: redfish		</title>
		<link>https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-19090</link>

		<dc:creator><![CDATA[redfish]]></dc:creator>
		<pubDate>Sun, 13 Aug 2017 16:00:10 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=85#comment-19090</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-7402&quot;&gt;Tim Dettmers&lt;/a&gt;.

Hey, thanks for those replies but, 
How to get the results like 128000/160000?

Take above as example:
Standard: 128×1000 dot 1000×500 = 128×500
Split by weight matrix second dimension: 128×1000 dot 1000×250 = 128×250 -&#062; stack matrices

As you mention in above reply, 
&quot;...model parallelism is to split your parameters among all nodes, so that you have two 1000×250 matricies B1 and B2. When you matrix multiply this with the input you will get two matricies C1 and C2 with dimensions 128×500. You stack these matricies to get the full output matrix of size 128×1000.&quot;

In my understand: 
A(128x1000) dot W(1000x500) can be split as
A dot B1(1000x250) = C1(128x250)
A dot B2(1000x250) = C2(128x250)
and stack C1, C2 together get a result of C(128x500)

I&#039;m not sure why C1/C2 are dim. 128x500 in your reply..

And if my understand was right, (or maybe you&#039;re right)
Still don&#039;t know how model prarallelism sync through activations? and the calculations of 128000/160000..

Thanks a lot for helping me understand this!!
Appreciate.]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-7402">Tim Dettmers</a>.</p>
<p>Hey, thanks for those replies but,<br />
How to get the results like 128000/160000?</p>
<p>Take above as example:<br />
Standard: 128×1000 dot 1000×500 = 128×500<br />
Split by weight matrix second dimension: 128×1000 dot 1000×250 = 128×250 -&gt; stack matrices</p>
<p>As you mention in above reply,<br />
&#8220;&#8230;model parallelism is to split your parameters among all nodes, so that you have two 1000×250 matricies B1 and B2. When you matrix multiply this with the input you will get two matricies C1 and C2 with dimensions 128×500. You stack these matricies to get the full output matrix of size 128×1000.&#8221;</p>
<p>In my understand:<br />
A(128&#215;1000) dot W(1000&#215;500) can be split as<br />
A dot B1(1000&#215;250) = C1(128&#215;250)<br />
A dot B2(1000&#215;250) = C2(128&#215;250)<br />
and stack C1, C2 together get a result of C(128&#215;500)</p>
<p>I&#8217;m not sure why C1/C2 are dim. 128&#215;500 in your reply..</p>
<p>And if my understand was right, (or maybe you&#8217;re right)<br />
Still don&#8217;t know how model prarallelism sync through activations? and the calculations of 128000/160000..</p>
<p>Thanks a lot for helping me understand this!!<br />
Appreciate.</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Tim Dettmers		</title>
		<link>https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-16128</link>

		<dc:creator><![CDATA[Tim Dettmers]]></dc:creator>
		<pubDate>Fri, 16 Jun 2017 18:11:56 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=85#comment-16128</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-16106&quot;&gt;SK06&lt;/a&gt;.

I assume with model splitting you mean splitting the model among multiple GPUs. This is also called model paralellism.
PyTorch supports this quite well: https://discuss.pytorch.org/t/model-parallelism-in-pytorch-for-large-r-than-1-gpu-models/778
It does not seem that Caffe supports model parallelism yet: https://github.com/BVLC/caffe/issues/876]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/comment-page-1/#comment-16106">SK06</a>.</p>
<p>I assume with model splitting you mean splitting the model among multiple GPUs. This is also called model paralellism.<br />
PyTorch supports this quite well: <a href="https://discuss.pytorch.org/t/model-parallelism-in-pytorch-for-large-r-than-1-gpu-models/778" rel="nofollow ugc">https://discuss.pytorch.org/t/model-parallelism-in-pytorch-for-large-r-than-1-gpu-models/778</a><br />
It does not seem that Caffe supports model parallelism yet: <a href="https://github.com/BVLC/caffe/issues/876" rel="nofollow ugc">https://github.com/BVLC/caffe/issues/876</a></p>
]]></content:encoded>
		
			</item>
	</channel>
</rss>
