<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	
	xmlns:georss="http://www.georss.org/georss"
	xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
	
	>
<channel>
	<title>
	Comments on: How to Parallelize Deep Learning on GPUs Part 1/2: Data Parallelism	</title>
	<atom:link href="https://timdettmers.com/2014/10/09/deep-learning-data-parallelism/feed/" rel="self" type="application/rss+xml" />
	<link>https://timdettmers.com/2014/10/09/deep-learning-data-parallelism/</link>
	<description>Making deep learning accessible.</description>
	<lastBuildDate>Sun, 20 Sep 2020 21:42:15 +0000</lastBuildDate>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.0.11</generator>
	<item>
		<title>
		By: Kalyan		</title>
		<link>https://timdettmers.com/2014/10/09/deep-learning-data-parallelism/comment-page-1/#comment-63530</link>

		<dc:creator><![CDATA[Kalyan]]></dc:creator>
		<pubDate>Fri, 04 Oct 2019 01:21:26 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=65#comment-63530</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2014/10/09/deep-learning-data-parallelism/comment-page-1/#comment-63498&quot;&gt;Tim Dettmers&lt;/a&gt;.

Got it...Thanks]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2014/10/09/deep-learning-data-parallelism/comment-page-1/#comment-63498">Tim Dettmers</a>.</p>
<p>Got it&#8230;Thanks</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Tim Dettmers		</title>
		<link>https://timdettmers.com/2014/10/09/deep-learning-data-parallelism/comment-page-1/#comment-63498</link>

		<dc:creator><![CDATA[Tim Dettmers]]></dc:creator>
		<pubDate>Thu, 03 Oct 2019 13:40:38 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=65#comment-63498</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2014/10/09/deep-learning-data-parallelism/comment-page-1/#comment-63481&quot;&gt;Kalyan&lt;/a&gt;.

The idea is if you increase the time delta between computing the weight updates from one layer and the next layer, then you have more time to synchronize the weights of the next layer in backpropagation. Thus you can hide more communication under gradient computation if you use techniques which are more expensive to compute (Adam or in this case RMSProp).]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2014/10/09/deep-learning-data-parallelism/comment-page-1/#comment-63481">Kalyan</a>.</p>
<p>The idea is if you increase the time delta between computing the weight updates from one layer and the next layer, then you have more time to synchronize the weights of the next layer in backpropagation. Thus you can hide more communication under gradient computation if you use techniques which are more expensive to compute (Adam or in this case RMSProp).</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Kalyan		</title>
		<link>https://timdettmers.com/2014/10/09/deep-learning-data-parallelism/comment-page-1/#comment-63481</link>

		<dc:creator><![CDATA[Kalyan]]></dc:creator>
		<pubDate>Thu, 03 Oct 2019 04:15:06 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=65#comment-63481</guid>

					<description><![CDATA[This part is not clear:
&quot;Another way is to increase the computational time/network time ratio by other means, e.g. by using is computationally intensive optimization techniques like RMSProp. You need the same time to pass the gradients to each other, but more time is spend on computation, thus increasing the utility of the fast GPUs.&quot;
Exactly how the gradients take same time to be passed to each other?]]></description>
			<content:encoded><![CDATA[<p>This part is not clear:<br />
&#8220;Another way is to increase the computational time/network time ratio by other means, e.g. by using is computationally intensive optimization techniques like RMSProp. You need the same time to pass the gradients to each other, but more time is spend on computation, thus increasing the utility of the fast GPUs.&#8221;<br />
Exactly how the gradients take same time to be passed to each other?</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Tim Dettmers		</title>
		<link>https://timdettmers.com/2014/10/09/deep-learning-data-parallelism/comment-page-1/#comment-9588</link>

		<dc:creator><![CDATA[Tim Dettmers]]></dc:creator>
		<pubDate>Thu, 10 Nov 2016 21:19:30 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=65#comment-9588</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2014/10/09/deep-learning-data-parallelism/comment-page-1/#comment-9556&quot;&gt;Yogita&lt;/a&gt;.

Thank you for pointing out that error! I correct the link&#039;s address and it should work now.]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2014/10/09/deep-learning-data-parallelism/comment-page-1/#comment-9556">Yogita</a>.</p>
<p>Thank you for pointing out that error! I correct the link&#8217;s address and it should work now.</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Yogita		</title>
		<link>https://timdettmers.com/2014/10/09/deep-learning-data-parallelism/comment-page-1/#comment-9556</link>

		<dc:creator><![CDATA[Yogita]]></dc:creator>
		<pubDate>Wed, 09 Nov 2016 09:48:45 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=65#comment-9556</guid>

					<description><![CDATA[Your blog is really useful. Your link to previous blog is unavailable. Where can I get to read that blog?]]></description>
			<content:encoded><![CDATA[<p>Your blog is really useful. Your link to previous blog is unavailable. Where can I get to read that blog?</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Masoud		</title>
		<link>https://timdettmers.com/2014/10/09/deep-learning-data-parallelism/comment-page-1/#comment-8908</link>

		<dc:creator><![CDATA[Masoud]]></dc:creator>
		<pubDate>Fri, 07 Oct 2016 12:06:32 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=65#comment-8908</guid>

					<description><![CDATA[Hi

Thanks for your nice explanation
Regarding to memory tiles section, I notices the same degradation when using batch size smaller than 32.
I am wondering if you have and reference that I can follow?]]></description>
			<content:encoded><![CDATA[<p>Hi</p>
<p>Thanks for your nice explanation<br />
Regarding to memory tiles section, I notices the same degradation when using batch size smaller than 32.<br />
I am wondering if you have and reference that I can follow?</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Justin		</title>
		<link>https://timdettmers.com/2014/10/09/deep-learning-data-parallelism/comment-page-1/#comment-7940</link>

		<dc:creator><![CDATA[Justin]]></dc:creator>
		<pubDate>Tue, 30 Aug 2016 21:25:04 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=65#comment-7940</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2014/10/09/deep-learning-data-parallelism/comment-page-1/#comment-2932&quot;&gt;Bafu&lt;/a&gt;.

Yeah, I noticed the same thing. 

The equation listed in the article is incorrect, but the one Baru listed isn&#039;t quite right either. For example, the equation in the article divides by 40 twice, which is wrong, and the last 1024 is a 102, which is also wrong. In addition, to convert from seconds to milliseconds, you have to multiply the numerator by 1000, not divide by 1000.

And as Bafu points out, you need to multiply the numerator by 8 to convert the weight matrix from bytes to bits, not divide by it. Bafu&#039;s equation missed converting from seconds to MS which should be an additional *1000 in the numerator.

I believe the correct equation is:

(4 * 8 * 1000 * 1000 * 1000) / (40 * 1024 ^ 3)

= ~0.74ms]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2014/10/09/deep-learning-data-parallelism/comment-page-1/#comment-2932">Bafu</a>.</p>
<p>Yeah, I noticed the same thing. </p>
<p>The equation listed in the article is incorrect, but the one Baru listed isn&#8217;t quite right either. For example, the equation in the article divides by 40 twice, which is wrong, and the last 1024 is a 102, which is also wrong. In addition, to convert from seconds to milliseconds, you have to multiply the numerator by 1000, not divide by 1000.</p>
<p>And as Bafu points out, you need to multiply the numerator by 8 to convert the weight matrix from bytes to bits, not divide by it. Bafu&#8217;s equation missed converting from seconds to MS which should be an additional *1000 in the numerator.</p>
<p>I believe the correct equation is:</p>
<p>(4 * 8 * 1000 * 1000 * 1000) / (40 * 1024 ^ 3)</p>
<p>= ~0.74ms</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Tim Dettmers		</title>
		<link>https://timdettmers.com/2014/10/09/deep-learning-data-parallelism/comment-page-1/#comment-7212</link>

		<dc:creator><![CDATA[Tim Dettmers]]></dc:creator>
		<pubDate>Thu, 04 Aug 2016 04:46:56 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=65#comment-7212</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2014/10/09/deep-learning-data-parallelism/comment-page-1/#comment-7177&quot;&gt;vinhomes riverside hai phong&lt;/a&gt;.

Thank you!]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2014/10/09/deep-learning-data-parallelism/comment-page-1/#comment-7177">vinhomes riverside hai phong</a>.</p>
<p>Thank you!</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: vinhomes riverside hai phong		</title>
		<link>https://timdettmers.com/2014/10/09/deep-learning-data-parallelism/comment-page-1/#comment-7177</link>

		<dc:creator><![CDATA[vinhomes riverside hai phong]]></dc:creator>
		<pubDate>Tue, 02 Aug 2016 13:05:47 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=65#comment-7177</guid>

					<description><![CDATA[Hello there,  You have done an incredible job. I will definitely digg it and personally suggest to my friends.

I&#039;m confident they&#039;ll be benefited from this site.]]></description>
			<content:encoded><![CDATA[<p>Hello there,  You have done an incredible job. I will definitely digg it and personally suggest to my friends.</p>
<p>I&#8217;m confident they&#8217;ll be benefited from this site.</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Bafu		</title>
		<link>https://timdettmers.com/2014/10/09/deep-learning-data-parallelism/comment-page-1/#comment-2932</link>

		<dc:creator><![CDATA[Bafu]]></dc:creator>
		<pubDate>Mon, 28 Mar 2016 08:00:20 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=65#comment-2932</guid>

					<description><![CDATA[Hi Tim,

I like all your articles pretty much and here is a little correction for the formula calculating the time to pass a weight matrix:

    passing time = weight matrix size / network bandwidth = (1000*1000*4*8) / (40*1024^3) = 0.75 (ms)

Thanks for sharing!]]></description>
			<content:encoded><![CDATA[<p>Hi Tim,</p>
<p>I like all your articles pretty much and here is a little correction for the formula calculating the time to pass a weight matrix:</p>
<p>    passing time = weight matrix size / network bandwidth = (1000*1000*4*8) / (40*1024^3) = 0.75 (ms)</p>
<p>Thanks for sharing!</p>
]]></content:encoded>
		
			</item>
	</channel>
</rss>
