<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	
	xmlns:georss="http://www.georss.org/georss"
	xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
	
	>
<channel>
	<title>
	Comments on: How To Build and Use a Multi GPU System for Deep Learning	</title>
	<atom:link href="https://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/feed/" rel="self" type="application/rss+xml" />
	<link>https://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/</link>
	<description>Making deep learning accessible.</description>
	<lastBuildDate>Sat, 02 Jan 2021 09:18:32 +0000</lastBuildDate>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.0.11</generator>
	<item>
		<title>
		By: Tim Dettmers		</title>
		<link>https://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/comment-page-1/#comment-83863</link>

		<dc:creator><![CDATA[Tim Dettmers]]></dc:creator>
		<pubDate>Sat, 02 Jan 2021 09:18:32 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=52#comment-83863</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/comment-page-1/#comment-82239&quot;&gt;Sunil&lt;/a&gt;.

Hi Sunil,
everything should be alright in your case: No NVLink needed, and 8x lanes are fine for 2 GPUs - it should not be much slower (maybe 1-5%).]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/comment-page-1/#comment-82239">Sunil</a>.</p>
<p>Hi Sunil,<br />
everything should be alright in your case: No NVLink needed, and 8x lanes are fine for 2 GPUs &#8211; it should not be much slower (maybe 1-5%).</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Sunil		</title>
		<link>https://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/comment-page-1/#comment-82239</link>

		<dc:creator><![CDATA[Sunil]]></dc:creator>
		<pubDate>Thu, 03 Dec 2020 17:13:51 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=52#comment-82239</guid>

					<description><![CDATA[Hi Tim,

Thanks for your excellent articles, really helpful and great expertise.

I&#039;m working on DL for medical signal processing and currently just using a GTX1070 which is slowing me down significantly. I&#039;m looking at buying 2 RTX 3080s to run in parallel on a single machine. Do I need anything (like NVLink) to connect them or can they just go straight onto the motherboard and let the software (Tensorflow) do the parallelisation? Secondly, my motherboard specs say it can run a x16 bandwidth in the first PCIe slot but &quot;x8 or above is not recommended for VGA cards&quot; for the other slots even though they are called &quot;x16&quot; slots (it&#039;s a 28-lane CPU). Does this matter? Will it slow everything down if parallelised (i.e. will both cards need to work at x4 or x8?) orwill it be impossible to parallelise?

Thanks in advance for your help!]]></description>
			<content:encoded><![CDATA[<p>Hi Tim,</p>
<p>Thanks for your excellent articles, really helpful and great expertise.</p>
<p>I&#8217;m working on DL for medical signal processing and currently just using a GTX1070 which is slowing me down significantly. I&#8217;m looking at buying 2 RTX 3080s to run in parallel on a single machine. Do I need anything (like NVLink) to connect them or can they just go straight onto the motherboard and let the software (Tensorflow) do the parallelisation? Secondly, my motherboard specs say it can run a x16 bandwidth in the first PCIe slot but &#8220;x8 or above is not recommended for VGA cards&#8221; for the other slots even though they are called &#8220;x16&#8221; slots (it&#8217;s a 28-lane CPU). Does this matter? Will it slow everything down if parallelised (i.e. will both cards need to work at x4 or x8?) orwill it be impossible to parallelise?</p>
<p>Thanks in advance for your help!</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Tim Dettmers		</title>
		<link>https://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/comment-page-1/#comment-74659</link>

		<dc:creator><![CDATA[Tim Dettmers]]></dc:creator>
		<pubDate>Fri, 03 Jul 2020 14:44:48 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=52#comment-74659</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/comment-page-1/#comment-74477&quot;&gt;Deterministic&lt;/a&gt;.

I would use the two systems separately and do not connect them with each other (not worth it, a good interconnect is expensive has little benefit). The Titan RTX fit in the same slot as RTX 2080 but two of them next to each other overheat quickly — so maybe stick with the RTX 2080. You do not need NVLink for 2 GPUs the normal transfer via PCIe is more than enough.]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/comment-page-1/#comment-74477">Deterministic</a>.</p>
<p>I would use the two systems separately and do not connect them with each other (not worth it, a good interconnect is expensive has little benefit). The Titan RTX fit in the same slot as RTX 2080 but two of them next to each other overheat quickly — so maybe stick with the RTX 2080. You do not need NVLink for 2 GPUs the normal transfer via PCIe is more than enough.</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Deterministic		</title>
		<link>https://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/comment-page-1/#comment-74477</link>

		<dc:creator><![CDATA[Deterministic]]></dc:creator>
		<pubDate>Wed, 01 Jul 2020 12:04:37 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=52#comment-74477</guid>

					<description><![CDATA[Hello Tim,
Congrats for your excellent articles! I would like your advice on a setup for deep learning  with images.

I have 2 PCs currently with GTX 1060 and thought to replace those for 2x 2080 Ti in each PC and make a cluster with them (potentially adding more later), also will connect the GPUs with a fast direct link like you did if it&#039;s worth it.

Some questions arise though, is the Titan RTX a safer bet for the extra memory? Not sure the Titan takes the same space and slots as the 2080... Also, NVlink between pairs of 2080 would double the memory I think, although I don&#039;t know if it will work out of the box or needs specific libraries. What do you think?
Thanks!]]></description>
			<content:encoded><![CDATA[<p>Hello Tim,<br />
Congrats for your excellent articles! I would like your advice on a setup for deep learning  with images.</p>
<p>I have 2 PCs currently with GTX 1060 and thought to replace those for 2x 2080 Ti in each PC and make a cluster with them (potentially adding more later), also will connect the GPUs with a fast direct link like you did if it&#8217;s worth it.</p>
<p>Some questions arise though, is the Titan RTX a safer bet for the extra memory? Not sure the Titan takes the same space and slots as the 2080&#8230; Also, NVlink between pairs of 2080 would double the memory I think, although I don&#8217;t know if it will work out of the box or needs specific libraries. What do you think?<br />
Thanks!</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Tim Dettmers		</title>
		<link>https://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/comment-page-1/#comment-70474</link>

		<dc:creator><![CDATA[Tim Dettmers]]></dc:creator>
		<pubDate>Sat, 04 Apr 2020 02:59:01 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=52#comment-70474</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/comment-page-1/#comment-70106&quot;&gt;Alam Noor&lt;/a&gt;.

This setup is too difficult or even impossible to use for a single task. What you can do is to run two separate tasks on each of these machines.]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/comment-page-1/#comment-70106">Alam Noor</a>.</p>
<p>This setup is too difficult or even impossible to use for a single task. What you can do is to run two separate tasks on each of these machines.</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Alam Noor		</title>
		<link>https://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/comment-page-1/#comment-70106</link>

		<dc:creator><![CDATA[Alam Noor]]></dc:creator>
		<pubDate>Tue, 24 Mar 2020 19:40:29 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=52#comment-70106</guid>

					<description><![CDATA[Hi Tim,
I have two GPUs Machines one is 2080 TI and GeForce RTX 2070. Now I want to use both GPUs for training at object detection big data. Can you please comment, how I can configure for one task. Please comment step by step, I will be thankful to have your valuable comments and time. Thanks]]></description>
			<content:encoded><![CDATA[<p>Hi Tim,<br />
I have two GPUs Machines one is 2080 TI and GeForce RTX 2070. Now I want to use both GPUs for training at object detection big data. Can you please comment, how I can configure for one task. Please comment step by step, I will be thankful to have your valuable comments and time. Thanks</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: fusedentropy		</title>
		<link>https://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/comment-page-1/#comment-43136</link>

		<dc:creator><![CDATA[fusedentropy]]></dc:creator>
		<pubDate>Mon, 24 Sep 2018 17:12:29 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=52#comment-43136</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/comment-page-1/#comment-43028&quot;&gt;Tim Dettmers&lt;/a&gt;.

Thank you for the reply.

I have the same suspicion - It must be possible, but for some reason, not supported at this time. Also, NVIDIA does not seem to want to entertain the idea.

I do know that the NVIDIA driver does &quot;disable&quot; the capabilities. One example is failure to &quot;enable Peer access&quot; - cudaDeviceEnablePeerAccess .

I&#039;ve been looking for a hack to get around the limitation. I even though about using some daemon, running on Ubuntu which would be running inside a Hyper-V  on top of the Win10.  I would use some sort of event or interrupt to trigger the daemon to perform a deviceToDevice memcpy. Unfortunately, I have read that the  NVIDIA drivers will not work with the Ubuntu system in this scenario.

At any rate, gotta keep on trekking in spite of the bumps in the road; this is part of what engineers do.

Cheers!]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/comment-page-1/#comment-43028">Tim Dettmers</a>.</p>
<p>Thank you for the reply.</p>
<p>I have the same suspicion &#8211; It must be possible, but for some reason, not supported at this time. Also, NVIDIA does not seem to want to entertain the idea.</p>
<p>I do know that the NVIDIA driver does &#8220;disable&#8221; the capabilities. One example is failure to &#8220;enable Peer access&#8221; &#8211; cudaDeviceEnablePeerAccess .</p>
<p>I&#8217;ve been looking for a hack to get around the limitation. I even though about using some daemon, running on Ubuntu which would be running inside a Hyper-V  on top of the Win10.  I would use some sort of event or interrupt to trigger the daemon to perform a deviceToDevice memcpy. Unfortunately, I have read that the  NVIDIA drivers will not work with the Ubuntu system in this scenario.</p>
<p>At any rate, gotta keep on trekking in spite of the bumps in the road; this is part of what engineers do.</p>
<p>Cheers!</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Tim Dettmers		</title>
		<link>https://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/comment-page-1/#comment-43028</link>

		<dc:creator><![CDATA[Tim Dettmers]]></dc:creator>
		<pubDate>Fri, 21 Sep 2018 20:18:19 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=52#comment-43028</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/comment-page-1/#comment-43020&quot;&gt;fusedentropy&lt;/a&gt;.

I am not familiar with the details on Windows, but from my personal experience, a lot of things are documented as &quot;not working&quot; under certain conditions when they actually are. This might be to make people buy cards that &quot;support features&quot; such as Tesla, or potentially, in this case, to save NVIDIA from troubles that are difficult to anticipate: Few users work this such systems in that way, so its difficult to support them, so they want to save troubles with support requests by just saying it does not work — period. I think this might be going on here.]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/comment-page-1/#comment-43020">fusedentropy</a>.</p>
<p>I am not familiar with the details on Windows, but from my personal experience, a lot of things are documented as &#8220;not working&#8221; under certain conditions when they actually are. This might be to make people buy cards that &#8220;support features&#8221; such as Tesla, or potentially, in this case, to save NVIDIA from troubles that are difficult to anticipate: Few users work this such systems in that way, so its difficult to support them, so they want to save troubles with support requests by just saying it does not work — period. I think this might be going on here.</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: fusedentropy		</title>
		<link>https://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/comment-page-1/#comment-43020</link>

		<dc:creator><![CDATA[fusedentropy]]></dc:creator>
		<pubDate>Fri, 21 Sep 2018 16:55:47 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=52#comment-43020</guid>

					<description><![CDATA[Something not mentioned is that, on Windows OS systems, GpuDirect only works when the GPU in TCC-Mode.  If you are going GPU-to-GPU, both GPUs need to be in TCC-Mode. This is also true for P2P.

NVIDIA insists that it is a limitation with Windows&#039; WDDM Mode architecture. Frankly, I don&#039;t believe them.  All that is needed for DMA is the physical memory address of both the src and dst. The NVIDIA driver can easily get this value.

At that point, what the GPU does is a don&#039;t care for the Windows OS. The GPU has so many schedulers, I am sure it could schedule a DMA of the memory, usually some multiple of a page size.

I am sure industry would love to be able to do DMA from their compute (TCC) GPUs to their display (WDDM) GPU after all their CUDA kernels have finished crunching the data. Then use OpenGL Interop for display/rendering - all data stays on GPUs!! Never having to double-copy to host memory (GPU memory throughput is also much faster).  Now add NVLINK and you avoid PCIe traffic between GPUs as well!! AWESOME! 

NVIDIA insists this is not possible due to Windows (WDDM) - Really?]]></description>
			<content:encoded><![CDATA[<p>Something not mentioned is that, on Windows OS systems, GpuDirect only works when the GPU in TCC-Mode.  If you are going GPU-to-GPU, both GPUs need to be in TCC-Mode. This is also true for P2P.</p>
<p>NVIDIA insists that it is a limitation with Windows&#8217; WDDM Mode architecture. Frankly, I don&#8217;t believe them.  All that is needed for DMA is the physical memory address of both the src and dst. The NVIDIA driver can easily get this value.</p>
<p>At that point, what the GPU does is a don&#8217;t care for the Windows OS. The GPU has so many schedulers, I am sure it could schedule a DMA of the memory, usually some multiple of a page size.</p>
<p>I am sure industry would love to be able to do DMA from their compute (TCC) GPUs to their display (WDDM) GPU after all their CUDA kernels have finished crunching the data. Then use OpenGL Interop for display/rendering &#8211; all data stays on GPUs!! Never having to double-copy to host memory (GPU memory throughput is also much faster).  Now add NVLINK and you avoid PCIe traffic between GPUs as well!! AWESOME! </p>
<p>NVIDIA insists this is not possible due to Windows (WDDM) &#8211; Really?</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: minecraft		</title>
		<link>https://timdettmers.com/2014/09/21/how-to-build-and-use-a-multi-gpu-system-for-deep-learning/comment-page-1/#comment-42432</link>

		<dc:creator><![CDATA[minecraft]]></dc:creator>
		<pubDate>Sun, 09 Sep 2018 07:02:53 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.wordpress.com/?p=52#comment-42432</guid>

					<description><![CDATA[It&#039;s going to be ending of mine day, however before finish I 
am reading this enormous piece of writing to improve my experience.]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s going to be ending of mine day, however before finish I<br />
am reading this enormous piece of writing to improve my experience.</p>
]]></content:encoded>
		
			</item>
	</channel>
</rss>
