Comments on: How To Build and Use a Multi GPU System for Deep Learning

By: Tim Dettmers

Tim Dettmers — Sat, 02 Jan 2021 09:18:32 +0000

In reply to Sunil.

Hi Sunil,
everything should be alright in your case: No NVLink needed, and 8x lanes are fine for 2 GPUs – it should not be much slower (maybe 1-5%).

By: Sunil

Sunil — Thu, 03 Dec 2020 17:13:51 +0000

Hi Tim,

Thanks for your excellent articles, really helpful and great expertise.

I’m working on DL for medical signal processing and currently just using a GTX1070 which is slowing me down significantly. I’m looking at buying 2 RTX 3080s to run in parallel on a single machine. Do I need anything (like NVLink) to connect them or can they just go straight onto the motherboard and let the software (Tensorflow) do the parallelisation? Secondly, my motherboard specs say it can run a x16 bandwidth in the first PCIe slot but “x8 or above is not recommended for VGA cards” for the other slots even though they are called “x16” slots (it’s a 28-lane CPU). Does this matter? Will it slow everything down if parallelised (i.e. will both cards need to work at x4 or x8?) orwill it be impossible to parallelise?

Thanks in advance for your help!

By: Tim Dettmers

Tim Dettmers — Fri, 03 Jul 2020 14:44:48 +0000

In reply to Deterministic. I would use the two systems separately and do not connect them with each other (not worth it, a good interconnect is expensive has little benefit). The Titan RTX fit in the same slot as RTX 2080 but two of them next to each other overheat quickly — so maybe stick with the RTX 2080. You do not need NVLink for 2 GPUs the normal transfer via PCIe is more than enough.

By: Deterministic

Deterministic — Wed, 01 Jul 2020 12:04:37 +0000

Hello Tim,
Congrats for your excellent articles! I would like your advice on a setup for deep learning with images.

I have 2 PCs currently with GTX 1060 and thought to replace those for 2x 2080 Ti in each PC and make a cluster with them (potentially adding more later), also will connect the GPUs with a fast direct link like you did if it’s worth it.

Some questions arise though, is the Titan RTX a safer bet for the extra memory? Not sure the Titan takes the same space and slots as the 2080… Also, NVlink between pairs of 2080 would double the memory I think, although I don’t know if it will work out of the box or needs specific libraries. What do you think?
Thanks!

By: Tim Dettmers

Tim Dettmers — Sat, 04 Apr 2020 02:59:01 +0000

In reply to Alam Noor. This setup is too difficult or even impossible to use for a single task. What you can do is to run two separate tasks on each of these machines.

By: Alam Noor

Alam Noor — Tue, 24 Mar 2020 19:40:29 +0000

Hi Tim,
I have two GPUs Machines one is 2080 TI and GeForce RTX 2070. Now I want to use both GPUs for training at object detection big data. Can you please comment, how I can configure for one task. Please comment step by step, I will be thankful to have your valuable comments and time. Thanks

By: fusedentropy

fusedentropy — Mon, 24 Sep 2018 17:12:29 +0000

In reply to Tim Dettmers.

Thank you for the reply.

I have the same suspicion – It must be possible, but for some reason, not supported at this time. Also, NVIDIA does not seem to want to entertain the idea.

I do know that the NVIDIA driver does “disable” the capabilities. One example is failure to “enable Peer access” – cudaDeviceEnablePeerAccess .

I’ve been looking for a hack to get around the limitation. I even though about using some daemon, running on Ubuntu which would be running inside a Hyper-V on top of the Win10. I would use some sort of event or interrupt to trigger the daemon to perform a deviceToDevice memcpy. Unfortunately, I have read that the NVIDIA drivers will not work with the Ubuntu system in this scenario.

At any rate, gotta keep on trekking in spite of the bumps in the road; this is part of what engineers do.

Cheers!

By: Tim Dettmers

Tim Dettmers — Fri, 21 Sep 2018 20:18:19 +0000

In reply to fusedentropy. I am not familiar with the details on Windows, but from my personal experience, a lot of things are documented as "not working" under certain conditions when they actually are. This might be to make people buy cards that "support features" such as Tesla, or potentially, in this case, to save NVIDIA from troubles that are difficult to anticipate: Few users work this such systems in that way, so its difficult to support them, so they want to save troubles with support requests by just saying it does not work — period. I think this might be going on here.

By: fusedentropy

fusedentropy — Fri, 21 Sep 2018 16:55:47 +0000

Something not mentioned is that, on Windows OS systems, GpuDirect only works when the GPU in TCC-Mode. If you are going GPU-to-GPU, both GPUs need to be in TCC-Mode. This is also true for P2P.

NVIDIA insists that it is a limitation with Windows’ WDDM Mode architecture. Frankly, I don’t believe them. All that is needed for DMA is the physical memory address of both the src and dst. The NVIDIA driver can easily get this value.

At that point, what the GPU does is a don’t care for the Windows OS. The GPU has so many schedulers, I am sure it could schedule a DMA of the memory, usually some multiple of a page size.

I am sure industry would love to be able to do DMA from their compute (TCC) GPUs to their display (WDDM) GPU after all their CUDA kernels have finished crunching the data. Then use OpenGL Interop for display/rendering – all data stays on GPUs!! Never having to double-copy to host memory (GPU memory throughput is also much faster). Now add NVLINK and you avoid PCIe traffic between GPUs as well!! AWESOME!

NVIDIA insists this is not possible due to Windows (WDDM) – Really?

By: minecraft

minecraft — Sun, 09 Sep 2018 07:02:53 +0000

It’s going to be ending of mine day, however before finish I
am reading this enormous piece of writing to improve my experience.