Deep Learning is very computationally intensive, so you will need a fast CPU with many cores, right? Or is it maybe wasteful to buy a fast CPU? One of the worst things you can do when building a deep learning system is to waste money on hardware that is unnecessary. Here I will guide you step by step through the hardware you will need for a cheap high-performance system.
Over the years, I build a total of 7 different deep learning workstations and despite careful research and reasoning, I made my fair share of mistake in selecting hardware parts. In this guide, I want to share my experience that I gained over the years so that you do not make the same mistakes that I did before.
The blog post is ordered by mistake severity. This means the mistakes where people usually waste the most money come first.
GPU
This blog post assumes that you will use a GPU for deep learning. If you are building or upgrading your system for deep learning, it is not sensible to leave out the GPU. The GPU is just the heart of deep learning applications – the improvement in processing speed is just too huge to ignore.
I talked at length about GPU choice in my GPU recommendations blog post, and the choice of your GPU is probably the most critical choice for your deep learning system. There are three main mistakes that you can make when choosing a GPU: (1) bad cost/performance, (2) not enough memory, (3) poor cooling.
For good cost/performance, I generally recommend an RTX 2070 or an RTX 2080 Ti. If you use these cards you should use 16-bit models. Otherwise, GTX 1070, GTX 1080, GTX 1070 Ti, and GTX 1080 Ti from eBay are fair choices and you can use these GPUs with 32-bit (but not 16-bit).
Be careful about the memory requirements when you pick your GPU. RTX cards, which can run in 16-bits, can train models which are twice as big with the same memory compared to GTX cards. As such RTX cards have a memory advantage and picking RTX cards and learn how to use 16-bit models effectively will carry you a long way. In general, the requirements for memory are roughly the following:
- Research that is hunting state-of-the-art scores: >=11 GB
- Research that is hunting for interesting architectures: >=8 GB
- Any other research: 8 GB
- Kaggle: 4 – 8 GB
- Startups: 8 GB (but check the specific application area for model sizes)
- Companies: 8 GB for prototyping, >=11 GB for training
Another problem to watch out for, especially if you buy multiple RTX cards is cooling. If you want to stick GPUs into PCIe slots which are next to each other you should make sure that you get GPUs with a blower-style fan. Otherwise you might run into temperature issues and your GPUs will be slower (about 30%) and die faster.
RAM
The main mistakes with RAM is to buy RAM with a too high clock rate. The second mistake is to buy not enough RAM to have a smooth prototyping experience.
Needed RAM Clock Rate
RAM clock rates are marketing stints where RAM companies lure you into buying “faster” RAM which actually yields little to no performance gains. This is best explained by “Does RAM speed REALLY matter?” video on RAM von Linus Tech Tips.
Furthermore, it is important to know that RAM speed is pretty much irrelevant for fast CPU RAM->GPU RAM transfers. This is so because (1) if you used pinned memory, your mini-batches will be transferred to the GPU without involvement from the CPU, and (2) if you do not use pinned memory the performance gains of fast vs slow RAMs is about 0-3% — spend your money elsewhere!
RAM Size
RAM size does not affect deep learning performance. However, it might hinder you from executing your GPU code comfortably (without swapping to disk). You should have enough RAM to comfortable work with your GPU. This means you should have at least the amount of RAM that matches your biggest GPU. For example, if you have a Titan RTX with 24 GB of memory you should have at least 24 GB of RAM. However, if you have more GPUs you do not necessarily need more RAM.
The problem with this “match largest GPU memory in RAM” strategy is that you might still fall short of RAM if you are processing large datasets. The best strategy here is to match your GPU and if you feel that you do not have enough RAM just buy some more.
A different strategy is influenced by psychology: Psychology tells us that concentration is a resource that is depleted over time. RAM is one of the few hardware pieces that allows you to conserve your concentration resource for more difficult programming problems. Rather than spending lots of time on circumnavigating RAM bottlenecks, you can invest your concentration on more pressing matters if you have more RAM. With a lot of RAM you can avoid those bottlenecks, save time and increase productivity on more pressing problems. Especially in Kaggle competitions, I found additional RAM very useful for feature engineering. So if you have the money and do a lot of pre-processing then additional RAM might be a good choice. So with this strategy, you want to have more, cheap RAM now rather than later.
CPU
The main mistake that people make is that people pay too much attention to PCIe lanes of a CPU. You should not care much about PCIe lanes. Instead, just look up if your CPU and motherboard combination supports the number of GPUs that you want to run. The second most common mistake is to get a CPU which is too powerful.
CPU and PCI-Express
People go crazy about PCIe lanes! However, the thing is that it has almost no effect on deep learning performance. If you have a single GPU, PCIe lanes are only needed to transfer data from your CPU RAM to your GPU RAM quickly. However, an ImageNet batch of 32 images (32x225x225x3) and 32-bit needs 1.1 milliseconds with 16 lanes, 2.3 milliseconds with 8 lanes, and 4.5 milliseconds with 4 lanes. These are theoretic numbers, and in practice you often see PCIe be twice as slow — but this is still lightning fast! PCIe lanes often have a latency in the nanosecond range and thus latency can be ignored.
Putting this together we have for an ImageNet mini-batch of 32 images and a ResNet-152 the following timing:
- Forward and backward pass: 216 milliseconds (ms)
- 16 PCIe lanes CPU->GPU transfer: About 2 ms (1.1 ms theoretical)
- 8 PCIe lanes CPU->GPU transfer: About 5 ms (2.3 ms)
- 4 PCIe lanes CPU->GPU transfer: About 9 ms (4.5 ms)
Thus going from 4 to 16 PCIe lanes will give you a performance increase of roughly 3.2%. However, if you use PyTorch’s data loader with pinned memory you gain exactly 0% performance. So do not waste your money on PCIe lanes if you are using a single GPU!
When you select CPU PCIe lanes and motherboard PCIe lanes make sure that you select a combination which supports the desired number of GPUs. If you buy a motherboard that supports 2 GPUs, and you want to have 2 GPUs eventually, make sure that you buy a CPU that supports 2 GPUs, but do not necessarily look at PCIe lanes.
PCIe Lanes and Multi-GPU Parallelism
Are PCIe lanes important if you train networks on multiple GPUs with data parallelism? I have published a paper on this at ICLR2016, and I can tell you if you have 96 GPUs then PCIe lanes are really important. However, if you have 4 or fewer GPUs this does not matter much. If you parallelize across 2-3 GPUs, I would not care at all about PCIe lanes. With 4 GPUs, I would make sure that I can get a support of 8 PCIe lanes per GPU (32 PCIe lanes in total). Since almost nobody runs a system with more than 4 GPUs as a rule of thumb: Do not spend extra money to get more PCIe lanes per GPU — it does not matter!
Needed CPU Cores
To be able to make a wise choice for the CPU we first need to understand the CPU and how it relates to deep learning. What does the CPU do for deep learning? The CPU does little computation when you run your deep nets on a GPU. Mostly it (1) initiates GPU function calls, (2) executes CPU functions.
By far the most useful application for your CPU is data preprocessing. There are two different common data processing strategies which have different CPU needs.
The first strategy is preprocessing while you train:
Loop:
- Load mini-batch
- Preprocess mini-batch
- Train on mini-batch
The second strategy is preprocessing before any training:
- Preprocess data
- Loop:
- Load preprocessed mini-batch
- Train on mini-batch
For the first strategy, a good CPU with many cores can boost performance significantly. For the second strategy, you do not need a very good CPU. For the first strategy, I recommend a minimum of 4 threads per GPU — that is usually two cores per GPU. I have not done hard tests for this, but you should gain about 0-5% additional performance per additional core/GPU.
For the second strategy, I recommend a minimum of 2 threads per GPU — that is usually one core per GPU. You will not see significant gains in performance when you have more cores if you are using the second strategy.
Needed CPU Clock Rate (Frequency)
When people think about fast CPUs they usually first think about the clock rate. 4GHz is better than 3.5GHz, or is it? This is generally true for comparing processors with the same architecture, e.g. “Ivy Bridge”, but it does not compare well between processors. Also, it is not always the best measure of performance.
In the case of deep learning there is very little computation to be done by the CPU: Increase a few variables here, evaluate some Boolean expression there, make some function calls on the GPU or within the program – all these depend on the CPU core clock rate.
While this reasoning seems sensible, there is the fact that the CPU has 100% usage when I run deep learning programs, so what is the issue here? I did some CPU core rate underclocking experiments to find out.
Note that these experiments are on a hardware that is dated, however, these results should still be the same for modern CPUs/GPUs.
Hard drive/SSD
The hard drive is not usually a bottleneck for deep learning. However, if you do stupid things it will hurt you: If you read your data from disk when they are needed (blocking wait) then a 100 MB/s hard drive will cost you about 185 milliseconds for an ImageNet mini-batch of size 32 — ouch! However, if you asynchronously fetch the data before it is used (for example torch vision loaders), then you will have loaded the mini-batch in 185 milliseconds while the compute time for most deep neural networks on ImageNet is about 200 milliseconds. Thus you will not face any performance penalty since you load the next mini-batch while the current is still computing.
However, I recommend an SSD for comfort and productivity: Programs start and respond more quickly, and pre-processing with large files is quite a bit faster. If you buy an NVMe SSD you will have an even smoother experience when compared to a regular SSD.
Thus the ideal setup is to have a large and slow hard drive for datasets and an SSD for productivity and comfort.
Power supply unit (PSU)
Generally, you want a PSU that is sufficient to accommodate all your future GPUs. GPUs typically get more energy efficient over time; so while other components will need to be replaced, a PSU should last a long while so a good PSU is a good investment.
You can calculate the required watts by adding up the watt of your CPU and GPUs with an additional 10% of watts for other components and as a buffer for power spikes. For example, if you have 4 GPUs with each 250 watts TDP and a CPU with 150 watts TDP, then you will need a PSU with a minimum of 4×250 + 150 + 100 = 1250 watts. I would usually add another 10% just to be sure everything works out, which in this case would result in a total of 1375 Watts. I would round up in this case an get a 1400 watts PSU.
One important part to be aware of is that even if a PSU has the required wattage, it might not have enough PCIe 8-pin or 6-pin connectors. Make sure you have enough connectors on the PSU to support all your GPUs!
Another important thing is to buy a PSU with high power efficiency rating – especially if you run many GPUs and will run them for a longer time.
Running a 4 GPU system on full power (1000-1500 watts) to train a convolutional net for two weeks will amount to 300-500 kWh, which in Germany – with rather high power costs of 20 cents per kWh – will amount to 60-100€ ($66-111). If this price is for a 100% efficiency, then training such a net with an 80% power supply would increase the costs by an additional 18-26€ – ouch! This is much less for a single GPU, but the point still holds – spending a bit more money on an efficient power supply makes good sense.
Using a couple of GPUs around the clock will significantly increase your carbon footprint and it will overshadow transportation (mainly airplane) and other factors that contribute to your footprint. If you want to be responsible, please consider going carbon neutral like the NYU Machine Learning for Language Group (ML2) — it is easy to do, cheap, and should be standard for deep learning researchers.
CPU and GPU Cooling
Cooling is important and it can be a significant bottleneck which reduces performance more than poor hardware choices do. You should be fine with a standard heat sink or all-in-one (AIO) water cooling solution for your CPU, but what for your GPU you will need to make special considerations.
Air Cooling GPUs
Air cooling is safe and solid for a single GPU or if you have multiple GPUs with space between them (2 GPUs in a 3-4 GPU case). However, one of the biggest mistakes can be made when you try to cool 3-4 GPUs and you need to think carefully about your options in this case.
Modern GPUs will increase their speed – and thus power consumption – up to their maximum when they run an algorithm, but as soon as the GPU hits a temperature barrier – often 80 °C – the GPU will decrease the speed so that the temperature threshold is not breached. This enables the best performance while keeping your GPU safe from overheating.
However, typical pre-programmed schedules for fan speeds are badly designed for deep learning programs, so that this temperature threshold is reached within seconds after starting a deep learning program. The result is a decreased performance (0-10%) which can be significant for multiple GPUs (10-25%) where the GPU heat up each other.
Since NVIDIA GPUs are first and foremost gaming GPUs, they are optimized for Windows. You can change the fan schedule with a few clicks in Windows, but not so in Linux, and as most deep learning libraries are written for Linux this is a problem.
The only option under Linux is to use to set a configuration for your Xorg server (Ubuntu) where you set the option “coolbits”. This works very well for a single GPU, but if you have multiple GPUs where some of them are headless, i.e. they have no monitor attached to them, you have to emulate a monitor which is hard and hacky. I tried it for a long time and had frustrating hours with a live boot CD to recover my graphics settings – I could never get it running properly on headless GPUs.
The most important point of consideration if you run 3-4 GPUs on air cooling is to pay attention to the fan design. The “blower” fan design pushes the air out to the back of the case so that fresh, cooler air is pushed into the GPU. Non-blower fans suck in air in the vincity of the GPU and cool the GPU. However, if you have multiple GPUs next to each other then there is no cool air around and GPUs with non-blower fans will heat up more and more until they throttle themselves down to reach cooler temperatures. Avoid non-blower fans in 3-4 GPU setups at all costs.
Water Cooling GPUs For Multiple GPUs
Another, more costly, and craftier option is to use water cooling. I do not recommend water cooling if you have a single GPU or if you have space between your two GPUs (2 GPUs in 3-4 GPU board). However, water cooling makes sure that even the beefiest GPU stay cool in a 4 GPU setup which is not possible when you cool with air. Another advantage of water cooling is that it operates much more silently, which is a big plus if you run multiple GPUs in an area where other people work. Water cooling will cost you about $100 for each GPU and some additional upfront costs (something like $50). Water cooling will also require some additional effort to assemble your computer, but there are many detailed guides on that and it should only require a few more hours of time in total. Maintenance should not be that complicated or effortful.
A Big Case for Cooling?
I bought large towers for my deep learning cluster, because they have additional fans for the GPU area, but I found this to be largely irrelevant: About 2-5 °C decrease, not worth the investment and the bulkiness of the cases. The most important part is really the cooling solution directly on your GPU — do not select an expensive case for its GPU cooling capability. Go cheap here. The case should fit your GPUs but thats it!
Conclusion Cooling
So in the end it is simple: For 1 GPU air cooling is best. For multiple GPUs, you should get blower-style air cooling and accept a tiny performance penalty (10-15%), or you pay extra for water cooling which is also more difficult to setup correctly and you have no performance penalty. Air and water cooling are all reasonable choices in certain situations. I would however recommend air cooling for simplicity in general — get a blower-style GPU if you run multiple GPUs. If you want to user water cooling try to find all-in-one (AIO) water cooling solutions for GPUs.
Motherboard
Your motherboard should have enough PCIe ports to support the number of GPUs you want to run (usually limited to four GPUs, even if you have more PCIe slots); remember that most GPUs have a width of two PCIe slots, so buy a motherboard that has enough space between PCIe slots if you intend to use multiple GPUs. Make sure your motherboard not only has the PCIe slots, but actually supports the GPU setup that you want to run. You can usually find information in this if you search your motherboard of choice on newegg and look at PCIe section on the specification page.
Computer Case
When you select a case, you should make sure that it supports full length GPUs that sit on top of your motherboard. Most cases support full length GPUs, but you should be suspicious if you buy a small case. Check its dimensions and specifications; you can also try a google image search of that model and see if you find pictures with GPUs in them.
If you use custom water cooling, make sure your case has enough space for the radiators. This is especially true if you use water cooling for your GPUs. The radiator of each GPU will need some space — make sure your setup actually fits into the GPU.
Monitors
I first thought it would be silly to write about monitors also, but they make such a huge difference and are so important that I just have to write about them.
The money I spent on my 3 27 inch monitors is probably the best money I have ever spent. Productivity goes up by a lot when using multiple monitors. I feel desperately crippled if I have to work with a single monitor. Do not short-change yourself on this matter. What good is a fast deep learning system if you are not able to operate it in an efficient manner?
Some words on building a PC
Many people are scared to build computers. The hardware components are expensive and you do not want to do something wrong. But it is really simple as components that do not belong together do not fit together. The motherboard manual is often very specific how to assemble everything and there are tons of guides and step by step videos which guide you through the process if you have no experience.
The great thing about building a computer is, that you know everything that there is to know about building a computer when you did it once, because all computer are built in the very same way – so building a computer will become a life skill that you will be able to apply again and again. So no reason to hold back!
Conclusion / TL;DR
GPU: RTX 2070 or RTX 2080 Ti. GTX 1070, GTX 1080, GTX 1070 Ti, and GTX 1080 Ti from eBay are good too!
CPU: 1-2 cores per GPU depending how you preprocess data. > 2GHz; CPU should support the number of GPUs that you want to run. PCIe lanes do not matter.
RAM:
– Clock rates do not matter — buy the cheapest RAM.
– Buy at least as much CPU RAM to match the RAM of your largest GPU.
– Buy more RAM only when needed.
– More RAM can be useful if you frequently work with large datasets.
Hard drive/SSD:
– Hard drive for data (>= 3TB)
– Use SSD for comfort and preprocessing small datasets.
PSU:
– Add up watts of GPUs + CPU. Then multiply the total by 110% for required Wattage.
– Get a high efficiency rating if you use a multiple GPUs.
– Make sure the PSU has enough PCIe connectors (6+8pins)
Cooling:
– CPU: get standard CPU cooler or all-in-one (AIO) water cooling solution
– GPU:
– Use air cooling
– Get GPUs with “blower-style” fans if you buy multiple GPUs
– Set coolbits flag in your Xorg config to control fan speeds
Motherboard:
– Get as many PCIe slots as you need for your (future) GPUs (one GPU takes two slots; max 4 GPUs per system)
Monitors:
– An additional monitor might make you more productive than an additional GPU.
Update 2018-12-14: Reworked entire blog post with up-to-date recommendations.
Update 2015-04-22: Removed recommendation for GTX 580
Jay says
Hey,
Thanks for that summary. You said that one should buy a GPU with at least 8GB RAM but that RTX GPU RAM was twice as effective as GTX RAM. That brings me to my question.
I have a choice between 2 laptops. Identical except one has an GeForce RTX 3060 6GB and costs $1400; while the other has a GeForce RTX 3070 8GB and costs $2000.
I know the RTX 3060 will be slower but is 6GB acceptable? You implied it will be the equivalent of a GeForce GTX 12GB RAM video card for RAM utilization.
Please advise as I’d really like to save the extra $600 in cost between the 2 laptops.
Given that video card add-ins for desktops for 3000 series RTX cards seem to start at $1000 it seems to me I should bide my time with a good entry level laptop with an RTX GPU that has much fairer prices until the video card price gouging is done for.
Thanks!
Tim Dettmers says
6GB is indeed a bit small – I would go for the 8 GB GPU
zoey79 says
Wonderful article. However, I am about to buy a new laptop. So what do you feel about the idea of a gaming laptop for deep learning?
Tim Dettmers says
Gaming laptops are excellent for deep learning. Make sure to get a beefy GPU!
TK says
I had a gaming laptop for deep learning. However I think desktop is still a better choice . Using Laptop for deep learning tend to overheat the laptop and battery appears to degrade much faster.
Moreover, the largest gpu memory in a laptop is 8gb but note that not all 8gb can be allocated for deep learning, which may not be sufficient if you are trying a very deep network or dual network. Mobile gpu is also less efficient than desktop gpu. Computing speed (cpu and etc) can also slower than a gaming desktop.
Chaitanya says
Thank you Tim for the post, it was very helpful to understand the importance of hardware components in deep learning.
I have been researching about the hardware requirements to begin a Deep learning project on my work station from couple of months, finally read your article that has answered lot of my questions. I did realize the GPU on my machine will not be sufficient so wanted to get your thoughts on its replacement or adding a second one.
Please suggest if I can add any Nvidia 20xx series GPU to below configuration.
– Dual CPU – Xeon E5 2670 – V2 10 cores each, 64GB RAM
– Existing GPU – Nvidia Geforce 1050
– power unit – 800 watts
– two PCi e gen 3 X 16 slots (with 4 other gen slots in between, currently one is in use for 1050)
Kriskr3 says
Hello Tim,
I had read your great article on GPU recommendations for Deep learning, it was informative and would help anyone who is interested and serious about this field. I found the article when I searched in google for ideas on GPU upgrade, after reading your responses to the posts I wanted ask my question right here. I have HP workstation that has Nvidia Geforce GTX 1050 (4GB) so looking to either replace it or add another. Power unit is 800 watt, dual CPU, two PCIe GEN 3 X16, one PCI e GEN3 X8 and three other GEN2. I believe at max I can add one GPU (may be low wattage) due to space and power limitation. I’m not sure if I can even add Nvidia Geforce 20X series to the existing or I need to replace. I would appreciate if you can share view based on your experience.
Imahn says
Dear Tim,
I would have five short questions, I am really sorry!
(i) I am generally wondering: I am not 100% sure yet whether I should opt for 2 GPUs or 4GPUs. I would of course first buy 1 GPU and then scale, but if I know a priori that I plan to have only two GPUs, I could opt for a cheaper MB, CPU, cooler, PSU, etc. Does one maybe need 2 GPUs to do some testing on hyperparameters of papers that one reads, and 4 GPUs if one wants to build own neural networks (and thus test even more ideas)? Do you have any brief thoughts on this, or a link we could read?
(ii) What do you think of this PSU from EVGA:
https://www.newegg.com/evga-supernova-750-g-120-gp-0750-x1-2000w/p/1HU-00J7-006T1?Description=2000%20watt%20power%20supply&cm_re=2000_watt%20power%20supply-_-9SIAHT8BHT5519-_-Product
I am asking because in the other post of yours, you were writing about the problem of
4 x RTX 3090, but wouldn’t this PSU solve the problem? But you didn’t mention this PSU, that’s why I am confused. (Apparently, this PSU only works under 220 V, so for me, I couldn’t buy it, but wouldn’t it be great for US Americans?)
(iii) For a possible 4-GPU setup, do you think that an Intel Core i7-9800X with 8 cores is enough for the 4 GPUs at full utilization + the normal things that one does (reading papers, having Zoom meetings, using LibreOffice, VirtualBox, etc.)? This CPU would cost me 480 $, but more cores would even cost more. I generally suspect that I will need a GPU rather than CPU for ML, I know you recommend 1-2 CPUs per GPU, but with 4 GPUs, that would be 4-8 just for the GPUs, so I am honestly unsure.
(iv) This question is strongly related to (iii): Is it possible to use the Deep Learning PC for normal home-office while the 4 GPUs are at full utilization?
(v) If I opted for 4 GPUs in blower-style fan, wouldn’t my neighbors be able to hear it? They have small babies and I am honestly worried that the noise at night would be too much… Any thoughts would be appreciated. 🙂
Dmytro says
Hi Tom!
I got CPU : Intel® Pentium(R) CPU G4560 @ 3.50GHz × 4
And got error when try to load model with TF 2.2 and upper
Process finished with exit code 132 (interrupted by signal 4: SIGILL)
when i got TF 1.5 it work fine`s
I read much and find that it connected with CPU , is that true?
I really need understand with what trouble is it
Thank for you`r attention!
Have a nice day.
marco says
1. Version of Python and tf should match (please verifiy spec on tf software requirements)
2. CPU binary precompiled version of Tf can include CPU specific optimizations that could be not compatibles.
Generally this happen when a different cpu architecture binary is launched on an unsupported CPU.
Probably tf 1.5 could be run on your CPU, but the new was not compiled for that or was not compatible with your Python version.
How to solve it:
if u can check the python incompatible issue, that could be solved quickly, installing appropriate version of Python (I suppose 3.8 bcs I have the same tf 2.2 on my machine and it use 3.8)
If the problem is at binary / cpu compatible level you should :
A: change CPU
or
B: compile tf from source ON YOUR CPU.
I yet compiled tf on my CPU several time and you can do it, don’t worry
Armand says
Hi Tim,
I’m building a DL rig for a student organization and I’m wondering how to share it with students. I want to be able to create VMs and erase/reconfigure them if students mess up. I want to use it kind of like a personal AWS Cloud.
Do you have any leads I should follow or keywords I should search for ?
Thanks!
Mira says
Hi Tim, all,
We are about to buy (when available) RTX 3090 for AI, PyTorch and TensorFlow.
The computer, where I planned to put GPU in has i7-3930K which runs only at pcie 2.0. How much would pcie 2.0 limit the perfomance in the computations?
I know the theoretical throughputs, but I have no idea about real perfomance. Could you please give me some example of power deprecation?
Thanks, Mira
Mira says
Edit: I enabled pcie 3.0 via force-enable-gen3.exe
Now, the throughput and perfomance seems to work fine.
Audi says
Hi Tim,
Thanks for this article and late happy new year!
I am currently doing simple ML and some deep learning for images as a hobby with my laptop. I wanted to build a pc with a budget constraint of around 2k USD as I wanted to learn more about deep learning and AI as a beginner.
Here is my current PCPartsPicker list too:
PCPartPicker Part List: https://pcpartpicker.com/list/jGM33Z
CPU: AMD Ryzen 9 3900XT 3.8 GHz 12-Core Processor
CPU Cooler: Cooler Master Hyper 212 Black Edition 42 CFM CPU Cooler
Motherboard: Gigabyte X570 AORUS ELITE WIFI ATX AM4 Motherboard
Memory: Corsair Vengeance LPX 32 GB (2 x 16 GB) DDR4-2666 CL16 Memory (x2)
Storage SSD: ADATA XPG SX8200 Pro 1 TB M.2-2280 NVME Solid State Drive
Storage HDD: Seagate Barracuda Compute 4 TB 3.5″ 5400RPM Internal Hard Drive
Video Card: MSI GeForce RTX 3070 8 GB GAMING X TRIO Video Card
Case: Phanteks Eclipse P300A Mesh ATX Mid Tower Case
Power Supply: Cooler Master MasterWatt 750 W 80+ Bronze Certified Semi-modular ATX
The question:
1. For the CPU I am conflicted with Ryzen 9 3900XT (12 cores) where people claimed that each core performance is better than Ryzen Threadripper 2950X (16 cores). For deep learning and ML, which one is better?
2. For the GPU i am also conflicted with going for either MSI RTX 3070 (8gb memory config, 256 memory bus, and 5888 core ) or Zotac RTX 3080. Your post recommended to go for RTX 3080 (10gb memory config, 320 memory bus, and 8704 cores) ; however, with my budget, I can only land either of these two where the RTX 3080 Zotac is claimed to be subpar brand. or maybe should I wait for the upcoming RTX 3070 Ti (10gb memory config, 320 memory bus and ~7424 cores) or RTX 3060 non-Ti version(12gb memory config, 192 memory bus, and ~3840 cores) ?
3. As a windows OS user for a long time, should I make it dual OS windows and Linux or just install Linux for this pc? (I am not getting used to Linux performance and would like to use my pc for program like office and some steam games)
Thanks!
David says
Hey Audi,
I’m currently struggling with a similar problem. Either the Ryzen 9 3900x or the 5800x. Do you know which one is better for deep learning? Following the explanation given by tim I suppose that the 12 cores outperforms the 8 cores of the 5800x?
Frederick Carlson says
Scary. This is very similar to my build
Great article and site – BTW
CPU: Intel Core i9-10900X Cascade Lake 3.7GHz Ten-Core
CPU Cooler: Noctua NH-D15S CPU Cooler
Motherboard: Gigabyte X299X Aorus Master Intel LGA 2066 eATX Motherboard
Memory:
G.Skill Ripjaws V Series 32 GB (2 x 16 GB) DDR4-3200 CL16 Memory
G.Skill Trident Z RGB 32GB (2 x 16GB) DDR4-3200 PC4-25600 CL16 Memory
Storage: Two Samsung M2 SSD (2X500 GB) and 4 2TB WD Black HDDs
Video Card: NVIDIA GeForce 1080ti (x1)
Video Card: NVIDIA GeForce 1070ti (x3)
Case: Lian Li O11D XL-W ATX Full Tower Case
Power Supply: Thermaltake 1250 W
Imahn says
Hi Tim,
(i) I hope you are doing fine! I am currently searching a bit for an appropriate motherboard to support 2-3 GPU’s, and I honestly don’t know how to read the specifications to decide …
So the way I understood some of your comments, it is not necessary to have PCIe 4.0 lanes, but PCIe 3.0 lanes seem to do their job for Machine Learning. Now let’s say that I want to have three GPU’s, what does the specification need to say? Since I cannot find RTX 3080 in blower-style fans, I suspect that I need enough space between the GPU’s as to not run into cooling problems.
As an example, let’s consider the MSI Z490-A PRO ATX (https://www.cyberport.de/?DEEP=2303-9AR&APID=21&msclkid=19d8555294d71af589d60ab781345cc7), I’m sorry this website is in German …
It says that this Motherboard has the following specs:
2x PCIe 3.0 x16 (1x x16, 1x x4), 3x PCIe 3.0 x1
Is this good for 3 GPU’s? To me, on the image, the 3 x PCIe 3.0 x1 look really small,
so I guess only the 3.0 x 16 could be used for GPU’s.
(ii) Does it make sense to buy a 3-year warranty for a Motherboard?
(iii) The Motherboard would come with SATA-cables, would I need more cables to connect the Motherboard to the PSU or the GPU’s?
Thanks!
Tim Dettmers says
Hi Imahn,
what you want to look out for is a motherboard for specifications that say X 16/X 16/X 16 all the same with the eight instead. This indicates that the motherboard supports three GPUs in each GPU has 16 lanes. This is different from how many PCIe slots you have. The easiest way to check a motherboard is to go to Newegg.com as it has the most information on hardware and the information a standardized. Seems for your motherboard only supports one GPU.
Lauren says
Thanks so much for your post!! I’m trying to build my first personal setup for deep learning. I’m hoping to start with one RTX 3090, but then have space for up to three if I wanted to expand in the future. Do you have any advice on this setup:
CPU: Intel Core i9-10850K Comet Lake 10-Core 3.6 GHz LGA 1200 125W
GPU: RTX 3090
MB: ASUS WS X299 SAGE LGA 2066 Intel X299
RAM: 128GB: Corsair LPX 8*16GB 3200
PS: CORSAIR AX1600i 1600W
Case: Fractal Design Define 7 XL
SSD: HP EX920 M.2 1TB PCIe NVMe NAND SSD
HD: Western Digital 4TB
My plan was to start with 1 GPU, and allow room to expand. Looks like this setup should support up to three with room around each GPU for cooling? Do you think I’d need an additional cooler? Any other advice or suggestions on this setup?
Thanks so much, I really appreciate any thoughts that you have!
Lauren
Tim Dettmers says
Hi Lauren, the build looks fine to me. Make sure that you can fit all potential three RTX 3090 in the case that you chose.
Brandon Wolfson says
Hi Tim,
Thanks for this article, this is super helpful for a first time builder like me. I had a few questions:
1) Do you think intel i9-10900F will be enough cores (it has 10 cores) for a dual RTX 3090 build? I know you recommend min 2 cores/GPU, but I asked Lambda Labs and they recommended min 12 cores for dual RTX 3090 and so I got worried.
2) Also, I realize dual RTX 3090 build is probably impractical with this mobo. In that case, do you think a RTX 3090 and RTX 3080 Ti (hopefully it comes out) would work well with this setup?
3) Lastly, do you have any thoughts on vertical monitors or ultra-wide monitors?
Here is my PCPartsPicker list too:
PCPartPicker Part List: https://pcpartpicker.com/list/zvCZ2V
CPU: Intel Core i9-10900F 2.8 GHz 10-Core Processor
CPU Cooler: *Noctua NH-D15 82.5 CFM CPU Cooler
Motherboard: ASRock Z490 Taichi ATX LGA1200 Motherboard
Memory: G.Skill Ripjaws V Series 32 GB (2 x 16 GB) DDR4-3200 CL16 Memory
Storage: ADATA XPG SX8200 Pro 2 TB M.2-2280 NVME Solid State Drive
Video Card: NVIDIA GeForce RTX 3090 24 GB Founders Edition Video Card
Video Card: NVIDIA GeForce RTX 3090 24 GB Founders Edition Video Card (or maybe RTX 3080 Ti for this 2nd GPU)
Case: Lian Li O11D XL-W ATX Full Tower Case
Power Supply: Corsair HX Platinum 1200 W 80+ Platinum Certified Fully Modular ATX Power Supply
Thanks!
Brandon Wolfson says
BTW, I plan to get more system memory too once I get a 2nd GPU. I was thinking 32 GB to start with 1 GPU, then buying more once I get my 2nd GPU. Thanks!
Tim Dettmers says
Hi Brandon,
1) 10 cores is plenty, you will be fine!
2) I would try to avoid mixing GPU types to allow for better parallelization. If parallelization is not that important for you, this could be a good option.
3) I personally dislike vertical monitors. It is easier on the neck to move left-to-right rather than up-down. I think even large documents can be read quite well on an ultra-wide monitor.
The build looks solid to me!
John B says
Hello Brandon,
I am lookinh to build my first set up for DL and willl like to know how you choice of parts are helping you so far?
TOBIN says
Hello Tim,
Thanks for the blog. I read the blog fully and I am building a deep learning machine and I would like to have you expertise in building a perfect machine for my purpose.
I am a start-up, I am building this machine for my start-up which is used for people tracking using multi camera setup, and deep learning classification based on their actions.
I intended to use the multi camera people tracking and deep learning classification on the GPU and use the output.
Once I got this project on the woking phase, I will use this machine as a backend server for a small group of people around 20 to 30 tracking and classification.
I am building this machine on a budget of around 3500 Dollars. This is my build:
GPU: 1x ZOTAC RTX 3090 (~ 2000 $)
RAM: 1x CORSAIR VENGEANCE LPX 32GB (16GBX2) DDR4 DRAM 3000MHZ C16 MEMORY KIT (~ 140$)
CPU: AMD RYZEN 5 5600X PROCESSOR (UPTO 4.6 GHZ) (~ 440$)
SSD: 1x Samsung 970 PRO NVME M.2 512GB SSD (~ 365$)
Hard Disk: 1x WD BLUE 2TB INTERNAL HDD (~ 65$)
PSU: ANTEC HCP1000 80 PLUS PLATINUM SERIES 1000 WATTS (~ 160$)
Case: DEEPCOOL GAMERSTORM MACUBE (~ 110$)
MotherBoard: GIGABYTE B550 AORUS ELITE AX WIFI (~ 215$)
Any suggestion on this build?
What type of cooling is required for this machine?
Should I go for a cheaper products in any of these line up?
Is it enough with RTX 3090 or should we use Titan RTX or anything else?
How about the CPU is it enough for this work load?
Any thing else you want to comment?
Tim Dettmers says
Hello Tobin,
this looks good. One thing though if you want to use the server as a backend for 20 to 30 models, it might be more manageable if you have more multiple smaller GPUs to spread the load. I am not sure how much memory you will need and what your budget is, but either multiple RTX 2070, RTX 2080 Ti, RTX 3070, or RTX 3080 Ti might be a better choice than a single RTX 3090.
John B says
Hello Tobin,
I am watching this space to get an idea on how to build my first set up for DL and willl like to know how your choice of parts are helping you so far?
Dominik P says
Hi Tim,
Thanks for the great article! I’m a first-time PC builder, trying to build an ML Workstation on a 2000-3000€ budget. This is my current plan:
CPU: Ryzen 9 3900X
GPU: RTX 3090
MB: Asus ROG Strix X750-E Gaming Wifi
RAM: Corsair LPX 2*16GB 3200
PS: Corsair 750W
Case: NZXT H710
SSD: Samsung 970 Evo 1 TB M.2
Could you give me some advice on this setup?
Should I get a cheaper motherboard?
Air or liquid cooling? (an amd wraith prism air cooler is included with the cpu)
Is it a problem that the cpu has no integrated memory? I will probably lose 1-2GB vram to Xorg, right?
Any other recommendations?
Thank you so much!,
Dominik
Tim Dettmers says
The question would be: Why you get the PCIe 4.0 motherboard in the first place? For gaming, you might get some advantages, but not really for anything deep learning related. So if you want to build a pure deep learning machine, I would maybe buy a cheaper motherboard. On the other hand, if you later get NVMe SSD which support full PCIe 4.0 speeds and a second GPU it might be worth it if you run some things which are very storage-intensive, such as deep learning with very large datasets. Otherwise, the build looks good!
Emmanuel says
Hi Tim,
Thx so lot for your usefull and complete article.
I’m a french PhD student in the first month of his thesis. I have a question about your paragraph on memory requirements for research in deep learning.
You say that for research that is hunting state-of-the-art scores we need memory for the GPU >= 11 GB.
I assume that with a memory <= 11 GB the computation still works fine except that computation duration is bigger. Is it true ?
Bravo for your analysis 🙂
Kind regards
Emmanuel
Tim Dettmers says
Some deep learning models are so large that you cannot run them with an 11 GB memory (you might be able to do so with some complicated tricks). These models are usually some big transformer models. If you run only computer vision, you can come quite far with 8-10GB but your networks might be a bit slow because you need to run them with a very small batch size.
Maciek says
Hi,
Thank for the post.
Did you try using windows with WSL2 for DL ? This could solve some of your problems (and create new ones)
Tim Dettmers says
As I understand it, GPUs and WSL2 do not have an easy time to work together. However, both PyTorch and Windows have pretty good regular Windows support as I understand. So one does not need to use WSL2.
Ryan Adonde says
Hi Tim, thanks for the great blog.
I was wondering if you had any thoughts on an 8 GPU setup with dual root architecture (4 GPUs attached to each CPU). My main focus is distributed training across all 8 GPUs, but have concerns that the CPU/CPU interconnect may become a bottleneck for communication between GPUs as some other sources have suggested that dual root is a big no-no when trying to scale across 8 GPUs (for this exact reason). However, the cards I am looking to get do not support P2P (2080 ti) and will have to send data via the CPU anyway (for the cards attached to the same CPU) so was wondering if you had experience on how problematic that extra hop across CPUs will be for the cards that are not connected to the same CPU.
Many thanks
Tim Dettmers says
Usually, it is not that big of an issue and parallelization is still quite fast. If you use the right software (integrated into most libraries) then the GPU memory will be transferred to a pinned CPU buffer which can be directly be transferred to the other CPU/GPU. Overall, the communication cost should only be about twice as expensive as normally. That sounds like a high cost, but communication is only a small part of the training costs. As such, your models will still be quite fast when parallelized. You can expect something like 6.75x speedup compared to 7.25x for a system with P2P and a single root.
Ryan Adonde says
Thanks for the info! Much appreciated
Bob O says
Hey Tim,
I finally bit the bullet and built a machine (with the exception of the scarce 3070 that will go in later). You mentioned a software post, but I could not find it. Do you have favorites that use would suggest for doing my software setup. I was going to start with Ubuntu because it does not appear I can access the GPU with the VMs I have from windows.
I am looking for specific things like openCV in python, caffe, keras, enabling and using my GPU… basically exactly what you have done for hardware, but the step following assembly!
Thanks, and please keep up the good work.
Tim Dettmers says
Hey Bob,
unfortunately, I do not have a software guide. I would recommend Ubuntu since using GPUs through a VM can be a pain (or not work at all, depending on the motherboard). In terms of software, I would look into Anaconda3 on Ubuntu which is a package manager for scientific computing. You can download it freely and can install all the software that you mention without the need for compiling anything. Compiling OpenCV for example can be a pain whereas in anaconda you just execute “conda install -c anaconda opencv” and you are done.
Good luck!
Frank Fletcher says
Hi Tim, thank you for sharing all your work with these hardware guides!
I don’t know if you have an update for this article somewhere else but there are now several ways to control the fan curves for NVidia GPUs. The easiest way I’ve found is to use GreenWithEnvy.
https://gitlab.com/leinardi/gwe
Tim Dettmers says
Thank you, Frank, I have not seen it before! This looks excellent, thank you for sharing! Another package I know about is coolgpus which is designed for servers where some of the NVIDIA options are not available because no monitors are connected to the GPUs. So coolgpus is pretty good for servers, but the gwe package looks a bit better than coolgpus for the desktop case.
Matt says
Hey Tim!
Thank for this post – really helpful! For my first PC build, I’m planning on using a Ryzen 7 3700x CPU and RTX 2080 Super (will replace in the future with the new GPUs). You talked about GPU cooling but what is your opinion on CPU cooling? Is the stock cooler for my CPU not good enough and should I consider AIO solutions?
Thanks!
Tim Dettmers says
Hey Matt! Often, a stock cooler is okay for the CPU although it can be a bit loud. Many people are not installing AIO solutions on their CPU for better and more silent cooling. However, it was shown that a good air cooler is often just as good and even more silent than AIO water cooling solutions. The bottom line for deep learning though is mostly noise: If a bit of noise is okay, then go with stock, if you want a more silent setup go with either AIO or a good air cooler. In either case, if you train large models that saturate your GPU, your GPU will also be quite loud, so in that case, a silent CPU cooler will not make the greatest difference. I personally prefer as silent working environments as possible, and I always buy a dedicated CPU cooler.
Bob Nestor says
Hi Tim:
Awesome content! I’m a retired software engineer looking to learn more about AI & ML.
I have a few questions about H/W:
– Intel or AMD (I’m leaning towards AMD using an X570 MB)
– Best starter OS
and S/W:
– Best courseware
– Best learning samples.
Cheers…Bob
Thanks for sharing your knowledge.
Tim Dettmers says
Hi Bob!
– AMD CPUs are great; so an X570 MB is great.
– Use Ubuntu 20.04 + Anaconda + PyTorch. If you want to do deep learning that is the way to go. You will have the least issues overall if you use that.
– fast.ai is by far the best course for deep learning for software engineers
– just google around for pytorch samples for the models that you learn about in the fast.ai classes
Good luck!
Bob Nestor says
Hi Tim:
Glad I found your site. I truly appreciate your help in advancing my knowledge of AI & DL. Really appreciate your help.
Cheers…Bob
haykelvin says
Hi Tim, thank you so much for this awesome article. It is very informative and interesting to see those number(both theoretical and from actual testing) in the reasoning. I am new to the field and start playing with pytorch recently, sure this will help me and lots of others to make wise choice when selecting hardware in future ML builds.
Have read through the thread and saw your concern on AMD gpu software compatibility issue. I saw a few good deals of VEGA FE in my local second-hand market, the 16gb of ram looks sweet on paper, do you think those can give me some good bang for the buck if I don’t mind to experiment with them a bit? I also see cost efficient upgradability if I want to get more of those in the future second-hand market. Or would you recommend just stick with CUDA at all?
Tim Dettmers says
Our community could definitely need more AMD enthusiasts. Currently, AMD GPUs work for deep learning, but their performance is not as good and there might be some hidden issues here and there. So if you want to just get things running I recommend NVIDIA + CUDA. If you want to contribute actively to the community AMD and ROCm is great — this option helps a lot of diffuse the NVIDIA monopoly over time but you can expect a more frustrating experience.
joy says
hi, Need one recommendation, i got 8 GPU to build a Deep learning machine ?
which motherboard (which supports AMD 7000 series cpu) you recommend to support 8 times PCIe*16 slots … and have multiple M.2 SSD slots too…
Tim Dettmers says
There is no regular server motherboard from desktop vendors that does that I think. I think you need to go with specialized motherboards like those from Supermicro. I have too little experience with servers to recommend a particular motherboard. Usually, you just go with that you need and what is cheap and support and warranty covers all issues. So you will get it working and keep it working without any problem.
Christophe Bessette says
Hi Tim,
I wanted to start by saying that I loved reading your GPU and Deep learning hardware guide, I learned alot!
It still left me with a couple of questions (I’m pretty new when it comes to computer building and spec in general). I’m mainly interested in Deep Reinforcement Learning and, I read that for DRL, CPU is much more important then it is in other fields of Deep Learning because of the need to handle the simulations. So i’m wondering if going with a Ryzen 5 2600 is enough or I should go with something which has more core, higher clock and/or supported memory. Also, with DRL, can I get away with a cheaper GPU like the RTX 2060 or the GTX 1070. I’m not really on a tight budget but i’m looking to make it the most cost-effective possible while not being restrained too much by my PC.
I don’t know if it matters but i’m mostly trying to do Reinforcement Learning for financial markets trading.
Thank you!
Tim Dettmers says
Hi Christophe,
I think for deep reinforcement learning you want a CPU with lots of cores. The Ryzen 5 2600 is a pretty solid counterpart for an RTX 2060. GTX 1070 could also work, but I would prefer an RTX 2060 for DRL. You could also wait a bit for the RTX 3060 and get a cheap threadripper to improve performance further. However, that setup might be a bit more expensive and you have to wait longer to get the parts.
Chris-Sij says
I’m looking to build a home-based machine learning setup that will utilize transfer learning and classification and apply findings comparatively to CT’s. I’ll have a plethora of data but my actual data input size is incredibly small in single instances. I’m looking to build a system that provides the most bang for my buck and have a desire to build a machine around a Titan XP, if possible (or advised). There’s a potential for getting a second Titan for future work if the single one is not enough or up to the task later on. However, I’m unfamiliar with Nvidia based setups when it comes to personal building so I’d love some advice on what kind of other parts I should be looking to pick-up. I’m most likely going to be pairing this single Titan with 32GB of RAM (2-16GB sticks), but am pretty much stuck after that point. I’d appreciate any direction you could provide as this is all new territory to me and am trying to avoid cloud-computing services like AWS for the time being.
Thank you in advance!
Tim Dettmers says
Have a look at my other blog post about GPUs. There I have some “barebone” setups for 2 GPUs which you can use a guide for your build.
John Heffernan says
You really are something else! You have provided some exemplary resources for me. Seeing the number of responses to comments you have is incredible. Thank you very much for what you do!
Tim Dettmers says
Thank you 🙂
darklinux says
Hello, I am faced with a legitimate dilemma: I plan to create a cluster of computers for machine learning, as part of my startup, but I have a very limited budget, I hesitate between two options:
1: two RTX 2070s with a server running ryzen 5
2: four GTX 1660 super / ryzen 5
all with https://cnvrg.io/
Help !
Tim Dettmers says
I would definitely go with two RTX 2070 Super. The memory on GTX 1660 is just a bit small.
darklinux says
thanks for your reply, i was thinking 3070, not 2070
Tim Dettmers says
If it would be two RTX 3070s for the price of 4x GTX 1660 then definitely go with the 2x RTX 3070!
darklinux says
thank you , good weekend
Abhishek says
Hi Tim
Your article is nice and informative. You have got really great experience with server configurations. I had small doubt!
What kind of server configuration would be required to do video analytics on 30-40 4MP CCTV cameras simultaneously? Its a basically Boundary surveillance project. In Video analytics, taslk would be to identify human, animal or bird. I am inclined to use Intel processors in general.
What if no, of cameras are 12 ? What configuration would you suggest?
Thank you
Tim Dettmers says
At 4MP and a framerate of 30 fps you have about 36 MB per second, taken times 40 cameras that means about 1.4GB/s. The main problem here is to store that data and pass it quickly to GPUs. An NVMe SSD raid would be very helpful here. Otherwise, it depends on the network and the resolution that you want to process. 4MP is pretty large and you definitely need to downsize images. Downsizes images with YOLO can be processes at about 200 FPS which means you need about 6 GPUs to process the data efficiently. These figures are for RTX 20 GPUs, so I imagine 4x RTX 30 GPUs could work. If you reduce the frame rate by 1/4 to 8 fps per CCTV you could process everything on a single GPU.
Xuan says
Hi Tim,
Thank you for detailed description on building a Deep Learning machine. I would request your suggestion if the config i am building will work out well or not.
CPU – AMD Ryzen 3900x (12core)
CPU Cooler – corsair H100i RGB
Motherboard – Gigabyte X570 aorus ultra
Memory – Corsair Vengeance LPX 32 GB (2*16) DDR4-3200
Storage – Samsung 970 Evo plus 500GB M.2-2280 NVME SSD and HDD 1TB
Graphic Card – ZOTAC RTX2080 ti
Power Supply : EVGA SuperNOVA G2 1300W
Case fan: Cooler Master Blade master 40.79 CFM 80mm Fan
I was wondering if I can add one more graphic card (2080 ti) to the above config. Does the above motherboard support 2 graphic cards 2080ti?
Tim Dettmers says
The build looks good! The motherboard supports multiple GPUs, so that would be an option. If you only get a single GPU, you do not need a power supply of 1300W probably 700-800W would be sufficient if you go for a single RTX 30 GPU. With 2 GPUs, it makes sense to go for 1300W just to have a bit extra room (also for a third GPU).
Alex says
Hello Tim.
I’m looking now for good laptop to start with deep learning.
Can you advise me please, if HP Omen 15″ model with 7i Intel processor, 16 GB RAM and
GPU Nvidia RTX2070 (8 GB) is good choice?
If it is not, what is the good laptop for your opinion?
Thank You in advance.
Alex
Tim Dettmers says
I do not know much about laptops. There are many other things to consider because laptops can be quite personal (battery life, weight etc.). In terms of deep learning performance, i7, 16 GB RAM and RTX 2070 sounds very good for a laptop. With that, you would definitely be able to do some pretty good deep learning.
Keshav says
Hi, I was planning to build a PC with the RTX 2060 Super. Should I wait for the 30xx series in terms of price and performance or shall I go ahead and order it. Since RTX 20xx series is getting discontinued need to make a decision soon.
Andrew Hughes says
Hi Tim,
I’m a little confused by this:
“Be careful about the memory requirements when you pick your GPU. RTX cards, which can run in 16-bits, can train models which are twice as big with the same memory compared to GTX cards. As such RTX cards have a memory advantage and picking RTX cards and learn how to use 16-bit models effectively will carry you a long way.”
Does this mean that because of the lack of precision the memory requirement is halved, hence you can have a model which is twice as big for a cards with the same RAM.
I’ve also read in other places about “models not fitting into memory”, what does this actually mean?
What are we “Fitting into RAM”? Is it a combination of the model itself and the data? Or just the model? Or just the Data? I thought using things like TF we load things in batches anyway. So why does this matter?
Could you clear this confusion up for me?
Thanks
Tim Dettmers says
The data usually takes up almost no memory since we, as you rightly pointed out, only load one batch into GPU memory. Otherwise, it depends on the model that you are working with. Convolutional networks are very small models with very large activations while transformers are somewhere in-between (both weights, gradients, and activations are large). Activations here refers to the data representations which are passed through the network. These need to be stored to compute the gradient in the backward pass.
Mira says
Hello, I would like to build a computer with AMD 3rd gen cpu.
My demand is focused on pci-e lanes. It must to have a GPU x16 together with 2 LAN cards x4 + x1 and two m.2 SSDs (at least one at full speed x4).
My question is, would this works on mainstream AM4 MB with pci-e lanes working at x16/x4/x1 (+ x4 for SSD)?
I don´t want to devaluate it to x8/x8/x1.
Tim Dettmers says
I am not quite sure if that works. As I understand it, there are a lot of different varieties of combinations of Ryzen 3rd gen CPUs together with PCIe 4.0 motherboard, but I think in any case one is just able to use 16x lanes for the GPU if you just use the m.2 SSDs. If you add 2 LAN cards I think this will downgrade the GPU to 8x lanes. I could be wrong about this, but as I understand it, in many cases you still have “extra lanes” but these are no distributed equally across all slots and using some components will draw away lanes from the GPU.
Mira says
If you check the manual:
https://dlcdnets.asus.com/pub/ASUS/mb/SocketAM4/PRIME_X570-P/E15650_PRIME_X570-P_UM_WEB_V2.pdf
And there are other and similar, you can see, that for DUAL GPU there is x16 + x4 lanes.
CPU provides x16 lanes
PCH provides x4 + 3*x1
Therefore in this case it should work. Is that correct?
Odyssee says
Hi there, thanks for the study!
Do you have any benchmark recommendation to test all those facts? A benchmark that could highlight the impact of an overclocked CPU, PCI lanes, …
Common benchmarks focus only on GPU comparison.
Thanks,
Tim Dettmers says
Sorry I no longer have the code I think. I think I never uploaded it to github. I used a Linux tool that downclocks the core clock rate and benchmarks performance in taht way. For PCIe lanes you can just use NVIDIA’s CUDA sample library for benchmarking.
TK says
What about a laptop that is equipped with rtx 2070 super max p? Would it be sufficient for deep learning? Understand that the mobile GPU is less efficient than a desktop gpu but I’m working from two sites so having a laptop is definitely a win for me.
Tim Dettmers says
You can do many things with that GPU but not all models will fit into 8 GB of memory and it will be about half as slow as a desktop GPU. If that is okay for you it might be a good option.
Danilo Cominotti Marques says
Tim,
I understand that the 8GB VRAM situation shouldn’t be a problem in terms of “not being able to fit some models at all” if Linux and CUDA are used and ‘per_process_gpu_memory_fraction’ >=2 and ‘enabling allow_growth’ = true in TensorFlow (because then it uses CUDA Unified Memory). Of course there could be significant performance impacts, but then you’d be able to fit bigger models. Any thoughts on this?
Tim Dettmers says
Unfortunately, this feature will be too slow to train neural networks. There are some similar techniques that you could use (memory swapping) but these need to be programmed or tuned for any neural network separately and do not work automatically. So with such techniques, an 8 GB GPU memory will be enough even for the largest neural networks, but unified memory does not allow for this right now.
lazy_propogator says
Hello Tim, hope you are doing well. Can you help me choose between the AMD Ryzen 7-3750H Processor vs an Intel i5 processor for the CPU? They are being coupled with the RTX 2060 and the GTX 1660 Ti respectively. Would the AMD CPU be a bottleneck for the RTX? Are there any potential problems which can arise in the use of AMD CPU processors in deep learning?
Mayur says
Hi Tim,
Thank you for detailed description on all the essentials in setting up Deep Learning machine. I am currently building my DL machine and would request your suggestion if the config i am building will work out well or not.
CPU – Intel Core i7-9700K 3.6 GHz 8-Core
CPU Cooler – Fractal Design Celsius s24 87.6 CFM Liquid CPU cooler
Motherboard – Asus ROG STRIX Z390-E Gaming ATX LGA 1151
Memory – Corsair Vengeance LPX 64 GB (4*16) DDR4-3200
Storage – Samsung 970 Evo 500GB M.2-2280 NVME SSD
Graphic Card – ZOTAC RTX2080 SUPER AMP 8GB GDDR6 ZT-T20820D-10P
Case: NZXT H500 ATX Mid Tower Case
Power Supply : Corsair RMx(2018) 850 W 80+ Gold Certified Fully Modular ATX
Case fan: Cooler Master Blade master 40.79 CFM 80mm Fan
I will add 1 TB HDD with above config.
Tim Dettmers says
The CPU is a bit overkill if you want to just do deep learning. If you want to also do other things with the computer it looks pretty good. This would be a well-balanced build for Kaggle competitions for example.
Russell says
Hi Tim,
Thanks for your article. It helped me to select parts for my rig.
I would appreciate some feedback on my selection.
https://au.pcpartpicker.com/list/JFhR8M
As the mobo has 1x x16 slot and 1x x8 slot and hence the 2nd GPU is only getting PCIe 3.0 x8, should I get a different mobo that supports both x16?
Cheers,
Russ
Tim Dettmers says
Looks solid to me!
Greg says
Hi Tim,
thank you for the all information that you put in here.
however I have problem with choosing GPU for my motherboard which is ASRock z270 Pro4. I am trying to upgrade my PC for software like Zbrush, Maya, Substance Painter etc.
I am considering of buying one of these:
GEFORCE RTX 2060 OC REV2 6144MB GDDR6 PCI-EXPRESS GRAPHICS CARD
GEFORCE RTX 2060 VENTUS XS OC 6144MB GDDR6 PCI-EXPRESS GRAPHICS CARD
the problem that I have is that on one website that I have looked at, these GPU perform average with my motherboard.
I don’t want to waste my money so I would like to ask you if you know any website where I could compare performance of GPU with a motherboard or if you have any suggestions what GPUs would be the best for my motherboard.
Thank you and looking forward for your reply
Tim Dettmers says
I think they should perform equally well on the motherboard. I am not sure why it would be otherwise.
Marc says
32GB vs 64GB of RAM. Given current RAM prices, is it worth just going for 64GB.
Doing work mostly in computer vision.
CPU: Ryzen 7 3800x
GPU: 2 x RTX 2070 Super
Tim Dettmers says
Yes, I agree. RAM prices are fluctuating but right now RAM is pretty affordable!
Wen says
Hi Tim,
Thanks for your post. I have a i7-3770 4-core desktop with16GB RAM. I was happy to see Q/A above that it’s ok to buy a RTX 2080 ti. But then I read on another website saying that power supply won’t be enough for GTX card above 1030 for a motherboard of Optiplex 7010 which is what I have. I haven’t confirmed that PSU is 250W (for which I suppose I need to open the cover and check physically). Do you know if that is true?
Thank you.
Tim Dettmers says
That is true, if the PSU is only 250 watts you will not be able to run a RTX 2080 Ti on that. If you start upgrading the PSU though it might be worth thinking if it is worth it or to build a new desktop entirely. Both options can make sense depending on budget and other constraints.
Dimiter says
Hi Tim,
I have an old X99 MoBo and Intel CPU (Intel I7-5930K) from 2015. One (or possibly) both of them died, I cannot really troubleshoot without replacing them. I am thinking of buying both, but the question is move to ADM vs stay with Intel: AMD’s 1920X or 2920x with X399 MoBo vs Intel CPU (not sure which one, comparable Intel 7900X is crazy expensive) and X299 MoBo. It seems much more economical to move to AMD, even if it would complicate processor water cooling etc. What fo you think?
Thank you.
sourav says
Hi Tim,
I currently have a gaming pc with 1050ti, 8GB DDR3 RAM, and AMD FX6300 processor. I would like to upgrade to RTX 2060 SUPER/ RTX 2060, 16GB DDR4 RAM, and an Intel CPU. I want to save money for the GPU, thus, I decided to use a budget CPU. I am thinking about Intel Core i3-9100F (4 cores, 3.6 GHz, 65 W, locked, 80$). This does not come with an integrated graphic (I will add the GPU). Is that a good CPU for a single GPU build? Or, should I look for an old CPU & Motherboard Combo under $120 on eBay? If yes, which CPU and Motherboard would be a good fit for my budget?
Thank You
Tim Dettmers says
I think the CPU should be more than fine for a single GPU. You should worry more about other applications (CPU-based ML for Kaggle competitions, for example) that might be bottlenecked by the CPU. You can also roll with a AMD CPU which are now pretty cost-efficient and powerful but it would only make a small difference.
Jochen van Osch says
Hello Tim,
thank you for insightfull article. Maybe you can also give me some guidance on the choice in GPU . I work in a hospital and want to start with deep learning projects on high resolution image data-sets from MRI and CT.
Our budget for buying a PC is ~5000 euro.
Choice for GPU I now thinking about is: RTX-Titan RTX versus 1 (or 2) RTX 2080 Ti?
(in combination with a AMD Threadripper X399-a CPU. )
I think we will not run multiple projects at once, but I want to be, GPU-memory-wise, on the save side GPU-memory-wise, and the Titan has 24GB.
Kind regards,
Jochen van Osch
Tim Dettmers says
I would would definitely go with the RTX Titan! The memory will be a life-safer if you work with medical images! Also make sure to invest in NVMe SSDs as loading large unprocessed images can be a large bottleneck. I recommend getting a motherboard that supports 3x NVMe SSDs and then get 3x of them and setup a virtual SSD Raid 0.
Adam TS says
I was wondering what the lower limit on RAM speeds is? I am looking at repurposing old server hardware and have 64gb of 1333mhz DDR3 memory and was wondering if this would be a bottleneck? Also I have committed to offsetting my carbon footprint, and wanted to thank you for encouraging others to do the same!!
Tim Dettmers says
It can always be a bit tricky to re-purpose old hardware but if the computer boots with the RAM stick then it should not be the biggest bottleneck. Since you rarely use the RAM in deep learning training and since the RAM is usually of similar speed to the PCIe bus it should not be a bit bottleneck. If you run DDR3 memory with 4 GPUs the PCIe bus and the RAM should be of about equal speed and you should only loose about 5-10% performance.
Pranav Lal says
Hi Tim, I am stuck with a Nvidia GeForce GTX 1660. Do I stand a chance with this model of GPU or do I need to buy something else? The problem that the ram is only 6GB but I cannot afford anything more.
Tim Dettmers says
It will be difficult but you can look up techniques to conserve memory. You will probably also need to accept running smaller datasets and models.
Aaric says
If I am to start out building supervised and unsupervised models, then do I really need a graphics heavy computer?
Tim Dettmers says
Yes, usually even if you just want to get started a GPU is required. A CPU can be quite slow even for small problems.
Satchel says
Hi Tim! I’m an undergraduate student primarily focused on Kaggle competitions and personal projects – I have an old PC ( i7 870, 16GB DDR3 ) that I plan on upgrading with a 1070/1070Ti GPU and some new age SSD’s.
The CPU fits your requirements (8 threads, 1 GPU) but is more than 10 years old and only supports PCIe 2.0×16. How significant would this bottleneck be in your opinion, and does it warrant an upgrade to a modern CPU?
Finally, I’m curious about your opinion on AMD Cards / ROCm stack, as I have access to a R9 290 and Vega 54.
Thanks for an insightful article,
Satchel
Tim Dettmers says
I believe PCIe 2.0 would be sufficient in your case but there is not enough data to say that definitely. I would give it a try and upgrade your computer if it does not work out. Theoretically it should be fine.
I would avoid the AMD cards still due to compatibility with software.
Satchel says
Thanks for your advice! I ended up getting a 1070 and it worked pretty decently. (A little faster than colab). Since then I actually entered into a masters degree, and I’m using it quite a bit more often.
I’ve noticed when training that both my GPU and CPU usage are around 98%-100% (occasionally GPU at 94%) sometimes either one may be slightly higher than the other but it tends to be more CPU bound (98% CPU, 94% GPU, sometimes 100% on GPU)
To me this is a good signal that an upgrade is neccesary, but do you think its a significant bottleneck? I’m curious if upgrading my CPU will dramatically change performance.
Tim Dettmers says
The high CPU percentages do not necessarily mean that your CPU is utilized. Some libraries use active waiting, which will keep the CPU busy with “empty” calculations. The GPU utilization is also not the true utilization; it just means that all cores on the GPU are used (but not by how much).
One test that you can decide on a CPU upgrade is to limit the frequency on the CPU manually. This can be done with some CPUs on Linux (for Intel, it is easy, for AMD, I am not sure, but it should be possible). Then you can compare the performance with an underclocked CPU. If it is much lower, then the CPU is a bottleneck. If the performance is similar, the CPU is not a bottleneck.
Michel Gartner says
Hi Tim,
I have doubts in order to choose a CPU. I know Ryzen 7 3700x or something like that seems pretty good but I’m worried about the Intel MKL library’s issues.
I will buy a RTX 2070 super.
Which CPU do you recommend? Ryzen 7 3700x? i9 9900k? i7 9700k? As far as I know, I9 and I7 only has 16 pcie lanes and that could be a problem
I mostly do deep learnign stuff but I also want to use my pc to some kaggle competitions (mostly tree-based models that runs on cpu in sklearn)
Regards,
Michel
Tim Dettmers says
MKL library issues are only for things like solvers, Fourier transform, eigendecomposition. I am not sure if that is really that common for Kaggle competitions and you would only be hit by a small penalty. I think Ryzen processors are fine even in your case.
Victor Gorgonho says
Hey Tim,
Could you please tell me which processor you think will fit better this pc i’m planning to build:
NVIDIA GEFORCE RTX 2070 8GB
2x 8 GB 2666MHz Vulcan DDR4
B450 Asus Prime
HD 2TB 7200 + SSD 500GB M.2
650w PCYes Shocker 80 plus
My options right now would be either Ryzen 7 2700 or Ryzen 5 3600…
Also, is that good enough for a deep learning student? The lab i’m working on uses computer vision to recognize LIBRAS, which is a Brazilian Sign Language…
Appreciate, and loved your article!
Regards,
Tim Dettmers says
Both CPUs are more than fine for one GPU. I might just go with the cheaper one.
Eric Bohn says
For real world applications in a 2 GPU system running RTX 2080 Ti’s at most, is there much difference between x8/x8 and x16/x4? Does it effectively make both perform as a x4 to keep things in sync when running model/data parallelism?
Tim Dettmers says
Both perform at about 4x speed.
Steven says
Hi Tim,
I was wondering if you had any opinion on cube computer cases. I’m thinking of exchanging my full tower case for a cube case to save space, but can’t find anything that let’s me know what (if any) the cost for cooling my GPU (possibly expanding to 2 GPUs) is. Do you have any knowledge or opinions?…
Thanks
Eric Bohn says
Are 2x GPU for machine learning worth it? Should I buy a board now that allows for 2x x8/x8, or upgrade to Threadripper for multi-GPU later on?
Tim Dettmers says
More GPUs are always better :). If you plan to go for 4 GPUs in the future, it makes sense to get the Threadripper and the right motherboard right away. But then you should ask yourself, do you really need 4 GPUs / is spending that money justified?
Eric Bohn says
My wife isn’t going to like that answer 😛
On average, what kind of performance improvement can I expect when going from one to two GPUs?
Atharva says
Hey Tim,
Please tell me what do you think of this pc:
i7-9700K processor (3.6 GHz, 12 MB)
Hard Drive 2TB 7200+1TB SSD M.2
NVIDIA GEFORCE RTX 2070 8GB
will this be good for deep learning?
BTW loved your article!
Regards,
Tim Dettmers says
I think that is appropriate. I also like Ryzen CPUs if you want to save a bit of money. They would definitely also be more than enough for an RTX 2070 GPU.
Xiaopeng Fu says
Hi Tim,
A lot thanks for the wonderful article and all the replies to our queries. Being new to this field and preparing to build my own system, I’m wondering if it’s worth waiting for Intel’s 10th gen desktop CPUs. Is it good idea, for example, to buy a 9900k + z390 board at this moment, knowing the board will not be compatible for future CPU upgrades. Or maybe the 10th gen improvement will not make much difference for DL…
Thanks!
Xiaopeng
Tim Dettmers says
The CPU does not matter that much for deep learning. If you have some workloads which require a better CPU (factorization, sklearn models, some big data stuff) then it might well worth it to wait. However, if you just want to get started and do deep learning it might be better to just go ahead now — you will lose almost no deep learning performance if you use a 9900k CPU.
Eric Bohn says
Hey Tim, If you have a moment I’d be curious to know what you think about my build and my reasoning behind it: https://pcpartpicker.com/b/3jw6Mp
Much appreciated.
Eric
Tim Dettmers says
I think it looks good. two things though: The PSU with that high of wattage is only needed if you want to expand to two GPUs in the future — think again if that is really what you want. Otherwise, I would use the same NVMe SSD instead of a small and a larger one. I guess you want to store the OS on the smaller one and have the rest for data? The better thing is to use a small partition for the OS and then use a virtual RAID 0 to create a single high-speed device — this can make a huge difference if you work with very large datasets! Otherwise, quite some few spinning disks, but if you need the space, then you need the space.
Eric Bohn says
Thanks Tim.
Yes, the original thought behind the 850W PSU was for two GPUs. That’s also why I have that motherboard – for the x8/x8 configuration. Do you think it’s reasonable to run two GPUs in this setup, or should I plan on moving to a threadripper system when I want to go for the second GPU? My intent for this machine is personal projects and a Masters program, but I also want to be open to scaling for a small startup and/or consulting.
I’m not following on the virtual RAID 0 configuration. Do you mean run Windows and Linux on the same drive on two partitions, and RAID 0 with the second physical drive? Do you have a link?
Is there any reason to not go with a single 1 TB drive for OS and datasets vs a drive for OS and another drive for datasets?
Tim Dettmers says
I think startup stuff and consulting is fair with 2 GPUs. You want to get GPUs with big memory though if you want to do startup stuff, preferably a Titan RTX. On Linux a virtual RAID 0 is easy to setup: https://www.digitalocean.com/community/tutorials/how-to-create-raid-arrays-with-mdadm-on-ubuntu-16-04. Not sure about Windows though.
Juliana says
Hi Tim, Thank you so much for you work and for your helpful guides.
I was wondering if you would mind looking at my build project and helping me with a doubt I have (regarding the CPU/motherboard combination).
I’ll use it mostly as a machine-learning/ deep-learning/ computer-vision workstation. My mind is set on a RTX2080Ti (possibly adding a second RTX2080Ti in the future).
Here’s the [PCPartPicker Part List](https://es.pcpartpicker.com/list/mQ46rV).
Picking the motherboard, my reasoning was:
– the 2-way SLI capability allows me to eventually add a second RTX2080Ti later in a x8/x8/x4 PCIe configuration.
– with the X570 chipset, I can get the Zen2 microarchitecture of the Ryzen 3000 series without a bios flashing headache (even though the Graphic cards won’t take advantage of the PCIe Gen 4).
All this results in a somewhat high-end motherboard.
The biggest concern I have is the motherboard/CPU combination: it feels weird to spend less on a CPU than on its motherboard. Isn’t this a bit ‘Frankensteinish’?
There a 110 euros gap in Spain between the 6 cores Ryzen 5 3600 and the 8 cores Ryzen 7 3700X.
Isn’t this a bit wasteful for a Ryzen 5 3600? Should I go for a Ryzen 7 instead?
Tim Dettmers says
The CPU/motherboard combination looks pretty good to me for 2 GPUs. It is totally fine if you spend more on a motherboard than a CPU. Over the past years, motherboards kept getting more expensive and some CPUs, especially AMD ones, good cheaper. However, the Ryzen 7 can make sense if you are working with datasets that involve loading and preprocessing lots of data (computer vision, for example ImageNet).
Michael says
Hi Tim, which GPUs would get if you had $10k, and wanted to use them to train large transformer-based models, at home. Note that at home you would have to pay for electricity yourself.
I’m trying to decide between 8x2080Ti vs 4xRTX Titan X vs 2xQuadro 8000. Also note that four RTX Titan X cards in the same chassis will overheat due to their fan type, and I’m not very comfortable to water cool them.
Michael says
No thoughts? I’m seriously thinking getting 4 RTX Titans and watercooling them, but it’s a bit scary.
Hugo says
Hi Tim,
Congratulations for your great work
What setup would you recommend for GPT-2 ( pre-trained language model ) latest release (1.5b parameters) ?
I am intending to train this AI for my researches, but i am very unaware about the hardware needed. I have read that numerous users have issues with still powerfull setup.
Any idea ?
Sorry for my english, not my mothertongue 🙂
Best Regards.
Tim Dettmers says
A minimum would be 4x RTX 2080 Ti. You might use very small batch sizes though which is computationally inefficient, thus I would not recommend RTX 2080 Tis. I would recommend instead 4x Titan RTX which should have enough memory so you can run GPT-2 and other transformers with a large enough batch size.
Hugo says
Thank you very much for your answer, Tim !
You opened my eyes on an important point regarding the transformers.
I am now studying the matter of batch sizes.
Better to start first a collab notebook and run tests before investing big money…
Filip says
Tim, your hardware guide was really useful in identifying a deep learning machine for me about 9 months ago. At that time the RTX2070s had started appearing in gaming machines. Based on your info about the great value of the RTX2070s and FP16 capability I saw that a gaming machine was a realistic cost-effective choice for a small deep learning machine (1 gpu).
I ended up buying a Windows gaming machine with an RTX2070 for just a bit over $1000. I ended up modifying the cooling to get positive case pressure (took off the front bezel blocking the airflow) and making it a dual boot Windows10/Unbuntu18. As a Linux newbie one gotcha I found out was using a Windows file system results in a performance bottleneck in Linux. So I added an SSD with Ext4 for data preprocessing and that made a big difference.
It has been working great for learning deep learning (with pytorch) and Kaggle competitions. I have found this local setup to be faster than Google Colab, Kaggle kernels, and Azure notebooks and long runs are more reliable. The colorful case lights are an added bonus!
Tim Dettmers says
Thanks for your feedback! I think cases like this are pretty common. Some setups will fall a bit short here and there but with a bit adjustments you can quickly get a great system that fulfills most of your needs.
Anu Chandra says
Hi Tim,
Thanks for the excellent material. I’ve been working with a 4x2080Ti workstation. Some of the new GAN training work really requires 8x2080Ti. I’ve been looking at server based reference designs – deeplearning11 and deeplearning12 from servethehome.com. I don’t know a lot about servers but it seems (from youtube videos) that they generate horrible fan noise when all GPUs are used. Have you given any thought to a 8xGPU machine that can live comfortably in a home environment? Any thoughts appreciated. Anu
Tim Dettmers says
I do not think you will find a 8 GPU machine which you can comfortable house in a home. If you have a small room far from other rooms (bedroom/living room) you might be able to do it if you put some noise insulation into the room and put the server there. It might just be a better idea to rent some GPUs/TPUs in the cloud for whenever you need to run 8 GPU jobs. You can get a 4 GPU for most other things and only use the 8 GPUs if you need them. Or do batch aggregation to simulate 8 GPU training. Batch aggregation will just double the training time for you, which should be alright and is doable in a home environment.
Mike says
Dear Tim,
Hello, I am going to do a UG project which going to do deep learning of Medical imaging. And I am finding a perfect laptop can let me do the research. Do you have laptop recommendation?
Tim Dettmers says
I would not recommend laptops for medical imaging deep learning projects. Usually, in medical imaging you will have images with very high resolution and you will need a GPU which has the most memory that you can afford (Titan RTX 24GB). You can buy a desktop and a small laptop with which you login to the desktop when you are on the go. This would be the best solution.
vithin says
Can I use multiple different GPU cards on a single CPU for deep learning?
Tim Dettmers says
Yes, but you will not be able to parallelize across those GPUs.
Alex says
Hi Tim,
Thank you for the post!
I am thinking of building PC and right now my build is:
CPU – AMD Threadripper 1900X 3.8 GHz 8-Core Processor
CPU Cooler – Corsair H100i PRO 75 CFM Liquid CPU Cooler
Motherboard – Gigabyte X399 AORUS PRO ATX TR4 Motherboard
Memory – Corsair Vengeance LPX 32 GB (2 x 16 GB) DDR4-3200 Memory
Storage – Samsung 970 Evo Plus 1 TB M.2-2280 NVME Solid State Drive
GPU – Asus GeForce RTX 2070 SUPER 8 GB Turbo EVO
Case – NZXT H500i ATX Mid Tower Case
PSU – Corsair RMx (2018) 850 W 80+ Gold Certified Fully Modular ATX Power Supply
I am going to use it for training CNNs(Kaggle, not large projects). I’m also planning to add the second GPU after some time. Is this build sufficient for my purposes? Would you recommend any different processor? I’m a bit worried about the cooling – is it enough for the current build and is something needed to be changed for 2 GPUs?
sakshi says
I love your blog. This all provided knowledge are unique than other Deep learning blog.
Good explain, keep updating
Michiel van de Steeg says
Hi Tim,
I’m looking to build a machine with one 2080 TI, with the ability to expand it to a second card. The difficulty I’m facing is that I want my machine to be quiet, but watercooling is quite complex, expensive, and seems like high maintenance. Blower fans aren’t particularly quiet.
Do you know if (on an x470 mobo) there are ways other than watercooling and two blower fans that would keep the case cool enough? For example, how much would hybrid cooling AIOs help? Say two of these: https://www.evga.com/products/product.aspx?pn=11G-P4-2384-KR (bonus points if you know whether any motherboard has headers for two aio pumps)? Or, if I get an air cooled GPU now, would adding a blower fan to that allow for sufficient cooling?
Hoping you can offer any advice on this issue. I’ve searched the internet but I think my demands may be a bit too high / specific.
Tim Dettmers says
If you just want to run two cards you can get a motherboard with at least 3 PCIe slots and use non-blower fans. Because you have a single PCIe slot which is empty between cards cooling is usually sufficient and you can run a bit more silent non-blower fans. Otherwise, AIOs can help. People have mixed reviews about them, some reporting very low temperatures, others report similar temperatures to regular fans. I think you cannot do much wrong with AIO GPUs if you want silent performance, the 3x+ PCIe slot + non-blower fan option is cheaper though.
Michiel van de Steeg says
Thanks for the quick response!
What’s not quite clear to me, though: AM4 compatible motherboards / ATX towers don’t seem like they would support this in terms of physical dimensions. E.g. on the Prime Asus x470 Pro, there are 2 GPU slots with 3-slot-width space, and the third slot only has a 1-slot-width space. I’m not sure how I can manage to put a graphics card in the bottom slot. In many cases the PSU or bottom of the case would be in the way, and most don’t have the right number of expansion slots on the back. Am I overlooking something?
Do you think a hybrid cooled 2-width GPU on the first PCIEx16 and an air cooled 3-width GPU on the second PCIEx16 could work? That would mean there’s 1 slot of space in between the two, with part of the heat from the top one going to the radiator.
Thanks again!
Tim Dettmers says
If you have the right case you can install a GPU on the bottom slot. It only has a 1-slot-width, but in some computer cases, the GPU just extends beyond the motherboard. If you look for cases that optimized for GPU airflow you can probably find a usable case.
Not sure if the hybrid + regular fan would work out.
Ahmad says
Dear Tim Dettmers,
Thank you very much for this blog. This information is really useful for upcoming deep learning project. In my work place we are developing a server kind on of system to run three deep learning projects. To run all the models in a concurrent manner for 24×7, I need nearly 100 GB GTX 2080 Ti GPU. To maintain this GPU what type of additional resources do I need. I need this GPU only for inference not for training.
My question may look a little broad sorry for that. If you need any other information please let me know.
Thank you in advance.
Tim Dettmers says
If you require a large amount of memory (to hold different kind of models?) and only want to do inference then working a CPU might actually be an excellent option. For inference, in general, the software will be far more important than the hardware. I think in terms of memory/dollar one of the best options will be the RTX 2080 Ti or the RTX 2060 — but I am not sure if memory is really your problem.
chanhyuk jung says
I bought a rtx 2060 super. And I have a system with i5 3470. I added ram so it’s 16 gbs, It has an ssd and a hdd. with the 3470, only two cores are at a 100%. I can upgrade to a 8500, but would it make a lot of difference?
Tim Dettmers says
If you are using PyTorch try the flag OMP_NUM_THREADS=1 python your_scripy.py. You can also try this if you are using TensorFlow but I am not sure if it will help. The difference is mainly determined by your scripts. If they load a lot of data you can benefit quite a bit from a good CPU. You can also tweak the data loader threads and see how big the difference is for that. Sometimes you can squeeze a bit more out of your CPU if you tune that parameter. Depending on your script you can probably expect 0 – 30% increased speed with an i5 8500.
Carl says
Hi, I am considering buying a GPU for deep learning. If I understand this article right there are different models of RTX2070 card. I am looking for 16-bit FP, but I don’t see any information about FP. Could you tell me which parameter in the specification should I pay attention to?
Tim Dettmers says
Sorry for the confusion, but all RTX 2070 have 16-bit capability. You can pick any card. If you have the money though, I would recommend picking the RTX 2070 Super over the regular one.
Fatih Bilgin says
Hi Tim,
I’m trying to set up following PC for your guide. (For especially Kaggle -beginner- competitions) Does it look good? Thank you.
*Asus Turbo GeForce RTX 2070 8GB 256Bit GDDR6 (DX12) PCI-E 3.0 GPU (TURBO-RTX2070-8G)
*AMD RYZEN 5 2600X 6-Core 3.6 GHz (4.2 GHz Max Boost) Socket AM4 95W CPU
*ASUS TUF X470-PLUS GAMING AMD X470 AM4 Ryzen DDR4 3200MHz(OC) M.2 USB3.1 mOTHERBOARD
*Crucial 32GB (2x16GB) Ballistix Sport LT Gray DDR4 3000MHz CL15 1.35V PC Ram
*Corsair TX-M Series TX850M 80+ Gold PSU CP-9020130-EU (850W)
*Intel 660P 1TB 1800MB-1800MB/s NVMe M.2 QLC SSD
*Seagate Barracuda 2TB 2.5″ 5400RPM 128MB Cache Sata 3 HDD
*Thermaltake Level 20 MT ARGB CA-1M7-00M1WN-00 Black SPCC ATX Mid Tower Computer Case
BrN says
Hey Tim, really appreciate your post here. Has been a huge help. I’m currently doing some deep learning application on MRI images using mostly Tensorflow/Keras. I’d like to build a workstation with a 1 GPU set up for now with the plan to up it to 2 GPUs in the future. I don’t think I’ll be going to the 4 GPU set up.
Wanted to get your thoughts on the CPU/PCIe lane situation (here is my build: https://pcpartpicker.com/list/fvWZBb). I’ve had some people suggest I go the AMD 3900x route to take advantage of quad channel memory and more PCIe lanes, but wondering if the current set up would be good enough for a 2 GPU set up in the future (i.e. add another RTX 2080 Ti sometime later).
Thanks!
Tim Dettmers says
You might want to have a slightly bigger PSU if you want to run two GPUs. The extra PCIe lanes are not worth it.
Winston Fan says
Thank you for this great article~! It helps alot! I also found this article(https://medium.com/the-mission/how-to-build-the-perfect-deep-learning-computer-and-save-thousands-of-dollars-9ec3b2eb4ce2) which recommends to use AMD ThreadRipper 2920X and here is the build(https://pcpartpicker.com/b/nrjypg) . But I wend to UserBenchMark and found Ryzen 7 3700X actually is newer, cheaper and performs better in most aspects.
https://cpu.userbenchmark.com/Compare/AMD-Ryzen-TR-2920X-vs-AMD-Ryzen-7-3700X/m625966vs4043
My question is, should I keep everything same but just replace Threadripper 2920X with Ryzen 7 3700X? or should I stick to TR 2920X? and Why?
Thank you!
Tim Dettmers says
Make sure the Ryzen CPU supports the number of GPUs that you want to have. If it does it is a great and cheap option!
Winston says
thanks. I did my research and found that Ryzen 7 3700X supports up to 2 GPUs, which is fine for me.
So Ryzen 7 3700X is definitely a better choice for me 🙂
Sourav Banerjee says
Hey Tim,
First of all thank you very much for your wonderful article with a great insight about Deep Learning. It helped me to certain extent to understand the hardware requirements needed for any DL machine. But I have an existing system with the following Config:
CPU: Intel Core i5-2500K
HDD: Seagate ST3160812AS 41N3268 LEN 160GB
HDD: Seagate Barracuda 7200.12 500GB
RAM: Corsair XMS3 DDR3 1600 C9 2x2GB
MBD: Asus P8Z68-V
GPU: ? What should I go for as a best optimal upgrade for DL and NLP?
By the Way, Can this CPU perform good for ML or we need to rebuild the PC?
Tim Dettmers says
The CPU will be fine if you use 1-2 GPUs. If you have more you need something better.
Sourav Banerjee says
Which GPU do you suggest? Do I need to upgrade my RAM?
Claus says
Tahnk you! This is extremely helpful. I need to buy a multi GPU setup suited for deep learning analysis of 3D radiological data – eventually several TB. What kind of a setup would be recommended if I have about 30k€ available? Several of your comments relate to smaller systems, so what are the key caveats for larger systems like the one I need to buy?
A more specific question relates to GEForce 2080 ti versus Tesla VT100 (10x the price!). Any killer argument for Tesla? More VRAM than 11GB needed in our case
Tim Dettmers says
I would recommend a 8 GPU machine with 8x RTX Titan. Reach out to some hardware vendors that offer these systems. It might be that for such a machine the budged you need is slightly higher (32k euro). If this is a case a 4 GPU machine with 4 RTX Titan is also great. RTX 2080 Ti has too small memory for your application. The V100 is too pricey and not good!
Claus says
Thank you very much indeed for your advice. In the meantime my American colleague suggested to go for the V100, despite the price with the following argument:
GE force cards are totally fine for 2D models in particularly if you want to leverage imageNet transfer learning by cropping or resizing (first better) at 224×224 . However, NVidia advances tools as for example AMP automated mixed precision might not be available on GE force and works just on V-100. AMP allows you to train deeper models or larger training batch (faster training) with limited memory footprint. If you are planning 3D data driven models or multi-channel (informations from different sequences) I would definitely chose V-100 32 Gb cards.
If we follow this advice we could only start with V100 GPUs and buy more at a later point in time. Do you have a comment and could you elaborate what you meant when stating V100 is …and not good. Thank you once again!
Michael says
Titan RTX is slightly slower than V100, and has 24GB of RAM (vs 32GB in V100). It supports all the features of V100 (including AMP). Your colleague does not know what he’s talking about. If you’re absolutely sure that your model is not going to fit in 24GB or RAM even with batch size of 1, then I recommend going with four Quadro RTX 8000 cards (48GB of RAM at $5,500).
Here’s a good system builder in US:
https://lambdalabs.com/products/blade
Tim Dettmers says
If you think memory is a problem I suggest going with Quadro 8000 cards with 48 GB memory instead.
Lina says
Hi Tim,
Thanks for your great article! I have questions, please can you answer them?!
What about using multi Titan rtx instead of multi quadro 5000? Which one will be faster? Also, I found that Lambda uses quadro and Tesla instead of titan rtx for DL server, what is the point? Is that just for double precision?
Thanks!
Tim Dettmers says
Both are about the same. Lambda uses quadro because they make more profit. It also might be that NVIDIA does not sell them RTX cards anymore. There is a clause in the CUDA license that forbids the use of RTX cards in data centers. So this could also be a reason.
RB says
Hi,
I am looking to do some entry level DL stuff and then build my way upto kaggle. I would appreciate any feedback on the following machine.
https://pcpartpicker.com/list/GxyQCb
Thank in advance!
Francisco Paiva says
Let’s say I decide to go with an Intel i9-9900KF, which has only 16 PCI lanes available. In this scenario, I also used two GeForce RTX 2080 Ti GPUs. If I also use an SSD, which requires 4 PCI lanes, would I still be able two for an 8x/8x setting with the GPUs? Wouldn’t the system configuration be limited by the max number of CPU PCI lanes, and so considering the SSD, the GPU would be forced to an 8x/4x setting? In this scenario, would it be better to get just one GPU if a Intend to use parallelism?
Tim Dettmers says
Yes this is problematic. You can use a SATA SSD to solve this or another CPU.
Francisco Paiva says
Ok! Thank you for the reply =)
Eric Bohn says
Doesn’t this depend? The chipset has PCIe lanes in addition to the CPU right? Therefore, if the m.2 is on the chipset then it wouldn’t take away from the x8/x8 used by the GPUs?
Tim Dettmers says
Yes, you are right. I got it wrong the first time around. Most often your motherboard will provide the PCIe lanes for the PCIe storage and thus it does not take away from the GPU PCIe lanes.
Eric Bohn says
Thank you for clarifying.
lazy_propogator says
Hello Tim, this is a great article! Thanks for all the info. NVIDIA recently released the Super versions of the RTX cards, can you shed some insight on that? They are supposed to be more powerful than their processors, its said that the RTX 2060 super is almost as good as the RTX 2070. But on the other hand there are reports that the RTX 2080 Super is only slightly better than the RTX 2080. Can you shed some light on this?
Thanks
Tim Dettmers says
I have not analyzed the data of the GPUs yet. What you say seems accurate from my first impression though. So RTX 2070 and 2060 Super are good. RTX 2080 Super not so much.
Steven says
Hi Tim,
Thank you for sharing! I’m looking to build a desktop for prototyping. What are your opinions on Intel i5 9600k vs i7 9700k? Or do you recommend something else? Also can you recommend a good compatible motherboard? I will be using one RTX 2070 for now but would like to be future proof for up to 4 one day.
Thanks!
Tim Dettmers says
If you want to have 4 GPUs consider a CPU with at least 32 lanes and about 6-8 cores.
Nick Jonas says
Will this build provide enough airflow for a 9900K + 2080 Ti? It’s all air cooled, 2080 Ti model has an open-air design with two fans (can buy 3 fans if needed).
PCPartPicker Part List: https://pcpartpicker.com/list/w7rRfH
CPU: Intel Core i9-9900K 3.6 GHz 8-Core Processor (Purchased For $485.00)
CPU Cooler: Noctua NH-D15 82.5 CFM CPU Cooler (Purchased For $89.95)
Motherboard: Gigabyte Z390 AORUS PRO ATX LGA1151 Motherboard (Purchased For $144.99)
Memory: Corsair Vengeance LPX 16 GB (2 x 8 GB) DDR4-3200 Memory ($84.99 @ Amazon)
Storage: Samsung 970 Evo Plus 500 GB M.2-2280 NVME Solid State Drive (Purchased For $109.99)
Storage: Seagate Barracuda Compute 2 TB 3.5″ 7200RPM Internal Hard Drive ($54.99 @ Amazon)
Video Card: Zotac GeForce RTX 2080 Ti 11 GB AMP MAXX Video Card ($1099.99 @ Amazon)
Case: Cooler Master MasterCase H500 ATX Mid Tower Case ($99.99 @ B&H)
Power Supply: Corsair RMx (2018) 850 W 80+ Gold Certified Fully Modular ATX Power Supply (Purchased For $94.49)
Tim Dettmers says
One GPU builds usually have no cooling issue. Airflow is not that critical. It is more about what kind of cooling system you have on the GPU.
Bill says
Followup to my q, here is the config I’m looking at. Partspicker doesn’t seem fully current, noting Asus Rampage IV but not V is shown.
Bill says
Any advantage of EEB over EATX, e.g. these two?
Does the ‘WS’ do 3 * x16 GPU’s, or only one? It seems to have a max of 3 GPUs, vs. 4 for the ROG. Partspicker lists the ROG IV (no V) without sellers, and the price of the IV is $700+ on Newegg.
ASUS WS C621E Sage EEB Server Motherboard Dual LGA 3647 Intel C621
3 x PCIe 3.0 x16 (x16 mode)
2 x PCIe 3.0 x16 (Single at x16, dual at x8/x8)
2 x PCIe 3.0 x16 (x8 mode)
ASUS ROG RAMPAGE V EDITION 10 LGA 2011-v3 Intel X99 SATA 6Gb/s USB 3.1 Extended ATX Motherboards
4 x PCIe 3.0/2.0 x16 (x16, x16/x16, x16/x8/x8, x16/x8/x8/x8 or x8/x8/x8/x8 mode with 40-LANE CPU; x16, x16/x8 or x8/x8/x8 mode with 28-LANE CPU) *
* The PCIEx8_4 slot shares bandwidth with M.2 and U.2.
Gowri says
Hello Tim,
Thanks so much for the blog and replies to comments. I am sorry if this is a reposting, but my comment seemed to have disappeared, so thought I would post again… It would be so helpful to have your insights.
I am attempting to put together a desktop with what I have available online and locally, that is both DL-now ready and future proof. These are the components with some questions:
1. NVIDIA RTX 2080 (8GB)
(not sure which one is right, there seem to be multiple versions online such as RTX 2080 Super and Twin x2 8GB – which would be most appropriate here?)
2. RAM – Corsair Vengeance DDR4 32GB
(2x16GB (instead of 4x8GB) seems to cost less – would this be alright?)
3. Intel i7 8700k processor
(Selected this inspired by TensorBook’s choice of processor for a laptop; not sure this is the best for the configuration selected, please do let me know if there’s a better option)
4. 1 TB SSD (Samsung 970 Evo?) NVMe
5. ASUS mother board (an appropriate one)
(Would the ROG-Strix-Gaming-Motherboard-802-11ac/dp/B07HCPLQ2H be good for DL too?)
6. Power supply – Corsair smps cx750
(we have occasional power cuts, so thought this is a worthy investment)
7. Hard disk for data (Seagate 2TB Fire Cuda)
8. Cabinet – Corsair Crystal 570x RGB 3 RGB fans
(Not sure if Mid Tower is sufficient for the config selected – is there a better option?)
Please would you share your inputs on these?
Thanks a ton for your time and help!
Tim Dettmers says
I think this looks reasonable. You could go with a cheaper AMD processor (Ryzen) to save some money. 2x 16 GB are great. Looks good otherwise!
Dmitry says
Hi Tim,
First of all thanks a lot for the post – it saved quite a bit of time for me.
I’ve got a bit oldish machine with i7-3770K (4 cores + hyperthreading) and 32Gb DDR3 RAM which I’d like to start using for Deep Learning (for NLP tasks).
Looking at your other post I am thinking about getting one RTX 2080 Ti though not sure if my CPU would become a bottleneck and I better go for cheaper GTX 1080 Ti instead.
What would you think about this?
Unfortunately most of posts on internet on this are from gaming perspective and do not look too relevant…
Many thanks,
Dmitry
Tim Dettmers says
You should be fine with an i7-3770K for most tasks. Some tasks that make heavy use of background data loaders such as computer vision can take a hit in performance, but it should be not too much, maybe 30-50%. If you compare this to getting a full new system sticking with your i7-3770K looks like a quite cost-efficient solution. I would give it a go!
Dmitry says
Thanks Tim! Much appreciated
Daniel says
Hello,
my name is Daniel, I am a student and using for the first time the PyTorch library with Cuda. I was trying to train a network and came across some problems, and hope you could help me out.
The setup I am currently using might be a little unusual. I am using a NUC7i7BNH, that has a Intel Core i7-7567U Processor @ 3.50 GHz and 8 GB of RAM. Additionally I have an external GPU, a Nvidia Titan V built in a Asus Rog XG Station 2 and connected to my NUC through Thunderbolt.
I went through some PyTorch tutorials and had seemingly no problems with this setup. Nevertheless, when I try to train a bigger network with a big image dataset, the CPU runs constantly at 100% and the GPU only at 0-5%. I have been trying to find out what the problem is. I checked several times that my code is actually using Cuda, but the CPU is still running at 100% and making the training progress extremely slow.
From what I have read, I suppose that it should be a CPU bottleneck problem, but wanted to confirm. I also looked at the RAM usage and it seems to stay between 85-90% during the training. Maybe it has also something to do with the fact that I am using an eGPU?
Thanks in advance!
Tim Dettmers says
That sounds like an eGPU issue where the bottleneck is to transfer the data iteratively to the GPU. One solution might be, if your dataset is not too large, to transfer the entire dataset to your GPU. This will take some time, but once this transfer is complete the CPU should no longer be a bottleneck since almost no operations are executed on the GPU. If this does not solve the problem something else might be wrong. You can run PyTorch profilers to find out where the bottleneck comes from exactly.
Daniyal Mujtaba says
Hi a great blog tbh and really helpful in deciding the most of the system for DL but still i need one advice in terms of GPU
So i m going for i7-4790K 16 GB ddr3
now its the gpu time
Actually I m new to DL and i dont have much idea what GPU should perform better..
Rtx 2060 6gb ( I read somewhere i can use it in 16 bit mode to virtually get a 12 GB bandwidth for processing )
1070 ( 1080ti is out of budget right now )
that is option one with future possibility of upgrading it to a better cpu with ddr4 ram
or i can go for better cpu like 6700K 16 gb ddr4 and settle for 1060 6gb and in near future add another gpu or upgrade this one to a 2070 maybe
but that upgrade might take more than a year
So yeah suggestions would be helpful
and this machine wont be only used for DL as i would game on it too.. so yeah
Tim Dettmers says
I would go with the RTX 2060. If you learn to use it well you should be able to use most deep learning models.
Al says
I have to disagree with your assessment on 16-bit fp capable cards. Most DL frameworks are garbage as far as software goes, and support for 16-bit is still crippled. “learn to use it well” in practice means you need to spend much more than the card’s worth of man-hours over its lifetime to make 16-bit work in many cases, which completely defeats its purpose. I think your performance graph is misleading. Pricing will get outdated very quickly too, as Nvidia will screw us shortly with the new RTX “super” pricing drop and second hand Pascal cards keep going down in price.
I just got a second-hand Pascal card myself after trying a Turing RTX for a few months. More RAM for a lower price, a much better deal than a RTX 2060. I’m mostly stuck with 32-bit in most cases anyway. Maybe in a year or two 16-bit will actually be usable!
Tim Dettmers says
In PyTorch, the 16-bit recipe is quite easy to do and stable. I have no 16-bit experience with TensorFlow though.
Al says
From what I’ve seen in Keras some layers -which you’re prone to end up needing to use like batch norm- don’t support 16-bit. Also, the results are often out of whack, with big losses that don’t improve and so on, even following the guidelines. When it works it’s great but when it doesn’t you just can’t justify spending the time to make it work.
Another annoying side effect of faster cards with the same amount of RAM is that you may often find that you’re under-utilising the compute capacity. So why use a “fast” RTX at 50-70% capacity when you can use a cheaper GTX at 80-100% and get the same results in the same time? This happened to me recently when using cudnn LSTM and GRU Tensorflow layers from C++ for inference but it can happen in many other cases.
And don’t get me started with the Python image manipulation garbage software… you often find that you can’t make it run fast enough for augmentation online and parallelization is not possible/sucks and so on… so there you have your powerful GPU waiting for data from crippled software that can’t even use your CPU cores properly. You may as well take charge, throw the Python garbage out the window and use opencv and cudnn in c++, which will be much more fun if you have the time (which you often won’t), because you’re solving actual problems and not banging your head against a wall. This has nothing to do with the GPU choice, I’m just ranting at this point…
Tim Dettmers says
I have not used Keras in years and I am not sure how to resolve 16-bit problems in Keras. In PyTorch, it is rather easy and works well. I think this is primarily because NVIDIA is supporting 16-bit compute with specialized libraries for PyTorch.
I totally agree with the underutilization issue. GTX cards or XX70 cards are more than sufficient for LSTM-like models and if people use these heavily I would recommend such cards.
Indeed, Python is a nightmare in terms of parallelization. I worked on this problem myself and was able to do much better than standard software like TensorFlow/PyTorch for toy problems like MNIST/CIFAR where I increased training time by 4x because I just had better data loaders. However, I did not have time to implement all the models/layers from scratch and so I gave up on the project. Developing such software is just difficult and a huge undertaking. I think we should be quite grateful for the great free deep learning software that we have even though it is inefficient at most times.
Michael says
Re 16 bit: in my experience issues arise when you’re doing some experiments where you’re not sure how the gradients are going to behave, or perhaps regularization might push weights/activations to the range beyond 16 bit range. Basically standard over/underflows. If you put safeguards in your code to handle these situations 16 bit works fine in both PT and TF.
Re under-utilization of GPUs, try https://devblogs.nvidia.com/fast-ai-data-preprocessing-with-nvidia-dali
I personally haven’t tried it yet, but it might help.
Daniyal Mujtaba says
but i m getting a 1070 second hand with same price of 2060 in my country.. so 1070 is better according to you?
Tim Dettmers says
They are about the same. Getting either one should be fine.
Al says
You can find a second-hand 1070 for about 200 euros on eBay now. I think it’s impossible to find an RTX 2060 for that price anywhere. For the prices I’ve seen, the 1070 is a bit better value even though it’s a bit slower. And it also has 8GB…
Zarglayoun Amira says
Hello Tim,
A friend of mine has bought the followings based on a build found on the internet:
CPU : Intel i7–8700K
RAM : 64GB : 16*4: G.Skill Ripjaws V DDR4 3200MHz
GPU : RTX 2080 Ti
Motherboared : MSI Z370 PC PRO
PSU : CoolerMaster Vanguard 1000W PSU
Cooler : ML240L Liquid Cooler with the Hyper 212 LED Turbo.
Storage : A 512GB 970 Pro Samsung M.2 SSD
The total cost was about 3500$
If I have only 1000 to 1500$ and if my aim is to have a decent build to go on kaggle competitions (I am not looking to be at the top5 but let’s say around the top 100-150), how can I change his build, do I keep the GPU ? the RAM ? etc
PS : I can buy used components to reduce the costs
PS2 : I already have the following (I don’t know if it can be useful)
Crucial CT2C8G3S160BMCEU 16Go Kit (8Gox2)
Samsung SSD 850 EVO, 500 Go – SSD Interne SATA III 2.5″
Tim Dettmers says
You do not necessarily need the RAM, but then you need to write careful code that is memory efficient. This will already save you a lot. The PSU wattage is suitable for 2-3 GPUs and can be reduced to 600 watts which will bring the price down further. Otherwise, one can go for a cheap GPU. For one GPU a cheap ryzen CPU with a cheap motherboard is more than enough. All of this is for deep learning though, to run models on your CPU would be slow on this setup, so you would need to make sure that you run boosting/tree models on your GPU.
PS: Yes the used components look good!
Ariel says
Hi Tim, I’m reading your post as I’m about to build a deep learning machine.
I’m planning to get me a ASUS Turbo GeForce RTX 2080 8 GB Graphic Card GDDR6 with High-Performance Blower-Style Cooling for Small Chassis and SLI Setups TURBO-RTX2080-8G ( https://tinyurl.com/y4y68dub ) and add : Patriot Viper 4 Series 16GB Kit (2 X 8GB) 3733 MHz (PC4 29800) DDR4 DRAM Kit (PV416G373C7K)
(https://tinyurl.com/y5k77pva ) and I’m waiting for the New AMD CPU that just been announced, the Ryzen 9 3900X with 12 cores and 24 threads. would you recommend?
thanks -Ariel
Tim Dettmers says
The AMD CPUs are quite good now but the 3900X has only 16 PCIe lanes so only good for 2x GPU setups. If you only want two GPUs this is a great choice. Otherwise, a Threadripper is a cost-effective option for 4 GPU setups.
Ariel Ravinovich says
can you recommend any other AMD chipset that will go with this confirugation?
thanks
Tim Dettmers says
A cheap threadripper usually works well.
Ariel R. says
Hi Tim, is the following configuration will work?
https://pcpartpicker.com/user/ArielR/saved/#view=pmgrVn
I found a motherboard with 3 slots for video cards as you can see and will fit to the ne AMD CPU
Thanks
Tim Dettmers says
I would get lower clocked RAM to save a bit of money as well as a air cooler for the CPU (there are some silent with high performance which cost a bit less). Looks good otherwise!
Ariel Ravinovich says
Hi Tim what would you say about this configuration :
https://pcpartpicker.com/list/
I think the motherboard is not bad with 3 slots of 16.
thanks
Richard says
Could you discuss the hardware implications of the type and application of deep learning? Different hardware tradeoffs could be made for a box dedicated to training an image classifier on a large dataset versus transfer learning with an existing model and these hardware tradeoffs might be different if the application was sentiment analysis or NLP. I suppose one way to determine those tradeoffs, as alluded to in an earlier comment, would be to run the task in the cloud and get an understanding of the bottlenecks and requirements that way before buying dedicated hardware.
Tim Dettmers says
For image classifiers, it is useful to have a large SSD where you can put your full dataset on (1TB+). Other than that, there are no tasks specific requirements.
Karan Sharma says
Hey,
Is RTX 2070 good enough to start with if I want to train architectures like YOLO (object Detector) using Tensorflow.
Please help me with this.
Tim Dettmers says
Yes, the RTX 2070 will be good enough for this.
Arthur says
Hi Tim,
Since you have great experiences on building this kind of DL machine, I have a question regarding how to optimize the Ethernet bandwidth on different HW configurations and applications. For example, how many Gigabit or 10 Gigabit Ethernet I need if I have 16 or 12 NVIDIA Tesla GPUs with 2 Intel Xeon Scalable Processors for graphical analysis or gaming processing application? Thanks.
Tim Dettmers says
If you have 4 GPUs per node and you want to train traditional convolutional networks with standard algorithms you should get at least 20-40 GBit/s Infiniband, preferably 100 GBit/s or faster. 10GB/s and especially Ethernet will be too slow for standard algorithms. With special algorithms, 10 GBit/s ethernet can work but no open source project for these algorithms exists and implementations on your own will take months. So it is better to invest in a good networking solution and use standard libraries.
Chanhyuk jung says
Hi. I’m a high school student graduating this year. I completed the deep learning specialization on Coursera. Now that I’m confident enough to use pytorch for nlp and RNNs on speech, I need a gpu. I can ask my parents to buy me a computer or just use google colab. Would it be okay to just use colab even if I can afford a computer?
Ade says
Hello Tim
Please, I am a Ph.D. student and research area is deep learning . My potential build is the following specification.
CPU: Intel® Xeon® Silver 4114 10-Core (2.2 GHz, 3.0GHz Turbo, 13.75M L3 Cache)
Motherboard: ASUS® WS C621E SAGE (DDR4 RDIMM, 6Gb/s, CrossFireX/SLI).
RAM: 64GB Kingston DDR4 2666MHz ECC Registered (2 x 32GB)
GPU: 11GB NVIDIA GEFORCE RTX 2080 Ti – HDMI, 3x DP GeForce – RTX VR Ready!
1st Storage: 6TB SEAGATE BARRACUDA PRO 3.5″, 7200 RPM 256MB CACHE
1st SSD Drive(OS installed) : 1TB SAMSUNG 970 EVO PLUS M.2, PCIe NVMe (up to 3500MB/R, 3300MB/W).
Please, is it enough for Image processing, training. I have 3 million images to train (2 TB image dataset). Any suggestion on areas to improve on my build. The motherboard supports 2 CPUs, up to 756GB RAM, as well.
Thanks
Ade
Tim Dettmers says
The SSD is really important here for image processing. Try to get multiple SSDs and raid them in raid0 or buy at least one SSD which you solely use for your dataset. Otherwise, it looks good.
Nick says
I feel this guide is becoming obsolete because it ignores alternatives to GPUs, like the Google TPU. There are conflicting claims but it does seem clear that chips which were designed for accelerating deep neural networks are going to be better than chips that were designed for accelerating graphics cards. There are at least half a dozen companies with TPU-like products in the pipeline, it’s not just Google.
Tim Dettmers says
Please have a look at my updated GPU recommendation blog post which also discusses TPUs.
Damian E says
Hi,
I need a certain level of mobility so I want to go with a 17″ laptop with eGPU via Thunderbolt 3.
I would like to know if it makes sense to purchase a laptop that already has an integrated GPU (mobile rtx 2070-2080)? Can they work in pair? Or do I have to switch between them and thus make the integrated one useless?
Also, Thunderbolt 3 caps and 40Gbs to PCIE and that is most likely the theoretical maximum, not necessarily what you get. Does it make sense to go with the TRX Titan? or and I burning money and should go with 2070?
Tim Dettmers says
Integrated GPUs are great but also expensive. If you find a cheap laptop with integrated RTX 2070 I would go for that. If you want to have multiple GPUs (internal + external) it gets complicated. I am not sure how this setup is supported. I would look online for other people who tried. In general, a single eGPU should also be great. It is also cheaper to upgrade the GPU without upgrading the laptop!
silverstone says
Hi Tim,
Thanks for the guide. What do you think about AMD vs. Intel CPU with NVIDIA GPU? Are there any bottleneck for DL frameworks with AMD CPU?
Tim Dettmers says
It seems AMD CPUs are fine. I never had any problems with my AMD CPUs both at home and in the office. One issue might be if you want to use your CPU for some linear algebra (solvers and decomposition etc.), but other than that AMD CPUs are great.
Hamid says
I actually ended up purchasing intel and X299 Mark 2 motherboard, but after I assembled I realized it’s so lousy that it does not have even a basic graphics card, so do I need to buy a cheap graphics card or the GPU itself can be used without sacrificing the processing power?
Vinicius Dallacqua says
Are you aware that this article got cloned? https://medium.com/@joaogabriellima/a-full-hardware-guide-to-deep-learning-cb78b15cc61a
Tim Dettmers says
Thank you for pointing that out! That was caught quite quickly and the user is now banned.
Kim Goodwin says
As reading the title I seriously didn’t think that the article going to be this much deep on topic. Great start!!!
I am tired of trying different coolants for my processor and heatsinks. Now I have decided to use the thermal paste instead. Is that a good option?
Kartik says
Hi Tim,
Thanks a lot for all your effort and also keeping this blog up to date.
I was thinking of taking an AMD threadripper 1900 which would be used for heavy preprocessing and running xgboost or other libraries which run on cpu. Is it overkill?
And ive been following Kaggle for a long time was doing other things and now i am full on it. Ive seen people doing prototyping and training seperately.
Should i get an rtx 2080ti or two rtx2070 for training and prototyping ? Or maybe make a cluster of gtx 1080ti ??
Tim Dettmers says
An RTX 2070 is great for prototyping. Since most of the time on Kaggle is spent prototyping it is not so efficient to dedicated resources for training. I would say, use your RTX 2070 also for training and if that is not sufficient (memory or training time to high) use the cloud for access to fast GPUs. This will be cheaper and more flexible.
Hamid says
Hey Tim,
Thanks for this great post,
I’m looking for a GPU do to my own research and I’m thinking of price range $2200-$2500. Given that a while is passed may I ask what would you choose for 1. 2080 GPU (Ti?) 2. Motherboard, 3. Ram, 4. Hard disk 5. PSU and 6. Case?
I mainly use this for deep learning.
Thanks
David Knowles says
Thanks a lot for the post, very useful.
Do you have advice on how bad an idea it is to have two different GPUs in the same box? I have an old GTX 980 that it seems a shame to waste so was thinking of running that alongside a RTX 2070 in a 2 GPU setup. I would potentially use the 980 for prototyping things while the 2070 is off training. Thanks!
Tim Dettmers says
That is usually just fine. Make sure that you use software that is precompiled for different compute architectures (different GPU series) and you should have no problem.
Farooq says
Hi Tim,
Extremely helpful article ! Keep it updated please !
I wanted to take your opinion on buying a single GPU e.g. GTX 1080-Ti today priced around $808 vs buying two GPUs e.g. RTX 2070 (single GPU priced $527, total = $1045).
Will 2 GPUs (RTX 2070) perform better as compared to single GPU RTX 2070?
Generally, does two slightly slower GPUs perform better for Machine learning projects as compared to a single high speed GPU?
Tim Dettmers says
Usually, two GPUs that are slightly slower are better than a big GPU, because you can run multiple hyperparameter configurations of the same network on a GPU. Parallelization is also an option and is usually slightly faster than one big GPU. So go for the RTX 2070s!
rocco says
Hello Tim,
I am going to use Asus WS X299 SAGE motherborad with 2X Rtx2080Ti. If I use dual fan GPUs (2x Rtx2080ti), Is it better compared blower style fan GPUs ? Actually, I am afraid of use blower fan GPUs for heating problem. Using two GPUs, I think there is enough space between them.
Tim Dettmers says
Yes, if you have space between your GPUs a dual fan will be fine, but probably comparable to the blower fan.
Diego says
Hi Tim
Thank you for your guidance, for those interested in the development of AI.
I have a big question and it is the following:
I want to buy an MSI Vortex G65RV with the following characteristics:
Processor: Intel Core i7-6700K 4.0GHz 8M Cache, up to 4.20 GHz
Hard Disk: 1 TB (SATA) 7200 + 256GB SSD PCIe 3.0
RAM memory: 32GB DDR4 2133MHz expandable
Graphics Card: 8 GB Nvidia GeForce GTX 1070 DDR5 VR Ready
Connectivity: Killer ac Wi-Fi + Bluetooth v4.1, 2 ports Killer Gb LAN
2 x Thunderbolt 3, 2 x Mini Display Port, 2 x USB 3.1 Type-C.
Unfortunately, the desktop only reaches: CPU i7-7700 and GPU 1080.
For which I thought about acquiring an external GPU:
GIGABYTE AORUS Gaming Box RTX 2070
Which tells me the following: GeForce RTX 2070 with 8G memory and 448 GB / s memory bandwidth has 2304 CUDA Cores and hundreds of Tensor cores operating in parallel.
I know it will not yield the same to a card connected internally on the board. but I would like to know how the GPU communicates in those cases since the connection is through the Thunderbolt port if I am not mistaken, and I would also like to know how it will be the DL redeeming with CPU i7-6700k and GPU RTX-2070 with connection Thunderbolt
Thank you.
Tim Dettmers says
Thunderbolt 3 is pretty good for communication and you should see only a small loss in performance (10%) for most application. This number might be higher if you (1) have very large input data, (2) a small neural network — this is not very common. So an eGPU should be fine.
Michel says
Hello,
I am no expert in deep learning but the gaming community tends to consider that going from an rtx 2060 to an rtx 2070 brings little benefit in terms of FPS or detail rendering, a just higher price.
I am wondering whether there is any reason why the 2060 is not mentioned in your really great review of GPUs.
Thanks!
Tim Dettmers says
This post is just not updated. Need to do this soon.
Alvaro says
Also, consider the new GTX 1660 Ti! The “tensor cores” have been removed but they’ve been replaced with FP16 units. I can only guess about the actual performance compared to RTX 2060… it would be great to find some actual tests. Could it be just as cost-effective after the retail price stabilizes?
Michael says
I would be interested in the comparison of the RTX 2060 and RTX 2070 for deep learning applications.
Do you think it is worth going for the RTX 2070?
Mario Galindo says
Thank you for the post.
I am buying a computer with two GPUs 1085 TI. On the mother board there are no screen connections nor space to other GPUs. Where should must I connect the monitors? I want to install 3 monitors. Should I connect them to the GPUs? Or, must I buy another mother board?
Thank you.
Tim Dettmers says
Just connect them to your GPUs. It will barely impact performance of your GPUs.
Vishal Bajaj says
Hi Tim,
Would it be possible to setup a system with a 1080Ti and a 2080 Ti and use them to perform parallel training?
Tim Dettmers says
This does not work, unfortunately. You need the same chip architecture to do GPU-to-GPU parallelization. You can do GPU-CPU-GPU parallelization, but that often yields no speedups.
Vishal Bajaj says
Thanks for the input Tim. I guess i will try to get the 2080Ti, but i keep reading many reviews of them dying! So a little afraid to put down 1800$ (CAD) 🙂
Ari H says
It’s just a shame that AMD’s latest GPUs would have potential to demolish NVIDIA’s overpriced cards on deep learning if they just fully supported PyTorch. Currently they’re about as good as doorstops unless you write everything with OpenCL yourself. Their priority should be to get PyTorch working ASAP.
Tim Dettmers says
Agreed. There are some efforts to do this, but it is a delicate issue because PyTorchs code-base was an older code-base which was built upon. I hope soon they can figure out the last issues and then I would be happy to recommend AMD cards as well.
Joshua Marsh says
Hey Tim, thanks for creating this amazing repository of information!
I just wanted to hear your thoughts on the differences between the RTX 2080 Ti founders’ edition vs the other RTX 2080 Ti’s with third-party hardware from ASUS, Gigabyte, MSI, etc (or just the advantages of FE vs not FE cards in general). In my country, the FE is about $300 USD cheaper so if the others do not have any real advantages for AI I would prefer to go with the FE. Also, as I am considering water cooling, the advantages from gained due to superior cooling may not be a concern.
Thanks Tim!
Tim Dettmers says
If the FE is $300 cheaper definitely go for the FE card!
Ando says
Hi Tim,
Thank you very much for the guide.
I am trying to build my first DL machine. Following your advice, I am looking at to start with an RTX 2070, I will add either another 2070 or an 2080 Ti later, and maybe even a third one. This is my build, if you have time, please have a brief look:
https://pcpartpicker.com/user/ando_khachatryan/saved/yQkNQ7
My concern and question is about the cards: while looking for a blower-style card on Amazon, I encountered lots of negative reviews for cards from different vendors, and the vast majority were describing the same problem: card worked out-of-box, then, after a week of gaming, artifacts started to appear and the games started to freeze/crash.
Any comments on this? Have you seen/heard about this?
My problem is that returning the cards would be a major issue for me, I don’t live in the U.S. and RMA/return would cost me a lot of money.
Thank you in advance.
Tim Dettmers says
Looks like a good build with some spare room for more GPUs.
I have heard about the problems. It is unclear still if all RTX have this problem or if the first batch of RTX cards in the release had this problem. It is worth it to look at the date of the reviews and see if it got better over time. I personally have no problems with my RTX cards, but maybe I have been lucky so far.
Ryan S says
Hello Tim, I have a few very fundamental questions. I plan on using an NVIDIA Tesla P4 GPU on a server (let’s say Intel Xeon 16core, 128GB RAM, 2x10GbE etc.,). From a popular manufacturer that has such a config, it states that the system can handle video analytics (like face detection) on 9 concurrent video streams @ 720p/15fps.
My question is:
– If I run video @ 720p/3fps, how many concurrent video I may be able to handle concurrently?
– If I run video @ 1080p/3fps, how many concurrent video I may be able to handle concurrently?
I know there are many factors related, but just as a ballpark, any suggestion if lowering the frame rate would help increase the # of video streams. Is this a linear equation of any kind?
Thanks!
Tim Dettmers says
I do not know what this is referring to exactly, but one assumption that could be reasonable is to say it scales linearly, that means 9*5 streams for 720p/3fps and 9*5/2.25 for 1080p/3fps, but I do not know if that works out. The best it to ask the manufacturer yourself.
Bruce says
Hello!
Maybe I am a bit confused, but can I have a config with more than one RTX 2070 at the same time? Because 2070 don’t support SLI (https://hothardware.com/news/nvidia-geforce-rtx-2070-gpu-will-not-support-nvlink-sli-but-why). Does it matter?
Thanks in advance!
Tim Dettmers says
CUDA code cannot use SLI for communication for GPUs. Instead, GPUs communicate via the PCIe network. Thus no SLI support is needed for parallelism.
Nasi says
Hi Tim,
Thank you for this article.
I already have a GPU, 1080, in my PC. I am going to install GPU 2080ti along with the previous one.
1) Is it possible to have two different types of GPUs in one PC and use them for training a neural network, especially in tensorflow? I do not know how to prepare the environment to use both GPUs. Is it possible for you to sent me a good tutorial link for that?
2) There are two PCI Express slots on the motherboard, but they are too close to each other. If I install the new one in the empty slot, the fan of the old one will be blocked by the new one. So, I think I should either buy a new case (with a new motherboard) or buy a PCI express riser. I found multiple links to buy a PCI riser, but I do not know whether they are good or not. If I use a PCI riser, I will put the new GPU outside the case, and I will not close the case. Could you please give me your opinion about PCI express risers?
https://www.amazon.fr/Cablecc-Gen3-0-16-PCI-Express-x16-Extender-Up-Angled/dp/B07GBRQPQF/ref=sr_1_17?ie=UTF8&qid=1548337647&sr=8-17&keywords=pci+express+riser
https://www.gearbest.com/other-pc-parts/pp_672357.html?wid=1433363¤cy=EUR&vip=4450235&gclid=Cj0KCQiA4aXiBRCRARIsAMBZGz_d_R54eWGNs1vpAKV0qBtUDNK9MGw7HzNrLH4d5MFlfCpBMGC9s2IaAm4tEALw_wcB#anchorGoodsReviews
https://azerty.nl/product/delock/670177/riser-card-pci-32-bit-with-flexible-cable-left-insertion-riser-kaart?gclid=Cj0KCQiA4aXiBRCRARIsAMBZGz9LzTqekoEhsr6sRVCwqBNfrWdTBVhDgzYHfu7dNwTBOLBLCfgUn5caAumqEALw_wcB
https://www.amazon.com/Ubit-Multi-interface-Function-Graphics-Extension/dp/B076KN7K5Q
Best regards,
Nasi
Tim Dettmers says
1) Yes, but you will not be able to parallelize a deep neural network across those two different GPUs.
2) If one of them has a blower fan you might be able to put the RTX 2080 Ti left of the other GPU. Otherwise, you can always buy a riser/extender if you have overheating issues.
Ehtesham says
Well, I tried something on Hp Proliant ML350 Gen8 for mining. (I wish i may put a picture here). 6 GPU setup. 3 mini risers 009S at one side of mobo and 3 on the others side of the board. Those six USB wires let them through the back side of chassis – and mount my 6 GPUs on the top of server.
All 6gpus (GTX 1660Super) recognized by windows 10 system. Now the difficult part.. Proliant ML350 Gen8 has separate 400W power supplies (combined 800W). Technically, 1660 super requires and draw 90W. In this scenario, total consumption is 540W.. so i decided to put 750W additional power supply to meet my requirement.
I failed.. coz power supply goes down in 1 minute.. and windows gets crashed and system get rebooted… Same exercise i tried with HIVEOS but same result.
Today I will try to put a 1500W power supply along with 800W (already installed supply) and shall see the results… Hope it works.. !
Tim Dettmers says
Thanks for sharing this! This shows how difficult it can be to get the power requirements right
Gurunath says
While you have mentioned that PCIe lines don’t matter significantly for a <=4 GPU setup, I plan to use a setup with 8 GPUs (RTX 2080) for NLP, speech recognition tasks. Would the number of PCIe lines significantly affect the performance in such applications? What would be your advice on the number of PCIe lanes for each GPU in this 8 GPU setup for NLP and Speech recognition tasks?
Info: For example, our current NLP task on sequence-to-sequence model for a batch of 100 sentences, each restricted to 128 tokens (each represented by a 64-bit tensor) in Pytorch takes around 120-150 ms per iteration on a single GPU(1080Ti).
Thanks in advance.
Tim Dettmers says
If you want to parallelize across 8 GPUs the PCIe lanes will matter quite a bit compared to 4 GPUs. The communication requirements scale linearly with the number of GPUs (if you use the right communication algorithm). However, if you run 8 GPUs on a regular 4 GPU motherboard you are also halving the PCIe speeds and you will have 4 GPUs behind a PCIe root complex. Since only one GPU behind a PCIe root complex can communicate with another root complex it means you need 8x the time to sent the same amount of messages between GPUs compared to 4 GPUs. So in total, the communication with 8 GPUs on 4-GPU motherboard will be 32 times more expensive than 4 GPUs on a 4-GPU motherboard. If you want to parallelize 8 GPUs efficiently, you will need 4 PCIe root complexes and this often means 2 CPUs and server-grade hardware (EPYC systems might be an exception, but I am not sure if those motherboards support 4x root complex setups).
If you do not want to parallelize a network across all GPUs, you will be fine — just note that with this system you cannot really do parallel training.
Eric says
Hi Tim,
Thanks for this post. After reading it through I still a bit unsure about my PC specs that I would like to get to run deep learning. Mainly because I don’t want to get hardware that don’t work with other software/hardware.
I wonder if you could recommend me a set of hardware with a budget around £1800-£2200.
Thanks a lot 🙂
Tim Dettmers says
I recommend using pcpartpicker.com it should take care of hardware compatibility and will alert you if there are some issues with your build.
Abdelrahman says
Thanks for this article
I want to buy GTX 1060 6GB for Master’s research on Object detection. But, i don’t have enough budget so i’m planning to buy Xeon processor like E5-1620 with 24 GB ram.
Is this CPU better for my purpose, or i i7 would bee better?
Tim Dettmers says
The CPU will be fine for deep learning with a GTX 1060. However, if you want to preprocess data might take more time with such a CPU.
krzh says
Do you think this is a well balanced system? https://pcpartpicker.com/list/kv3N4q
Tim Dettmers says
Yes, that looks quite good. The system would also work well with more than 2 GPUs so if you have any plans to use more than 2 GPUs you could get a motherboard with more PCIe slots. Otherwise, all good!
Nitin Gupta says
Hello Everybody,
I have been trying to get the multiple Gpus to work on a Ubuntu system.
I am using this ubuntu 16.04 LTS along with nvidia-smi
+—————————————————————————–+
| NVIDIA-SMI 384.130 Driver Version: 384.130 |
|——————————-+———————-+———————-+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 106… Off | 00000000:02:00.0 Off | N/A |
| 6% 57C P0 26W / 120W | 321MiB / 6072MiB | 0% Default |
+——————————-+———————-+———————-+
| 1 GeForce GTX 106… Off | 00000000:04:00.0 Off | N/A |
| 0% 27C P8 5W / 120W | 2MiB / 6072MiB | 0% Default |
+——————————-+———————-+———————-+
| 2 GeForce GTX 106… Off | 00000000:05:00.0 Off | N/A |
| 0% 26C P8 5W / 120W | 2MiB / 6072MiB | 0% Default |
+——————————-+———————-+———————-+
+—————————————————————————–+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1165 G /usr/lib/xorg/Xorg 198MiB |
| 0 2066 G compiz 119MiB |
| 0 2746 G /usr/lib/firefox/firefox 1MiB |
+—————————————————————————–+
although the system shows that it has all the cards , but they dont get used even when i try keras for mutliple gpu learning
motherboard : asus prime z270p
processor : intel i7 7700 lga 1151
gpu : zotac gtx 1060 ( 3 in number)
if any other information is required for solving this , i can provide the same , i am using the pcie bridge to raise the gpu and use them.
Tim Dettmers says
This should usually work. I guess the problem might be the PCIe bridge. It is difficult to tell with this information and it is not straightforward to debug. If you can use two GPUs without PCIe bridge and try again.
Nitin says
I have used two gpus without pcie bridge , these 2 gpus are now mounted on the motherboard , but i am still not able to use both of those gpus. Tensorflow starts to use memory from both but does not use the second one for processing.
Tim Dettmers says
Did you write code that utilizes both GPUs? You can try to run some code which tests parallelism. There are some multi-GPU samples from NVIDIA (CUDA samples) which test if parallelism between your GPUs is possible. If this sample works it will be a software issue.
geek12138 says
Hi Tim,
A titan rtx and two 2080ti which are more suitable, considering the memory difference between 24g and 11g*2.
Tim Dettmers says
The computing power on two RTLX 2080 Ti is almost double that of the Titan RTX. Thus its a question of compute vs memory. If you want faster compute go with 2x RTX 2080 Ti if you want more memory go with the Titan RTX.
Neil M says
Hi Tim,
Happy New Year and thanks for a great blog!
I’m building my first deep learning work station and based on your guide I’ve just purchased an RTX 2070.
I’m re-purposing an existing older workstation as the basis for the build. The specification of the machine is:
CPU – Intel i7 3820
Motherboard – MSI X79A-GD65 (8D)
RAM – 16Gb DDR3 (2400 MHz)
Boot Drive – 240Gb Kingston HyperX Fury
Data Drive – 4TB Western Digital Red
OS – Ubuntu (64 bit)18.04
I want to use the existing GeForce 660 GPU to drive the monitors and keep the RTX 2070 solely for computation. Looking at the NVIDIA website both GPUs use a common driver so I expect this will work. Do you forsee any issues or limitation with this approach or my current spec? Thanks.
Tim Dettmers says
I think the system should work quite well with an RTX 2070. Some computer parts are older thus some parts of common code, like preprocessing, would be slower, but your deep learning performance should be close to what other people report with modern desktops.
Peixiang says
Can I use two different GPU at the same time? Say 1080Ti and 2070? What are the issues I may encounter?
Shayan says
Hi Tim,
Can you please comment on what type of setup is used in this video [https://www.youtube.com/watch?v=RFaFmkCEGEs&t=54s], at 0:47 seconds you can see he has 4 nvidia GPUs by using the nvidia-smi command, however he is using a macOS.
Also would you recommend using macOS (w/ gpus) for competitions.
Kind Regards
Shayan
Tim Dettmers says
These are GTX 1080 Ti GPUs. There are some compatibility issues with macOS that only certain NVIDIA GPUs are supported, but I do not know the details. For this, I usually do a google search on reddit: “site:reddit.com which NVIDIA GPUs work for macOS deep learning”
Fiz says
Hi Tim, questions on your comments “For good cost/performance, I generally recommend an RTX 2070 or an RTX 2080 Ti. If you use these cards you should use 16-bit models. Otherwise, GTX 1070, GTX 1080, GTX 1070 Ti, and GTX 1080 Ti from eBay are fair choices and you can use these GPUs with 32-bit (but not 16-bit).”
do you mean
(1) on RTX cards running 16-bit models indirectly doubles the available memory for deep learning compared to 32-bit models, is that correct?
(2) the facts in (1) are not valid for GTX cards, i.e. 32-bit models and 16-bit models makes no difference?
(3) how to explicitly run in 32 vs 16 models when deep learning? Are there examples?
Tim Dettmers says
1. It is not a straight doubling, but the memory requirements are much lower.
2. You can have 16-bit models with GTX cards, but what happens under the hood is that all values will be cast to 32-bit before any computation. So the weights are 16-bit and the computation 32-bit for GTX cards. However, you should also see a good reduction in memory if you use 16-bit weights with GTX cards.
3. In PyTorch is can be as simple as “model = model.half()” and you will run in 16-bit mode. In practice, it can be a bit more complicated depending on the model that you are running. You can have a look at NVIDIA’s 16-bit library Apex that is built on PyTorch for more sophisticated examples.
Phil says
Hi Tim,
Can you please tell me if I am doing something wrong. The idea is to run LSTMs with many datasets that are rather small (<5Gb). I will have several GPU not to parallelise but to run different optimisations at the same time. I am not a hardware expert and I want to make sure that I don't waste GPU power because of a poor setup. If it goes well, I will replicate it to populate a rack.
-Gigabyte X399 Designare Ex
– AMD Threadripper 1920X
– 4x KFA2 RTX 2070 OC
– 4x HyperX Fury 16Go, DDR4-2400, DIMM 288
– Samsung 860 EVO Basic (500Go)
– WD 4To
– Corsair AX1600i (1600W)
I still need to find a 3U or 4U rackmount that fits.
I really appreciate your help
Phil
Tim Dettmers says
The RTX 2070 cards that you chose might be prone to overheat in that configuration. I would pick a blower-style RTX 2070 card instead. Otherwise a good build. I am not sure though if you can easily find 3U or 4U racks that fit well with this configuration.
Phil says
Thank you very much!
Happy New Year to you.
Saurabh K says
Just want to know your thoughts on the combo I am looking at:
1. MSI X299 Gaming PRO Carbon AC: https://www.amazon.com/gp/product/B071G3JR9Y/ref=crt_ewc_title_dp_3?ie=UTF8&psc=1&smid=ATVPDKIKX0DER
2. Intel Core i7-7800X X-Series Processor (28 PCIe lanes for possible future expansion):
https://www.amazon.com/gp/product/B071H1B3Z1/ref=crt_ewc_title_dp_1?ie=UTF8&psc=1&smid=ATVPDKIKX0DER
The motherboard seems pretty expensive, any other suggestions for the motherboard compatible with i7-7800x? The reason I am going for that one is because it supports wireless LAN. Else MSI X299 RAIDER LGA is a pretty good option (https://www.newegg.com/Product/Product.aspx?item=N82E16813144059). Any thoughts?
Tim Dettmers says
Other motherboards that do not support WLAN are fine as long as you get a USB wifi adapter. This combination might save you a bit of money on the motherboard. The i7 is a very versatile CPU — a bit expensive but it will show strong performance in any case!
Ahmed Adly says
Hi Tim,
Can I add an RTX 2080ti to my existing 2 GTX 1080ti to improve the training time for voice recognition application?
Matt says
Beginner who had some perishable credit on electronics vendor with limited computer parts supply. Got my hands on a evga rtx 2070 xc 8GB and a full atx case. Started the fastai course but the workflow with cloud resources got too annoying.
So i have my hands on the rtx 2070 already. Don´t want to waste too much of its capability for most beginner/intermediate use cases but do not have that much to spend.
the impression i get from the latest guide update is that even a i3 7100 or g4560 would not hamper the gpu or only slightly (and those cpu are really cheap). Have i understood it correctly?
Tim Dettmers says
Yes a single RTX 2070 should be easy to utilize with an i3 7100. However, if you preprocess a lot of data you might still run into some bottlenecks. If you make sure that you have good quality preprocessing code you should be fine.
Nazim says
What are your thoughts on Titan RTX?
Tim Dettmers says
Too expensive for the given performance.
Nazim says
So two 2080 ti better than One Titan RTX?
Abir Das says
Hi,
Thanks for the detailed post and also the enlightening discussions. I have prepared two builds which I am sharing below. Can you please provide your suggestions. I have some specific questions regarding the builds which I am providing with the build configs. But, before that let me provide my requirements/type of ML works I intend to do.
1. I am a very new assistant professor at an Institute in India. I will eventually get the money to procure 2 workstations with 4 GPUs (I am at least hoping 1080Ti) in each. But, that will take time and I want a decent build to get the ball rolling. For that I have already bought 2 1080Ti GPUs and a Samsung 860 EVO 500 gb around 3 months back when I was in the US. So, they are sitting idle now. To avoid this, and to get started, I want to buy the other parts of a DL machine from my pocket. My budget is around Rs. 100,000 [Rs is the Indian currency].
2. The machine will be in the server room of the institute. So, the cheapest cooler [whatever noise level] and cabinet is what I would prefer.
3. My student [only one at this moment] will run RL codes [both training and inference] on images. Later, I might do some classification work on videos [but this is a distant possibility at this moment, and I might be able to procure the servers with 4 GPUs by then].
4. I don’t plan to expand this machine beyond 2 GPUs. My long term plan is to make this a student machine that will have even 1 GPU and the student can develop/prototype codes here while the stable code would run in the 4 GPU servers.
5. My builds provide some prices which does not have web links. This is from https://mdcomputers.in — a local but reputable vendor. I could not find how to link their product pages to pcpartpicker. Otherwise, I would have done that. So, you have to believe my words about the price.
My first build uses core i5 9600K and a compatible motherboard. As my budget allows me to spend some more money, should I go for i7 8700K? This change along with a change of motherboard costs me Rs. 6000 more [still remaining inside the budget]. Core i7 processor supports hyperthreading. My student says, from time to time, he might use multiple threads for preprocessing in Tensorflow. Or, instead of upgrading the processor, should I just go for RAM upgrade to say a total 48 GB [and stick to core i5 9600K]? This will cost me around Rs. 12,000 more [and so I am remaining within the budget]. And later, I can pull out one memory when the system, ultimately, becomes a student’s developing machine with one GPU.
Build with core i5 9600K — https://in.pcpartpicker.com/user/dasabir/saved/fgQ299
Build with core i7 8700K — https://in.pcpartpicker.com/user/dasabir/saved/bD7J8d
While looking for the option of core i7 8700K, I came across core i7+ 8700 [https://mdcomputers.in/intel-core-i7-8700-bo80684i78700.html]. I see that this will cost me Rs. 11,000 more over my core i5 9600K build. I am not sure what is the difference between an i7 8700K and i7+ 8700 (other than the frequency/speed). Here is the comaprison link — https://ark.intel.com/compare/126684,140642 . Will i7+ 8700 require different motherboard? It says the box includes NVME 3.0 x 2, does it help me? Also the i7+ processor includes a 16 GB optane memory. Will it be of any help (e.g., keeping the OS there)? Also does optane memory occupy PCIe lanes? Any suggestion on this would be great to have.
My second build is with AMD processors. I tried with AMD Ryzen 7 2700X. The price is coming around the same as the core i5 9600K build. It does have 8 cores compared to 6 cores for the intel processors, but does AMD have hyperthreading? I am not sure. Also it does not have MKL, is intel MKL going to be crucial for deep learning?
Build with AMD Ryzen 7 2700X — https://in.pcpartpicker.com/user/dasabir/saved/3ddTBm
Though you say number of PCIe lanes are not that important especially with 2 GPUs, I just tried my luck with an AMD threadripper processor. As expected, it is overbudget. But, if you say, it is worth spending this much money, I might also go for it.
Build with AMD Threadripper 1900X — https://in.pcpartpicker.com/user/dasabir/saved/73mhyc
Abir
Tim Dettmers says
Either build is fine. You could buy a bit cheaper RAM which lower speeds — it will not make a big difference. If the ADM build is too expensive and you run only 2 GPUs I would rather good with an i5 or i7 build.
Abir Das says
Hi Tim,
Thanks a lot. The AMD Ryzen 7 2700X build is the cheapest. So I will go with this. I tried to see low speed RAMs, but that is not saving me much. So, I will go with the AMD 7 2700X build i.e., https://in.pcpartpicker.com/user/dasabir/saved/3ddTBm
Alam Noor says
Thanks for sharing
Zhenlan says
Hi Tim, Thanks for the great guild. I found you blog super helpful when I built my first box in 2016. And it is still worthwhile to come back for updates.
One question about GPU RAM. I am getting serious about CV competitions on Kaggle. And that entails using big pre-trained networks like Xception. I have a 1080 ti and use batch size of 16 for inputs image of size 256,256. On software side, I am using tensorflow.keras (mostly for multiprocessing augmentation) and raw tensorflow.
My computer would freeze every now and then so I have to force restart. And even before it freezes, it would get slower by epochs. First epoch would be 90s, next would be 120s and it just gets worse. It also gets slower and more likely to freeze as I experiment more network structure by defining more models(clear_session() or tf.reset_default_graph() does not help)
I use “top” to monitor cpu/RAM, either of which seems to be the problem. I use something like “watch -n0.1 nvidia-smi” to monitor GPU. GPU utilization stays above 90%. But it does not really tell me much about memory as tensorflow automatically allocates almost of the GPU memory at start. I tried tf.configproto() to limit GPU memory used by tensorflow without much luck.
Do you have any suggestion as to how to diagnose this issue? Thanks in advance and happy holidays!!
Best,
Zhenlan,
Tim Dettmers says
It sounds like you have a memory leak somewhere in your code. First, check if you run out of CPU RAM and your computer is swapping RAM to disk. If that does not help try to debug TensorFlow further. If that does not help it could help to install the newest NVIDIA drivers. If this does not help try PyTorch and see if that works for you (PyTorch is much easier to debug in these cases). Good luck!
david says
Hi, I am a Computer Scientist but I have not done any project on DL before. Maybe later I will buy RTX Titan but not in the next three months. Could you please let me know the following?
1. Given a model and if I want to see how it behaves under different initial parameters, will there be a problem if my desktop has two GPU of different kinds (e.g. one GTX 1060 and one RTS 2080/2080Ti or RTS Titan)?
2. Am I correct that only when I do parallel training of the same network with the same set of initial parameters will I need to have GPU of the same model?
3. Those 20×0 cards have Tensor Cores in addition to CUDA Cores. Are Tensor Cores helpful in speeding up the training in Tensorflow? What else is it good to buy RTX card now rather than GTX card?
Tim Dettmers says
1. You can use different GPUs for different networks. However, if you want to parallelize a single network across both GPUs they need to have the same chip, for example 2x RTX 2080 or 2x GTX 1060.
2. Yes, this is correct.
3. You will benefit from Tensor Cores in TensorFlow, but as far as I understand not all features of RTX cards are implemented yet and your Tensor Core code will be fast, but will not run at 100% speed. I think PyTorch has much better support for RTX cards, but it is just a matter of time until full features are implemented in TensorFlow.
George M says
Tim, thanks for updating this. Long term I am hoping to build a dual RTX 2070 system to allow for data parallelism. Would hooking up one monitor to each GPU be a viable option? Also, in that case would the “coolbits” option be able to control each GPU fan, or will fan control still be “hard and hacky” as you put it?
Brendan says
Hey Tim,
First I want to thank you for this blog, it teaches a man to fish rather than giving him a fish as the old aphorism goes.
I have a few questions related to hardware that I’m a little unclear on, and also that are pertinent as PCIe 4.0 slots are rumored soon. First a little background on my build, I’m going to be building a computer primarily for statistical computing before I begin a doctorate program in stats/applied math. This means it will first need to be good at serial processing which is why I’m entertaining CPUs that are overkill in terms of CNN needs (the type of neural networks I will be using). I do want it to be able to do CNN work as I am intrigued by and play around with that somewhat.
1) I’m looking between an i7 8700K and an i9 9900K. I was entertaining an AMD ryzen 2700X but I realized that I shouldn’t as that has a max number of lanes of 20, correct?
2) Should I wait for my build until PCIe 4.0 motherboards are released? All I see now are rumors but it is rumored that they will be released next year in 2019. Given that DMA is the main bottleneck here, wouldn’t it be beneficial to wait until PCIe 4.0 is available (31.508 GB/s more than doubles the performance of PCIe 3.0)?
3) If that is true, then would RAM memory speed actually be a factor? I don’t think so, as you stated your current setup gets over 50 GB/s for the RAM which would still be above the DMA bottleneck of PCIe 4.0
4) Even if PCIe 4.0 motherboards are released in 2019, they would still be compatible with the processors I mentioned above, correct? If so, then building my rig now wouldn’t hamper me as I could just upgrade the motherboard once PCIe 4.0 compatible motherboards are available. Is that right or are we unsure if there will be LGA1151 compatible PCIe 4.0 motherboards?
5) I’m looking at a GTX 1080 ti and a RTX 2080 ti for my GPU. I think the RTX 2080 ti is a little outside my price range, but I would be debating between a GTX 1080 ti with a water cooling block setup or an RTX 2080 ti without water cooling. Which do you think will likely perform better as the temperatures will likely hamper performance of the GPU with the stock fans?
Thank you again for this post and for your continued answering of questions in the comments. If you have time, I would greatly appreciate a response!
Cheers,
Brendan
Tim Dettmers says
Hi Brendan,
1) The Ryzen 2700X would be fine for up to two GPUs. If you want more GPUs I would look for a different CPU.
2) PCIe 4.0 will not help much with deep learning and I would not wait for it.
3) Memory speed is not much of a factor. I would just buy cheap RAM.
4) This is determined by the motherboard. For PCIe 2.0 -> PCIe 3.0 we saw that the new motherboards often supported only the most recent CPU sockets. I believe this could be the case for the new PCIe 4.0 boards too.
5) A single RTX 2080 Ti on air should be fine. You should see a slight performance decrease but it is still faster than the GTX 1080 Ti. If you do not have the money for an RTX 2080 Ti, both a water- or air-cooled GTX 1080 Ti should be a great option.
Haider Alwasiti says
Your statement in 2015, is it still true with current frameworks? I used pytorch with fastai and all threads and cores are maxed out usually ( image training in resnet34) :
Needed number of CPU cores
When I train deep neural nets with three different libraries I always see that one CPU thread is at 100% (and sometimes another thread will fluctuate between 0 and 100% for some time). And this immediately tells you that most deep learning libraries – and in fact most software applications in general – just use a single thread. This means that multi-core CPUs are rather useless
Tim Dettmers says
I think it is partially true. Preprocessing still dominates CPU needs. If you do preprocessing while training you need more cores. However, you could also preprocess your data in bulk before training.
However, some frameworks also use quite a bit CPU in the background like TensorFlow. I do not have the deepest insights into this, but TensorFlows graph pipeline is quite sophisticated and might need more CPU cores to process efficiently. The benefits for PyTorch would mainly lie with background loader Threads.
Eslam Haroun says
Hi Tim,
Sorry for missing this point.
Is a chipset with dual x16 pcie 3.0 that is compatible only with 16 pcie lanes cpu
Is this motherboard equivalent to a chipset with single x16 pcie 3.0 slot or dual x8/x8?
Is it equivalent to a chipset with dual x16 pcie 2.0?
Thanks
Tim Dettmers says
16x slots and lanes for the CPU are different things. If you have 2 PCIe 16x slots and CPU with 16x lane that would be perfect for a 2 GPU setup!
Eslam Haroun says
Hi Tim,
Thank you for your great guide.
Are the following components sufficient for 2 GPUs system with their full power?
Motherboard: ASRock Z270 PRO4 LGA1151/ Intel Z270/ DDR4/ Quad CrossFireX/ SATA3&USB3.0/ M.2/ A&GbE/ ATX Motherboard.
Power Connectors 24-pin main power connector, 8-pin ATX12V connector
CPU: Intel BX80677G4620 7th Gen Pentium Desktop Processors
GPU: EVGA GeForce GTX 1060 GAMING, ACX 2.0 (Single Fan), 6GB GDDR5, DX12 OSD Support (PXOC) 06G-P4-6161-KR
RAM: Corsair CMZ8GX3M1A1600C10B Vengeance Blue 8 GB DDR3 1600MHz (PC3 12800) Desktop Memory 1.5V
HDD: WD Blue 1TB SATA 6 Gb/s 7200 RPM 64MB Cache 3.5 Inch Desktop Hard Drive (WD10EZEX)
PSU: should it be 24-pin main power connector, 8-pin ATX12V connector?
like:
Antec EarthWatts Gold Pro 550W Power Supply 550 Watt 80 Plus Gold PSU with 120mm Silent Cooling Fan, Semi Modular, 7 Years Warranty, 99% +12V and ATX12V 2.4 – EA550G PRO Black
or 20+4 like:
Antec EarthWatts Gold Pro 550W Power Supply 550 Watt 80 Plus Gold PSU with 120mm Silent Cooling Fan, Semi Modular, 7 Years Warranty, 99% +12V and ATX12V 2.4 – EA550G PRO Black
Thank you.
Tim Dettmers says
Sorry I am not able to look at a build in full detail. If you can narrow down your problem to a single question I might have time to answer.
Parag Maha says
Thank you for sharing this article…You explain every thing very well in article as well as in comments also..It is very helpful for me as I am preparing for hardware courses so I used to search these things and I found that your blog is simply awesome among all..Thank you once again…Waiting for your new article…All the very best..KEEP WRITING
Mohammed Sarraj says
Hello Tim,
I am adding these questions to the list of questions mentioned above (just a reminder):
+ I am going with a 2x 2080ti setup for now, and I am going to be expanding in the future to 4x 2080ti. However, would I benefit from using NVLinks? If so, how is it going to impact things? For example, would I be able to double my memory? Would it it affect other bottlenecks?
+ Since, I am going to be using 4x 2080Tis and probably a 9920x (depending on your answer for the questions from the previous post), I am expecting to use about 1800 Watts. However, I couldn’t find any PSUs that are with that capacity. I am looking for a PSU with low voltage ripple (for overclocking purposes). However, I wasn’t able to find even a single 1800 Watt PSU in the market. What is your advice in this case? I’ve heard of the Add2PSU component ( https://www.youtube.com/watch?v=erHoq3DbwVA ). However, I am not sure how safe it is to use multiple PSUs on a single PC. What do you think?
Thanks
Tim Dettmers says
(1) NVLink on RTX cards is currently limited to two GPUs only and it will not help you too much in that scenario for data parallelism. For model parallelism, it could help, but currently, there are no models and code that profit from that. So it might be useful in the future, but not right now.
(2) 1800 Watts is a bit much. You need about 275 per GPU and another 300 for the CPU which is about 1400 Watts. If you get a 1300 to 1600 Watts PSU you should be fine I think even if you overclock. There are some nice ones from EVGA; you can find them easily if you search newegg.
Mohammed Sarraj says
Thank you so much! One more thing, can you please help me with the questions from the previous post? I will post the questions here:
”
Just a quick clarification on your reply earlier, I am planning on expanding later to 4 x 2080 TI. So, in this case I would go with Rampage VI Extreme & (9920x or 9940x depending on price) and 2x2080ti. However, the issue is that I am concerned about the number of lanes going into GPUs. Both of these CPUs has 44 lanes, which is not enough to run 16 lanes on each GPU. Does it matter (16 vs 8 lanes)?
Another option is to go with threadripper which has 60 lanes (again not enough for 4×16), but at least I might be able to run 3×16 + 1×8.
Also, you addressed this before, but I just want to confirm, CPU clock is irrelevant. Basically, I am not losing anything by going down from 9900k @ 5.3 GHz to 9920x @ 4.7 GHz to even a threadripper 2950x @ 4.4GHz. right?
Also, is there a difference between using an intel cpu vs amd? (Sorry if this seems like a very broad question, but I’m not sure which way to go; intel has higher clocks and worst value-for-money, while amd has better value and more lanes)
Thank you again for being patient with me! Choosing the build components has been a steep learning curve for me. I am really glad that I found someone to point me to the right direction.”
Thanks Tim!
Mohammed says
Thanks for your prompt response!
Just a quick clarification on your reply earlier, I am planning on expanding later to 4 x 2080 TI. So, in this case I would go with Rampage VI Extreme & (9920x or 9940x depending on price) and 2x2080ti. However, the issue is that I am concerned about the number of lanes going into GPUs. Both of these CPUs has 44 lanes, which is not enough to run 16 lanes on each GPU. Does it matter (16 vs 8 lanes)?
Another option is to go with threadripper which has 60 lanes (again not enough for 4×16), but at least I might be able to run 3×16 + 1×8.
Also, you addressed this before, but I just want to confirm, CPU clock is irrelevant. Basically, I am not losing anything by going down from 9900k @ 5.3 GHz to 9920x @ 4.7 GHz to even a threadripper 2950x @ 4.4GHz. right?
Also, is there a difference between using an intel cpu vs amd? (Sorry if this seems like a very broad question, but I’m not sure which way to go; intel has higher clocks and worst value-for-money, while amd has better value and more lanes)
Thank you again for being patient with me! Choosing the build components has been a steep learning curve for me. I am really glad that I found someone to point me to the right direction.
Mohammed says
Thank you Tim for such a great guide! I have a question about the asynchronous mini batch allocation code you mentioned. I am using python mainly through Keras and sometimes Tensorflow. How can I do the asynchronous allocation? Also, I am not familiar at all with cuda code, but how hard is it to learn? And, is there a way to integrate cuda code into my normal use (python amd keras)?
Tim Dettmers says
This blog post is a bit outdated. It seems that TensorFlow is using pinned host memory by default, which means that you are already able to do asynchronous GPU transfers. While I stressed it in the blog post, its actually not that big of a bottleneck for most cases. For large video data, it could have a good impact.
Mohammed says
So here is the build I am considering. I should say that my main goal is to learn deep learning as quickly as possible. I am planning on doing as many kaggle competitions (open and closed) as possible Can you please help?
GPU: 1 x GTX 1080 Ti + 1 x RTX 2080 Ti (I might add a third card depending on your recommendation)
Cooling: Liquid (open loop)
PSU: RM1000x ( I might go abit higher in terms of OC quality (higher tier) and power delivery (targetting 60-80% at full load for best efficiency, where expected maximum load is ~800W). I am considering getting an AT1200x instead
As for CPU and RAM, there are 2 options:
Option A: (Higher CPU clock but Dual Channel Memory)
CPU: 9900k with 16MB smart cache (not sure but estimated @~5.2-5.5GHz for all 8 cores)
Motherboard: Asus Maximus XI Extreme
RAM: 64 GB (OC to 4000MHz+)
Option B : (Quad Memory Channel but Lower CPU Clock and I think overkill core count)
CPU: 7920x with 16.5 MB cache (not sure but estimated @~4.7GHz for all 12 cores)
Motherboard: Rampage VI Extreme
RAM: 64 or 128 GB (depending on your recommendation) again (OC to 4000MHz+)
I really like option A because of CPU clock speed. However, if I go with option A, then I will be using a cloud service for large datasets. The advantages of option B is the option to expand memory to 128 GB and the quad channel. Also, not sure about the difference between the two motherboards in terms of multi-gpu pcie lane allocation (I don’t understand the specs)
So, do you have any comments on the builds?
Which one would you pick?
Where is(are) the bottleneck here?
Thanks
Mohammed says
Also, how much RAM do I need for Kaggle competitions?
Tim Dettmers says
For +2 GPUs choose the Rampage VI motherboard; if you settle at 2 GPUs then Maximum XI. I would also go for two GPUs with the same chipset, either GTX 1080 Ti or RTX 2080 Ti so you will be able to parallelize across GPUs. I would get RAM with a lower clock — there is almost no gain for OC RAM and it is quite expensive.
Mohammed says
So, your point from the article about (RAM clock) is not outdated. In other words, is RAM clock irrelevant because of asynchronous mini batch allocation?
What about data cleaning and pre-processing? Does the same logic apply?
Also, for Kaggle competitions, how much RAM do you think I would need? I hear that 32-64 GB is recommended. However, for image competitions, 128 GB is the minimum. What do you think?
Thanks for being patient with me!
Tim Dettmers says
I think 32GB is good, but sometimes you need to write cumbersome memory efficient code — with 64GB you can avoid that most of the time. 128 GB is a bit overkill for most competitions I think. If you have 64GB and it is not enough then you can always spend some time optimizing your code. If you still find your RAM lacking you can also just order more RAM. I would start with 32 GB or 64 GB — you can always order more!
A good RAM clock will not help you pre-process much faster. This video puts it quite well: https://www.youtube.com/watch?v=D_Yt4vSZKVk
Angel G says
Hi Tim, I’ve re read the comments and few questions rouse in my head.
1. How it has been decided that 16-bit floating point numbers have enough precision for the neural networks – doesn’t it reduce their recognition abilities ?
2. In 2010, I trained a C++ coded CNN. I’ve noticed that if I run in more than 4 parallel threads it’s learning rate decreased (required more epochs) . The weights were updated concurrently by the threads using non-blocking(mostly) atomic cmpxchg64 instructions. I’ve skipped all development until now.
Now massive parallel architectures are used (in GPU) so I wonder how do they update/combine the weights in parallel without destroying the learning rate ?
3. Does CUDA vs AMD matter if I implement the neural networks in the old school manner – without any SDK – Open GL shading language with floating-point textures.
Tim Dettmers says
Hi Angel. Here some answers, suggestions:
1. From my experience, 16-bit results are close to 32-bit results, but I did not do thorough testing. There is research that points in both directions (in my research I show that 8-bit gradients and activities work fine for AlexNet on ImageNet). I think if you want to learn, experiment, prototype and do practical work, 16-bit networks are alright. For academic work which requires state-of-the-art results 16-bit methods might have insufficient precision.
2. Again, you might want to read my paper for an overview of synchronous methods. The best methods compress the gradient and send them across the network to other GPUs where everything is aggregated and synchronized for a valid gradient. Such methods work quite well. There are other asynchronous methods championed by Google, but they also suffer from the problems that you describe.
3. Implementing in CUDA is much better because they have better tooling, a better community, and thus better support. However, we need more AMD code because it is not healthy if NVIDIA has more than 90% of the market share in deep learning hardware. If you expect that your software will be useful for many thousand others, AMD might be a good, ethical choice. If it is for a personal project which may be useful to dozens, go with NVIDIA CUDA.
P says
Hi Tim, does DL require I/O access to the storage (such as SSD)
during the learning process? If so, it is a lot of only a bit occasionally?
I guess lots of access are required at the beginning to load the data and only a small amount of I/O access is required to save the learning weight. Am I correct?
Tim Dettmers says
It is mostly loading data and an SSD is only required for performance if you have very large input sizes. If you do not have that, you will gain no performance over using a spinning hard disk. However, besides DL you use the SSD for many other tasks, so if you have the money I would definitely go for SSD drives to make I/O work more comfortable.
P says
Thanks Tim, do DL related programs require many syscall or ioctls?
Regardless of cost, is there an advantage using i7-8700K rather than i5-8400? Is i5-8400 sufficiently good enough to get things done?
Is it worth to pay more to get the Nvidia 1080 rather than the 1070ti?
Leorexij says
Hi Tim,
one place to collect all excellent reference . It really helps people better configure their machines to perform efficient deep learning.
I’d like to ask a question and would be grateful if you can help me.
I am going to use 2000 images at a time and I want to use tensorflow and theano framework in pyhthon. Can you advise me the configuration to achieve this with good performance.
And my budget is less than 50,000 INR
Abdelrahman says
Hi Tim,
Thanks for the great article. I am planning to use 6850k for my deep learning box, with 1070 Ti, 1080, or 1080 Ti GPUs, planning to extend to 4 GPUs later.
I just wonder if the following motherboard is a great option for deep learning box (4 GPUs): MSI Extreme Gaming Intel X99 LGA 2011 DDR4 USB 3.1 Extended ATX Motherboard (X99A GODLIKE Gaming )
https://www.amazon.com/MSI-Extended-Motherboard-X99A-GODLIKE/dp/B014VITZPM/ref=cm_cr_arp_d_product_top?ie=UTF8
Nikolaos Tsarmpopoulos says
Hi Tim,
On a 4x GTX 1080ti system, with 1x SSD for Windows 10 Pro, and 3x HDDs in RAID 5 for mass storage (encrypted with Bitlocker), in a secure multiuser environment, I’m looking for an effective approach to separate storage from compute: I currently have to reconfigure access rights in the RAID volume for the users, every time the OS is reinstalled (see clean installation after breaking things).
I think it would make sense having a type 1 (bare metal) hypervisor to allow for Windows and Linux VMs to access the hardware as needed. I’m considering a VM for NAS, and two more VMs for Windows and Linux.
Do you know if this is possible abdomen, if so, which hypervisor allows for CUDA from VMs running Linux and Windows to access the GPUs? Is there a particular, tested software configuration that you can recommend?
Thanks in advance.
Nikolaos Tsarmpopoulos says
Autocorrelation modified my question. Here it is :
Do you know if this is possible and, if so, which hypervisor allows for use of CUDA from VMs running Linux and Windows, to access the GPUs? Is there a particular, tested software configuration that you can recommend?
Levent says
Hi Tim,
Thanks for this excellent article!
We’re trying to configure our new machine and thinking of buying:
ASUS Z10PE-D16 WS as the motherboard. It’s obvious that we can’t fit more than 3 GPUs on this mobo, but what about using ribbon extender cables and hanging the GPUs, just like “mining rig” people do?
Do you think this will be a good idea, as this mobo has 4 x PCI-e x16, and 2 x E5 26xx will have 80 PCI-e lanes?
Julien says
Hey Tim,
Sorry if I missed this point, but suppose I only plan on having 2 GPUs max. Would a 16 PCIe lane CPU work given that each GPU utilizes 8 lanes?
Tim Dettmers says
Yes, for 2 GPUs, 16 lanes is plenty!
Shahab says
Hi Tim,
Thank you so much for your great article. It really helps people better configure their machines to perform efficient deep learning.
I’d like to ask a question and would be grateful if you can help me.
At this moment, I have a GTX 1080 Ti with an Intel Core i5 6500 with 8 GBs of ram.
My question is that, is it worth upgrading CPU and ram to a Core i7 7700 (or 6700) and 16 GB, respectively? Would be any boost at all if I do this upgrade?
Thank you in advance.
Clément says
Hello,
I plan to build a machine with unbutu for deep learning.
I already have a GTX 1080 ti.
I’ll take 32 gb of RAM
My question is concerning the CPU.
Indeed i’d like to buy a i5-7600K. Is it ok with the 1080 ti? Will it work well if I decide to add another 1080 ti with a SLI?
new_dl_learner says
Hello Tim, what do you think of using the Threadripper compared with the i7 7700K, i9 7900X or 7940X? Two main concerns are: 1) there are user reviews saying that under Linux, there are bugs with PCIe. Not sure if I will encounter such bugs if I install 1-4 Nvidia GPU. 2) Lack of motherboard that supports PCIe 3.0 x16x16x16x16. Some mentioned that there is no noticeable difference between x8 and x16. I guess they talked about the frame rate for gaming. Not sure if this conclusion applies to deep learning. Any idea?
new_dl_learner says
Hello, I have spent too much time on hardware selections. It is driving me nuts. I need some help. My current laptop computer is almost 10 years old. I am building a desktop replacement. I also want to use it for DL/ML research. As far as I know, latest CPUs such as the i7 7700K, i9 7900X and Threadripper do not support PCIe 3.0 16x16x16x16. Motherboards that support such quad PCIe 3.0 at 16x each only support older CPUs with LGA2011-v3 socket or the Xeon E5. These CPUs are at similar or higher price range than the ones mentioned above. Moreover, they are running at lower speed. I guess the dilemma is: Spending more on older technology for quad PCIe 3.0 16x16x16x16 vs. spending less on the latest CPU that only support two PCIe 3.0 running 16×16. Any suggestion appreciated. Thanks.
Nikolaos Tsarmpopoulos says
Dear new_dl_learner,
The X99 motherboards with PLX chips (for quad PCI-E 3.0 x16) are no longer produced, they are out of stock in most places, hence, no delima here. As Tim has explained earlier, the CPU is not as important for Deep Neural Networks as the GPU. Also PCI-E 3.0 x8 is not much worse than x16. If you start with anything less than a Quad GPU setup, by the time you need faster data throughput new CPUs and GPUs will be available, sporting PCI-E 4.0, so you won’t be limited by PCI-E 3.0 x8.
new_dl_learner says
Dear Nikolaos,
Thank you for the useful information. My laptop computer is i7 2.66GHz (8GB 1067 MHz DDR3, NVIDIA GeForce GT 330M with 512 MB GDDR 3). Will that be sufficient for me to use it to learn about Deep Learning and do some work in this area before PCI-E 4.0 comes out? If not, what hardware do you recommend during this transition period?
Nikolaos Tsarmpopoulos says
You can start experimenting with Deep Learning using the CPU of your existing laptop and a compatible library. For example, Tensorflow is available for CPU. Start with the basics.
When you reach the point where you need faster compute capability, depending on your budget, you can put together a PC. At that time, you will know what the requirements of the software and of the models that you are using will be, so choosing the right hardware will be much easier than it is today for you.
No need to rush it, it doesn’t help building a supercomputer today if you don’t know how to use it for deep learning. By the time you need one, new technologies will be available and you’ll be in a better position to choose wisely.
new_dl_learner says
Thanks for your suggestion. When can we build computers using PCI-E 4.0 parts?
Yes, Tim mentioned that CPU is not that important but it was in 2015. Not sure about now. Some people mentioned that DL/ML applications such as Tensorflow take advantage of multi-core, multi-thread CPU and recommended getting at least 8 cores. My laptop is showing signs of failing. Besides DL/ML, I also run engineering applications that would benefit from higher clock CPUs.
One concern I have is that in the GPU selection thread, somebody asked about the possibility of doing DL on computer with a Nvidia 750M GPU. Tim mentioned that “So a GDDR5 750M will be sufficient for running most deep learning models. If you have the DDR3 version, then it might be too slow for deep learning (smaller models might take a day; larger models a week or so).” Mine is an older laptop with GPU using 512MB GDDR3. So, it will take days to train even smaller models?
Nikolaos Tsarmpopoulos says
PCIe 4.0 is reportedly due to be rolled out in early 2018. That means new CPUs, new motherboards, new GPUs. NVIDIA is likely to lead with the announcement of its volta GPUs, to be seen.
My understanding is that no matter how fast GPU and CPU you purchase, there will always be a model that takes days to train and another that’s too big to fit in your GPU’s memory, which is why nvidia sells Quadro and Tesla.
Hence, I think the bottom line is: use whatever system you can afford and justify for your education right now. By the time you need faster hardware, you’ll know what you need, what you need it for and what options are available at that time.
new_dl_learner says
Thanks. Given the timeline, I better save the money than build a top of the line 4-GPU system now.
Some mentioned that as deep learning is very computationally intensive, having a fast system really helps during the learning stage as a fast system would allow the learner to try out different parameters and see how the results are affected without waiting for days. I did research on neural networks when I was a PhD student. However, I did not learn about deep learning. As it is just neural networks with more hidden layers, I suppose it will take me less time to learn it. Given the background, can anybody recommend hardware to use during this educational transition stage? How much RAM should I get?
An alternative is to replace my laptop first. I suppose if I get a laptop, I should get a descent but not too expensive one. In deep learning, will the mobile version of GTX 970M, 1050, 1050Ti, 1060, 1070, 1080 perform much better than my GeForce GT 330M with 512 MB? I know that for desktop, it is better to buy the 1080Ti. How are the mobile version of the GPU ranked among each other? I don’t want to buy a high end laptop that I could use the money to build a high end workstation later on.
Reagan says
Tim,
Great blog, this is my second post. You mentioned previously that of all the characteristics of GPUs that RAM size and bandwidth were most important. I haven’t seen anywhere on your blog you mentioned the relationship between CUDA-cores and DL speedups. The meat of my question is, what is the real difference between the 1070 and the 1080 as they both have 8GB of GPURAM? I’m considering buying a 1070 for the cost savings over a 1080 for my toy DL rig.
Reagan says
I pretty much answered my own question. The 1080 is only $70 or so more expensive than the 1070 for 500 extra cuda cores and more bandwidth. I’ll go for it.
BUT, my next question is really important. Quad-channel MOBO a must?
new_dl_learner says
Hello Tim, is it better to get a CPU which supports AVX-512?
Tim Dettmers says
If you do not use GPUs this might be a sensible investment, otherwise, it will not that be important and I would not select a CPU based on this feature alone.
new_dl_learner says
Thank you for your expert advice.
James says
Hi Tim,
Great info, thanks. I was wondering if you knew about comapnies that could offer this service? I know NVIDIA used to build dev boxes but stopped? This would help me focus more on dev and not worry too much about building the machine.
Thanks,
Jame
Tim Dettmers says
There were some deep learning desktops from other companies, but I cannot find them on Google. I think some of them might be buried in the comment section somewhere, try to search that. Other than that, you could also just buy a high-end gaming PC. Basically, there is no difference between a good deep learning machine and a high-end gaming machine. So buying a high-end gaming machine is a perfect choice if you want to avoid building your own machine. I would still recommend giving building your own machine a shot — it is much easier than it looks!
Jame says
Hi Tim! Thanks for that. Will look around the comments. If i don’t find any other solution i will have to start learning…The problem is the size that we are looking at, maybe too big/challening for a single person,
Thanks!
Reagan says
Tim,
What do you think about using the Xeon E-5 1620v4 instead of the i7-5930K for a quadGPU machine? The Xeon is half the price of the i7, also has 40 PCIe lane support, and has a higher memory bandwidth and is the same socket type.
Is there something about server chips I’m not seeing that would interfere with me using this chip on an X99 mobo?
Also, is there a difference between DDR3 and DDR4 RAM that is meaningful to deep learning?
Great blog!
Tim Dettmers says
The Xeon is definitely a better option here. It has less cache and fewer cores, but this should only have a minor influence. The chip should work normally on a X99 mobo. For deep learning there is a very minimal difference between DDR3 and DDR4. Probably the performance difference would be a few percent which should not be noticeable unless you run the GPUs 24/7. However, if you want performance for a 4 GPU setup, then the first thing you should look into is cooling, in particular, liquid cooling. Other factors are insignificant.
Sumeet Singh says
Hello Tim,
Thanks for this wonderful resource for Deep Learning DIYers. Based on this and several other resources on the internet, I have built my first A.I. ‘rig’ on which I am training a Image Captioning/Transcription/Translation Neural Network – Im2Latex: to convert Latex generated images back into the original latex markup. I have an convnet of about 14M parameters and my Conditioned-Attentive-LSTM has about 8M parameters. I’ve been running this on Google-Cloud-Platform before I built my own ‘rig’ and am happy to report that my rig with one GPU trains almost twice as fast as the one with one virtual CPU on Google (i.e. half a K80). I think I can make mine a little faster with some BIOS settings – but am happy with it so far. Am ordering another 1080 Ti soon. Oh yes, I haven’t yet overclocked my GPU but it naturally runs at over 1900 MHz on load with the help of Nvidia X Server Settings app on Linux (temperature is 50 degrees C with inbuilt liquid cooling with the CPU at 30 degrees C)
In the spirit of giving back to the community, here’s my parts list: https://pcpartpicker.com/user/Sumeet0/saved/#view=gFbvVn. Also, while I do have a copy of Windows 10 I decided to use Ubuntu GNOME 16.04 LTS – mostly because I’m very comfortable with Unix like operating systems since I’ve worked on/with those for over 20 years. One problem with Linux though is that most software utilities for overclocking and system monitoring run on Windows. As you – and other resources on internet – say, the best way to overclock a GPU on linux is to flash the BIOS. At the least that’s not convenient – especially for a newbie.
Thanks again for this wonderful resource.
Tim Dettmers says
Thanks for your feedback — this is very useful for everybody here!
Overclocking does not increase deep learning performance by much, but it helps to squeeze the last bit of performance out of the GPU. I think you can expect something between 0-3% performance increase from that. Might sound like not much, but put into numbers, this means that if your GPU runs 24/7 this can be up to 40 minutes per day extra compute time.
Sumeet Singh says
Thanks. Is good to know that I’m not missing out too much by not overclocking (and that my choosing linux over Windows 10 didn’t cause me a major performance disadvantage – since I could have very easily overclocked Windows). Now I can focus on training my 23M parameters 🙂
new_dl_learner says
Thanks. Some sites suggested 8-16GB RAM but I found recent posts suggesting 32GB or 64GB. It is not also uncommon to see posts from users using 128 or 256GB RAM. What is a reasonable amount of RAM for home computer above which it would be better to use online computing services from companies?
new_dl_learner says
I read that CPU is not as important as GPU for DL. Just to make sure the number of CPU cores is 2x the number of GPU. However, I also read that CPU cores could be assigned to take part of ML/DL computation. Do, does that mean it is good to have as many cores as I could get?
Tim Dettmers says
More cores are always better, but it is also a question of how much you want to pay. I think CPU cores = 2x GPU might be a bit much for the high range. If you get 3 GPUs a 4 core is still sufficient. If you have 4 GPUs I a 6 core would also be sufficient. I would however not recommend a 2 core for 3 GPUs. 4 cores for a 4 GPU system is borderline, as it will be okay if you just run deep learning but it might become a bottleneck as you run any other application in addition. So choose according to your budget and according to your needs.
new_dl_learner says
Both Intel and AMD announced an overwhelming number of CPUs in August. Which CPU choice would be the best for ML/DL? At first, I considered the Threadripper but there is no related motherboard that supports the running of 4 GPU at x16x16x16x16 at the same time. I probably get two 1080Ti but I may need four later. Is there an advantageous in getting dual CPUs motherboard vs. getting 2 computers?
Sumeet Singh says
I recently built my system (https://pcpartpicker.com/user/Sumeet0/saved/#view=gFbvVn). Firstly, I would recommend that you run your models on a cloud-platform first in order to get a sense of what type of hardware you want. For e.g. in my case, the importance of the CPU speed and RAM size is very minimal. All the work should be done by your GPU which in my experience was about 60x faster than if I ran my workload on the CPUs. However, IMO you should test your model out on a cloud-platform first (and take measurements) to get a sense for yourself. Second, my first instinct was to go with AMD processors since those are cheaper – but within a day of researching I found out that you can’t find a reasonable AMD GPU that will control even 40 PCIE 3.0 x16 lanes. The intel Xeon E5s (and the new Silvers) will easily control 48 lanes per CPU. So that settled the debate for me. I also wanted to run my experiments continuously for several days – even weeks – therefore I preferred to go with server components. Thirdly, there were very few consumer (Intel chipset X99) motherboards that would simultaneously run 4 PCIE 3.0 cards at x16 speeds – and they all cost $600-$1000. Most X99 motherboards that have 4 or more PCIE x16 slots are only able to drive 2 of those at full x16 speeds. If you add more cards, their speed will drop down to x8. I found a few server motherboards (Intel chipset C612) that did drive 4 cards at x16 speed but at same or cheaper price-point (I ultimately bought an ASRock Rack EP2612 WS for $400). These three points made me decide to go with a server motherboard with two sockets, and Intel Xeon E5 CPU (I’m only using one at this time). I have the option to add upto 4 GPU cards all of which will be driven at x16 speed with at total of two Xeon E5 CPUs. I can run two cards with one CPU alone, and that’s what I intend to do in the near-term. If things work out and if there is a need, I’ll add one more CPU and two more GPU cards.
Since I have my model all setup and running on Google Cloud platform as well as on my own system, I have a very good comparison of speed and price. I will recover the price of my rig in 5-10 months if I run with one GPU and 3-6 months if I run with two (and even sooner with 4 GPUs). This is based on running my computations at least 12 hours a day every day (which reasonable for my case). I did *not* factor in the fact that mine runs 1.5x to 3x faster than GCP. I did factor in the cost of electricity though which is 40 cents / kWH for tier-3 and 27 cents / kWH for tier-2 consumption in my area. The biggest advantage for me though is that psychologically now I can leave my jobs running all the time without worrying about accruing costs (not to mention that GCP rations the use of resources and you have to convince them sometimes when you want to increase your resource consumption – which I find very perplexing). IMO this psychological advantage is a major plus when you’re experimenting/researching because you can never experiment too much. As a bonus, my system runs much faster which is priceless since I can iterate much faster. That said, you should do your own calculations – I may have made mistakes or overlooked something. Good luck with your project!
Tim Dettmers says
This is very good advice and a thorough analysis — thank you for giving back! This is very valuable and I should incorporate this advice into my blog post.
Sumeet Singh says
Hi Tim,
Glad you found my input useful. I found this entire forum more informative than anything else out there on the internet from a deep-learning DIY POV. Thanks again.
Here’s some detail about how my PSU calculations:
1. After much deliberation I went for a 1000W PSU for driving the 1 CPU + 2 GPU configuration instead of a 850W PSU. The nominal consumption of this config would be 800 Watts (180 for motherboard based on ASRock-Rack’s ‘Prime95’ performance-test results which their wonderful support-rep shared with me, 85W for CPU, 250W per GPU and 5Ws per fan(6) and 2.5W for liquid cooling pumps(2). The cheaper option was to go for a 850W power supply which as you can see would be running at at least 95% load. But since I want to run this config 24×7, I want to give the PSU more head-room and hence opted to go for the 1000W PSU – which would still be too loaded for my comfort at 80% (I’d prefer 70% sustained load). But I’m already quite over budget, so I’ll stick with it.
2. When/if I add 1 more CPU and 2 more 2GPUs – they will consume an additional 585W – bringing the total nominal consumption to 1385W. Then I’ll buy another ~850W PSU – and run the two PSUs in parallel. This way I can scale up the system incrementally. I haven’t tested running with two PSUs yet, but I think it should work (despite what many people will say on the internet) as long as you choose the second PSU such that it will be okay with one not connecting the 24-pin and EPS/ATX12V mobo pins (some may not put any voltage on the line or suffer poor voltage regulation in this scenario, but I hope that my seasonic will do fine), one ensures that any given component is entirely powered by the same PSU and all the PSUs are grounded to the same ground. So for e.g. ensure that all the motherboard power sockets (one 24-pin and two 8-pin EPS/ATX12v sockets on my mobo) are powered by the same PSU. I’ll update this forum when/if I do that. An additional complication is that PSUs normally don’t start powering the output lines when you turn on their power switch. They do that only after they receive a control-signal from the motherboard (which it does when you hit the start button on the computer-case) which they receive on two pins on the 20-24 pin ATX power connector. You can fake that signal to the second PSU by shorting the correct two pins or you can buy a device that will relay the mobo’s signal to the second PSU. I haven’t tried this out yet, so don’t know for sure if it will work but people have done this successfully so I’m hopeful that I’ll be able to make it work.
new_dl_learner says
About getting a dual CPUs motherboard and have each CPU controlling two GPU at the same time… I have two questions:
1. If I only have one CPU installed, can the motherboard control 4 GPU at the same time at 16x16x16x16?
2. In case two CPUs are required to control 4 GPUs at the same time at 16x16x16x16, will DL software such as Tensorflow take care of the parallelism and distribution of workload of the GPUs automatically?
Sumeet Singh says
1. I didn’t find any mobos that will do that – but I didn’t seriously consider anything that was north of $670. I suspect you could find this feature in higher-end server motherboards (much more expensive). You will also need a CPU that will drive at least 64 (i.e. 4*16) PCIe lanes. I know of high end Intel CPUs that will power 96 PCIE lanes but those are very expensive – I think the newest Intel Xeon Gold and Platinum series (maybe some higher-end Silvers too).
2. Regarding whether Tensorflow will drive all 4 GPUs – that’s my working theory. I’m training on tensorflow and everything I’ve read and discussed with colleagues who work on DL tells me that that should work out of the box. I know from experience that if you train on CPUs, then tensorflow loads them all up (e.g. it would load all 32 cores on a 4 socket / 32-core HP server machine). I have read that it will do the same with GPUs. That’s my main reason for trying to build a 4GPU system as against two 2-GPU systems. I just ordered my second GPU and will know in 2 weeks whether this theory pans out. Hope to update this forum then but I suspect that Tim will already know the answer right away – therefore do ask him.
new_dl_learner says
Thanks.
I guess one uncertainty is whether the two CPUs on the same motherboard coordinate with each other well and automatically. Supposing that CPU1 provides x16x16 to GPU1 and GPU2 while CPU2 provides x16x16 to GPU3 and GPU4. When the user launches a DL application from CPU1, will that application automatically distributes the tasks to GPU1, 2 3 and 4 without any afford from the user.
Sumeet Singh says
I think Tensorflow should be able to do that. I’ll update the forum when/if I have any real experience in this regard.
Nikolaos Tsarmpopoulos says
Latest (Purley), high-end Xeons support 48 lanes per CPU, hence 96 lanes would only be supported from a dual CPU config. These CPUs feature up to 3 Ultra Path Interconnect links, for CPU-to-CPU communication, at 9.6 and 10.4 GT/s.
Source:
https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview
That is expected to be similar in performance with the older interconnect, QPI. Source:
https://www.nextplatform.com/2015/05/26/intel-lets-slip-broadwell-skylake-xeon-chip-specs/
More information on QuickPath Interconnect and its throughput (about 25Mbps), here:
https://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect
I haven’t been able to confirm whether, in a dual CPU config, Windows:
1) recognises that a particular process or thread utilises a particular GPU that is attached to the PCIe lanes of a particular CPU,
2) assigns the aforementioned thread to the CPU where the GPU is attached
If someone can help find information on this, please share it here.
If this is not feasible, then its possible that dual CPU configurations may be slower than a single CPU motherboard that utilises a PCIe switch (PLX).
Michael says
***will DL software such as Tensorflow take care of the parallelism and distribution of workload of the GPUs automatically?***
Of course not. You have to manually design your code to run on different GPUs. Usually you would break up your data batches and assign them to individual GPUs in a “multi-tower” fashion. If you don’t understand these things, I recommend learning them before spending any money on hardware. It’s like wondering how big of an engine to get in a car before you learned how to drive. Just get started with whatever hardware you have at hand. By the time you know how to build large complex models, any specific hardware suggestions given here will likely become obsolete.
Sumeet Singh says
Very good point Michael. I (incorrectly) assumed that the questioner already knew about building towers (as in the CIFAR-10 tutorial https://www.tensorflow.org/tutorials/deep_cnn) and that his question was in the context of whether tensorflow would distribute the load after that. But you’re right, this is perhaps not what he meant.
new_dl_learner says
Hello Sumett, any update?
Sumeet Singh says
Yeah, my dual-GPU configuration has been up and running for a couple of weeks now. It worked exactly as expected with Tensorflow. I’ve coded a synchronous towers architecture as described in the CIFAR10 tutorial. I got about 70-80% speed-up (speed went up from 1.25 s/batch to 0.7s). I suppose I could get closer to 100% speed-up if I coded a asynchronous architecture – which I would do if I got more GPUs. Please note that you have to tell tensorflow to place the operators on a specific GPU – and you can do so easily by building multiple copies of your graph and place each on a separate GPU (see the CIFAR10 tutorial). Therefore from the beginning try to code your graph such that it can be easily replicated with all subsequent instances reusing variables created by the first instance. Tensorflow will automatically place the variables on the CPU (provided you specify soft-placement in the session config.) and take care of transferring their values to and from the multiple GPUs. All you need to do is specify device-placement and Tensorflow will do the rest – i.e. it will deploy your graph-code to the multiple GPUs and CPU, transfer data back and forth between GPU and RAM and coordinate the execution of the entire graph spread-up over CPUs, GPUs and RAM. It takes very little effort compared to how much work it does. Coding a asynchronous model on the other hand, will take a bit more coding. Oh and one more thing – be sure to use queues and queue-runners for reading data asynchronously from the disks so that the data is ready in RAM when the graph needs it. You’ll also need to ensure that your BIOS is setup properly – for e.g. I had to turn on the ‘Above 4G Decoding’ option on my motherboard. I also noticed that if I turn off ECC, then the speed actually slows down contrary to what I had expected. I also notices a ‘warm-up’ period of about 30-45 minutes after boot-up when the graph runs 3x slower – not sure why (maybe the time it takes for OS to load inodes into cache?) but now I just suspend the machine instead of shutting it down.
Hope this helps.
Tim Dettmers says
Thanks for your comment, Sumeet, it is good to see some discussion in the comment section!
James says
Hi Tim,
After some digging I came accros this company that could help me – Elysian ai. Do you know them?
http://www.elysian.ai
new_dl_learner says
I cannot find a motherboard that supports threadripper and 4 x PCIe 3.0 x16/x16/x16/x16. How come such motherboard is not available?
Leo says
You never will. Threadripper have 64 PCIe lanes, but you have to left 4 of them for chipset and most mobo now will feed 4, 8 or 12 lanes to MNVe/SSD and other disks.
Tim Dettmers says
I checked NewEgg and indeed the X399 board’s specs show that they only support standard PCIe setups. However, if you look at the manufacturer’s homepage you will see that they indeed support full 64 PCIe lanes. I assume Newegg system is not updated yet to make a 16x/16x/16x/16x system available for the specs (it seems to be standardized). See for example https://www.gigabyte.com/Motherboard/X399-AORUS-Gaming-7-rev-10#kf
new_dl_learner says
Thanks Tim. I am a bit confused. From the specifications, there is no mention that it supports 4×16, only 2×16. The online menu states teh same thing. Am I reading it wrong?
1. 2 x PCI Express x16 slots, running at x16 (PCIEX16_1, PCIEX16_2)
2. 2 x PCI Express x16 slots, running at x8 (PCIEX8_1, PCIEX8_2)
(The PCIEX16 and PCIEX8 slots conform to PCI Express 3.0 standard.)
3. 1 x PCI Express x16 slot, running at x4 (PCIEX4)
(The PCIEX4 slot conforms to PCI Express 2.0 standard.)
Tim Dettmers says
You are right, I just for searching “lanes” and confused the 64 that I saw with specs for the motherboard. This is strange indeed, why do they not support the full 64 lanes? There was another blog post saying that this particular board would support that, but the manufacturer’s page clearly says it does not. I would get in touch with any manufacturer and just ask.
new_dl_learner says
As far as I know, even threadripper supports 64 lanes, none of the available motherboard allows 4 CPU running at 16x16x16x16 at the same time. So, if I want more than two GPU running at x16, I will need to choose Intel CPU?
Nikolaos Tsarmpopoulos says
The following article suggests that only 56 out of 64 PCI-E lanes can be used for GPUs:
http://www.guru3d.com/articles-pages/amd-ryzen-threadripper-1920x-review,4.html
I haven’t found a good explanation but I think it’s likely 4 lanes are used to connect the cpu to the X399 chipset.
PCIE switches could be used (in future workstation motherboards) to support more (than 3) GPUs at x16 lanes.
Roku com link says
I am not certain where you’re getting your info, however great
topic. I needs to spend some time finding out more or working out
more. Thanks for magnificent information I used to be in search of this information for
my mission.
Muhammad Abdullah says
Hi Tim, I’m new to Deep Learning and Computer Vision and I need to build a workstation for that within $1000 budget and I’ll considering used and low cost components available in Pakistan. So far I have found following options.
Board + Processor + Casing + Power Supply
Dell T3610 with Xeon E5 Series CPU (12 Core, 35M or 30M Cahce, 2.0 GHz) $570
Asus X99 Motherboard with Core i7-5820K (6 Core, 15M Cache, 3.3 GHz) $237 + $332 +$10 +$42= $621
MSi X99S SLi Plus with Xeon E5-2620 v4 (8 Core, 20M, 2.1 GHz) $801 + $20 + $42= $863
HP Tower z820 with Intel Xeon E5-2687W (8 Core, 20M, 3.1 GHz) $550
HP Tower z620 with Intel Xeon E5-2650 (8 Core, 20M, 2.0 GHZ) $431
GPUs
GTX 1050Ti – $185 – 2GB
GTX 1060 – $400
GTX 1070 – $512
GTX 1080 – $711
Other Options Include Quadro 5000 with 2.5 GB and 382 bit
RAM
16GB DDR4 $142
16 GB DDR3 $33
HDD
500 GB $19
1 TB $33
SSD
128 GB $28
Please guide me about the most powerful and lost cost combination that will help me in future. Also let me know if a better combination of motherboard and processor can be made from available parts.
Tim Dettmers says
You can save money by using the DDR3 ram option with a suitable motherboard. The cheap E5 options look quite good to me. I would go for a GTX 1070 given these prices. If you are short on money a GTX 1060 with 8GB of RAM would also be okay. Hope that helps!
Fernando says
Hi Tim,
I followed your guide to understand better my needs in the computer I want to build for deep learning applications. However I have a question regarding the PCIes from the CPU.
Specifically, you mention that 40 PCIe are good to go for 4 GPUs, and also mentioned that every GPU communicates through 16 PCIe. In my mind if I would like to use the full potential of the GPUs I calculate I would need 16×4 = 64 PCIe in my cpu to make this communication efficient. I defintely misunderstood something about that but I would love to know how did you came to this conclusion. So the question basically is, how many PCIe does a CPU need? do I need more than the ones that my GPUs demand? Is there any other component demanding this buses and therefore necessary to have even more?
Thanks in advance for the information.
Best Regards,
Fernando
Tim Dettmers says
Generally the more the better and while PCIe speed is not that important if you only do parallelism among 4 GPUs it is still the easiest factor to improve (or decrease if you do not have the lanes) your performance, Generally only devices that are attached to the PCIe bus also draw lanes. For example, if you have a PCIe SSD, this will also affect the transfer speed to your GPUs. The setup in which your PCIe devices can run is specified by the motherboard. For example you might have a 40 lane CPU, but your motherboard only supports a 8x/8x/8x/8x setup for your PCIe devices, in this case GPUs, so that so GPU can utilize the full 16x speed.
In my research on GPU parallelism, it is usually the case that networking performance is the greatest bottleneck. So if you have few lanes, your algorithms are limited by how much they can scale. There are algorithms which go around this, but currently only Microsoft CNTK supports those algorithms (block momentum parallelism). So in general, having full 40 lanes (or rather 36 because one GPU with 16x speed is useless for deep learning if the other GPUs do not have 16x) is a good thing to have. On the other hand, such CPUs and motherboard that support 40 lanes are more expensive. From a cost efficiency perspective it might be better to go with fewer lanes — you might just get more bang for the buck.
The details are more complicated., but I hope this helps you to get an overview about the issue.
Nikolaos Tsarmpopoulos says
Fernando,
Also check if a PCIe switch (PLX) makes sense for the types of workloads you will be creating.
Some motherboards feature PLX chips, to allow for 4 GPUs operating at 16 lanes each. Typically, a PLX chip connects to the CPU using 16 lanes and handles 2 GPUs using 32 lanes. The paired cards can communicate with each other (DMA) via the PLX chip, i.e without consuming any PCIe lanes on the CPU and without being affected by any communication of the other pair of GPUs.
Also, note that PCIe uses separate lanes for downlink and separate for uplink, i.e a device that supports 16 lanes, practically supports 16 lanes uplink and 16 lanes downlink, which can be used concurrently at full speed. This is beneficial, when the software library uses the following approach:
If the workload can be split into four processing stages that take about the same processing time and each stage can be handled by a separate GPU, here’s how the data would be transferred at full speed: GPU1, GPU2 are attached to PLX1 and GPU3, GPU4 are attached to PLX2. The CPU uses 16 (uplink) lanes to send data to GPU1 via PLX1. At the same time (in parallel), GPU1 transfers the data it has just processed to GPU2 using 16 lanes via PLX1, GPU2 transfers data it has processed to the CPU using 16 (downlink) lanes via PLX1, the CPU transfers this data to GPU3 using 16 (uplink) lanes via PLX2 and, similarly, GPU3 transfers data it has processed to GPU4 using 16 lanes via PLX2; GPU4 transfers the data it has processed back to the CPU using 16 (downlink) lanes via PLX2. You’ll notice that the GPUs make use of all 16 PCIe lanes available to each of them and the CPU also makes full use of 32 lanes (on both directions, up link and downlink).
In other words, your software can potentially make optimal use of 16-lane GPUs, via a CPU with 32 available PCIe lanes, if it only needs to send data from the CPU to the first GPU and receive data concurrently from the last GPU (in a sequence of GPUs, where each GPU does some processing and forwards the data to the next, for further processing) back to the CPU. The workload needs to be balanced, so that GPUs don’t wait too long.
I’m aware of two EATX motherboards with this feature and they are quite expensive and I’m not sure if the additional cost can be justified in terms of performance:
ASRock X99 WS-E/10G
ASUS X99-E WS
Regards
new_dl_learner says
When looking for a motherboard, do I need to ensure something like: NVIDIA Quad SLI, 4-Way SLI, 3-Way SLI, SLI Technology
if I plan to use more than one 1080Ti in the same machine?
Nikolaos Tsarmpopoulos says
You will not be using SLI for deep learning.
I suggest that you ensure the motherboard has enough PCI-E x16 slots for future expansion (as Tim has advised) and, if you are concerned about the number of lanes that will be available in a multi-GPU setup, you will need to download the manual (in pdf) of the motherboard you’re interested to buy and check the number of lanes according to the number of GPUs.
Without a PLX chip, depending on CPU lanes, the manual could say for example 2 GPUs at 16/16, 3 GPUs at 16/8/8, 4 GPUs at 8/8/8/8. A motherboard with PLX typically says 2 GPUs at 16/16, 3 GPUs at 16/16/16, 4 GPUs at 16/16/16/16.
new_dl_learner says
Thanks. I heard that although the Threadrippers have lower CPU clock than the best Intel CPU, they support 64 PCI-E lanes and Quad-Channel DDR4 memory. Does that mean the TR4 socket motherboards would support running 4 Nvidia GPU at the same time?
About using more than one GPU for DL, it seems that I need to write software to take advantage of parallelism. Isn’t the use of multiple GPU to solve problems automatic? I mean when more than one GPU is installed, the hardware and software (e.g. tensorflow) automatically detect the existence of multiple GPUs and divide the task to all the installed GPU automatically.
Nikolaos Tsarmpopoulos says
Indeed, Ryzen Threadripper comes with 64 lanes but my understanding is that some of these are reserved for other motherboard features, eg. chipset, M. 2 slots, etc. Check the following, as an example :
“4 x PCIe 3.0 x16 (single@x16, dual@x16/x16, triple@x16/x16/x8 mode)”
Source : https://www.asus.com/Motherboards/ROG-Strix-X399-E-Gaming/specifications/
It advertises 4 x PCIe 3.0 x16 but then explains that the third GPU will be operating in x8 mode. Check the detailed specs in the manual before you buy a motherboard.
More lanes on the CPU is a good thing, but I suspect a single threadripper will support up to 3 GPUs at x16 and the server equivalent will support up to 6 GPUs at x16.
Parallelism is implemented differently in each library. Read through Tim’s articles, he has provided some very helpful info in his blog. Also, check each library for updates, as they are gradually improved by their respective authors.
new_dl_learner says
I asked Asus. Their reply is:
” I understand that want to the specifications of ROG Zenith Extreme and ROG Rampage VI Extreme. I know that as a computer user you want to customize your motherboard on your own preference to use it to its full potential. Let me continue assisting you with your concern.
For the two motherboard, it can support 4 x PCIe 3.0 x16 (x16, x16/x16, x16/x0/x16/x8, or x16/x8/x8/x8) at the same time since they didn’t share any bandwidth with any of the slots in the motherboard.”
Does that mean for these two motherboards, I can use four 1080Ti GPUs running at top speed at the same time?
In another reply, support mentioned that even I added a SSD or other PCI cards, the four GPU can still run at top speed.
Nikolaos Tsarmpopoulos says
My understanding is that “4 x PCIe 3.0 x16 (x16, x16/x16, x16/x0/x16/x8, or x16/x8/x8/x8)” means:
1 GPU at x16, i.e at full speed
2 GPUs at x16/x16, ie. both cards at full speed
3 GPUs at x16/x16/x8, ie. two cards at full speed, the third at half speed,
4 GPUs at x16/x8/x8/x8, ie. one card at full speed, three others at half speed.
Hence, a motherboard with the aforementioned capability does not support more than 2 GPUs at full speed.
I suggest you go back to the manufacturer and ask them to clarify their position. You can ask them, for example: if you attach 3 or 4 GPU cards and each card supports x16 bidirectional PCIe lanes, will the particular motherboard (of interest) allocate, to each card, x16 dedicated bidirectional PCIe lanes, for concurrent transfer of data between the cards and the CPU?
If the motherboard supports 4 GPUs at full speed, it’s typically reported as x16/x16/x16/x16.
new_dl_learner says
Thanks Nikolaos. I will ask support as you suggested.
I may be wrong but I get the impression that he might be trying to hide something. For example, in a previous email, he wrote as following.
Hmm… What is “For the full speed, it actually depends on how you will use the 4 ROG-STRIX-GT1080TI-11GB at the same time. “?
He also suggested getting an overclocked version of 1080 Ti. Does overclocked version perform noticeably better?
—————-
“The ROG Zenith Extreme again can work with 4 graphics card since it support multi-GPU and supports 4 way SLI Technology. For the full speed, it actually depends on how you will use the 4 ROG-STRIX-GT1080TI-11GB at the same time. I can still recommend the Zenith if you don’t want to overclock your GPU. Your GPU speed will not lower down even if you connect an SSD or another expansion card since there are no bandwidth between the PCIE slots. For Intel i9 processor I can suggest the ROG Rampage VI Extreme.”
Nikolaos Tsarmpopoulos says
Tim has an excellent article on GPUs, where he explains the pros and cons of different types of GPU coolers. I suggest that you read patiently Tim’s articles and the responses he has given to other readers. There’s wealth of information here:
http://timdettmers.com/2017/04/09/which-gpu-for-deep-learning/
new_dl_learner says
Thanks Tim. How the number of GPU card scales with the performance? For example if I have X*1080 Ti installed on the same computer, will it take 1/X the time to complete the same task?
Tim Dettmers says
Scaling within one computer is usually quite good. It still depends on the task, but you can expect a scaling from 2.5-3.9 for 4 GPUs depending on the software framework. The main drawback is that you have to add more special code with handles the parallelism. I recommend PyTorch for these kinds of tasks.
new_dl_learner says
Hi Tim, I have a PhD in Computer Science but I have not worked on DL before. For CPU, do you recommend the AMD Threadripper, Xeon or Core i7-7700/7700K? I plan to buy a 1080 Ti first and if needed, add more later.
Tim Dettmers says
Any of the CPUs that you listed is fine for deep learning with multiple GTX 1080 Tis. Choose the CPU according to your additional needs (preprocessing, other data science applications, other uses for your computer etc).
new_dl_learner says
Thanks Tim. As I know, software for my other needs do not take advantage of multi-core. So, faster CPU is better than having more cores. Do software related to deep learning take advantage of multi-core, multi-thread? If so, about how many cores and threads of CPU would be advantageous? AMD and Intel have different system/memory bandwidth. Which would be better?
Tim Dettmers says
Most deep learning libraries make use of a single core or do not use other cores in full. Thus CPU with many cores does not have a great advantage over others.
new_dl_learner says
Thanks Tim. Regard to the GTX 1080 Ti, there are several companies selling cards of different variants using the GTX 1080 Ti, which brand and variant do you recommend? I plan to buy one card first and if needed, add more later.
Somewhere I read a recommendation to stay away from the reference edition which I read is also called the Founding edition. Do overclocked 1080 Ti performance significantly better? In the past, overclocked systems tended to fail sooner. Not sure if it is worth.
Tim Dettmers says
I would recommend the cheapest card. The cards are almost the same. Overclocked cards have almost no benefit for deep learning (for gaming they do, though). I am not sure about the Founding edition — have not heard anything bad about it other than other cards being cheaper.
Thorsten Beier says
Hi Tim,
thanks for this great guide!
It helped us to choose a deep learning server for tensorflow. We now use this rack machine https://www.cadnetwork.de/de/produkte/deep-learning but with four Tesla P100 instead of GTX 1080 Ti. But i don’t know if there is a huge difference between GTX and Tesla.
I can confirm that 1-3 GPUs are used fully and the fourth GPU deliver about 40% of their performance. It could be a limitation of the PCIe Bus.
Thanks
Thorsten
Tim Dettmers says
Hi Thorsten,
that is interesting. I do not think that the 40% performance comes from PCIe issues alone, there might be another thing amiss. It cannot be some cooling issue since then you would see a performance degradation with other GPUs too. It would be interesting to know the reason for this. Let me know if you know more!
I am happy that my guide helped you to choose your server! Indeed, Tesla GPUs are only minimally better than GTX GPUs. The P100 is quite a bit better than the GTX 1080 Ti, but it also costs un-proportionally more. I think GTX 1080 Ti would have been more cost effective, but often these are not available for servers (NVIDIA has the policy to sell GTX cards only to consumers and Tesla cards to companies), so overall not a bad choice!
Johydeep says
Hi Michael /Tim
I am looking for one deep learning PC and I found this “Intel Core i7-7800X Processor”
with
Socket LGA 2066
Compatibale with Intel® X299 Chipset
6 Cores/12 Threads
Max Number of PCI Express Lanes 28
Intel® Optane™ memory ready and support for Intel® Optane™ SSDs
AND
MSI Performance Gaming Intel X299 LGA 2066 DDR4 USB 3.1 SLI ATX Motherboard (X299 GAMING PRO CARBON AC)
Is this good choice for 2 1080Ti with full speed?
Thanks
Johydeep
Tim Dettmers says
It looks reasonable. With 28 lanes you will have a bit slower parallelism, but for 2 GPUs this bottleneck is not too large so you should still be fine; I guess you could expect a performance decrease of 10-15% for parallelism with 2 GPUs, which is okay. Otherwise, the specs are quite good for general computation, so if you want to use your CPU for other data science tasks this is a good choice. If you want to only do deep learning I might go for a slower CPU which has more lanes, but your current option is also not too bad.
Tom says
Hi Tim,
Thanks for your great blog.
Can i use “GeForce GTX 1080 Max-Q” laptop for deep learning task?
Here is the full description. I need something really portable but same time i need to be able to train RNN models.
HIDevolution Asus ROG Zephyrus GX501VI-XS74-HID1 Black 15.6″ w/ IC Diamond Thermal Compound on CPU+GPU – Optimal System Temperatures (FHD/i7-7700HQ/GTX1080 Max-Q/512G PCIe SSD/16GB RAM)
https://www.amazon.com/HIDevolution-Zephyrus-GX501VI-XS74-HID3-Diamond-Compound/dp/B0736C1PP5/ref=sr_1_7?ie=UTF8&qid=1498792741&sr=8-7&keywords=gx501vi&th=1
Tim Dettmers says
The GPU in that laptop is quite powerful so you will be able to train RNNs without any major problems. It also should be quite fast compared to, say, a GTX 1060 which will be quite a bit slower.
Tom says
Cuda core is fine but apart from that everything else is 30% less compare to main GTX 1080.
Will there be any performance issue during training large Model ?
Tim Dettmers says
You can expect the card to be about 30% slower, but that is still pretty fast compared to other cards. You might need to adapt your models slightly or use 16-bit precision for very large models, but you should be able to run everything that is out there.
Mirela says
Hi Tim,
Thanks for the thorough write-up, it is truly helpful.
I am in the process of picking parts for a deep learning machine myself, and I have a focus on graph computation and network analysis.
Do you have any insights on whether an i5 7500 (3.5 ghz) with 1060 6GB and 32 RAM would do (in an mini-ITX setup), or should I go for an Xeon E3-1225 (3 ghz) with the same GPU but possibly more RAM (64 GB)?
Thanks for your efforts to share knowledge so far!
Mirela
Mirela says
Just a small update on my part:
I assume working with node2vec would fit my context best.
(https://snap.stanford.edu/node2vec/)
Also, I’ve found
https://www.google.nl/url…gL8EYt_ZHetggs1UnH7HU14uA
(link to pdf, via CWI)
(perhaps interesting for you as well of course)
And upon long pondering, I assume the Xeon E5 1620 v4 is a wiser choice compared to an i5/i7 setup.
Xeon is mentioned here, as well as is graph processing for a similar setup:
https://www.youtube.com/watch?v=875NbdL39A0&feature=youtu.be&t=243
+
https://www.youtube.com/watch?v=875NbdL39A0&feature=youtu.be&t=445
I’ve already invested in a ‘good’ (what budget could hold 🙂 GPU, namely the gtx 1060 6gb, and ram would be 16 or 32 gb as well (already useful for R).
But now it seems the Xeon would be the best option.
Regardless, many thanks!
Tim Dettmers says
That sounds reasonable. If I were you I would also pay good attention to the motherboard. If it has extra RAM slots (8 slots) then you can always increase the RAM size if you need more; in that way you can upgrade your setup depending on the problem that you are working on.
Tim Dettmers says
You might want to go with the 64GB setup depending on what kind of graphs you will work. The graph structure can differ greatly and some graphs will require you to have more than 100GB of RAM while for others it is more manageable. The CPU if often less important (but still depends on graph and problem, so check this for problems/graphs you work with). A GTX 1060 might be a bit slow at times, but often you do not work with the full graphs anyways because training would take too long. Thus you could also trim down your graph further and then a GTX 1060 is a solid choice (no large memory required and good speedup over the CPU).
Mirela says
Hi Tim,
Many thanks!
I have bought the components for below listed setup, aiming at having as much RAM as possible (‘affordable’ :).
– intel xeon e5 1620 v4
– supermicro x10 srl-f
– kensington ddr4 32gb RAM (lrdimm) 2133 mhz
-> I plan to have in total 8 modules of these, in total 256 gb
(it’s even possible to go up 375 based on the cpu, and 1 tb based on the motherboard)
– 256 gb SSD
– 1 tb HDD
– msi geforce gtx 1060 6gb
– noctua cooling
– lepa power, 800W
Have seen https://event.cwi.nl/grades/2016/00-Leskovec-slides.pdf also, seems like RAM is very relevant indeed 🙂
Looking forward to some graph processing!
And thanks again!
Felix Dorrek says
Hi Tim,
many many thanks for your great blog articles, they are a great help!
I have a perhaps a bit off-topic question. Can you recommend any resources to learn about computer hardware on a conceptual level? So I am not really interested in the underlying electrical engineering just yet, but about different components and how they interact. For example I’m interested in how data is being transferred from memory to GPU-memory in more detail.
Thank you again,
Felix
Tim Dettmers says
That is a good question, but unfortunately, I do not have a good answer for that! I also wanted to learn more the conceptual side of hardware, but the resources that I found are often resources from universities and textbooks which also look at the details. What I found most promising was to just do google searches for specific questions and try to get informed through multiple sources of websites. For example googling “cpu to gpu memory transfer” will yield blog posts, forum questions, presentations on the topic and so forth. With that, you can get informed about that question. From here you might have new questions which you can then google. If you do this for a few hours every week, you will get quite knowledgeable about concepts quite quickly. Hope this helps!
Tim
Felix Dorrek says
Thanks, I’ll do that then.
I guess I was intially hoping that there was nice resource to simplify the learning process 😉
Alderasus Ghaliamus says
First of all, thank you very much for how comprehensive and deep knowledge you gave us through the two blogs: the full and the GPU focused.
I wish that you could answer my question, with advance apologies if my question asking the obvious.
I was about to spend around £3800 on a PC (the new ALIENWARE AURORA) which has two GeForce GTX 1080 Ti, 64GB DDR4 at 2400MHz, and Intel Core i7-7700K Processo. I was very happy that I finally could decide which PC I should buy for my PhD research the next two years. What made me more happy that I was following your appreciated GPU-focused blog – YES! I have multiple high performance GPUs!
However, something took me to the other blog – this blog – and I read the CPU advice ending with the fact that my CPU is only 16 PCIe lanes – not 40 as you warned. I went back the first step in my searching for a PC, before 4 months.
I did my best again, [focusing only on built PCs by Dell or Lenovo], and I ended with another ALIENWARE PC – the ALIENWARE AREA-51, which has same* GPUs and Memory in the first PC in my comment, however it has different CPU which is i7-6850K with 40 PCIe lanes and 3.0 ER. However, the cost went up by more £700: £4500. It is expensive, I could and will afford it for my PhD, but it is expensive.
When I reached such cost, I remembered two Laptops from which I ran away because their costs. I said to myself, if I reached £4500 with the PC, why not go with the life laptop with one or two more thousands. The laptops are:
(A) ThinkPad P71.
– CPU: Intel Xeon E3-1535M v6 (certainly, back to 16 PCIe lanes).
– Memory: 64GB(16×4) DDR4 2400MHz ECC SoDIMM
– GPU: NVIDIA Quadro P5000 16GB (no two GPUs).
– SSD: 1TB
-HDD: 1TB
-Cost: £6200
(B) Dell Precision 7720:
– CPU: Intel Xeon E3-1535M v6 (certainly, back to 16 PCIe lanes).
– Memory: 64GB(16×4) DDR4 2400MHz ECC SoDIMM
– GPU: NVIDIA Quadro P5000 16GB (no two GPUs).
– SSD: 1TB
-HDD: 2TB
-Cost: £5800
So, my choices are as following:
Choice One: new ALIENWARE AURORA as PC + XPS 15 £1800 as Laptop = £5400
Choice Two: ALIENWARE AREA-51 + my very old crying coughing laptop = £4500
Choice Three: no PC + ThinkPad P71 = £6200
Choice Four: no PC + Dell Precision 7720 = £5800
If you could please, and really I am so sorry to have you and your appreciated time reading this long comment, help me with selecting one choice or arranging them with your reasons, you will make my next two years technically truly safe. I have to say that my research is on two different data spaces: genetic data and textual data.
Finally, thank you again for your contribution through this blog, and thank you in advance for getting this point reading my comment.
*To be fair regarding the cost of the second PC, it has 2 more TB HDD [4TB] than the first PC, however it provides the same size of SSD: 512GB.
Tim Dettmers says
These are all solid options albeit all quite expensive. Note that PCIe lanes are not that important if you have 2 GPUs, but become more important if you have 4 GPUs. However, I do think the biggest issues here is just that these computers are too expensive. If I were you I would go for a used computer solution which I would upgrade to your needs.
For example, I just last week sold my used computer, which is similar, or even better than these options for 800 pounds on gumtree. So a smart choice might be to buy a used computer and upgrade it with some parts. For genetics research I would try to find a cheap computer which support 8 RAM slots and than buy 64 GB RAM for the machine and upgrade to 128 GB of RAM if your research requires this. Speed of the RAM is overvalued; a plain DDR3 RAM setup is sufficient and cheap. For some deep learning algorithms or algorithms in computational biology a single GPU should be sufficient but choose one that has a lot of RAM; a 12GB is ideal and I would go for a used GTX Titan X for 400-500 pounds on eBay (make sure your computer has a PSU which at least support 600 watts).
This option would yield a very high performance computer for roughly 2000 pounds. Of course it requires some manual assembly, but it really is not difficult and you really should try to do this.
If you cannot get a used option with parts due to university bureaucracy I would go with a ordinary laptop + a hetzner.de GPU machine which for a 3 year PhD will cost 4400 pounds but offers everything that you need and can be canceled / upgraded month-wise. For most genetics research you should be fine without a GPU which would cost 2150 pounds on hetzner.de. If your algorithms require double precision then you will need to make a careful choice about which GPU to get, but probably the most cost efficient solution would involve renting some Tesla GPU in the cloud (AWS for example) to work with double precision when you need it.
So the main options that I see are (1) buying used computer and upgrade its parts, (2) buy ordinary laptop and a dedicated machine in the cloud. These options will give you the best performance per quid.
Alderasus Ghaliamus says
Don’t know how to thank you. Your generosity representing in your reading time and response is profoundly appreciated.
According to must limitations in buying ‘new’ and ‘high performance’ PC or laptop, I will take your appreciated advice for the future. Now, I believe that I will go with the first choice in my list hoping that two GPUs will be enough most of the time.
Again, thank you very much.
Raja says
So I’ve purchased a ryzen 1700x and a msi x370 sli plus
I am wondering if it is going to bottleneck 2x 1080ti for deep learning?
ryzen only has 24 pcie lanes
the board only supports
x8/x8 in pci 3.0
if i understand correctly that is almost the same as x16 pci 2.0
Is x16 pcie 2.0 single gpu a bottleneck?
is x8/x8 pcie 3.0 going to bottleneck 2 gpus?
a 1080ti has 11gbps ram
but x8 pcie 3.0 would be only around 8gbps while x16 3.0 is around 16gbps
Can a modern gpu stream in new data while it is doing calculations?
in which case some of the 11gbps is being used for compute and some of it is being used to stream new data in?
Bit of a deep learning rookie here.
Grateful for any advice.
Also will the dual channel ram in ryzen be a problem?
Should i build a threadripper machine?
As of now I don’t plan to use 4 gpus.
So as long as my current build doesn’t have a major problem i will use it with 2 gpus
and then next year do a 4 gpu build with canonlake/10nm
Tim Dettmers says
It depends on the algorithm but in general PCIe lanes with 2 GPUs are not that important. If will decrease performance but not by a lot. Maybe 0-10% depending on the use-case. You should be fine.
Tom says
What is the best way to put 4 GPUs (NON-founder edition) easily in a board?
Thanks
Michael says
i7-7700k costs over $300. That’s not cheap. For that kind of money you can get a CPU with 40 lanes (e.g. E5-1620v4), and put it into something like ASRock X99 Extreme4 board. Or you could pay more for Asus X99-WS board which has 2 PLX switches and supports quad PCIe x16.
Tom says
Thanks, Michael.
Sam says
That’s true, although I’d like to have something newer/faster than ivy bridge as I do use this machine for more than just deep learning. If I really wanted to save money I could use the supercarrier board with a ~$40 kaby lake G3930.
Michael says
E5-1620v4 is Broadwell, this is the latest generation of Xeon architecture. It has much better memory bandwidth than 7700k.
Sam says
Ahh my mistake! And you’re correct about the memory bandwidth, although how relevant is that considering the DMA bandwidth bottleneck for CPU memory –> GPU memory transfer?
Nikolaos Tsarmpopoulos says
With a 40 lanes cpu you can have the CPU concurrently exchanging data with 2 GPUs at full bandwidth.
With a x16 lanes CPU you’re limited to either x16 lanes to one GPU at a time, or x8 lanes for concurrent exchange of data with both GPUs.
As Tim mentioned earlier, for 2 GPUs you’re fine with x8 per GPU. Having said that, 16 lanes on the cpu may not be sufficient, as some of these lanes maybe reserved by the chipset or other PCIe devices, eg integrated M.2 slot.
I would opt for a cpu with more PCIe lanes, as Tim and Michael have advised.
Sam says
Hi Tim,
Thanks so much for writing all of this up, it’s very informative. I’m currently picking out parts for a DL machine, and I’m trying to figure out where I may have bottlenecks.
Your piece on DMI for ram to vram transfer is quite interesting. Most of what I’m reading emphasizes high pci-e bandwidth. I’m building a dual gpu system, and I’m wondering if I really need both gpus running at pci 3.0×16, or if x8 is fine for each? It sounds like the DMA bandwidth could be a problem. I couldn’t find much info on DMA related to specific chipsets, however you mentioned 12GB/s. Is this bandwidth the same for different chipsets (I’m comparing z270 to x99). If I’m mostly running independent models on each GPU, would I see much if any benefit to 2x pci-3 x16, or would that only really show a big benefit when running the gpu’s in parallel for a single model? Asynchronous mini-batch allocation is interesting, however I’m not sure if it’s integrated into all of the newer high-level DL frameworks…
Re: the DMA issue, intel’s new optane drives are routed through PCI, and they can be used as ram in addition to long term storage. Do you think that these can be used as a way around the DMA bottleneck??
Tim Dettmers says
If you use the right algorithms there will be almost no decrease in performance if you use x8 for each GPU. Even if you use the “wrong” algorithms, performance reduction should be minimal for most models since aggregated transfer-times for 2 GPUs are not that large. The costs increase dramatically as you add more GPUs though — for a 4 GPU system it is important that you are on PCIe 3.0 with at least 32 PCIe lanes from your CPU/motherboard.
I would not care too much about DMA. I suppose for most chipsets / CPU combos it is the same. It might differ a bit here and there, but the performance difference should be negligible. I recommend using PyTorch for parallelism if you have a two GPUs and if you have 4+ GPUs I recommend Microsoft’s CNTK.
Sam says
Thanks! After days and days of research I just ordered my hardware.
I thought I’d mention, theres a motherboard that I feel is perfect for dual gpu rigs: The ASRock Z270 supercarrier:
http://www.asrock.com/MB/Intel/Z270%20SuperCarrier/index.asp
This board, like some of the x99 workstation boards has a PLX switch, allowing dual pci x16 or quad pci x8 on a Z270 board. For dual gpu rigs, you get the added benefit of being able to run 2 gpus 4 slots apart (instead of the usual 3 slots on most non-workstation boards). This helps a lot with cooling since there’s more space between the 2 gpus, especially with non-reference coolers taking up 2.5-3 slots these days.
Tom says
Hi Sam,
What Processor are you using with ASRock Z270 supercarrier?
Intel i7-6850K Processor ???
Does this motherboard support 40 Lane ?
Thanks
Tom
Sam says
Hey Tom,
I’m using a i7 7700k. The ASRock Z270 Supercarrier is a Z270 board, so it takes LGA1151 chips, which are limited to 16 lanes, therefore CPU’s for this board only have 16 lanes. Just like how some of the Asus/ASRock X99 workstation boards have dual PLX chips that allow 4x pci x16 with a 40 lane X99 cpu, the Z270 Supercarrier allows a cpu with 16 lanes to run 2 GPUs at pci x16. Of course it’s not that simple – I don’t think the performance is identical when running PLX chips to get “more” pci lanes than your CPU has, but my understanding is that the GPUs can communicate with each other at x16. Same reason why the Nvidia dev boxes use X99 chips with 40 lanes yet operate 4 cards at x16 through a workstation board with plx chips, the Supercarrier lets us run 2 cards at x16 with a 16 lane Z270 cpu. It’s much cheaper than a i7-6850k yet allows for similar GPU bandwidth. I think that Z270 also has some other nice things that X99 doesn’t because it’s much newer, although X99 does support some things that Z270 doesn’t like quad channel memory. Lastly, having a board with 4 slots between GPU’s is nicer for SLI when you’re air cooling.
Michael says
Dual PCIe x16 should not need any PLX switches. The switch is only needed when you want to do Quad PCIe x16, which is more lanes than a single CPU can support.
Sam says
Dual PCIe x16 doesn’t need switches when using a 40 lane CPU, however having a switch on a Z270 board allows me to use a much cheaper, still very powerful 16 lane CPU with 2 GPUs at x16.
Tom says
Non-X99 motherboards and a normal processor which does not have 40 lanes can not run 2 GPU with the full power which is 16X, is that correct?
So in my understanding ASRock Z270 supercarrier is perfect becase it allows having 2 GPU using 16x and 3 M.2.
99.9% motherboard does not space to put 4 GPU (Overclocked ones) without water cooling installed. 4 Founder edition GPU can be installed in motherboard like “ASUS LGA2011-v3 Dual 10G LAN 4-Way GPU ATX/CEB Motherboard (X99-E-10G WS)” which provide full 16x utalization.
Tom says
Hi sam,
Can we use Z270 board + Intel Boxed Core i7-6850K Processor together ?
Tom says
Thanks, Sam.
Martin says
Hi!
First of all a big thanks! You’ve basically created the best resource for deep learning enthusiasts looking to build their own machine.
My computer is going to be situated close to my bed so one priority is noise. This brings me to the first issue of whether or not to get a reference (Founders Edition) 1080 Ti or not, as they generally seem to be louder than their OEM counterparts. There seems to be a debate around performance and quality between reference and OEM cards. From what I can gather, most people who are doing long-running computations rather than gaming, especially on multi GPU setups, favor reference cards due to their fan design which blows air out the back rather than just circulate air inside the case. I’m starting with one card and plan to add a second one later, and I doubt I’ll get more than 2 cards for a while.
For the CPU, I was first looking at Intel i7 6850k, which was the cheapest i7 I could find that supports 40 lanes. However, Intel Xeon E5-1620 V4 is almost half the price and also supports 40 lanes. Not sure if the faster i7 is worth the money here?
Lastly, I was thinking about getting a water cooler for the CPU. I’ve read mixed opinions about water cooling, but I reckon moving air outside of the case should be a good thing as it allows the GPUs to run at lower temperatures?
Here’s the build: https://pcpartpicker.com/list/gzfgxY
Any suggestions highly appreciated!
Thomas says
Hello Tim.
Thank you very much for this blog. You gave me solid background for my understanding of dependency between hardware and deep learning.
I have a question about bus speed in CPU. Should that be a concern ? As you wrote the true bottleneck is between cpu and gpu, and as i understand the “Bus speed” which is at ark.intel white sheets refers to that connection.
I have to choose between E5-262x v4/3 or E5-16xx v4/3. The 262x family have bus speed set at 8 GT/s QPI while 16xx have 5 GT/s. (8 GT/s is what PCIe 3.0 offers) Besides the Scalability, clock frequency and memory bandwidth that is the difference between them, and the only one that matters in deep learning since all of them have clock speeds above 2GHz, memory bandwidth over 68 GB/s and I will not make use of Scalability.
This link refers to the possible comparison of those models: http://ark.intel.com/compare/92980,92986,92994,92987
Thank you in advance.
Tim Dettmers says
It is correct that this is the main bottleneck between the CPU and GPU, however a very tiny amount of time in spend between CPU-GPU interactions on a memory level compared to the actually GPU computation. It becomes more relevant if you have multiple GPUs, but for multiple GPUs the main bottlenecks are somewhere else. Currently a good CPU (in terms of bus speed) will improve your deep learning performance by about 0-1.5% compared to a “standard” one and I would not worry about it too much. I think all the CPUs that you linked are more than fine.
Adarsh says
Hey Tim, Nice article. I have a question:
Lets suppose I train a model for a particular task (classify an image in class 1 or 2) on a particular kind of input (800×400 resolution), I want to choose a GPU with MINIMUM number of cores and power that would give me the result in under 100 mili seconds. How to estimate this without running the model on GPU? Is there a relation between no of cores of the GPU and the performance of a deep learning model?
Thanks a lot in advance.
Tim Dettmers says
The speed of the computational units between different GPUs of the same series are about the same (NVIDIA Titan Xp modules are not much faster than say, GTX 1060 modules), but the reason why bigger cards is faster that they just have more modules (called stream multiprocessors or SMs). If your model is computationally not intensive, then benchmarking some small GPUs and extrapolating the number of SMs might be a valid option to find which is the optimal GPU in this case.
For operations which saturate the GPU such as big matrix multiplications or, in general, convolution this is very difficult to estimate. It sounds like you want to reduce costs. A good way to do this is also through power efficiency and this is a very transparent option which can be easily optimized. It also sounds like you want to reduce latency — this is very difficult to test because computationally graphs differ too widely; the only option that I see is to find people that have these GPUs and let them run benchmarks on your model. Or otherwise, try to generalize existing benchmarks for your model.
Adarsh says
Can you give me some insights as to how I can extrapolate the benchmarks. Let us assume I have a GPU of 4 SMs and 4GB Global Memory on Pascal architecture which gives me X mili second avg classification time. Theoretically, Can I expect timing of X/2 mili seconds with a GPU having 8 SMs and 8 GB Global memory.
In my experience performance does not varies in linear fashion, How to estimate timings while extrapolating and interpolating the GPU Specs(No. of SMs, Global memory, Memory bandwidth etc)?
Thanks
Michael says
Adarsh, it’s hard to give you any advice, because you didn’t tell us anything about what you’re trying to do exactly: what is your accuracy target (e.g. on ImageNet)? What is your power budget?
People have run VGG and Inception on an Iphone 6s, with 150-300ms latency:
http://machinethink.net/blog/convolutional-neural-networks-on-the-iphone-with-vggnet/
If you have some embedded application in mind, then your best option is Jetson TX2 dev kit. It should definitely be able to run latest Imagenet networks under 100ms latency.
See this nice paper which tested the previous Jetson kit:
https://arxiv.org/abs/1605.07678
It also provides some insight into the relation of amount computation and accuracy.
Charles U says
Hi Tim,
Thanks for this great article, I also read your other one on GPU performance. I’m on a budget right now so planning on buying a GTX 1060 6G, with the intent on upgrading in the future.
In this post, you mention your computer should have at least equal RAM as your GPU. Does that mean it would make more sense for me to buy a 6G RAM computer to match my card size ? I was originally planning to get a 4G RAM computer. And in the future, if I get a 1080Ti with 12G, will I have to upgrade my computer to 12G RAM ?
Thanks
Charles
Tim Dettmers says
This requirement is not so strict; I should update my blog post on this. If you have 4GB RAM you will be able to work with most datasets if you stream your data, that is load them in small batches bit-by-bit. If you do this 4GB will even suffice for the GTX 1080Ti. You might run into some problems if you run very large RNNs, but this can be prevent with some code which initialize weight directly on the GPU rather than CPU. You might also run into problems when you preprocess data, but also this can be managed with some extra code. You should be fine with 4GB.
Nikos Tsarmpopoulos says
Hi Tim,
I’m reading through each and every single article of yours, they are to-the-point and very helpful for beginners in neural networks, like myself.
Regarding PCIe 3.0, I’ve noticed that most consumer-grade motherboards fall back to x8x8x16 for three GPUs and x8x8x8x8 for four GPUs. Hence, in a multi-GPU setup, where a different CPU thread handles each GPU, it’s not possible make use of the GPUs’ x16 lanes capability. Notably, PCIe 3.0 x8 has the same theoretical throughput as PCIe 2.0 x16.
While searching for motherboards with more PCIe lanes, I noticed that some new consumer-targetted motherboards come with a PEX 8747 Broadcom PCIe Bridge. That’s a 48 lanes bridge, which is still insufficient for non-synchronised, concurrent data transfers. Broadcom’s top of the line bridge supports 96 lanes (no idea how this solution costs): 64 lanes could be used for 4 GPUs and 32 additional lanes for the CPU, which means GPUs can communicate with each other using the full PCIe 3.0 x16 bandwidth and up to two GPUs can concurrently transfer data from/to the CPU at full bandwidth.
Have you considered these solutions? Are you aware of motherboards that deliver sufficiently good value for money, e.g. achieving performance that would costs a less than alternative solutions, to justify the cost?
Thanks in advance.
Tim Dettmers says
Hi Nikos,
thanks for your comment. I also stumbled upon these switches, but in the end they are probably not so suitable for deep learning. The details are a bit difficult to understand but let me try to explain: The problem with these solutions is that they still use the underlying PCIe interface and thus are limited just like normal PCIe transfers. In most graphics applications you do not have parallel GPU-to-GPU transfers, but GPU-to-GPU transfers which are slightly off-set in time and also small in size. Under such circumstances you can have clever protocols and extra lanes which feed into the usually attached lanes (which have a hard limit to 16 per GPU) in a safe in secure manner without blocking the channels. In other words, with these switches you can send multiple packets asynchronously and securely but each GPU still receives one packet at a time; in a normal switch each packet must be scheduled after all other packets on that path have completed or otherwise one has insecure transfers (which can corrupt the data).
The crux is that in deep learning, or generally in computing, you do many parallel transfers at the very same time and usually these packets are large. With this new fancy switch you can start the transfer of the packets asynchronously but they will still block each other for access to the GPU (because it takes quite some time to send the full package). This means that instead of blocking before the transfer you now have blocking during the transfer. I am not sure about the performance in this case, but I could imagine that the performance is the same or even worse than with normal switches.
The reasoning behind these switches is that they trade the synchronization of the full PCIe path with the synchronization of sub-paths on the PCIe circuit (the sub-path to the GPU) which increases performance for many applications, especially graphics applications, but probably not for deep learning.
Hope this helps!
Nikos Tsarmpopoulos says
Hi Tim,
Thanks for your response, as always very informative.
I wrote to Broadcom, to ask for additional information regarding their particular PCIE switches and they came back with very interesting feedback.
My understanding from their response is that broadcom’s aforementioned 80- and 96- lanes switches should allow for more efficient GPU – to – GPU communication at full x16 lanes speed (per pair), compared to the x8 lanes currently supported in triple and quadruple GPU configurations via a 40lanes CPU.
However, they also implied that these [current generation] switches communicate with the cpu via a 16lanes connection, ie the cpu cannot establish two 16lanes connections with corresponding GPUs in parallel via the switch. Multicasting data wouldn’t be affected.
I’m looking into reconfirming this with Broadcom, as there might be a benefit in using a motherboard with these switches in multiGPU configs.
Kind regards,
Nikos
Nikos Tsarmpopoulos says
Reading through your response again, I realise that concurrent access of four GPUs from the CPU, for transmission of large amounts of data, via a PCIe switch that features 16lanes upstream, would result in half the bandwidth of a CPU’s 8lanes with each GPU.
Tim Dettmers says
Indeed, these switches can be very complicated and I am not sure about every detail. If you can gain a bit more insight these it would be great if you can share it here. Thanks!
Nikos Tsarmpopoulos says
Hi Tim,
Following up from my previous message, I have now confirmed with Broadcom that, due to a contraint imposed by the PCIe specification, its PCIe switches feature a total of 16 lanes on the upstream port, i.e. to the CPU.
The downstream ports are non blocking, i.e. when using a PCIe switch of 80 lanes (64 lanes for the GPUs + 16 lanes for the upstream connection to the CPU), pairs of GPUs can talk to each other directly, using 16 lanes per pair.
If we use 4 GPUs, a single GPU can broadcast data to the others at full x16 speed (versus x8 speed if they were attached directly to the CPU’s PCIe lanes).
Also, the CPU can broadcast data to all four GPUs using the full x16 throughput.
Two pairs of GPUs can exchange data at full x16 speed (again, versus x8 speed if they were attached directly to the CPU’s PCIe lanes).
The downside is lower concurrent (non-broadcasted) data throughput from the CPU to the GPUs, where 16 lanes will be shared, delivering the equivalent of x4 throghput per GPU (versus x8 speed, if …, as above).
Thus, it really depends on how much unicast data needs to be transferred -concurrently- between the CPU and the GPUs versus between the GPUs.
Depending on what proportion of the time it takes to train a deep neural network is spent on (a)unicasting data -concurrently- from the CPU to the GPUs , (b) broadcasting data from the CPU to the GPUs, (c) unicasting data between pairs of GPUs and (d) broadcasting data from one GPU to the rest, a multi-GPU might benefit from a PCIe switch (also called PLX).
My understanding is that a motherboard with such a switch costs £200-£300 more than normal motherboards.
What do you think?
Tim Dettmers says
That is a pretty good insight, thanks Nikos!
So one will get improved performance from such a system. However, for bigger systems it is common to pool the data on the CPU to perform more complicated, tree-like broadcasts through the network. I do not think such broadcasts are currently implemented for GPU memory (this feature might have been added since the last time I checked, which was more than a year ago). I think such a motherboard with the 64 GPU lanes would be optimal in a 4 GPU setup. For a multi-node setup it might still be helpful, but probably too expensive to justify the costs, and for big systems reliability is often more important than squeezing out the last bits of performance.
All of this also depends on the type of algorithm that one uses though, but it is good to know that these motherboards can improve performance! Thanks again!
MacMinus says
Since we are now more than 2 years down the line, and Moore’s law has been doing its thing, I would be curious about an update to this great piece with the current HW (e.g. multi-GTX 1080 Ti’s).
Tim Dettmers says
The general hardware recommendations did not change very much and I think I would make the same recommendations that are listed here. If you are interesting in GPU recommendations you can read my other blog post about GPUs.
Petra-Kathi says
Another great thank-you from my side as I (hope I) have gained a lot of insight into the hardware setup of deep learning workstations!
Presently I am in the position of defining a deep learning workstation for internal research purposes in a professional environment (i.e.: full-grown Windows IT environment :-/ with a dedicated server room, requirements to use only certified hardware and such). From the pure deep learning research approach I would have opted for a system with 3 or 4 GTX 1080TI cards – well aware of the problem of parallelization, but at least providing ample computing power for independent parallel jobs to come to their ends in sensible absolute time slices. But the only certified offer in the strived-for power range we got was a system with 2 P6000 cards.
As this card is quite new and mentioned here only once and without further discussion w/r/to potential deep learning issues, let me please ask for hands-on experiences in the directions of libraries support, half precision deep learning computational power, and potential further quirks I may not have thought about? I am well aware that the P6000 is sub-optimal cost-wise, but anything not certified is a no-go in this environment. 🙁
Your hints are very appreciated!
Tim Dettmers says
The P6000 is based on the GP102 ship, which is very similar to a GTX 1080 Ti and Titan X Pascal. The features and performance will be similar to these cards, that is, you will have usual support of all deep learning libraries, good computational power, but almost no half-precision performance. So with that card you will receive a powerful GPU which you can use in your certified environment. If the cost difference between the P6000 and P100 is slim, you might want to opt for the P100 with which you gain a bit of performance and half-precision computation. However, if the difference is larger then just go with the P6000.
Michael says
I’d definitely recommend going with Quadro GP100 or Tesla P100:
https://exxactcorp.com/index.php/product/prod_detail/2048
https://exxactcorp.com/index.php/product/prod_detail/1662
P100 should provide double the performance of P6000 for deep learning, with effectively the same or more of half precision memory (24GB or 32GB).
Talk to Exxact folks, I had a very positive experience with them.
Petra-Kathi says
Tim and Michael,
thanks for your comments! Pitily the cost difference seems to be anything but “slim”, so I expect the final system to contain two P6000, at least for starters. The Exxact product line will certainly be compared to local offerings.
samihaq says
Can anyone please look at my almost final rig, and suggest any improvements or inform about any blunder which i am about to make, please, and especially about any useless money spend which i am already very short of having. The only aim is to have a solid reliable rig that can serve 24/7 for long time for around 2500-2600$. Thank you very much. Regards.
https://pcpartpicker.com/list/BqRgHN
Petra-Kathi says
Maybe you should consider one or two additional case fans? IIRC the power supply fan pushes the air out. If you add another out-blowing fan somewhere at the top and an aspirating one at the bottom this might improve heat dissipation in 24/7 operation.
trulia says
Does CPU have 40 lanes?
How many max GPU it can support?
Tim Dettmers says
A CPU can have between 16 and 40 lanes. Read the specifications of a CPU to see how many lanes that CPU has. Usually you will need at least 8 lanes for a single GPU, but this is dependent on your motherboard. The CPU can provide support for lanes, but they must be there on the motherboard. A CPU can support a maximum of 4 GPUs.
Nikos Tsarmpopoulos says
Would it be possible for a single CPU (assuming enough dual-threaded cores) to handle 8 GPUs via a 96 lanes PCIe switch?
Michael says
Sami, FYI: https://www.amazon.com/gp/customer-reviews/R19IMBI0BXETDP/ref=cm_cr_arp_d_rvw_ttl?ie=UTF8&ASIN=B00MY3SQ84
sami haq says
@Michael. Oh really. I am tired at searching for reviews and looking for things. Can you please look at my built and suggest any improvements and especially some mbo for max 300$ or give me some direction. Thank you.
Michael says
Sami, for my last 3 workstations, I didn’t bother building them myself. I sent the desired specs to several system builders, and then negotiated the price down. In the end, I only paid a few hundred bucks more than what it would cost me to do it myself.
For your budget, I would buy a used computer, 2-3 generations old, and get a couple of 1080 Ti cards.
samihaq says
Thank you for the info. Regards
samihaq says
Thank you for info. EVGA informed through email that the motherboard has been tested for Intel Xeon E5-1620 V3 and not the v4.0. So thanks to you for informing me abt it and i have changed the cpu from v4 to v3, both are almost same. Here is the reply from EVGA:-
“Hello,
Thank you for the email so, unfortunately, EVGA hasn’t tested the newer Xeon CPU like the V4. The only tested is the V3 these are only we have tested that has supported the X99 motherboard with the latest bios update. I apologize for the inconvenience.
Xeon® E5-1680 V3 3.20 GHz 40 1.14
Xeon® E5-1660 V3 3.00 GHz 40 1.14
Xeon® E5-2695 V3 2.30 GHz 40 1.14
Xeon® E5-2697 V3 2.60 GHz 40 1.14
Xeon® E5-2670 V3 2.30 GHz 40 1.14
Xeon® E5-2660 V3 2.60 GHz 40 1.14
Xeon® E5-2687W V3 3.10 GHz 40 1.14
Xeon® E5-2687W V3 3.10 GHz 40 1.14
Xeon® E5-2685W V3 2.60 GHz 40 1.14
Xeon® E5-1650 V3 3.50 GHz 40 1.14
Xeon® E5-2667 V3 3.20 GHz 40 1.14
Xeon® E5-2630L V3 1.80 GHz 40 1.14
Xeon® E5-2609 V3 1.90 GHz 40 1.14
Xeon® E5-2609 V3 1.90 GHz 40 1.14
Xeon® E5-1620 V3 3.50 GHz 40 1.14
Xeon® E5-2643 V3 3.40 GHz 40 1.14
Xeon® E5-1630 V3 3.70 GHz 40 1.14
Xeon® E5-2603 V3 1.60 GHz 40 1.14
Xeon® E5-2620 V3 2.40 GHz 40 1.14
Xeon® E5-2640 V3 2.60 GHz 40 1.14
Xeon® E5-2623 V3 3.00 GHz 40 1.14
Xeon® E5-2637 V3 3.50 GHz 40 1.14
Regards,
EVGA”
sami haq says
Hi, i am into the deep learning and currently have k5100 Quadro gpu with 8gb of memory in a laptop with compute of 3.0. I want to make a solid DL rig which can serve me good for atleast 4-5 years with heavy work load. After reading the fantastic blog by Tim, i have selected the following using Pcpartpicker. As my budget limit is max around 2400-2500$, so basically i have gone for budget cpu but better gpus. I would go for one 1080ti and min one or max two 1070s.
Can anyone please look into my built and suggest any improvements. Also one thing i am confused about is whether i should go for
Asus X99-DELUXE II ATX LGA2011-3 Motherboard 394$
or
Asus X99-A/USB 3.1 ATX LGA2011-3 Motherboard $228.88
Does going for deluxe Mbo with almost 180$ more is justified?
Another thing i am confused about whether founder’s edition of the Gpus by EVGA or any other vendor is good enough or shud i go with customized with more fans but ofcourse will cost more.
My built is
Intel Xeon E5-1620 V4 3.5GHz Quad-Core Processor (40 lanes) 286.99
Cooler Master Hyper 212 EVO 82.9 CFM Sleeve Bearing CPU Cooler $24.88
Asus X99-A/USB 3.1 ATX LGA2011-3 Motherboard $228.88
Crucial Ballistix Sport LT 32GB (2 x 16GB) DDR4-2400 Memory $219.99
Western Digital BLACK SERIES 2TB 3.5″ 7200RPM Internal Hard $122.88
EVGA GeForce GTX 1070 8GB SC GAMING ACX 3.0 $374.00
EVGA GeForce GTX 1080 Ti 11GB Founder Edition $700.00
Corsair Air 540 ATX Mid Tower Case $119.98
EVGA SuperNOVA G2 1300W 80+ Gold Certified Fully-Modular ATX Power Supply
$182.03
Asus DRW-24B1ST/BLK/B/AS DVD/CD Writer
Total $2278
Thanks
Tim Dettmers says
Hi Sami,
I so do see why the $180 would be justified; the board adds 1 PCIe slot but give pretty much the same deep learning performance.
Please note that if you have two GPUs of different chipset for example a GTX 1070 and a GTX 1080 you will not be able to parallelize them.
Often the coolers on the GPUs are quite similar in performance so that it should not be a big deal. However, I am not so familiar with the current fan designs and there might be a fan which is superior to others. I probably would pay $20-30 if the fan performance is > 33% better, but not more. I do not think it is worth it at a certain point – better to save that money to buy another GPU in the future.
Hope this helps
sami haq says
Thanks for the useful info. I did’nt know abt the parallelization of gpu issue with different chipset. As these gpu are not cheap to get so i was thinking that i will use a 1080ti for big networks, while using one or two 1070s for small prototypes for params or checking different options in parallel on relatively small scale.Do you believe in my logic or should i prefer parallelization(by having same gpu’s, in which case i can afford max two 1080ti) over my current view??
My second question is regarding Asus motherboards, they all have very bad reviews on newegg.com by the verified owners due to dead mbo etc . As i am not based in US and getting these purchases through a US based friend so i cant avail warranty etc. Do you have any personal experience of a motherboard which u can recommend that can serve me 24/7 for long time, or can you suggest any particular brand, please.
Once again thanks for you info.
Tim Dettmers says
Ah I understand, using a GTX 1080 Ti and a GTX 1070 makes sense if you use them in that way.
There are in general no reliable motherboard manufacturers, but certain specific motherboard versions are more stable than others. I think to orient yourself along newegg reviews is a good idea. For example, I would not buy the motherboard that you linked due to their bad reviews. However, I am not current on the market situation for motherboards, so you have to find a good motherboard by yourself. Usually using pcpartpicker, selecting the X-SLI option where X is how many GPUs you want to have and then sort the price and pick the first option with good reviews is a sound strategy to find a good motherboard.
sami haq says
Thank you Tim. I have selected
EVGA X99 Classified 151-HE-E999-KR LGA 2011-v3 Intel X99 SATA 6Gb/s USB 3.0 Extended ATX Intel Motherboard. It has 5 x PCI Express 3.0 x16, 4 WAY SLI and great reviews on Amazon as well as newegg and is around 300$. I believe EVGA is a good brand and hope it serves me better for this. For the gpu, i will consider for similar chipset as you suggested bcoz in near furture, i guess all the libraries like theano, tensorflow, pytorch etc will have to support parallelization.
Thanks
jennifer lewitz says
It is truly a nice and helpful piece of information. I’m satisfied that you just shared this useful information with us.
Please stay us informed like this. Thanks for sharing.
Tim Dettmers says
Thank you, I am happy that you found the blog post useful 🙂
Umair says
Hey Tim
Some of the links here direct to an old WordPress blog. Is that content unavailable now?
Tim Dettmers says
All of my content has been moved to this blog so you should find it here. I was not aware that there were some old dead links in this blog post. Thank you of making me aware of that. I will clean that up in the next days.
Nader says
Please help
Do you recommend getting an Alienware amplifier with an Alienware laptop with a GTX 1060 for portability and the amplifier with a GTX 1080ti for the amplifier and a station
Please help
Chris says
Hi,
someone else above had a similar but not exact the same question, hence I would like to ask for your opinion as well 🙂
I understand that it would be optimal to have a CPU with enough native PCIe lanes to connect to every GPU with 16 lanes. Given that I would like to build a system with not more than two GPUs, I would need 32 lanes for the GPUs to avoid PCIe bottlenecks. Currently that yields to socket 2011-3 CPUs (Broadwell) with 40 lanes.
If I would, for reasons of cost, use a socket 1151 (Kaby Lake) setup with a 16 lane CPU but with a mainboard offering a PLX switch that can offer 2 PCIe x 16 slots, one questions arises: Do the GPUs need the whole PCIe bandwidth permanently, forcing the PLX switch to permanently share the 16 x bandwidth into 8 x / 8x or is it more likely that the GPUs transmit in an interleaving manner with full x 16 bandwidth available to the currently transmitting one. My guess is, that the truth would be something in between but I have no exact numbers or benchmarks. Do you have some experience regarding actual bandwidth loss or suggestions here, is it beneficial to use a PLX switch in 16 lane CPU, dual GPU configurations or should I definitely go for a 40 lane CPU?
Cheers,
Chris
Tim Dettmers says
Hi Chris,
there are some motherboard which support the 16x speed when no transfers to the other GPU is executed, but this is rare. In general you will have 8x / 8x speed. Check your motherboard specs for this.
I would not worry too much about PCIe lanes. If you want to parallelize GPUs it will be a performance hit, but you would still get good speedups. If you use good parallelization algorithms, like those provided by Microsoft’s CNTK, then you will have no performance hit. If you use the GPUs separately you will see almost no performance hit. So I would just go ahead with that setup. It will probably give you the best bang for the buck.
Dhaval says
Should I take Zotac Nvidia GT730 cause I don’t have much money and can spend a max of 5000 INR. Any suggestions sir?-
Tim Dettmers says
The GT 730 variant with GDDR5 memory is a good choice in that price range. The DDR3 variant will be much slower so pay attention to the memory. The memory is just 1 GB in this variant, but if you use 16-bit networks you can do some experiments with this. If you need to train larger networks then the DDR3 variant with larger memory (up to 4GB) will be a good choice too. You will have to wait for experiments a bit longer, but you will be able to run most models if you use 16-bit and you will get a speedup over using the CPU.
James says
Yes it looks like Titan X and new GTX 1080Ti have basically the same specs, but almost half price for 1080:
https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_10_series
I’d nearly ordered a titan x only to find them now out of stock in most retailers.
Is there something fundamentally different about the 1080 vs Titan where deep learning is concerned? Otherwise it looks like you could build a devbox clone for a decent price.
Tim Dettmers says
Definitely go for the GTX 1080 Ti. The 1 GB memory difference is not significant for most use-cases.
ElXDi says
Hi Tim and all Deep Learning guys!
I have a i7-4790K CPU with 32Gb of RAM which should be fine for the beginning.
I’m planning to buy new GPU. I have a few options. 1060s is best seller and best chose price/performance, I guess. BTW here is list of non reference design PCBs http://thepcenthusiast.com/geforce-gtx-1060-compared-asus-evga-msi-gigabtye-zotac/
1. GTX 1060 3GB reference design 200e
+ really cheap
– poor over clocking results
– just 3GB of VRAM
2. GTX 1060 3GB non reference design (the PCB is based on GTX 1080 with better power feed and 8pin connector) 250e.
+ performance boost +5% 🙂
+ ability to over clock with volt mode
– price
– just 3GB of VRAM
3. GTX 1060 6GB ref design 260e
+ more CUDA cores
+ more VRAM
4. GTX 1060 6GB non reference design (the PCB is based on GTX 1080 with better power feed and 8pin connector. 280e – 300e
+ more CUDA cores
+ more VRAM
+ really good over clocking ability (+15%)
– quite expensive
– price / performance index is not so good any more
5. GTX 980 4Gb used with 1 year warranty 250e
+ more CUDA
+ 5% more performance than 1060 6GB version
– used
– less VRAM
– more power consumption
So what you think about above options? What is more important more VRAM or CUDA core number or GPU clock speed or VRAM bandwidth?
Thank you for sharing your experience!
Tim Dettmers says
If you have missed it you might want to check out my other blog post about GPU selection: GPU advice. To reiterate the points:
– Bandwidth is the thing that you want to have the most of
– The best GPU in terms of cost/performance is the GTX 1070 (and soon also the GTX 1080 Ti)
– GPU memory size is important; but for many tasks 8GB is fine. If you want to computer vision research get a 12GB GPU
To answer other questions: CUDA core number and clock speed are not that important. Overclocking will give you almost no performance increase for deep learning.
Hope that helps!
ElXDi says
Thank you very much for your answer. Your answer really helps me.
As far as I understood the 3GB model is really useless. So the 1060 6GB is fine for beginning and 1070 8GB is minimum for any real project and Titan X 12 GB is required for something real.
Cheers!
om says
Titan X and GTX1080 ti have only 1GB difference in memory but in price big difference.
Does anyone know why?
http://www.eurogamer.net/articles/digitalfoundry-2017-gtx-1080-ti-finally-revealed
Ashley says
@Michael – bummer it’s done.
Ashley says
I found this: https://annarbor.craigslist.org/sys/6031393982.html but I am getting a much better rig and half the price..
Michael says
@Ashley: I’d probably just get this one (after getting the price down to $300, or $350, tops):
https://annarbor.craigslist.org/sys/6031436427.html
The advantage is it’s already got 1050 card in it, so you can start doing DL right away. Later, if you realize you need more power, you can buy 1080 Ti, and will still be within your $1k budget.
Ashley says
Thanks for the advice.
Ashley says
Hi,
Complete noobie build here.. So all aspects of all things computer needed.. I have been using a laptop till now and I would like to build a reasonably priced PC that can run CNNs. If needed I can tunnel in from anywhere to work with etc.
This is what I have.. would really appreciate any comments – have I missed anything?
Intel Core i5-6500 3.2GHz Quad-Core Processor
Corsair H60 54.0 CFM Liquid CPU Cooler
MSI B150 PC Mate ATX LGA1151 Motherboard
Kingston HyperX Fury Black 8GB (1 x 8GB) DDR4-2133 Memory
Western Digital Caviar Blue 1TB 3.5″ 7200RPM Internal Hard Drive
Asus GeForce GTX 1060 6GB 6GB Turbo Video Card
Phanteks ECLIPSE P400S TEMPERED GLASS ATX Mid Tower Case
Corsair CXM 550W 80+ Bronze Certified Semi-Modular ATX Power Supply
TP-Link TG-3468 PCI-Express x1 10/100/1000 Mbps Network Adapter
Gigabyte GC-WB867D-I PCI-Express x1 802.11a/b/g/n/ac Wi-Fi Adapter
Thank you!
Tim Dettmers says
Looks a solid build which offers some opportunities for upgrades in the future.
If I would do more data science I would probably go with cheap or used DDR3 CPU/RAM combo and buy more RAM (32-64GB); possibly I would swap the GTX 1060 for a GTX 1070 if I have the spare money left from switching from DDR4 to DDR3. If I would do more deep learning I would also go for a DDR3 CPU/RAM combo, possibly buy used hardware, and then buy a GTX 1080 Ti.
This does not mean that your build is bad. Your build is more future proof. My build would be more “I-want-to-do-things-now”. I guess this depends on taste, but be aware of what you want to buy when you buy hardware. Do you want to buy data science, deep learning, machine learning, Kaggle competitions, or being future proof? Your build buys all of that a little and a lot of being future proof, which can be a very sensible choice.
Ashley says
Hi,
You are awesome – thank you for the quick reply (because I have to get the laptop I am working with back asap)
– I want it for deep learning & machine learning primarily, either at the workstation or through a laptop that I can tunnel in with when needing a change of environment.
– I need it to last because I may not have another chance to buy anytime soon.
– In case this matters? I will be using Linux, probably Ubuntu flavour. It was challenging installing on ROG – had to use rpm for some reason.
If you don’t mind:
Where do I get used hardware from?
I can’t find the GTX 1080 ti on pcpartspicker.. its coming out this week looks like? Would you know the best place to get it through?
Ashley says
Oh – and should I look at an SSD for base installation? if so will I get away with one that is say 125GB?
Ashley says
Final build – for now:
Intel Core i5-6500 3.2GHz Quad-Core Processor 198.68 (w shipping)
Corsair H60 54.0 CFM Liquid CPU Cooler 59.99
MSI B150 PC Mate ATX LGA1151 Motherboard 84.78
Kingston HyperX Fury Black 8GB (1 x 8GB) DDR4-2133 Memory 68.99 (I tend to pull out memory and use it for other builds so kept the newer version)
Western Digital Caviar Blue 1TB 3.5″ 7200RPM Internal Hard Drive 49.99
Gigabyte GeForce GTX 1070 8GB Windforce OC Video Card 369.99
Rosewill GRAM ATX Mid Tower Case 49.99
Rosewill 600W 80+ Bronze Certified Semi-Modular ATX Power Supply (had to up for the new GPU) 59.99
TP-Link TG-3468 PCI-Express x1 10/100/1000 Mbps Network Adapter 9.22
D-Link DWA-552 PCI 802.11g/n Wi-Fi Adapter 9.95
Logitech K120 Wired Standard Keyboard 9.00
I am going to use a 32″ TV – hope that doesn’t kill my eyes..
And have a little mouse and speaker.
Total cost $970.57 (using used where possible)
Michael says
Ashley, no, this is not how I’d spend a thousand bucks if I needed a cheap machine for DL. Instead of getting all these parts individually, I’d shop for a decent used desktop, then buy a good video card separately. For example, something like this: https://santabarbara.craigslist.org/sys/5992606383.html
Then you will have enough money left for GTX 1080 and more. The truth is, CPU performance haven’t improved that much in the last 5 years, so for deep learning an old CPU + 1080 will be faster than a new CPU + 1070.
Also, you should get a SSD. Again, old CPU + SSD will be faster than new CPU + hard drive.
p.s. and you definitely don’t need a liquid cooler (nor any overclocking).
Nader says
Is the NEW Ryzen 1800X compatible with the GTX 1080ti ?
Will using theano as the backend work ?
Thank you
Andrew says
Ryzen will work perfectly fine with a 1080 TI. However depending on your work load Ryzen may not be the best option.
The Pro’s and Con’s of Ryzen 1800X is:
Pro: Ryzen has ECC RAM support which is great for mission critical situations where data CAN NOT risk being corrupted at any cost. However if you are mainly doing Deep Learning then ECC RAM is not really necessary at all as most Deep Learning algorithm’s and AI training etc.. can be done on 16 bit or even 8 bit precision (which is something that the TITAN X pascal excels at actually, thus why something like a Quadro or Tesla isn’t necessary either in most cases)
Pro: Ryzen has 8 cores, which are beneficial if you plan to do work with highly multi-threaded programs for video editing, 3d rendering etc.. although in many of these cases you are better off using GPU acceleration instead of relying on a CPU since CUDA acceleration on an Nvidia GPU (especially a TITAN X) will be FAR faster than ANY CPU. And again, if you are just doing Deep Learning mostly, or maybe some PC Gaming on the side etc.. and aren’t doing programs that need all those extra cores (Deep Learning only needs 4 cores even for four way SLI in most cases as shown in this article) then the extra cores of Ryzen are redundant frankly.
Con: Ryzen is limited to dual channel RAM. This cuts your memory bandwidth in half pretty much which CAN effect intensive Deep Learning work somewhat. It also only supports up to 2666-2900mhz RAM speeds in many cases which isn’t really a big deal for Deep Learning but will effect any memory intensive workstation/proffesional tasks. It also has a RAM capacity limit of 64GB compared to Intel X99 chipset used with CPUs like the i7 6800K etc.. that allows for 128GB of QUAD channel RAM clocked at up to 3600mhz. It’s up to your situation whether you consider that a problem or not.
Con: Ryzen has no overclocking capability to speak of. Nobody has really been able to get ANY Ryzen chip to get over 4.1ghz; with many even being stuck at 3.9gh zor 4.0ghz (which in the case of the 1800X is literally NO overclock at all since the 1800X runs at 4ghz out of the box). So if you are using programs that need clock speed then a faster chip would be beneficial.
So overall unless you really have specific need of an 8 core chip, i would say for a Deep Learning PC, even if you do things like normal web browsing, heavy PC gaming, video streaming/encoding etc.. you might be better off getting something like an i7 6800K (which has 6 cores 12 threads but can hit 4.4ghz in some cases so overall is a bit better) which is $100 cheaper than the R7 1800X; or perhaps the i7 7700K which is only $329 ($170 cheaper than R7 1800X) and can easily overclock to 5ghz with proper cooling (many people have hit 5.2ghz even with just the high quality noctua air coolers or AIO water coolers etc..) Only reason i would specifically get Ryzen is if you are really needing an 8 core chip for specific programs, as Deep Learning and most general use doesn’t require any more than 4 cores.
Michael says
Keep in mind that 6800K has only 28 PCIe lanes (Ryzen and 7700k are even worse), so if you’re planning to use multiple GPUs (now or in the future), go with E5-1650 v4 (or E5-1620 v4 if you’re on a budget). Also, Skylake Xeons are about to be released (this month), so if you can, wait for them (mainly for AVX512 support).
tom says
Hi Michael,
I would like to use 4 GTX 1080 Ti
Which is the best and cheap processor with motherboard ?
Andrew says
Ah yes. Good catch.
Have you really noticed a difference between running a GPU with PCI-e 3.0 x8 and x16 for Deep Learning though? In most other situations i’ve seen having x8 PCI-e 3.0 isn’t hindering much at all, if any; you sometimes see a 0.5% or maybe 1% performance delta between the two but that’s typically it.
Michael says
I haven’t seen this tested anywhere, but I’m guessing it’s important for large networks running on fast GPUs, when it takes longer to move gradients from GPU to GPU than to calculate them.
Tim Dettmers says
I am always happy to answer comments, but it give me even more joy to see that people answer each other’s questions. Thanks Michael and Andrew!
Tim Dettmers says
Yes, the AMD Ryzen CPU series will be compatible with your NVIDIA cards. In general, all modern CPUs should support NVIDIA cards. This is so because the CPU and NVIDIA communicate with a protocol that is in general used for printers, network interfaces and so forth, and there is no CPU manufacturer which can themselves not to support these features. Thus all CPUs should have support for NVIDIA GPUs (at least those which come as PCIe cards, which are all GPUs except the ones with NVLink, that is the NVIDIA P100 currently).
s12 says
Hi Tim,
I have been looking into using NVLink to couple two TXPs. I was hoping to do this in a SLI-like fashion (like shown here: http://www.kitguru.net/components/graphic-cards/anton-shilov/nvidia-pascal-architectures-nvlink-to-enable-8-way-multi-gpu-capability/ ), rather than buying a purpose-built motherboard. Unless I’m mistaken, this isn’t currently possible — do you know if NVIDIA has any plans to implement this in the future?
Thank you very much for this article and all of your helpful comments.
Tim Dettmers says
If you are interested in parallelism I recommend looking into Microsoft’s CNTK library. Their parallelisation algorithms, especially 1-bit quantization and block momentum, are so good that you get linear speedups without having NVLink. Granted, the software is a bit difficult to use but is maturing quickly and you could save a lot of money by going without NVLink. I am currently not aware of any affordable NVLink hardware which is used outside of supercomputing. You might get your hands on one of those machines, but it will be expensive. So in the end CNTK might be the only way to go which is practical. This may be disappointing, but I hope it helps!
Ervin says
Hello Tim and thank you for your post. I have currently a desktop with Core 2 Quad Q9300. I was wondering whether it would bottleneck a GTX1060 6GB for some beginner to mid DL problems?
Tim Dettmers says
It is an old CPU, but you should be relatively fine. You can expect to run about 10-20% slower than with a high end CPU. Probably processing some non-deep learning code, that is preprocessing data will take quite a bit more, but running the deep learning model should be not much slower.
Ervin says
Thank you for your reply. I also have an old motherboard GA-P43-ES3G (http://www.gigabyte.com/Motherboard/GA-P43-ES3G-rev-10#sp) which only supports PCI Express 2.0. I believe that will be a major bottleneck right?
om says
Hi Tim,
I have the latest mac and I want to use GPU – GTX Titan X with https://www.akitio.com/expansion/node
AKiTiO Node – eGPU box Thunderbolt 3
My question is, can I use TensorFlow with this external GPU device, without killing performance and efficiency. What could be the side effect?
I know using TitanX with the desktop will be a lot better but I need mobility.
Thanks and you rock 🙂
om
JP says
On the akitio specs it says that Mac is not supported.
om says
Just wanted to add – AKiTiO reply –
Hello trulia.
You have a new message from .
Re: tensorflow
Message: Hi Trulia, We currently do not have an eGPU solution for the Mac and the only eGPU solution we have is for select Thunderbolt 3 PCs, so the answer to your question is No. Having said that, it might be possible that you could make it work but this would be more of a DIY project that requires hardware and software modification. Also, it would void the warranty, so this is not something that I can recommend. Regards, Stefan
Usher says
Thanks Tim for your comment!
Usher says
Hi Tim,
Could you comment on below build?
– Chassis: Corsair Carbide Air 540
– Motherboard: Asus ROG STRIX X99 GAMING ATX LGA2011-3 Motherboard
– Cpu: Intel Core i7 6800k
– Ram: 32GB DDR4 G.Skill 2400Mhz
– Gpu: 1 ASUS GTX 1080
– HD1: 500GB SSD Samsung EVO
– HD2: 1TB WD Red in RAID 5
I am not sure if the board is a good choice if I might be adding a second GPU in the future. Or maybe ASUS X99-Deluxe II is worth the extra cost?
Tim Dettmers says
Hi Usher,
I do not have time to check the details, but it seems that the motherboard is okay. The review on newegg are not that good though, but the cost/performance might still be good. Adding a second GPU will definitely no problem with the motherboard that you chose.
Otherwise the build looks okay. I recommend checking the build with pcpartpicker, which often finds compatibility issues if there are any.
Pavel says
Hi Tim,
Very good article! Thank you!
P.S. You have cool working place.
Nader says
So a single Titan Pascal trumps dual gtx 1080 in sli ?
Correct ?
Michael says
Tim, thanks for the great article! I have a couple of questions:
1. What is “4” in your mini-batch size calculation (4x128x244x244x3)?
2. I’m deciding on which SSD to buy for my machine with four Pascal Titan X cards, mostly to do training on Imagenet. Assuming your bandwidth estimate of 290MBps is for a single card, should I multiply it by four when running a model on all four cards? Do you know how fast Pascal Titan X processes a single 128 mini-batch? Also, if I use mini-batch of 256 , I would need double the bandwidth, right?
Given the above considerations, would you recommend going with a PCIe based SSD, such as Samsung 960 Pro, rather than SATA based one, such as Samsung 850 Evo?
Tim Dettmers says
1. The data used in deep learning is usually 32-bit or 4 bytes; this is the 4 in the calculation above (conversion into bytes).
2. This is a bit complicated. Parallelism does not scale linearly, so that you should multiply the estimate by 3.5 or so (for TensorFlow this will be closer to 2.5-3). One thing to keep in mind is that in practice small data transfers are often slower (the overhead is large when the data size is small) and that GPUs operate more efficient on larger batch sizes.
I am unfamiliar with the exact internals of the TensorFlow batching procedure. If they do it right for both data loading and data transfer a PCIe SSD would lend a bit improved performance. However, from some benchmarks it seems that some parts (GPU transfers I think) TensorFlow is sub-optimal. If this is really so for GPU transfers then a PCIe SSD and a normal SSD would lend the same performance. I personally would just go for a cheap normal SSD.
Michael says
Thanks Tim. A different question: which software framework would you use for experimenting with Imagenet?
So far I’ve been using Theano, but only on small datasets (MNIST and CIFAR). My main interest is to test different quantization methods for weights and activations, and see how it works for different network architectures. I’ve read your paper, by the way, very interesting, but I prefer not to code everything from scratch in C/CUDA if possible. Right now I’m looking into implementation of the asynchronous batch allocation, like you suggested, in Theano, and it’s not very straightforward.
Would you recommend switching to TensorFlow, or sticking with Theano? I’m less concerned with the ability to parallelize code across multiple GPU, because I can just run different experiments in parallel.
Tim Dettmers says
TensorFlow is a good call. If you want to work on vision only Caffe is also an excellent option. However, overall PyTorch or Torch might be more suitable to you. PyTorch already implements asynchronous batching by default and Torch already has the 1-but quantization method. I am currently not sure how that is integrated into PyTorch, but since both Torch and PyTorch are only wrappers for lua and python, respectively, interfacing with 1-bit quantization should be relatively straighforward. If you want to implement other methods of quantization, then Torch and to some degree PyTorch some good interfacing and easy extension. However, the algorithms would need to be written in C/CUDA. Extending TensorFlow in this way might not be as straightforward, so you might run into difficulties either way. TensorFlow is of course still more popular and thus if you extend it, it will have more value for other people. So not an easy decision, but maybe I could give you some points which makes a decision easier.
Nader says
Hi,
What do you think of the following build ?
https://pcpartpicker.com/list/8sv2jc
Thank you
Pawel says
Hi Tim,
I’ve just upgraded from GTX 960 to GTX 1070.
I used to run the tensorflow cifar10_multi_gpu_train.py file to check the speed from one release to another. With the last tensorflow release it peak at about 1500 images / sec with GTX 960 (which is an impressive progress btw, with the initial releases it was more like 750 images / sec).
I was suprised to see that my GTX 1070 peaks at ~1700 images / sec, a very small improvement. It looks like the CPU is now the bottleneck (I see it constantly at 300% usage – 3 full cores). I have a i5-3570k which should be decent.
I didn’t analysed it further (yet) but could samebody share an experience on that ? I wasn’t expecting the CPU to be the bottleneck here.
Tim Dettmers says
Are you training on multiple GPUs (cifar10_multi_gpu_train.py)? If so, then this is your answer. TensorFlow has terrible performance for multiple GPUs and upgrading multiple GPUs will not yield much better performance for TensorFlow.
Piotr Czapla says
Paweł,
Tensorflow do not use GPU to GPU transfer when updating weights. It download the whole model to RAM and make updates on CPU. At least this is the understanding i’ve got from reading: https://arxiv.org/abs/1608.07249
Nader says
Should i buy a GTX 1080 now or wait the ti which is supposedly coming out next month?
Tim Dettmers says
The GTX 1080 Ti will be better in all of the ways. Make sure however to preorder it or something, otherwise all cards might be bought up quickly and you have to go back to the GTX 1080. Another strategy might be to wait a bit longer for the GTX 1080 Ti to arrive and then buy a cheap GTX 1080 from eBay. I think these two choices make sense if you can wait for a month or two.
Nader says
Hi,
What do you think of the following build ?
https://pcpartpicker.com/list/8sv2jc
Thank you
Nader says
Thank you for your reply.
I appreciate it.
Andrew says
I would personally recommend a couple small changes. Here’s what i would go with: https://pcpartpicker.com/list/bx2ssJ
First off, if you are going to spend $85 on a 256GB regular SATA based SSD for storage then you might as well get the top of the line M.2 960 Evo for $120. It’s over 3 times faster than the one you picked in transfer speeds, and is overall much better. (alternatively if you don’t care about the extra speed you can get a 500GB SATA drive for about that same price, getting double the storage)
The second thing i would change is to get a Z270 motherboard rather than a Z170. It’s been a month or so since you commented so not sure if you bought yours yet, but the new Z270 motherboards support more PCI lanes, support 4K encoding on 7000 series CPUs etc.. so they’re worth looking at, especially since they’re basically the same price. My link swapped in a Gigabyte Z270 Gaming K3 for your Gigabyte Z170 Gaming M3. Very Similar boards.
Lastly, you should also get the i5 7600K instead of the i5 6600K since Kaby Lake 7000 processors are about 5-10% faster than Skylake 6000 processors, and the 7600K can be overclocked to over 5ghz no problem compared to the 6600K that has trouble getting over ~4.7ghz on air cooling in some cases. And since the 7600K is also about the same price you might as well get it. Personally though i would still recommend an i7 over an i5 in this situation simply because simultaneous multi-threading is becoming fairly more important as of late, and the extra 2MB of L3 cache is also nice to have. I figure if you are spending $1200 on a TITAN X Pascal you should be able to fit in $100 more for an i7 7700K that can also be overclocked to 5ghz pretty easily in most cases (even on air!)
Table Salt says
Hi Tim, thanks for the excellent posts, and keep up the good work.
I am just beginning to experiment with deep learning and I’m interested in generative models like RNNs (probably models like LSTMs, I think). I can’t spend more than $2k (maybe up to $2.3k), so I think I will have to go with a 16-lane CPU. Then I have a choice of either a single Titan X Pascal or two 1080s. (Alternatively, I could buy a 40-lane CPU, preserving upgradability, but then I could only buy a single 1080). Do you have any advice specific to RNNs in this situation? Is model parallelism a viable option for RNNs in general and LSTMs in particular?
Thank you!
Tim Dettmers says
I think you can apply 75% of state-of-the-art LSTM models on different tasks with a GTX 1080; for the other 25% you can often create a “smarter” architecture which uses less memory and achieves comparable results. So I think you should go for 16 lanes and two GTX 1080. Make sure your CPU support two GPUs in a 8x/8x setting.
Om says
Hi Tim,
You are such a amazing person. So patient and knowledgable.
I am also in the same boat of deep leaning and willing to learn.
I brought this computer
http://www.costco.com/CyberpowerPC-SLC2400C-Desktop—Intel-Core-i7—8GB-NVIDIA-GeForce-GTX-1080-Graphics—Windows-10-Professional.product.100296640.html
CyberpowerPC SLC2400C Desktop – Intel Core i7 – 8GB NVIDIA GeForce GTX 1080 Graphics – Windows 10 Professional
This is gaming PC but i don’t play games.
My question is can i use “Titan Pascal X” from Nvidia along with GeForce GTX 1080 for more computation power.
I learned SLI is not solution and anyways both are different GPUs .
So in order to achieve faster result can i combine both GPU for Tensorflow.
I am using tensorflow –
I just found this – (Basic Multi GPU Computation in TensorFlow)
https://tensorhub.com/donnemartin/4_multi_gpu
I need to install a VM with Ubantu 16 for all this setup.
Thanks
Tim Dettmers says
Hi Om,
I am really glad that you found the resources of my website useful — thank you for your kind words!
The thing with the NVIDIA Titan X (Pascal) and the GTX 1080 is that they use different chips which cannot communicate in parallel. So you would be unable to parallelize a model on these two GPUs. However, you would be able to run different models on each GPU, or you could get another GTX 1080 and parallelize on those GPUs.
Note that using a Ubuntu VM can cause some problems with GPU support. The last time I checked it was hardly possible to get GPU acceleration running through a VM, but things might have changed since then. So I urge you to check if this is possible first before you go along this route.
Best,
Tim
Gordon says
Thank you very much for writing this! – knowing something about how to evaluate the hardware is something I have been struggling to get my head around.
I have been playing with TensorFlow on the CPU on a pretty nice laptop (fast i7 with lots of RAM and an SSD but ultimately dual core so slow as hell).
I want try something on the GPU to see if it is really just 100’s of times faster, but I am worried about investing too much too soon as I have not had a desktop in ages.. having read this post and the comments I have the following plan:
Use an existing freenas server I have as a test bed and buy a relatively low end GPU – GTX 960 4096MB:
https://www.overclockers.co.uk/msi-geforce-gtx-960-4096mb-gddr5-pci-express-graphics-card-gtx-960-4gd5t-oc-gx-319-ms.html
The freenas box has a crappy celeron core 2 3.2 dual core and only 8GB of Ram.:
http://ark.intel.com/products/53418/Intel-Celeron-Processor-G550-2M-Cache-2_60-GHz
I will buy the graphics card and an SSD to install an alternative OS on, I *may* upgrade the ram and processor too as all of these items will all benefit the freenas box anyway (i also run plex on it).
If this goes well and I develop further I will look at a whole new setup later with appropriate motherboard, cpu, etc. but in the mean time i can learn how to to identify where my specific bottle necks are likely to be etc.
From what you have said here i think there will be several slow parts to my system but I am probably going to get 80-90% of the speed of the graphics, the main restriction being that the cpu only supports PCIe 2.0 – as everything else while not ideal and scale-able for that GPU can probably feed it fast enough.
I have 2 questions (if you have time – sorry for long comment but i wanted to make my situation clear):
1. Do you see anything drastically wrong with this approach? – no guarantees obviously, I could spend more money now if i am just shooting myself in the foot but i would rather save it for the next system once i am fully committed and have more experience.
2. I chose the GPU based on RAM, number of CUDA cores and Nvidia compute capability rating (which reminds me of windows performance rating 😀 – a bit vague but better than nothing).. the other one i was considering was this £13 more so also a fine price imho:
https://www.overclockers.co.uk/palit-geforce-gtx-1050ti-stormx-4096mb-pci-express-gddr5-graphics-card-gx-03t-pl.html
Which has less cores 768 vs 1024 but a shorter process length, higher speed 1290MHz vs 1178MHz, and i *think* i higher rating assuming that the Ti is just better (seems to mean unlocked) 6.1 vs 5.2:
https://developer.nvidia.com/cuda-gpus#collapse4
Basically is the drop in cores really made up for to such a drastic extent that this significantly higher rating from nvidia is accurate.. noting that i am probably going to be happy enough either way – feel free to just say “either is probably fine” 😀
Alternatively if there is something else in the sub £150 ish range that you would suggest given that the whole thing may be replaced by a titan x or similar (hopefully cheaper after Christmas 😉 ) if this goes well. I did consider just getting something like this: much less ram but still was more cores than 2 and allows me to figure out how to get code running on the GPU:
https://www.overclockers.co.uk/asus-geforce-gt-710-silent-1024mb-gddr3-pci-express-graphics-card-gx-396-as.html
Gordon says
Got the 1050 Ti (well another variation of it), i figured they would be similar regardless so i might as well trust nvidias rating.
https://www.amazon.co.uk/gp/product/B01M66IJ55/
Also got 32 GB or ram and a quadcore i5 that supports pci 3.0 as they were all cheap on ebay. (SSD too of course).
Looks like i can mount my zfs pool in ubuntu so i will probably just take freenas offline for a while and use this as a file and plex server too (very few users anyways) and this way my raid array will be local should i want to use it.
Tim Dettmers says
That sounds solid. With that you should easily get started with deep learning. The setup sounds good if you want to try out some deep learning on Kaggle.com for example.
Tim Dettmers says
Upgrading the system bit by bit may make sense. Note that CPU and RAM will make no difference to deep learning performance, but might be interesting for other applications. If you only use one GPU a PCIe 2.0 will be fine and will not hurt performance. The GTX 960 and GTX 1050Ti are on a par in terms of performance. So pick what is most convenient / cheaper for you.
Mor says
Hi Tim,
I am willing to buy a full hardware to deep learning,
my budget is about 15,000$
I don’t have any experience in this and when I tried to check things out it was too complicated for me to understand,
Can you help me ? maybe recommend about companies or anything else that suits my budget and still be good enough to work with?
Thanks a lot
Tim Dettmers says
If I were you I would put together a PC on pcpartpicker.com with 4 GPUs and then build it together by myself. This is the cheapest option. If that is too difficult, then I would look for companies that sell deep learning desktops. They basically sell the same hardware, but at a higher price.
JP Colomer says
Hi Tim,
Thank you for this excellent guide.
I was wondering, now that the new 1000 series and Titan X came out, what are your updated suggestions for GPUs (no money, best performance, etc)?
Tim Dettmers says
Please, see my GPU blog post for these updates.
JP Colomer says
Thank you, Tim. I ended up buying a GTX 1070.
Now, I have to purchase the MOBO. I’m deciding between a GIGABYTE GA-X99P-SLI and a Supermicro C7X99-OCE-F.
Both support 4 GPUs but it seems that there is not enough space for a 4th GPU on the Supermicro. Any experience with these MOBOs?
This is my draft https://pcpartpicker.com/list/6tq8bj
Tim Dettmers says
Indeed, the Supermicro motherboard will not be able to hold a 4th GPU. I also have a Gigabyte motherboard (although a different one) and it worked well with 4 GPUs (while I had problems with an ASUS one), but I think in general most motherboards will work just fine. So seems like a good choice.
Shahid says
I am confused between two options:
1) A 2nd Generation core i5, 8GB DDR3 RAM and a GTX 960 for $350.
2) A 6th Generation core i3, 16GB DDR3 RAM and a GTX 750Ti for $480.
Can you please comment? I expect to upgrade my GPU after a few months.
Tim Dettmers says
A difficult choice. If your upgrade your GPU in a few months then it depends if you use your desktop only for deep learning or also for other tasks. If you use your machine regularly, I would spend the extra money and go for option (2). If you want to almost exclusively deep learning with the machine (1) is a good, cheap choice. Here the choice also depends if you buy the 2GB or 4GB variant of each GPU. In terms of speed (1) will be about 33-50% faster, but the speed would not be too important when you start out with deep learning, specially if you upgrade the GPU eventually.
Shahid says
Thank you Tim, you really inspire me! Actually I took the Udacity SDCND course, and here is the list of a few projects I want to accomplish on a local machine:
1. Road Lane-Finding Using Cameras (OpenCV)
2. Traffic Sign Classification (Deep Learning)
3. Behavioral Cloning
4. Advanced Lane-Finding (OpenCV)
5. Vehicle Tracking Project (Machine Learning and Vision)
So, my work is solely related to Computer Vision and Deep Learning. I also have an option to a GTX 1060 6GB with that core i3 (2). Off course, I expect to code the GPU versions of OpenCV tasks. Do you think this 3rd option would be sufficient to accomplish these projects in average amount of time? Thank you again.
Gautam Sharma says
Hi Shahid. I’m in the same boat as yours. Even I have signed up for the SDCND. I have an old PC with core i3 and 2GB RAM. I am adding additional 8GB RAM and buying GTX 1060 6GB. This is a really powerful GPU which’ll perform great in our work associated with the SDCND.
meskie sprawy says
As I website possessor I believe fpfoggd the content material here is rattling excellent , appreciate it for your hard work. You should keep it up forever! Best of luck.
Tim Dettmers says
Thank you! I aim to keep it up forever 🙂
Hrishikesh Waikar says
Hi Tim ,
Wonderful article . However I am about to buy a new laptop . So what do you feel about the idea of gaming laptop for deep learning with Nvidia GTX 980 M , GTX 1060/1070 ?
Tim Dettmers says
Definitely go for the GTX 10 series GPUs for your laptop since these are very similar to full desktop GPUs. They are probably more expensive though. Another option would be to buy a cheap, light laptop with long-battery duration and a separate desktop to which your connect remotely to run your deep learning work. The last option is what I use and I am quite fond of it.
Alisher says
I am very happy that I thought as you did. I bought Macbook Air which is very portable, and going to buy a desktop with better specifications to do my experiments on it.
I had a question but I have asked it in previous comment.
Thank you again for the very useful information.
Regards,
Tim Dettmers says
You are welcome! I am glad I could help out!
Tim
panovr says
Great article, and thanks for sharing!
I want to configure my working layout like yours: “Typical monitor layout when I do deep learning: Left: Papers, Google searches, gmail, stackoverflow; middle: Code; right: Output windows, R, folders, systems monitors, GPU monitors, to-do list, and other small applications.”
Do I need extra configuration in addition to connect 3 monitors to the motherboard? Is there any additional hardware need for this 3 monitors configuration?
Thanks!
Tim Dettmers says
No extra configuration is required other than the normal monitor configuration for your operating system. Your GPU needs to have enough connectors and support 3 monitors (most modern GPUs do).
Poornachandra Sandur says
Hi Tim,
Thank you for sharing your knowledge it was very much beneficial to understand the concepts in DL.
I have a doubt
How to feed custom images into CNN , for object recognition using Python language.Please give some pointers on this.
Tim Dettmers says
You will need to rescale custom images to a specific size so that you can feed your data into a CNN. I recommend looking at ImageNet examples of common libraries (Torch7, Tensorflow) to understand the data loading process. You will then need to write an extension which resizes your images to the proper dimension, for example 1080×1920 -> 224×224.
Alisher says
Firstly, I am very thankful for your post. It is very nice and very helpful.
One thing I wanted to point is; you can feed the images into network (in caffe) as they are. I mean if you have 1080×1920 image, there is no need to reshape it to 224×224. But, this does not mean that feeding the image as is perform better, I think this can be standalone research topic 🙂
Secondly, I am planning to buy a desktop PC; and since I am a Deep Learner Researcher (beginner) I am going to do a lot experiments on ImageNet, etc large scale datasets. Do you suggest to buy the gaming PCs directly, or would it be wise choice to build my own PC?
I was considering to buy Asus ROG G20CB P1070.
Thank you very much in advance!
Regards,
Tim Dettmers says
Building your own PC would be a better choice in the long-term. It can be daunting at first, but it is often easier than assembling IKEA furniture and unlike IKEA furniture there are multitude of resources on how to do it step-by-step. After you have build your first desktops, building the next desktops will be easy and rewarding and you will save a lot of money to boot!
Shahid says
Thank you Tim!
Prasanna Dixit J says
This is good overview on the HW that matters to the DL, Would like your view on the OpenPower -NVIDIA combo, and economics of setting up a ML/DL lab.
Tim Dettmers says
I think that non-consumer hardware is not so economically efficient for setting up a ML/DL lab. However, beyond a certain threshold of GPUs the traditional consumer hardware no longer is an option (NVIDIA will not sell you consumer-grade GPUs on bulk and there might also be problems with reliability). I would recommend to get as much traditional, cheap, consumer hardware as possible and mix it with some HPC components like cheap Mellanox Infiniband cards and switches from eBay.
Arthur says
Great hardware guide. Thank you for sharing your knowledge.
Shashwat Gupta says
Hey, I wanted to ask if the nvidia quadro k4000 will be a good choice for running convolutional nets?
Tim Dettmers says
A K4000 will work, but it will be slow and you cannot run big models on large datasets such as ImageNet.
Shashwat Gupta says
Shall I get a GTX1080 instead?
Ashiq says
Hi Tim
Thanks for the great article and your patience to answer all the questions. I just built a dev box with 4 Titan X Pascal and need some advice on air flow. For reference, here is the Part list: https://pcpartpicker.com/list/W2PzvV and the Picture: http://imgur.com/bGoGVXu
Loaded Windows first for stress testing the components and noticed the GPUs temps reached 84C while the fans are still at 50%. Then the GPUs started slowing down to lower/maintain the temp. Then with MSI Afterburner, I could specify a custom temp-vs-fanspeed profile and keep the GPU temps at 77C or below – pretty much what you wrote in the cooling section above.
There is no “Afterburner” for Linux, and apparently the BIOS of the Titan X Pascal is locked so we can’t flash them with custom temp setting. The only option left for me is to play with the coolbits and I prefer not to attach 4 monitors to it (I already have two 30inch monitors that are attached to a windows computer that I use for everything. 6 monitors on the table will be too much).
I wonder if you found any new way of emulating monitors for Xorg as my preferred option would be keep 3 of the GPUs headless ?
Cheers
Ashiq
Tim Dettmers says
I did not succeed in emulating monitors so myself. Some other claim that they got it working. I think the easiest way to increase the fan speed would be to flash the GPU with a custom BIOs. That way it will work in both Windows and Linux.
spuddler says
Not sure, but there maybe there exist specific dummy plugs to help “emulating” monitors, if it’s not possible purely by software. At least DVI and HDMI-dummy plugs worked for cryptocurrency miners back in the day.
Ashiq says
So I got it (virtual screens will coolbits) working by following the clues from http://goo.gl/FvkGC7. Here (https://goo.gl/kE3Bcs) is my Xserver cofig file (/etc/X11/xorg.conf) and I can change all 4 fan speeds with nvidia-settings
Tim Dettmers says
Thanks Ashiq — that sounds great! Thank you for sharing the link!
Piotr Czapla says
Hi Ashiq,
Would you mind sharing how loud is your setup. It look very similar to the one I’m planning to build and I’m torn between going for liquid cooling or air cooling. Will I be able hear it from 10 meters away?
Regards,
Piotr
anon says
Hi Tim,
Could you recommend any Mellanox ConnectX2 cards for GPU-RDMA ?
Some are just Ethernet (MNPA19 XTR, for e.g. ) and I wonder if those can be flashed to support RDMA or maybe I should just buy a card which supports Infiniband outright ?
anon says
Hi Tim,
I just got 5 dell precision t7500 in an auction.
haven’t received them yet, but the description mentions Nvidia Quadro 5000 installed.
Would it be worth replacing them or are they enough for starting out ?
Machines themselves have 12GB of DDR3(ECC i presume) RAM and Xeon 5606 as described.
Tim Dettmers says
The Quadro 5000 has only a compute capability of 2.0 and thus will not work with most deep learning libraries that use cuDNN. Thus it might be better to upgrade.
anon says
Thanks.
I am thinking of going with GTX 1060.
Is there any difference though between EVGA , ASUS, MSI or NVIDIA versions ?
These are the options I see when search on ebay .
Gautam Sharma says
That should matter much. Don’t go with the Nvidia founder’s edition. It doesn’t have a good cooling system. Just go with the cheapest one which is EVGA. It is one of the most promising brand. I just ordered the EVGA one.
Tim Dettmers says
Please note that the GTX 1080 EVGA has currently cooling problems with are only fixed with flashing the BIOS of the GPU. This card may begin to burn without this BIOS update.
Toqi Tahamid says
My current CPU is Intel Core i3 2100 @ 3.1Ghz and RAM is 4GB. My motherboard is Gigabyte GA-H61M-S2P-B3 (rev. 1.0) . It has support PCIe 2.0. Can I use GTX 1060 in my current configuration or do I need to change the board and the CPU? I want to keep the cost as much as low.
Tim Dettmers says
You should be able to run a GTX 1060 just fine. The performance should be only 5-10% less than on an optimal system.
zac zhang says
Awsome! Thanks for your sharing. Can you tell me how much will them cost to build up such a cluster? Cheers!
Tim Dettmers says
Basically it is two regular deep learning systems together with infiniband cards. You can get infiniband card and a cable quite cheap on eBay and the total cost for a 6 GPU, 2 node system would be about 3k for the system and infiniband cards, and an additional 6k for the GPUs (if you use Pascal GTX Titan X) for a total of $9k.
Shravankumar says
I am using Asus K55VJ, i5 3rd gen, Nvidea Geforce GT 635M- 2GB, with 750HDD and 8GB RAM. Does my computer supports deep learning?
Tim Dettmers says
Your GPU has compute capability of 2.1 and you need at least 3.0 for most libraries — so no, your computer does not support deep learning on GPUs. You could still run deep learning code on the CPU, but it would be quite slow.
Jacqueline says
Hi Tim
Is this one a good one for a Deep Learning Researchers?
https://www.bhphotovideo.com/c/product/1269213-REG/asus_g20cb_db71_gtx1070_republic_of_gamers_g20cb.html
thank you!
Tim Dettmers says
It is a bit pricey and there are not much details about the motherboard. Also the GPU might be a bit weak for researchers.
I would also encourage you to buy components and build them together on your own. This may seem like a daunting task but it is much easier than it seems. This way you get a high-quality machine that is cheap at the same time.
Gilberto says
Hi Tim,
first of all thank you for sharing all these precious information.
I am new to neural network and python.
I want to test some ideas on financial time series.
I’m starting to learn python, theano, keras.
After reading your article, I decided to upgrade my old pc.
I know almost nothing about hardware so I ask you an opinion about it.
Current configuration:
– Motherboard: Gigabyte GA-P55A-UD3 (specification at: http://www.gigabyte.com/products/product-page.aspx?pid=3439#sp)
– Intel i5 2.93 GHz
– 8 Gb Ram
– GTX 980
– PSU power: 550watts
I may add:
– Ssd Hard Drive (I will install Ubuntu and use it only by command line – not graphical interface)
The power supply is powerful enough for the new card?
Does the motherboard support the new card?
Thank you very much,
Gilberto
Tim Dettmers says
The motherboard should work, but it will be a bit slower. The PSU is borderline, it might be a bit too few watts or just right, its hard to tell.
Arman says
Hi Tim,
I had a question about the new pascal gpu’s. I am debating between Gtx 1080 and Titan X. The price of Titan X is almost double the 1080’s. Excluding the fact that Titan X has 4 more Gb memory, does it provide significant speed improvement over 1080 to justify the price difference?
Thanks,
Juan says
Hi,
I am not Tim (obviously), but as far as I understood from his other post on GPU (http://timdettmers.com/2014/08/14/which-gpu-for-deep-learning/) he states that for research level of work it actually is a difference, maxime when you are are using videosets. But for example … “While 12GB of memory are essential for state-of-the-art results on ImageNet on a similar dataset with 112x112x3 dimensions we might get state-of-the-art results with just 4-6GB of memory.”
Hope this can help you.
DarkIdeals says
If you can afford it the TITAN X is DEFINITELY worth it over the 1080 in most cases. Not only does it have that 12GB of VRAM to work with but it also has features like INT8 (the way i understand it, is that you can store floats as 8 bit integers which helps efficiency etc.. Potentially quite useful) and has 44 TOP units (kinda like ROPs but not for graphic rendering, they are beneficial to Deep Learning though)
Basically the TITAN X is literally identical to the $7000 Tesla P100 just without the Double Precision FP64 capability and without HBM2 memory (The TITAN X uses GDDR5X instead, however it’s not much of a difference as the P100’s memory bandwidth even with the HBM is only 540 GB/second whereas the TITAN X is very close at 480 GB/second and hits 530 GB/second when you overclock the memory from 10,000mhz to 11,000mhz so it’s literally no difference really) Other than those things and the certified Tesla Drivers there’s literally no real difference between the P100 and the TITAN X Pascal; which is very important as the Tesla P100 is literally THE most powerful graphic card on the planet right now!
The important thing to mention is that Double Precision isn’t really important for Neural nets etc.. that you deal with in Deep Learning; so for $1,200 you are getting the power of the $7,000 monster supercomputer chip of the Tesla P100 just without all the unnecessary server features that Deep Learning doesn’t use.
Also, in comparison to the GTX 1080, the TITAN X has a significant advantage in both memory capacity (12GB vs 8GB on 1080), memory bandwidth (530 GB/s when overclocked on the TITAN X, vs 350 GB/s on the 1080 when overclocked…that’s a FIFTY PERCENT increase in memory bandwidth!), and has a massive increase in CUDA cores which is very beneficial (40% more, which when combined with the double memory capacity and 50% higher bandwidth easily nets you ~60% more performance in some scenarios over the 1080)
Hope this helps, the TITAN X is a GREAT chip for Deep Learning, the best in the world currently available in my opinion. Which is why i bought two of them.
DarkIdeals says
(sorry for the long post but it is important to your decision so try to read it all if you have time)
Hey, correcting an error in my earlier post. LIke i said i wasn’t quite sure if i understood the INT8 functionality properly. and i was wrong about it. Apparently there was a typo in the spec pages of the Pascal TITAN X, it said “44 TOPs” and made me think it was an operation pipeline of sorts similar to a “ROP” which is responsible for displaying graphical images etc..
It actually was referring the the INT8, which is basically just 8 bit integer support. The average GPU runs with 32 bit “full precision” accuracy, which is a measurement of how much time and effort is put into each “calculation” made by the GPU. For example, with 32 bit it may only go out to 4 decimal points when calculating for the physics of water in a 3d render etc.. which is plenty good for things like Video Games and your average video editing and rendering project; but for things like advanced physics calculations by big universities that are trying to determine the 100% accurate behavior of each individual molecule of H2O within the body of water to see EXACTLY how it moves when wind blows etc.. you would need “double precision” which is a 64 bit calculation that would have much more accuracy, going to more decimal points before deciding that the calculation is “close enough” compared to what 32 bit would.
Only special cards like Quadro’s and Tesla’s have high 64 performance, they usually have half the Teraflops of performance at 64 bit mode compared to 32 bit, so a Quadro P6000 (same GPU as the TITAN XP but with full 64 bit support) it has 12 Teraflops of power at 32 bit mode and ~6 Teraflops of power in 64 bit mode. But there is also 16 bit mode, “half precision” for things requiring even less accuracy, INT8 to my understanding is basically “8 bit quarter precision” mode, with even less focus on total mathematical accuracy; and this is useful for Deep Learning as some of the work done doesn’t require that much accuracy,.
So, in other words, in 8 bit mode, the TITAN X has “44 Teraflops” of performance.
Tim Dettmers says
Your analysis is very much correct. However, for some games there are already some elements which make heavy use of 8-bit Integers. However, before it was not possible to do 8-bit Integer computation, but you had to first convert both numbers to 32-bit, then do the computation, and then convert it back. This would be done implicitly by the GPU so that no programming was necessary. Now the GPU is able to do it on its own. However, the support is still quite limited so you will not the 8-bit deep learning just yet. Probably in a year earliest would be my guess, but I am sure it will arrive at some point.
Ionut Farcas says
First of all, really nice blog and well made articles.
Do you think that spending 240£ more for a 1070 (2048 CUDA cores) instead of a 1060 (1280 CUDA cores) for a laptop? Does the complexity of the most used deep learning algorithms require the extra 760 CUDA cores?
Thank you.
Tim Dettmers says
I am not sure how easy it is to upgrade the GPU in the laptop. If it is difficult, this might be one reason to go with the better GPU since you will probably also have it for many years. If it is easy to change, then there is not really a right/wrong choice. It all comes down to preference, what you want to do and how much money you have for your hardware and for your future hardware.
sk06 says
Hi,
I just bought two Supermicro 7048GR-TR server machine with 4 TitanX cards on each machine. Im confused how to configure the server. How many partitions I have to make, how to utilize 256GB SSD drive and two other 4TB hard drives in each machine. The server will be only used for deep learning applications. What deep learning framework should I use (TensorFlow or Caffe or Torch) considering two servers. I work in medical imaging domain. I recently started getting used to deep learning domain. Please help me with your valuable suggestions.
Link for server configuration:
https://www.supermicro.com.tw/products/system/4u/7048/SYS-7048GR-TR.cfm
Thanks and Regards
sk06
Tim Dettmers says
The servers have a slow interconnect, that is the servers only have a gigabit Ethernet which is a bit too slow for parallelism. So you can focus on setting up each server separately. It depends on your dataset size, but you might want to have the SSD drive dedicated for your datasets, that is, install the OS on the hard drive. If your datasets are < 200GB, you could also install the OS on the SSD to have a smoother user experience. The frameworks all have their pros and cons. In general I would recommend TensorFlow, since it has the fastest growing community.
sk06 says
Thanks for the suggestions. I tried training my application with 4 gpus in the new server. To my shock, training the alexnet took 2.30 Hrs with 4 gpus while training the alexnet took 35 mins with single gpu. I used caffe for this. Please let me know where am I going wrong..! The batch size and other parameter settings are same as in the original paper.
Thanks and Regards
sk06
chanhyuk jung says
I just started learning about neural networks and I’m looking forward to studying it. I have a gt 620 with a dual core pentium g2020 clocked at 3.3 ghz with 8gb of ram. Would it be better to buy a 1060 and two 8gb rams for the future?
Tim Dettmers says
Yes, the GT620 will not support cuDNN which is important deep learning software and makes deep learning just more convenient, because it allows you more freedom in choosing your deep learning framework. You will have less troubles if you buy a GTX 1060. 16GB of RAM will be more than enough, I think even 8GB could be okay. Your CPU will be sufficient, no update required.
Vasanth says
Hi Tim,
Many thanks for this post, and your patient responses. I had a question to ask – NVIDIA gave away Tesla K40C (which is the workstation version of K40, as I understand) as part of its Hardware Grant Program (I think they are giving TitanX now, but they were giving Tesla K40Cs until recently). It’s not clear to me what workstations from standard OEMs like Dell/HP are compatible with a K40C. I have spoken to a few vendors about compatibility issues, but I don’t seem to get convincing responses with knowledge. I am concerned about buying a workstation, which would later not be compatible with my GPU. Would it be possible for you to share any pointers you may have?
Thank you very much in advance.
Tim Dettmers says
The K40C should be compatible with any standard motherboard just fine. The compatibility that hardware vendors stress if often assumed for datasets where the cards run hot and need to do so permanently for many months or years. The K40 has a standard PCIe connector and that is all that you need for your server motherboard.
Wajahat says
Hi Tim
Thanks a lot for your useful blog.
I am training CNN on CPU and GPU as well.
Although the weights are randomly initialized , but I am setting the random seed to zero in the beginning of the training. Still, I am getting different weights learnt for CPU than GPU. The difference is not huge (e.g. -0.0009 and -0.0059, or 0.0016 and 0.0017), but there is a difference that I can notice. Do y ou have any idea how this could be happening? I know it is a very broad question, but what I want to ask is, is this expected or not?
I am using MatlabR2016a with MatConvNet 1.0 beta20 (Nvidia Quadro 410 GPU in Win7 and GTX1080 in Ubuntu 16.04), Corei7 4770 and Corei7 4790.
Exactly same data with same network architecture used.
Best Regards
Wajahat
Tim Dettmers says
This can well be true and normal. The seed itself can produce different random numbers on CPU and GPU if different algorithms are used. Convolution on GPUs mal also include some non-deterministic operations (cuDNN 4). When using unit tests to compare CPU and GPU computation, I also often have some difference in output given the same input, thus I assume that there are also small differences in floating point computation (although very small). All this might add up to your result.
Arman says
Thanks for the great guide.
I had a question. What is the minimum build that you recommend for hosting a Titan X pascal?
Tim Dettmers says
For a single Titan X Pascal and if you do not want to add another card later almost any build will do. The CPU does not matter; you can buy the cheapest RAM and should have at least 16 GB of it (24 GB will be more than enough). For the PSU 600 watts will do; 500 watts might be sufficient. I would buy a SSD if you want to train on large data sets or raw images that are read from disk.
anon says
How are you so patient with everyone’s questions ?
Tim Dettmers says
There are several reasons:
– I led a team of 250 in an online community and people often asked me for help and guidance. At first I sometimes lend support and sometimes I did not. However, over time I realized that not helping out can produce problems: Demotivate people from something which they really want to do but do not know how to do, produce defects in the social environment (when I do not help out, others would take example from my actions and do the same) among others. Once I start lending a hand always, I found that I do not lose as much time as I thought I would lose. Due to my vast background knowledge in this online community, it often was faster to help than thinking about if some question or request was worth of my help. I now always help without a second thought or at least start helping until my patience grows tired
– Helping people makes me feel good
– I was born with genes which make me smart and which make me understand some things easier than others. I feel that I have a duty to give back to those which were less fortunate in the birth lottery
– I believe everybody deserves respect. Answering questions which are easy for me to answer is a form of respect
I hope that answers your question 🙂
Andrew says
You are an amazingly good person Tim. The world needs more people like you. Your actions encourage others to behave in a similar way which in turn helps build better online and offline communities. Thank you!
Tim Dettmers says
Thank you for the kind words!
Michael Lanier says
How do the new NVIDIA 10xx compare? I followed through with this guide and ended up getting a GTX Titan. The bandwidth looks slightly higher for the Titan series. Does the architecture affect learning speeds?
Tim Dettmers says
The bandwidth is high for all Titans, but their performance is different from architecture to architecture, for example Kepler (GTX Titan) is much slower than Maxwell (GTX Titan X) even though the have comparable bandwidth. So yes the architecture does affect learning speed — quite significantly so!
drh1 says
hi tim,
thanks for some really useful comments. i have a hardware question. i’ve configured a Windows 10 machine for some GPU computing (not DL) at the moment. I think the hardware issues overlap with your blog, so here goes:
the system has a GTX 980 Ti card and a K40 card on an ASUS X-99 Deluxe motherboard. When the system boots up, the 980 (which runs the display as well) is fine, but the K40 gives me “This device cannot start. (Code 10). Insufficient system resources exist to complete the API”. I have the most up-to-date drivers (354.92 for K40, 368.81 for 980).
Has anyone configured a system like this, and did they have similar problems? Any ideas will be greatly appreciated.
Tim Dettmers says
It might well be that your GPU driver is meddling here. There are separate drivers for Tesla and GTX GPUs and you have the GTX variant installed and thus the Tesla card might not work properly. I am not entirely sure to go around this problem. You might want to configure the system as a headless (no monitor) server with Tesla drivers and connect to it using a laptop (you can use remote desktop using Windows, but I would recommend installing ubuntu).
bmahak says
I want to build my own deep learning machine using skylake motherboard and cpu. I am planing not to use more then 2 GPUS (GTX 1080). Starting with one GPU first and upgrading to a second one if needed.
here is my setup in pcpartpiker: http://pcpartpicker.com/user/bmahak2005/saved/Yn9qqs
Please tell me what you think about it.
Thanks again for a great article .
HB.
Tim Dettmers says
The motherboard and CPU combo that you chose only supports 8x/8x speed for the PCIe slots. This means you might see some slowdown in parallel performance if you use both of your GPUs at the same time. The decrease might vary between networks with roughly 0-10% performance loss. Otherwise the build seems to be okay. Personally I would go with a bit more watts on the PSU just to have a save buffer of extra watts.
David Selinger says
Hey there Tim,
Thanks for all the info!
I was literally pushing send on an email that said “ORDER IT” to my local computer build shop when nVidia announced the new Titan X Pascal.
Do you have any initial thoughts on the new architecture? Especially as it pertains to cooling the NVRAM which usually requires some sort of custom hardware (cooling plate? my terminology is likely wrong here) will that add additional delay after purchasing the new hardware?
Thank you sir!
Tim Dettmers says
There should be no problems with cooling for the GDDR5X memory with the normal card layout and fans. I know for HBM2 NVIDIA actually designed the memory to be actively cooled, but HBM2 is stacked while GDDR5X is not. Generally GDDR5X is very similar to GDDR5 memory. It will consume less power but also offer higher density, so that on the bottom line GDDR5X should run on the same temperature level or only slightly hotter than GDDR5 memory — no extra cooling required. Extra cooling makes sense if you want to overclock the memory clockrate, but often you cannot get much more performance out of it for how much you need to invest in cooling solutions.
Overall the architecture of Pascal seems quite solid. However, most features of the series are a bit crippled due to manufacturing bottlenecks (16nm, GDDR5X, HBM2 all these need their own factories). You can expect that the next line of Pascal GPUs will step up the game by quite a bit. The GTX 11 series probably will feature GDDR5X/HBM2 for all cards and allow full half-float precision performance. So Pascal is good, but it will become much better next year.
David Selinger says
Cool thanks. That gave me something to chew on.
Last question (hopefully for at least a week : ) ): Do you think that a standard hybrid cooling closed-loop kit (like this one from Arctic: https://www.arctic.ac/us_en/accelero-hybrid-iii-140.html) will be sufficient for deep learning or is a custom loop the only way to go?
– VRM: heatsink + fan
– VRAM: Heatsink ONLY
– GPU: closed-loop water cooled
Obviously will have to confirm the physical fit once those specs become more available, but insofar as the approach, I was a little bit concerned about the VRAM.
The use case is convolutional networks for image and video recognition.
Thanks,
Selly
sk06 says
Hi Tim,
Thanks for the excellent post. The user comments are also pretty informative. Kudos to all.
I recently started shifting my focus from conventional machine learning to Deep Learning. I work in medical imaging domain and my application has a dataset of 50000 color images (5000 per class, 10 classes, size – 512×512). I have a system with Quadro k620 gpu. I want to train state of the art CNN model architectures like Googlenet InceptionV3, VGGnet16, alexnet from scratch. Do the QuadroK620 will be sufficient for training these models. If I have to go for higher end gpu’s, can u please suggest me which card I should go for? (K1080, TitianX, etc). I want to generate the prototypes as fast as possible. Budget is not primary.
Tim Dettmers says
A QuadroK620 will not be sufficient for these tasks. Even with very small batch sizes you will hit the limits pretty quickly. I recommend getting a Titan X on eBay. Medical imaging is a field with high resolution images where any additional amount of memory can make a good difference. Your dataset is fairly small though and probably represents a quite difficult task; it might be good to split up the images to get more samples and thus better results (quarter them for example if the label information is still valid for these images) which then in turn would consume more memory. A GTX Titan X should be best for you.
John says
Great article. What would you recommend for a laptop GPU setup rather than a desktop? I see a lot of laptop builds with a 980M or 970M GPU, but is it worth waiting for some variant of the 1080M/1070M/1060M?
Tim Dettmers says
A laptop with such a high end graphics card is a huge investment and you will probably use that laptop much longer than people use their desktops (it is much easier to sell your GPU and upgrade for a desktop). I would thus recommend to wait for the 1000M series. It seems it will arrive in some months ahead and the first performance figures show that they are slightly faster than the GTX Titan X — that would be well worth the wait in my opinion!
Dante says
Tim,
Based on your guide I gather that choosing a less expensive hexa core Xeon cpu with either 28 or 40 lanes will not see a great drop in performance. is that correct? (1-2 GPUs). Can you share your thoughts?
Great guides. very helpful for folks getting into Deep learning and trying to figure out what works best for their budget.
Dante
Tim Dettmers says
Yes that is very true. There is basically no advantage from newer CPUs in terms of performance. The only reason really to buy a newer CPU is to have DDR4 support, which comes in handy sometimes for non-deep learning work.
Simon says
In general I seek cheaper idea to assembly set without decreased performance.
Does NVIDIA coolbits take possibility to decrease GPU heats up ?
You wrote about “coolbits” on Ubuntu and problem with headless.
Did you hear about DVI or VGA plug dummy, i.e.
http://www.ebay.com/itm/Headless-server-DVI-D-EDID-1920×1200-Plug-Linux-Windows-emulator-dummy-/201087766664
I think it will be good solution for video card with no monitor attached and no problems with coolbits control.
Simon says
Hi
Asus spec X99-E WS shows that has a PLX chip that provides a additional 48 PCIe lanes. Getting a i7-6850K with a X99-E WS theoretically gives you 88PCIe lanes in total and that is still plenty to run 4 GPUs all at x16.
Is that true for deep learning ?
Thx for reply.
Tim Dettmers says
I am not exactly sure how this feature maps to the CPU and to software compatibility. From what I heard so far, you can quite reliably access GPUs from very non-standard hardware setups, but I am not so sure about if the software would support such a feature. If the GPUs are not aware of each other on the CUDA level due to the PLX chip, then this feature will do nothing good for deep learning (it would probably be even slower than a normal board, because probably you would need to go through the CPU to communicate between GPUs).
But the idea of a PLX chip is quite interesting, so if you are able to find out more information about software compatibility, then please leave a comment here — that would not only help you and me, but also all these other people that read this blog post!
Rikard Sandström says
Thank you for an excellent post, I keep coming back here for reference.
With regards to memory types, what role does GDDR5 vs GDDR5X play? Is this an important differentiator between offerings like 1080 and 1070, or is it not relevant for deep learning?
gameeducationjournal.info says
When I initially left a comment I seem to have clicked on the
-Notify me when new comments are added- checkbox and now every time a comment is
added I get four emails with the exact same comment.
Is there a means you can remove me from that
service? Many thanks!
Tim Dettmers says
That sounds awful. I will check what is going wrong there. However, I am unable to remove a single user from the subscription. See if you can unsubscribe yourself. Otherwise please contact the jetpack team. Apparently the data is stored by them and the plugin that I use for this blog access that data as you can read here. I hope that will help you. Thanks for letting me know.
Arun Das says
Wonderful guide ! Thank you !
Milan Ender says
Hey,
first of all thanks for the guide, helped me immensely to get some clarity in this puzzle! 🙂
Couple of questions as I’m a bit too impatient to wait for 1080/70 reviews on this topic:
As you stated, bandwidth, memory clock and memory size seem to be one of the most important factors so would it even make sense to put some more money in a solidly overclocked custom GPU? So far I’ll just pick the cheapest solidly cooled one (EVGA ACX 3.0 probably).
Also my initial analysis between 1070GTX vs 1080GTX was heavily in favor for the 1080 GTX based on the benchmarks from http://www.phoronix.com/scan.php?page=article&item=nvidia-gtx-1070&num=4 . Though the theoretical TFLOPS SP MIXBENCH results were closely in favor for the 1070 (76.6 €/TFLOP 1080GTX vs 73.9 €/TFLOP 1070GTX) the SHOC on CUDA results in terms of price efficiency were closely in favor for the 1080GTX but more or less the same . However the GDDRX5 on the 1080 GTX seem to seal the deal I guess for deep learning applications? Also I found the 1080 around 6 Watt/TFlops more cost efficient. Am I on the right track here? Maybe the numbers help some others here searching for opinions on that :).
Anyways after reading through your articles and some others I came up with this build:
http://pcpartpicker.com/list/LxJ6hq . Some comments would be very appreciated 🙂 . I feel like the CPU is a bit overkill but it was the cheapest with DDR4 ram and 40 lanes. Maybe not needed though I’m a bit unsure of that.
Best regards
peter says
Hello Tim:
Thanks for the great post. I built the following PC base on it.
CPU: i5 6600
Mother board: Z170-p
DDR4: 16g
GPU: nvidia 1080 founder edition
Power: 750W
However, after I install 14.04, I can’t get CUDA8.0 and the new driver install(which claim N1080 user has to renew this driver).
Is the problem occur because of the other components of the PC like mother board?
Thanks!
Tim Dettmers says
I have heard that people have problems with Skylake under ubuntu 14.04. But I am not sure if that is really the problem. You can try upgrade to ubuntu 16.04 because the Skylake support is better under that version, but I am not sure if that will help.
Poornachandra Sandur says
Hi Tim Dettmers,
Your blog is awesome. I currently have GeForce GTX 970 on my system , is that sufficient for beginning Convolutional Neural Networks.
Tim Dettmers says
A GTX 970 is an excellent option to explore deep learning. You will not be able to train the very largest models, but that is also not something you want to do when you explore. It is mostly learning how to train small networks on common and easy problems, such as AlexNet and similar convolutional nets on MNIST, CIFAR10 and other small data sets, until you get a “feel” for training convolutions nets so that you then can go on with larger models and larger data sets (ResNet on ImageNet for example). So everything is good.
Adrian Sarno says
I haven’t bee able to boot up this MSI laptop with any of the flavors of 14.04 (lubuntu, xubuntu, kubuntu, ubuntu) , could it be the SkyLake processor that it is not compatible with 14.04?
https://bugzilla.kernel.org/show_bug.cgi?id=109081
Looks like I will have to wait until a fix is created for the upstream ubuntu versions or until nvidia updates Cuda to support 16.04. Is there any other thing I can try?
Thanks!
Tim Dettmers says
Laptops with a NVIDIA GPU in combination with Linux are always a pain to get running properly as it is often is also very dependent on your other hardware in your laptop. I do not have any experience in this case, but you might be able to install 14.04 and then try to patch the kernel with that you need. Not easy to do though.
David Laxer says
Any comments on MIT’s Eyeriss chip?
David Laxer says
http://www.rle.mit.edu/eems/wp-content/uploads/2016/02/eyeriss_isscc_2016_slides.pdf
Glenn says
Thanks for all the info. If I plan to use only one GPU for computation, then would I expect to need two GPUs in my system: one for computation and another for driving a couple of displays? Or can a single GPU be used for both jobs?
Tim Dettmers says
A single GPU is fine for both. A monitor will use about 100-300MB of your GPU memory and usually draw an insignificant amount (<2%) of performance. It is also the easier option, so I would just recommend to use a single GPU.
Yasumi says
For deep learning on speech recognition, what do you think of the following specs?
It’s going to cost 2928USD. What are your thoughts on this?
– INTEL CORE I7-6800K UNLOCKED FOR OC(28lanes)(6 CORE/ 12 THREADS/3.8GHZ) NEW!
– XSPC RayStorm D5 Photon AX240 (240L)
– ASUS X99-E WS (ATX/4way SLI/8x Sata3/2xGigabit LAN/10xUSB3.0)
– 4 x GSKILL RipjawsV RED 2x8GB DDR4 2400mhz (CL15)
-ZOTAC GTX1080 8GB DDR5X 256BIT Founder’s Edition (1733/10000)-NEW
– SuperFlower Leadex Gold 650W(80+Gold/Full Modular)*5 Years Warranty
– CORSAIR AIR 540 BLACK WINDOW
– INTEL 540s 480GB 2.5″ Sata SSD (560/480)
Tim Dettmers says
This is a good build for a general computation machine. A bit expensive for deep learning, as the performance is mostly determined by the GPU. Using more GPUs and cheaper CPU/Motherboard/RAM would be better for deep learning, but I guess you want to use the PC also for something different than deep learning :). This would be a good PC for kaggle competitions. If you plan on running very big models (like doing research) then I would recommend a GTX Titan X for memory reasons.
Adrian Sarno says
thanks so much for your advice! I managed to install Xubuntu 16.04, now the next step is installing CUDA and TensorFlow, I will need all the advice that I can get with that one.
The problem I have with Ubuntu Desktop is known, it looks like they are going to address it in 14.04.1 (sorry for the comment slightly off topic).
http://askubuntu.com/questions/760051/ubuntu-16-04-0-final-unity-desktop-kubuntu-gnome-can-not-boot-from-live-us/760124
Spuddler says
You should try to use 14.04, 16.04 still can give you lots of headaches right now.
This is how I do it: http://pastebin.com/E6uFu2Em
This will not work on 16.04 for probably hundreds of reasons
Adrian Sarno says
I have a laptop with a NVIDIA Quadro M3000M (4.0GB) GDDR5 PCI-Express, I would like to use it for deep learning, I noticed that no-one mentions Quadro cards in the context of deep learning, is there a design reason why these cards are not used in deep learning?
PS: I tried to install ubuntu (all it s versions) and it fails to show the gnome menu, it just shows the background desktop image.
Spuddler says
as far as I know, quadro cards are usually optimized for CAD applications, you can use them for deep learning but they will not be as cost efficient as regular geforce cards.
Your problem with Ubuntu not booting is a strange one, does not really look like a graphics driver issue since you get a screen. Before googling for more difficult troubleshooting procedures I would try other Ubuntu 14.04 LTS flavours if I were you, like Xubuntu (windows-like, lightweight), Kubuntu (windows-like, fancy) or even Lubuntu (very lightweight). It may just be some arcane issue with Ubuntu’s Gnome Desktop and your hardware.
Nizam says
This is the most informative blog about building a deep learning blog!
Thanks for that.
Now that the Nvidia’s 1080, 1070 are launched, which is a better deal for us?
two 1070s or one 1 080?
Everyone writes in the context of gamers 🙁
I badly need this communities voice here!
Epenko pentekeningen says
Question: For budgetary reasons i’m looking at an AMD cpu / board combination (4 cores) but that combination has no onboard video.
Can the GPU (4GB nvidia 960) which will be used for machine learning also be used at the same time as the videocard (no 3d offcoarse).
Does that work or do i need an extra videocard ? Thanks!
Tim Dettmers says
Yes, that will work just fine! This setup would be a great setup to get started with deep learning and get a feel for it.
Adrian Sarno says
Tim,
I’m looking for information on which GPU cards have support for convolutional layers, in particular I was considering a laptop with the GTX 970, but according to your blog above it does not support convolutional nets. Would you ind to explain what does that mean in terms of features and also time performance? Is there a way to know from the spec whether the card is good for conv nets?
thanks in advance
Tim Dettmers says
Maybe I have been a bit unclear in my post. The GTX 970 supports convolutional nets just fine, but if you use more then 3.5GB of memory you will be slowed down. If you use 16-bit networks though you can still train relatively well sized networks. So a GTX 970 is okay for most non-research, non-I-want-to-get-into-top5-kaggle use-cases.
Greg says
Hey Tim…quick question. Do you have any opinion about the new GeForce GTX 1080s for deep learning?
Maybe you already give your opinion but I have missed it.
Thanks,
Greg
Thomas R says
Hi Tim, did you connect your 3 monitors to the mainboard/CPU or to your GPU? Does this have an influence on the deep learning computation?
Tim Dettmers says
I connected them to two GPUs. It does not really affect performance (maybe 1-3% at most), but it does take up some memory (200-500MB). But overall this effect is neglectable.
DD Sharma says
Hello Tim,
Comparing two cards for GPGPU (Deep Learning being an instance of a GPGPU) what is more important: # of cores or memory? For learning purposes and may be some model dev I am considering a low end card (512 cores, 2GB) .. will this seriously cripple me? Other than giving-up performance gains, will it seriously be constraining? I checked research work of folks from 5+ years ago and many in academia used processors with even weaker specs and still got something done. Once I discover that I am doing something real serious I can go to Amazon cloud or get an external GPU (connect via Thunderbolt 3) or build a machine.
Tim Dettmers says
Neither cores nor memory is important per se. Cores do not matter really. Bandwidth is important and FLOPS second most important. You need a certain memory to training certain networks. For state of the art models you should have more than 6GB of memory.
Bob Sperry says
Hi Tim,
I suppose this is echoing Jeremy’s question, but is there any reason to prefer a Titan X to a GTX 1080 or 1070? The only spec where the Titan X still seems to perform better is in memory (12 GB vs. 8 GB).
I got a Titan X on Amazon about 2.5 weeks ago, so have about 10 days to return it for a full refund and try for a GTX 1080 or 1070. Is there any reason not to do this?
Tim Dettmers says
No performance data is currently in deep learning is currently available for the GTX 1000s, but it is rather safe to say that these will yield much better performance. If you use 16bit, and probably most libraries will change to that soon, you will see in increase of at least 2 times in performance. I think returning your Titan X is a good idea.
Spuddler says
Just wanted to add that Nvidia artificially crippled the 16bit operation on the 1070/1080 GTX to abysmal speeds, so we can only hope they don’t do the same with the Pascal Titan card.
Jerry says
Hi Tim. Thanks for an excellent guide! I was wondering what your opinion is on Nvidia’s new graphics card – Nvidia Geforce GTX 1080. The performance is said to beat the Titan X and is proposed to be half the price!
Gilbert says
Hi, does the number of CUDA core matter? GTX 1080 will be released already and it has 2500 CUDA cores whereas a GTX 980 TI has about 2800 CUDA cores. Will this affect the speed of training? Or In general GTX 1080 will be faster with is 8 teraflops of performance?
Tim Dettmers says
The number of cores does not matter really. It all depends how these cores are integrated with the GPU. The GTX 1080 will be much faster than the GTX Titan X, but it is hard to say by how much.
Gilbert says
So you’d recommend that I invest myself in a GTX 1080 instead? 🙂
Daniel Rich says
So reading this post that bandwidth is the key limiter makes me think the gtx 1080 with a bandwidth of 320 will be slightly worse for deep learning than a 980 to. Does that sound right?
Tim Dettmers says
You cannot compare the bandwidth of a GTX 980 with the bandwidth of a GTX 1080 because the two cards use different chipsets. The GTX 1080 will definitely be faster.
DD Sharma says
Tim,
Any updates to your recommendations based on Skylake processors and specially Quadro GPU’s?
Tim Dettmers says
Skylake is not need and Quadro cards are too expensive — so no changes to any of my recommendations.
Lucian says
Hi Tim, great post!
Could you talk a bit about having different graphics cards in the same computer? As an extreme example, would having a Titan X, 980 Ti and a 960 be problematic?
Dorje says
Thank you very much, Tim.
I got a Titan X, hahaha~
Cheers,
Dorje
Eduardo says
Hi, I am a Brazilian student, so everything is way too expensive for me. I will buy a gtx 960 and start of with a single GPU and expand later on. The problem is that intel CPUs with 30+ lanes are WAY too expensive. So I HAVE to go with AMD, but the motherboards for AMD only have PCIe 2.0.
My question is: can I get a good performance out of 2 x 960 GPUs on a PCIe 2 .0 x16 mobo? By good I mean equal to a single 960 with x16 on a PCIe 3.0, maybe even a single gtx 980.
Tim Dettmers says
Hi, both a Intel CPU with 16 lanes or less (as long as your motherboard supports 2 GPUs) as well as AMD with PCIe 2.0 will be fine. You will not see large decreases in performance. It should be about 0-10% depending on task and deep learning software.
If you are short on money it might also be an option to use AWS GPU instances. If you do not train every day this might be cheaper in the end. However, for tinkering around with deep learning a GTX 960 will be a pretty solid option.
Raj says
Thanks for the great blog, i learned a lot.
For me getting a 40 lane or even 28 lane CPU-MB is out of budget. In my country these parts are rare.
I am planning to get a 16 lane CPU. With this i can get MB which has 2xPCIe 3.0 x16. I plan to use single GPU initially. If i want to use 2 GPU’s it has to be x8/x8 configuration. With this configuration is it practical to use 2 GPU’s in the future?
My system will likely have i7 6700, Asus Z170-A and Titan X.
Cheers,
RK
Tim Dettmers says
Hi RK,
16 lanes should still work good with 2 GPUs (but make sure the CPU supports x8/x8 lanes — I think every CPU does, but I never used them myself). The transfer to the GPU will be slower, but the computation on the should still be as fast. You probably see a performance drop of 0-5% depending on the data that you have.
RK says
Thanks for the fast reply.
Tim Dettmers says
You are welcome 🙂
Yi says
Hi Tim,
Thanks for the great post. Sorry to bother you again. I just want to ask sth about coolbits option of the GPU cards. Right now, I set it to 12 and I can manually control the fan speed. It works nicely. But I won’t check the temperature all the time and change the fan speed accordingly. So during training, how much percentage of fan speed should I use? 50%, or 80% or an aggressive 90% maybe? Thanks a lot.
And if I keep the fan always running at 80% speed, will it reduce the lifecycle of the card? Thanks.
Tim Dettmers says
The life expectancy of the card will increase the cooler you keep it. So if you can you can keep the fan at 100% at all times. However, this of course can problems with noise if the machine is nearby you or other people. For my desktop I keep the fan as low as possible to keep the GPU below 80 degrees C and if I leave the room I just set the fan speed to 100%.
Yi says
Thanks a lot for your reply, it helps a lot.
Spuddler says
Keep in mind that running your fans at 100% constantly will wear out the fans much faster – although that is better than a dead GPU chip. It can be difficult to find cheap replacement fans for some GPUs, so you should look for cheap ones on alibaba etc. and have a few spares lying around in advance since shipping from china takes weeks.
Also, when a fan stops running smoothly, you can usually just buy cheap “ball bearing oil” ($4 on ebay or so) and remove the sticker on the front side of the fan. There will be some tiny holes beneath into which you can simply squirt some of the oil and most likely the fan will run as good as new. Worked out for me so far
Dorje says
Hi Tim, THANKS for such a great post! and all these responses!
I got a question:
What if I buy a TX 1 instead of buying a computer ?
I will do video or CNN images classification sort things.
Cheers,
Dorje
Tim Dettmers says
Hi Dorje,
I also thought about buying a TX1 instead of a new laptop, but then I opted against it. The overall performance on the TX1 is great for a small, mobile, embedded device, but not so great compared to desktop GPUs or even laptop GPUs. There might also be issues if you want to install new hardware because it might not be supported by the Ubuntu for Tegra OS. I think in the end the money is better spend to get a small, cheap laptop and buy some credit for GPU instances on AWS. Soon there will also be high performance instances (featuring the new Pascal P100), so this would also be a good choice for the future.
Chip Reuben says
My guess is that (if done right) the monitor functionality gets relegated to the integrated graphics capability of the motherboard. Just don’t try to stream high-res. video while training an algorithm.
Steven says
Ooops – I should have mentioned that the motherboard I’m using is an ASRock Fatal1ty X99 Professional/3.1 EATX LGA2011-3. It doesn’t have an integrated graphics chip.
Steven says
Hi Tim,
This post was amazingly useful for me. I’ve never built a machine before and this feels very much like jumping in the deep end. There are two things I’m still wondering about:
1. If I’m using my GPU(s) for deep learning, can I still run my monitor off of them? If not should I get some (relatively) cheap graphics card to run the monitor, or do something else?
2. Do you have any opinion about Intel’s i7-4820K CPU vs. the i7-5820K CPU? There seems to be a speed vs. cache size & cores trade-off here. My impression is that whatever difference there is will be small, but the larger cache size should lead to fewer cache misses, which should be better. Is this accurate?
Thanks
Steven says
Was just reading through the Q/A’s here and saw your response to Rohit Mundra (2015-12-22) answered my first question.
Sorry for the repeat….
Tim Dettmers says
No problem, I am glad you made the effort to find the answer in the comment section. Thanks!
Matt says
Everyone seems to be using an Intel CPU, but they seem prohibitively expensive if actual clock speed or cache isn’t that important… Would an AMD cpu with 38 lane support work just as well paired with two GPUs?
Also, have you experimented with builds using two different GPUs?
Tim Dettmers says
Yes, a AMD CPU should work just as well on 2 GPUs as an intel one. However, using two different GPUs will not work if the have different chipsets (GTX 980 + GTX 970 will not work); what will work if you have different vendors (EVGA GTX 980 + ASUS GTX 980 will work with no problems).
Matt says
I see – thanks! I’m considering just getting a cheaper gpu to at least get my build started and running and then picking up a Pascal gpu later. My plan was to use the cheaper gpu to drive a few monitors and use the Pascal card for deep learning. That kind of setup should be fine right? In other words, there is only an issue with two different cards if I try to use them both in training, but I’m essentially using just a single gpu for it
David Laxer says
Hi,
Thanks for this post. Are there any Cloud solutions yet?
I used Amazon g2.2xlarge as well as g2.8xlarge as Spot Instances,
however, the GPUs are old, don’t support the latest CUDA features and spot prices
have increased.
Tim Dettmers says
There are also some smaller providers for GPUs but their prices are usually a bit higher. Newer GPUs will also available via Microsoft Azure N-series sometime soon, and these instances will provide access to high-end GPUs (M60 and K80). I will look into this issue in the next week when I will update my GPU blog post.
David Laxer says
Can you recommend a good box which supports:
1. multiple GPUs for deep learning (say the new Nvidia GP100),
2. additional GPU for VR headset,
3. additional GPU for large monitor?
Thanks!
Xiao says
Hi Tim,
Thanks for the post! Very helpful. Was just wondering what editor (monitor in the center) did you use in the picture showing the three monitors?
Tim Dettmers says
That is an AOC E2795VH. Unfortunately they are not sold anymore. But I think any monitor with a good rating will do.
Razvan says
Hey Tim,
Awesome article. Was curious whether you have an opinion on the Tesla M40 as well.
Looks suspiciously similar to the Titan X.
Think the “best DL acceleration” claim might be a bit of a marketing gamble?
Cheers,
–Razvan
Tim Dettmers says
This post is getting slowly outdated and I did not review the M40 yet — I will update this post next week when Pascal is released.
To answer your question, the Titan X is still a bit faster with 336 GB/s while the M40 sports 288 GB/s. But the M40 has much more memory which is nice. But both cards will be quite slow compared to the upcoming Pascal.
Chip Reuben says
Wow, I am super glad I read this response. Based on your comment about the Pascal vs. the Titan X, I was able to place the development of my system on hold, just in time! I was going to get a Titan X. But now I will want to know if it will be much better to get the Pascal with 32 GB of dedicated RAM (VRAM?) vs. the 12 GB of the Titan X. http://www.pcworld.com/article/2898175/nvidias-next-gen-pascal-gpu-will-offer-10x-the-performance-of-titan-x-8-way-sli.html
Do you have specific information that suggests it will be one week yet before the Pascals will be available? How much do you thing the 1080 will be (in USD, Euros, etc.)?
Chip Reuben says
The Pascal P100 won’t even be available to most of us until later this year at the soonest (http://wccftech.com/nvidia-pascal-gpu-gtc-2016/) and it isn’t even in the same league as the Titan X. They haven’t said anything about the 10xx’s, so I’m assuming they will be quite a while yet also?
Chip Reuben says
Thanks for the great answers. Do you think that one Titan 12 GM of memory is better than, say, two GTX 980s, or two of the upcoming Pascals (xx80s)? I currently have a system designed that has a motherboard such that has the additional PCIe lanes but that (as I’ve been told by the Puget Systems people) adding a second GPU would slow down things by 2x. So I thought “just get the Titan w/ 12 GB of memory and be done with it.” Do you think that sounds ok? Or do I upgrade the motherboard? I’m thinking that the Titan may be more than I ever need, but unfortunately I do not know. Thank you for your great help and thorough work.
Yi Zhu says
Hi Tim,
Thanks for the great post. I am a graduate student, and would like to put together a machine recently. But if I put up a system with i7-5930K CPU, Asus X-99 deluxe MOBO and two titan x GPUs for now, will the pascal GPUs compatible with this configuration? Can I just simply plug in a Pascal GPU when it is released? Thanks a lot.
Tim Dettmers says
As far as I understand there will be two different versions of the NVLink interface, one for regular cards and one for workstations. I think you should be alright with your hardware, although you might want to wait for a bit since Pascal will be announced soon and probably ship in May/June.
Hehe says
Why is the aws g2.8x not enough?
It says 60gibs(approx 64gbs) of gpu memory
Thanks
Tim Dettmers says
The 60GB refers to the CPU memory that the AWS g2.8x has. The GPU memory is 4GB per card.
Chip says
“CPU and PCI-Express. It’s a trap!”
I have no idea what that is supposed to mean. Does that mean I avoid PCI express? Or just certain Haswells? What is the point here?
Tim Dettmers says
Certain Haswells do not support the full 40 PCIe lanes. So if you buy a Haswell make sure it support it if you want to run with multiple GPUs.
Phong says
You say GTX 680 is appropriate for convnets, however I see GTX 680 just has 2GB RAM which is inadequate for most convnets such as AlexNet and of course VGG variants.
Tim Dettmers says
There is also a 4GB GTX 680 variant which is quite okay. Of course a GTX 980 with 6GB would be better, but it is also way more expensive. However, I would recommend one GTX 980 over multiple GTX 680. It is just not worth the trouble to parallelize on these rather slow cards.
Chip says
Hi Tim,
Thanks for this excellent primer. I am trying to get a part set and have this so far (http://pcpartpicker.com/p/JnC8WZ) but it has some 2 incompatibility issues. Basically, I want to be working through this 2nd Data Science Bowl (https://www.kaggle.com/c/second-annual-data-science-bowl) as an exercise. I will likely work with a lot of medical image data. Also, I will use this system as an all-purpose computer too (for medical writing), so I’m wondering if I also need to add the USB, HDMI, and DVI connects (I currently also use an Eizo ColorEdge CG222W monitor). Also, I like the idea of 2 hard drives, one for Windows and one for Linux/Ubuntu (or I could partition?) Finally, I use a wireless connect, hence that choice. I would be most grateful if you could help with the 2 incompatibilities, any omissions, and seeing if this system would generally be ok. Thank you in advance for your time.
Tim Dettmers says
You can resolve the compatibility issue by choosing a larger mainboard. A larger mainboard should give you better RAM voltage and also fixes the PCIe issue. Although the GTX 680 might be a bit limiting for training state of the art models, it is still a good choice to learn on the Data Science Bowl dataset. Once Pascal hits the market you can easily upgrade and will be able to train all state-of-the-art networks easily and quickly.
Chip says
Thank you for this response. I had the GTX 980 selected (in the pcpartpicker permalink), but I may well just wait for the Pascal that you suggested. I read this article (http://techfrag.com/2016/03/18/nvidia-pascal-geforce-x80-x80ti-gp104-gpu-supports-only-gddr5-memory/), however, and suppose I must admit I’m quite confused with the names, the relationship of “Pascal” to GeForce X80, X80Ti & Titan Specs, and also the concern with respect to GDDR5 vs. GDDR5X memory. Is it worth it to wait for one of the GeForce (which I assume is the same as Pascal?) rather than just moving forward with the GTX 980? Will one save money by way of sacrificing something with respect to memory? Please forgive my neophyte nature with respect to systems.
Tim Dettmers says
Pascal will be the new chip from NVIDIA which will be released in a few months. It should be designated as GTX 10xx. The xx80 refers to the most powerful GPU consumer model of a given series, e.g. the GTX 980 is the most powerful the 900s series. The GTX Titan is usually the model for professionals (deep learning, computer graphics for industry and so forth).
And yes I would wait for Pascal rather than buy a GTX 980. You could buy a cheap small card and sell it once Pascal hits the market.
Wajahat says
Hi Tim
Thanks a lot for your article. It answered some of my questions. I am actually new to deep learning and know almost nothing of GPUs. But I have realized that I need one. Can you comment on the expected speedup if I use ConvNets on a Titan X than a n intel corei7 4770-3.4 Ghz?
Even a vague figure would do the job.
Best Regards
Wajahat
Tim Dettmers says
It depends highly on the kind of convnet you are want to train, but a speedup of 5-15x is reasonable. However, if you can wait a bit I recommend you to wait for Pascal cards which should hit the market in two months or so.
viper65 says
Thank u. But consider the size of the memory and the brand, I am afraid the price of pascal would far beyond my budget?
viper65 says
Nice article!
What do you think about HBM? Apart from the size of ram, do you think that fury x has any advantage comparing to 980Ti?
Tim Dettmers says
The Fury X definitely has the edge over the GTX 980 Ti in terms of hardware, though in terms of software the AMD still lags behind. This will change quite dramatically once NVIDIA Pascal hits the marked in a few months. HBM is definitely the way to go to get better performance. However, the HBM of NVIDIA offers double the memory bandwidth from the Furx X and Pascal will also allows for 16-bit computations which effectively doubles the performance further. So I would not recommend getting a Fury X, but instead to wait for Pascal.
Bobby says
How soon do you think will flagship of Pascal, like Titan X, be on the market? I am not sure if I should wait. Thank you.
hroent says
Hi Tim — Thanks for this article, I’ve found it extremely useful (as have others, clearly).
You’re probably aware of this, but the new Titan X Pascal cards have very weak FP16 performance.
Tim Dettmers says
Yes the FP16 performance is disappointing. I was hoping for more, but I guess we have to wait until Volta is released next year.
Freddy says
Hey Tim,
first of all thank you very much for your great article. It helped me alot to gain some inside in the hardware requirements needed for any DL machine. Over the past several years i only worked with laptops (in freetime) as i had some good machines at work. Now i am planning to set up some system at home to start experimenting on some stuff in my free time. After i read your post and many of the comments i started to create a build (http://de.pcpartpicker.com/p/gdNRQ7) and as you looked over so many systems and gave advices i hoped that you can maybe do it once again 😉
I choosed the 970 as a starter, and then wait for the pascal cards comming out later this year. I am also not planning to work with more than 2 gpus in the future at home. And for the monitor. i already have one 24″ at home, so this will just be the 2nd.
I dunno, maybe you can look over it and give me some advices or your opionon.
Tim Dettmers says
Looks like a solid build for a GTX 970 and also after an upgrade to one or two Pascals this is looking very good.
Freddy says
Thanks for the time you are spending, giving so many people advices. It is/was quite hard for me after so many years of laptop use to dive back into hardware specifics. You made it a lot easier with your post. Big thanks again!
Lawrence says
Hi Tim,
Great website ! I am building a Devbox, https://developer.nvidia.com/devbox.
My machine has 4 Titan X cards, Kingston Digital HyperX Predator 480 GB PCIe Gen2 x 4 , Intel Core i7-5930K Haswell-E, and G.SKILL 64GB. I am using ASUS RAMPAGE V extreme motherboard. When I place the last Titan X card on the last slot, my SSD gets disapered from bios. I am not sure I have a PCIe conflict ? Does M.2 can interfere with PCIE_X8_4. What should I do to fix this issue ? Should I change the motherboard, any advice ?
Tim Dettmers says
Your motherboard only supports 40 PCIe lanes, which is standard, because CPUs only support a maximum of 40 PCIe lanes. Your 4 Titan X will run in 16x/8x/8x/8x lane mode. You might be able to switch the first GPU to 8x manually, but even then CPUs and motherboards usually do not support a 8x/8x/8x/8x/8x mode (usually two PCIe switches are supported for a single GPU, and a single PCIe switch supports two devices, so you can only run 4 PCIe devices in total). This means that there is probably no possibility to get your PCIe SSD working with 4 GPUs. I might be wrong. To check this it is best to contact your ASUS tech support and ask them if the configuration is possible or not.
Bobby says
Hi Tim,
Thank you for the wonderful guide.
As Lawrence, I’m also building a GPU workstation using https://developer.nvidia.com/devbox as the guide. It mentions that “512GB PCI-E M.2 SSD cache for RAID”. I wonder how to setup this SSD as the cache for RAID, since RAID 5 does not support this as I know. Have you done anything similar? Thank you very much.
Tim Dettmers says
Hi Bobby,
I have no experience with RAID 5, since usual datasets will not benefit from increased read speeds as long as you have a SSD. I think you will need to change some things in BIOS and then setup a few things for your operating system with a raid manager. I think you will be able to find a tutorial for your OS online so you can get it running.
Bobby says
Hi Tim,
It seems it’s not related to the RAID. I wonder how to setup an SSD as the cache for a normal HDD. Setting it as the cache for RAID should be similar. With this, I may not need to manually copy my dataset for HDD to SSD before experiment. Thank you.
Alex Blake says
Hi Tim:
Thanks so much for sharing your knowledge!
I’ve seen you mentioned that Ubuntu is a good OS..
what is the best OS for deep learning?
What is a good alternative to Ubuntu?
I’d really appreciate your thoughts on this…
Tim Dettmers says
Linux based system are currently best for deep learning since all major deep learning software frameworks support linux. Another advantage is, that you will be able to compile almost anything without any problems while on other systems (Mac OS, Windows) there will always be some problems or it may be nearly impossible to configure a system well.
Ubuntu is good, because it is widely used, easy to install and configure, and it has some support for their LTS versions which makes it attractive for software developers which target linux systems. If you do not like Ubuntu you can use Kubuntu, or other X-buntu variants; if you like a clean slate and to configure everything they way you like I recommend Arch Linux, but be beware that it will take a while until you configured everything the way it is suitable for you.
JB says
Tim,
First of all, thank you for writing this! This post has been extremely helpful to me.
I’m thinking about getting a gtx 970 now and upgrading to pascal when it comes out. So, if I never use more than 3.5gb vram at a time, then I won’t see performance hits, correct? I’m building my rig for deep reinforcement learning (mostly atari right now), so my minibatches are small (<2MB), and so are my convnets (<2mill weights). Should I be fine until pascal?
I'm trying to decide between these two budget builds: [Intel Xeon e5](http://pcpartpicker.com/p/dXbXjX) and [Intel i5](http://pcpartpicker.com/p/ktnHdC). I'm thinking about going with the Xeon, since it has all 40 pcie lanes if I wanted to do more than two gpus in the future, and it's a beefier processor. However, I start grad school in the fall, so I'd have university hardware then, and think I'd be more than fine with two gpus for personal experiments in the future. (Or could 4 lanes be enough bandwidth for a gpu?) If I get the i5 I could upgrade the processor without having to upgrade the motherboard if I wanted. The processor just needs to be good enough to run (atari) emulations and preprocess images right now. I can't really imagine anything but the GPU being the bottleneck, right?
Thank you for the help. I'm trying to figure out something that will last me awhile, and I'm not very familiar with hardware yet.
Thanks again,
– JB
Tim Dettmers says
Hi JB,
the GTX 970 will perform normally if you stay below 3.5GB of memory. Since your mini-batches are small and you seems to have rather few weights this should fit quite well into that memory. So in your case the GTX 970 should give you optimal cost/performance.
Rohit Mundra says
Hey Tim,
Thanks for the great article; I have a more specific question though – I’m building an entry-level Kaggle-worthy system using an i7-5820K processor. Since I want to keep my GTX 960’s 4GB memory solely for deep learning, would you recommend I buy an additional (cheaper) graphic card for display or not? I’m considering the GT 610 for this purpose since it’s cheap enough. Also, if I were to do this, where would I specify such a setting (e.g. use GT 610 for display)?
Thanks again!
Rohit
Tim Dettmers says
For most datasets on Kaggle your GPU memory should be okay and using another small GPU for you monitors will not do much. However, if you are doing one of the deep learning competitions and you find yourself short on memory and you think you could improve your score by using a model that is a bit larger then this might be worth it. So I would only consider this option if you really encounter problems where you are short on memory.
Also remember that the memory requirements of convolutional nets increases most quickly with the batch-size, so going from a batch-size of 128 to 96 or something similar might also solve memory problems (although this might also decrease your accuracy a bit, its all quite dependent on the data set and problem). Another option would be to use the Nervana system deep learning libraries which can run models in 16-bit, thus halving the memory footprint.
Fusiller says
Just a quick note to say thank you and congrats for this great article.
Very nice of you to share your experience on the matter.
Regards.
Alex
Tim Dettmers says
Thank you! I am happy that you found the article helpful!
Eystein says
Hello! First off, I just want to say this website is a great initiative!
I’m going to use Kaldi for speech recognition the next spring in my master thesis. Not knowing exactly what type of DNNs I’ll be implementing, I’m planning for an allround solid, budget GPU. Is the GTX 950 with 2 GB suitable (I haven’t seen this mentioned here)? It only requires a 350 W PSU, which is why I’m considering it. Also I have a Q6600 CPU and a motherboard that has 4 GB RAM as a max, so this is a bit constraining on the overall performance of this setup. And apologies if this is too general a question. I’m just now getting into the field 🙂
Tim Dettmers says
The GTX 950 2GB variant might be a bit short on RAM for speech recognition if you use more powerful models like LSTMs. The cheapest solution might be to prototype on your CPU and use AWS GPU instances to run the model if everything looks good. This way you need no new computer/PSU and will be able to run large LSTMs and other models. If this does not suit you, a GTX 950 with 4GB of memory might be a good choice.
Eric says
Tim,
Thank you for the many detailed posts. I am going with a one GPU Titan X water cooled solution based on information here. Does it still hold true that adding a second GPU will allow me to run a second algorithm but that it will not increase performance if only one algorithm is running? Best Regards – Eric
Tim Dettmers says
There are now many good libraries which provide good speedups for multiple GPUs. Torch7 is probably the best of them. Look for the Torch7 Facebook extensions and you should be set.
BK says
Hi Tim,
Great post; In general all of the content on your blog has been fantastic.
I’m a little curious about your thoughts on other types of hardware for use in deep learning. I’ve heard a number of people suggest FPGAs to be potentially useful for deep learning(and parallel processing in general) due to their memory efficiency vs. GPUs. This is often mentioned in the context of Xeon Phi….what are your thoughts on this? If true, where does the usefulness lie, in the ‘tracking’ or ‘scoring’ part of deep learning(my perhaps incorrect understanding was GPUs advantage was their use for training as opposed to scoring)?
My apologies for what I’m certain are sophomoric questions; I’m trying to wrap my head around these matters as someone new to the subject!
Regards,
BK
Tim Dettmers says
Nonsense, these are great questions! Keep them coming!
FPGAs could be quite useful for embedded devices, but I do not believe they will replace GPUs. This is because (1) their individual performance is still worse than an individual GPU and (2) combining them into sets of multiple FPGAs yields poor performance while GPUs provide very efficient interfaces (especially with NVLink which will be available at the end of 2016). GPUs will make a very big jump in 2016 (3D memory) and I do not think FPGAs will ever catch up from there.
Xeon Phi is potentially more powerful than GPUs, because it is easier to optimize them at the low level. However, they lack the software for efficient deep learning (just like AMD cards) and as such it is unlikely that one will see Xeon Phis be used for deep learning in the future (unless Intel creates a huge deep learning initiative that rivals the initiative of NVIDIA).
BK says
Thanks for the response! That’s very interesting.
I wanted to follow up a little bit regarding software development for NVIDIA vs. Intel or AMD. I know how much more developed CUDA libraries are when it comes to Deep learning than OpenCL. What frameworks can I actually run with an intel or AMD architecture? Do torch/caffe/Theano only work on NVIDIA hardware? Once again, my apologies if I’m fundamentally misunderstanding something.
One last question, beyond the world of deep learning, what is the perception of xeon phi? It seems hard to find people who are talking with certainty as to what its strengths/applications will be. Is there any consensu on this? what do you think makes most sense for xeon phi as an application?
Many thanks!
-BK
Greg says
Hey Tim…
Do you have any suggestions for a tutorial for DL using Torch7 and Theano and/or Keras?
Thanks
Greg
Nghia Tran says
Hi Tim,
Thank you very much for all the writting. I am an objective C developer but a brand newbie to the deep learning thing and so interested in this area right now.
I got a Mac 3.1 and I would like to upgrade the graphic card for having CUDA to run torch7, lua and nn as to learn about this programming. Don’t bother if this should be a Mac card or Windows card.
Which one should you recommend? GTX 780Ti?GTX 960 2GB? GTX 980? Tesla M2090(second hand)?
Look forward to your advice.
Tim Dettmers says
From the cards you listed the GTX 980 will be the best by far. Please also have a look at my GPU guide for more info how to choose your GPU.
Nghia Tran says
Thank you very much. I got a generous sponsor to build up a new ubuntu machine with 2 GTX 780 Ti. Should I use the GTX 980 in the new machine to yield better performance than a SLI GTX 780 Ti or let it stay in my Mac?
Tim Dettmers says
If you already have the two GTX 780 Ti I would stick with that and only change/add the GPU if you experience RAM shortage for one of your models.
Nghia Tran says
Thank you very much Tim. I am looking forward to your further writing.
By the way, do you have time to look at the neuro-synaptic chip from IBM yet? Really interested in your “deep analysis” on this as well.
Brent Soto says
Hi Tim, The company that I buy my servers from (Thinkmate) recently sent me an e-mail advertising that they’ve been working with Supermicro to sell servers with support for Titan X. What do you think about this solution? I’ve had a lot of luck with Supermicro servers, and they offer 3 year warranty on the Titans and will match the price if found cheaper elsewhere. Here’s the link: http://www.thinkmate.com/systems/servers/gpx/gtx-titan-x
Tim Dettmers says
Hi Brent, I think in terms of the price, you could definitely do better on the 1U model with 4 GTX Titan X. A normal board with 1 CPU will not have any disadvantage compared to the 1U model for deep learning.
However, the 4U model is different because it can use 8 GTX Titan X with a fast CPU-to-CPU switch which makes parallelization of 8 GPUs easy and fast. There are only few solutions available that are build like this and come with 8 GTX Titan X — so while the price is high, this will be a rather unique and good solution.
Greg says
Yes, I did the BIOS flash in the beginning.
Lastly, I kept testing and found the culprit….when installing Cuda I can’t install the 502 driver that it comes with or the Ubuntu system locks with an unknown password…no matter trying a ton if different ways to install the Cuda driver. I scoured the internet for a solution and there wasn’t one and it looks like no one has put 2 n 2 together about the Cuda driver. It could be a combo of things both hardware and software but it definitely involves this driver the x99 mb, a titian x and Ubuntu 14.04 and 15.04.
Thanks.
Greg says
Hi Tim..
Recently I have had a ton of trouble working with Ubuntu 14.04 …installing Cuda, caffe etc. Ubuntu has password locked me out of my system twice and getting all dependencies installed to make caffe to install has been a real problem. It works sometimes …other times it doesn’t work. Ubuntu 14.04 is clearly an unstable OS.
I would like your opinion TIm on moving from Linux to Windows for deep learning? What are your thoughts?
Thanks in advance…
-Greg
Tim Dettmers says
I can feel your pain — I have been there too! Ubuntu 14.04 is certainly not intuitive when you are switching from Windows and a simple unseemingly command can ruin your installation. However, I found once you understand how everything is connected in Linux things get easier, make sense, and you no longer will run into errors which break your installations or even our OS. After this point, programming in Linux will be much more comfortable than in Windows due to the ease to compile and install any library. So it may be painful but it is well worth it. You will gain a lot of if you go through the pain-mile — keep it up!
Greg says
After Ubuntu 14.04 locking me out 3 times via a booting up and false logon screen… I thought I’d try Ubuntu 15.04. I think the Cuda driver slammed Unity resetting the root password to something other than the password I gave it. I search the web and this is a common problem and there seems to be no fix.
I’m running x99 MB, I7 5930, 64 GB ram, and one Titan x. I’ll get a second Titan x when I’m ready for it. I want to create my own NN and nodes but for now I have a ton of learning to do and I need to follow what’s been done so far.
Do you use standard libraries and algorithms like Caffe, Torch 7 and Theano via Python? I feel I need to wade through everything to see how it works before using it. Nvidia Digits looks pretty simple working from the GUI but it also looks, from my limited experience, like it’s pretty limited.
Tim Dettmers says
Is this because of your x99 board? I never had any problems like that. As for the software, Torch7 and Theano (Keras and derivatives) works just fine for me. I have tried Caffe once and it worked, but I also heard some nightmare stories about installing Caffe correctly. NVIDIA Digits will be just as you described: Simple and fast, but if you want to do something more complex it will just be an expensive fast PC with 4 GTX Titan X.
mxia.mit@gmail.com says
Just to tag onto this, I have an X-99 E board, and had some problems on the initial install when trying to boot into ubuntu’s live installer, nothing with the password though. After installing everything worked fine at the OS level. In case this is relevant, reflashing to the latest BIOS helped a lot, but probably won’t help your password problem.
Cheers and best of luck!
Mike
Safi says
Hi Tim,
First thanks a lot for these interesting and useful topics. I am a PhD student i work on Evolutionary ANNs.
I want to start using GPUs, my budget can reach 150$ Max.
I found in my town a new GTX 750 and GTX 650 Ti. Which one is better and are they supported by cuDNN.
Thank
Tim Dettmers says
A GTX 750 should be better, and both support cuDNN. However, I would also suggest that you have a look at AWS GPU instances. The instance will be a bit faster and may suit your budget well.
ML says
Hello Tim, what about external graphic cards connected through Thunderbolt? Have you looked at those? Could that be a cheap solution without having to build/buy a new system?
Tim Dettmers says
I looked at some performance reviews and they state about 70-90% performance for gaming. For deep learning the only performance bottleneck will be transfers from host to GPU and from what I read the bandwidth is good (20GB/s) but there is a latency problem. However, that latency problem should not be too significant for deep learning (unless it’s a HUGE increase in latency, which is unlikely). So if I put these pieces of information together it looks as if an external graphics card via Thunderbolt should be a good option if you have an apple computer and have the money to spare for the suitable external adapter.
Tony says
Tim, thanks again for such a great article.
One concern that I have is that I also use triple monitors for my work setup. However, doesn’t the fact that you’re using triple monitors effect performance of your GPU? Do you recommend buying a cheap $50 gpu for your triple monitor setup and then dedicating your titan x or your more expensive primarily to deep learning? I run Recurrent Neural Nets,
Thanks!
Tim Dettmers says
Three monitors will use up some additional memory (300-600MB) but should not affect your performance greatly (< 2% performance loss). I recommend getting a cheap GPU for your monitors only if you are short on memory.
Tony says
Thanks — that makes alot of sense. I just thought it would affect your bandwidth (as that is usually the bottleneck). I’m currently running the 980 TI — I know it has 336Gb/s. Good to know that it uses some memory though. Appreciate it.
Michael Holm says
Hello Tim,
Thank you for your article. The deep learning devbox (NVIDIA) has been touted as cutting edge for researchers in this area. Given your dual experience in both the hardware and algorithm sides, I would be grateful to hear your general thoughts on the devbox. I know it came out a few months after you wrote your article.
Thank you!
Colin McGrath says
I just want to thank you again Tim for the wonderful guide. I do have a couple of hardware utilization questions though. I am trying to figure out how to properly partition my space in ubuntu to handle my requirements. I dual boot Windows 10 (for work/school) and Ubuntu 14.04.3 (deep learning) with each having their own SSD boot drive and HDD storage drive. For starters here’s my setup:
– ASRock X99 WS-E
– 1x Gigabyte G1 980 ti
– 16GB Corsair Vengeance RAM 2133
– i7-5930k
– 2x Samsung 850 Pro 256GB SSDs (boot drives)
– 2x Seagate Barracuda 3TB HDDs (storage drives)
My windows install is fine, but I want to be able to store currently unused data in the HDD, stage batches in the SSD then send the batches from SSD to RAM to fully leverage the IOPS gain in a SSD.
I currently have Ubuntu partitioned this way, however I’m not entirely sure this will fit my needs. I’m thinking I might want to allocate /home on the HDD due to how ubuntu handles the /home directory in the UI, but I’m unsure if that will be a problem with deep learning:
SSD (boot):
– swap area – 16GB
– / – 20GB
– /home – 20GB
– /var – 10GB
– /boot – 512MB
– /tmp – 10GB
– /var/log – 10GB
HDD
– /store 1TB
vinay says
Does anyone know what would be the requirements for prediction clusters? Most articles focus on training aspects but inference/prediction is also important and compute demand for these are little discussed. Can anyone comment on compute demands for prediction? Also, what do you recommend, CPU only, CPU+GPU, or CPU+FPGA, etc for such tasks?
Thanks,
Vinay
Tim Dettmers says
It depends on many factors which is a suitable solution. If you build a web application, how long do you want your user to wait for a prediction (response time)? How many predictions are requested per second in total (throughput)?
Prediction is much faster than training, but still a forward pass of about 100 large images (or similar large input data) takes about 100 milliseconds on a GPU. A CPU could do that in a second or two.
If you predict one data point at a time a CPU will probably be faster than a GPU (convolution implementations relying on matrix multiplication are slow if the batch sizes are too small), so GPU processing is good if you need high throughput in busy environments, and a CPU for single predictions (1 image should take only 100 milliseconds for a good CPU implementations). Multiple CPU servers might also be an option, and usually they are easier to maintain and cheaper (AWS spot instances for example, also useful for GPU work). Keep in mind that all these these numbers are reasonable estimates only and will differ from the real results; results from a testing environment that simulates the real environment will make it clear if CPU servers or GPU servers will be optimal.
I do not recommend FPGA for such tasks since over time, interfaces to FPGA are not easy to maintain and cloud solutions do not exist (as far as I know).
Sascha says
Hi,
thanks a lot for all this information. After stumbling across a paper from Andrew Ng et al (“Deep learning with COTS HPC systems”) my original plan was to also build a cluster (to learn how it is done). I wanted to go for two machines with a bunch of GTX Titans but after reading your blog I settled with only one pc with two GTX 980s for the time being. My first thought after reading your blog was to actually settle for two 960s but then I thought about the energy consumption you mentioned. Looking at the specifications of the nvidia cards I figured the 980 were the most efficient choice currently (at least as long as you have to pay German energy prices).
As I am still relatively fresh to machine learning I guess this setup will keep me busy enough for the next couple of months, probably until the pascal architecture you mentioned is available (I read somewhere 2nd half of 2016). If not then I guess I will buy another PC and move one of the 980s into it so that I can learn how to setup a cluster (my current goal is learning as much as fast as possible).
The configuration I went for is as follows:
CPU: Intel i7-5930k (I chose this one instead of the much cheaper 5920 as it has the 40 PCI lanes you mentioned, which gives the additional flexibility of handling 4 graphics cards)
Mainboard: ASRock Fatal1ty X99 Professional (supports up to 4 graphics cards and has a M.2 slot)
RAM: 4×8 GB DDR4-3000
Graphics Card: 2x Zotac GTX 980 AMP! Edition
Hard Disk: Samsung SSD SM951 with 256 GB (thanks to M.2 it offers 2 GB/s of sequential read performance)
Power Supply: be quiet! BN205 with 1200 Watts
I hope that installing Linux on the ssd works as I read that the previous version of this ssd mad some problems.
Thanks again
Sascha
Tim Dettmers says
Hi Sascha! Your reasoning is solid and it seems to got a good plan for the future. Your build is good, but as you say, the PCIe SSD could be a bit problematic to set up. Another fact to be aware of is that your GPUs will have a slower connections with that SSD, because the SSD takes away bandwidth from your GPUs (your GPUs will run at 16x/8x instead of 16x/16x). Overall the PCIe SSD would be much faster for common applications, but slower when you use parallelism on two GPUs, it might be better to go for a SATA SSD (if you do not use parallelism that much a PCIe SSD is a solid choice). A SATA SSD will be slower than the PCIe one, but it should be still be faster enough for any deep learning task. However, preprocessing will be slower on this SSD, and this is probably the main advantage of the PCIe SSD.
Sascha says
That is an interesting point you make regarding the M.2. I did not realise that this is how the board will distribute the lanes. I figured that as the M.2 only uses 4 lanes the two cards could each run with 16 and if I actually decided to scale up to a quad setup each card eventually would only get 8 lanes.
My first idea after reading the comment was to just try the ssd in the additional M.2 PCI 2.0 slot, which is basically a SATA 6 connection but that will not work, as it will not fit because one has the Key B and the other the Key M layout.
Do you have an idea about what this actually means for real life performance in deep learning tasks (like x% slower)?
Greetings
Sascha
Tim Dettmers says
When I think about it again, I might be wrong about what I just said. How two GPUs and the PCIe SSD will work depends highly on your motherboard and how the PCIe slots are wired and how the PCIe-switches are distributed. I think with a 40 PCIe lane CPU and a mainboard that supports 16x/16x/8x layout, it should be possible to configure that to use 16 lanes for your GPUs and 8 lanes for your SSD; to use that setup you only need to make sure to plug everything into the right slot (your mainboard manual should state how to do this). I have not looked at your hardware in detail, but I think your hardware supports this.
If your motherboard does not support 16x/16x/8x, then your GPU parallelism will suffer from that. Convolutional nets will have a penalty of 5-15% depending on the architecture, recurrent networks may have a low or no penalty (LSTMs) or a high penalty (20-50%) for recurrent nets with many parameters like vanilla RNNs.
Colin McGrath says
What are your opinions on RAID setups in a deep learning rig? Software-based RAID is pretty crappy in my experience and can cause a lot more problems than it solves. However, RAID controllers take a PCI-E slot which will, fortunately/unfortunately, all be taken by 4 x Gigabyte GTX 980 TI cards. Is it worth running RAID with the software controller? Or is it better just to do full copy clone backups?
Tim Dettmers says
I do not think it is worth it. Usually, a common SATA SSD will be fast enough for most kinds of data; in come cases there will be a decrease in performance because the data takes too long to load, but compared to the effort and money spend on a RAID system (hardware) it is just not worth it.
mxia.mit@gmail.com says
Hey Tim,
Thank you so much for this great writeup, it’s been pivotal in helping me and my co-founder understand the hardware. We’re a duo from MIT currently working on a venture backed startup bringing deep learning to education, hoping to help at least improve, if not fix, the US education system.
Our first build is aiming to be cheap where it can (since both of us are beginners and we need to be frugal with our funding) but future proof enough for us to do harder things.
My current build consists of these parts:
Mobo: Asus X99-E WS SSI CEB LGA2011-3 Motherboard
CPU: Intel Core i7-5820K 3.3GHz 6-Core Processor
Vide Card: EVGA GeForce GTX 960 4GB SuperSC ACX 2.0+ Video Card
PSU: EVGA 850W 80+ Gold Certified Fully-Modular ATX Power Supply
RAM: Corsair Vengeance LPX 16GB (2 x 8GB) DDR4-3000
Storage: Sandisk SSD PLUS 240GB 2.5″ Solid State Drive
Case: Corsair Air 540 ATX Mid Tower Case
Could you look over these and offer any critique? My logic was to have a Mobo and CPU that could handle upgrading to better hardware later, things like the PSU, Ram, and the 960 I’m willing to replace later on.
Thank you in advance! Also is there a way we could exchange emails and chat more?
Would love any advice we can get from you while we build out our product.
Best,
Mike Xia
Tim Dettmers says
Looks good. The build is a bit more expensive due to the X99 board, but as you said, that way it will be upgradeable in the future which will be useful to ensure good speed of preprocessing the ever-growing datasets. You are welcome to send me an email. My email is firstname.lastname@gmail.com
Carles Gelada says
I have been looking for an affordable CPU with 40 lanes without luck. Could you give me with a link?
I am also curious about the actual performance benefit of 16x vs 8x. If the bottleneck are the DMA writes will the performance reduce by halve?
Tori says
Thank you so much for such informative article!
How would GTX Titan Z compare to GTX Titan X for the purpose of training a large CNN? Do you think it’s worth the money to buy a GTX Titan Z or is a GTX Titan X good enough? Thanks!
Tim Dettmers says
A GTX Titan X will be much better in most cases. If you want more details have a look at my answer about this equation on quora.
Peter says
Hi Tim,
Firstly, thanks for this article; it’s extremely informative (in fact your entire blog makes fascinating reading for me, since I’m very new to neural networks in general).
I want to get a more powerful GPU to replace my old Gtx 560 Ti (a great little card, but 1gb of memory is really limiting and I presume it’s pretty slow these days too). Sadly I cannot really afford the GTX Titan X (as much as I’d like to, 1300 CAD is too damn high). The 980 Ti is also a bit on the high end, so I’m looking at the 980, since it’s about 200 CAD cheaper. My question is; how much performance am I gaining from my old 560 Ti to a 980/980 Ti/Titan X? Is the difference in gained speed even that large? If it’s worth saving for the bigger card then I’ll just have to be patient.
I’m currently running torch7 and a LSTM-RNN with batches of text, not images, but if I want to do image learning I assume I’d want as much RAM as possible?
Cheers 🙂
Tim Dettmers says
The speedup should be about 4x when you go from a GTX 560 Ti to a GTX 980. The 4GB ram on the GTX 980 might be a bit restrictive for convolutional networks on large image datanets like ImageNet. A GTX Titan X or GTX 980 Ti will only be 50% faster than a GTX 980. If you wait about 14-18 months you can get a new Pascal card which should be at least 12x faster than your GTX 560 Ti. I personally would value getting additional experience now as more important than getting less experience now and faster training in the future — or in other words, I would go for the GTX 980.
Peter says
How exactly would I be restricted by the 4GB of ram? Would I simply not be able to create a network with as many parameters, or would there be other negative effects (compared to the 6GB of the 980 Ti)?
You’ve mentioned in the past that bandwidth is the most important aspect of the cards, and the 980 Ti has 50% higher bandwidth than the regular 980; would that mean it’s 50% faster too, or are there other factors involved?
Tim Dettmers says
Yes, thats correct, if your convolutional network has too many parameters it will not fit into your RAM. Other factors besides memory bandwidth only play a minor role, so indeed, it should be about 50% better performance (not the 33% I quoted earlier (I edited this for correctness just now)).
howtobeahacker says
Hi Tim,
I have a minor question related to 6-pin and 8-pin power connector. It is related to your sentence “One important part to be aware of is if the PCIe connectors of your PSU are able to support a 8pin+6pin connector with one cable”.
My workstation has one 8-pin cable to TWO 6-pin cable connectors. Is it possible that we plug into these two 6-pin connectors to power up Titan X which requires 6-pin and 8-pin power connectors? I think I will try it, because I want to plug 2 GPUs Titan X and only this way my workstation can support up two GPUs.
Thank you so much.
@An
Tim Dettmers says
I think this will depend somewhat on how the PSU is designed, but I think you should be able to power two GTX Titan X with one double 6-pin cable, because the design makes it seem that it was intended for just that. Why would they put two 6-pin connectors on a cable if you cannot use them? I think you can find better information if you look up your PSU and see if there is a documentation, specification or something like that.
howtobeahacker says
Hi Tim,
Thank for your responses. I read your posts and I remembered an image of a software in Ubuntu to visualize states of GPU. Something that is similar to Task Manager for CPUs. If you have information, please let me know.
Florijan Stamenković says
Hi Tim!
We’ve already asked you for some advice, an it was helpful… We put together a dev-box in the meanwhile, with 4 Titans inside, it works perfectly.
Now we are considering production servers for image tasks. One of them would be classification. Considering the differences between training and runtime (runtime handles a single image, forward prop only), we were wondering if it would be more cost effective to run multiple weaker GPUs, as opposed to fewer stronger ones…. We are reasoning that a request queue consisting of single-image tasks could be processed faster on two separate cards, by two separate processes, then on a single card that is twice as fast. What are your thoughts on this?
We’ve run very crude experiments, comparing classification speed of a single image on a Titan machine, vs. 960M equipped laptops. The results were more or less as we expected: Titans are faster, but only about 2x, whereas Titans are 4x more expensive then a GTX960 (which has significantly more GFLOPS then the 960M). In absoulte terms, classification speed on a weaker card is acceptable, we’re wondering about behavior under heavy load.
F
Tim Dettmers says
Hi Florijan!
I think in the end this is a numbers game. Try to overflow a GTX 960M and a Titan with images and see how fast they go and compare that with how fast you need to be. Additionally, it might make sense to run the runtime application on CPUs (might be cheaper and more scalable to run them on AWS or something) and only run the training on GPUs. I think a smart choice will take this into account, and how scalable and usable the solution is. Some AWS CPU spot instances might be a good solution until you see where your project is headed to (that is if a CPU is fast enough for your application).
Florijan Stamenković says
Tim,
Thanks for your reply. You’re right, it definitively is a numbers game, I guess we will simply need to stress-test.
We already tried to run our classifier on the CPU, but classification time was an order of magnitude slower then on the 960M, so that doesn’t seem a good option, especially considering the price of a GTX960 card.
We’ll do a few more tests at some point. If we find out anything interesting, I’ll post back here…
F
Roelof says
Hi Tim,
Thanks a lot for your great hardware guide!
I’m planning to build a 3 x Titan X GPU setup, which will be more or less running on a constant basis: would you say that water cooling will make a big impact on performance (by keeping the temperatures always below the 80 degrees)?
As the machine will be installed remotely, where I don’t have easy access to it, I’m a bit nervous about installing a water cooling system in such a setup, with the risk of cooling leakage, so the “risk” has to be worth the performance gain 🙂
Do you have any experience with water cooled systems, and would you say that it would be a useful addition ?
Also would you advise a nice tightly fit chassis, or a bigger one which allows better airflow ?
Finally (so many questions :P), would you think 1500 watt with 92-94% efficiency at 100% load should suffice in the case I might use 4 Titan X GPUs, or would it be better to go for a 1600W PSU ?
Tim Dettmers says
If your operate the computer remotely, another option is to flash the BIOS of the GPU and crank up the fan to max speed. This will produce a lot of noise and heat, but your GPUs should run slightly below 80 degrees, or at 80 degrees with little performance lost.
Water cooling is of course much superior but if you have little experience with it it might be better to just go with an air cooled setup. I have heard if installed correctly, water cooling is very reliable, so maybe this would be an option when somebody else, how is familiar with water cooling helps you to set it up.
In my experience, the chassis does not make such a big difference. It is all about the GPU fans, and getting the heat out quickly (which is mostly towards the back and not through the case). I installed extra fans for better airflow within the case, but this only make a difference of 1-2 degrees. What might help more are extra backplates and small attachable cooling pads for your memory (both about 2-5 degrees).
I used a 1600W PSU with 4 GTX Titans which need just as much power as a GTX Titan X and it worked fine. I guess 1500W would also work well and 92-94% efficiency is really good. I would try with the 1500W one and if it does not work just send it back.
Roelof says
Thanks for the detailed reponse, I’ve decided to go for:
– Chassis: Corsair Carbide Air 540
– Motherboard: ASUS X99-E WS
– Cpu: Intel(Haswell-e) Core i7 5930K
– Ram: 64GB DDR4 Kingston 2133Mhz
– Gpu: 3 x NVIDIA GTX TITAN-X 12GB
– HD1: 2 X 500GB SSD Samsung EVO
– HD2: 3 X 3TB WD Red in RAID 5
– PSU: Corsair AX1500i (1500Watt)
With a custom build water cooling system for both the Cpu and the 3 Titan X’s, which I hope will let me crank up these babies while keeping the temperature all times below the 80 degrees.
The machine is partly (at least the chassis is) inspired by Nvidia’s recently released DevBox for Deep Learning (https://developer.nvidia.com/devbox), but for almost 1/2 of the price. Will post some benchmarks with the newer cuDNN v3 once its build and all setup.
Alex says
How did your setup turn out ? I am also looking to either build a box or find something else ready made (if it is appropriate and fits the bill). I was thinking of scaling down the nvidia devbox as well. I also saw these http://exxactcorp.com/index.php/solution/solu_detail/233 which are similar. Very expensive.
Why is there no mention of Main Gear https://www.maingear.com/custom/desktops/force/index.php anywhere? Are they no good? The price seems too good to be true. I have heard that they break down, but I have also heard that the folks at Main Gear are very responsive and helpful.
Thanks for any insight and thanks Tim for the great blog posts!
Axel says
Hi Tim,
I’m a Caffe user, and since Caffe has recently added support for multiple GPUs, I have been wondering if I should go with a Titan X or with 2 GTX 980. Which of this 2 configurations would you choose? I’m more inclined towards the 2 GTX 980, but maybe there are some downsides with this configuration that I haven’t thought about.
Thanks!
Tim Dettmers says
This is relevant. I do not have experience with Caffe parallelism, so I cannot really say how good it is. So 2 GPUs might be a little bit better than I said in the quora answer linked above.
howtobeahacker says
Hi, I intend to plug in 2 GPU Titan X in my workstation. In the spec of my workstation, it said that it is possible to have up 2 NVIDIA K20 GPUs. In fact, K20 and TitanX are the same size. However, when I get the first Titan X GPU, I measure that if I plug the second in, there is a tiny space between 2 GPUs. I wonder if it is safe for the cooling of the GPU system.
Hope to have your opinion.
Thanks
Tim Dettmers says
A very tiny space between GPUs is typical for non-tesla cards and your cards should be safe. The only problem is, that your GPUs might run slower because they reach their 80 degrees temperature limit earlier. If you run a unix system, flashing a custom BIOS to your Titans will modify the fan regulation so that your GPUs should be cool (< 80 degrees C) at all times. However, this may increase the noise and heat inside the room where your system is located. Flashing a BIOS for better fan regulation will most and foremost only increase the lifetime of your GPUs, but overall everything should be fine and safe without any modifications even if you operate your cards at maximum temperature for some days without pause (I personally used the standard settings for a few years and all my GPUs are still running well).
howtobeahacker says
Hi Tim,
Thank for your responses. I read your posts and I remembered an image of a software in Ubuntu to visualize states of GPU. Something that is similar to Task Manager for CPUs. If you have information, please let me know.
howtobeahacker says
Hi Tim,
I just found a way to increase GPU fan in Ubuntu using Nvidia X server settings. The details are in http://askubuntu.com/questions/42494/how-can-i-change-the-nvidia-gpu-fan-speed
Tim Dettmers says
Indeed, this will work very well if you have only one GPU. I did not know that there was a application which automatically prepares the xorg config to include the cooling settings — this is very helpful, thank you! I will include that in an update in the future.
An Tran says
I just found a way to increase fan speed of multiple GPUs without flashing. Here is my documentation.
http://antechcvml.blogspot.sg/2015/12/how-to-control-fan-speed-of-multiple.html
Tim Dettmers says
This looks great! Thank you!
pedropgusmao says
Hello Tim,
First of all thanks for always answering my questions and sorry for coming back with more 🙂
Do you think a 980 (4GB) is enough for training current neural nets (alexnet, overfeat, vgg), or would it be wise to go for a 980ti?
PS: I am a PhD student, time for me is cheaper than euros 🙂
Thanks again.
Tim Dettmers says
4 GB of memory can indeed be quite short sometimes. If time is cheaper than money, go for a GTX 980Ti, or even better a GTX Titan X.
gac says
Hi Tim,
First of all, excellent blog! I’m putting together a gpu workstation for my research activities and have learned a lot from the information you’ve provided so …. thanks!! 🙂
I have a pretty basic question. So basic I almost feel stupid asking it but here goes …
Given your deep learning setup which has 3x GeForce Titan X for computational tasks, what are your monitors plugged in to?
I would like a very similar setup to yours (except I’ll have two 29″ monitors) and I was wondering if it’s possible to plug these into the Titan cards and have them render the display AND run calculations.
Or is it better to just have another, much cheaper, graphics cards which is just for display purposes?
Tim Dettmers says
I have my monitors plugged into a single GTX Titan X and I experience no side effects from that other than a couple of hundreds MB memory that is needed for the monitors; the performance for CUDA compute should be almost the same (probably something like 99.5%). So no worries here, just plug them in where it works for you (on windows, one monitor would also be an option I think).
Vu Pham says
I’m so sorry the X3 version of Mellanox does not support RDMA but the X4 does
Vu Pham says
So, I do a research on deep learning hardware, I assume the most appropriate Part list is:
Motherboard: X10DRG-Q – This is an dual socket board which alow you to double the lane of the cpu. It has 4x fully functional x16 PCI Ex 3.0 Slot and an extra 4 x PCI Ex 2.0 Slot for a Mellanox card.
CPU: 2X E5-2623
Network card: Mellanox ConnectX-3 EN Network Adapter MCX313A-BCBT
Star of the show: 4x TitanX
Assume the other parts are $1000, total cost would be $7,585, half the price of the Nvidia Dev box. My god NVIDIA.
Tim Dettmers says
This sounds like a very good system. I was not aware of the X10DRG-Q motherboard; usually such mainboards are not available for private customers — this is a great board!
I do not know the exact topology of the system compared to the Nvidia Dev box, but if you have two CPUs this means you will have an additional switch between the two PCIe networks and this will be a bottleneck where you have to transfers GPU memory through CPU buffers. This makes algorithms complicated and prone to human error, because you need to be careful how to pass data around in your system, that is, you need to take into account the whole PCIe topology (on which network and switch the infiniband card sits etc., on which network the GPU sits etc.). Cuda convnet2 has some 8 GPU code for a similar topology, but I do not think it will work out of the box.
If you can live with more complicated algorithms, then this will be a fine system for a GPU cluster.
Vu Pham says
I got it, so stick to the old plan then, Thank you any way.
Vu Pham says
Hi Tim
Fortunately, Supermicro provides me X10DRG-Q mobo diagram, and it would be also a gerneral diagram for other 2011 dual socket mobo which has 4 or morethan 4 PCIEX slot. 2 CPU are connected by 2 QPI – Intel QuickPath Interconnect. If cpu1 has 40 lanes, then 32 lane for 2 PCI ex 16, 4x for 10Gigabit Lan, 4x for a 4x PCI ex (8x slot shape, which will be cover if you install 3rd graphic card). The 2nd cpu also provide 32 lane for pci express, then 8x will be 8x slot on the top slot (nearest cpu socket). Pretty complicated.
The point when I build a perfect 4×16 PCIex3.0 is that I though the performence gonna be half if the bandwidth go from 16x down to 8x. Do you have any infomation how much performnce different, said a single titan x, on a 16x 3.0 and 16x 2.0?
Tim Dettmers says
Yes, that sounds complicated indeed! A 16x 2.0 will be as fast as a 8x 3.0, so the bandwidth is also halved by stepping down to 2.0. I do not think there exists a single solution which is easy and at the same time cheap. In the end I think the training time will not be that much slower if you run 4 GPUs on 8x 3.0 and with that setup you would not run in any programming problems for parallelism and you will be able to use standard software like Torch7 with integrated parallelism — so I would just go for a 8x 3.0 setup.
If you want a less complicated system that is still faster, you can think about getting a cheap InfiniBand FDR card on eBay. That way you would buy 6 cheap GPUs and all hook them up via InfiniBand at 8x 3.0. But probably this will be a bit slower than straight 4x GTX Titan X on 8x 3.0 on a single board.
Mohamad Ivan Fanany says
Hi Tim, very nice sharing. I just would like to comment on the ‘silly’ parts (smile): the monitors. Since I only have one monitor, I just use NoMachine and put the screen in one of my virtual workspaces in ubuntu to switch between the current machine and our deep learning servers. Surprisingly this is more convenient and energy efficient both for the electricity and our neck movement. Just hope this would help especially those who only have single monitor. Cheers.
Tim Dettmers says
Thanks for sharing your working procedure with one monitor. Because I got a second monitor early, I kind of never optimized the workflow on a single monitor. I guess when you do it well, as you do, one monitor is not so bad overall — and it is also much cheaper!
Xardoz says
Very Useful information indeed, Tim.
I have a newbie question: If the motherboard has integrated graphics facility, and if the GPU is to be dedicated to just deep learning, should the display monitor be connected directly to the motherboard rather than the GPU?
I have just bought a machine with GeForce Titan X card and they just sent me a e-mail saying:
“You have ordered a graphics card with your computer and your motherboard comes supplied with integrated graphics. When connecting your monitor it is important that you connect your monitor cable to the output on the graphics card and NOT the output on the motherboard, because by doing so your monitor will not display anything on the screen.”
Intuitively,it seems that off-loading the display duties to the motherboard will free the GPU to do more important things. Is this correct? If so, do you think that this can be done simply? I would ask the supplier, but they sounded lost when I started talking about deep learning on Graphics cards.
Regards
Xardoz
Tim Dettmers says
Hi Xardoz! You will be fine when you use connect your monitor to your GPU especially if your using a GTX Titan X. The only significant downside of this is some additional memory consumption which can be a couple of hundred MB. I have 3 monitors connected to my GPU(s) and it never bothered me doing deep learning. If you train very large convolutional nets that are on the edge of the 12GB limit, only then I would think about using the integrated graphics.
Xardoz says
Thanks Tim.
It seems that my motherboard graphics capability (Asus Z97-P with an Intel i7-4790k) is not available if a Graphics card is installed.
And yes I do need more than 12GB for training a massive NN! so I decided to buy a small Graphics card just to run the display as suggested in one of your comments above. Seems to work fine.
Regards
Charles Foell III says
Hi Tim,
1) Great post.
2) Do you know how motherboards with dedicated PCI-E lane controllers shuffle data between GPUs with deep learning software? For example, the PLX PEX 8747 purports control of 48 PCI-E lanes beyond the 40 lanes a top-shelf CPU controls, e.g. allowing five x16 connections, but it’s not clear to me if deep learning software makes use of such dedicated PCI-E lane controllers.
I ask since going beyond three x16 connections with CPU control of PCI-E lanes only requires dual CPU, but such boards along with suitable CPUs can be in sum be thousands of dollars more expensive than a single CPU motherboard that has a PLX PEX 8747 chip. If the latter has as good performance for deep learning software, might as well save the money!
Thanks!
-Charles
Tim Dettmers says
That is very difficult to say. I think the PLX PEX 8747 chip will be handled by the operating system after you installed some driver, so that deep learning software would use it automatically in the background. However, it is unclear to me if you really can operate three GPUs in 16/16/16 when you use this chip, or if it will support peer-to-peer GPU memory transfers. I think you will need to get in touch with the manufacturer for that.
Charles Foell III says
Hi Tim, makes sense. Thanks for the reply.
I’ll need to dig more. I’ve seen various GPU-to-GPU benchmarks for server-grade motherboards (e.g. in HPC systems), including a raw ~ 7 GB/s using a PLX PEX chip (lower than host-to-GPU), but I’ve had difficulty finding benchmarks for single-CPU boards, let alone for more than three x16 GPU connections.
If you come across a success story of a consumer-grade single-CPU system with exceptional transfer speed (better than 40 PCI-E 3.0 lanes worth in sum) between GPUs when running common deep learning software/libraries, or even a system with such benchmarks for raw CUDA functions, please update.
In the meantime, I look forward to your other posts!
Best,
Charles
Nikos Tsarmpopoulos says
AMD’s Naple CPU is expected to provide 128 lanes: 64 lanes for 4 PCIe expansion cards at x16 and the remaining for CPU-to-CPU interconnect (called Infinity Fabric).
Source:
https://arstechnica.co.uk/information-technology/2017/03/amd-naples-zen-server-chip-details/
Nikos Tsarmpopoulos says
In another article, it is implied that with 1xCPU systems, 128 lanes will be available for I/O, presumably allowing for full x16 lanes on up to 8 GPUs, or for use with NVLink bridges.
Source:
http://www.anandtech.com/show/11183/amd-prepares-32-core-naples-cpus-for-1p-and-2p-servers-coming-in-q2
Jon says
Will Ecc RAM make Convolution NN or deep learning more efficient or better? In another word, if the same money can buy me one PC with Ecc RAM vs TWO PC without Ecc RAM, which should I pick for deep learning?
Tim Dettmers says
I think ECC memory only applies to 64 bit operations and thus would not be relevant to deep learing, but I might be wrong.
ECC corrects if a bit is flipped in the wrong way due to physical inconsistencies at the hardware level of the system. Deep learning was shown to be quite robust to inaccuracies, for example you can train a neural network with 8-bits (if you do it carefully and in the right way); training a neural network with 16-bit works flawlessly. Note that training on 8 bit for example, will decrease the accuracy for all data while ECC is relevant only for some small parts of the data. However, a flipped bit might be quite severe while a conversion from 32 to 8-bits might still be quite close to the real value. But overall I think an error in a single bit should not be so detrimental to performance, because the other values might counterbalance this error or in the end softmax will buffer this (an extremely large error value sent to half the connections might spread to the whole network, but in the end for that sample the probability for the softmax will be just 1/classes for each class).
Remember that there are always a lot of samples in a batch, and that the error gradients in this batch are averaged. Thus even large errors will dissipate quickly, not harming performance.
Tran Lam An says
Hi Tim,
Thank for your support on Deep Learning group.
I have a workstation DELL T7610 http://www.dell.com/sg/business/p/precision-t7610-workstation/pd.
I want to plug in 2 Titan X from NVIDIA and ASUS. Everything seems okay, I just wonder about PSU, cooling, and dimensions of GPU.
I will check the cooling and dimensions latter. My main concerns is about power.
I look documents http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-titan-x/specifications and https://www.asus.com/Graphics-Cards/GTXTITANX12GD5/specifications/.
Both of them requires power up to 300W.
However in the specs of workstation, they said sth about graphics card that:
Support for up to three6 PCI Express® x16 Gen 2 or Gen 3 cards up to 675W (total for graphics (some restrictions apply))
GPU: One or two NVIDIA Tesla® K20C GPGPU – Supports Nvidia Maximus™ technology.
So the total power seems okay, right?
Another evidences:
The power of the workstation would be:
Power Supply: 1300W (externally accessible, toolless, 80 Plus® Gold Certified, 90% efficient)
CPU ( 230W ) + 2GPU( 300*2 ) + 300 = 1130W.
It seems okay for the power, right?
Hope to have your opinions.
Thank you for your sharing.
Tim Dettmers says
Everything looks fine. I ran 3 GTX Titan with a 1400 watt PSU and 4 GTX Titan with 1600 watt, so you should definitely be fine with 1300 watt and 2 GPUs. A GTX Titan also uses more power than a GTX Titan X. Your calculation looks good and there might even be space for a third GPU.
P.S. The comments are locked in place to await approval if someone new posts on this website. This is so to prevent spam.
Haider says
Tim,
I am new to deep NN. I discovered its tremendous progress after seeing the excellent 2015 GTC NVidia talk. Deep NN will be very useful for my Phd which is about electrical brain signal classification (Brain Computer Interface).
What a joy I found your blog! Just wished if you wrote more.
All your post are full of interesting ideas. I have checked the comments of the posts which are not less interesting than the posts themselves and full of important hints too.
I read a lot, but did not find most of your interesting hints on hardware elsewhere. Your posts were just brilliant. I believe your posts filled a gap in the web, especially on the performance and the hardware side of deep NN.
I think on the hardware side, after reading your posts I have enough knowledge to build a good system.
On the software side, I found a lot of resources. However, I am still a bit confused. Perhaps, because it wasn’t your posts 😉 . Why do you only write on hardware? Your can write very well, and we love to hear from your experience on software too..
From where should I begin?
I’m very fond of Matlab and didn’t program much in other languages. And I don’t know anything about python, which seems very important to learn for machine learning. I don’t mind to learn python if you advise me to do so. But if it is not necessary, then maybe I can spare my time to learn other deep NN stuff, which are overwhelming already. My excitement crippled me. I have opened ~600 tabs and want to see them all.
If you were in my shoes, what platform you will begin to learn with? Caffe, Torch or Theano ? Why?
And please, tell me too about your personal preference. I learned from your posts that you are making your own programs. But in case you are picking one of these for you, what will be. And in case you were like me with no python experience, what will you pick in that case?
I am very interested to hear your opinion. I am not in hurry.. When you feel like writing please answer me with some details.
I thank you sincerely for all the posts and comment replies in your blog and eager to see more posts from you Tim!
Thank you!
Tim Dettmers says
Thank you for all this praise — this is encouraging! I wrote about hardware mainly because I myself focused on the acceleration of deep learning and understanding the hardware was key in this area to achieve good results. Because I could not find the knowledge that I acquired elsewhere on a single website, I decided to write a few blog posts about this. I plan to write more about other deep learning topics in the future.
In my next posts I will compare deep learning to the human brain: I think this topic is very important because the relationship between deep learning and the brain is in general poorly understood.
I also wanted to make a blog post about software, but I did not have the time yet to do that — I will do so probably in this month or the next.
Regarding your questions, I would really recommend Torch7, as it is the deep learning library which has the most features and which is steadily extended by facebook and deepmind with new deep learning models from their research labs. However, as you posted above, it is better for you to work on windows and Torch7 does not work well on windows. Theano is the best option here I guess, but also Minerva seems to be okay.
Caffe is a good library when you do not want to fiddle around to much within a certain programming language and just want to train deep learning models; the downside is that it is difficult to make changes to the code and the training procedure/algorithm and few models are supported.
In the case of brain signals per se, I thin python offers a lot of packages which might be helpful for your research.
However, if you just want to get started quickly with the language you know, Matlab, then you can also use the neural network bindings from the Oxford research group, with which you can use your GPU to train neural networks within Matlab.
Hope this helps, good luck!
Zizhao says
Do you think if you have too many monitors, it will occupy too much resources of your GPU already? If yes, how to solve this issue?
Tim Dettmers says
I have three monitors with 1920×1080 resolution and the monitors draw about 400 MB. For me I never had any issues with this, but I also had 6GB cards and I did not train models that maxed out my GPU RAM. If you have a GPU with less memory (GTX 980 or GTX 970) then there might be some problems for convolutional nets. The best way to circumvent this problem is to buy a really cheap GPU for the monitors (a GT210 costs about $30 and can power two (three?) monitors), so that your main deep learning GPU is not attached to any monitor.
Sameh Sarhan says
Tim, you have a wonderful blog and I am very impressed with the knowledge as well as the effort that you are putting into it.
I run a silicon valley startup that works in the space of wearbales Bio-sensing , we developed very unique non-invasive sensors , that can measure vitals , psychological and physiological effects. Most of our signals are multivariate time series, with a typically process (1×3000) per sensor per reading , and we can typically use up to 5 sensors.
We are currently expanding our ML algorithms to add CNNs capabilities, I wonder what do you recommend in terms of GPU.
Also I would highly appropriate if you can email me to further discuss potentially mutually beneficial collaboration
Regards,
Sameh
Tim Dettmers says
Hi Sameh! If you have multivariate time series a common CNN approach is to use a sliding windows over your data on X time steps. Your convolutional net would then use temporal instead of spatio-temporal convolution which would use much less memory. As such, 6GB of memory should probably be sufficient for such data and I would recommend a GTX 980 Ti, or a GTX Titan. If you need to run your algorithms on very large sliding windows (an important signal happened 120 time steps ago, to which the algorithm should be sensitive to) a recurrent neural network would be best for which 6GB of memory would also be sufficient. If you want to use CNNs with such large windows it might be better to get a GTX Titan X with 12GB memory.
Regards,
Tim
Sergii says
Sorry that my question was confusing.
I write simple code which runs axpy cublas kernels and memcpy. As you can see from the profiler ![](http://imgur.com/dmEOZTY,q2HhqlX#1), in case of pinned memory the kernels that were launched after cudMemcpyAsync run in parallel (with respect to transfer process).
However, in case of pageable memory, ![](http://imgur.com/dmEOZTY,q2HhqlX) cudMemcpyAsync blocks the host, and I can’t launch the next kernel.
In the chapter `Direct memory access (DMA)` you say “…on the third step the reserved buffer is transferred to your GPU RAM without any help of the CPU…”, so why does cudMemcpyAsync block the host until the end of the copy process? What is the reason for that?
Tim Dettmers says
The most low-level reason I can think of is, as I said above, that pageable memory is inherently insecure and may be swapped/pushed around at will. If you start a transfer and want to make sure that everything works, it is best to wait until the data is fully received. I do not know about the low level details how the OS and its drivers and routines (like DMA) interact with the GPU. If you want to know these details, I think it would be best to consult with people from NVIDIA directly, I am sure they can give you a technical accurate answer; you might also want to try the developer forums.
Sergii says
Thank you for the reply.
Do you know what is the reason for the inability to have overlapping pageable host memory transfer and kernel execution?
Tim Dettmers says
It has all to do with having a valid pointer to the data. If your memory is not pinned, then the OS can push the memory around freely to make some optimizations, so you are not certain to have a pointer to CPU memory and thus such transfers are not allowed by the NVIDIA software because they easily run into undefined behaviour. With pinned memory, the memory no longer is able to move, and so a pointer to the memory will stay the same at all times, so that a reliable transfer can be ensured.
This is different in GPUs, because GPU pointers are designed to be reliable at all times as long as they stay on some GPU memory, so these problems do not exist for GPU -> GPU transfers.
Sergii says
Thanks for the wonderful explanation. But I still have a question. Your previous reply can explain why data transfer with pageable memory can’t be asynchronous with respect to a host thread, but I still do not understand why a device can’t execute kernel while copying data from a host. What is the reason for that?
Tim Dettmers says
Kernels can execute concurrently, the kernel just needs to work on a different data stream. In general, each GPU can have 1 Host to GPU, and 1 GPU to GPU transfer active, and execute a kernel concurrently on unrelated data in another stream (by default all operations use a default stream and are thus not concurrent).
But you are right that you cannot execute a kernel and a data transfer in the same stream. I assume there are issues with the hardware not being able to resume a kernel once the end of a steam that is being transferred at the very moment is reached (the kernel would need to wait, then compute, then wait, then compute, then wait… — this will not deliver good performance!). So it will be because of this that you cannot run a kernel on partial data.
Sergii says
Hi Tim!
Thanks for your helpful and detailed write-up.
It seems from this blog post (http://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/) that for concurrent kernel execution and data transfer the memory must be pinned.
You wrote “…one might be able to prevent that when one uses pinned memory, but as you shall see later, it does not matter anyway…” and AFAIU you don’t use pinned memory in the async batch allocation process (`clusternet` project).
Also, pinned memory is mentioned in the documentation (http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#asynchronous-transfers-and-overlapping-transfers-with-computation), but at the same time this (http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79) document says “…the copy may overlap with operations in other streams…” and no mention about pinned memory.
These contradictory facts are a bit confusing to me. So my question is:
– Do you have the code that can confirm overlapping between transfer process of pageable host memory to the device memory and kernel execution?
– And what is actually going on with cudaMemcpyAsync?
Tim Dettmers says
What you write is all true, but you have to look at it in two different ways, (1) CPU -> GPU, and (2) GPU -> GPU.
For CPU -> GPU you will need pinned memory to do asynchronous copies; however, for GPU -> GPU the copy will be automatically asynchronous in most use cases — no pinned GPU memory needed (cudaMemcpy and cudaMemcpyAsync are almost always the same for GPU -> GPU transfers).
I turns out that I use pinned memory in my clusterNet project, but it is a bit hidden in the source code: I use it only for batch buffers in my BatchAllocator class, which has an embarrassingly poor design. There I transfer usual CPU memory, to a pinned buffer (while the GPU is busy) and then transfer it in another step asynchronously to the GPU, so that the batch is ready when the GPU needs it.
You can also allocate the whole data set as pinned memory, but this might cause some problems, because once pinned, the OS cannot “optimize” the locked in memory anymore which may lead to performance problems if one allocated a chunk of memory too large.
Kai says
Hey Tim! Thanks for these posts, they’re highly, highly appreciated! I’m just starting to get my feet wet in deep learning – is there any way to hook up my Laptop to a GPU (maybe even an external one?) without having to build a PC from scratch so I could start GPGPU programming on small datasets with less of an investment? Does the answer depend on my motherboard?
Tim Dettmers says
In that case it will be best to use AWS GPU spot instances which are cheap and fast. External GPUs are available, but they are not an option because the data transfer, CPU -> USB-like-interface -> GPU, is too slow for deep learning. Once you have made some experiences with AWS I would then buy a dedicated deep learning PC.
Kai says
Thanks, that sounds like a good idea then!
Frank Kaufmann (@FrankKaufmann76) says
What are your thoughts on the GTX 980 Ti vs. the Titan X? I guess with “980” in your article you referred to the 4 GB models. The 980 Ti has the same Memory Bandwidth as the Titan X, 2GB more memory than a 980 (which should make it better for big convnets), only a few CUDA cores less. And the price difference is 549 USD for a 980 Ti vs 999 USD for the Titan X.
Tim Dettmers says
The GTX 980 Ti is a great card and might be the most cost effective card for convolutional nets right now. The 6GB RAM on the card should be good enough for most convolutional architectures. If you will be working on video classification or want to use memory-expensive data sets I would still recommend a Titan X over a 980 Ti.
Sinan says
That sounds interesting. Would you mind sharing more details about your G3258-based system?
Tim Dettmers says
I do not have a Haswell G3258 and I would not recommend one, as it only runs 16 PCIe 3.0 lanes instead of the typical 40. So if you are looking for a CPU I would not pick Haswell — too new and thus too expensive, and many Haswells do not have full 40 PCIe lanes.
Sinan says
Sorry Tim, my comment was meant to be in response to the comment #128 by user “lU” from March 9, 2015 at 10:59 PM. I wonder why it didn’t appear under that one despite having double-checked before posting. I guess it’s the fault of my mobile browser.
First of all, thank you for a series of very informative posts, they are all much appreciated.
I was planning to go for a single GPU system (GTX 980 or the upcoming 980 Ti) to get started with deep learning, and I had the impression that at $72, this is the most affordable CPU out there.
Tim Dettmers says
You’re welcome! I was looking for other options, but to my surprise there were not any in that price range. If you are using only a single GPU and you are looking for the cheapest option, this is indeed the best choice.
Mark says
That second 10x speed up claim with NVLink is a bit strange bc it is not clear how it is being made.
Richard says
Hi Tim,
First can I say thanks very much for writing this article – it has been very informative.
I’m a first year PhD student. My research is concerned with video classification and I’m looking into using convolutional nets for this purpose.
My current system has a Gt 620 which takes about 4 hours to run a lenet5 based network built using theano on MNIST. So I’m looking to upgrade and I have about £1000 to do it with.
I’ve allocated about £500 for the gpu but I’m struggling to decide what to get. I’ve discounted the gtx 970 due to the memory problems. I was thinking either gtx 780 (6gb asus version), gtx 980 or two gtx 960’s. What is your opinion on this? I know I can’t use multiple gpus with theano but I could run two different nets at the same time on the 960’s, however would it be quicker just to run each net consecutively on the 980 since its faster. Also there’s the 780 which although would be slower than the 980 it has more ram which would be beneficial For convolutional nets. I looked into buying second hand as you suggested however I’m buying through my university so that isn’t an option.
Thanks for your help and for the great article once again.
Cheers,
Richard
Tim Dettmers says
That is really a tricky issue, Richard. If you use convolutional on the spatial dimensions of an image as well as the time dimension, you will have 5 dimensional tensors (batch size, rows, columns, maps, time) and such tensors will use a lot of memory. So you really want a memory card. If you use the Nervana Systems 16-bit kernels you would be able to reduce memory consumption by half; these kernels are also nearly twice as fast (for dense connections there are more than twice as fast). To use the Nervana Systems kernels, you will need a Maxwell GPU (GTX Titan X, GTX 960, GTX 970, GTX 980). So if you use this library a GTX 980 will have “virtually” 8GB of memory, while the GTX 780 has 6GB. The GTX 980 is also much faster than the GTX 780, which further adds to the GTX 980 options. However, the Nervana Systems kernels still lack some support for natural language processing, and overall you will have a far more thorough software if you use torch and a GTX 780. If you think about adding your own CUDA kernels, the Nervana Systems + GTX 980 option may be not so suitable, because you probably will need to handle the custom compiler and program 16-bit floating point kernels (I have not looked at this, but I believe there will be things which makes it more complicated than regular CUDA programming).
I think both, GTX 780 and GTX 980 are good options. The final choice is up to you!
Hope this helps!
Cheers,
Tim
Richard says
Thanks for the detailed response Tim,
Think i’ll go with the 780 for now due to the extra physical memory. Quick follow up question: if I have the money for an additional card in the future would I need to buy the same model. Could I for example have both a GTX 780 and a GTX 980 running in the same machine so that I can have two different models running on each card simultaneously? Would there be any issues with drivers etc? Going to order the parts for my new system tomorrow will post some benchmarks soon.
Cheers,
Richard
Tim Dettmers says
GPUs can only communicate directly if they are based on the same chip (but brands may differ). So for parallelism you would need to get another GTX 780, otherwise a GTX 980 is fine for everything else. Also remember, that new Pascal GPUs will hit around Q3 2016 and those will be significantly faster than any Maxwell GPU (3D memory) — so waiting might be an option as well.
Mark says
FYI on Pascal chip from NVIDIA. Speed up over Titan is “up to 5x.” Of this, a 2x speed up will come from the option of switching to using 16 bit floating point in Pascal.
The rest of the “up to 10x speed up” comes from the 2x speed up you get from NVLink. Here the comparison is two Pascal versus two Titans. I don’t know what the speed up would be if the Pascals used the same PCI interlink as the Titans or if they could even use the PCI interlink. Hopefully so then a new motherboard would not be necessary.
Thomas says
Hi Tim,
Thank you for all your advice on how to build a machine for DL!
You don’t talk about the possibility of using an embedded GPU in the motherboard (or a “small” second GPU) so as to dedicate the “big” GPU to calculus. Could that affect the performance in any way?
Also we want to build a computer to reproduce and improve -by making a more complex model- the work of DeepMind about their generalist AI.
We were thinking about getting one Titan X with 32G of RAM.
Would you have any specific recommendation concerning the motherboard and CPU?
Thank you very much
Tim Dettmers says
There are some GPUs which are integrated (embedded) in regular CPUs and you can run your monitors on these processors. The effect of this is some saved memory (about a hundred MB for each monitor) but very little computational resources (less than 1 % for 3 monitors). So if you are really short on memory (say you have a GPU with 2 or 3GB and 3 monitors) then this might make good sense. Otherwise, it is not very important and buying a CPU with integrated graphics should not be a deciding factor when you buy a CPU.
As I said in the article, you have a wide variety of options for the CPU and motherboard, especially if you will stick with one GPU. In this case you can really go for very cheap components and it will not hurt your performance much. So I would go for the cheapest CPU and motherboard with a reasonable good rating on pcpartpicker.com if I were you.
Thomas says
Thank you very much!
Florijan Stamenković says
Tim,
Thanks for the excellent guide! It has helped us a lot. However, a few questions remain…
We plan to build a deep-learning machine (in a server rack) based on 4 Titan cards. We need to select other hardware. Ideally we would put all four cards on a single board with 4x PCIe 3.0 x16. The questions are:
1. If I understand correctly, GPU intercommunication is the bottleneck. Should we go for dual 40-lane CPUs (Xeons only, right?), or take a single i7 and connect the cards with SLI?
2. Will any 4x PCIe 3.0 x16 motherboard do? Is socket 2011 preferable?
We plan to use these nets for both constitutional and dense learning. Our budget (everything except the Titans) is around $3000, preferably less, or a bit more if justified. Please advise!
Florijan Stamenković says
I just read the above post as well and got some needed information, sorry for spamming. From what I understand, SLI is not beneficial.
Should we then go for two weaker Xeons (2620), each with 40 PCIe lanes? Will this be cost-optimal?
Thanks,
F
Tim Dettmers says
2 CPUs will typically yield no speedup because usually the PCIe networks of each CPU (2 GPUs for each CPU) are disconnected which means that the GPU pairs will communicate through CPU memory (max speed about 4 GB/s, because a GPU pair will share the same connection to the CPU on a PCIe-switch). While it is reasonable for 8 GPUs, I would not recommend 2 CPUs for a 4 GPU setup.
There are motherboards that work differently, but these are special solutions which often only come in a package of a whole 8 GPU server rack ($35k-$40k).
If you use a single GPU then any motherboard with enough slots and which supports 4 GPUs will do; choose the CPU so that it supports 40 PCIe lanes and you will be ready to go. Socket 2011 has no advantage over other sockets which fulfill these requirements.
Regarding SLI: SLI can be used for gaming, but not for CUDA (it would be too slow anyways); so communication is really all done by PCI Express.
Hope this helps!
Florijan says
It does, thanks! We are still deciding between a single CPU vs dual-CPU (for other computing purposes). Could you comment on the following two motherboards being suitable for our 4 titans:
http://www.asus.com/us/Commercial_Servers_Workstations/X99E_WS/overview/
http://www.asus.com/Commercial_Servers_Workstations/Z10PED8_WS/overview/
In particular the Z10PED8 states it supports “4 x PCIe 3.0/2.0 x16 (dual x16 or quad x8)”, from which I understand it does NOT support quad x16. Would the X99 be the best solution then?
Tim Dettmers says
It is quite difficult to say which one is better, because I do not know the PCIe switch layout of the dual CPU motherboard. The most common PCIe switch layout is explained in this article and if the dual CPU motherboard that you linked behaves in a similar way, then for deep learning 2 CPUs will be definitely be slower than 1 CPU if you want to use parallel algorithms across all 4 GPUs; in that case the 1 CPU board will be better. However, this might be quite different for other computing purposes than deep learning and a 2 CPU board might be better for those tasks.
sacherus says
Hi Tim,
thank you for your great article. I think it’s cover everything that you need to know to start your journey with DL.
I’m also grad student (but instead of image processing, I’m in speech processing) and want to buy some machine (I thinking also about Kaggle, but for beginning I could take 20-40 place 🙂 ). I want to buy (East Europe) used workstation (without graphics) + used graphics. Probably I will end up with 2 cards in my computer… Maybe 3….
Questions:
1) You wrote that you need to have 7 3.0 slots motherboard for 3 GPUs. Isn’t possible to have
16 x | 1x | 16x | 1x (etc) setup? Like in http://www.msi.com/product/mb/Z87-G45-GAMING.html#hero-overview?
2) So there do not exist setups that support 16x/16x (or are to expensive)?
3) I see that computation compatibility also matters. I can buy geforce 780 ti in similar price to gtx 970. 780 ti has better bandwith + more GFLOPS (you never mentioned about FLOPS), but 970 has newer CC + more memory.
4) Maybe I should let go and buy what… 960 or 680 (just start)… However, 970 is not much expensive than those 2. Or just buy whole used PC.?
Tim, what do you think?
Tim Dettmers says
1. You are right, a 16x | 1x | 16x | 1x setup will work just as well; I did not thought about that in this way, and I will update my blog with that soon — thanks!
2. I hope I understand you right: You have a total of 40x PCIe lanes supported by your CPU (not the physical slots, but this is sort of the communication wires that are layed from the PCIe slots to the CPU) and your GPUs will use up to 16x (standard mainboards) for that; so 16x 16x is standard if you use 2 GPUs, for 3 GPUs this is 16x8x16 and for 4 GPUs 16x8x8x8. If you mean physical slots, then a 16x | Yx | 16x setup will do, where Y is any size; because most GPUs have a width of two PCIe slots you most often cannot run 2 GPUs on 16x | 16x mainboard slots, sometimes this will work if you use watercooling though (reduces the width to one slot)
3. GFLOPS do not matter in deep learning (its virtually the same for all algorithms), your algorithms will always be limited by bandwidth; the 780 TI has higher bandwidth, but inferior architecture and the GTX 970 would be faster. However, the GTX 780 TI has no gliches, and so I would go with the GTX 780 TI
4. The GTX 680 might be a bit more interesting than the GTX 780 TI if you really want to train a lot of convolutional nets; otherwise a GTX 780 TI is best; if you only use dense networks you might want to go with the GTX 970
Yu Wang says
Hi Tim,
Thanks for the insightful posts. I’m a grad student working in the image processing area. I just started to explore some deep learning techniques with my own data. My dataset contains 10 thousand 800*600 images with 50+ classes. I’m wondering GTX970 will be sufficient to try different networks and algorithms, including CNN.
Tim Dettmers says
Although your data set is very small and you will only be able to train a small convolutional net before you overfit the size of the images is huge. Unfortunately, the size of the images is the most significant memory factor in convolutional nets. I think a GTX 970 will not be sufficient for this.
However, keep in mind, that you can always shrink the images to keep them manageable. for a GTX 970 you will need to shrink them to about 250*190 or so.
Yu Wang says
Thanks for the quick reply. Look forward to your new articles.
Dimiter says
Tim,
Thanks for a great write-up. Not sure what I’d have done without it.
A bit of a n00b question here,
Do you thinks it matters in practice if one has PCI2 2.0 or 3.0?
Thanks
Tim Dettmers says
If it is possible that you will have a second GPU at anytime in the future definitely get a PCIe 3.0 CPU and motherboard. If you use additional GPUs for parallelism, then in the case of PCIe 2.0 you will suffer a performance loss of about 15% for a second GPU, and much larger losses (+40%) for your third and fourth GPU. If you are sure that you will stay with one GPU in the future, then PCIe 2.0 will only give you a small or no performance decrease (0-5%) and you should be fine.
Mark says
This may not make much difference if you care about a new system now or about having a more current system in the future. However, if you want to keep it around for years and use it for other things besides ML then wait a few months.
Intel’s Skylake CPU will be released in a few months along with it’s new chip set, new socket, new motherboards etc. All PCI 3, ddr4, etc. It’s considered a big change compared to prior CPU’s. Skylake prices are suppose to be similar to current offerings but retailers say they expect the price of ddr4 to drop. Don’t really understand why but gamers are also waiting for the release … maybe just because “new and improved” since it doesn’t seem to translate into a big plus for the gaming experience.
Bjarke Felbo says
Thanks for a great guide! I’m wondering if you could give me a rough estimate of the performance boost I would get by upgrading my system? Would be awesome to have that before I spend my hard-earned money! I supposed it’s mainly based on my current GPU, but here’s a bit of info about the rest of the system as well.
Current setup:
ATI Radeon™ HD 5770 1gb
One of the last CPU’s from the 775-socket series.
4gb ram
SSD
Upgraded setup:
GTX 960 4gb
Modern dual-thread CPU with 2+ GHz
8gb ram
SSD
Two more questions:
1) I’ve sometimes experienced issues between different motherboard brands and cetain GPU’s. Do you have a recommendation for a specific motherboard brand (or specific product) that would work well with a GTX 960?
2) Any idea of what the performance reduction would be by doing deep learning in caffe using a Virtualbox environment of Ubuntu instead of doing a plain Ubuntu installation?
Tim Dettmers says
It is difficult to estimate the performance boost if your previous GPU is a ATI GPU; but for the other hardware pieces you should see about a 5-10% increase in performance.
1. I never had any problems with my motherboards, so I cannot give you any advice here on that topic.
2. I also had this idea once, but it is usually impossible to do this: CUDA and virtualized GPUs do not go together, you will need specialized GPUs (GRID GPUs, which are used on AWS); even if they would go together there would be a stark performance decrease.
It it a great change to go from windows to ubuntu, but it is really worth doing if you are serious about deep learning. A few months in ubuntu and you will never want to go back!
Bjarke Felbo says
Thanks for the quick response! I’ll try Ubuntu then (perhaps some dual-booting). Would it make sense to add water-cooling to a single GTX 960 or would that be overkill?
Shinji says
Hi Tim, this is a great post!
I’m interested in the actual PCIe bandwidth in the deep learning process. Are PCIe 16 lanes needed for deep learning? Of course x16 PCIe gen3 is ideal for the best performance, but I’m wondering if x8 or x4 PCIe gen3 is also enough performance.
Which do you think better solution if the system has 64 PCIe lanes?
* 4 GPGPUs connected with 16 PCIe lanes each
* 16 GPGPUs connected with 4 PCIe lanes each
Which is important factor, the number of GPGPU (calculation power) or PCIe bandwidth?
Tim Dettmers says
Each PCIe lane for PCIe 3.0 has a theoretical bandwidth of about 1 GB/s, so you can run GPUs also with 8 lanes or 4 lanes (8 lanes is standard for at least one GPU if you have more than 2 GPUs), but it will be slower. How much slower will depend on the application or network architecture and which kind of parallelism is used.
64 PCIe lanes are only supported by two CPU motherboards and these boards often have a special PCIe switching architecture which connects the two separate PCIe systems (one for each CPU) with each other; I think you can only run up to 8 GPUs with such a system (the BIOs often cannot handle more GPUs even if you have more PCIe slots). But if you take this as a theoretical example it is best to just do some test calculations:
16 GPUs means 15 data transfers to synchronize information; 4 PCIe lanes / 15 transfers = 0.2666 GB/s for a full synchronization. If you now have a weight matrix with say 800×1200 floating point numbers you have 800x1200x1024^-3= 0.0036 GB. This means you could synchronize 0.2666/0.0036 = 74 gradients per second. A good implementation of MNIST with batchsize 128 will run with about 350 batches per second. So the result is that 16 GPUs with 4 PCIe lanes will be 5 times slower for MNIST. These numbers are better for convolutional nets, but not much better. Same for 4 GPUs/16 lanes:
16/3 = 5.33; 5.33/0.0036 = 647; so in this case there would be a speedup of about 2 times; this is better for convolutional nets (you can except a speedup of 3.0-3.9 depending on the implementation). You can do similar calculations for model parallelism in which the 16 GPU case would fare a bit better (but it is probably still slower than 1 GPU).
So the bottom line is that 16 GPUs with 4 PCIe lanes are quite useless for any sort of parallelism — PCIe transfer rates are very important for multiple GPUS.
Shinji says
Thank you for explanation.
Regarding your description, it depends on application, but the data transfer time among GPUs is dominant in multiple GPUs environment.
However, I have another question.
In your assumption, the GPU processing time is always shorter than data transfer time. In 16 GPUs case, GPU processing must take less than 14 msec to process one batch. In 4 GPUs case, it must take less than about 2 msec.
If the GPU processing time is longer enough than data transfer time, the data transfer time for synchronization is negligible. In that case, it is important to have many GPUs rather than PCIe bandwidth.
Is my assumption unlikely in usual case?
Tim Dettmers says
This is exactly the case for convolutional nets, where you have high computation with small gradients (weight sharing). However, even for convolutional nets there are limits to this; beyond eight GPUs it can quickly become difficult to gain near-linear speedups, which is mostly due low interconnections between computers. A 8 GPU system will be reasonably fast with speedups of about 7-8 times for convolutional nets, but for more than 8 GPUs you have to use normal interconnects like infiniband. Infiniband is similar to PCIe but its speed is fixed at about 8-25 GB/s (8GB/s is the affordable standard; 16 GB/s is expensive; 25 GB/s is very, very expensive): So for 6 GPUs + 8GB/s standard connection this yields a standard bandwidth of 1.6 GB/s which is much worse than the 4 GPU 16 lanes example; for 12 GPUs this is 0.72 GB/s; 24 GPUs 0.35GB/s; 48 GPUs 0.17GB/s. So pretty quickly it will be pretty slow even for convolutional nets.
Tim Dettmers says
I overlooked your comment, but it is actually a very good question. It turns out that you exactly hit the mark: The less communication is needed the better are more GPUs compared to more bandwidth. However, in deep learning there are only few cases where it makes sense to trade bandwidth for more GPUs. Very deep recurrent neural networks (time dimension) would be an example, and to some degree (very) deep neural networks (20+ layers) are of this type. However, even for 20+ layers you still want to look at maximizing your bandwidth to maximize your overall performance.
For dense neural networks, anything above 4 GPUs is rather impractical. You can make it work to run faster, but this required much effort and several compromises in model accuracy.
Mark says
Got a bit of a compromise i am thinking about. To save on cash in picking a CPU. The i7 5820K and i7 5930K are the same except for pci lanes (28 versus 40). According to this video:
https://youtu.be/rctaLgK5stA
It comes down to using say a 4th 980 or Titan otherwise if it’s three or less then there is no real performance difference. This means a saving on the CPU of about $200.
What’s your thoughts since you warned about the i7 5820 in your article?
Tim Dettmers says
Yes the i7 5820K only has 28 PCIe lanes and if you buy more than one GPU I would choose definitely a different CPU . The penalty will be observable when you use multiple GPUs especially if you will use 4x GTX980 (personally, I would choose a cheap CPU < $250 with 40 lanes and instead buy 4x Titan GTX X — that will be sufficient) One note though, remember that in 2016 Q3/Q4 there will be Pascal GPUs, which are about 10 times better than a GTX Titan X (which is 50 % better than a GTX 980), so it might be reasonable to go with a cheaper system and go all out once Pascal GPUs are released.
Mark says
Well if i buy now in terms of the CPU and motherboard then I would like to upgrade this system in a couple years to Pascal. To keep this base system current over a few years then would you still recommend a x99 motherboard? If so then I am stuck with only two choices 5930 or 5960.
AMD has cpu’s and associated motherboards but I am not familiar with anything going that direction. Do they have something in mind here that is cheaper, about the same performance and can handle up to 4 980/titan/pascal GPU’s?
BTW, thought i read somewhere that no current motherboard will handle Pascal, is that correct?
Tim Dettmers says
A x99 motherboard might be a bit overkill. You will not need most of its features like DDR4 RAM. As you said, the Pascal GPUs will use their own interconnect which is much faster than PCIe — this would be another reason to spend less money on the current system. A system based on either the LGA1150 or the LGA2011 would be a good choice in terms of performance/cost.
I do not have experience with AMD either, but from the calculations in my blog post I am quite certain that it would also be a reasonable choice. I think in the end it just comes down how much money you have to spare.
Mark says
Great thanks! Still one thing remain unclear to a newbie builder like me. Is an x99 chip set wed to only motherboards which will not work with Volta/Pascal? If not then I can just swap out the motherboard but keep the x99 compatible CPU, memory, etc.
Also, since you are writing about convolutional nets, these are front-ends the feed neural nets. However, there is a new paper on using an SVM approach that needs less memory, is faster and just as accurate as any state-of-the-art covnet/neural-net combo. It keeps the convolution and pooling layers but replaces the neural net with a new fast-food (LOL) version of SVM. They claim it works “better”
“Deep Fried Convnets” by Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song, Ziyu Wang.
The SVM versus neural-net battle continues.
Peyman says
Great guide Tim, thanks.
I am wondering if you get the display output from the same GPUs which you do the computation on?
I’m gonna buy a 40 lane i7 cpu, which is a LGA 2011 socket, along with a GTX 980. It seems that none of the CPUs with this socket have an internal GPU to drive display. And the other CPUs, LGA 1150 and LGA 1155, do not support more than 28 lanes.
So , the question is do I need a separate GPU to drive displays, or I can do the compute and run the displays on the same GPU?
Tim Dettmers says
You can use the same GPU for computation and for display, there will be no problem. The only disadvantage is, that you have a bit less memory. I use 3x 27 inch monitors at 1920×1080 and this config uses about 300-400 MB of memory which I hardly notice (well, I have 6GB of GPU memory). If you are worried about that memory you can get a cheap NVIDIA GT210 (which can hold 2 monitors) for $30 and run your display on that, so that your GTX 980 is completely free for CUDA applications.
Harry says
I realize this is an old post but what motherboard did you pick? Most LGA2011 seem to not support dual 16x which I thought was the attraction of the 40 pcie lanes.
Lucas Shen (@icrtiou) says
Hi Tim,
I’m interested about the GPU bios. Can you share what bios which includes a new, more reasonable fan schedule are you using right now? I have 2 titan x waiting to be flashed.
Tim Dettmers says
I do not know if GTX 970/ GTX 980 BIOS is compatible with a GTX Titan X BIOS. Doing a quick google search, I cannot find information about a GTX Titan X BIOS, which might be, because the card is relatively new.
I think you will find the best information in folding@home and other crowd-computing forums (also cryptocurreny mining forums) to get this working.
Lucas Shen (@icrtiou) says
Thanks for the pointers. fah is very interesting XD, though I don’t find titan x bios yet. Guess I have to live with it for a while.
I saw you have plan to release deep learning library in the future. What framework will you be working on? Torch7, Theano, Caffe?
salem ameen says
Hi Tim,
I bought an MSI G80 Laptop to learn and work on deep learning which connects 2 GPU using SLI, could you please tell me if I could run deep learning on this laptop even in one GPU.
Regards,
Tim Dettmers says
Yes you will be able to use a single GPU for deep learning, SLI has nothing to do with CUDA. Even if there are dual-GPUs (like the GTX 590) on a hardware level you can simply access both GPUs separately. This is also true for software libraries like theano and torch.
salemameen says
Thanks Tim,
Because I don’t have background in coding, I want to use existing libraries. By the way, I bought this laptop not for gaming for deep learning, I thought would be more powerful with 2 GPUs, but even if one works fine that is ok for me. Regards,
Tim Dettmers says
You’re welcome! If you use Torch7 you would will be able to use both GPUs quite easily. If you dread working with Lua (it is quite easy actually, the most code will be in Torch7 not in Lua), I am also working on my own deep learning library which will be optimized for multiple GPUs, but it will take a few more weeks until it reaches a state which is usable for the public.
Mark says
Looking a two possible x99 boards, ASUS x99-Deluxe (~$410 US) and ASUS Rampage V Extreme (~$450 US). Unless you know something, I do not see that the extra $40 will make any difference for ML but maybe it does for other stuff like multi-media or gaming.
Will start with 16G or 32G DDR4 (haved decided yet, ~$500-$700 US).
I plan to use the 6-core i7-5930k (~$570 US). By your recommendations of 2 cores per GPU that means max 3 GPU’s.
GTX 980’s are ~$500 US and GTX Titans ~$1000 US. Besides loss of PCI slots, extra liquid cooling, what speed difference does one expect in a system with two GTX 980’s versus an identical system with one GTX Titan?
Tim Dettmers says
I do not think the boards make a great difference, they are rather about the chipset (x99) than anything else.
I think 6 cores should also be fine for 4 GPUs. On average, the second core is only used sparsely, so that 3 threads can often feed 2 GPUs just fine.
One GTX Titan X will be 150% as fast as a single GTX 980, so two GTX 980 are faster, but because one GPU is much better and easier to use than two, I would go for the GTX Titan X if you can afford it.
Mark says
“One GTX Titan X will be 150% as fast as a single GTX 980, so two GTX 980 are faster, but because one GPU is much better and easier to use than two, I would go for the GTX Titan X if you can afford it.”
Thanks for advice. Could you elaborate a bit more on the ease of use between one gpu versus two?
Also, i understand the Titan will be replace this year with a faster GTX 980 Ti. They will be the same price.
Tim Dettmers says
If you use torch7 then it will be quite straight forward to use 2 GPUs on one problem (2 GPUs yield about 160% speed when compared to a single GPU); other libraries do not support multiple GPUs well (theano/pylearn2, caffe), and others are quite complicated to use (cuda-convnet2). So 160% is not much faster than a GTX Titan X, so if you want to also use different libraries, a GTX Titan X would be faster overall (and more memory too!).
I am just working on a library that combines the ease of use of torch7 with very efficient parallelizm (+190% speedup for 2 GPUs), but it will take a month or two until I implemented all the needed features.
benoit says
Motherboard: Get PCIe 3.0 and as many slots as you need for your (future) GPUs (one GPU takes two slots; max 4 GPUs per system)
just to be sure I get it.
all GPU are better on a PCIe 3.0 slot, as each GPU seems to take 2 slots (due to size) for 3 GPU you’d need a 6 PCIe 3.0 slots MB ?
Tim Dettmers says
That’s right, modern GPUs will run faster on a PCIe 3.0 slot.
To install a card you only need a single PCIe 3.0 slot, but because you have a width of two PCIe slots each card will render the PCIe slot next to it unusable. For 3 GPUs you will need 5 PCIe slots, because the first two cover 4 slots and you will need a single fifth slot for the last GPU.
So a motherboard with 5x PCIe 3.0 x16 is fine for 3 GPUs.
benoit says
when I was mining bitcoins (unfortunately with radeon 🙂 hence why I’m so interested in your article) I used PCI risers like (http://www.amazon.fr/gp/product/B001CC3BNS?psc=1&redirect=true&ref_=oh_aui_search_detailpage)
Do you think those can act as a bottleneck between the PCIe 3.0 slot of the MB and the GPU ?
Using those could prove useful in finding cheaper MB with less PCIe 3.0 slots.
Tim Dettmers says
I also read a bit about risers when I was building my GPU cluster, and I often read that there was little to no degradation in performance. However, I do not know what PCIe lane configuration (e.g. 16/8/8/8, or 16/16/8 are standard for 4 and 3 GPUs, respectively) the motherboard will run under such a configuration and this might be a problem (the motherboard might not support it well). For cryptocurrency mining this is usually not a problem, because you do not have to transfer as much data over the PCIe interface if you compare that to deep learning — so probably there is no one that ever tested this under deep learning conditions.
So I am not really sure how it will work, but it might be worth a try to test this on one of your old mining motherboards and then buy a motherboard accordingly. If you decide to do so, then please let me know. I would be really interested in what is going on in that case and how well it works. Thanks!
Damien MENIGAUX says
I have tried and built many systems with passive risers. Always use similar ones (x16 to x16, but this one seems a bit cheap.
I never could make it work with molex risers though. I would have packet loss and it would make the training fail.
Tim Dettmers says
I am currently using two x16-to-x16 risers and it works like a charm to prevent overheat of 4 RTX 2080 Tis in one case. Some other students at the University of Washington use the same setup with great success.
Stijn says
What is the largest dataset you can analyze, you can choose the specs you want, and how much time would it take?
Tim Dettmers says
The sky is the limit here. Google ran conv nets that took months to complete and which were run on thousands of computers. For practical data sets, ImageNet is one of the larger data sets and you can expect that new data sets will grow exponentially from there. These data sets will grow as your GPUs get faster, so you can always expect that the state of the art on a large popular data set will take about 2 weeks to train.
Mark says
What motherboards by company and model number do you recommend (ASUS, MSI, etc) for a home PC that will be used for multimedia as well (not concerned with gaming). I am thinking of using a single GTX 980 but may think about add more GPU’s later(not a crucial concern). Also, what i7 cpu models do I need? Thanks for the help and the suggestion of the 960 alternative to the 580. I am learning Torch 7 and can afford the 980.
Tim Dettmers says
I only have experience with motherboards that I use, and one of them has a minor hardware defect and thus I do not think my experience is representative for the overall mainboard product, and this is similar for other hardware pieces. I think with the directions I gave in this guide you can find your pieces on your own through lists that feature user rating like http://pcpartpicker.com/parts/
Often it is quite practical to sort by rating and buy the first highly rated hardware piece which falls in your budget.
Felix Lau says
Thanks for this great post!
What’s your thought on using g2.xlarge instead of building the hardware? I believe g2.xlarge is a lot slower than GTX 980. However it is possible to spawn many instances on AWS at the same time which might be useful for tuning hyperperameter.
Tim Dettmers says
Indeed the g2.xlarge is much slower than the GTX 980, but also much cheaper. It is a cheap option if want to train multiple independent neural nets, but it can be very messy. I only have experience with regular CPU instances, but with those it can take considerable time to manage one’s instances, especially if you are using AWS for large data sets together with spot instances — you will definitely be more productive with a local system. But in terms of affordability GPU instances are just the best.
I just want you make you aware of other downsides with GPU instances, but the overall conclusion stays the same (less productivitz, but very cheap): You cannot use multiple GPUs on AWS instances because the interconnect is just too slow and will be a major bottle neck (4 GPUs will run slower than 2). Also the PCIe interconnect performance is crippled by the virtualization. This can be partly improved by a hacky patch, but overall the performance will still be bad (it might well be that 2 GPUs are worse than 1 GPU).
Also like the GTX 580, the GPU instances do not support newer software, and this can be quite bad if you want to run modern variants of convolutional nets.
Mark Trovinger says
What IDE are you using in that pic? It looks like Eclipse but I can’t quite tell. Great article, a full breakdown is just what I needed!
Tim Dettmers says
Glad that you liked the article. I am using Eclipse (NVIDIA Nsight) for C++/C/CUDA in that pic; I also use Eclipse for Python (PyDev) and Lua (Koneki). While I am very satisfied with Eclipse for Python and CUDA, I am less satisfied with Eclipse for Lua (that is torch7) and I probably will switch to Vim for that.
sshidy2014 says
About (possibly) multiple GPUs, would nvidia SLI be of any significant help?
Tim Dettmers says
Thanks for your comment. NVIDIA SLI is an interface which allows to render computer graphics frames on each GPU and exchange them via SLI. The use of SLI is limited to this application, so doing computations and parallelizing them via SLI is not possible (one needs to use the PCIe interface for this). So CUDA cannot use SLI.
Sancho McCann says
Thoughts on the Tesla K40? It’s one of the GPUs available through NVIDIA’s academic hardware grant program: https://developer.nvidia.com/academic_hw_seeding
Tim Dettmers says
A K40 will be similar to a GTX Titan in terms of performance. The additional memory will be great if you train large conv nets and this is the main advantage of a K40. If you can choose the upcoming GTX Titan X in the academic grant program, this might be the better choice as it is much faster and will have the same amount of memory.
dh says
why is k40 much more expensive when gtx x is cheaper but has more cores and higher bandwidth?
Tim Dettmers says
The K40 is a compute card which is used for scientific applications (often system of partial differential equations) which require high precision. Tesla cards have additional double precision and memory correction modules which makes them excel at high precision tasks; these extra features, which are not needed in deep learning, make them so expensive.
zeecrux says
ImageNet on K40:
Training is 19.2 secs / 20 iterations (5,120 images) – with cuDNN
and GTX770:
cuDNN Training: 24.3 secs / 20 iterations (5,120 images)
(source: http://caffe.berkeleyvision.org/performance_hardware.html)
I trained ImageNet model on a GTX 960 and have this result:
Training is around 26 secs / 20 iterations (5,120 images) – with cuDNN
So GTX 960 is close to GTX 770
So for 450000 iterations, it takes 120 hours (5 days) on K40, and 162.5 hours (6.77 days) on GTX 960.
Now K40 costs > 3K USD, and GTX 960 costs < 300 USD 🙂
Tim Dettmers says
Thanks, this is very useful information!
Hannes says
I find the recommendation of the GTX 580 for *any* kind of deep learning or budget a little dubious since it doesn’t support cuDNN. What good is a GPU that doesn’t support what’s arguably the most important library for deep learning at the moment?
Tim Dettmers says
This is a really good and important point. Let me explain my reasoning why I think a GTX 580 is still good.
The problem with no cuDNN support is really that you will require much more time to set everything up and often cutting-edge features that are implemented in libraries like torch7 will not be available. But it is not impossible to do deep learning on a GTX 580 and good, usable deep learning software exists. One will probably need to learn CUDA programming to add new features through one’s own CUDA kernels, but this will just require time and not money. For some people time and effort is relatively cheap, while money is rather expensive. If you think about students in developing countries this is very much true; if you earn $5500 a year (average GDP per capita ppp of India; for the US this is $53k – so think about your GPU choice if you had 10 times less money) then you will be happy that there is a deep learning option that costs less than $120. Of course I could recommend cards, like the GTX 750, which are also in that price range and which work with cuDNN, but I think a GTX 580 (must faster and more memory) is just better than a GTX 750 (cuDNN support) or other alternatives.
EDIT: I think it might be good to add another option, which offers support for cuDNN but which is rather cheap, like the GTX 960 4GB (only a bit slower than the GTX 580) which will be available shortly for about $250-300. But as you see, an additional $130-180 can be very painful if you are a student in a developing country.
DarkIdeals says
A great 2016 update if you happen to still frequent this blog (don’t see any recent posts) is the new GTX 1060 Pascal graphic card. Specifically the 3GB model. Now 3GB is definitely cutting a tad close on memory, however it’s a VASTLY superior choice to both a 580 AND a 960 4gb. The 1060 6GB model is equivalent to a GTX 980 in overall performance, and the 3GB 1060 model is only ever-so-slightly weaker putting it at the level of a hugely overclocked GTX 970 (i’m talking like ~1,650mhz 970 levels. Which is maybe ~5% below a 980)
And the 3GB 1060 can be had for a measly $199 BRAND NEW! It’s definitely something to consider at least. And if you still desperately need that extra VRAM then you can even get the 6GB version of the 1060 (which as i mentioned is literally about tied with an average GTX 980! ) can be had for as little as $249 right now!
Tim Dettmers says
I updated my GPU recommendation post with the GTX 1060, but I did not mention the 3GB version, that did not exist at that time. Thanks for letting me know!
Khalid says
Hi,
I want to get a system with GPU for speech processing and deep learning application using python language.
Can you please inbox me the reasonable system hardware requirements?
Tim Dettmers says
For these applications a “standard” deep learning system will be sufficient. You can find examples of such systems in the comments section (search for “pcpartpicker” and you will probably find some examples).
Rusty Scupper says
what,the.heck… Could you have skipped the blather and gotten to the point? There are only a few specific combinations that support what you were trying to explain so maybe something like:
– GTX 580/980
– i5 / i7 CPU
– Lots of ram (duh)
– Fast hard drive
Tim Dettmers says
Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime.
zeng says
授人以鱼不如授人以渔. same proverb as in Chinese.
very helpfull, thanks for the sharing.
lU says
Covers everything i wanted to know and even more, thanks!
It also confirms my choice for a pentium g3258 for a single GPU config. Insanely cheap, and even has ecc memory support, something that some folks might want to have..
cicero19 says
Hi Tim,
This is a great overview. Wondering if you could recommend any cost-effective CPUs with 40 PCIe lanes.
Thanks!
Tim Dettmers says
There are many CPUs in all different price ranges which are all reasonable choices and most CPUs support 40 PCIe lanes. The best practice is probably to look at site like http://pcpartpicker.com/parts/cpu/ an select a CPU with a good rating and a good price; then check if it supports the 40 lanes and you will be good to go.
Alexander Rezanov says
Good afternoon. Can you please help me. There is a used computer offer in my neighbourhood for about 800$.
i7 4790k
MSI 1080
DDR3 4Gx2 + DDR3 8Gx2
wd 1tb ssd
Is it a good choice for deep learning for beginning?
Tim Dettmers says
Yes, it would be okay for a beginner. You can run most models and explore deep learning problems. You will not be able to run some of the largest deep learning models, but that should not be your goal when you learn and explore deep learning anyway.
Alexander Rezanov says
Execuse me for not topic question. But are you familiar with tensorflow?