Deep learning is a field with intense computational requirements and the choice of your GPU will fundamentally determine your deep learning experience. With no GPU this might look like months of waiting for an experiment to finish, or running an experiment for a day or more only to see that the chosen parameters were off. With a good, solid GPU, one can quickly iterate over deep learning networks, and run experiments in days instead of months, hours instead of days, minutes instead of hours. So making the right choice when it comes to buying a GPU is critical. So how do you select the GPU which is right for you? This blog post will delve into that question and will lend you advice which will help you to make choice that is right for you.
TL;DR
Having a fast GPU is a very important aspect when one begins to learn deep learning as this allows for rapid gain in practical experience which is key to building the expertise with which you will be able to apply deep learning to new problems. Without this rapid feedback it just takes too much time to learn from one’s mistakes and it can be discouraging and frustrating to go on with deep learning. With GPUs I quickly learned how to apply deep learning on a range of Kaggle competitions and I managed to earn second place in the Partly Sunny with a Chance of Hashtags Kaggle competition using a deep learning approach, where it was the task to predict weather ratings for a given tweet. In the competition I used a rather large two layered deep neural network with rectified linear units and dropout for regularization and this deep net fitted barely into my 6GB GPU memory.
Should I get multiple GPUs?
Excited by what deep learning can do with GPUs I plunged myself into multi-GPU territory by assembling a small GPU cluster with InfiniBand 40Gbit/s interconnect. I was thrilled to see if even better results can be obtained with multiple GPUs.
I quickly found that it is not only very difficult to parallelize neural networks on multiple GPUs efficiently, but also that the speedup was only mediocre for dense neural networks. Small neural networks could be parallelized rather efficiently using data parallelism, but larger neural networks like I used in the Partly Sunny with a Chance of Hashtags Kaggle competition received almost no speedup.
Later I ventured further down the road and I developed a new 8-bit compression technique which enables you to parallelize dense or fully connected layers much more efficiently with model parallelism compared to 32-bit methods.
However, I also found that parallelization can be horribly frustrating. I naively optimized parallel algorithms for a range of problems, only to find that even with optimized custom code parallelism on multiple GPUs does not work well, given the effort that you have to put in . You need to be very aware of your hardware and how it interacts with deep learning algorithms to gauge if you can benefit from parallelization in the first place.

Since then parallelism support for GPUs is more common, but still far off from universally available and efficient. The only deep learning library which currently implements efficient algorithms across GPUs and across computers is CNTK which uses Microsoft’s special parallelization algorithms of 1-bit quantization (efficient) and block momentum (very efficient). With CNTK and a cluster of 96 GPUs you can expect a new linear speed of about 90x-95x. Pytorch might be the next library which supports efficient parallelism across machines, but the library is not there yet. If you want to parallelize on one machine then your options are mainly CNTK, Torch, Pytorch. These library yield good speedups (3.6x-3.8x) and have predefined algorithms for parallelism on one machine across up to 4 GPUs. There are other libraries which support parallelism, but these are either slow (like TensorFlow with 2x-3x) or difficult to use for multiple GPUs (Theano) or both.
If you put value on parallelism I recommend using either Pytorch or CNTK.
Using Multiple GPUs Without Parallelism
Another advantage of using multiple GPUs, even if you do not parallelize algorithms, is that you can run multiple algorithms or experiments separately on each GPU. You gain no speedups, but you get more information of your performance by using different algorithms or parameters at once. This is highly useful if your main goal is to gain deep learning experience as quickly as possible and also it is very useful for researchers, who want try multiple versions of a new algorithm at the same time.
This is psychologically important if you want to learn deep learning. The shorter the intervals for performing a task and receiving feedback for that task, the better the brain able to integrate relevant memory pieces for that task into a coherent picture. If you train two convolutional nets on separate GPUs on small datasets you will more quickly get a feel for what is important to perform well; you will more readily be able to detect patterns in the cross validation error and interpret them correctly. You will be able to detect patterns which give you hints to what parameter or layer needs to be added, removed, or adjusted.
So overall, one can say that one GPU should be sufficient for almost any task but that multiple GPUs are becoming more and more important to accelerate your deep learning models. Multiple cheap GPUs are also excellent if you want to learn deep learning quickly. I personally have rather many small GPUs than one big one, even for my research experiments.
So what kind of accelerator should I get? NVIDIA GPU, AMD GPU, or Intel Xeon Phi?
NVIDIA’s standard libraries made it very easy to establish the first deep learning libraries in CUDA, while there were no such powerful standard libraries for AMD’s OpenCL. Right now, there are just no good deep learning libraries for AMD cards – so NVIDIA it is. Even if some OpenCL libraries would be available in the future I would stick with NVIDIA: The thing is that the GPU computing or GPGPU community is very large for CUDA and rather small for OpenCL. Thus, in the CUDA community, good open source solutions and solid advice for your programming is readily available.
Additionally, NVIDIA went all-in with respect to deep learning even though deep learning was just in it infancy. This bet paid off. While other companies now put money and effort behind deep learning they are still very behind due to their late start. Currently, using any software-hardware combination for deep learning other than NVIDIA-CUDA will lead to major frustrations.
In the case of Intel’s Xeon Phi it is advertised that you will be able to use standard C code and transform that code easily into accelerated Xeon Phi code. This feature might sounds quite interesting because you might think that you can rely on the vast resources of C code. However, in reality only very small portions of C code are supported so that this feature is not really useful and most portions of C that you will be able to run will be slow.
I worked on a Xeon Phi cluster with over 500 Xeon Phis and the frustrations with it had been endless. I could not run my unit tests because Xeon Phi MKL is not compatible with Python Numpy; I had to refactor large portions of code because the Intel Xeon Phi compiler is unable to make proper reductions for templates — for example for switch statements; I had to change my C interface because some C++11 features are just not supported by the Intel Xeon Phi compiler. All this led to frustrating refactorings which I had to perform without unit tests. It took ages. It was hell.
And then when my code finally executed, everything ran very slowly. There are bugs(?) or just problems in the thread scheduler(?) which cripple performance if the tensor sizes on which you operate change in succession. For example if you have differently sized fully connected layers, or dropout layers the Xeon Phi is slower than the CPU. I replicated this behavior in an isolated matrix-matrix multiplication example and sent it to Intel. I never heard back from them. So stay away from Xeon Phis if you want to do deep learning!
Fastest GPU for a given budget
TL;DR
Your fist question might be what is the most important feature for fast GPU performance for deep learning: Is it cuda cores? Clock speed? RAM size?
It is neither of these, but the most important feature for deep learning performance is memory bandwidth.
In short: GPUs are optimized for memory bandwidth while sacrificing for memory access time (latency). CPUs design to the the exact opposite: CPUs can do quick computations if small amounts of memory are involved for example multiplying a few numbers (3*6*9), but for operations on large amounts of memory like matrix multiplication (A*B*C) they are slow. GPUs excel at problems that involve large amounts of memory due to their memory bandwidth. Of course there are more intricate differences between GPUs and CPUs, and if you are interested why GPUs are such a good match for deep learning you can read more about it in my quora answer about this very question.
So if you want to buy a fast GPU, first and foremost look at the bandwidth of that GPU.
Evaluating GPUs via Their Memory Bandwidth

Bandwidth can directly be compared within an architecture, for example the performance of the Pascal cards like GTX 1080 vs. GTX 1070, can directly be compared by looking at their memory bandwidth alone. For example a GTX 1080 (320GB/s) is about 25% (320/256) faster than a GTX 1070 (256 GB/s). However, across architecture, for example Pascal vs. Maxwell like GTX 1080 vs. GTX Titan X cannot be compared directly due to how different architectures with different fabrication processes (in nanometers) utilize the given memory bandwidth differently. This makes everything a bit tricky, but overall bandwidth alone will give you a good overview over how fast a GPU roughly is. To determine the fastest GPU for a given budget one can use this Wikipedia page and look at Bandwidth in GB/s; the listed prices are quite accurate for newer cards (900 and 1000 series), but older cards are significantly cheaper than the listed prices – especially if you buy those cards via eBay. For example a regular GTX Titan X goes for around $550 on eBay.
Another important factor to consider however is that not all architectures are compatible with cuDNN. Since almost all deep learning libraries make use of cuDNN for convolutional operations this restricts the choice of GPUs to Kepler GPUs or better, that is GTX 600 series or above. On top of that, Kepler GPUs are generally quite slow. So this means you should prefer GPUs of the 900 or 1000 series for good performance.
To give a rough estimate of how the cards perform with respect to each other on deep learning tasks I constructed a simple chart of GPU equivalence. How to read this? For example one GTX 980 is as fast as 0.35 Titan X Pascal, or in other terms, a Titan X Pascal is almost three times faster than a GTX 980.
Please note that I do not have all these cards myself and I did not run deep learning benchmarks on all of these cards. The comparisons are derived from comparisons of the cards specs together with compute benchmarks (some cases of cryptocurrency mining are tasks which are computationally comparable to deep learning). So these are rough estimates. The real numbers could differ a little, but generally the error should be minimal and the order of cards should be correct. Also note, that small networks that under-utilize the GPU will make larger GPUs look bad. For example a small LSTM (128 hidden units; batch size > 64) on a GTX 1080 Ti will not be that much faster than running it on a GTX 1070. To get performance difference shown in the chart one needs to run larger networks, say a LSTM with 1024 hidden units (and batch size > 64). This is also important to keep in mind when choosing the GPU which is right for you.

Cost Efficiency Analysis
If we now plot the rough performance metrics from above and divide them by the costs for each card, that is if we plot much bang you get for your buck, we end up with a plot which to some degree reflects my recommendations.

Note however, that this measure of ranking GPUs is quite biased. First of all, this does not take memory size of the GPU into account. You often will need more memory than a GTX 1050 Ti can provide and thus while cost efficient, some of the high ranking cards are no practical solutions. Similarly, it is more difficult to utilize 4 small GPU rather than 1 big GPU and thus small GPUs have a disadvantage. Furthermore, you cannot buy 16 GTX 1050 Ti to get the performance of 4 GTX 1080 Ti, you will also need to buy 3 additional computers which is expensive. If we take this last point into account the chart looks like this.

So in this case, which practically represents the case if you want to buy many GPUs, unsurprisingly the big GPUs win since it is cheaper if you buy more cost efficient computer + GPU combinations (rather than merely cost efficient GPUs). However, this is still biased for GPU selection. It does not matter how cost efficient 4 GTX 1080 Ti in a box are if you have a limited amount of money and cannot afford it in the first place. So you might not be interested in how cost efficient cards are but actually, for the amount of money that you have, what is the best performing system that you can buy? You also have to deal with other questions such as: How long will I have this GPU for? Do I want to upgrade GPUs or the whole computer in a few years? Do I want to sell the current GPUs in some time in the future and buy new, better ones?
So you can see that it is not easy to make the right choice. However, if you take a balanced view on all of these issues, you would come to conclusions which are similar to the following recommendations.
General GPU Recommendations
Generally, I would recommend the GTX 1080 Ti, GTX 1080 or GTX 1070. They are all excellent cards and if you have the money for a GTX 1080 Ti you should go ahead with that. The GTX 1070 is a bit cheaper and still faster than a regular GTX Titan X (Maxwell). The GTX 1080 was bit less cost efficient than the GTX 1070 but since the GTX 1080 Ti was introduced the price fell significantly and now the GTX 1080 is able to compete with the GTX 1070. All these three cards should be preferred over the GTX 980 Ti due to their increased memory of 11GB and 8GB (instead of 6GB).
A memory of 8GB might seem a bit small, but for many tasks this is more than sufficient. For example for Kaggle competitions, most image datasets, deep style and natural language understanding tasks you will encounter few problems.
The GTX 1060 is the best entry GPU for when you want to try deep learning for the first time, or if you want to occasionally use it for Kaggle competition. I would not recommend the GTX 1060 variant with 3GB of memory, since the other variant’s 6GB memory can be quite limiting already. However, for many applications the 6GB is sufficient. The GTX 1060 is slower than a regular Titan X, but it is comparable in both performance and eBay price of the GTX 980.
In terms of bang for buck, the 10 series is quite well designed. The GTX 1050 Ti, GTX 1060, GTX 1070, GTX 1080 and GTX 1080 Ti stand out. The GTX 1060 and GTX 1050 Ti is for beginners, the GTX 1070 and GTX 1080 a versatile option for startups, and some parts of research and industry, and the GTX 1080 Ti stand solid as an all-around high-end option.
I generally would not recommend the NVIDIA Titan Xp as it is too pricey for its performance. Go instead with a GTX 1080 Ti. However, the NVIDIA Titan Xp still has its place among computer vision researchers which work on large datasets or video data. In these domains every GB of memory counts and the NVIDIA Titan Xp just has 1GB more than the GTX 1080 Ti and thus an advantage in this case. I would not recommend the NVIDIA Titan X (Pascal) anymore, since the NVIDIA Titan Xp is faster and almost the same price. Due to the scarcity of these GPUs on the market however, if you cannot find a NVIDIA Titan Xp that you can buy, you could also go for a Titan X (Pascal). You might also be able to snatch a cheap Titan X (Pascal) from eBay.
If you already have GTX Titan X (Maxwell) GPUs an upgrade to NVIDIA Titan X (Pascal) or NVIDIA Titan Xp is not worth it. Save your money for the next generation of GPUs.
If you are short of money but you you know that a 12GB memory is important for you then there is also the GTX Titan X (Maxwell) from eBay as an excellent option.
However, most researchers do well with a GTX 1080 Ti. The one extra GB of memory is not needed for most research and most applications.
I personally would go with multiple GTX 1070 or GTX 1080 for research. I rather run a few more experiments which are a bit slower than running just one experiment which is faster. In NLP the memory constraints are not as tight as in computer vision and so a GTX 1070/GTX 1080 is just fine for me. The tasks I work on and how I run my experiments determines the best choice for me, which is either a GTX 1070 or GTX 1080.
You should reason in a similar fashion when you choose your GPU. Think about what tasks you work on and how you run your experiments and then try to find a GPU which suits these requirements.
The options are now more limited for people that have very little money for a GPU. GPU instances on Amazon web services are quite expensive and slow now and no longer pose a good option if you have less money. I do not recommend a GTX 970 as it is slow, still rather expensive even if bought in used condition ($150 on eBay) and there are memory problems associated with the card to boot. Instead, try to get the additional money to buy a GTX 1060 which is faster, has a larger memory and has no memory problems. If you just cannot afford a GTX 1060 I would go with a GTX 1050 Ti with 4GB of RAM. The 4GB can be limiting but you will be able to play around with deep learning and if you make some adjustments to models you can get good performance. A GTX 1050 Ti would be suitable for most Kaggle competitions although it might limit your competitiveness in some competitions.
The GTX 1050 Ti in general is also a solid option if you just want to try deep learning for a bit without any serious commitments.
Amazon Web Services (AWS) GPU instances
In the previous version of this blog post I recommended AWS GPU spot instances, but I would no longer recommend this option. The GPUs on AWS are now rather slow (one GTX 1080 is four times faster than a AWS GPU) and prices have shot up dramatically in the last months. It now again seems much more sensible to buy your own GPU.
Conclusion
With all the information in this article you should be able to reason which GPU to choose by balancing the required memory size, bandwidth in GB/s for speed and the price of the GPU, and this reasoning will be solid for many years to come. But right now my recommendation is to get a GTX 1080 Ti, GTX 1070, or GTX 1080, if you can afford them; a GTX 1060 if you just start out with deep learning or you are constrained by money; if you have very little money, try to afford a GTX 1050 Ti; and if you are a computer vision researcher you might want to get a Titan Xp.
TL;DR advice
Best GPU overall (by a small margin): Titan Xp
Cost efficient but expensive: GTX 1080 Ti, GTX 1070, GTX 1080
Cost efficient and cheap: GTX 1060 (6GB)
I work with data sets > 250GB: GTX Titan X (Maxwell), NVIDIA Titan X Pascal, or NVIDIA Titan Xp
I have little money: GTX 1060 (6GB)
I have almost no money: GTX 1050 Ti (4GB)
I do Kaggle: GTX 1060 (6GB) for any “normal” competition, or GTX 1080 Ti for “deep learning competitions”
I am a competitive computer vision researcher: NVIDIA Titan Xp; do not upgrade from existing Titan X (Pascal or Maxwell)
I am a researcher: GTX 1080 Ti. In some cases, like natural language processing, a GTX 1070 or GTX 1080 might also be a solid choice — check the memory requirements of your current models
I want to build a GPU cluster: This is really complicated, you can get some ideas here
I started deep learning and I am serious about it: Start with a GTX 1060 (6GB). Depending of what area you choose next (startup, Kaggle, research, applied deep learning) sell your GTX 1060 and buy something more appropriate
I want to try deep learning, but I am not serious about it: GTX 1050 Ti (4 or 2GB)
Update 2017-04-09: Added cost efficiency analysis; updated recommendation with NVIDIA Titan Xp
Update 2017-03-19: Cleaned up blog post; added GTX 1080 Ti
Update 2016-07-23: Added Titan X Pascal and GTX 1060; updated recommendations
Update 2016-06-25: Reworked multi-GPU section; removed simple neural network memory section as no longer relevant; expanded convolutional memory section; truncated AWS section due to not being efficient anymore; added my opinion about the Xeon Phi; added updates for the GTX 1000 series
Update 2015-08-20: Added section for AWS GPU instances; added GTX 980 Ti to the comparison relation
Update 2015-04-22: GTX 580 no longer recommended; added performance relationships between cards
Update 2015-03-16: Updated GPU recommendations: GTX 970 and GTX 580
Update 2015-02-23: Updated GPU recommendations and memory calculations
Update 2014-09-28: Added emphasis for memory requirement of CNNs
Acknowledgements
I want to thank Mat Kelcey for helping me to debug and test custom code for the GTX 970; I want to thank Sander Dieleman for making me aware of the shortcomings of my GPU memory advice for convolutional nets; I want to thank Hannes Bretschneider for pointing out software dependency problems for the GTX 580; and I want to thank Oliver Griesel for pointing out notebook solutions for AWS instances.
[Image source: NVIDIA CUDA/C Programming Guide]
How much slower mid-level GPUs are? For example, I have a Mac with GeForce 750M, is it suitable for training DNN models?
There is a GT 750M version with DDR3 memory and GDDR5 memory; the GDDR5 memory will be about thrice as fast as the DDR3 version. With a GDDR5 model you probably will run three to four times slower than typical desktop GPUs but you should see a good speedup of 5-8x over a desktop CPU as well. So a GDDR5 750M will be sufficient for running most deep learning models. If you have the DDR3 version, then it might be too slow for deep learning (smaller models might take a day; larger models a week or so).
Thanks a lot Mr.Tim D
You have a very lucid approach to answer complicated stuff, hope you could point out what impact FloatingPoint 32 vs 16 make on speed up and how does a 1080ti stack up against the Quadro GP100?
A P100 chip, be it the P100 itself or the GP100, should be roughly 10-30% faster than a Titan Xp. I do not know of any hard, unbiased data on half-precision, but I think you could expect a speedup of about 75-100% on P100 cards compared to cards with no FP16 support, such as the Titan Xp.
You don’t need a really powerful GPU for Inference. Intel’s on-board graphics is more than enough for getting real-time performance for most of the applications (unless it is a high frame rate VR experience). For training, you obviously need an NVIDIA card, but it is a one-time thing.
is it any good for processing non-mathematical data or non-floating point via GPU? How about the handling of generating hashes and keypairs?
Sometime it is good, but often it isn’t – it depends on the use-case. One applications of GPUs for hash generation is bitcoin mining. However the main measure of success in bitcoin mining (and cryptocurrency mining in general) is to generate as many hashes per watt of energy; GPUs are in the mid-field here, beating CPUs but are beaten by FPGA and other low-energy hardware.
In the case of keypair generation, e.g. in mapreduce, you often do little computation, but lots of IO operations so that GPUs cannot be utilized efficiently. For many applications GPUs are significantly faster in one case, but not in another similar case, e.g. for some but not all regular expressions, and this is the main problem why GPUs are not used in other cases.
Hi, nice writeup! Are you using single or double precision floats? You said divide by 4 for the byte size, which sounds like 32 bit floats, but then you point out that the Fermi cards are better than Kepler, which is more true when talking about double precision than single, as the Fermi cards have FP64 at 1/8th of FP32 while Kepler is 1/24th. Trying to decide myself whether to go with the cheaper Geforce cards or to spring for a Titan.
Thanks for you comment James. Yes, deep learning is generally done with single precision computation, as the gains in precision do not improve the results greatly.
It depends what types of neural network you want to train and how large they are. But I think a good decision would be to go for a 3GB GTX580 from ebay, and then upgrade to a GTX 1000 series card next year. The GTX 1000 series cards will probably be quite good for deep learning, so waiting for them might be a wise choice.
Thank you for the great post. Could you say something about having a new card on order CPU?
For example I have 4 core Intel Q6600 from year 2007 with 8Gb of RAM (without possibility to upgrade). Could this be a bottleneck, if I choose to buy new GPU for CUDA and ML?
I’m also not sure which one is a better choice GTX 780 2Gb of RAM, vs GTX 970 4Gb of RAM. 780 has more cores, but are a bit slower…
http://www.game-debate.com/gpu/index.php?gid=2438&gid2=880&compare=geforce-gtx-970-4gb-vs-geforce-gtx-780
A nice list of characteristics, still, I’m not sure which would be a better choice. I would use the GPU for all kind of problems, perhaps some with smaller networks, but I wouldn’t be shy of trying something bigger when I feel conferrable enough.
What would you recommend?
Hi enedene, thanks for your question!
Your CPU should be sufficient and should slow you down only slightly (1-10%).
My post is now a bit outdated as the new Maxwell GPUs have been released. The Maxwell architecture is much better than the Kepler architecture and so the GTX 970 is faster than the GTX 780 even though it has lower bandwidth. So I would recommend getting a GTX 970 over a GTX 780 (of course, a GTX 980 would be better still, but a GTX 970 will be fine for most things, even for larger nets).
For low budgets I would still recommend a GTX 580 from eBay.
I will update my post next week to reflect the new information.
Thank you for the quick reply. I will most probably get GTX 970. Looking forward to your updated post, and competing against your on Kaggle. 🙂
Hi Tim. What open-source package would you recommend if the objective was to classify non-image data? Most packages specifically are designed for classifying images
I have only superficial experience with the most libraries, as I usually used my own implementations (which I adjusted from problem to problem). However, from what I know, Torch7 is a really strong for non-image data, but you will need to learn some lua to adjust some things here and there. I think pylearn2 is also a good candidate for non-image data, but if you are not used to theano then you will need some time to learn how to use it in the first place. Libraries like deepnet – which is programmed on top of cudamat – are much easier to use for non-image data, but the available algorithms are partially outdated and some algorithms are not available at all.
I think you always have to change a few things in order to make it work for new data and so you might also want to check out libraries like caffe and see if you like the API better than other libraries. A neater API might outweigh the costs for needing to change stuff to make it work in the first place. So the best advice might be just to look a documentations and examples, try a few libraries, and then settle for something you like and can work with.
Hi Tim. Do you have any references that explain why the convolutional kernels need more memory beyond that used by the network parameters. I am trying to figure out why Alex’s net needs just over 3.5Gb when the parameters alone only take ~0.4 Gb…what’s hogging the rest?!?
Thanks for your comment Monica. This is indeed something I overlooked, which is actually a quite important issue when selecting a GPU. I hope to address this in an update I aim to write soon.
To answer your question: The increase memory usage stems from memory that is allocated during the computation of the convolutions to increase computational efficiency: Because image patches overlap one saves a lot of computation when one saves some of the image values to then reused them for an overlapping image patch. Albeit at a cost of device memory, one can achieve tremendous increases in computational efficiency when one does cleverly as Alex does in his CUDA kernels. Other solutions that use fast Fourier transforms (FFTs) are said to be even faster than Alex’s implementation, but these do need even more memory.
If you are aiming to train large convolutional nets, then a good option might be to get a normal GTX Titan from eBay. If you use convolutional nets heavily, two, or even four GTX 980 (much faster than a Titan) also make sense if you plan to use the convnet2 library which supports dual GPU training. However, be aware that NVIDIA might soon release a Maxwell GTX Titan equivalent which would be much better than the GTX 980 for this application.
Hi Tim. Thanks for this very informative post.
Do you know how much of a boost Maxwell gives? I’m trying to decide between a GTX 850M with 4GB DDR3 and a Quadro K1100M with 2GB GDDR5. I understand that the K1100M is roughly equivalent to the 750M. Which gives the bigger boost: going from Kepler to Maxwell or from Geforce to Quadro (including from DDR3 to GDDR5)?
Thanks so much!
Going from DDR3 to GDDR5 is a larger boost than going from Kepler to Maxwell. However, the Quadro K1100M has only a slightly faster bandwidth than the GTX 850M which will probably cancel out the benefits, so that both cards will perform at about the same level. If you want to use convolutional neural networks the 4GB memory on the GTX 850M might make the differnce; otherwise I would go with the cheaper option.
Thanks!
Hi, I am planning to replicate ImageNet object identification problem using CNNs as published in recent paper by G. Hinton et al… ( just as an exercise to learn about deep learning and CNNs ).
1. What GPU would you recommend considering I am student. I heard the original paper used 2 GTX 580 and yet took a week to train the 7 layer deep network? Is this true? Could the same be done using a single GTX 580 or GTX 970? How much time will it take to train the same on a GTX 970 or a single GTX 580 ? ( A week of time is okay for me )
2. What kind of modifications in the original implementation could I do ( like 5 or 6 hidden layers instead of 7, or lesser number of objects to detect etc. ), to make this little project of mine easier to implement on a lower budget while at the same time helping me learn about the deep nets and CNNs ?
3. What kind of libraries would you recommend for the same? Torch7 or pylearn2 / theano ( I am fairly proficient in python but not so much in lua ).
4. Is there a small scale implementation of this anywhere in github etc?
Also thanks a lot for the wonderful post. 🙂
1. All GPUs with 4 GB should be able to run the network; you can run a bit smaller networks on one GTX 580; these networks will always take more than 5 days, even on the fastest GPUs
2. Read about convolutional neural networks, then you will understand what the layers do and how you can use them. This is a good, thorough tutorial: http://danielnouri.org/notes/2014/12/17/using-convolutional-neural-nets-to-detect-facial-keypoints-tutorial/
3. I would try pylearn2, convnet2, and caffe and pick which suits you best
4. The implementations are generally general implementations, i.e. you run small and large networks with the same code, it is only a difference in a parameters to a function; if you mean by “small”, a less complex API I heard good things about the Lasagne library
Hi Tim, super interesting article. What case did you use for the build that had the GPUs vertical?
It looks like it is vertical, but it is not. I took that picture while my computer was laying on the ground. However, I use the Cooler Master HAF X for both of my computer. I bought this tower because it has a dedicated large fan for the GPU slot – in retrospect I am unsure if the fan is helping that much. There is another tower I saw that actually has vertical slots, but again I am unsure if that helps so much. I would probably opt for liquid cooling for my next system. It is more difficult to maintain, but has much better performance. With liquid cooling almost any case would go that fits the mainboard and GPUs.
It looks like there is a bracket supporting the end of the cards, did that come with the case or did you put them in to support the cards?
(Duplicate paragraph: “I quickly found”.)
Thanks, fixed.
Great article!
You did not talk about the number of cores present in a graphics card (CUDA cores in case of nVidia). My perception was that a card with more cores will always be better because more number of cores will lead to a better parallelism, hence the training might be faster, given that the memory is enough. Plz correct me if my understanding is wrong.
Which card would you suggest for RNNs and a data size of 15-20 Gb (wikipedia/freebase size)? A 960 would be good enough? Or should I go with a 970 one? 580 is not available in my country.
Thanks for your comment. CUDA cores relate more closely to FLOPS and not to bandwidth, but it is the bandwidth that you want for deep learning. So cuda cores are a bad proxy for performance in deep learning. What you really want is a high memory bus width (e.g. 384 bits) and high memory clock (e.g. 7000MHz) – anything other than that hardly matters for deep learning.
Mat Kelcey did some tests with theano for the GTX 970 and it seems that the GPU has no memory problems for compute – so the GTX 970 might be a good choice then.
Thanks a lot 🙂
Hi Tim,
Thanks for your excellent blog posts. I am a statistician and I want to go into deep learning area. I have a budget of 1500-2000 $. Can you recommend me a good desktop system for deep learning purposes? From your blog post I know that I will get a gtx 980 but, what about cpu, ram, motherboard requirement?
Thanks
Hi Yakup,
I wanted to write a blog post with detailed advice about this topic sometimes in the next two weeks and if you can wait for that you might get some insights what hardware is right for you. But I also want to give you some general, less specific advice.
If you might be getting more GPUs in the future, it is better that you will buy a motherboard with PCIe 3.0 and 7 PCIx16 slots (one GPU takes typically two slots). If you will use only 1-2 GPUs, then almost any motherboard with do (PCIe 2.0 would be also be okay). Plan to get a power supply unit (PSU) which has enough Watts to power all GPUs you will get in the future (e.g. if you will get a maximum of 4, then buy a +1400 Watts PSU). The CPU does not need to be fast or have many cores. Twice as many threads as you have GPUs is almost always sufficient (for Intel CPUs we mostly have: 1 core = 2 threads); any CPU with more than 3GHz is okay; less than 3GHz might give you a tiny penalty in speed of about 1-3%. Fast memory caches are often more important for CPUs, but in the big picture they also contribute little in overall performance; a typical CPU with slow memory will decrease the overall performance by a few percent.
One can work around a small RAM by loading data sequentially from your hard drive into your RAM, but it is often more convenient to have a larger RAM; two times the RAM your GPU has gives you more freedom and flexibility (i.e. 8GB RAM for a GTX 980). A SSD will it make more comfortable to work, but similarly to the CPU offers little performance gains (0-2%; depends on the software implementation); a SSD is nice if you need to preprocess large amounts of data and save them into smaller batches, e.g. preprocessing 200GB of data and save them into batches of 2GB is a situation in which SSDs can save a lot of time. If you decide to get a SSD, a good rule might be to buy a SSD that is twice as large as your largest data set. If you get a SSD, you should also get a large hard drive where you can move old data sets to.
So the bottom line is, a $1000 system should perform at least at 95% of a $2000 system; but a $2000 system offers more convenience and might save some time for preprocessing.
Hi Tim,
Nice and very informative post. I have a question regarding processor. Would you suggest to build a computer with AMD processor (for example, AMD FX-8350 4.0GHz 8-Core Processor) over INTEL based processor for deep learning? I also do not know, AMD processor has PCI 3.0 support. Could you please give your thought on this?
And thanks a lot for the wonderful post.
Thanks for your comment, Dewan. An AMD CPU is just as good as a Intel CPU; in fact I might favor AMD over Intel CPUs because Intel CPU pack just too much unnecessary punch – one simply does not need so much processing power as all the computation is done by the GPU. The CPU is only used to transfer data to the GPU and to start kernels (which is little more than a function call). Transferring data means that the CPU should have a high memory clock and a memory controller with many channels. This is often not advertised on CPUs as it not so relevant for ordinary computation, but you want to choose the CPU with the larger memory bandwidth (memory clock times memory controller channels). The clock on the processor itself is less relevant here.
A 4GHz 8 core AMD CPU might be a bit overkill. You could definitely settle for less without any degradation in performance. But what you say about PCIe 3.0 support is important (some new Haswell CPUs do not support 40 lanes, but only 32; I think most AMD CPUs support all 40 lanes). As I wrote above I will write a more detailed analysis in a week or two.
Hey Tim,
Thanks for the excellent detailed post. I look forward to reading your other posts. Keep going 🙂
Hey! Thanks for the great post!
I have one question, however:
I’m in the “started DL, serious about it” group and have a decent PC already, although without NVIDIA GPU. I’m also a 1st yr student, so GTX 980 is out of question 😉 The question is: what do You think about Amazon EC2? I could easily buy a GTX580, but I’m not sure if it’s the best way to spend my money. And when I think about more expensive cards (like 980 or the ones to be released in 2016) it seems like running a spot instance for 10 cents per hour is a much better choice.
What could be the main drawbacks of doing DL on EC2 instead of my own hardware?
I think a Amazon web services (AWS) EC2 instance might be a great choice for you. AWS is great if you want to use a single or multiple separate GPUs (one GPU for one deep net). However, you cannot use them for multi-GPU computation (multiple GPUs for one deep net) as the virtualization cripples the PCIe bandwidth; there are rather complicated hacks that improve the bandwidth, but it is still bad. Everything beyond two GPUs will not work on AWS because their interconnect is way to slow for that.
Is the AWS single GPU limitation relevant to the new g2.8xlarge instance? (see https://aws.amazon.com/blogs/aws/new-g2-instance-type-with-4x-more-gpu-power/).
It seems to run the same GPUs as those in the g2.2xlarge which would still impede parallelization for neural networks, but I do not know for sure without some hard numbers. I bet that with custom patches 4 GPU parallelism is viable although still slow (probably one GTX Titan X will be faster than the 4 GPUs on the instance). More than 4 GPUs still will not work due to the poor interconnect.
Thanks for the explanation. Looking forward to read the other post.
Hi Tim,
I am a bit confused between buying your recommended GTX 580 and a new GTX 750 (maxwell). The models which I am getting in ebay are around 120 USD but they are 1.5GB models. One big problem with the 580 would be, buying a new PSU (500watt). As you stated the maxwell architecture is the best, then would the GTX 750 (512 CUDA cores, 1GB DDR5) be a good choice? It will be about 95 USD and I can also do without an expensive PSU.
My research area is mainly in text mining and nlp, not much of images. Other than this I would do Kaggle competetions.
A GTX 750 will be a bit slower than a GTX 580 which should be fine and more cost effective in your case. However, maybe you want to opt for the 2 GB version; with 1 GB it will be difficult to run convolutional nets; 2 GB will also be limiting of course, but you could use it on most Kaggle competitions I think.
great posts, Tim!
which deep learning framework you often use for your work may I ask?
I programmed my own library for my work on the parallelization of deep learning; otherwise I use Torch7 with which I am much more productive than with Caffe or Pylearn2/Theano.
what about Tensorflow?
The commend is quite outdated now. TensorFlow is good. I personally favor PyTorch. I believe one can be much more productive with PyTorch — at least I am.
I know that these are not recommended, but 580 won’t work for me because of the lack of Torch 7 support: will the 660 or 660 Ti work with Torch 7? Is this possible to check before purchasing? Thank you!
The cuDNN component of Torch 7 needs a GPU with compute capability 3.5. A 660 or 660Ti will not work; You can find out which GPUs have which compute capability here.
Any comments on this new Maxwell architecture Titan X? $1000 US
http://www.pcworld.com/article/2898093/nvidia-fully-reveals-1000-titan-x-the-most-advanced-gpu-ever.html
seemingly has a massive memory bandwidth bump – for example the gtx 980 specs claim 224 GB/sec with the Maxwell architecture, this has 336 GB/sec (and also comes stock with 12GB VRAM!)
Along that line, are the memory bandwith specs not apples to apples comparisons across different Nvidia architectures?
i.e. the also 780ti claims 336GB/sec with the Kepler architecture – but you claim the 980 with 224GB/sec bandwidth can out benchmark it for basic neural net activities?
Appreciate this post
You can compare bandwidth within microarchitecture (Maxwell: GTX Titan X vs GTX 980, or Kepler: GTX 680 vs GTX 780), but across architectures you cannot do that (Maxwell card X vs Kepler card X). The very minute changes in the design of a microchip can make vast difference in bandwidth, FLOPS, or FLOPS/watt.
Kepler was about FLOPS/watt and double precision performance for scientific computing (engineering, simulation etc.), but the complex design lead to poor utilization of the bandwidth (memory bus times memory clock). With Maxwell the NVIDIA engineers developed an architecture which has both energy efficiency and good bandwidth utilization, but the double precision suffered in turn — you just cannot everything. Thus Maxwell cards make great gaming and deep learning cards, but poor cards for scientific computing.
The GTX Titan X is so fast, because it has a very large memory bus width (384 bit), an efficient architecture (Maxwell) and a high memory clock rate (7 Ghz) — and all this in one piece of hardware.
“a 6GB GPU is plenty for now” — don’t you get severely limited in the batch size (like, 30 max) for 10^8+ parameter convnets (eg simonyan very deep, googlenet)?
although I think some DL toolkits are starting to come with functionality of updating weights after >1 batch load/unload onto gpu, which I guess would result in theoretically unlimited batch size, though not sure how this would impact speed?
This is a good point, Alex. I think you can also get very good results with conv nets that feature less memory intensive architectures, but the field of deep learning is moving so fast, that 6 GB might soon be insufficient. Right now, I think one has still quite a bit of freedom with 6 GB of memory.
A batch and activation unload/load procedure would be limited by the ~8GB/s bandwidth between GPU and CPU, so there will be definitely a decrease in performance if you unload/load a majority of the needed activation values. Because the bandwidth bottlenecks are very similar to parallelism, one can expect a decrease in performance of about 10-30% if you unload/load the whole net. So this would be an acceptable procedure for very large conv nets, however smaller nets with less parameters would still be more practical I think.
What is your opinion about the different brands (EVGA, ASUS, MSI, GIGABYTE) of the video card for the same model?
Thanks for this post Tim, is very illustrating.
EVGA cards often have many extra features (dual BIOs, extra fan design) and a bit higher clock and/or memory, but their cards are more expensive too. However, with respect to price/performance it often depends from card to card which is the best one and one cannot make general conclusions from a brand. Overall, the fan design is often more important than the clock rates and extra features. The best way to determine the best brand, is often to look for references of how hot one card runs compared to another and then think if the price difference justifies the extra money.
Most often though, one brand will be just as the next and the performance gains will be negligible — so going for the cheapest brand is a good strategy in most cases.
hey Tim,
you been a big help – I have included the results from CUDA bandwidth test (which is included in the samples file of the basic CUDA install.)
This is for a GTX 980 running on 64bit linux with i3770 CPU, and PCIe 2.0 lanes on motherboard.
This look reasonable?
Are they indicative of anything?
the device/host and host/device speeds are typically the bottleneck you speak of?
no reply necessary – just learning
thanks again
tim@ssd-tim ~/NVIDIA_CUDA-7.0_Samples/bin/x86_64/linux/release $ ./bandwidthTest
[CUDA Bandwidth Test] – Starting…
Running on…
Device 0: GeForce GTX 980
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12280.8
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12027.4
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 154402.9
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Looks quite reasonable; the bandwidth from host to device, and device to host is limited by either RAM or PCIe 2.0 and 12GB/s is faster than expected; 150GB/s is slower than the 224GB/s which the GTX 980 is capable of, but this is due to the small memory size of 30MB — so this looks fine.
Hi Tim, great post! I feel lucky that I chose a 580 a couple of years ago when I started experimenting with neural nets. If there had been an article like this then I wouldn’t have been so nervous!
I’m wondering if you have any quick tips for fresh Ubuntu installs with current nvidia cards? When I got my used system running a couple of years ago it took quite a while and I fought with drivers, my 580 wasn’t recognized, etc.. On the table next to me is a brand new build that I just put together that I’m hoping will streamline my ML work. It’s an intel X99 system with a Titan X (I bought into the hype!). Windows went on fine(although I will rarely use it) and Ubuntu will go on shortly. I’m not looking forward to wrestling with drivers…so any tips would be greatly appreciated. If you have a cheat-sheet or want to do a post, I’m sure it will be warmly welcomed by many…especially me!
Yeah, I also had my troubles with installing the latest drivers on ubuntu, but soon I got the hang of it. You want to do this:
0. Download driver and remember the path where you saved the file
1. Purge system from nvidia and nouveau driver
2. Blacklist nouveau driver
3. Reboot
4. Ctrl + Alt + F1
5. sudo service lightdm stop
5. chmod +x driver_file
6. sudo ./driverfile
And you should be done. Sometimes I had troubles with stopping lightdm; you have two options:
1. try sudo /etc/init.d/lightdm stop
2. killing all lightdm processes (sudo killall lightdm or (1) ps ux | grep lightdm, (2) find process id, (3) sudo kill -9 id)
For me the second option worked.
You can find more details to the first steps here:
http://askubuntu.com/questions/451221/ubuntu-14-04-install-nvidia-driver
Thanks for the reply Tim. I was able to get it all up and running pretty painlessly. Hopefully your response helps somebody else too…it’s nice to have this sort of information in one spot if it’s GPU+DNN related.
On a performance note, my new system with the Titan X is more than 10 times faster on an MNIST training run than my other computer (i5-2500k + gx580 3Gb). And for fun, I cranked up the mini-batch size on a Caffe example (flicker finetuning) and got 41% validation accuracy in under an hour.
I believe you hit a nerve with a couple of your blog posts…I think the type of information that you’re giving is quite valuable, especially to folks who haven’t done much of this stuff.
One possible information portal could be a wiki where people can outline how they set up various environments (theano, caffe, torch, etc..) and the associated dependencies. Myself, I set up a few and I’m left with a few questions like for example…
-given all the dependencies, which should be build versus apt-get versus pip? A holistic outlook would be a very education thing. I found myself building the base libraries and using the setup method for many python packages but after a while there were so many I started using apt-get and pip and adding things to my paths…blah blah…at the end everything works but I admin I lost track of all the details.
I know that I’m not alone!! Having a wiki resource that I could contribute to during the process would be good for me and for others doing the same thing….instead of hunting down disparate sources and answering questing on stackoverflow.
I mention this because you probably already have a ton of traffic because of a couple key posts that you have. Put a wiki up and I promise I’ll contribute! I’ll consider doing it myself as well…time…need more time!
thanks again.
Thanks, johno, I am glad that you found my blog posts and comments useful. A wiki is a great idea and I am looking into that. Maybe when I move this site to a private host this will be easy to setup. Right now I do not have time for that, but I will probably migrate my blog in a two months or so.
I am a bit of a novice but got it done in a few hours.
My thoughts
Try and start with a clean install of a NVIDIA supported linux distro ( unbuntu 14.04 LTS) is on there
I used the Linux distro proprietary drivers, instead of downloading them from NVIDIA. X-org-edgers PPA has them and they keep them pretty current. This means you can install the actual NVIDIA driver via sudo apt-get, and also (more importantly) upgrade the driver in a few months when NVIDIA easily. It also blacklists Nouveau automatically. You can toggle between driver versions in the software manager as it shows you all the drivers you have.
Once you have the driver working, you are most of the way there. I ran into a few troubles with the CUDA install, as sometimes your computer may have some libraries missing, or conflicts. But I got CUDA 7_0 going pretty quickly
these two links helped
http://bikulov.org/blog/2015/02/28/install-cuda-6-dot-5-on-clean-ubuntu-14-dot-04/
http://developer.download.nvidia.com/compute/cuda/6_0/rel/docs/CUDA_Getting_Started_Linux.pdf
there is gonna be some trial and error, be ready to reinstall ubuntu and take another try at it.
good luck
Hi Tim-
Does the platform you plan on DLing on matter? By this I mean x99, z97, AM3+, ect. X99 is able to utilize more threads and cores than z97, but I’m not sure if that helps at all, similar to cryptocurrency mining, where hardware besides the GPU dosent matter.
Hi Jack-
Please have a look at my full hardware guide for details, but in short, hardware besides the GPU does not matter much (although a bit more than in cryptocurrency mining).
Ok, sure, thanks.
Hi Tim,
I have benefited from this excellent post. I have a question regarding amazon gpu instances. Can you give a rough estimate of the performance of amazon gpu? Like GTX TITAN X = ? amazon gpu.
Thanks,
Thanks, this was a good point, I added it to the blog post. The new AWS GPUs (g2.2 and g2.8 large) are about as fast as a GTX 680 (they are based on the same chip, but are slightly different to support virtualization). However, there are still some performance decreases due to virtualization for the memory transfer from CPU to GPU and between GPUs; this is hard to measure and should have little impact if you use just one GPU. If you perform multi-GPU computing the performance will degrade harshly.
Hi Tim,
Thanks for sharing all this info.
I don’t understand the difference between GTX 980 from say Asus and Nvidia.
Obviously same architecture, but are they much different at all?
Why it seems hard to find Nvidia products in Europe?
Thanks
So this is the way how a GPU is produced and comes into your hands:
1. NVIDIA designs a circuit for a GPU
2. It makes a contract with a semiconductor producer (currently TSMC in Taiwan)
3. The semiconductor producer produces the GPU and sends it to NVIDIA
4. NVIDIA sends the GPU to companies such as ASUS, EVGA etc.
5. ASUS, EVGA, etc. modify the GPU (clock speeds, fan — nothing fundamental, the chip stays the same)
6. You buy the GPU from either 5. or 4.
So while all GPUs are from NVIDIA you might buy a branded GPU from, say, ASUS. This GPU is the very same GPU as another GPU from, say, EVGA. Both GPUs run the very same chip. So essentially, all GPUs are the same (for a given chip).
Some GPUs are not not available in other countries because of regulations (NVIDIA might have no license, but other brands have?) and because it might not be profitable to sell it there in the first place (you will need to have a different set of logistics for international trade; NVIDIA might not have the expertise and infrastructure for this, but regular hardware companies like ASUS, EVGA do).
Hi Tim,
Thank you for your advices I found them very very useful. I have many questions please and feel very to answer some of them. I have many choices to buy a powerful laptop or computer My budget is (£4000.00).
I would like to buy Mac Pro (cost nearly £3400.00) , so can I apply deep learning of this machine as it uses the OSX operating system and I want to use torch7 in my implementation. Second, I will buy Titan x then I have two choices, First, I will install TITAN X GPU in Mac Pro. Second, I will buy Alienware Amplifier (to use TITAN X) with Alienware 13 laptop. Could you please tell me if this possible and easy to make it because I am not a computer engineer, but I want to use deep learning in my research.
Best regards,
Salem
I googled the Alienware Amplifier and I read it only has 4GB/s of bandwidth internal and it might be that there are other problems. If you use a single GPU, this is not too much of a concern, but be prepared to deal with performance decreases in the range of 5-25%. If there are technical details that I overlooked the performance decrease might be much higher — you will need to look into that yourself.
The GTX Titan X in a Mac Pro will do just fine I guess. While most deep learning libraries will work well with OSX there might be a few problems here and there, but I think torch7 will work fine.
However, consider also that you will pay a heavy price for the aesthetics of apple products. You could buy a normal high end computer with 2 GTX Titan X and it will be still cheaper than a Mac Pro. Ubuntu or any other Linux-based OS need some time to get comfortable with, but they work just as well as OSX and often make programming easier than OSX does. So it is basically all down to aesthetics vs performance — that’s your call!
Is it easy to install GTX Titan X in a Mac Pro? Does it need external hardware or power supply or just plug in?
Many thanks Tim
Is it easy to install GTX Titan X in a Mac Pro? Does it need external hardware or power supply or just plug in?
Hi,
Nice article! You recommended all high-end cards. What about mid-range cards for those with a really tight budget? For example, the GT 740 line has a model with 4GB gddr5, 5000 mt/s mem clock, a 128 bus width and is rated at ~750 GFLOPS. Will such a card likely give a nice boost in neural net training (assuming it fits in the cards mem) over a mid-range quad-core CPU?
Thanks!
The GTX 740 with 4GB GDDR5 is a very good choice for a low budget. Maybe I should even include that option in my post for a very low budget. A GT 740 will definitely be faster than quad-core CPUs (probably anything between 3 to 7 times faster, depending on the CPU and the problem).
Thanks for this great article. What do you think of the upcoming GTX 980 Ti? I have read it has 6GB and clock speed/cores closer to the Titan X. Rumoured to be $650-750. I was about to buy a new PC, but thought I might hold out as it’s coming in June.
The GTX 980 Ti seems to be great. 6GB of RAM is sufficient for most tasks (unless you use super large data sets, doing video classification, and use expensive convolutional architectures) and the speed is about the same. If you use Nervana System 16 bit kernels (which will be integrated into torch7) then there should be no issues with memory even with these expensive tasks.
So the GTX 980 Ti seems to be the new best choice in terms of cost effectiveness.
Hi,
I am a novice at deep nets and would like to start with some very small convolutional nets. I was thinking of using a GTX750TI (in my part of the world it is not really very cheap for a student). I would convince my advisor to get a more expensive card after I would be able to show some results. Will it be sufficient to do a meaning convolutional net using Theano?
Your best choice in this situation will be to use an amazon web service GPU spot instance. These instances have small costs ($0.1 and hour or so) and you will able to produce results quickly and cheaply, after which your advisor might be willing to buy an expensive GPU. To save more costs, it would be best to prototype your solution on a CPU (just test that the code is working correctly) and then start up an AWS GPU instance and let your code run for a few hours/days. This should be the best solution.
I think there are predefined AWS images which you can load, so that you do not have to install anything — google “AMI AWS + theano” or “AMI AWS + torch” to find more.
Thanks a lot for the suggestion. I will go ahead and try this.
Will the Pascal GPUs have any special requirements, such as X99 or DDR4? I am currently planning a Z97 build with DDR3, but don’t want to be stuck in a years time! Thanks, J
According to the information that is available, Pascal will not need X99 or DDR4 (which would be quite limiting for sales), instead Pascal cards will just be like a normal card stuck in a PCIe slot with NVLink on top (just like SLI) and thus no new hardware is needed.
Sweet, thanks.
About this:
“GTX Titan X = 0.66 GTX 980 = 0.6 GTX 970 = 0.5 GTX Titan = 0.40 GTX 580
GTX Titan X = 0.35 GTX 680 = 0.35 AWS GPU instance (g2.2 and g2.8) = 0.33 GTX 960”
Have you actually measured the times/used these gpus or are you “guessing”?
Thank you for the article!
Very good question!
Because deep learning is bandwidth-bound, the performance of a GPU is determined by its bandwidth. However, this is only true for GPUs with the same architecture (Maxwell, Kepler, Fermi). So for example: The comparisons between GTX Titan X and GTX 980 should be quite accurate.
Comparisons across architectures are more difficult and I cannot assess them objectively (because I do not have all the GPUs listed). To provide a relatively accurate measure I sought out information where a direct comparison was made across architecture. Some of these are opinion or “feeling”-based, other sources of information are not relevant (game performance measures), but there are some sources of information which are relatively objective (performance measures for bandwidth-bound cryptocurrency mining); so I weighted each piece of information according to its relevance and then I rounded everything to neat numbers for comparisons between architectures.
So all in all, these measure are quite opinionated and do not rely on good evidence. But I think I can make more accurate estimates than people that do not know GPUs well. Therefore I think it is the right thing to include this somewhat inaccurate information here.
Hi,
Thanks a lot for the updated comparison. I bought a 780 Ti a year ago and it’s interesting how it compares to the newer cards? I use it for NLP tasks mainly, including RNNs, starting with LSTMs.
Also, do I get it right that ‘GTX Titan X = 0.66 GTX 980’ means that 980 is actually 2/3 as fast as Titan X or the other way round?
A GTX 780 Ti is pretty much the same as a GTX Titan Black in terms of performance (slower than a GTX 980). Exactly, the 980 is about 2/3 the speed of a Titan X.
Can you comment on this note on the cuda-convnet page
https://code.google.com/p/cuda-convnet/wiki/Compiling
?
“Note: A Fermi-generation GPU (GTX 4xx, GTX 5xx, or Tesla equivalent) is required to run this code. Older GPUs won’t work. Newer (Kepler) GPUs also will work, but as the GTX 680 is a terrible, terrible GPU for non-gaming purposes, I would not recommend that you use it. (It will be slow). ”
I am probably in the “started DL, serious about it”-group, and would have probably bought the GTX 680 after reading your (great) article.
This is very much true. The performance of the GTX 680 is just bad. But because the Fermi GPUs (4xx and 5xx) are not compatible with the NVIDIA cuDNN library which is used by many deep learning frameworks, I do not recommend the GTX 5xx. The GTX 7xx series is much faster, but also much more expensive than a GTX 680 (except the GTX 960, which is about as fast as the GTX 680), so the GTX 680 despite being so slow, is the only viable choice (besides GTX 960) for a very low budget.
As you can see in the comment of zeecrux, the GTX 960 might actually be better than the GTX 680 by quite a margin. So probably it is better to get a GTX 960 if you find a cheap on. If this is too expensive, settle for a GTX 580.
Ok, thank you! I can’t see any comment by zeecrux ? How bad is the performance of the GTX 960 ? Is it sufficient to have if you mainly want to get started with DL, play around with it, do the occasional kaggle comp, or is it not even worth spending the money in this case ? Buying a Titan X or GTX 980 is quite an investment for a beginner ?
Ah I did not realize, the comment of zeecrux was on my other blog post, the full hardware guide. Here is the comment:
A K40 is about as fast as a GTX Titan. So the GTX 960 is definitely faster and better than a GTX 680. It should be sufficient for most kaggle competitions and is a perfect card to get startet with deep learning.
So it makes good sense to buy a GTX 960 and wait for Pascal to arrive in Q3/Q4 2016, instead of buying a GTX 980 Ti or GTX 980 now.
Hey Tim,
Can i know where to check this statement? “But because the Fermi GPUs (4xx and 5xx) are not compatible with the NVIDIA cuDNN library “. TIA.
Check this stackoverflow answer for a full answer and source to that question.
Hi Tim,
Do you think it is better to buy Titan X now or waiting the new Pascal if I want to invest in just one GPU withing the coming 4 years?
The Pascal architecture should be a quite large upgrade when compared to Maxwell. However, you have to wait more than a year for them to arrive. If your current GPU is okay, I would wait. If you have no GPU at all, you can use AWS GPU instances, or but a GTX 970, and sell it after one year, to buy a Pascal card.
From what I read, GPU Direct RDMA is only available for workstation cards (Quadro/Tesla). But it seems like you are able to get your cluster to work with a few GTX Titan’s and IB cards here. Not sure what am I missing.
You will need a Mellanox InfiniBand card. For me a ConnectX-2 worked, but usually only ConnectX-3 and ConnectX-IB are supported. I never tested GPU Direct RDMA with Maxwell, so it might not work there.
To get it working on Kepler devices, you will need the patch you find under downloads here (nvidia_peer_memory-1.0-0.tar.gz):
http://www.mellanox.com/page/products_dyn?product_family=116
Even with that I needed quite some time to configure everything, so prepare yourself for a long read of documentations and error google search queries.
Hi Tim, thank you for posting and updating this, I’ve found it very helpful.
I do have a general question, though, about quadro cards, which I’ve noticed neither you nor many others discuss using for deep learning. I’m configuring a new machine and, due to some administrative constraints, it is easiest to go with a quadro k5000.
I had specced out a different machine with a GTX 980, but it’s looking like it will harder to purchase it. My questions are whether there is anything I should be aware of regarding using quadro cards for deep learning and whether you might be able to ball park the performance difference. We will probably be running moderately sized experiments and are comfortable losing some speed for the sake of convenience; however, if there would be a major difference between the 980 and k5000, then we might need to reconsider. I know it is difficult to make comparisons across architectures, but any wisdom that you might be able to share would be greatly appreciated.
Thanks!
The k5000 is based on a Kepler chip and has 173 GB/s memory bandwidth. Thus is should be a bit slower than a GTX 680.
Hi!
I am in a similar situation. No comparison of quadro and geforce available anywhere. Just curious, which one did you end up buying and how did it work out?
Hi Tim, 1st i want to say that I’m truly extremely impressed with your blog, its very helpful.
Talking about the bandwidth of PCI Ex, have u ever heard about plx tech with their pex 8747 bridge (Chip). Anandtech has a good review on how does it work and effect on gaming: http://www.anandtech.com/show/6170/four-multigpu-z77-boards-from-280350-plx-pex-8747-featuring-gigabyte-asrock-ecs-and-evga. They even said that it can also replicate 4 x16 lanes on a cpu which is 28lanes.
Someone mentioned it before in the comments, but that was another mainboard with 48x PCIe 3.0 lanes; now that you say you can operate with 16x on all four GPUs I got curious and looked at the details.
It turns out that this chip switches the data in a clever way, so that a GPU will have full bandwidth when it needs high speed. However, when all GPUs need high speed bandwidth, the chip is still limited by the 40 PCIe lanes that are available at the physical level. When we transfer data in deep learning we need to synchronize gradients (data parallelism) or output (model parallelism) across all GPUs to achieve meaningful parallelism, as such this chip will provide no speedups for deep learning, because all GPUs have to transfer at the same time.
Transferring the data one after the other is most often not feasible, because we need to complete a full iteration of stochastic gradient descent in order to work on the next iterations. Delaying updates would be an option, but one would suffer losses in accuracy and the updates would not be that efficient anymore (4 delayed updates = 2-3 real updates?). This would make this approach rather useless.
thank for your detailed explanation.
Is it possible to use the GTX 960M for Deep Learning? http://www.geforce.com/hardware/notebook-gpus/geforce-gtx-960m/specifications. It has 2.5GB GDDR though. Maybe a pre-built specs with http://t.co/FTmEDrJDwb ?
A GTX 960M will be comparable in performance to a GTX 950. So you should see a good speedup using this GPU, but it will not be a huge speedup compared to other GPUs. However, compared to laptop CPUs the speedup will still be considerable. To do more serious deep learning work on a laptop you need more memory and preferably faster computation; a GTX 970M or GTX 980M should be very good for this.
Hi Tim
I’m planning to build a pc mainly for kaggle and getting started with deep learning.
This is my first time.For my budget I’m thinking of going with
i7-4790k
GTX 960 4GB
Gigabyte GA-Z97X-UD3H-BK or Asus Z97-A 32GB DDR3 Intel Motherboard
I’m wishing to replace the gtx 960 or add another card later on …
Is this is a good build ? please offer your suggestions
Thanks in advance:)
Looks like a solid cheap build with one GPU. The build will suffice for a Pascal card once it comes available and thus should last about 4 years with a Pascal upgrade. The GTX 960 is a good choice to try things out, and use deep learning on kaggle. You will not able to build the best models, but models that are competitive with the top 10% in deep learning kaggle competitions. Once you get the hang of it, you can upgrade and you will be able to run the models that usually win those kaggle competitions.
Hi Tim,
Right now i’m in between 2 choices: 2 gtx 690 and a Titanx. Both come with same price. Which one do you think is better for conv net? Or Multimodal Recurrent Neural Net
I would definitely pick a GTX Titan X over two GTX 690, mainly because using two GTX 690 for parallelism is difficult and will be slower than a single Titan X. Running multiple algorithms (different algorithms on each GPU) on the two GTX 960 will be good, but a Titan X comes close to this due to its higher processing speed.
Are there any important differences between the GTX 980 and the GTX 980 TI? It seems that we can only get the latter. While it seems faster, I’m not skilled enough in the area to know whether it has any issues related to using it for deep learning.
The GTX 980 Ti is as fast at the GTX Titan X (50% faster than GTX 980), but has 6GB of memory instead of 12GB. There are no issue with the card, it should work flawlessly.
What do you think of Titan X superclocked vs. regular Titan X? Are the up/down sides noticeable?
The upgrade should be unnoticeable (0-5% increased speed) and I would recommend a superclocked version only if you do not pay any additional money for that.
Possibly (probably) a dumb question but can you use a superclocked GPU with an non-superclocked GPU? Reason I ask is that a cheap used superclocked Titan Black is for sale on ebay as well as another cheap Titan Black (non-superclocked). Just want to make sure I wouldn’t be making some mistake by buying the second one if I decided to get two Titan black GPUs.
p.s. thanks for the blog. Super helpful for all of us noobies.
Yes, this will work without any problem. I myself have been using 3 different kind of GTX Titan for many months. In deep learning the different of compute clock also makes hardly a difference, so that the GPUs will not diverge during parallel computation. So there should be no problems.
Hello Tim
Thank you very much for you in-depth hardware analysis (both this and the other one you did). I basically ended up buying a new computer based only on your ideas 🙂
I choose the GTX 960 and then I might upgrade next year if I feeling this is something for me.
But in a lot of places I read about this imagenet db. The problem there seems to be that i need to be a researcher (or in education) to download the data. Do you know anything about this? Is there any way for me as a private person (that is doing this for fun) to download the data? The reason why I want this dataset is because it is huge and it also would be fun to be able to compare how my nets works compared to other people.
If not, what other image databases except for CIFAR and MIST do you recommend?
Thanks agan.
Hello Mattias, I am afraid there is no way around the educational email address for downloading the dataset. It is really is a shame, but if these images would be exploited commercially then the whole system of free datasets would break down — so it is mainly due to legal reasons.
There are other good image datasets like the google street view house number dataset; you can also work with Kaggle datasets that feature images, which has the advantage that you get immediate feedback how well you do and the forums are excellent to read up how the best competitors did receive their results.
Thanks for quick reply,
I will look into both Kaggle and the street view data set then 🙂
Hello Tim,
Thank you for your article. I understand that researchers need a good GPU for training a top performing (convolutional) neural network. Can you share any thought on what compute power is required (or what is typically desired) for transfer learning (i.e. fine tuning of an existing model) and for model deployment?
Thank you!
Tim, Such a great article. I’m going back and forth between the titan z and the titan x. I can probably buy the titan z for ~$500 from my friend. I’m very confused as to how much memory it actually has. I see that it has 6gb x 2.
I guess my question is: Is the Titan Z have the same specs as the Titan X in terms of memory? How does this work from a deep learning perspective (currently using theano)
Many Thanks,
One thing I should add is that I’m building RNN’s (specifically LSTM’s) with this Titan Z or Titan X. I’m also considering the 980 TI
Please have a look at my answer on quora which deals exactly with this topic. Basically, I recommend you to go for the GTX Titan X. However, $500 for GTX Titan Z is also a good deal. Memory-wise, you can think of the GTX Titan Z, as two normal GTX Titan with a connection between the two GPUs — so two GPUs with 6GB of memory each.
That makes much more sense. Thanks again — checked out your response on quora. You’ve really changed my views on how to set up deep learning systems. Can’t even begin to express how thankful I am.
Hey Tim, not to bother too much. I bought a 980 Ti, and things have been great. However, I was just doing some searching, and saw that the AMD Radeon R9 390X is ~$400 on Newegg and has 8gb memory and 500gb bandwidth. These specs are roughly 30% better than the 980-TI for $650.
I was wondering what your thoughts are on this? Is AMD compute architecture slower compared to Nvidia Kepler architecture for deep learning? In the next month or so, I’m considering purchasing another card.
Based upon numbers, it seems that the AMD cards are much cheaper compared to Nvidia. I was hoping you could comment on this!
Theoretically the AMD card should be faster, but the problem is the software: Since no good software exists for AMD cards you will have to write most of the code yourself with an AMD card. Even if you manage to implement good convolutions the AMD card will likely perform worse than the NVIDIA one because the NVIDIA convolutional kernels have been optimized by a few dozen researchers for more than 3 years.
NVIDIA Pascal cards will have up to 750-1000 GB/s memory bandwidth, so it is worth waiting for Pascal which probably will be released in about a year.
Yea — I can’t wait for Pascal. For now, will just rock out with the 980 TI’s. Thanks alot!
Hi Tim,
Come across the internet for deep learning on this blog is great for newbie like me.
I have 2 choices in hands now: 1 GTX 980 4GB and 2 GTX 780 ti 3GB SLI. Which one do you recommend that should come to the hardware box for my deep learning research?
I am more favour of 2 780 Ti as learning from your writing on CUDA cores + memory bandwidth.
Thank you very much.
Nghia
I would favor the GTX 980 which will be much faster than 2 GTX 780 Ti even if you use the two cards in parallel. However, the 2 GTX 780 Ti will much better if you run independent algorithms and thus enables you to learn how to train deep learning algorithms successfully more quickly. On the other hand, the 3GB on them is rather limiting and will prevent you to train current state of the art convolutional networks. If you want to train convolutional networks I would suggest you choose the GTX 980 rather than 2 GTX 780 due to this.
Thank you very much for the advice.
Is it possible to put all three card into one machine and that give me enough environment to learn parallelism programming and study deep learning with neuron network (torch7 & Lua)?
A system with those 3 cards (780Ti x2 + 980 x1) will yield better performance overall or drag it down due to the hardware imparity and complexity?
Yes, you could run all three cards in one machine. However you can only select one type of GPU for your graphics; and for parallelism only the two 780 will work together. There might be problems with the driver though, and it might be that you need to select your Maxwell card (980) to be your graphics output.
In a three card system you could tinker with parallelism with the 780s and switch to the 980 if you are short on memory. If you run NervanaGPU you could also use 16-bit floating point models, thus doubling your memory, however, NervanaGPU will not work on your Kepler 780 cards.
Thank you very much, Tim.
For the sake of study,
From the specs:
+ The GTX 780Ti with 2880 CUDA cores + 3GB (384bit bandwidth), and double that with SLI
+ The GTX 980 with 2048 CUDA cores + 4GB (256bit bandwidth).
Does VRAM 1GB/core difference make a big deal in deep learning?
I will benchmark and post the result once I got hand on to run the system with above 2 configuration.
I have access to a NVIDIA Grid K2 card on a virtual machine and I have some questions related to this:
1. How does this card rank compared to the other models?
2. More importantly, are there any issues I should be aware of when using this card or just doing deep learning on a virtual machine in general?
I do not have the option of using any other machine than the one provided.
And of course, thanks for some great articles! They are a big help.
You are welcome! I am glad that it helped!
1. The Grid K2 card will roughly perform as good as a GTX 680, although its PCIe connection might be crippled due to virtualization.
2. Depends highly on the hardware/software setup. Generally there should not be any issue other than problems with parallelism.
Do you know what versions of CUDA it is comparible with? Would it work with CUDA 7.5?
Hi! Fantastic article. Are there any on demand solution such as Amazon but with 980Ti on board? I can’t find any.
Amazon needs to use special GPUs which are virtualizable. Currently the best cards with such capability are kepler cards which are similar to the GTX 680. However, other vendors might have GPU servers for rent with better GPUs (as they do not use virtualization), but these server are often quite expensive.
Hello Tim,
First of all, I bounced on your blog when looking for Deep Learning configuration and I loved your posts that confirm my thoughts.
I have two questions if you have time to answer them:
(1) For specific problems, I will train my DNN with ImageNet with some other classes, for this, I don’t care waiting for a while (well, a long while), when the DNN will be ready, do you know if configuration (one to four Titan X, 12GB each) will not delay too much when scene labelling images. I would like to have answers by seconds like Clarifai does. I guess this is dependent of the number of hidden layers I could have in my DNN
(2) Do you have enough long use of your configuration to provide feedback on MTBF for GPU cards? I guess like disks, running a system on 24/7 basis will impact the longevity of GPU cards…
Thanks in advance for your answers
mph
(1) Yes, this is highly dependent on the network architecture and it is difficult to say more about this. However, this benchmark page by Soumith Chintala might give you some hint what you can expect from your architecture given a certain depth and size of the data. Regarding parallelization: You usually use LSTMs for labelling scenes and these can be easily parallelized. However, running image recognition and labelling in tandem is difficult to parallelize. You are highly dependent on implementations of certain libraries here because it cost just too much time to implement it yourself. So I recommend to make your choice for the number of GPUs dependent on the software package you want to use.
(2) I had no failures so far — but of course this is for a sample size of 1. I have heard from other people that use multiple GPUs that they had multiple failures in a year, but I think this is rather unusual. If you keep the temperatures below 80 degrees your GPUs should be just fine (theoretically).
Awesome work, this article really clears out the questions I had about available GPU options for deep learning.
What can you say about the Jetson series, namely the latest TX1?
Is it recommended to get as an alternative to PC rig with desktop GPU’s?
I was also thinking about the idea to get a Jetson TX1 instead of a new laptop, but in the end it is more convenient and more efficient to have a small laptop and ssh into a desktop or an AWS GPU instance. A AWS GPU instance will be quite a bit faster than the Jetson TX1 so that the Jetson only makes sense if you really want to do mobile deep learning, or if you want to prototype algorithms for future generation of smartphones that will use the Tegra X1 GPU.
Hi Tim!
Thank you for the excellent blog post.
I use various neural nets (i.e. sometimes large, sometimes small) and hesitate to choose between GTX 970 and GTX 960. What is better if we set price factor aside?
– 970 is ~2x faster than 960, but as you say it has troubles.
– on the other hand, Nvidia had shown that GTX 980 has the same memory troubles > 3.5GB
http://www.pcper.com/news/Graphics-Cards/NVIDIA-Responds-GTX-970-35GB-Memory-Issue
If we take their information for granted, I don’t understand your point re. memory troubles in GTX 970 at all, because you do recommend GTX 980
Simply put, is GTX 970 still faster than GTX 960 on large nets or not? What concrete troubles we face using 970 on large nets?
Thank you again, Alexander
Hi Alexander,
if you look at the screenshots again you see that the bandwidth for the GTX 980 does not drop when we increase the memory. So the GTX 980 does not have memory problems.
Regarding your question of 960 vs. 970: The 970 is much better if you can stay below 3.5GB of memory, but much worse otherwise. If you train sometimes some large nets, but you are not insisting on very good results (rather you are satisfied with good results) I would go with the GTX 970. If you train something big and hit the 3.5GB barrier, just adjust your neural architecture to be a bit smaller and you should be alright (or you might try different things like 16-bit networks, or aggressive use of 1×1 convolutional kernels (inception) to keep the memory footprint small).
Thanks, Tim!
Indeed, I overlooked the first screenshot, it makes a difference.
Don’t understand Nvidia’ statement still, somehow they equated GTX 980 and 970 above 3.5 GB, but no matter.
Hey Tim!
I was thinking about GTX 970 issue again. According to the test, it loses bandwidth above 3.5GB. But what does it mean exactly?
-does it start affecting bandwidth for memory below 3.5GB as well? (I guess no)
-does it decrease GPU computing performance itself? (I guess no)
-what if input data allocated in GPU memory below 3.5GB, and only CNN weights allocated above 3.5GB? In that case upper 0.5GB shouldn’t be in use for data exchange and may not affect overall bandwidth? I understand we don’t control this allocation by default, but what in theory?
Great post ,
I bought a GTX 750, considering your article I’m doomed right?
I have a question though, I have’nt tested this yet, but here it goes.
Do you think I can use VGGnet or Alex Krizhevsky net for Cifar10? GTX750 has 2G of ram and it GDDR5. CIFAR10 is only 60K, and of size 32*32*3! maybe it fits?! Im not sure.
What do you think on this ? May I be able to give pascal voc 2007 as well?
Thanks again
The GTX 750 will be a bit slow, but you should still be able to do some deep learning with it. If you are using libraries that support 16bit convolutional nets then you should be able to train Alexnet even on ImageNet; so CIFAR10 should not be a problem. To use VGG on CIFAR10 should work out, but maybe it might be a bit tight especially if you use 32bit networks. I have no experience with the PASCAL VOC2007 dataset, but the image sizes seem to be similar to ImageNet, thus AlexNet should work out, but probably not VGG, even with 16bits.
Thanks you very much .
By the way I’m using Caffe and I guess it only supports 32 bit convnets. I’m already hitting my limit using a 4 conv layer network (1991Mbs or so ) and overall only 2~3 Mbs of GPU remains .
Your article and help was of great help to me sir and I thank you from the bottom of my heart .
God bless you
Hossein
Hi Tim,
thanks for great guidelines !
In case somebody’s interested in the numbers – I’ve just bought a GTX 960 (http://www.gigabyte.com/products/product-page.aspx?pid=5400#ov) and I’m getting ~50% better performance than AWS G2.2 instance (keras / tensorflow backend).
Thank you, Pawel. That are very useful statistics!
Hi Tim
Thanks a lot for this article. I was looking for something like this.
I have a quick question. What would be the expected speedup for ConvNets with a GTX Titan X vs Core i7 4770-3.4 Ghz?
A rough idea would do the job?
Best Regards
Wajahat
I wonder what exactly happens when we exceed the 3.5G limit of the GTX970?!
Will it crash? if not how much slower does it get when it passes that limit?
I want to know, if passing the limit and getting slower, would it still be faster than the GTX960 ? If it is so , that would be great.
Has anyone ever observed or benchmarked this ? Have you?
Thanks again
Thanks alot, actually I dont want to play with this card, I need its bandwidth and its memory to run some applications (a deep learning Framework called caffe ).
Currently I have a GTX750 2G GDDR5, I need 4Gig at least . at the very same time, I also need a higher bandwidth card.
I cant buy the GTX980, its too expensive for me, I was skeptical to go for the GTX960 4G or the GTX970 4G (3.5G).
basically, GTX960 is 128 bit and it gives me 112 G of bandwidth, while the GTX970 is 256 and gives me 192+G bandwidth.
My current cards bandwidth is only 80!
So I just need to know, Do I have access to the whole 4 gigabyte of vram? playing games aside?
Does it crash if it exceeds the 3.5G limit or it just gets slower?
I mistakenly posted this here ;! ( This was supposed to be in techpowerup!)
I am using GTX 970 and two 750 (with 1GD5, 2GD5)
But there is not big difference in speed.
Rather, it seems 750 is slightly faster than 970.
Would you tell me the reason?
Thanks.
Hmm this seems strange. It might be that the GTX 970 hit the memory limit and thus is running more slowly so that it gets overtaken by a GTX 750. On what kind of task have you tested this?
Hi Tim, Do you know (or recommend) any good DL project for “Graphic card testing” on Github? Recently I’m cooperating with a hardware retailer so they lend me bunch of NVIDIA graphic cards (titan, titanx, titan black, titan z, 980 ti, 980, 970, 780 Ti, 780…).
What if I install a single gtx 960 to a PCIe 2.0 slot instead of a 3.0?
It will be a bit slower to transfer data to the GPU, but for deep learning this is negligible. So not really a problem.
Thank you for sharing this. Please update the list with new Tesla P100 and compare it with TitanX.
I will probably do this on the weekend.
Hi Tim
Thanks a lot for sharing such valuable information.
Do you know if it will be possible to use and external GPU enclosure for deep learning
such as a Razer core?
http://www.anandtech.com/show/10137/razer-core-thunderbolt-3-egfx-chassis-499399-amd-nvidia-shipping-in-april
Would there be any compromise on the efficiency?
Best Regards
Wjahat
There will be a penalty to get the data from your CPU to your GPU, but the performance on the GPU will not be impacted. Depending on the software and the network you are training, you can expect a 0-15% decrease in performance. This should still be better than the performance you could get for a good laptop GPU.
What can I expect from a Quadro M2000M (see http://www.notebookcheck.net/NVIDIA-Quadro-M2000M.151581.0.html) with 4GB RAM in a “I started deep learning and I am serious about it” situation?
It will be comparable to a GTX 960.
Hi
I keep coming to this great article. I was about to buy a 980 ti only when discovered that today nvidia announced the pascal gtx 1080 to be released in the end of may 2016. Maybe you want to put an update to your article with this fantastic performance and price of gtx 1080/1070.
I will update the blog post soon. I want to wait until some reliable performance statistics are available.
By the way, the price difference between Asus, EVGA,… etc. vs the original nvidia seems pretty high. Titan x in Amazon priced around 1300 to 1400 usd vs 999usd in nvidia online store. Do you advise against buying the original nvidia? If yes, why? What is the difference? Which brand you prefer?
Many thanks Tim. Your posts are unique. We badly need hardware posts for deep learning!
For deep learning the performance of the NVIDIA one will be almost the same as ASUS, EVGA etc (probably about 0-3% difference in performance). The brands like EVGA might also add something like dual-boot BIOS for the card, but otherwise it is the same chip. So definitely go for the NVIDIA one.
I read this interesting discussion about the difference in reliability, heat issues and future hardware failures of the reference design cards vs the OEM design cards:
https://hashcat.net/forum/thread-4386.html
The opinion was strongly against buying the OEM design cards. Especially for computing and 24/7 working of GPUs.
I read all the 3 pages and it seems there is no citation or any scientific study backing up the opinion, but it seems he has a first hand of experience who bought thousands of NVidia cards before.
So what is your comment about this? Should we avoid OEM design cards and stick with the original NVidia reference cards?
Answering my own question above:
I asked the same question to the author of this blog post (Matt Bach) of Puget systems and he was kind to answer based on around 4000 Nvidia cards that they have installed in his company:
https://www.pugetsystems.com/labs/articles/Most-Reliable-PC-Hardware-of-2016-872/
I will quote the discussion happened in the comments of the above article, in case anybody is interested:
Matt Bach :
Interesting question and one that is a bit hard to answer since we don’t really track individual cards by usage. I will tell you, however, that we lean towards reference cards if the card is expected to be put under a heavy load or if multiple cards will be in a system. Many of the 3rd party designs like the EVGA ACX and ASUS STRIX series don’t have very good rear exhaust so the air tends to stay in the system and you have to vent it with the chassis fans. That is fine for a single card, but as soon as you stack multiple cards into a system it can produce a lot of heat that is hard to get rid of. The Linus video John posted in reply to your comment lines up pretty closely what we have seen in our testing.
I did go ahead and pull some failure numbers from the last two years. This is looking at all the reference cards we sold (EVGA, ASUS, and PNY mostly) versus the EVGA ACX and ASUS STRIX cards (which are the only non-reference cards we tend to sell):
Total Failures: Reference 1.8%, EVGA ACX 5.0%, ASUS STRIX 6.6%
DOA/Shop Failures: Reference 1.0%, EVGA ACX 3.9%, ASUS STRIX 1.5%
Field Failures: Reference .7%, EVGA ACX 1.1%, ASUS STRIX 3.4%
Again, we don’t know the specific usage for each card, but this is looking at about 4,000 cards in total so it should average out pretty well. If anything, since we prefer to use the reference cards in 24/7 compute situations this is making the reference cards look worse than they actually are. The most telling is probably the field failure rate since that is where the cards fail over time. In that case, the reference are only a bit better than the EVGA ACX, but quite a bit better than the ASUS STRIX cards.
Overall, I would definitely advise using the reference style cards for anything that is heavy load. We find them to work more reliably both out of the box and over time, and the fact that they exhaust out the rear really helps keep them cooler – especially when you have more than one card.
Hayder Hussein:
Recently Nvidia began selling their own cards by themselves (with a bit higher price). What will be your preference? The cards that Nvidia are manufacturing and selling by themselves or a third party reference design cards like EVGA or Asus ?
Matt Bach :
As far as I know, NVIDIA is only selling their own of the Titan X Pascal card. I think that was just because supply of the GPU core or memory is so tight that they couldn’t supply all the different manufacturers so they decided to sell it directly. I believe the goal is to get it to the different manufacturers eventually, but who knows when/if that will happen.
If they start doing that for the other models too, there really shouldn’t be much of a difference between an NVIDIA branded card and a reference Asus/EVGA/whatever. Really hard to know if NVIDIA would have a different reliability than other brands but my gut instinct is that the difference would be minimal.
That is really insightful, thank you for your comment!
Your blog posts have become a must-read for anyone starting on deep learning with GPUs. Very well written, especially for newbies.
I was wondering though if/when you will write about the new beast: GTX 1080? I am thinking of putting together a multi GPU workstation with these cards. If you could compare the 1080 with Titan or 900 series cards, that would be super useful for me (and i am sure quit a few other folks)
Thank you for this great article. What is your opinion about the new Pascal GPUs? How would you rank the GTX1080 and GTX1070 compared to the GTX Titan X? Is it better to buy the newer GTX 1080 or to buy a Titan X which has more memory?
Both cards are better. I do not have any hard data on this yet, but it seems that the GTX 1080 is just better — especially if you use 16-bit data.
Hey,
Great Writeup.
I have a GTX 970M with i7 6700 (desktop CPU) on a Clevo laptop.
How good is GTX 970m for deep learning?
A GTX 970m is pretty okay, especially the 6GB variant will be enough to explore deep learning and fit some good models on data. However, you will not be able to fit state of the art models, or medium sized models in good time.
Great article! I would love to see some benchmarks on actual deep learning tasks.
I was under the impression that single precision could potentially result in large errors. In large networks with small weights/gradients, won’t the limited precision propagate through the net causing a snowballing effect?
I admit I have not experimented with this, or tried calculating it, but this is what I think. I’ve been trying to get my hands on a Titan / Titan Black, but with what you suggest, it would be much better getting the new Pascal cards.
With that being said, how would ‘half precision’ do with deep learning then?
The problem with actual deep learning benchmarks is hat you need the actually hardware and I do not have all these GPUs.
Working with low precision is just fine. The error is not high enough to cause problems. It was even shown that this is true for using single bits instead of floats since stochastic gradient descent only needs to minimize the expectation of the log likelihood, not the log likelihood of mini-batches.
Yes, Pascal will be better than Titan or Titan Black. Half-precision will double performance on Pascal since half-floating computations are supported. This is not true for Kepler or Maxwell, where you can store 16-bit floats, but not compute with them (you need to cast them into 32-bits).
What about new nvidia GPUs like GTX 1080 and GTX 1070, please review these after they released on the perspective of deep learning. Nvidia claim that GTX 1080 performance beat GTX Titan GPU, Is it true for Deep learning task ?
I am about to buy new GPU for deep learning task so please suggest me which GPU, I should buy with budget vs performance ratio ?
I was able to use Tensorflow, the last google machine learning framework with and NVidia GTX 960 on Ubuntu 16.04. It’s not officially supported but can be used.
I’ve posted a tutorial about how to install it here:
http://stackoverflow.com/questions/37600808/how-to-install-tensorflow-from-source-with-unofficial-gpus-support-in-ubuntu-16
Hi,
Very nice post! Found it really useful and I felt GeForce 980 suggestion for Kaggle competitions really apt. However, I am wondering how good are the mobile versions of the GeForce series for Kaggle such as 940M, 960M, 980M and so on. Any thoughts on this?
I think for Kaggle anything >=6GB of memory will do just fine. If you have a slower 6GB card then you have to wait longer but it is still much faster than a laptop CPU, and although slower than a desktop you still get a nice speedup and a good deep learning experience. Getter one of the fast cards is however often a money issue as laptops that have them are exceptionally expensive. So a laptop card is good for tinkering and getting some good results on kaggle competition. However, if you really want to win a deep learning kaggle competition computational power is often very important and then only the high end desktop cards will do.
Tim, what is the InfiniBand 40Gbit/s interconnect card for? Do I absolutely need the card if I was going to do a muti-GPU solution? And are all three of your Titan X cards are connected using SLI solutions?
You only need InfiniBand if you want to connect multiple computers. For multiple GPUs you just need multiple PCIe slots and a CPU that supports enough lanes. SLI is only used for games, but not for CUDA compute.
Hi Tim, thanks for updating the article! Your blog helped me a lot in increasing my understanding of Machine Learning and the Technologies behind it.
Up to now I mostly used AWS or Azure for my computations, but I am planning a new PC build. Unfortunately I have still some unanswered questions where even the mighty Google could not help! 🙂
Anyways, I was wondering whether you use all your GPU(s) for Monitor output as well? I read a lot about screen tearing / blank screens/ X Stopping for a few seconds while running the algorithms.
A possible solution would be to get a dedicated GPU for Display output such as the GTX 950 running at x8 to connect 3 monitors while having 2 GTX 1080 at x16 Speed just for computation. What is your opinion / experience regarding this matter?
Furthermore as my current PC’s CPU, only has 16 PCIe Lanes but an IGPU build in. Could I use the IGPU for graphics output while a 1080 is build in for computation? I have found a thread on Quora but the only feedback given was to get a CPU with 40PCIe Lanes. Of course this is true but as cash does not grow on trees and AMD Zen and new Skylake Extreme Chipsets are on the horizon.
Your feedback is highly appreciated and thanks in advance!
Personally, I never had any problems with video output from GPUs on which I also do computation.
The integrated iGPU is independent of your dedicated GTX 1080 and does not eat up any lanes. So you can easily run graphics from your iGPU and compute with 16 lanes from your GTX 1080.
The only problem you might encounter are problems with the intel/NVIDIA driver combination. I would not care too much about the performance reduction (0-5%) and I have yet to see problems with using a GPU for both graphics and compute at the same time and I have worked with about 5-6 different setups.
So I would try the iGPU + GTX 1080 setup and if you run into errors just use the GTX 1080 for everything.
Hi, I am trying to find a Kepler card (CC >= 3.0/3.5) for my research. Could you please suggest one ? I had GeForce GTX Titan X earlier but I could not use it for dual purpose i.e. for computation and display driver. Titan X does not allow this. So I’m searching a Kepler card which allows dual purpose. Kindly suggest one.
Try to recheck your configuration. I am running deep learing and a display driver on a GTX Titan X for quite some time and it is running just fine.
Thanks for the quick reply!
One final question, which may sound completely stupid.
Reference Cards vs Custom GPUs?
Often the Clock speed and sometime VRAM speed is OC by default on many non reference cards, but if I look through builds and if included screenshots. It seems that mostly reference cards are used. The only reason I could think of is the “predictable” cooler height.
Thanks again!
Ups meant Founders Edition vs reference cards,.. happens if the mind wanders elsewhere!
Thanks Tim. Great update to this article, as usual.
I was eager to see any info on the support of half precision (16 bit) processing in GTX 1080. Some articles were speculating few days before their release that they might be inactivated by nVidia and reserving this feature for future nVidia P100 pascal cards. However after around1 month from releasing the1000 gtx series, nobody seems to mention anything related to this important feature. This alone, if had been enabled in GTX will make it up to ~1.5 to 2x with more TFlops/s processing power in comparison to maxwell GPUs including Titan X. And as you mentioned it will add the bonus for less memory requirements up to half. However it is still not clear whether the accuracy of the NN will be the same in comparison to the single precision and whether we can do half precision for all the parameters. Which of course are important to estimate how much will be the speedup and how much less is the memory requirement for a given task.
Thanks Tim. Great update to this article, as usual.
I was eager to see any info on the support of half precision (16 bit) processing in GTX 1080. Some articles were speculating few days before their release that they might be inactivated by nVidia and reserving this feature for future nVidia P100 pascal cards. However after around1 month from releasing the1000 gtx series, nobody seems to mention anything related to this important feature. This alone, if had been enabled in GTX will make it up to ~1.5 to 2x with more TFlops/s processing power in comparison to maxwell GPUs including Titan X. And as you mentioned it will add the bonus for less memory requirements up to half. However it is still not clear whether the accuracy of the NN will be the same in comparison to the single precision and whether we can do half precision for all the parameters. Which of course are important to estimate how much will be the speedup and how much less is the memory requirement for a given task.
Hey there!
great article! I am currently trying to flicker train GoogleLeNet on 400 of my own images using my SLI 780ti’s. But i keep getting errors. such as cannot find file -> dir to file location(one of the images im training it on) but the file is there and the correct dir is in the train file. do you have any idea why this would be? also in the guide i followed to do this the guy had 4gb vram and used batch of 40 with 256×256 images i did the same but with batch size of 30 to account for the 3gb vram. am i doing something wrong here? how can i optimise the training to work on my video card? i appreciate any help you can give! thanks Josh!
Hi Tim,
Thanks for a great article, it helped a lot.
I am a beginner in the field of deep learning and have built and used only a couple of architectures on my CPU (currently a student so decided not to invest on GPU’s right away).
I have a question regarding the amount of CUDA programming required if I decide to do some sort of research in this field. I have mostly implemented my vanilla models in Keras and learning lasagne so that I can come up with novel architecture.
I know quite many researchers whose CUDA skills are not the best. You often need CUDA skills to implement efficient implementations of novel procedures or to optimize the flow of operations in existing architectures, but if you want to come of with novel architectures and can live with a slight performance loss, then no or very little CUDA skills are required. Sometimes you will have cases where you cannot progress due to your lacking CUDA skills, but this is rarely an issue. So do not waste your time with CUDA!
Thanks for the reply 🙂
Is a GT 635 capable of cudnn (conv_dnn) acceleration? in theory its a GK_208 kepler chip with 2GB of mem. I know its a crap card but its the only Nvidia card I had lying around. I have not been able to get GPU acceleration on WIN 8.1 to work – so wanted to ask if its my theano/cuda/keras installation thats the issue, or if its the card.. before I throw any money at the problem and buy a better GPU 960+. Should I go to windows 10?
Your card, although crappy, is a kepler card and should work just fine. Windows could be the issue here. Often it is not well supported by deep learning frameworks. You could try CNTK which has better windows support. If you try CNTK it is important that you follow this install tutorial step-by-step from top to bottom.
I would not recommend Windows for doing deep learning as you will often run into problems. I would encourage you to try to switch to Ubuntu. Although the experience is not as great when you make the switch, you will soon find that it is much superior for deep learning.
Thanks Tim, I did eventually get the GT 635 working under WIN 8.1 – on my Dell, about a 2.7x improvement over my Mac Pro’s 6 core xeons. Getting things going on OSX was much easier. I still don’t think the GT 635 is using cuDNN (cuDNN not available) but I’ll have to play – I get the sense I could get another 2x with it. The 2GB of Vram sucks, I really have to limit the batch sizes I can work with.
What are you thoughts on the GTX 1060? an easy replacement for the 960 on the low end? Over-clocking the 1060 looks like it can get close to a FE 1070 minus 2GB of memory. Thoughts?
Hello Tim,
I’ve been following your blog for 3 months now and since then I have been waiting to buy a GTX Titan X Pascal. However, there are rumors about Nvidia releasing the Volta architecture on the next year with HBM2. What your thoughts about the investment on a Pascal architecture based GPU currently? Thank you.
If you already have a good Maxwell GPU the wait for Volta might well be worth it. However, this of course depends on your applications and then of course you can always sell your Pascal GPU once Volta hits the market. Both options have its pro and cons.
I’m curious if unified memory in cuda 8 will work for dual 1080,
Then theoretically dual 1080 nvlink setup will crush tyranny in memory and flops ?
Unified memory is more a theoretical than practical concept right now. The CPU and GPU memory is still managed by the same mechanism as before and just that the transfers are hidden. Currently you will not see any benefits for this over Maxwell GPUs.
I have a used 1060 6gb on hand. I am planning to get into research type deep learning. However I am still getting started and dont understand all the nitty gritty of parameter tuning batch sizes etc. You mention 6gb would be limiting in deep learning. If I understand right using small batch sizes would not converge on large models like resnet with a 1060 (I am shooting in the dark here wrt terminology since I still a beginner). I was thus looking to get a 1080ti when I came across an improved form of “unified memory” introduced from Pascal and cuda 6+. I have a 6core HT xeon cpu+32gb ram. Could I use some system ram to remove the 6gb limitation.
I understand that p100 can do this and not incur heavy copy latency due to a new “page migration engine” feature wherein if access for data on gpu memory leads to a pagefault the program suspends the call and request the relevant memory page (instead of whole data) from the cpu.
http://parallelplusplus.blogspot.in/2016/09/nvidia-pascals-gpu-architecture-most.html
There is confusion about this feature in the 1080ti even though it uses the same gp100 module. The conclusion being that it can check for “page fault” but not do “prefetch”.
https://stackoverflow.com/a/43770299
and is seconded by
https://stackoverflow.com/a/40011988 for titan xp
Since 1080 (and by inference 1060 6gb since they both have gp400) also has this ConcurrentManagedAccess set to 1 according to
https://devtalk.nvidia.com/default/topic/1015688/on-demand-paging/?offset=5
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-device-properties
I am guessing I wouldnt benefit much from a new purchase of a 1080ti.
however it looks like these features in cuda 8 in c++ havent been yet been applied to higher language libraries.
https://github.com/inducer/pycuda/issues/116
Looks like AMD’s Vega might have this feature too through IOMMUv2 hardware passthrough and considering since AMD’s miopen 1.0 supports tensorflow and torch 7. It might be an good alternative.
http://parallelplusplus.blogspot.in/2017/01/fine-grained-memory-management-on-amd.html
http://www.amd.com/system/files/2017-06/TIRIAS-AMD-Epyc-GPU-Server.pdf
https://www.reddit.com/r/Amd/comments/61oolv/vega_you_are_no_longer_limited_by_the_amount_of/
Would be interested in hearing your thoughts specifically if a 1060 would be capable of addressing more than 6gb without heavy penalty maybe in the future (cant extrapolate what the yes to ‘page fault’ and no to ‘prefetch’ would mean in this context) and if it would be faster/slower than the AMD vega solution using IOMMU
I think the easiest and often overlooked option is just to switch to 16-bit models which doubles your memory. This is supported by most libraries (you are right that the “page migration engine” is not supported by any deep learning library). Other than that I think one could always adjust the network to make it work on 6GB — with this you will not be able to achieve state-of-the-art results, but it will be close enough and you save yourself from a lot of hassle. I think this also makes practically the most sense.
My brother recommended I would possibly like this blog.
He used to be totally right. This submit actually made my day.
You can not believe just how much time I had spent for this info!
Thanks!
I am glad to hear that you and your brother found my blog post helpful 🙂 Thank you!
hi
my confusions is
1) quadro series k2000 and higher capable enough for deep learning beginning.
2)keplar , maxwel, pascal how much difference dose it make on performance as a beginner
3)gtx titan x is pascal or maxwell
5)parameters to be considered for comparison of GPU as far as deep learning is concern
4)please suggest any GPU according to you
1) A k2000 will be okay and you can do some tests, but its memory and performance will not be sufficient for larger datasets
2) Get Maxwell or Pascal GPU if you have the money; Kepler is slow
3) There is a one Titan X for Pascal and one for Maxwell
5) Look at memory bandwidth mostly
4) GTX 1060
Exceptionally excellent blog
Thank-you you so much for your valuable reply
Hi Tim,
first of all, thank you for your awesome articles about deep learning. They have been very usefull for me. Since I use Caffe and CNTK framework for deep learning and GPU computing speed is very important, encouraged by your last article update (GTX Titan X Pascal = 0.7 GTX 1080 = 0.5 GTX 980 Ti) and very positive reviews on Internet, I decided to upgrade my GTX 980 Ti (Maxwell) with brand new GTX 1080 (Pascal). In order to compare performance of both architectures – new Pascal with old Maxwell (and of course because I just want to see how well my new GTX 1080 performs to justify expense 🙂 I benchmarked both cards in Caffe (CNTK is not cuDNN 5 ready yet). To my big surprise new GTX 1080 is about 20% slower in AlexNet training than old GTX 980 Ti. I realized two benchmarks in order to compare performance in different operating systems but with practically same results. The reason why I chosed different versions of CUDA and cuDNN is that Pascal architecture is supported only in CUDA 8RC and cuDNN 5.0 and Maxwell architecture performs better in CUDA 7.5 and cuDNN 4.0 (otherwise you get poor performance).
Maybe I have done something wrong in my benchmark (but I’m not aware anything…), could you give me some advice how to improve training perfomance on GTX 1080 with Caffe? Is there any other framework which support Pascal architecture with full speed?
First benchmark:
OS: Windows 7 64bit
Nvidia drivers: 368.81
Caffe buid for GTX 1080 : Visual studio 2013 64bit, CUDA 8RC, cuDNN 5.0
Caffe buid for GTX 980 Ti: Visual studio 2013 64bit, CUDA 7.5, cuDNN 4.0
Caffe buid for GTX 980 Ti: Visual studio 2013 64bit, CUDA 7.5, cuDNN 5.0
Caffe buid for GTX 980 Ti: Visual studio 2013 64bit, CUDA 7.5, cuDNN 5.1
GTX 1080 performance : 4512 samples/sec.
GTX 980 Ti perfomance: 5407 samples/sec. (cuDNN 4.0) best performance
GTX 980 Ti perfomance: 4305 samples/sec. (cuDNN 5.0)
GTX 980 Ti perfomance: 4364 samples/sec. (cuDNN 5.1)
Second benchmark:
OS: Ubuntu 16.04.1
Nvidia drivers: 367.35
Caffe buid for GTX 1080 : gcc 5.4.0, CUDA 8RC, cuDNN 5.0
Caffe buid for GTX 980 Ti: gcc 5.4.0, CUDA 7.5, cuDNN 4.0
GTX 1080 performance : 4563 samples/sec.
GTX 980 Ti perfomance: 5385 samples/sec.
Thank you very much,
Ondrej
Thank you Ondrej for sharing — these are some very insightful results!
I am not entirely sure how convolutional algorithm selection works in Caffe, but this might be the main reason for the performance discrepancy. The cards might have better performance for certain kernel sizes and for certain convolutional algorithms. But all in all these are quite some hard numbers and there is little room for arguing. I think I need to update my blog post with some new numbers. To learn that the performance of Maxwell cards is such much better with cuDNN 4.0 is also very valuable. I will definitely add this in an update to the blog post.
Thanks again for sharing all this information!
The best article about choosing GPUs for deep learning I’ve ever read!
As a CNN learner with few budget, I decide to buy a GTX 1060 replacing the old Quardro K620. Since GTX 1060 does not support SLI and you wrote that “using the PCIe 3.0 interface for communication in multi-GPU applications”. I am a little worry about upgrading later soon. Should I buy a GTX 1070 instead of GTX 1060? Thanks.
Maybe this was a bit confusing, but you do not need SLI for deep learning applications. The GPUs communicate via the channels that are imprinted on the motherboard. So you can use multiple GTX 1060 in parallel without any problem.
Thanks for sharing your knowledge about this topics.
Regards.
Dear Tim,
Extremely thankful for the info provided in this post.
We have GPU server on which CUDA 6.0 is installed and it has two Tesla T10 graphic cards. I have a question if I can use this GPU system for deep learning as Tesla T10 is quite old as of now. I am facing some hardware issues with installing caffe on this server. It has ubuntu 12.04 LTS as OS.
Thanks in advance
Tharun
The T10 Tesla chip has a too low compute capability so that you will not be able to use cuDNN. Without that you can still run some deep learning libraries but your options will be limited and training will be slow. You might want to just use your CPU or try to get a better GPU.
Hi Tim
Your site contains such a wealth a knowledge. Thank you.
I am interested in having your opinion on cooling the GPU. I contacted NVIDIA to ask what cooling solutions they would recommend on a GTX Titan X Pascal in regards to deep learning and they suggested that no additional cooling was required. Furthermore, they would discourage adding any cooling devices (such as EK WB) as it would void the warranty. What are your thoughts? Is the new Titan Pascal that cooling efficient? If not, is there a device you would recommend in particular?
Also, I am mostly interested in RNN and I plan on starting with just one GPU. Would you recommend a second GPU in light of the new SLI bridge offered by NVIDIA? Do you think it could deliver increased performance on single experiment?
I would also like to add that looking at the DevBox components,
No particular cooling is added except for sufficient GPU spacing and upgraded front fans.
http://developer.download.nvidia.com/assets/cuda/secure/DIGITS/DIGITS_DEVBOX_DESIGN_GUIDE.pdf?autho=1471007267_ccd7e14b5902fa555f7e26e1ff2fe1ee&file=DIGITS_DEVBOX_DESIGN_GUIDE.pdf
From my experience addition fans for your case are negligible (less than 5 degrees differences; often as low as 1-2 degrees). Increasing the GPU fan speed by 1% often has a larger effect than additional case fans.
If you only run a single Titan X Pascal then you will indeed be fine without any other cooling solution. Sometimes it will be necessary to increase the fan speed to keep the GPU below 80 degrees, but the sound level for that is still bearable. If you use more GPUs air cooling is still fine, but when the workstation is in the same room then noise from the fans can become an issue as well as the heat (it is nice in winter, then you do not need any additional heating in your room, even if it is freezing outside). If you have multiple GPUs then moving the server to another room and just cranking up the GPU fans and accessing your server remotely is often a very practical option. If those options are not for you water cooling offers a very good solution.
Hi Tim,
Thanks for the great article and thanks for continuing to update it!
Am I correct that the Pascal Titan X doesn’t support FP16 computations? So if TensorFlow or Theano (or one’s library of choice) starts fully supporting FP16, would the GTX 1080 then be better than the new Titan X as it would have larger effective (FP16) memory? Do But perhaps I am missing something…
Is it clear yet whether FP16 will always be sufficient or might FP32 prove necessary in some cases?
Thanks!
Hey Chad, the GTX 1080 also does not support FP16 which is a shame. We will have to wait for Volta for this I guess. Probably FP16 will be sufficient for most things, since there are already many approaches which work well with lower precision, but we just have to wait.
ah, ok. got it. Thanks a lot!
Which GTX, if any, support int8? Does Tensorflow support int8? Thanks for the great blog.
All GPUs support int8, both signed and unsigned; in CUDA this is just a signed or unsigned char. I think you can do regular computation just fine. However, I do not know how the support for Tensorflow is, but in general most the deep learning frameworks do not have support for computations on 8-bit tensors. You might have to work closer to the CUDA code to implement a solution, but it is definitely possible. If work with 8-bit data on the GPU, you can also input 32-bit floats and then cast them to 8-bits in the CUDA kernel; this is what torch does in its 1-bit quantization routines for example.
Hi Tim,
Would multi lower tier gpu serve better than single high tier gpu given similar cost?
Eg.
3 x 1070
vs
1 x Titan X Pascal
Which would you recommend?
Here is one of my quora answers which deals exactly with this problem. The cards in that example are different, but the same is true for the new cards.
Thank for the reply.
I am not sure if I understand the answer correctly, is it pcie bandwidth the bottleneck bandwidth you are referring which is around 8GB/s due to multiple cards compared to single larger memory titan X’s bandwidth of 336Gb/s?
One more question, does slower ddr ram bandwidth will impact the performance of deeplearning ?
That is correct, for multiple cards the bottleneck will be the connection between the cards which in this case is the PCIe connection. Slower DDR RAM bandwidth almost decreases performance by as much as the bandwidth is lower, so it is quite important. This comparison however is not valid between different GPU series e.g. invalid for Maxwell vs. Pascal.
Hi Tim. Great article.
One question: I have been given a Quadro M6000 24GB. How do you think it compares to a Titan or Titan X for deep learning (specifically Tensorflow)? I’ve used a Titan before and I am hoping that at least it wouldn’t be slower.
Thank you.
The Quadro M6000 is an excellent card! I do not recommend it because it is not very cost efficient. However, the very large memory and high speed which is equivalent to a regular GTX Titan X is quite impressive. On normal cards, you do not have more than 12GB of RAM which means you can train very large models on your M6000. So I would definitely stick to it!
Awesome, thanks for the quick response.
Hey Tim, thank you so muuuch for your article!! I am in the “I started deep learning and I am serious about it” group and will buy a GTX 1060 for it. I am more specifically interested in autonomous vehicle and Simultaneous Localization and Mapping. You article has helped me clarify my currents needs and match it with a GPU and budget.
You have a new follower here!
Thanks!
Thank you for your kind words — I am glad that I you found my article helpful!
Hey Tim,
Thank you for this fantastic article. I have learned a lot in these past couple of weeks on how to build a good computer for deep learning.
My question is rather simple, but I have not found an answer yet on the web: should I buy one Titan X Pascal or two GTX 1080s?
Thank you very much for your time,
Tim
Hey Tim,
In the past I would have recommended one faster bigger GPU over two smaller, more cost-efficient ones, but I am not so sure anymore. The parallelization in deep learning software gets better and better and if you do not parallelize your code you can just run two nets at a time. However, if you really want to work on large datasets or memory-intensive domains like video, then a Titan X Pascal might be the way to go. I think it highly depends on the application. If you do not necessarily need the extra memory — that means you work mostly on applications rather than research and you are using deep learning as a tool to get good results, rather than a tool to get the best results — then two GTX 1080 should be better. Otherwise go for the Titan X Pascal.
Tim
Hi Tim,
First of all thank you for your reply.
I am ready to finally buy my computer however I do have a quick question about the 1080 ti and the Titan xp. For a researcher that does some GAN, LSTM and more, would you recommend 2x 1080 ti or just one Titan xp. I understand that in your first post you said that the Titan X Pascal should be the one, however I would like to know if this is still the case on the newer versions of the same graphics cards.
Thank you so much for updating the article!
Tim
I think two GTX 1080 Ti would be a better fit for you. It does not sound like you would need to push the final performance on ImageNet where a Titan Xp really shines. If you want to build new algorithms on top of tried and tested algorithms like LSTM and GANs then two GPUs (which are still very fast) and 1 GB less memory will be far better than one big GPU.
Tim!
Thanks so much for your article. It was instrumental in me buying the Maxwell Titan X about a year ago. Now, I’ve upgraded to 4 Pascal Titan X cards, but I’m having some problems getting performance to scale using data parallelism.
I’m trying to chunk a batch of images into 4 chunks and classify them (using caffe) on the 4 cards in parallel using 4 separate processes.
I’ve confirmed the processes are running on the separate cards as expected, but performance degrades as I add new cards. For example, if it takes me 0.4 sec / image on 1 card alone, when I run 2 cards in parallel, they each take about 0.7 sec / image.
Have you had any experience using multiple Pascal Titan X’s in this manner? Am I just missing something about the setup/driver install?
Thanks!
Were you getting better performance on your Maxwell Titan X? It also depends heavily on your network architecture; what kind of architecture were you using? Data parallelism in convolutional layers should yield good speedups, as do deep recurrent layers in general. However, if you are using data parallelism on fully connected layers this might lead to the slowdown that you are seeing — in that case the bandwidth between GPUs is just not high enough.
Hi Tim,
Thank you very much for the fast answer.
I just have one more question that is related to the CPU. I understand that having more lanes is better when working with multiple GPUs as the CPU will have enough bandwidth to sustain them. However, in the case of having just one GPU is it necessary to have more than 16 or 28 lanes? I was looking at the *Intel Core i7-5930K 3.5GHz 6-Core Processor*, which has 40 lanes (and is the cheapest in that category) but also requires an LGA 2011 and DDR4 memory which are expensive. Is this going to be too much of an overkill for the Titan X Pascal?
Thank you for your time!
Tim
If you are having only 1 card, then 16 lanes will be all that you need. If you upgrade to two GPUs you want to have either 32+ lanes (16 lanes for each) or just stick with 16 lanes (8 lanes for each) since the slowest GPU will always draw down the other one (24 lanes means 16x + 8x lanes, and for parallelism this will be bottlenecked by the 8 lanes). Even if you are using 8 lanes, the drop in performance may be negligible for some architectures (recurrent nets with many times steps; convolutional layers) or some parallel algorithms (1-bit quantization, block momentum). So you should be more than fine with 16 or 28 lanes.
I compared quadro k2200 with m4000.
k2200 won surprisingly m4000 in the simple network.
I am looking for a higher performance single-slot GPU than k2200.
How about k4200 ?
quadro k4200 ( single-slot and single precision = 2,072.4 gflops )
Check your benchmarks and if they are representative of usual deep learning performance. The K2200 should not be faster than a M4000. What kind of simple network were you testing on?
I tested the simple network on a chainer default example as below.
python examples/mnist/train_mnist.py –gpu 0
result
K2200
avg : 14288.94418 images/sec
M4000
avg : 13617.58361 images/sec
However I confirmed that the M4000 is faster than a K2200 in the complex netork like alexnet.
[convnet-benchmarks]
./convnet-benchmarks/chainer/run.sh
result
K2200
alexnet : 639ms, overfeet : 2195ms
M4000
alexnet : 315ms, overfeet : 1142ms
I think that GPU clock is effective in the simple network.
Is this correct ?
GPU Clock
K2200
1045 MHz
M4000
772 MHz
Shading Units
K2200
640
M4000
1664
Hi! Great article, very informative. However, I want to point out that the NVIDIA Geforce GTX Titan X and the NVIDIA Titan X are two different graphics cards (yes the naming is a little bit confusing). The Geforce GTX Titan X has Maxwell microarchitecture, while the Titan X has the newer Pascal microarchitecture. Hence there is no “GTX Titan X Pascal”.
Ah this is actually true. I did not realize that! Thanks for pointing that out! Probably it is still best to add Maxwell/Pascal to not confuse people, but I should remove the GTX part.
I live at a place where 200kwh costs 19.92 dollars and 600kwh cost 194.7 dollars. The electricity bills grows exponentially. I usually train unsupervised learning algorithms on 8 terabytes of video. Which gpu or gpus should I buy?
I live at a place where 200kwh costs 19.92 dollars and 600kwh cost 194.7 dollars. The electricity bills grows exponentially. I usually train unsupervised learning algorithms on 8 terabytes of video. Which gpu or gpus should I get? The titan x pascal had the most bandwidth per watt but it’s a lot expensive for the little gain of performance per watt.
Take into consideration the potential cost of electricity when comparing the options of building your own machine versus renting one on a data centre.
Great article. I am just a noob at this and learning . not a researcher, but application guy. I have an old mac pro 2008 with 32gb of ram (fb-dimm in 8 channel) on dual xeon quad core at 2.8ghz.(8 core, 8 thread) I’ve been using gtx750ti with 4gb on deep mask/sharp mask on torch. COCO image set took 5 days to train through 300 epoch on deep mask. I am wondering how much performance increase would I see going to GTX 1070? or I am wondering if I could add second gtx750ti that matches the one I got instead for 8gb of ram. (have room for 2 gpu)
thanks for everything
Adding a GTX 750Ti will not increase your overall memory since you will need to make use of data parallelism where the same model rests on all GPUs (the model is not distributed among GPU so you will see no memory savings). In terms of speed, and upgrade to a GTX 1070 should be better than two GTX 750Ti and also significantly easier in terms of programming (no multi-GPU programming needed). So I would go with the GTX 1070
Hi Tim, thanks for an insightful article!
I picked up new 13″ macbook with thunderbolt 3 ports, i am thinking of a setup using GTX-1080 using eGFX enclosure – http://www.anandtech.com/show/10783/powercolor-announces-devil-box-thunderbolt-3-external-gpu-enclosure . What do you think of this idea?
It should work okay. There might be some performance problems when you transfer data from CPU to GPU. For most cases this should not be a problem, but if your software does not buffer data on the GPU (sending the next mini-batch while the current mini-batch is being processed) then there might be quite a performance hit. However, this performance hit is due to software and not hardware, so you should be able to write some code to fix performance issues. In general the performance should be good in most cases with around 90% performance.
Hi
I want to test multiple neural networks against each other using encog. For that i want to get a nvidia card. After reading your article i think about getting the 1060 but since most calculations in encog using double precision would the 780 ti be a better fit? The data file will not be large and i do not use images.
Thanks
The GTX 780 Ti would still be slow for double precision. Try to get a GTX Titan (regular) or GTX Titan Black they have excellent double precision performance and work generally quite okay even in 32-bit mode.
Thanks for sharing this- good stuff! Keep up the great work, we look forward to reading more from you in the future!
Hi Tim,
Really useful post, thanks.
I wondered about the Titan Black – looking online the memory bandwidth, 6GB of memory, single and double precision are better than a 1060 and at current eBay prices, are about 10-15% cheaper than a 1060.
Other than the lower power of the 1060 and warranty, would there be any reason to choose the 1060 over a Titan Black?
Thank you
The architecture of the GTX 1060 is more efficient than the Titan Black architecture. Thus for speed, the GTX 1060 should still be faster, but probably not by much. So the GTX Titan Black is a solid choice, especially if you also want to use the double precision.
Hi Tim,
thanks for the article, it’s the most useful I found during my 14-hour google-marathon!
I’m very new to deep learning, starting with YOLO I’ve found that my gtx 670 with 2GB is seriously limiting what I can explore. Inferring from your article, if I stack multiple GPUs for CNNs, the memory will in principle add up, right? I’m asking because I will make a decision between a used Maxwell Titan X or a 1070/1080, and my main concern is the memory, thus I would like to know if it is a reasonable memory upgrade option (for CNNs) to add a second card at some point when they are cheaper. Furthermore, if the 1080 and the used Maxwell Titan X are the same price, as this a good deal?
Also, I’m concerned with the FP16x2 feature for the 1070/1080, adding only one FP16x2 core every 128 FP32 cores: If I’m using FP16, the driver might report my card is FP16v2
capable, and thus a framework might use these few FP16v2 cores instead of emulating FP16 arithmetics by promoting to FP32. Is this a valid worst-case scenario for e.g. caffe/torch/… or am I confusing something here? Also, I’ve read that before Pascal, there is effectively no storage benefit from FP16, as the numbers need to be promoted to FP32 anyway. I can only understand this if the data needs to be promoted before fetching it to the registers for computation, is this right?
Thank you
Hi Markus,
unfortunately the memory will not stack up, since you probably will use data parallelism to parallelize your models (the only way of parallelism which really works well and is fast).
If you can get a used Maxwell Titan X cheap this is a solid choice. I personally would not mind the minor slowdown compare the the added flexibility, so I would go for the Titan X as well here.
Currently, you do not need to worry about FP16. Current code will make use of FP16 memory, but FP32 computations so that the slow FP16 compute units on the GTX 10 series will not come into play. All of this probably only becomes relevant with the next Pascal generation or even only with Volta.
I hope this answered all of your questions. Let me know if things are still unclear.
Hi Tim, thanks for a great article! I’m just wondering if you had experience with installing the GTX or Titan X on rackmount servers? Or if you have recommendations for articles or providers on the web? (I’m in UK). I am having a long running discussion with IT support about whether it is possible, as we couldn’t find any big providers that would put together such a system. The main issue seems to revolve around cooling as IT says that Teslas are passive cooled while Titan X are active cooled, and may interfere with the server’s cooling system.
I think the passively cooled Teslas still have a 2-PCIe width, so that should not be a problem. If the current Tesla cards are 1-PCIe width, then it will be a problem and Titan Xs will not be an option.
Cooling might indeed also an issue. If the passively cooled Teslas have intricate cooling fins then their cooling combined with active server cooling might indeed be much superior to what Titan Xs can offer. Cooling systems for clusters can be quite complicated and this might lead to Titan Xs breaking the system.
Another issue might be just buying Titan Xs in bulk. NVIDIA does not sell them in bulk, so you will only be able to equip a small cluster with these cards (this is also the reason why you do not find any providers for such as system).
Hope this helps!
Hi Tim, I found a interesting thing recently.
I tried one Keras(both theano and tensorflow were tested) project on three different computing platforms:
A : ssd+i5 3470(3.2GHz)+GTX750Ti (2G)
B: ssd+E52620 v3+TITAN X (12G)
C: HDD+i56300HQ(2.6GHz)+GTX965M(4G)
With the same setting of cuda 8.0 and cuDNN 5.0 , A & B got similar GPU performance. However, I cannot understand why C is about 5 times slower than A. I guessed C could perform better than A before the experiment.
As I understand it Keras might not prefetch data. On certain problems this might introduce some latency when you load data, and loading data from hard disk is slower than SSD. If the data is loaded into memory by your code, this is however unlikely the problem. What strikes me is that A and B should not be equally fast. C could also be slow due to the laptop motherboard which has a poor or reduced PCIe connection, but usually this should not be such a big problem. Of course this could still happen for certain datasets.
How about gtx 1070 SLI ?
“However, training may also take longer, especially the last stages of training where it becomes more and more important to have accurate gradients.” WHY the last stages of training is important? Any justification
It is easy to improve from a pretty bad solution to an okay solution, but it is very difficult to improve from a good solution to a very good solution. Improving our 100 meter dash time by a second is probably not so difficult, while for an Olympic athlete it is sheer impossible because they already operate at a very high level. This goes the same for neural net and their solution accuracy.
Does having an AMD card support Theano / Keras ?
Amazon has introduced a new class of instances: Accelerated Computing Instances (P2), with 12GB K80 GPUs. These are much, much better than the older G2 instances, and go for $0.90/hr. Does this change anything in your analysis?
With such an instance you get one K80 for $0.9/h which means $21.6/day and $648/month. If you use your GPU for more than one GPU month (runtime) then it probably gets cheaper and cheaper to buy your own GPU. I do not think it makes really sense for most people.
It is probably a good option for people doing Kaggle competitions since most of the time will be spend still on feature engineering and ensembling. For researchs, startups, and people who learn deep learning it is probably still more attractive to buy a GPU.
Hello,
I decided to buy a GTX 1060 or GTX 1070 card to try with Deep Learning, but I am curious if the RAM size of The GPU or its bandwidth/speed will affect the ACCURACY of the final model or not, by comparing these two specific GPU cards.
in the other word, I want to know selecting the GTX 1060 will just cause longer training time over GTX 1070, or it will affect the accuracy of the model either.
Hi Hesam, the two cards will yield the same accuracy. There are some elements in the GPU which are non-deterministic for some operations and thus the results will not be the same, but they always be of similar accuracy.
Hi, it’s been a pleasure to read this article! Thanks!
Have you done any comparison of 2 x Titan X against 4 x GTX 1080? Or maybe you have some thoughts regarding it?
The speed of 4x 1080 vs 2 Titan X is difficult to measure, because parallelism is still not well supported for most frameworks and the speedups are often poor. If you look however at all GPUs separately, then it depends on how much memory your tasks needs. If 8GB are okay, 4x 1080 are definitely better than 2x Titan X, if not, then 2x Titan X are better.
Hi Tim,
first of all thank you for this great article.
I understand that the memory clock speed is quite important and depending on which graphics card manufacturer/line I choose, there will be up to a 10% difference.
Here is a good overview[German]
http://www.pcgameshardware.de/Pascal-Codename-265448/Specials/Geforce-GTX-1080-Custom-Designs-Herstellerkarten-1198846/
I am going to buy a 1080 and I am wondering if it makes sense to get such an OC one.
Do you have any experience with / advice on this?
Thank you for an answer,
Erik
OC GPUs are good for gaming, but they hardly make a difference for deep learning. You are better of buying a GPU with other features such as better cooling. When I tested overclocking on my GPUs it was difficult to measure any improvement. Maybe you will get something in the range of 1-3% improved performance for OC GPUs — so not so much worth it if you need to pay extra for OC.
Could you add AWS’s new P2 instance into comparison? Thank you very much!
The last time I checked the new GPU instances were not viable due to their pricing. Only in some limited scenarios, where you need deep learning hardware for a very short time do AWS GPU instances make economic sense. Often it is better to buy a GPU even if it is a cheaper, slower one. With that you will get much more GPU accelerated hours for your money compared to AWS instances. If money is less of in issue AWS instances also make sense to fire up some temporary compute power for a few experiments or training a new model for startups.
Hi Tim,
With the release of the GTX 1080 Ti and the revamp+reprice of GTX 1060/70/80, would you change anything in your TL;DR section, especially vs Pascal Titan X ?
Links to key points:
– GTX 1080 Ti: http://wccftech.com/nvidia-geforce-gtx-1080-ti-unleash-699-usd/
– Revamp+reprice of GTX 1060/70/80 etc.: http://wccftech.com/nvidia-geforce-gtx-1080-1070-1060-official-price-cut-specs-upgrade/
Cheers,
E.
Thank you so much for the links. I will have to look at those details, make up my mind, and update the blog post. On a first look it seems that the GTX 1070 8GB would be really the way to go for most people. Just a lot of bang for the buck. The NVIDIA Titan X seems to become obsolete for 95% of people (only vision researchers that need to squeeze every last bit of RAM for should use it) and the GTX 1080 Ti will be the way to go if you want fast compute.
Hi Tim,
Thank you for the great article and answering our questions.
NVIDIA just announced their new GTX 1080 TI. I heard that it shall even outperform the Titan X Pascal in gaming. I did not read anything about the performance of the GTX 1080 TI in Machine Learning / Deep Learning yet.
I am building a PC at the moment and have some parts already. Since the Titan X was not available over the last few weeks, I could still get the GTX 1080 TI instead.
1.) What is better, the GTX 1080 TI or Titan X? If the difference is very small, I would choose the cheaper 1080 TI and upgrade to Volta in a year or so. Is the only difference the 11 GB instead of 12 and a little bit faster clock or are some features disabled that could make problems with deep learning?
2.) Is half precision available in the GTX 1080 TI and/or the Titan X? I thought that it is only available in the much more expensive Tesla cards, but after reading through the Replies here, I am not sure anymore. To be more precise, I only care of the half precision (float 16) when it brings a considerable speed improvement (In Tesla roughly twice as fast compared to float 32). If it is available but with the same speed as float 32, I obviously do not need it.
Looking forward to your reply.
Thomas
Hi Tim,
one more question: How much of an issue will the 11GB of the GTX 1080 TI be compared to the 12GB on the Titan X? Does that mean that I cannot run many of the published models that were created by people on a 12GB GPU? Do people usually fill up all of the memory available by creating deep nets that just fit in their GPU memory?
Some of the very state of the art models might not run on some of the datasets. But in general, this is a no-issue. You will still be able to run the same models, but instead of 1000 layers you will only have something like 900 layers. If you are not someone which does cutting edge computer vision research, then you should be fine with the GTX 1080 Ti.
Alternatively, you can always run these models in 16-bit on most frameworks just fine. This is so, because most models make use of 32-bit memory. This thus requires a bit of extra work to convert the existing models to 16-bit (usually a few lines of code), but most models should run. I think you will be able to run more than 99% of the state of the art models in deep learning, and about 90% of the state of the art models in computer vision. So definitely go for a GTX 1080 Ti if you can wait for so long.
You are welcome. I am always happy if my blog posts are useful!
1.) I would definitely go with the GTX 1080 TI due to price/performance. The extra memory on the Titan X is only useful in a very few cases. However beware, it might take some time between announcement, release and when the GTX 1080 Ti is finally delivered to your doorstep — make sure you have that spare time. Make also sure you preorder it; when new GPUs are released their supply is usually sold within a week or less. You do not want to wait until the next batch is produced.
2.) Half precision is implemented on the software layer, but not on the hardware layer for these cards. This means you can use 16-bit computation but software libraries will instead upcast it to 24-bit to do computation (which is equivalent to 32-bit computational speed. This means that you can benefit from the reduced memory size, but not yet from the increased computation speed of 16-bit computation. You only see this in the P100 which nobody can afford and probably you will only see it for consumer cards in Volta series cards which will be released next year.
Now that GTX1080Ti is out, would you recommend that over Titan X?
Considering the incoming refresh of Geforce 100, should I purchase entry-level 1060x6GB now or will there be something more interesting in the near future?
The GTX 1060 will remain a solid choice. I would pick the GTX 1060 if I were you.
Thanks for the brilliant summary! Wish I have read this before the purchase of 1080, I would have bought 1070 instead as it seems a better option for value, for the kind of NLP tasks I have at hand.
Great blog my friend 🙂
Hi Tim,
There are a number of brands for the GTX 1080 Ti such as Asus, MSI, PALIT, ZOTAC etc. May I know does the brand matter ? I am planning to get a GTX 1080 Ti for my deep learning research, but not sure which brand to get.
Thank you.
In terms of deep learing performance the GPU itself are more or less the same (overclocking etc does not do anything really), however, the cards sometimes come with differernt coolers (most often it is the reference cooler though) and some brands have better coolers than others. The choice of brand shoud be made first and foremost on the cooler and if they are all the same the choice should be made on the price. So the GPUs are the same, focus on the cooler first, price second.
Hey Tim,
I’m in a confused state of mind right now. I currently have a GTX 960 4gb, which in selling. After that I’ll have enough cash to buy either a GTX 980 4gb or a GTX 1060 6gb. I plan to get serious with DL. I know most are saying the 1060 but it doesn’t have SLI! What if I want to upgrade in 5-6 months (just in case I suddenly get extremely serious)? Please help me.
Thanks
SLI is used for gaming only, you do not need it for parallelization (for CUDA computing the direction connection via PCIe is used). So If you want to get two GTX 1060 you can still parallelize them — no SLI required.
I was waiting for this very update, i.e. your recommendation after GTX1080Ti launch. As always, a very well rounded analysis.
I am building a two GPU system for the sole purpose of Deep Learning research and have put together the resources for two 1080Tis (https://de.pcpartpicker.com/list/GMyvhq). First I want to thank for your earlier posts because I used your advice for selecting every single component in this setup.
Secondly, there is one aspect you haven’t touched and I was wondering if you had any pointers to that. It’s about cooling and it’s effect on higher FLOPS. Having settled on dual 1080Ti system, now I have to select among stock cooling from FE or elaborate air cooling from AIBs or custom liquid cooling. From what I understand, FLOPS are a directly proportional to GPU frequency, so cooling the system to run the GPUs at higher clock rate should theoretically give linear increase in performance.
1) In your experience, is the linear increase in performance seen in practice too? On a related note, since you emphasized so much on memory bandwidth, is it the case that all/most DL setups are memory bound? because in that case, increasing compute performance won’t be of any use.
2) In earlier posts you recommended to go with RAM with slowest frequency as it will not be a bottleneck. From your argument in that post, I understand that a 2133MHz RAM would be good enough.
Thank you again for your time and these posts.
1) Bad cooling can reduce performance significantly. However, 1. case cooling does hardly anything for the GPUs, 2. the GPUs will be sufficiently cooled if you use air cooling and crank up the fans. I never tried water cooling, but this should increase performance compared to air cooling under high loads when the GPUs overheat despite max air fans. This should only occur if run them for many hours in a unventilated room. If this is the case, then water cooling may make sense. I have no hard numbers on the performance gain, but in terms of hardware, cooling is by far the biggest gain of performance (I think you can expect 0-15% performance gains). So if you are willing to put in the the extra work and money for water cooling, and you will run your GPUs a lot, then it might be a good fit for you.
2) 2133MHz will be fine. Theoretically, the performance loss should be almost unnoticeable and probably in the 0-0.5% range compared to the highest end RAM. I personally run 1600MHz RAM and compared to other systems that I have run on, I could not detect any degradtion of performance.
Thank you for prompt reply. I think I will stick to air cooling for now and keep water cooling for a later upgrade.
Hi Tim, Thanks for the informative post. I am currently looking at the 1080 TI. Right now I am running a LSTM(24) with 24 time steps. I generally use Theano and TensorFlow. What kind of speed increase would you expect from buying 1 1080TI as opposed to 2 1080 TI cards. Just trying to figure out if its worth it. Should I buy a SLI bridge as well, does that factor in? Thanks , really enjoyed reading your blog.
LSTM scale quite well in terms of parallelism. The longer your timesteps the better the scaling. Theano and TensorFlow have in general quite poor parallelism support, but if you make it work you could expect a speedup of about 1.5x to 1.7x with two GTX 1080 Ti compared to one; if you use frameworks which have better parallelism capabilities like PyTorch, you can expect 1.6x to 1.8x; if you use CNTK then you can expect speedups of about 1.9x to 1.95x.
These numbers might be lower for 24 timesteps. I have no hard numbers of when good scaling begins in terms of parallelism, but it is already difficult to utilize a big GPU fully with 24 timesteps. If you have tasks with 100-200 timesteps I think the above numbers are quite correct.
Now that the GTX 1080TI is based on Pascal, what would be the difference in using that card verses the Titan X Pascal for DNN training, whether it be for vision, speech or most other complex networks?
The performance is pretty much equal, the only difference is that the GTX 1080 Ti has only 11GB which means some networks might not be trainable on it compared to a Titan X Pascal.
After the release of 1080ti, you seem to have dropped your recommendation of 1080. You only recommend 1080ti or 1070 but why not 1080, what is wrong with it? It seems it has significantly better performance than 1070, so why not recommend 1080 as a budget but performant gpu? is it really waste of money to buy it, if so, why?
Thanks
Thank you, that is a valid point. I think the GTX 1080 does not have a good performance/costs ratio and has no real niche to fill. The GTX 1070 offers good performance, is cheap, and provides a good amount of memory for its price; the GTX 1080 provides a bit more performance, but not more memory and is quite a step up in price; the GTX 1080 Ti on the other hand offers even better performance, a 11GB memory which is suitable for a card of that price and that performance (enabling most state-of-the-art models) and all that at a better price than the GTX Titan X Pascal.
If you need the performance, you often also need the memory. If you do not need the memory, this often means you are not at the edge of model performance, and thus you can wait a bit longer for your models to train as these models often do not need to train for that long anyways. Along this ride, you also save good chunk of money. I just do not see a very solid use-case for the GTX 1080 other than “I do not want to run state-of-the-art models, but I have some extra money, but not too much extra money for a GTX 1080 Ti, and I want to run my models faster”. This is a valid use-case and I would recommend the GTX 1080 for such a situation. But note that this situation is rare. For most people either the GTX 1070 is already expensive, or the GTX 1080 Ti is cheap and there are few people in-between.
Also note that you can get almost two GTX 1070 rather than one GTX 1080. I would recommend two GTX 1070 over one GTX 1080 any day.
thanks for your detailed reply, but gtx 1080 price dropped rapidly after the release of 1080ti, its price gap with 1070 has narrowed significantly. To tell the truth, I purchased msi armor 1080 cheaper price than msi gaming x 1070, thanks to the weekend discount 🙂 but after your deliberate disregarding 1080, I somehow had happened to start doubting my choice, which you already cleared up 🙂
thanks again,
greetings from Turkey .
I did not know that the price dropped so sharply. I guess this means that the GTX 1080 might be a not so bad choice after all. Thanks for the info!
Tim Dettmers’s response is very logical as based on MSRP (Manufacturer Suggested Retail Price) which show little benefit into paying 30-40% additional dollars for a current GTX 1080 instead of a GTX 1070 as both have 8 Gb memory, which can often be a key factor into optimizing your ML/DL sessions. The Gb size (4Gb for a 1050 Ti, 6Gb for a 1060+ and 8Gb for botth 1070/1080) is probably the most important factor, before bandwith performance (check Nvidia for proper comparison).
Now the truth is that many retailers are, indeed, discounting the current GTX 1080 cards in stock heavily (anticipating the “new” 1080 version with better memory/bandwith performance) that the price spread between 1070 and 1080 is nowhere near the 30-40% official MSRP.
I’ve seen in Europe price differences between current 1070 and 1080 limited to 10-15% max due to heavy discounts.
So if you can get a current 1080 for say USD 450 vs a 1070 for USD 400: for sure get the 1080 if you can afford.
Hope this helps.
That makes very much sense. Thanks for your comment!
Thank you for sharing. This thread is very helpful.
1080 Ti is out of stuck in the NVIDIA store now. Do you know when it will on the stuck again?
This happened with some other cards too when they were freshly released. For some other cards, the waiting time was about 1-2 months I believe. I do not know if this is indicative for the GTX 1080 Ti, but since no further information is available, this is probably what one can expect.
Heja Amir,
The online offerings for the GTX 1080 Ti are getting wild now as the first batch, the Founder Edition (FE), from usual suspects (Asus, EVGA, Gigabyte, MSI, Zotac) came into retail on March 10.
Now the second batch, custom versions with dedicated cooling and sometimes overclocking from the same usual suspects, are coming into retail at a similar price range.
As a result, not only will you see plenty of inventory available in both FE and custom versions.
But you will see some nice “refurbished” bargains as early-adopters of the FE are sending back their cards to upgrade to a custom version, (ab)using the right to return online purchase within 7-15 days.
For example, I just snatched a refurbished Asus GTX 1080 Ti FE from our local “NewEgg” in Sweden for SEK 7690 instead of SEK 8490 local official price.
Nice 10% discount easy to grab, the card still had plastic films on so the previous owner was obviously planning his (ab)use of consumer rights.
Hope this helps.
Dear Eric,
Thank you. Currently, I can preorder custom versions from EVGA (http://www.evga.com/products/productlist.aspx?type=0&family=GeForce+10+Series+Family&chipset=GTX+1080+Ti):
Do you suggest these custom versions (for example: http://www.evga.com/products/product.aspx?pn=11G-P4-6393-KR) for deep learning researches or you prefer the founder edition?
I’m scared if overclocking is not appropriate for deep learning research when you run a program for a long times.
Hi Tim,
If I have a system with one 1080ti GPU, will I get x2 performance if I add another one?
The CPU is core i7 – 3770K , and its maximum RAM possible is only 32GB, also if I add another GPU the PCIv3.0 lanes will drop from 16 lanes to 8 lanes for each.
I will use them for image recognition, and I am planning to only run other attempts with different configurations on the 2nd GPU during waiting for the training the 1st GPU. I am kind of new to DL and afraid that it is not so easy to run one Network on 2 GPUs,
So probably training one network in one GPU, and training another in the 2nd will be my easiest way to use them.
My concern is about the RAM, will it be enough for 2 GPUs?
or the CPU is it fast enough to deal with 2 convNets on 2 GPUs?
Will my system be the bottleneck here in a two GPU configuration which makes it not worth the money to buy another 1080ti GPU?
And what if I buy a lower performance GPU with the 1080ti, like the GTX 1080?
Any problem with that?
Thank you for this unique blog. A lot of software advice are there in DL, but in Hardware, I barely find anything like yours.
Many Thanks
The performance depends on the software. For most library you can expect a speedup of about 1.6x but along that comes additional multi-GPU code that you need to write, but with some practice this should become second nature quickly. Do not be afraid of multi-GPU code.
32 GB is more than fine for the GPUs. I am working quite comfortable at 24GB with two GPUs. 8 lanes per GPU can be a problem when you parallelize GPUs and you can expect a performance drop of roughly 10%. If the CPU and RAM is cheap then this is a good trade-off.
The CPU will be alright, you will not see any performance drop due to the CPU.
If you just getting started I would recommend two GTX 1070 instead of the expensive big GTX 1080 Ti. If you can find cheap GTX 1080 this might also be worth it, but a GTX 1070 should be more than enough if you just start out in deep learning.
I am looking to getting into deep learning more after taking the Udacity Machine Learning Nanodegree. This will likely not be a professional pursuit, at least for a while, but I am very interested in integrating multiple image analysis neural networks, speech/text processing and generation, and possibly using physical simulations for training. Would you recommend 2 GPUs, one to run the deep learning nets and one to run the simulation, is that even possible with things like OpenAI’s universe? Also, do you see much reason to buy aftermarket overclocked or custom cooler designs with regard to their performance for deep learning? Greatly appreciate this blog any insight you might have as I look to update my old rig for new pursuits.
What kind of physical simulations are you planning to run? If you want to run fluid or mechanical models then normal GPUs could be a bit problematic due to their bad double precision performance. If you need double precision for your simulation I would go with an old GTX Titan (Kepler, 6GB) from eBay for the double precision, and a current GPU for deep learning. If you run simulations that do not require double precision then a current GPU (or two if you prefer) are best.
Overclocked GPUs do not improve performance in deep learning. Custom cooler designs can improve the performance quite a bit and this is often a good investment. However, you should check benchmarks if the custom design is actually better than the standard fan and cooler combo.
The simulations, at least at first, would be focused on robot or human modeling to allow a neural network more efficient and cost effective practice before moving to an actual system, but I can broach that topic more deeply when I get a little more experience under my belt.
My most immediate interest is whether I should look at investing in a single aftermarket 1080 Ti (with the option to add another later on) or something closer to 2x 1070s when working with video/language processing (perception, segmentation, localization, text extraction, geometric modeling, language processing and generative response)?
Also, looking into the NVidia drive PX system, they mention 3 different networks running to accomplish various tasks for perception, can separate networks be run on a single GPU with the proper architecture?
Yes you can train and run multiple models at the same time on one GPU, but this might be slower if the networks are big (you do not lose performance if the networks are small) and remember that memory is limited. I think I would go with a GTX 1070 first and explore your tasks from there. You can always get a GTX 1080 Ti or another GTX 1070 later. If your simulations require double precision then you could still put your money into a regular GTX Titan. I think this is the more flexible and smarter choice. The things you are talking about are conceptually difficult, so I think you will be bound by programming work and thinking about the problems rather than by computation — at least at first. So this would be another reason to start with little steps, that is with one GTX 1070.
This is very useful post. Is there an assumption in the above tests, that the OS is linux e.e, the deep learning package runs on Linux. or does it not matter.
I am considering a new machine, which means a sizeable investment.
It is easy to buy a windows gaming machine with the GPU installed “off the shelf” whereas there are few vendors for linux desktop . And there is side benefit of using the machine for gaming too.
GPU performance is OS independent since the OS barely interacts with the GPU. A gaming machine with preinstalled windows is fine, but probably you want to install Linux along-side of windows so that you can work easier with deep learning software. If you have just one disk this can be a bit of a hassle due to bootloader problems and for that I would recommend getting two separate disk and installing an OS on each.
Great followup post to your 2015 article.
One thing I would add is the the cooling system of various cards makes a difference if you’re going to stack them together in adjacent PCI slots. I have two GTX-1070s from EVGA that have the twin mounted fans that vent back into the box. (The “SC” Model) If I had it to do over again, I would get either the ‘blower’ model which vents out the back or the water cooled version.
For a moment, I had 3 cards, and (two 1070s and one 980ti) and I found that the waste heat of one card pretty much feed into the intake of the cooling fans of the adjacent cards leading to thermal overload problems. No reasonable amount of case fan cooling made a difference.
In my current setup with just the two 1070s, they’re spaced with one empty PCI slot between them so it doesn’t make much difference, but I suspect with four cards the “SC” models would have been extremely problematic.
Thanks again for both this post as well as your earlier 2015 post.
From my experience the ventilation within a case has very little effect of performance. I had a specially designed case for airflow and I once tested deactivating four in-case fans which are supposed to pump out the warm air. The difference was equivalent of turning up the fan speeds of the GPU by 5%. So it may make a difference if your cards are over 80 °C and your fan speeds are at 100%, but otherwise it will not improve performance. So in other words, the exhaust design of a fan is not that important, but the important bit is how well it removes heat from the heatsink on the GPU (rather than removing hot air from the case). If you compare fan designs try to find benchmark which actually test this metric.
If there are cooling issues though, then the water cooling definitely makes a difference. However, for that to make a difference you need to have cooling problems in the first place and it involves a lot more effort and to some degree maintenance. With four cards cooling problems are more likely to occur.
Hi Tim,
Thank you for the great article! I am in the process of building a deep learning / data science – kaggle box in the 2-2,5k range.
I was going for the gtx 1080 ti, but your argument that two gpus are better than one for learning purposes caught my eye.
I am planning on using the system mostly for nlp tasks (rnns, lstms etc) and I liked the idea of having two experiments with different hyper parameters running at the same time. So the idea would be to use the two gpus for separate model trainings and not for distributing the load. At least that’s the initial plan.
Also since we are talking about text corpus I guess the 6gb of vram would work.
On the other hand I’ve read that rnns don’t work well with multiple gpus, so I might experience problems using both of them at the same time.
Taking all that into account would you suggest eventually a two gtx 1070, two gtx 1080 or a single 1080ti? I am putting the 1080ti into the equation since there might be more to gain by having a 1080ti.
Hi Paris,
I think two GTX 1070, or maybe even a single GTX 1070 for a start, might be a good match for you. There might be some competitions on kaggle that require a larger memory, but this should only be important to you if you are crazy about getting top 10 in a competition (rather than gaining experience and improving your skills). To make the choice here which is right for you. Since competition usually take a while, it might also be suitable to get a GTX 1070 and if your memory holds you back on a competition to get a GTX 1080 Ti before the competition ends (another option would be to rent a cloud based GPU for a few days). In terms of data science you will be pretty good with a GTX 1070. Most data science problems are difficult to deal with deep learning,so that often the models and the data are the problem and not necessary the memory size. For general deep learning practice a GTX 1070 works well especially for NLP you should have no memory problems in about 90% of the cases and in those cases you can just use a “smarter” model.
Hope this helps.
Thank you very much Tim for taking the time to reply back!
Thanks for keeping this article updated over such a long time! I hope you will continue to do so! It was really helpful for me in deciding for a GPU!
I will definitely keep it up to date for the foreseeable future. Glad that you found it useful 🙂
Hey Tim,
I’m about to buy my parts, but I’m facing one big issue. I’m getting a gtx 1060, but I can’t find the budget to fit in a motherboard with 2 pcie x16 slots. This means I can’t SLI the 1060 in the future. Should I keep saving up, or is it better to just sell my old 1060 and buy a 1080 ti/higher end gpu when the time comes?
Thanks
That is a difficult problem. It is difficult to say what your needs will be in the future, but if you do not use parallelization with those two GPUs it is very similar to a single GTX 1080 Ti — so in that case buying the GTX 1060 with one PCIe slot will be good. If you really want to parallelize, maybe even two GTX 1080 Ti, it might be better to wait and save up for a motherboard with 2 PCIe slots. Alternatively, you could try to get a cheaper, used 2 PCIe slot motherboard from eBay.
I buy a GTX 1080 TI for deep learning research. I want to install my new card on my old desktop which has ASUS P6T motherboard (https://www.asus.com/us/Motherboards/P6T/specifications/). According to the specifications, this motherboard contains 3 x PCIe 2.0 x16 (at x16/x16/x4 mode) slots but GTX 1080 ti compatible with PCIe V3.0.
I just want to know if it’s possible to install 1080ti on my motherboard (ASUS P6T)? If yes (it seems PCIe v3 is compatible with PCIe v2), if I face with some bottleneck for leveraging PCIe v2 instead of PCIe v3?
Do you suggest to upgrade the motherboard of use the old one?
If you use a single GTX 1080 Ti, the penalty in performance will be small (probably 0-5%). If you want to use two GPUs with parallelism you might face larger performance penalties between 15-25%. So if you just use one GPU you should be quite fine, no new motherboard needed. If you use two GPUs then it might make sense to consider a motherboard upgrade.
Thank you for your valuable comments. I do appreciate your help.
Dear Tim,
Would you please consider the following link?
http://stackoverflow.com/questions/43479372/why-tensorflow-utilize-less-than-20-of-geforce-1080-ti-11gb
Is it possible that using PCIe v2 leads to this issue (low GPU utilization)?
It is likely that your model is too small to utilize the GPU fully. What are the numbers if you try a bigger model?
You’re right. Running a bigger model leads to better utilization. Thank you.
Just beware, if you are on Ubuntu, that several owners of the GTX 1080 Ti are struggling -here and there- to get it detected by Ubuntu, some failing totally.
Those familiar with the history of Nvidia and Ubuntu drivers will not be surprised but nevertheless, be prepared for some headaches.
In my case, I had to keep an old 750 Ti as GPU #1 in my rig to get Ubuntu 16.04 to start (GTX 1080 Ti as GPU #0 would not start).
You mean “an old 750 Ti as GPU #1” -> “an old 750 Ti as GPU #0”?
Hi Tim Dettmers,
I am working on 21gb input data which consists of video frames. I need to apply deep learning to perform classification task. I will be using cnn, lstm, transfer learning. Among Tesla k80, k40 and GeForce 780 which one do you recommend? Are there any other GPU’s which you recommend. Going through your well written article, I could also think on Titan X or GTX 1080 Ti. Do I need to use multiple gpu’s or a single gpu?
Hi Bhanu,
The Tesla k80 should give you the most power for this task and these models. The GTX 780 might limit you in terms of memory, so probably k40 and k80 are better for this job. The GTX 780 might be good for prototyping models. In terms of performance, there are no huge difference between these cards.
For your task, if you work in research, I would recommend a GTX Titan X or a GTX Titan Xp depending on how much money you have. If you work in industry, I would recommend a GTX 1080 Ti, as it is more cost efficient, and the 1GB difference is not such a huge deal in industry (you can always use a slightly smaller model and still get really good results; in academia this can break your neck).
I am building a computer right now 2,000 budget and I am going with the Asus GTX 1080. Should I go with something a little less powerful or should i go with this. I really care about graphics. (games I want to get are X-com 2, Player Unknowns Battlegrounds, Civ 6, The new mount and blade) games like that
I do not know about graphics, but it might be a good choice for you over the GTX 1070 if you want to maximize your graphic now rather than to save some money to use it later to upgrade to another GPU. If you want to save some money go with a GTX 1070. I guess both could be good choices for you.
Hey Tim,
I already have a gtx 960 4gb graphics card. I’m faced with two options – to buy a used 960 (same model) so I can have 2 960s, or I can sell my 960 and buy a used 1060 6gb. Which one will he better? I’ve heard for gaming the 1060 will be better, but how will it affect DL?
Thanks
P.S – Please note that the price for both paths will be similar (with the 960 path being more expensive by around 25 dollars)
Helpful info. Fortunate me I found your web site by chance, and I’m shocked why
this accident didn’t came about earlier! I bookmarked it.
Hey Tim,
Thanks for great post. Wondering if you will include 2017 version Titan XP in your comparisons soon too.
I’m planning to build my own external GPU box mainly for Kaggle NLP competitions.
Yesterday Nvidia introduced new Titan XP 2017 model.
I’m planning to buy Nvidia GPU, and use it as external GPU for NLP Deep Learning tasks.
1) Should i go with 1 Titan XP 2017 model ? Or still 2 X GEFORCE GTX® 1080 Ti will be better ?
2) What about using GPU externally with my already existing Mac or Windows laptop, connecting via Thunderbolt ?
3) What are your thoughts about TPU which Google introduced recently ?
External GPU box :
MAC: AKiTiO 2 ,
Windows: Razer Core, Alienware Graphics Amplifier, or MSI Shadow
Bizon box: https://bizon-tech.com/
Blog Post: http://www.techrepublic.com/article/how-to-build-an-external-gpu-for-4k-video-editing-vr-and-gaming/
Thanks.
I might update my blog post this evening.
1) In your case I would not recommend the Titan Xp; two GTX 1080 Ti are definitely better.
2) That should work just fine but in some cases you might see a performance drop of 15-25%. In most cases you should only see a performance drop of 5-10% though.
3) The TPU is only for inference, that is you cannot use it for training. It is actually quite similar to the NVIDIA GPUs which exist for the same purpose. Both the NVIDIA GPU and the Google TPU are generally not really interesting for researchers and normal users, but are for (large) startups and companies.
Hay Tim, Its great article!. I am new to ML. Currently i have a mac mini. i found few alternatives to add external graphic card through thunderbolt port. Can i run ML and Deep learning algorithms on this?
I have never seen reviews on this, but theoretically it should just work fine. You will see a performance penalty though which depending on the use case is anywhere between 5-25%. However, in terms of cost this might be a very efficient solution since you do not have to buy a full new computer if you use external GPUs via thunderbolt.
You’ve talked a bit about it in various comments but it would be great if we could get your thoughts on the real world penalties of running PCIe 3.x cards in PCe 2.x systems. I’m guessing that single GPU setups the reduced bandwidth would have minimal impact but what about multi-GPU configurations?
Another thing I’d be curious to hear your thoughts on is the performance penalty of locating GPUs in x8 PCIe slots.
I do not think you can put GPUs in x8 slots since they need the whole x16 connection to operate. In the case if you mean putting them in x16 slots but running them with 8x PCIe lanes, this will be okay for a single GPU and for 3 or 4 GPUs this is the default speed. Only with 2 GPUs you could have 16x lanes, but the penalty of parallelism on 8x lane GPUs is not to bad if you only have two GPUs. So in general 8x lanes per GPUs are fine.
The impact will be quite great if you have multiple GPUs. It is difficult to say how big it will be because it varies greatly between models (CNN, RNN, a mix of both, the data formats, the input size) and can also differ a lot between architecture (Resnet vs VGG vs AlexNet). What I can say, that if you use multiple GPUs with parallelism then an upgrade form PCIe 2.0 to PCIe 3.0 and an upgrade in PCIe lanes (32 lanes for 2 GPUs, or 24 lanes for 3, 36 lanes for 4 GPUs) will be the most cost efficient way to increase the performance of your system. Slower cards with these features will often outperform more expensive cards on PCIe 2.0 system or systems with not enough lanes for 16x (2 GPUs) or 8x speed (3-4 GPUs).
HI, I have a GTX 650Ti 1GB GDDR5. How is it for starters?
It will be slow and many networks cannot be run on this GPU because its memory is too small. However, you will be able to run cuDNN which is a big plus and you should be able to run examples on MNIST and other small datasets without a problem and faster than on the CPU. So you can definitely use it to get your feet wet in deep learning!
Hi,
Thank you very much for providing useful information! I’m using AWS P2(most cheapest one) but planing to switch other GPU environment, for example DELL’s desktop or laptop. What is different between laptop GPU and desktop GPU for training deep learning networks ? For example, GTX 1060 6GB on laptop and on desktop. GPU memory band width?
Cheers,
Hashi
For newer GPUs, that is the 10s series there is no longer any real difference between laptop and desktop GPUs (this means the GTX 1060 is very similar to the GTX 1060 laptop version). For earlier version the laptop version often has smaller bandwidth mostly; sometimes the memory is smaller as well. Usually laptop GPUs consume less energy than desktop GPUs.
Thanks for the post. (There is a small typo: “my a small margin”)
Fixed — thanks for pointing out the typo!
Thanks for the article. I have a McBook Pro and considering recent release of Mac drivers for the Pascal architecture I am considering getting an external GPU rig that would run over Thunderbolt 3. Any concerns with this? Do you know how much penalty I would pay for having the GPU be external to the machine? It appears on the surface that PCIe and Thunderbolt 3 are pretty similar in bandwidth.
Previously for deep learning research I have been using Amazon instances.
Thunderbolt 3 is 5GB/s compared to PCIe which 16 GB/s. If you use three or more GPUs the bandwidth of PCIe will shrink to 8GB/s. If you use just one GPU the penalty should be in the range of 0-15% depending on the task. Multiple GPUs should also be fine if you use them separately. However, do not try to parallelize across multiple GPUs via thunderbolt as this will hamper performance significantly.
Fantastic article!
I’m interested in starting a little beowulf cluster with some ViA mini-itx boards and I was wondering how I could add gpu compute to that on a basic level. They only have pcie x4, but I could use a riser. I was thinking the Zotac GT710 pcie x1 card – one on each board.
Unfortunately, the GT 710 would be quite slow, probably on-a-par with your CPU. I am not sure how well the GPUs are supported if you just connect them via a riser. If it works for PCIe x16 cards then this would be an option to go. If you just want a cheap GPU and the x16 thing works, then you can go with a GTX 1050 Ti.
Here is the board I am looking at. I’m planning on having like 40 of these rackmounted all in a cluster, and each with a gpu in it. I’m not looking for hyper power, just something fun to mess with. The riser idea sounds good! I’m just looking for budget stuff here, and I figured many low power devices is as good as one high power device.
http://www.viatech.com/en/boards/mini-itx/epia-m920/
Forgot the link
I have no idea if that will work or not. The best thing would be to try it for one card and once you get it running roll it out to all other motherboards.
Sounds good! Still in the planning phase, so I may revise it quite a bit. I really appreciate your help!
Hello Tim,
Except if I missed the information on the post, you should mention newest Nvidia cards limit the number of cards on a PC, GTX 1080Ti seems to be limited to two cards, this could be a main issue if we want several experiments on parallel or using multi-GPU with CNTK (and PyTorch). Knowing that, I am not sure to build a rig with GTX 1080Ti if I want to level up the system in the future.
Cheers,
mph
Hello Marc-Philippe,
I think you are confusing NVIDIA SLI HB limitations with PCIe slot limitations. You will only be able to run 2 GTX 1080 Ti if use you SLI HB, but if you use compute you are able to use up to 4 GPUs per CPU. SLI and SLI HB are not used for compute, but only for gaming. Thus there should be no limitations in the number of GTX 1080 Ti you can run, besides the CPU and PCIe slot limitations.
I live in Korea and the electricity bills are very expensive here, so I prioritized power efficiency over performance. But I couldn’t decide from reviews from gamers showing performance per frames per seconds. So what is the most power efficient GPU? (It doesn’t matter if it’s from amd)
Bigger GPUs are usually a bit more power efficient if you can fully utilize them. Of course this depends on the kind of task you are working on. If you look at performance / Watts, then all cards of the series 10 are about the same so it really depends how large your workloads are and optimize for that. That is get a series 10 card of size which fits your models. You should prefer series 10 cards over series 900 cards since they are a bit more energy efficient for the performance they offer.
It would be interesting to see what happens when using an eGPU housing over Thunderbolt.
Indeed, many people have asked about this and I would also be curious about the performance. However, I do not have the hardware to make such tests. If someone has some performance results it would be great if somebody could post them here.
HI tim
May i ask you question, i am on a low budget situation and i am weighting between gtx970m and gtx1050ti (mobile version) coud you give me an advice on which one should i get
The cards are very similar. I think the GTX 1050 Ti (notebook) might be slightly better in performance, but in the end I would make the decision based on the cost, since the GPUs are quite similar.
Thank you for the good article.
I have a machine with 2 titan X pascals and 64 GB ram .
Do you recommend to run two separate models simultaneously . Or parallelize a model across two GPUs .
regards
Mahesh
With 2 GPUs parallelism is still good without any major drawback in scaling. So you could run them in parallel or not, that depends on your application and personal preference. If your models run for many days, I would go for parallelism and otherwise just use them simultaneously.
Hi,
First, thanks for this great article.
Do you think I could install a GTX 1060 (6GB) on the following configuration :
Processor : Intel Pentium G4400
Integrated GPU : Intel HD Graphics 510
RAM : 4GO DIMM DDR4
Motherboard : Asus H110M-K
Nvidia is telling me that GTX 1060 requires at least a core i3 to run, but I’m seeing on CPU benchmark that G4400 is not that bad compared to some versions of core i3, so I’m lost….
Thanks a lot
Phil
The CPU should be fine. I think NVIDIA is referring to gaming performance rather than CUDA performance. For gaming performance this might be true, but for deep learning it should have almost no effect. So it should be fine.
Hi Tim,
This is a very good and interesting article, indeed!
What I’m still confused about is FP16 vs FP32 performance of the GTX 1000 series cards. In particular, I’ve read that deep learning uses FP16 and GTX 1000 series are too slow on FP16 (1:64), which means NVIDIA forces users of deep learning tools to buy a significantly more expensive Tesla or Quadro card.
I’m very new to deep learning and I would expect that an algorithm that requires FP16 accuracy could also be used with FP32 accuracy, is this not the case? If a card doesn’t support the performance optimisations required for doubling performance with FP16, I expect we would be limited by its FP32 performance. However, in this case, I don’t get it why NVIDIA decided to cap the performance of FP16 on these cards, i.e. why not let them perform in FP16 similarly to FP32.
Thanks
There are two different ways to use FP16 data:
(1) Store data in FP16 and during computation cast it to FP24 and use FP32 computation units, or in other words: Store in FP16 but use FP32 for compute.
(2) Store data in FP16 and use FP16 units for compute
Almost all deep learning frameworks use (1) at the moment, because only one GPU, the P100, has FP16 units. All other cards will have very, very poor FP16 performance. I do not know why NVIDIA decided to cap performance on most 10 series cards. It might be a marketing ploy to get people to buy P100, or it is a hardware constraint and it is just difficult to put both, FP16 and FP32 compute units, on a chip and make it cheap at the same time (thus only FP32 on consumer cards to make them cheap to produce).
But in general, it is as you said, if an algorithm runs in FP16 you can expect to be able to run it in FP32 just fine — so there should be no issues with compatibility or such.
Thanks for your response. My understanding was that NVIDIA had implemented FP16 in a smart way that reused FP32 to effectively double performance. I presumed this feature is implemented in firmware and sold at a premium, within the TESLA and Quadro product lines, similarly to the firmware-based (as opposed to hardware-based) implementation of ECC memory.
If FP16 is natively implemented in separate (hardware) circuitry within the GPU, it would indeed make economic sense for NVIDIA to exclude that from the consumer-grade product. Even if this is the case, though, since it’s possible to cast FP16 to FP32 for compute, I can’t imagine why NVIDIA has not implemented FP16 by casting it to FP32, in the firmware. It’s not a question, just an observation that has left me puzzled.
Hi Tim, thanks for that! Very comprehensive article.. I now have the GeForce GTX 1050Ti installed and running. Chose it as the PC is from 2010 (Dell Precision T1650) without changing the power supply can only supply 75W from the board. Also, read somewhere that other cards won’t fit and noticed that as I put it in it was physically millimetres away from the spot where the hard drives plug in. So imagine even if you changed the power supply in this unit for something that could run the other cards, they are probably bigger and will hit into things and so not even push into the slot. Not sure if the other cards even use the same kind of slot that was around in 2010.
Can see it is about 1/5 as fast as the fastest cards but as per your cost analysis, the best bang for your buck. Took a whole day of painful trial and error to get cuda 8.0 and cudnn 5 properly installed, but it works now finally on Linux Mint, the last niggling issues were extra lines needed in bashrc.
Just tested some style transfer and it took 9 minutes compared with hours that it took on the CPU, each iteration took like 0.25 seconds rather than like 10 seconds or so..
Hi Felix, thanks for your story! It seems that everything worked out well for you — it always make me happy to hear such stories! Indeed, the size of GPUs and the power requirements can be a problem and I think the GTX 1050 Ti was the right choice here.
Hi Tim,
Very interesting article, thanks very much for your work!
I expect to participate a few Kaggle competitions for fun and challenge, as well as experimenting for myself.
I need to change my GTX750ti (too slow) and I am hesitating between GTX1060 6GB and GTX1070.
The best deals for 1060 are currently around 250€ but I just found a 1070 at 300€, would you say it’s worth it?
Yes, the GTX 1070 seems like a good deal, I would go ahead with that!
Thanks!
I gave it a go as I could return it if needed.
The gap between 750ti and 1070 is so huge…
Let’s have fun now 😀
I am about to buy three Gpus for deep learning and sometimes entertainment. These are
one Gigabyte Aorus 1080 Ti card and two EVGA 1080Ti. All three are having the same chip i.e 1080 Ti and only difference is their cooling solution, my questions are:-
a) Will i be able to use all three of them for parallel deep learning ?
b) Will i be able to SLi the Aorus and EVGA cards?
c) Is there any other trouble e.g. related to bios etc by mixing the same chip cards but from multiple vendors?
Thank you very much
a) Yes that will work without any problem
b) I had a SLI of an EVGA and an ASUS GTX Titan for a while and I assume this still works, so yes!
c) There should not be any big troubles. Both for games and for deep learning the BIOS does not matter that much, mixing should cause no issues.
I think on the 1080 Ti you can only get 2-way SLI. A standard SLI bridge will do fine for up to 4K resolution, a high-bandwidth bridge is needed for 5K and 8K resolution.
You can use all three of them for parallel deep learning, or any other CUDA-based and OpenCL-based application.
Thank you.
I was looking to do some Kaggle competitions as well as video editing.
I guess I will get the GTX 1060 6GB.. off to Slick deals!
Sounds like a solid choice! I am glad the blog post was helpful!
I’m looking for an used card. What’s better between 960 4GB ddr5 and 1050Ti 4GB ?
I’m asking because the 960 has more cuda cores. THX!
The cards are very similar. The GTX 1050 Ti will be maybe 0-5% faster. I would go for the cheaper one.
Hello Tim,
You are providing great information in this blog with significant value to people into deep learning.
i’m having iMac 5k with core i5 and 32 GB RAM, and thinking to add one NVIDIA Titan Xp to it via eGPU as first step into deep learning, do you think this is a good choice or i should sell it and go directly into custom GPU rig ?
Also is there any readymade options ?
Also can you comment on this setup?
https://pcpartpicker.com/list/yCzT9W
Looks like a pretty high-end setup. Such a setup is in general suitable for data science work, but for deep learning work I would get a cheaper setup and just focus on the GPUs — but I think this is kind of personal taste. If you use your computer heavily, this setup will work well. If you want to upgrade to more GPUs later though, you might want to buy a big bigger PSU. I think around 1000-1200 watts will keep you future proof on upgrading to 4 GPUs; 4 GPUs on 850 watts can be problematic.
Thanks alot Tim, I have changed the setup by adding 1200 W PSU, and 4 way SLI motherboard with i7 7700 Kapy lake processor.
The remaining question is that this motherboard supports only 64 GB RAM, will this make future problems ? do i need 128 GB if i will work on 250 GB+ data sets ?
updated list: https://pcpartpicker.com/list/gHTLBP
Best,
Ahmed
Hi Ahmed
Thank you for sharing your setup. I was wondering, did you shortlist any alternative motherboard for the 128gb limtiation?
I have come to the same conclusion, I don’t like the 64gb limit.
There are some readymade options, but I would not recommend them as they are too pricey for its performance. I think building your own rig is a good option if you really want to get into deep learning. If you want to do deep learning on the side, an extension via eGPU might be the best option for you.
Hi Tim – many thanks for all the knowledge and time!
Application: multi-GPU setup for deep learning requiring parallelization across GPUs and at least 32GB of GPU RAM.
Say I choose to use 4 GTX 1080 Ti and am concerned with the loss due to inter GPU communication but also with the heat/cooling and noise.
Based on all your teaching above am thinking it would be better to use two smaller computer cases, with two GPUs each and connect them with an Infiniband FDR card than try to cram all 4 GPUs in a single box.
Also, having 2 smaller boxes gives :
1. more resiliency in terms of point of failure
2. dynamic scalability in terms of bring up all 4s or just use 2
3. flexibility if I want to replace 2 with more powerful math GPUs for some applications needing higher precision.
Is this on the right path?
Do you have any data on how much memory bandwidth loss there would be in this setup as opposed to putting all 4s in the same box?
Saw NA255A-XGPU https://www.youtube.com/watch?v=r5y2VbMaDWA but it is very expensive.
Please provide a critical review and advice.
Many thanks
Nick
I would use as many GPUs per node as you can if you want to network them with Infiniband. Parallelism will be faster the more GPUs you run per box, and if you only want to parallelize 4 GPUs then 4 GPUs in a single box will be much faster and cheaper to 2 nodes networked with Infiniband. Also programming those 4 GPUs will be much, much easier if you use just one computer. However, if you want to scale out to more nodes it might differ. This also depends on your cooling solution. If you have less than 32 GPUs I would recommend 4 GPUs per node + Infiniband cards. For usual over-the-counter motherboards you can run 3 GPUs + Infiniband card. The details of an ideal cost effective solution for small clusters depends on the size (it can differ dramatically between 8, 16, 32, 64, 96, 128 GPUs) and the cooling solution, but in general trying to crank as many GPUs into a node as possible is the most cost effective way to go and also in terms of performance ideal. In terms of maintenance costs this would not be ideal for larger clusters with +32 GPUs, but it will be okay for smaller clusters.
many thanks!
“depends on your cooling solution” – please suggest a solution that would not kick me out of the house from heat, noise and electromagnetic radiation. 🙂
” If you have less than 32 GPUs I would recommend 4 GPUs per node + Infiniband cards. ” – wdym? I thought Infiniband is for in between nodes. you meant “32 GB” instead of “32 GPUs”?
many thanks in advance, Tim!
“programming those 4 GPUs will be much, much easier if you use just one computer” – kindly provide some hints about where this would be visible in code/libraries we’d use.
You’d also want at least 128gb Ram with such a setup so make sure the Mobo can (though any > 4 PCI-E most likely does anyway).
For example, in Lesson 2 of Fast.ai MOOC, they use some code to concatenate batches into an array that takes 55gb of Ram.
No problem on AWS original course setup as basic EC2 can scale to 2tb.
But for newcomers with personal PC with GTX 1080Ti and 32gb RAM, it generates Memory Error: requires investigating RAM issues, rewriting code etc., instead of focusing on Deep Learning.
Also consider the “Aero” cooling (like Founder Edition 1080 Ti) as it expels heat via the GFX backpanel outside the case, instead of 4 monsters gladly blowing at each other inside 4 walls 😉
“Founder Edition 1080 Ti” with “aero cooling” – check!
“Fast.ai MOOC” – nice reference, check!
“at least 128gb Ram with such a setup so make sure the Mobo can” –
So I need 128 GB RAM on the motherboard to handle the 4 GPUs?
Or you mean in total with the GPU RAM – and GPUs being in the same card forms a continous address space with the CPU/mother board RAM.
If you meant second then I need 128 – 4 x 11 GB GPU RAM = 90 GB on the CPU.
Kindly clarify.
Semanticbeeng
Another aspect to consider is that parallelising on multiple machines is not as easy as parallelising on the same machine.
I don’t think you want to write the code that does the distribution yourself – you want it to be transparently handled by the library you are using (Tensorflow/Torch etc).
Now, if we are talking about say, hyperparameter tuning, this is rather easy to distribute: each execution is independent (unless perhaps if you’re running something like that needs to adjust the params dynamically), and you can ship it out to a separate machine easily.
But withing the same model, things are not that easy anymore. Some lend themselves to being distributed.
I believe that current libraries are much better at distributing across multiple GPUs on the same machine, relatively effort-free (config change), as opposed to across a network cluster.
I share your concern on the single point of failure (machine catching fire). So perhaps the solution is to have 5 X (4 gpu machine) ? 🙂 Just partially joking. Good luck with the heating.
Hi!
I am currently starting my thesis on Adversarial Learning. The department in which I will be working provides me (limited) remote access to a workstation in order to train large volumes. However, I was thinking on getting some cheap machine (my budget is very limited at the moment) in order to try small simulations since I cannot run anything on my laptop. Furthermore, I am not sure how much I will use my computer in the future (it depends if I go for PhD or not after all), so I just need a basic machine that allows can be upgraded in case is needed.
Here a configuration I found in an online shop in Germany :
Fractal Design Core 1100
Prozessor: Intel Core i3-7100, 2x 3.90GHz
Kühler: Intel Standard Cooler
Mainboard: Gigabyte H110M-DS2, Sockel 1151
Grafikkarte: MSI GeForce GTX 1050Ti Gaming 4G
Speicher: 8GB Crucial DDR4-2133
Festplatte: 1TB Western Digital WD Blue SATA III
Laufwerk: DVD+/-Brenner, DoubleLayer ändern
Netzteil: 300W be quiet! System Power 8 80+
Soundkarte: HD-Audio Onboard
Price: 635€
Do you think it’s a big deal? Would be enough given the needs I described before? Would you change anything?
The setup will be okay for your thesis. Maybe a GTX 1060 would be more suitable due to the larger memory, but you should be okay to work around that 4GB GPU memory and still write an interesting thesis.
If you like to pursue a PhD and need more GPUs the board will limit you here. The whole setup will be more than enough for any new GPU which will be out in the next few years, but your motherboard just holds one GPU. If that is okay for you (you can always access cloud based GPUs or the one’s provided for extra computational power) then this should be a good, suitable choice for you.
Thank you so much for you comment. I found a similar machine a bit cheaper:
https://www.amazon.de/DEViLO-1243-DDR4-2133-24xDVD-RW-Gigabit-LAN/dp/B003KO3HQM/ref=sr_1_10?s=computers&ie=UTF8&qid=1493910853&sr=1-10&keywords=DEViLO&th=1
Plus it has windows 10 already installed (some time to save). I compare them many times and I cannot see any major difference.
Thank you once more.
Cheers,
Nacho
Hi Tim,
One last question. Would a CPU AMD FX-6300, 6x 3.50GHz do a similar job than using a intel i3-7100? I have seen that the price difference is pretty big and it would maybe allow me to get a better gpu (1060 6gb). Is it worth it?
Thank you once more
Nacho
Actually I saw this model:
https://www.amazon.de/8-Kern-DirectX-Gaming-PC-Computer-8×4-30/dp/B01IPDIF4Q/ref=sr_1_1?s=computers&ie=UTF8&qid=1494068667&sr=1-1&keywords=gtx%2B1060%2B6gb&th=1
Which includes a 1060 6gb, 16 ram for 699 euros. The only thing is that it has a AMD Octa-Core FX 8370E CPU instead of an Intel.
What do you think?
Alternatives for AWS servers
1. https://www.hetzner.de/sk/hosting/produkte_rootserver/ex51ssd-gpu
dedicated server (not VM)
1 GeForce® GTX 1080 (only)
64 GB DDR4 RAM
€ 120 / month
2. https://www.ovh.ie/dedicated_servers/
4 x NVIDIA Geforce GTX 1070
64GB DDR4 RAM
€ 659.99 / month
I have heard good things about the hetzner one; it can be a good option. Thanks for the info.
Hi Tim – many thanks for all the knowledge and time!
I’d like to buy a deeplearning server, would you please give a performence comparison between telsa P100 and titan Xp?
The Tesla P100 is faster, but cost far too much for its performance. I would not recommend the P100.
Hello,
Thanks for the great post. I’m very new to this topic, and wanted to set up a desktop computer for deep learning project (for fun and kaggle). I had 2 questions, and really appreciate it if you could help me with them:
1) I just ordered a GTX 1080 Ti from NVidia website. Then I noticed that there are other versions like EVGA, or MSI… I’m kind of confused what they’re are, or if I should’ve got those.
2) For desktop, I just got a Lenovo ThinkCentre M900, with i7 6700 processor, 32 Gig Ram. Then I doubted if I really can put my GTX1080 inside that.. any thougth?
thanks a lot
The case seems to be able to hold low-profile x16 GPU cards. It seems that the system would support a GTX 1080, but the case does not. It will probably be too big for the case. I would contact your vendor for more precise information.
Hi, I would add that the PSU (website says max power 400W as an option) seems under-scaled for the 1080ti.
Also, I experienced an issue with a Dell system where the PSU could not be changed for a more powerful one because of a proprietary motherboard connector. I would not be surprised that you found the same on this kind of system.
As Tim suggested, contact your vendor for compatibility info.
Thanks a lot. I canceled it, and found pcpartpicker website which checks the compatibility between the parts.
thanks again 🙂
Looking forward to when you update this with the new V100!
Particularly curious if the increased bandwidth of nvlink v2 changes your opinion of multi gpu setups.
Hi Tim
Great blog and very informative for the deep learning enthusiasts. I am building a rig myself for deep learning. Here are the components I am planning to get.https://pcpartpicker.com/list/WzGZd6
One question I have is regarding the processor. Seems like Xeon 1620 V4 is pretty good, in terms of 40 lanes. It is outperformed by 1650V4 but also twice the price ( $600). To add total 4 GPUs, I’ll also need a mobo with 4 PCI lanes at least, so something like a MSI X99A gaming pro looks reasonable, although not sure they’ll physically fit in. I might just do 3 gpus then. Any comments.
Looks reasonable, the motherboard supports 4 GPUs, but it seems only 3 GPUs fit in there. So if you want to go with 4 GPUs you should get a different motherboard.
Hi. Lovely explanation! I am a newbie and I have a GPU having 4GB dedicated video memory and 4GB shared memory and around 112GB/s bandwidth (GeForce GTX 960). I need to know the number of convolution layers that I can implement using such a hardware. Besides, what would be the maximum size of the input image that I can feed to the network?
This depends on input size (how large the image is) and can be dramatically altered by pooling, strides, kernel size, dilation. In short: There is no concrete answer to your question, and even if you specify all the parameters, the only way to check this would be to implement the same model and test it directly. So I would suggest you just try it and see to how many layers you can get.
Hi Tim,
any thoughts on the new Radeon Vega Frontier Edition, from a hardware point of view and hoping that DL libraries will come soon?
The card is impressive from its raw hardware specs. However, I am unaware that any major changes in deep learning hardware will come along with the card. As such, AMD cards are still not viable at this time. It might chance in a few months down the road, but currently there is not enough evidence to make a bet on a AMD card now. We just do not know if it could be useful later.
Hi Tim! Thank you for great post !
do you think we will use FP64 for deep learning any soon ?
and could you give me some examples of using FP64 for deep learning ?
Thank you.
FP64 is usually not needed in deep learning since the activations and gradients are already precise enough for training. Note that the gradient updates are never correct with respect to the full dataset if you use stochastic gradient descent and thus more precision just does not really help. You can do well with 32FP or even 16FP and the trend is further downwards rather than upwards. I do not think we will see FP64 in deep learning anytime soon, because frankly there is no reason to use it.
AWS documentation mentions that the P2 instances can be optimized further, specifically persistence mode, disabling autoboost and setting the clocks to max frequency. Are any such optimizations that can be done for GTX 1080Ti cards?
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/accelerated-computing-instances.html#optimize_gpu
The performance difference of doing that compared to autoboost is usually not that relevant (1-5%) only if you really have large-scale, persistent operations (1-5% saving in money can be huge for big companies). If you want to improve performance focus on cooling, that is there the real benefits lie. I do not know if P2 instances have good cooling though.
when I think of several parameters associated with computational costs I get confused and I wonder if we can introduce comprehensive and illustrating metrics on which could be relied specially when selecting a GPU is influential for a specific task e.g., Deep Learning. with all due respect I believe your post is some how leading and somehow confusing! for instance, one might choose 1060 6GB over 980 8GB due to the higher memory bandwidth. is that really the correct decision? and for other close-performance cards! I mean you might overlook some parameters regarding to how the program performs on GPU and how the GPU implements the codes. I mean mostly the Shading Units and Ram size at least in your comparisons. for instance, how could we compare a GC with memory bandwidth of 160Gb/s, 1500 shading units and 8GB memory with the one of 200Gb/s, 1250 units and 6GB ram? Although I have searched everywhere in the net and read several papers, I cannot answer it scientifically or at least I cannot prove it. but one think I claim is every percent increased in the number of shading units is more effective in reducing computational time than that of the same percent increase in the memory bandwidth. for example, 10 percent higher bandwidth against 10 percent larger Shadings. I think in some cases, If we only think of parallelism, then, if we have 80GB/s and 1000units equals with the case with 160GB and 500! I am not sure about that. I would be glad to hear your opinions.
Shading units are usually not used in CUDA computations. Memory size also does not affect performance (at least in most cases). It is difficult to give recommendations for RAM size since every every field and direction of work has very different memory requirements. For example Computer vision research on ImageNet, on GANs, and computer vision in industry all have different requirements for memory size. All I can do is to give a reasonable direction and have people make their own choice. In terms of bandwidth you can compare cards within teach chip class (Tesla, Volta etc), but across it is very difficult to give an estimate of 160GB/s vs 125GB/s etc.
In short, my goal is not to say: “Buy this GPU if you have these and these and these requirements.”, but more like “Look, these things are important considerations if you want to buy a GPU and this makes a fast GPU. Make your choice.”
Is it possible to combine the computational power of 6 machines with 8 GPUs each? Is it possible for an algorithm to be able to use all 48 GPUs?
Thanks
Yes. It is a difficult problem however you can tackle it with the right hardware (network with 40 – 100 Gbit/s InfiniBand interfaces) and right algorithms.
I recommend block momentum or 1-bit stochastic gradient descent algorithms. Do not try asynchronous gradient methods — they are really bad!
This website is known as a stroll-by for all the information you wished about this and didn’t know who to ask. Glimpse right here, and you’ll definitely uncover it.
Hi Tim,
Thanks for your sharing and I enjoy reading your posts.
Would you please comment my 4 1080ti build
https://pcpartpicker.com/list/nWs7Fd
Note that partpicker not allowing me to add 4 1080ti so I put 4 TitanX as dummy GPUs.
My spec
(“ref” means time of related Tim’s post,
“#” means my questions)
*1: main use: kaggle competition trained in Pytorch
*2: GPU x4 1080ti
#can this build support 4 Volta GPU in future?
*3:Motherboard: Asus x99-WS-E
<- confirmed Quad PCI 3.0 x16/x16/x16/x16, DDR4 ~128GB
*3: CPU Xeon 2609v4 40lane PCI 3.0
<-compare i7-5930k, 40% cheaper and 2x DDR4 size
2GHz, why?
#shall choose 3.7GHz i7-5930k for 2x clock speed?
@ref_cpu_clk shows underclocking i7-3820 to 1/3 causes performance drop of ~8% for MNIST and 4% for Imagenet)
<- CPU max DDR4 speed is 1866 MHz (RAM speed no difference @ref2 2017-03-23 at 18:35)
*4: RAM: 64Mbyte now, maybe 128MB in future
ref1: http://timdettmers.com/2015/03/09/deep-learning-hardware-guide/
ref2: http://timdettmers.com/2017/04/09/which-gpu-for-deep-learning/
ref_cpu_clk: http://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/cpu_underclocking2.png?zoom=1.5&resize=500%2C337
If you buy 4x GTX 1080 Ti and want to work on Kaggle competitions, I would not skimp on the CPU and RAM. You will do a lot of other things besides deep learning if you do Kaggle competitions.
Otherwise it is looking good. I think 4x GTX 1080 Ti is a bit overkill for the Kaggle. You could also start with 2 GPUs, see how it goes, and add more later if you want.
Hi Tim, thanks for your comment. I choose 128G RAM and i7 3.xGHz CPU (5930k or 6850k) based on your [CPU conclusion](http://timdettmers.com/2015/03/09/deep-learning-hardware-guide/) Two threads per GPU; > 2GHz;
FYI, the 4 GPUs are for 2 Kaggle participants.
Hey. I am creative coder and a designer. I am looking forward to experiment with AI, interaction design and images.
Projects similar to https://aiexperiments.withgoogle.com/autodraw
1) I am a beginner and looking a laptop as it’s handy. Little low on budget so i need to decide.
Feasible option is
960 4gb http://amzn.to/2r5xujB
But if it doesn’t work at all i might consider the following 2 options:
Gtx 1060 3gb — http://amzn.to/2qXCzwc
Gtx 1060-6gb — http://amzn.to/2rZlpMO
Not able to decide which be sufficient.
2) Do i need to learn machine learning and train neural networks by myself or can i just apply things from the open sources already available? Do you know good resources for the same?
You can use pretrained models which will limit you to certain themes, but you do not need learn to train these models yourself. A GTX 960 is also sufficient for this (even a CPU would be okay). This might be enough for some interaction design experimentation. If you want to go beyond that I would recommend a GTX 1060 6GB. You will need to learn how to train these models (which might not be that difficult; it is basically changing parameters in open source code).
Regarding your question #2, I highly recommend “Practical Deep Learning for Coders” by Jeremy Howard (ex Kaggle Chief Scientist) and Rachel Thomas, in partnership with the Data Institute of the University of San Francisco.
https://www.usfca.edu/data-institute/certificates/deep-learning-part-one
It’s a free MOOC with superb ressources (Videos, Class Notes, Papers, Assignments, Forums).
Hi Tim,
This is a great article and comments!
So…If you have only $1,500.00 today for your budget,
which components would you pick for a complete system
and why?
Thanks in advance.
I would probably pick a cheap motherboard that supports 2 GPUs; DDR3 memory, cheap CPU, 16GB of RAM, a SSD, a 3TB hard drive and then probably two GTX 1070 or better if budget allows.
@Spicer
Here’s a paper published today that may help you.
https://blog.slavv.com/the-1700-great-deep-learning-box-assembly-setup-and-benchmarks-148c5ebe6415
Eric
Hi Tim,
Well, I am going all out to build a Deep Learning NN training platform. I plan to spend the money for the following:
2 TitanP cards w two remaining Gen3 PCIe slots available for expansion.
ASUS ROG ROG RAMPAGE V EDITION 10 w 64MB DDR4 3200 RAM, M2 SSD, i7 Processor, and with two ethernet ports – connecting 1 Ethernet Port to my internal network for access and the 2nd ethernet port connected directly to a data file server for access to training data.
Can you tell me the weaknesses with such a rig and where I might be spending money in the wrong places? Where could money be spent instead, to get an even a bigger bang for the buck?
Cheers,
Bill
Sounds like a reasonable setup if you want to use a data file server. If you want to get more bank for the buck you can of course use DDR3 RAM and a slow processor but with this setup you will be quite future proof for a while, so that you can upgrade GPUs as they are released. Also you are more flexible if you want to do other work that needs a fast CPU. So everything seems fine.
Which mini card you will recommend for a itx platform?
Super Flower Golden Green HX 80 Plus Gold Netzteil – 350 Watt SF-350P14XE (HX)
G776 Cooltek Coolcube Aluminium Silver Mini-ITX Case
Thank you very much!:D
I am only aware of 3 current cards which are suitable for such systems: GTX 1050, GTX 1050 Ti, and GTX 1060 mini. These should fit into most mini-ITX cases without any problem, but second check that the dimensions are right. While these cards are made to fit ITX boards, they may not fit every mini-ITX case.
Hi. Thank you Tim for such a wonderful blog. I have questions regarding Volta gpu. It is expected that Geforce version of Volta will be launched anytime towards the end of 2017 or early 2018. Do you think there would be so much difference in performance between 1080Ti based rig( which i am thinking of getting for DL) and the volta that i shud wait for volta?
My second question is when the volta will come, will it need newer motherboards or can it be used with its full strength/might on the currently available motherboards too , e.g i am thinking of getting EVGA X99 classified mbo, so if i wish to get voltas in future then could it be installed on this mbo or the voltas will need newer series of boards ?? Thank you for your help.
Consumer Volta cards is what you want. These Volta cards will probably be released a bit later, maybe Q1/Q2 2018 and will fit into any consumer-grade motherboard — so a EVGA X99 will work just fine. They will be a good step-up from Pascal with a similar jump in performance that you see between Maxwell and Pascal.
But won’t Volta consumer cards use NVLink interface and the current motherboards like EVGA X99 don’t support NVLink??
In case if the current motherboards will support Volta without making it work in backward compatibility mode, i.e, without using its full potential, then probably it would be better for me to get a EVGA 1080Ti for now and get two Volta consumer cards when they will be introduced. Any suggestions, please. Thank you very much.
Hi,
NVLink has just been introduced for the first time on the Quadro GV100 for the desktop. The link adapter alone costs £1000 or so. It is intended for HPC and deep learning applications but it won’t be necessary for the consumer cards, since x8 PCIE lanes are still considered sufficient and the cards can still use x16 lanes in SLI, with a 40 lanes CPU. Hence, it’s very unlikely NVLink will be included in consumer cards anytime soon.
Hi,
NVLink has just been introduced for a professional grade desktop card, Quadro GP100 and the link adapter alone costs £1000 or so. It is intended for HPC and deep learning applications but it won’t be necessary for the consumer cards, since x8 PCIE lanes are still considered sufficient and the cards can still use x16 lanes, even in a 2-way SLI configuration, with a 40 lanes CPU. Hence, I think it’s very unlikely NVLink will be included in consumer cards anytime soon.
I’m currently using my cpu (Xeon E5-2620 with 64GB of memory) to train large convolution networks for 3D medical images, and it is rather slow. The training (using keras/tensorflow) takes up 30-60 GB of memory, so I don’t think the network could train on a single graphics card. Would buying/adding multiple graphics cards net my system enough GPU memory to train a 3D CNN?
I would try to adjust your network rather than your hardware. Medical images are usually large so make sure you slice them up in smaller images (and pool the results for each image to get a classification). You can also utilize 1×1 convolutions and pooling to reduce your memory footprint further. I would not try to run on such data with CPUs as it will take too long too train. Multiple GPUs used in model parallelism for convolution is not too difficult to implement and would probably work if you use efficient parallelization frameworks such as CNTK or PyTorch, but it quite http://timdettmers.com/wp-admin/edit-comments.php#comments-formcumbersome still.
I am trying to make a autonomous drone navigation system for a project. I want to have two cameras with two motors each(so they could turn like the eye), two accelerometers(because humans have two) and two gyroscopes(same reason as before) as inputs for the neural net and out put the four motors of the drone. I’m trying to apply deep learning to make the drone autonomous. But I’ve only worked on gpus with at least 100GB/s bandwidth. Since the computer needs gpio to control the motors and recieve input in real time, I went for a single board computer. I couldn’t find boards with great gpu performance except for the jetson modules. What single board computer or SoC would you recommend?
I think Jetson is almost the only way to go here. Are modules are to big to work on drones. However, another option would be to send images to an external computer, process them there and send the results back. This is quite difficult and will probably have a latency of at least 300ms and probably closer to 750ms. So I would go with a Jetson module (or the new isolated GPUs which is basically a Jetson without any other parts). You can also interface a Jetson with an Arduino and as such you should all that you need for motor control.
Movidius has some chips that might be useful for you, although they are quite specialized for visual deep learning.
Hi,
Thank you Tim for such helpful suggestions. I am interested in 4 GPU(EVGA with custom cooling technology ICX) setup for DL, does any one based on their experience can recommend/suggest any reliable motherboard for holding 4 GPUs comfortably with ventilation space. It is very confusing as there many many board available but some have little/no space for 04 gpus and others have reliability issues as posted on newegg. Secondly my question is will 1300W PSU will be enough to support 04 GPUs?? Thank you very much.
Yes, the motherboard question is tricky. The cooling does not make a huge difference though, it is mostly about cooling on the GPUs not around them. The environment where the GPU is standing and the speed of the fans are much more important than the case for cooling. So I would opt for a reliable motherboard even if there is not too much ventilation space.
Oh and I forgot, 1300W PSU should be sufficient for 4 GPUs. If your CPU draws a lot (> 250W) you want to up the wattage a bit.
Hi Tim. Thank you for your suggestion. Can you please look at these two newegg links and based on your extensive experience, make a suggestion for mbo. My criteria is durability and to be future proof. I want to install 02 1080 Ti for now and 02 Volta consumer gpu when they will come next year, (each card will take 2 slots).
First is Asus E-WS mbo having 07 Pcie sockets,price 514$, https://www.newegg.com/Product/Product.aspx?Item=N82E16813182968&ignorebbr=1&cm_re=asus_e-ws_x99_motherboard-_-13-182-968-_-Product and second is evga having 05 Pcie sockets, price 300 $, https://www.newegg.com/Product/Product.aspx?Item=N82E16813188163&nm_mc=AFC-C8Junction&cm_mmc=AFC-C8Junction-PCPartPicker,%20LLC-_-na-_-na-_-na&cm_sp=&AID=10446076&PID=3938566&SID= . Thank you very much.
I have no experience with these motherboards. The Asus one seems better as it seems more reliable from the reviews. However, if you factor in the price I would go with the EVGA one. It will probably work out just fine and is much better from a cost efficiency perspective.
Hi,
I’m hoping someone could help recommend which GPU I should buy. My budget is limited at the moment, and I’m just starting out with deep learning. So I have narrowed down my options to either the GTX 1050 ti or the 3GB version of the 1060. I’m mainly interested, at least to start with, in things like Pix2pix and CycleGan. So I’m unsure if the extra 1GB of memory on the 1050ti would be better, or the extra compute power of the 1060. The 6GB version of the 1060 is a bit beyond my budget at the moment.
Thanks.
Tough question. For pix2pix and GANs in general more memory is often better. However, this is also dependent on your data. I think you could do some interesting and fun stuff with 3GB already, and if networks do not fit into memory you can always use some memory tricks. If you use 16-bit floating point models then you should be quite fine with 3GB, so I would opt for the GTX 1060.
Hey Tim, perfect article! Thank you very much!
I have a question, how about tesla M2090? compared to 1060.
A Tesla M2090 would be much slower than a GTX 1060 and has no real advantage. It costs more. So definitely choose a GTX 1060 over a Tesla M2090.
Hi,
after we chose a GPU like 1080 ti, how to assemble a good box for it. There is bunch of MB that are gaming specific and they are not designed for days of computing. choosing best model is also not a good option cause they cost much more than the GPU it self. having a 1000watt+ PSU and a open loop cooler is cost u can upgrade form 1070 to 1080ti.
does these card need xmp rams on 3000MH or a 2133 is enough?
I think the story continuous more complex after choosing the card. and I ask u light up the way to select system part not for overclocking GPU but just turn it on. This guide will change the budget to afford for GPU.
Gaming motherboards are usually more than enough. I do not think compute specific motherboards have a real advantage. I would choose a gaming motherboard which has good reviews (which often means it is reliable). I would try to go with less fancy options to keep the price down. You do not really need DDR4, and a high clock for RAM is also not needed for deep learning. You might want those features if you work on more data science related stuff, but not for deep learning.
Hi Tim, thanks for such a through exploration. But I have some others questions in mind, related to CPU side. Normally, with a single card, (TitanX maxwell, as well as 1080Ti), one CPU core stays on top, other three fluctuate between 10-50 range (p8z77, i3570K, 32GB, under DIGITS). Now I am planning to make a change, with an X99-E WS board. The only thing I could not decide is, if DL apps can use single core only, should we look fastest single core CPUs? At the time of writing, the fastest single core is on the i7700K 4.2 CPU. I am planning to buy a Xeon 2683 v3 processor which appears faster on CPU benchmarks, but slower when it comes to single-core perf.
Due to this fast-single-core subject, I cannot decide. Since my aim is to go for 4 GPUs, should I go for 4-cores one, or 14-cores one? I have used dual TitanX setup for a while and saw CPU percentage raised to 280%s, compared to 170-180%s with a single GPU. From this observation, CPU perf. appears important to a degree, to my eye. Any opinions?
The CPU speed will make litte difference, see my analysis in my other blog post about hardware. Often a high percentage of CPU means actively waiting for the GPU to do its job. Running 4 or more nets on 2 GPUs on a 4 core CPU never caused any problems for me; the CPU never saturated (and this is with background loader threads). So no worries on the CPU size, almost any CPU will be sufficient.
where do you stand on nvidia founder’s edition vs custom gpus; is there a preference for one over the other in a DL setting?
The run the same chip, so they are essentially the same. It is a bit like overclocked GPUs vs normal ones; for deep learning it makes almost no difference.
Hi Tim
First off, kudos , like everybody else, for making our life immensely easier. I apologise in advance for an inaccuracies, this is all rather new to me.
I am currently considering getting my first rig, I have settled on a 1080Ti as a good compromise of ram, budget, performance and future-proofness.
For what I sense, the next logical upgrade would be to add another 1080Ti whenever I max it out, and continue using 16 lanes per card.
Having 3 cards on the rig would offer less of an improvement for training times, because of a limit to 8 bit lanes, increased overhead in coordinating parallelisation, increased cooling requirements, larger PSU.
Hence my envisaged upgrade route would be, if and when required:
1) Add a second 1080Ti.
2) Add a second rig (this will help with HyperParam tuning, I suspect it won’t help much with parallel training on the SAME model)
Am I wrong in envisaging it this way? I.e. would it still be more cost efficient to try and push a third and fourth card on the same rig? In which case I have to pay more attention to the motherboard selection. I want to have 64GB of RAM on the motherboard itself, so this is already pushing my into the “right” territory with regard to choosing a motheboard that supports 3 or 4 GPUs.
Inputs welcome. – and thanks again.
MR
Do not worry too much about the problems with lanes, cooling etc if you add a third GPU. The slow-down will be noticeable (10-20%) but you still get the most bang for the buck. You are right that adding another server does not help with training a single model (it gets very complicated and inefficient), and it would be kind of wasted as a parameter machine. I would go and stuff one machine as much as possible and if you need more computing power after that, then buy a new machine.
Hey,
My question is (again 🙂 ) about PCIe 2 performance. But most questions are about new mainboards, while I am thinking about buying a used xeon workstation with this http://ark.intel.com/products/36783 chipset in a 2 PCIe x16 configuration. There is already an old Nvidia Quadro 4000 card in there and I want to add a 1080 8gb.
Is there anything that wouldn’t work in this setup?
It would work just fine. PCIe guarantees downwards compatibility, so the GTX 1080 Ti should work just fine. The CPU is a bit small, but it should run deep learning models just fine. For other CPU intensive work it will be slow, but for deep learning work you will probably get the best performance for the lowest price. One thing to notice is that you cannot parallelize across a Quadro 4000 and a GTX 1080 Ti, but you will be able to run separate models just fine.
Thank you for the excellent post. I have a few questions about PCI slots.
1) I found most pre-built workstations have only one PCIe x16 and some PCIe x4 and x1. Can GTX 1080ti work well with x4 or x1?
2) Do you have recommendation on pre-built systems with two PCIe x16 slots? (I prefer not to build one from scratch, but okay with simple installation like adding GPUs and RAMs.)
I have seen some mining rigs which use x1, but I do not think they support full CUDA capabilities with that or you need some special hardware to interface these things. It probably also breaks standard deep learning software. The best is just to stick to x16 slots.
Hello Tim
I stumbled across this news item about a pair of cryptocurrency-specific GPU SKUs by NVIDIA. (http://wccftech.com/nvidia-pascal-gpu-cryptocurrency-mining-price-specs-performance-detailed/)
With a price tag of 350$ for a GTX1080-class card, do you think it’s a good buy for deeplearning? The only downside is no video ports, but that doesn’t matter for DL anyway.
Thanks
Haris
They will be a bit slower, but more power efficient. If you use your GPUs a lot, then this will make a much more cost efficient card over time. If you plan to use your card for deep learning 24/7 it would be a very solid choice and probably the best choice when looking at overall cost efficiency.
Thank you for the quick reply!
Could you say why they would be slower? Because I want to weigh in the speed vs power savings.
I’m always amazed about claims such as:
“Buy two of these cards (at 350$ each) and place it in this system and you are looking at making over $5,000 a year in extra income! via LegitReviews”
If so true, the opportunity cost for Nvidia or AMD is insane: why sell them in the first place, and not keep them to mine cryptocurrency themselves in giant farms for their shareholder’s immediate benefit ?
I’d be a shareholder, I’d be furious at the management for such a lousy decision 😀
This happened in the past, then suddenly difficulty stagnated while hardware increased and the GPUs were worthless. The last time this happened from one week to the next. Mining hardware worth $15000 was worth $500 one week later. So if you look at it from a long-term perspective, going into cryptocurrency mining would not be a good strategy for NVIDIA or AMD, it is just too risky and unstable.
Hi, thanks for the article. I want to buy GTX 1060. Do you think there is any risk in buying factory overclocked GPU? Can I be sure that the factory overclocked card will be precise in computations? I already read in comments that the speed difference isn’t noticeable but the prices in my country aren’t much different so for me it’s basically the same to buy normal or overclocked…And also I would like to ask if there is any difference between 8GHz and 9GHz GDDR5? The cards I’m considering are:
Overclocked:
MSI GeForce GTX 1060 GAMING X+ 6G
MSI GeForce GTX 1060 GAMING X 6G
GIGABYTE GeForce GTX 1060 G1 Gaming
ASUS GTX 1060 DUAL OC 6GB
No overclocked:
ASUS GeForce GTX 1060 Turbo 6GB
ASUS GTX 1060 DUAL 6GB
Thaks for your opinion 🙂
The factory overclocks are usually in specific bounds which are still safe to use. So it should be fine. As you say the clock speed is no big difference, however, the memory clock makes a big difference. So if the prices are similar I would go for a 9GHz GDDR5 model.
Hi Tim,
What do you think of the following config for computer vision deep learning and scientific computing (MATLAB, Mathematica)?
Intel – Core i7-6850K 3.6GHz 6-Core Processor
Asus – SABERTOOTH X99 ATX LGA2011-3 Motherboard
Corsair – Vengeance LPX 64GB (4 x 16GB) DDR4-2800 Memory
Samsung – 960 Pro 1.0TB M.2-2280 Solid State Drive
EVGA – GeForce GTX 1080 Ti 11GB SC2 HYBRID GAMING Video Card
SeaSonic – PRIME Titanium 850W 80+ Titanium Certified Fully-Modular ATX Power Supply
I’m oversizing the PSU because I might want to add another 1080 Ti in the future. Some questions I have:
1) What kind of cooling do you recommend for the CPU?
2) Do you hook your machine to an Uninterruptible Power Supply/Surge Protector?
There may be a problem with the samsung disks with linux . I have a dell alienware aurora R6 and had difficulty installing linux. I saw the forums and no one has been able to install on R6 – apparently samsung shows up as 0 bytes or something. I am using tensorflow on windows10 on the same hardware. works fine, the GPU is getting utilized but I dont know about performance.
This is a good point, Ravi. A quick google search shows that some people have problems with PCIe SSDs under Linux. However, there are already some solutions popping up, so it seems that fixes and instructions how to get it working are underway. So I guess it should work in some way, but it may require more fiddling before it works.
Looks good. Strong CPU and RAM components are useful for scientific computing in general and I think it is okay to invest a bit more into these for your case.
1) For CPU cooling a normal air cooler will be sufficient; often CPUs run rather cool (60-70 C) and do not need more than air cooling.
2) Not really needed; I use a power strip with surge protector and that should protect your PC from most things which can cause a power surge
Hello, nice guide, based on which I bot a 1050 Ti. But going thru’ the installation of drivers/CUDA etc., I came across this page: https://developer.nvidia.com/cuda-gpus
Here, 1050 Ti is not listed as a supported GPU. Am I stuck with a useless card? When I try to install the CUDA 8 Toolkit (it always gives me an “Nvidia installer failed” error. Cannot get past it. Also, do I need V Studio installed? I am trying to install as per instruction here: http://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html#axzz4kPWtoq7o
I need Tensorflow to work with my GPU, that is all. Any advice? Thanks.
The GTX 1050 Ti supports CUDA fine, the problem is that you probably need the right compiler to make everything work. So yes, if you are missing the right visual studio then this is the missing ingredient.
Thanks Tim.
Am I supposed to install the full Visual Studio compiler, or will the VS 15 Express edition do?
The Nvidia CUDA 8.0 toolkit will not install; always exits with an error, “Nvidia installer failed” — and the VS 15 Express exited with an error (“some part did not install as expected”). Looks like this has the potential to turn into a nightmare. Any ideas, anyone? Thanks. I am running Win7Pro, SP1. Trying to get this EVGA GPU working for TF.
Hi Tim,
can you comment on this build: https://pcpartpicker.com/list/nThsQV
Thank You
This is a pretty high-cost rig. For deep learning performance, it will be not necessarily better than a cheaper rig, but you will be able to do lots of other stuff with it, like data science, Kaggle competitions and so forth. Make sure the motherboard supports Broadwell-E out of the box.
Thank you Tim very much, i really appreciate your support
I am in the “getting started in deep learning but serious about it ” category; I have completed andrej karpathy’s and andrew ng’s course. I am primarily interested in computer vision(kaggle) and reinforcement learning (openai gym). I am looking to build a deep learning pc, here is my parts list:
https://pcpartpicker.com/list/n7KZm8
Should I keep the gtx 1070 or should I spend the extra $250 to get the gtx 1080 ti? Will my current cpu be able to support the gtx 1080 ti comfortably?
Looks like a very solid, high-quality low-price rig. I think the GTX 1070 will be enough for now. You will be able to run most models, and those that you cannot run you could run in 16-bit precision. If you have the extra money though, the GTX 1080 Ti will give you a very solid card which you will probably not need to upgrade even after Volta hits the market (although selling a GTX 1070 and getting a GTX 1170 would not cost much either). So both choices are good. The GTX 1080 Ti is not necessary but would add comfort (no precision fiddling) and you will be good with it for at least a year or two.
Hi Tim,
The new AMD Vega Frontier Edition card comes with 25 TFLOPS of FP16 compute and 480 GB/s memory throughput. AMD pushes the card primarily for Graphics workstations and AI (???). Is there a framework that would scale in a multi-GPU environment and supports AMD’s technology (OpenCL I presume)?
Thank you
Currently, I am not aware of libraries with good AMD support, but that might change quickly. The specs and price are pretty good and there seems to be more and more effort put into deep learning on AMD cards. It might be competitive with NVIDIA by the end of the year.
Hi Tim,
I want to first thank you for all your awesome posts. You definitely rock!
I have what I think is a quick question for you.
In regards to deep learning and GPU processing – I have one Titan X. What’s your opinion on either keeping and adding to the Titan X getting 3 more or selling the Titan X and going with 4 of something more modern and affordable like the 1070 or ? What do you think – considering function and price.
Thanks in advance for your advice.
I think this depends on what do you want your cards to use for. If your current models saturate your Titan X then it might make sense to stick to more Titan Xs. However, if you memory consumption per model is usually much lower than that it makes sense to get GTX 1070s. This should be a good indicator of what kind of card would be best for you. Also consider to keep the GTX Titan X and buy additional GTX 1070s. You will not be able to parallelize across all 4 GPUs, but this might be a bit cheaper option.
I’m thinking future too….and the GTX 1070 may need to be upgrade to fill the shoes of the Titan X ; however, I hear the GTX 1080 Ti might be a good alternative to the Titan x across the performance board.
Thoughts on this Tim?
Greg
If you can wait until Q1/Q2 2018 I would stick with any of the mentioned GPUs and then upgrade to a Volta GPU in Q1/Q2 2018. The problem with GTX 1080 Ti is that it will lose quite some value once the Volta GPUs hit the market. You could get GTX 1080 Ti and sell them before Volta comes out, but I think upgrading to Volta directly might be smarter (if you can afford the wait).
hi Its my first deep learning project and the only gpu I could find is
http://www.gpuzoo.com/GPU-AFOX/GeForce_GT630_-_AF630-1024D3L1.html
What do you this about is it suitable for Neural machine translation
my data about 4 gb and 1 million sentence.
It is a bit tight. You might be better off using a CPU and a library with good CPU support. I heard facebook has quite good CPU libraries. They might be integrated into PyTorch, but I am not sure.
Actually, AMD have already successfully ported over 99% of the NVidia deep learning code to their Instinct GPUs, so compatibility will not be a problem.
As Facebook have recently open-sourced their CAFFE2 deep learning code, that will also be available when using AMD GPUs.
Without competition, AI will not push ahead, nor will price drop enough to spread AI use. As such, AMD’s massive investment in AI and deep learning is crucial to our societies.
“AMD took the Caffe framework with 55,000 lines of optimized CUDA code and applied their HIP tooling. 99.6% of the 55,000 lines of code was translated automatically. The remaining code took a week to complete by a single developer. Once ported, the HIP code performed as well as the original CUDA version.”
https://instinct.radeon.com/en-us/the-potential-disruptiveness-of-amds-open-source-deep-learning-strategy/
“Today Facebook open sourced Caffe2. The deep learning framework follows in the steps of the original Caffe, a project started at the University of California, Berkeley. Caffe2 offers developers greater flexibility for building high-performance products that deploy efficiently.”
https://techcrunch.com/2017/04/18/facebook-open-sources-caffe2-its-flexible-deep-learning-framework-of-choice/
Have you considered using the new breakthrough AMD technology for AI, from the Ryzen CPU’s aimed at parallel processing, to the new AMD Instinct series of GPUs?
These will offer superior power and facilities to NVidia and at least 30% lower cost, and have been designed to be ready for future AI computing needs, being much more scalable than NVidia technology.
Launched June 2017: “AMD’s Radeon Instinct MI25 GPU Accelerator Crushes Deep Learning Tasks With 24.6 TFLOPS FP16 Compute”
“Considering that EPYC server processor have up to 128 PCIe lanes available, AMD is claiming that the platform will be able to link up with Radeon Instinct GPUs with full bandwidth without the need to resort to PCI Express switches (which is a big plus). As we reported in March, AMD opines that an EPYC server linked up with four Radeon Instinct MI25 GPU accelerators has roughly the same computing power as the human brain. ”
Read more at :
https://hothardware.com/news/amd-radeon-instinct-mi25-gpu-accelerator-deep-learning-246-tflops-fp16
https://instinct.radeon.com/en-us/product/mi/
https://instinct.radeon.com/en-us/the-potential-disruptiveness-of-amds-open-source-deep-learning-strategy/
https://instinct.radeon.com/en-us/the-potential-disruptiveness-of-amds-open-source-deep-learning-strategy/
Indeed, this is a big step forward. The hardware is there, but the software and community is not behind it fully yet. I think once PyTorch has full AMD support we will see a shift. I think AMD is getting more and more competitive and soon it will be a great alternative option if not the better option over NVIDIA GPUs.
I’m a fanboy of AMD GPU’s + FreeSync monitor combo for gaming, insane value vs Nvidia.
But when it comes to choosing a GPU for your Deep Learning personal station **TODAY**, there’s no possible hesitation: Nvidia with CUDA/CUdnn all the way, from the GTX 1060 6Gb to the 1080Ti/Titan Xp.
Building a Deep Learning stable rig is complex enough for beginners (dealing with Win vs Linux, Python 2.7 vs 3.6, Theano vs TensorFlow and so on), no need to add a layer of cutting-edge tuning with AMD “work-in-progress” 😀
Now the moment AMD eco-system is truly operational and crowd-tested, I’ll be the first one to drop Intel/Nvidia to return to AMD with a Ryzen 1800x/Vega 10.
No need to wait til the end of the year, OpenMI is not just an option to PyTorch, but PyTorch is now available for AMD GPUs, as you can read below.
OpenMI offers the benefits of source code availability, meaning users can fine-tune the code to best fit their needs, and also improve the code base.
For those starting AI / Deep-learning projects, AMD offers a fully functional alternative to NVidia with a 30 – 60% cost saving. Not forgetting that AMD have many AI projects using their technology.
PyTorch :-
“For PyTorch, we’re seriously looking into AMD’s MIOpen/ROCm software stack to enable users who want to use AMD GPUs.
We have ports of PyTorch ready and we’re already running and testing full networks (with some kinks that’ll be resolved). I’ll give an update when things are in good shape.
Thanks to AMD for doing ports of cutorch and cunn to ROCm to make our work easier.
permalinkembedreport
[–]JustFinishedBSG
I am very very interested. I’m pretty worried by nvidia utter unchecked domination in ML.
I’m eager to see your benchmarks, if it’s competitive in PyTorch I’ll definitely build an AMD workstation”
Source:-
https://www.reddit.com/r/MachineLearning/comments/6kv3rs/mopen_10_released_by_amd_deep_learning_software/
AMD AI/Deep learning products now available:-
https://instinct.radeon.com/en-us/category/products/
The software is getting there, but it has to be battle-tested first. It will take some time until the rough edges are smoothed. Also, a community is very important, currently the OpenMI community is far too small to have any impact. The same is true for PyTorch when compared to TensorFlow, but the current trends indicate that very soon this may change. However, in the end, the overall picture counts, and as of now I cannot recommend AMD GPUs since everything is still in their infancy. I might update my blog post soon though, to indicate that AMD GPUs are now a competitive option if one is willing to handle the rough edges and live with less community support.
I don’t think you should be so fast to dismiss AMD’s solutions, Caffe is available and tested, and so are many applications using OpenMI.
AMD is also offering free hardware for testing for interesting projects, but regardless of that, researchers should start enquiring now, to see if what is on offer meets or even exceeds their needs.
Hi Tim.
I currently have the following configuration:
GPU: GTX 1080
CPU: i5-6400 CPU @ 2.70GHz (Max # of PCI Express Lanes = 16)
Motherboard: H110M PRO-D (1 PCIex16)
RAM: 32GB
PSU: max. power 750W
I want to upgrade this machine with at least one more GPU for the moment and gradually add more GPUs over the next couple of months. I want this machine to be used by 4-5 people. I like this solution better than the one where everyone gets a PC because one person can use more than one GPU if it’s available, and it’s cheaper (probably). Can you recommend a motherboard and a CPU for this purpose and maybe give your comment on this approach since I don’t have any experience?
Thanks!
There are no motherboards which are exceptional for 4 GPUs. They are all expensive and have their problems. The best bet is you search on pcpartpicker.com for 4-way SLI motherboards and select one with a good rating and good price. Optionally you can look in this comment section (or the comment section of my hardware blog post for motherboards that other people picked. The CPU option is difficult to recommend. For deep learning, you will not need a big CPU, but depending on what each user does (preprocessing etc) one might need a CPU with more cores. I would go with a fast 4 core, or with 6+ cores.
Hi Tim,
Thanks for all the good info.
A lot of the specs for the new Volta GPUs cards that will be coming out seems to focus on game play. Maybe I missed it but have you seen any kind of break down comparison with other GPUs like the Titan Xp or the 1050 Ti? I am just wondering if it would be a good idea to wait for the Volta or just dive in for a couple of Xp’s now. What’s your thoughts?
Thanks,
Jeff
It is difficult to say. The rumor mills point out Volta consumer around 2017 Q3, but Ti and Titan cards in 2018 Q1/Q2. It is unclear if the cards also come with the TensorCores that the V100 has. If not, then it is not worth waiting, if they have those, then a wait might be worth it. In any case, you would wait for quite some time, so if you need GPU power now, I would go with Titan Xp or GTX 1080 Ti (it is easier to sell those to gamers once Volta hits the market).
Hi Tim,
Thank you for these valuable informations,
I am planning to build a 5 x 1080 GTX using a ASRock Z87 Extreme4 motherboard,
I was wondering if it is such a good idea to use usb pcie x1 to x16 riser to plug all the gpus and save some money instead of buying a multiple x16 motherboard
Do you think this will have a big impact on my deepleaning processes?
Thank in advance
I am not sure if that will work. This is usually done for cryptomining but I have never seen a successful setup for deep learning. You could try this and see if it works or not and let us know. I am curious.
Hi Tim,
Nvidia announced that they will unlock the Titan xp performance. Does this help with Deep Learning? It’s currently unclear to me what exactly they have unlocked, is it FP64 and/or FP16?
Thanks and regards
Say, so if I get 2 1080 8GB gpus for training models, do I need a 3rd gpu to run 2 monitors? I was trying to figure out if I needed to actually get 3 1080s, or if I can get away with just 2 1080s, and then one somewhat smaller gpu just to run my monitors?
You just need the two GPUs. You can run on them deep learning models and run monitors on them at the same time. I do it all the time with no problems!
Awesome guide Tim!
As many others are probably wondering, what’s your take on the new RX Vega cards? And specifically their FP16 performance for deep learning as opposed to an Nvidia 1080ti for example?
For the price of a 1080ti you can almost buy two Vega 56. For someone planning to use them for mostly Reinforcement Learning on an AMD threadripper platform, would you recommend I get two 1080tis or four Vega 56’s?
Thanks!
The hardware is there, but not yet the software. Currently, the software and community behind AMD cards are too weak to be a reasonable option. It will probably become more mainstream in the next 6 months or so. So you can live with trouble, be a pioneer in the AMD land and thus help many others with contributing bug reports and problems, or you can make it easy and go with NVIDIA — your choice.
Hi Tim,
I have a follow-up question on this. I’m building a new server with 4 1080 Ti’s. I plan to use threadripper 1950x and MSI X399 combo. Do you think this is a good option?
Yes, it is a good option. In fact, I believe its the best option for standard hardware currently. I am using an X399 (AsRock) and a Threadripper too. I do not think there is a big difference between MSI and AsRock, and you will be happy with that. I personally use a 1900x — it has the good 64 PCIe lanes so that you can run NVMe SSDs, and 8 fast cores are sufficient even if multiple people use the computer for deep learning. If you are using other CPU hungry application like databases a lot, then a 16 core Threadripper might be worth it. Switching from a 1950x to a 1900x saves a good amount of money at little performance difference, though. I personally would probably go with a 1900x and more RAM instead of a 1950x.
Awesome! I got the combo and together with a 1600W PSU , 128 GB 2666 RAM. I will update later about my experience with this Threadripper build. Overall, this is our first AMD based box, we have several good boxes based on i7 and Xeon platform. I hope this AMD experimental platform can work well, Threadripper’s spec is really good.
Looks great! It am excited about your feedback!
Hello Tim, any reason for choosing the AsRock motherboard over the others such as the Asus Rog Zenith Extreme? Which NVMe SSD and size do you recommend?
I wanted to tinker with a 10 Gbits/sec network card, that is the main reason I chose the board. I would get one NVMe SSD for your OS (256 GB is sufficient) and if you want to put some datasets on an NVMe SSD, I would definitely get 1 TB or if that is too expensive 512 GB.
Thanks Tim. The motherboard that you are using is not available in my city. I need to choose another one.
I plan to buy one GPU at the beginning and add 2-3 more if needed later. Do all GPUs have to be from the same brand and the same model? I think somewhere I read that if one GPU is 1080Ti, all four have to be 1080Ti. Similarly, if one GPU is 1070Ti, all four have to be 1070Ti. Do they all have to be from the same brand and same model? Is there a problem if I buy them in stages rather than buying all at once?
For such mult-GPU systems, what kind of cooling system do you recommend? 2 fans, 3 fans or water cooling? I think in the case of 1070(ti) and 1080(ti), they take 2 slots.
The brand does not matter — you can mix here. The series should match (10s or 900s only). If you want to have good parallelization performance the chips should match (GTX 1080 Ti OR 1070 Ti only); you can get away with minor differences, e.g. Titan Xp and GTX 1080 Ti, but the slowest card will determine the speed of the parallelism. There should be no problem buying them in stages, often I bought my GPUs in that way. Currently, I have only one GPU (the dilemmas of being a grad student).
Water cooling for GPUs is very effective but can be a mess too. If you go for water cooling I would try to go for prebuild models (not water cooling kits for Do It Yourself (DIY)). If you go for water cooling make sure you have a case that can fit the radiators; this can be quite problematic with 4 GPUs. I would not recommend either water or air, both have clear benefits and drawbacks. It’s more of what you like to have. I usually go with air-cooled GPUs (especially if I only have 1 or 2).
Thanks Tim. When you mentioned “good parallelization performance” and using 2-4 GPUs, do you mean mutli-GPUs simulating different pieces of the same model with the same paramtersor each GPU running the same model but different parameters? In both cases, does Tensorflow do it automatically with one parameter change from the user or the user have to manually write codes to do parallelization?
As for the RAM, I see people using 2666, 3200 and 3600. Which do you recommend for both Thripper and INTEL i9 systems?
Usually, parallelization means data parallelism which means having the same model / same parameters on each GPU and splitting the input data over GPUs. You usually have to write manual code, but it is often easy to do.
Hello Tim, about the slowest card be the one determining the speed of parallelism…
If I use a slower card, say 1060, to drive a 4K monitor. Then, get 1-3 1080Ti to do DL work. Will I be able to tell Tensorflow and other frameworks to ignore the 1060 and run the simulations using those 1080Ti only? If not, is it better to just use one of the multi-1080Ti to do both simulation and drive the 4K monitor?
You can select the GPUs for parallelism in TensorFlow and other frameworks, so if a GTX 1060 drives your monitor, then just select the 3 GTX 1080 Ti and you will be fine. For parallelism, in this case, it would be definitely better to just use the 3 GTX 1080 Tis.
Thanks Tim. So, to split the same simulation across all the GPUs in one computer, I need to have cards of the same type (i.e. all 1080ti or all 1070ti). Otherwise the performance will be limited by the slower card. If I run the same model with different parameters in one computer, do I still have to have GPUs of the same type to avoid this issue? Will adding a slower GPU in the same computer becomes a bottleneck regardless of whether or not I use it just to run 4K monitor or drive 4K monitor and do simulation?
In each case, how is the performance on using 2-4 more computers with one GPU each vs. putting 2-4 GPUs in one computer?
If you do not use parallelism the GPUs will be independent meaning that they do not affect each others performance. 4 computers 1 GPU is very slow for parallelism; 4 GPUs 1 computer will be in all relevant aspects superior (cost/performance mostly). If you add slower GPUs for your monitors you do not have to use them you can always parallelize over your fast GPUs — so that should be no issue.
Thanks Tim. I recall you mentioning that the GPU is more important than CPU. Do systems with CPU that supports quad channel memory perform “much better” than those with CPU that supports dual channel memory?
The difference is almost non-existent. However, DDR4 memory is much better than DDR3 for many other tasks. If you run your models on the CPU instead of GPU it can increase the performance considerably. Now DDR4 is quite common and I would in general recommend to get DDR4 over DDR3 — it is not very useful for deep learning, but you will have a more balanced computer which is useful for many other tasks.
Hello Tim, is it possible for you to share the list of components to build your Threadripper workstation?
This is my build: https://de.pcpartpicker.com/list/NRhtpG. The PSU is for 1 GPU; you want to upgrade the PSU if you want to have more GPUs.
Thanks Tim. If I connect one 4K monitor@60Hz or 1-2 QHD/WQHD monitors to the 1080Ti, will the performance of the card in doing DL be affected, say by more than 20%? Is it better to get a cheaper video card to drive the high-res displays(s) and denote 100% of the 1080Ti(s) on DL tasks? Which video cards do you recommend to do it reasonably well? I guess I should plug it in x4 or x8 PCI slot and save the x16 to the 1080Ti(s).
Does TensorBoard looks good on 4K display?
It should not affect the performance much. It will take a few hundred MBs of ram (100-300MB), but will hardly affect deep learning performance (0-5% at most, if you are watching a video in the background while training etc).
Thanks. Do you know any good computer cases that can support 4 1080Ti or 4 Titan GPUs and silent or almost silent?
Thanks Tim. Is there any point to get EPYC CPU rather than Threadripper?
Hello Tim, besides the ASRock Fatal1ty X399, do you know any other X399 motherboard that works well and is fully compatible with Ubuntu 17.10?
Here is a quick update. The system is running smoothly with Linux 16.04.4 LTS. Installation took less 20 seconds something? I went out of the office and grabbed a cup of coffee then I found it finished. I haven’t installed the tool chains yet, so I can not speak anything about it right now.
The following is my configuration for a reference in case other people are thinking about building a server like this.
CPU: Threadripper 1950x
Mobo: MSI X399 GAMING PRO CARBON AC, the nice thing of this mobo is that it comes with a 3D mounting kit so that you can mount a high-speed fan to cool the GPUs additionally.
Memory:
Ballistix Sport LT Series DDR4 2666 MHz UDIMM (16×8), this memory is officially tested by MSI mobo with 16GB, 32GB, 64GB and 128GB all work and some thread mentioned that overclocking the memory may introduce extra issues. I suggest to stay at 2666MHz.
Storage: 2x Samsung 960 Evo nvme (this nvme drive is much faster than 860 and I believe it is the maximum supported speed by 1950x)
P