Deep learning is a field with intense computational requirements and the choice of your GPU will fundamentally determine your deep learning experience. With no GPU this might look like months of waiting for an experiment to finish, or running an experiment for a day or more only to see that the chosen parameters were off. With a good, solid GPU, one can quickly iterate over deep learning networks, and run experiments in days instead of months, hours instead of days, minutes instead of hours. So making the right choice when it comes to buying a GPU is critical. So how do you select the GPU which is right for you? This blog post will delve into that question and will lend you advice which will help you to make choice that is right for you.
Having a fast GPU is a very important aspect when one begins to learn deep learning as this allows for rapid gain in practical experience which is key to building the expertise with which you will be able to apply deep learning to new problems. Without this rapid feedback it just takes too much time to learn from one’s mistakes and it can be discouraging and frustrating to go on with deep learning. With GPUs I quickly learned how to apply deep learning on a range of Kaggle competitions and I managed to earn second place in the Partly Sunny with a Chance of Hashtags Kaggle competition using a deep learning approach, where it was the task to predict weather ratings for a given tweet. In the competition I used a rather large two layered deep neural network with rectified linear units and dropout for regularization and this deep net fitted barely into my 6GB GPU memory.
Should I get multiple GPUs?
Excited by what deep learning can do with GPUs I plunged myself into multi-GPU territory by assembling a small GPU cluster with InfiniBand 40Gbit/s interconnect. I was thrilled to see if even better results can be obtained with multiple GPUs.
I quickly found that it is not only very difficult to parallelize neural networks on multiple GPUs efficiently, but also that the speedup was only mediocre for dense neural networks. Small neural networks could be parallelized rather efficiently using data parallelism, but larger neural networks like I used in the Partly Sunny with a Chance of Hashtags Kaggle competition received almost no speedup.
Later I ventured further down the road and I developed a new 8-bit compression technique which enables you to parallelize dense or fully connected layers much more efficiently with model parallelism compared to 32-bit methods.
However, I also found that parallelization can be horribly frustrating. I naively optimized parallel algorithms for a range of problems, only to find that even with optimized custom code parallelism on multiple GPUs does not work well, given the effort that you have to put in . You need to be very aware of your hardware and how it interacts with deep learning algorithms to gauge if you can benefit from parallelization in the first place.
Since then parallelism support for GPUs is more common, but still far off from universally available and efficient. The only deep learning library which currently implements efficient algorithms across GPUs and across computers is CNTK which uses Microsoft’s special parallelization algorithms of 1-bit quantization (efficient) and block momentum (very efficient). With CNTK and a cluster of 96 GPUs you can expect a new linear speed of about 90x-95x. Pytorch might be the next library which supports efficient parallelism across machines, but the library is not there yet. If you want to parallelize on one machine then your options are mainly CNTK, Torch, Pytorch. These library yield good speedups (3.6x-3.8x) and have predefined algorithms for parallelism on one machine across up to 4 GPUs. There are other libraries which support parallelism, but these are either slow (like TensorFlow with 2x-3x) or difficult to use for multiple GPUs (Theano) or both.
If you put value on parallelism I recommend using either Pytorch or CNTK.
Using Multiple GPUs Without Parallelism
Another advantage of using multiple GPUs, even if you do not parallelize algorithms, is that you can run multiple algorithms or experiments separately on each GPU. You gain no speedups, but you get more information of your performance by using different algorithms or parameters at once. This is highly useful if your main goal is to gain deep learning experience as quickly as possible and also it is very useful for researchers, who want try multiple versions of a new algorithm at the same time.
This is psychologically important if you want to learn deep learning. The shorter the intervals for performing a task and receiving feedback for that task, the better the brain able to integrate relevant memory pieces for that task into a coherent picture. If you train two convolutional nets on separate GPUs on small datasets you will more quickly get a feel for what is important to perform well; you will more readily be able to detect patterns in the cross validation error and interpret them correctly. You will be able to detect patterns which give you hints to what parameter or layer needs to be added, removed, or adjusted.
So overall, one can say that one GPU should be sufficient for almost any task but that multiple GPUs are becoming more and more important to accelerate your deep learning models. Multiple cheap GPUs are also excellent if you want to learn deep learning quickly. I personally have rather many small GPUs than one big one, even for my research experiments.
So what kind of accelerator should I get? NVIDIA GPU, AMD GPU, or Intel Xeon Phi?
NVIDIA’s standard libraries made it very easy to establish the first deep learning libraries in CUDA, while there were no such powerful standard libraries for AMD’s OpenCL. Right now, there are just no good deep learning libraries for AMD cards – so NVIDIA it is. Even if some OpenCL libraries would be available in the future I would stick with NVIDIA: The thing is that the GPU computing or GPGPU community is very large for CUDA and rather small for OpenCL. Thus, in the CUDA community, good open source solutions and solid advice for your programming is readily available.
Additionally, NVIDIA went all-in with respect to deep learning even though deep learning was just in it infancy. This bet paid off. While other companies now put money and effort behind deep learning they are still very behind due to their late start. Currently, using any software-hardware combination for deep learning other than NVIDIA-CUDA will lead to major frustrations.
In the case of Intel’s Xeon Phi it is advertised that you will be able to use standard C code and transform that code easily into accelerated Xeon Phi code. This feature might sounds quite interesting because you might think that you can rely on the vast resources of C code. However, in reality only very small portions of C code are supported so that this feature is not really useful and most portions of C that you will be able to run will be slow.
I worked on a Xeon Phi cluster with over 500 Xeon Phis and the frustrations with it had been endless. I could not run my unit tests because Xeon Phi MKL is not compatible with Python Numpy; I had to refactor large portions of code because the Intel Xeon Phi compiler is unable to make proper reductions for templates — for example for switch statements; I had to change my C interface because some C++11 features are just not supported by the Intel Xeon Phi compiler. All this led to frustrating refactorings which I had to perform without unit tests. It took ages. It was hell.
And then when my code finally executed, everything ran very slowly. There are bugs(?) or just problems in the thread scheduler(?) which cripple performance if the tensor sizes on which you operate change in succession. For example if you have differently sized fully connected layers, or dropout layers the Xeon Phi is slower than the CPU. I replicated this behavior in an isolated matrix-matrix multiplication example and sent it to Intel. I never heard back from them. So stay away from Xeon Phis if you want to do deep learning!
Fastest GPU for a given budget
Your fist question might be what is the most important feature for fast GPU performance for deep learning: Is it cuda cores? Clock speed? RAM size?
It is neither of these, but the most important feature for deep learning performance is memory bandwidth.
In short: GPUs are optimized for memory bandwidth while sacrificing for memory access time (latency). CPUs design to the the exact opposite: CPUs can do quick computations if small amounts of memory are involved for example multiplying a few numbers (3*6*9), but for operations on large amounts of memory like matrix multiplication (A*B*C) they are slow. GPUs excel at problems that involve large amounts of memory due to their memory bandwidth. Of course there are more intricate differences between GPUs and CPUs, and if you are interested why GPUs are such a good match for deep learning you can read more about it in my quora answer about this very question.
So if you want to buy a fast GPU, first and foremost look at the bandwidth of that GPU.
Evaluating GPUs via Their Memory Bandwidth
Bandwidth can directly be compared within an architecture, for example the performance of the Pascal cards like GTX 1080 vs. GTX 1070, can directly be compared by looking at their memory bandwidth alone. For example a GTX 1080 (320GB/s) is about 25% (320/256) faster than a GTX 1070 (256 GB/s). However, across architecture, for example Pascal vs. Maxwell like GTX 1080 vs. GTX Titan X cannot be compared directly due to how different architectures with different fabrication processes (in nanometers) utilize the given memory bandwidth differently. This makes everything a bit tricky, but overall bandwidth alone will give you a good overview over how fast a GPU roughly is. To determine the fastest GPU for a given budget one can use this Wikipedia page and look at Bandwidth in GB/s; the listed prices are quite accurate for newer cards (900 and 1000 series), but older cards are significantly cheaper than the listed prices – especially if you buy those cards via eBay. For example a regular GTX Titan X goes for around $550 on eBay.
Another important factor to consider however is that not all architectures are compatible with cuDNN. Since almost all deep learning libraries make use of cuDNN for convolutional operations this restricts the choice of GPUs to Kepler GPUs or better, that is GTX 600 series or above. On top of that, Kepler GPUs are generally quite slow. So this means you should prefer GPUs of the 900 or 1000 series for good performance.
To give a rough estimate of how the cards perform with respect to each other on deep learning tasks I constructed a simple chart of GPU equivalence. How to read this? For example one GTX 980 is as fast as 0.35 Titan X Pascal, or in other terms, a Titan X Pascal is almost three times faster than a GTX 980.
Please note that I do not have all these cards myself and I did not run deep learning benchmarks on all of these cards. The comparisons are derived from comparisons of the cards specs together with compute benchmarks (some cases of cryptocurrency mining are tasks which are computationally comparable to deep learning). So these are rough estimates. The real numbers could differ a little, but generally the error should be minimal and the order of cards should be correct. Also note, that small networks that under-utilize the GPU will make larger GPUs look bad. For example a small LSTM (128 hidden units; batch size > 64) on a GTX 1080 Ti will not be that much faster than running it on a GTX 1070. To get performance difference shown in the chart one needs to run larger networks, say a LSTM with 1024 hidden units (and batch size > 64). This is also important to keep in mind when choosing the GPU which is right for you.
Generally, I would recommend the GTX 1080 Ti or GTX 1070. They are both excellent cards and if you have the money for a GTX 1080 Ti you should go ahead with that. The GTX 1070 is a bit cheaper and still faster than a regular GTX Titan X (Maxwell).. Both cards should be preferred over the GTX 980 Ti due to their increased memory of 11GB and 8GB (instead of 6GB).
A memory of 8GB might seem a bit small, but many tasks this is more than sufficient. For example for Kaggle competitions, most image datasets, deep style and natural language understanding tasks you will encounter few problems.
The GTX 1060 is the best entry GPU for when you want to try deep learning for the first time, or if you want to occasionally use it for Kaggle competition. I would not recommend the GTX 1060 variant with 3GB of memory, since the other variant’s 6GB memory can be quite limiting already. However, for many applications the 6GB is sufficient. The GTX 1060 is slower than a regular Titan X, but it is comparable in both performance and eBay price of the GTX 980.
In terms of bang for buck, the 10 series is quite well designed. The GTX 1060, GTX 1070 and GTX 1080 Ti stand out. The GTX 1060 is for beginners, the GTX 1070 a versatile option for startsups, and some parts of research and industry, and the GTX 1080 Ti stand solid as an all-around high-end option.
I generally would not recommend the NVIDIA Titan X (Pascal) as it is too pricey for its performance. Go instead with a GTX 1080 Ti. However, the NVIDIA Titan X (Pascal) still has its place among computer vision researchers which work on large datasets or video data. In these domains every GB of memory counts and the NVIDIA Titan X just has 1GB more than the GTX 1080 Ti and thus an advantage in this case. However, a better option in terms of bangs-for-buck is here a GTX Titan X (Maxwell) from eBay — a bit slower, but also sports a big 12GB memory.
However, most researchers do well with a GTX 1080 Ti. The one extra GB of memory is not needed for most research and most applications.
I personally would go with multiple GTX 1070 for research. I rather run a few more experiments which are a bit slower than running just one experiment which is faster. In NLP the memory constraints are not as tight as in computer vision and so a GTX 1070 is just fine for me. The tasks I work on and how I run my experiments determines the best choice for me, which is a GTX 1070.
You should reason in a similar fashion when you choose your GPU. Think about what tasks you work on and how you run your experiments and then try to find a GPU which suits these requirements.
The options are now more limited for people that have very little money for a GPU. GPU instances on Amazon web services are quite expensive and slow now and no longer pose a good option if you have less money. I do not recommend a GTX 970 as it is slow, still rather expensive even if bought in used condition ($150 on eBay) and there are memory problems associated with the card to boot. Instead, try to get the additional money to buy a GTX 1060 which is faster, has a larger memory and has no memory problems. If you just cannot afford a GTX 1060 I would go with a GTX 1050 Ti with 4GB of RAM. The 4GB can be limiting but you will be able to play around with deep learning and if you make some adjustments to models you can get good performance. A GTX 1050 Ti would be suitable for most Kaggle competitions although it might limit your competitiveness in some competitions.
Amazon Web Services (AWS) GPU instances
In the previous version of this blog post I recommended AWS GPU spot instances, but I would no longer recommend this option. The GPUs on AWS are now rather slow (one GTX 1080 is four times faster than a AWS GPU) and prices have shot up dramatically in the last months. It now again seems much more sensible to buy your own GPU.
With all the information in this article you should be able to reason which GPU to choose by balancing the required memory size, bandwidth in GB/s for speed and the price of the GPU, and this reasoning will be solid for many years to come. But right now my recommendation is to get a GTX 1080 Ti, or GTX 1070, if you can afford them; a GTX 1060 if you just start out with deep learning or you are constraint by money; if you have very little money, try to afford a GTX 1050 Ti; and if you are a computer vision researcher you might want to get a Titan X Pascal (or stick to existing GTX Titan Xs).
Best GPU overall: Titan X Pascal and GTX 1080 Ti
Cost efficient but expensive: GTX 1080 Ti, GTX 1070
Cost efficient and cheap: GTX 1060
I work with data sets > 250GB: Regular GTX Titan X or Titan X Pascal
I have little money: GTX 1060
I have almost no money: GTX 1050 Ti
I do Kaggle: GTX 1060 for any “normal” competition, or GTX 1080 Ti for “deep learning competitions”
I am a competitive computer vision researcher: Titan X Pascal or regular GTX Titan X
I am a researcher: GTX 1080 Ti. In some cases, like natural language processing, a GTX 1070 might also be a solid choice — check the memory requirements of your current models
I want to build a GPU cluster: This is really complicated, you can get some ideas here
I started deep learning and I am serious about it: Start with a GTX 1060. Depending of what area you choose next (startup, Kaggle, research, applied deep learning) sell your GTX 1060 and buy something more appropriate
Update 2017-03-19: Cleaned up blog post; added GTX 1080 Ti
Update 2016-07-23: Added Titan X Pascal and GTX 1060; updated recommendations
Update 2016-06-25: Reworked multi-GPU section; removed simple neural network memory section as no longer relevant; expanded convolutional memory section; truncated AWS section due to not being efficient anymore; added my opinion about the Xeon Phi; added updates for the GTX 1000 series
Update 2015-08-20: Added section for AWS GPU instances; added GTX 980 Ti to the comparison relation
Update 2015-04-22: GTX 580 no longer recommended; added performance relationships between cards
Update 2015-03-16: Updated GPU recommendations: GTX 970 and GTX 580
Update 2015-02-23: Updated GPU recommendations and memory calculations
Update 2014-09-28: Added emphasis for memory requirement of CNNs
I want to thankfor helping me to debug and test custom code for the GTX 970; I want to thank Sander Dieleman for making me aware of the shortcomings of my GPU memory advice for convolutional nets; I want to thank Hannes Bretschneider for pointing out software dependency problems for the GTX 580; and I want to thank Oliver Griesel for pointing out notebook solutions for AWS instances.
[Image source: NVIDIA CUDA/C Programming Guide]