Deep learning is a field with intense computational requirements and the choice of your GPU will fundamentally determine your deep learning experience. With no GPU this might look like months of waiting for an experiment to finish, or running an experiment for a day or more only to see that the chosen parameters were off and the model diverged. With a good, solid GPU, one can quickly iterate over designs and parameters of deep networks, and run experiments in days instead of months, hours instead of days, minutes instead of hours. So making the right choice when it comes to buying a GPU is critical. So how do you select the GPU which is right for you? This blog post will delve into that question and will lend you advice which will help you to make a choice that is right for you.
Having a fast GPU is a very important aspect when one begins to learn deep learning as this allows for rapid gain in practical experience which is key to building the expertise with which you will be able to apply deep learning to new problems. Without this rapid feedback, it just takes too much time to learn from one’s mistakes and it can be discouraging and frustrating to go on with deep learning. With GPUs, I quickly learned how to apply deep learning on a range of Kaggle competitions and I managed to earn second place in the Partly Sunny with a Chance of Hashtags Kaggle competition using a deep learning approach, where it was the task to predict weather ratings for a given tweet. In the competition, I used a rather large two layered deep neural network with rectified linear units and dropout for regularization and this deep net fitted barely into my 6GB GPU memory. The GTX Titan GPUs that powered me in the competition were a main factor of me reaching 2nd place in the competition.
Should I get multiple GPUs?
Excited by what deep learning can do with GPUs I plunged myself into multi-GPU territory by assembling a small GPU cluster with InfiniBand 40Gbit/s interconnect. I was thrilled to see if even better results can be obtained with multiple GPUs.
I quickly found that it is not only very difficult to parallelize neural networks on multiple GPUs efficiently, but also that the speedup was only mediocre for dense neural networks. Small neural networks could be parallelized rather efficiently using data parallelism, but larger neural networks like I used in the Partly Sunny with a Chance of Hashtags Kaggle competition received almost no speedup.
I analyzed parallelization in deep learning in depth, developed a technique to increase the speedups in GPU clusters from 23x to 50x for a system of 96 GPUs and published my research at ICLR 2016. In my analysis, I also found that convolution and recurrent networks are rather easy to parallelize, especially if you use only one computer or 4 GPUs. So while modern tools are not highly optimized for parallelism you can still attain good speedups.
The user experience of using parallelization techniques in the most popular frameworks is also pretty good now compared to three years ago. Their algorithms are rather naive and will not scale to GPU clusters, but they deliver good performance for up to 4 GPUs. For convolution, you can expect a speedup of 1.9x/2.8x/3.5x for 2/3/4 GPUs; for recurrent networks, the sequence length is the most important parameters and for common NLP problems one can expect similar or slightly worse speedups then convolutional networks. Fully connect networks usually have poor performance for data parallelism and more advanced algorithms are necessary to accelerate these parts of the network.
So today using multiple GPUs can make training much more convenient due to the increased speed and if you have the money for it multiple GPUs make a lot of sense.
Using Multiple GPUs Without Parallelism
Another advantage of using multiple GPUs, even if you do not parallelize algorithms, is that you can run multiple algorithms or experiments separately on each GPU. You gain no speedups, but you get more information about your performance by using different algorithms or parameters at once. This is highly useful if your main goal is to gain deep learning experience as quickly as possible and also it is very useful for researchers, who want try multiple versions of a new algorithm at the same time.
This is psychologically important if you want to learn deep learning. The shorter the intervals for performing a task and receiving feedback for that task, the better the brain able to integrate relevant memory pieces for that task into a coherent picture. If you train two convolutional nets on separate GPUs on small datasets you will more quickly get a feel for what is important to perform well; you will more readily be able to detect patterns in the cross-validation error and interpret them correctly. You will be able to detect patterns which give you hints on what parameter or layer needs to be added, removed, or adjusted.
I personally think using multiple GPUs in this way is more useful as one can quickly search for a good configuration. Once one has found a good range of parameters or architectures one can then use parallelism across multiple GPUs to train the final network.
So overall, one can say that one GPU should be sufficient for almost any task but that multiple GPUs are becoming more and more important to accelerate your deep learning models. Multiple cheap GPUs are also excellent if you want to learn deep learning quickly. I personally have rather many small GPUs than one big one, even for my research experiments.
NVIDIA vs AMD vs Intel vs Google vs Amazon
NVIDIA: The Leader
NVIDIA’s standard libraries made it very easy to establish the first deep learning libraries in CUDA, while there were no such powerful standard libraries for AMD’s OpenCL. This early advantage combined with strong community support from NVIDIA increased the size of the CUDA community rapidly. This means if you use NVIDIA GPUs you will easily find support if something goes wrong, you will find support and advice if you do program CUDA yourself, and you will find that most deep learning libraries have the best support for NVIDIA GPUs. This is a very strong point for NVIDIA GPUs.
On the other hand, NVIDIA has now a policy that the use of CUDA in data centers is only allowed for Tesla GPUs and not GTX or RTX cards. It is not clear what is meant by “data centers” but this means that organizations and universities are often forced to buy the expensive and cost-inefficient Tesla GPUs due to fear of legal issues. However, Tesla cards have no real advantage over GTX and RTX cards and cost up to 10 times as much.
That NVIDIA can just do this without any major hurdles shows the power of their monopoly — they can do as they please and we have to accept the terms. If you pick the major advantages that NVIDIA GPUs have in terms of community and support, you will also need to accept that you can be pushed around by them at will.
AMD: Powerful But Lacking Support
HIP via ROCm unifies NVIDIA and AMD GPUs under a common programming language which is compiled into the respective GPU language before it is compiled to GPU assembly. If we would have all our GPU code in HIP this would be a major milestone, but this is rather difficult because it is difficult to port the TensorFlow and PyTorch code bases. TensorFlow has some support for AMD GPUs and all major networks can be run on AMD GPUs, but if you want to develop new networks some details might be missing which could prevent you from implementing what you need. The ROCm community is also not too large and thus it is not straightforward to fix issues quickly. There also does not seem to be much money allocated for deep learning development and support from AMD’s side which slows the momentum.
Overall I think I still cannot give a clear recommendation for AMD GPUs for ordinary users that just want their GPUs to work smoothly. More experienced users should have fewer problems and by supporting AMD GPUs and ROCm/HIP developers they contribute to the combat against the monopoly position of NVIDIA as this will greatly benefit everyone in the long-term. If you are a GPU developer and want to make important contributions to GPU computing, then an AMD GPU might be the best way to make a good impact over the long-term. For everyone else, NVIDIA GPUs might be the safer choice.
Intel: Trying Hard
My personal experience with Intel’s Xeon Phis has been very disappointing and I do not seem them as a real competitor to NVIDIA or AMD cards and thus I will keep it short: If you decide to go with a Xeon Phi take note that you might encounter poor support, computing issues that make code sections slower than CPUs, very difficult to write optimized code, no full support of C++11 features, some important GPU design patterns are not supported by the compiler, poor compatibility with other libraries that rely on BLAS routines (NumPy and SciPy) and probably many other frustrations that I have probably not run into.
I was really looking forward to the Intel Nervana neural network processor (NNP) because its specs were extremely powerful in the hands of a GPU developer and it would have allowed for novel algorithms which might redefine how neural networks are used, but it has been delayed endlessly and there are rumors that large portions of the developed jumped the boat. The NNP is planned for Q3/Q4 2019. If you want to wait that long, keep in mind that a good hardware is not everything as we can see from AMD and Intel’s own Xeon Phi. It might well be into 2020 until the NNP is usable in a mature way.
Google: Cheaper On-Demand Processing?
The Google TPU developed into a very mature cloud-based product that is extremely cost-efficient. The easiest way to make sense of the TPU is by seeing it as multiple GPUs packaged together. If we look at performance measures of the Tensor-Core-enabled V100 versus TPUv2 we find that both systems have nearly the same in performance for ResNet50. However, the Google TPU is more cost-efficient.
So the TPU is a cost-efficient cloud-based solution? Yes and no. On paper and for regular use it is more cost-efficient. However, if you use best practices and guidelines as used by a fastai team and fastai library you can achieve faster convergences at a lower price — at least for convolutional networks for object recognition.
With the same software, the TPU could be even more cost-efficient, but here also lies the problem: (1) TPUs are not available for the use of the fastai library, that is PyTorch; (2) TPU algorithms rely mostly on the internal Google team, (3) no uniform high-level library exist which enforces good standards for TensorFlow.
All three points hit the TPU as it requires separate software to keep up with new additions to the deep learning algorithm family. I am sure the grunt-work has already been done by the Google team, but it is unclear how good the support is for some models. The official repository for example only has a single model for NLP with the rest being computer vision models. all models use convolution and none of them recurrent neural networks. With comes together with a now rather old report from February, that the TPUv2 did not converge when LSTMs were used. I could not find a source if the problem has been fixed as of yet, but it is very likely that software support improves quickly over time and that the cost will go down further making TPUs an attractive option. Currently, though, TPUs seem to be best used for computer vision and as a supplement with other compute resources rather than a main deep learning resource.
Amazon: Reliable but Expensive.
A lot of new GPUs have been added to AWS since the last update of this blog post. However, the prices are still a bit high. AWS GPU instances can be a very useful solution if additional compute is needed suddenly, for example when all GPUs are in use as is common before research paper deadlines.
However, if it ought to be cost-efficient then one should make sure that one only runs a few networks and that one knows with a good certainty that parameters chosen for the training run are near-optimal. Otherwise, the cost will cut quite deep into your pocket and a dedicated GPU might be more useful. Even if a fast AWS GPU is tempting a solid GTX 1070 and up will be able to provide good compute performance for a year or two without costing too much.
So AWS GPU instances are very useful but they need to be used wisely and with caution to be cost-efficient. For more discussion on cloud computing see the section below.
What Makes One GPU Faster Than Another?
Your first question might be what is the most important feature for fast GPU performance for deep learning: Is it CUDA cores? Clock speed? RAM size?
While a good simplified advice would have been “pay attention to the memory bandwidth” I would no longer recommend doing that. This because GPU hardware and software developed over the years in a way that bandwidth on a GPU is no longer a good proxy for its performance. The introduction for Tensor Cores in consumer-grade GPUs complicates the issue further. Now a combination of bandwidth, FLOPS, and Tensor Cores are the best indicator for the performance of a GPU.
One thing that to deepen your understanding to make an informed choice is to learn a bit about what parts of the hardware makes GPUs fast for the two most important tensor operations: Matrix multiplication and convolution.
A simple and effective way to think about matrix multiplication is that it is bandwidth bound. That is memory bandwidth is the most important feature of a GPU if you want to use LSTMs and other recurrent networks that do lots of matrix multiplications.
Similarly, convolution is bound by computation speed. Thus TFLOPs on a GPU is the best indicator for the performance of ResNets and other convolutional architectures.
Tensor Cores change the equation slightly. They are very straightforward specialized compute units which can speed up computation — but not memory bandwidth — and thus the largest benefit can be seen for convolutional nets which are about 30% to 100% times faster with Tensor Cores.
While Tensor Cores only make the computation faster they also enable the computation using 16-bit numbers. This is also a big advantage for matrix multiplication because with numbers only being 16-bit instead of 32-bit large one can transfer twice the number of numbers in a matrix with the same memory bandwidth. This memory size reduction of numbers is particularly important for storing more numbers in L1 cache which increase the speedup even further the larger the matrices in the matrix multiplication are. Generally one can hope for speedups of about 20% to 60% for LSTMs using Tensor Cores.
Note that this speedup comes not from the Tensor Cores per se, but just from their ability to do 16-bit computations. A 16-bit algorithm on an AMD GPU will be as fast as a matrix multiplication algorithm on an NVIDIA card with Tensor Cores.
A big problem with Tensor Cores is that they require 16-bit floating point input data and this might introduce some software support problems as networks usually use 32-bit values. Without 16-bit inputs Tensor Cores will be useless. I think, however, that these issues will be resolved quickly since Tensor Cores are too powerful to remain unused and now that they are available for consumer-grade GPUs we will see more and more people with them. Note that with the introduction of 16-bit deep learning we will also virtually double the memory of GPUs since double the parameters fit into the same memory.
So overall, the best rule of thumb would be: Look at bandwidth if you use RNNs; look at FLOPS if you use convolution; get Tensor Cores if you can afford them (do not buy Tesla cards unless you have to).
Cost Efficiency Analysis
The cost-efficiency of a GPU is probably the most important criterion for selecting a GPU. I did a new cost performance analysis that incorporated memory bandwidth, TFLOPs, and Tensor Cores. I looked at prices on eBay and Amazon and weighted them 50:50, then I looked at performance indicators for LSTMs, CNNs, with and without Tensor Cores. I took these performance numbers and weighted them together via a normalized geometric mean to receive average performance ratings with which I then calculated performance/cost numbers. This is the result:
Note that the numbers for the RTX 2080 and RTX 2080 Ti should be taken with a grain of salt since no hard performance numbers existed. I estimated performance according to a roofline model of matrix multiplication and convolution under this hardware together with Tensor Core benchmarks from the V100 and Titan V. The RTX 2070 is missing completely since no hardware specs exist at this point. Note that the RTX 2070 might easily beat the other two RTX cards in cost-efficiency, but I have no data to back this.
I will update these performance and price numbers next month when the RTX 2080 and RTX 2080 Ti are released. By then there should be enough information about the deep learning performance of these cards and further information on the RTX 2070 should be available to make accurate estimates.
From the preliminary data, we see that the RTX 2080 is more cost-efficient than the RTX 2080 Ti. The RTX 2080 Ti has about 40% more Tensor Cores and bandwidth at a 50% higher price, but this does not increase performance by 40%. For LSTMs and other RNNs, the performance gain from GTX 10 series to RTX 20 series is mostly the ability to do 16-bit floating point computation rather than the Tensor Cores itself. While convolution should theoretically scale linearly with Tensor Cores we do not see this from performance numbers. This indicates that other parts of the convolutional architecture that cannot be assisted with Tensor Cores make a significant contribution to the overall computational requirements. Thus RTX 2080 is more cost-efficient because it has all the features it needs to deliver better performance (GDDR6 + Tensor Cores) than the GTX 10 series while also being cheaper than he RTX 2080 Ti.
Additionally, note that there are some problems with this analysis and one need to be careful to interpret this data: (1) If you buy cost-efficient but slow cards then at some point your computer might no longer have space for more GPUs and thus resources are wasted. Thus this chart is biased against expensive GPUs. To counter this bias, one should also evaluate against the raw performance chart Figure 2. (2) This performance/cost chart also assumes that you are using 16-bit computations and Tensor Cores whenever possible. This means for 32-bit computation RTX cards have a very poor performance/cost ratio. (3) There are rumors there are large stockpiles of RTX series 20 cards which have been held back due to the nose-dive in cryptocurrencies. Thus popular crypto mining GPUs like GTX 1080 and GTX 1070 might quickly fall in price and their performance/cost ratio might improve quickly rendering the RTX 20 series less favorable in terms of performance/cost. On the other hand, a large supply of RTX series 20 cards will keep their price steady and competitive. It is difficult to predict how it will turn out. (4) As mentioned before, no hard, unbiased performance numbers exist for RTX cards and thus all these numbers have to be taken with a grain of salt.
So you can see that it is not easy to make the right choice. However, if you take a balanced view on all of these issues then the following recommendations are reasonable.
General GPU Recommendations
Currently, I would recommend two different main strategies: (1) Buy RTX and keep for +2 years; (2) find a cheap GTX 1080/1070/1060 or GTX 1080 Ti/GTX 1070 Ti as people will dump their cards on eBay and hold it such GPUs for a while until better cards are released. For example, one can wait for the RTX Titan release in 2019 Q1/Q2 and then sell & upgrade.
We have been waiting for a GPU upgrade for quite some time and for many people, the first strategy might be most suitable to get good performance now. While the RTX 2080 is more cost-efficient the RTX 2080 Ti offers more memory which could be a decisive factor for computer vision researchers and other memory intensive applications. Both cards are sensible solutions. The main question is: Do you need the extra memory on the RTX 2080 Ti? Remember that you would use this card usually in a 16-bit mode which virtually doubles the available memory. If you do not need that extra memory go with the RTX 2080.
Some people want a bigger upgrade and wait for an RTX Titan. This could also be a good choice since the GTX 10 series cards likely will fall in price. I would not recommend any specific GPU here since prices are too volatile — just grab whatever is cheap right now relative of what it was in the past weeks. Note that a GTX 1060 might sometimes lack the memory and speed that you need for certain models, so if you find a cheap GTX 1060 first think if the speed and the 6GB memory really fulfill your need. Otherwise, cheap GTX 1070, GTX 1070 Ti, GTX 1080, and GTX 1080 Ti are all excellent choices.
For startups, Kaggle competitors, and people that want to learn deep learning I would definitely recommend a cheap GTX series 10 cards. For all these application areas a GTX 1060 can be a very cost-efficient entry solution that gets you started.
For people that what to learn to do deep learning quickly multiple GTX 1060 might be perfect and once your skills are good you can upgrade to an RTX Titan in 2019 and keep the GPU for a few years.
If you are short of money I would recommend the GTX 1050 Ti with 4 GB of memory or if you can afford it a GTX 1060. Note that the GTX 1050 Ti has the advantage that you do not need an additional PCIe power connector from the PSU and thus you might be able to plug it into an existing computer to get started with deep learning without a PSU upgrade thus saving additional money.
If you are short of money but you know that a 12GB memory is important for you then there is also the GTX Titan X (Pascal) from eBay as an excellent option.
However, most researchers do well with a GTX 1080 Ti. The one extra GB of memory is not needed for most research and most applications and it is faster than the GTX Titan X (Pascal).
I personally will buy an RTX 2080 Ti since an upgrade of my GTX Titan X (Maxwell) is long overdue. I need more memory for my research so the RTX 2080 is not an option for me. I will also develop custom Tensor Core algorithms which only leaves the RTX 2080 Ti for me. So the RTX 2080 Ti is the best choice for me, but it does not mean it is the best choice for you.
You should reason in a similar fashion when you choose your GPU. Think about what tasks you work on (memory requirements) and how you run your experiments (a few fast ones, or multiple slow ones, or prototype and expand to cloud), also mind the future (are the future GPUs RTX 2070 or RTX Titan interesting to me? Are cheaper GTX 10 series cards interesting to me?) and then try to find a GPU which suits these requirements.
Deep Learning in the Cloud
Both GPU instances on AWS and TPUs in the Google Cloud are viable options for deep learning. While the TPU is a bit cheaper it is lacking the versatility and flexibility of AWS GPUs. TPUs might be the weapon of choice for training object recognition pipelines. For other work-loads AWS GPUs are a safer bet — the good thing about cloud instances is that you can switch between GPUs and TPUs at any time or even use both at the same time.
However, mind the opportunity cost here: If you learn the skills to have a smooth work-flow with AWS instances, you lost time that could be spent doing work on a personal GPU, and you will also not have acquired the skills to use TPUs. If you use a personal GPU, you will not have the skills to expand into more GPUs/TPUs via the cloud. If you use TPUs you are stuck with TensorFlow and it will not be straightforward to switch to AWS. Learning a smooth cloud work-flow is expensive and you should weight this cost if you make the choice for TPUs or AWS GPUs.
Another question is also about when to use cloud services. If you try to learn deep learning or you need to prototype then a personal GPU might be the best option since cloud instances can be pricey. However, once you have found a good deep network configuration and you just want to train a model using data parallelism with cloud instances is a solid approach. This means that a small GPU will be sufficient for prototyping and one can rely on the power of cloud computing to scale up to larger experiments.
If you are short on money the cloud computing instances might also be a good solution, but the problem is that you can only buy a lot of compute per hour when you only need some little for prototyping. In this case, one might want to prototype on a CPU and then roll out on GPU/TPU instances for a quick training run. This is not the best work-flow since prototyping on a CPU can be a big pain, but it is a cost-efficient solution.
With the information in this blog post, you should be able to reason which GPU is suitable for you. In general, I see two main strategies that make sense: Firstly, go with an RTX 20 series GPU to get a quick upgrade or, secondly, go with a cheap GTX 10 series GPU and upgrade once the RTX Titan becomes available. If you are less serious about performance or you just do not need the performance, for example for Kaggle, startups, prototyping, or learning deep learning you can also benefit greatly from cheap GTX 10 series GPUs. If you go for a GTX 10 series GPU be careful that the GPU memory size fulfills your requirements.
Best GPU overall: RTX 2080 Ti
Cost-efficient but expensive: RTX 2080, GTX 1080
Cost-efficient and cheap: GTX 1070, GTX 1070 Ti, GTX 1060
I work with datasets > 250GB: RTX 2080 Ti or RTX 2080
I have little money: GTX 1060 (6GB)
I have almost no money: GTX 1050 Ti (4GB) or CPU (prototyping) + AWS/TPU (training)
I do Kaggle: GTX 1060 (6GB) for prototyping, AWS for final training; use fastai library
I am a competitive computer vision researcher: GTX 2080 Ti; upgrade to RTX Titan in 2019
I am a researcher: RTX 2080 Ti or GTX 10XX -> RTX Titan — check the memory requirements of your current models
I want to build a GPU cluster: This is really complicated, you can get some ideas here
I started deep learning and I am serious about it: Start with a GTX 1060 (6GB) or a cheap GTX 1070 or GTX 1070 Ti if you can find one. Depending on what area you choose next (startup, Kaggle, research, applied deep learning) sell your GPU and buy something more appropriate
I want to try deep learning, but I am not serious about it: GTX 1050 Ti (4 or 2GB)
Update 2018-08-21: Added RTX 2080 and RTX 2080 Ti; reworked performance analysis
Update 2017-04-09: Added cost efficiency analysis; updated recommendation with NVIDIA Titan Xp
Update 2017-03-19: Cleaned up blog post; added GTX 1080 Ti
Update 2016-07-23: Added Titan X Pascal and GTX 1060; updated recommendations
Update 2016-06-25: Reworked multi-GPU section; removed simple neural network memory section as no longer relevant; expanded convolutional memory section; truncated AWS section due to not being efficient anymore; added my opinion about the Xeon Phi; added updates for the GTX 1000 series
Update 2015-08-20: Added section for AWS GPU instances; added GTX 980 Ti to the comparison relation
Update 2015-04-22: GTX 580 no longer recommended; added performance relationships between cards
Update 2015-03-16: Updated GPU recommendations: GTX 970 and GTX 580
Update 2015-02-23: Updated GPU recommendations and memory calculations
Update 2014-09-28: Added emphasis for memory requirement of CNNs
I want to thankfor helping me to debug and test custom code for the GTX 970; I want to thank Sander Dieleman for making me aware of the shortcomings of my GPU memory advice for convolutional nets; I want to thank Hannes Bretschneider for pointing out software dependency problems for the GTX 580; and I want to thank Oliver Griesel for pointing out notebook solutions for AWS instances.