Deep learning is a field with intense computational requirements, and your choice of GPU will fundamentally determine your deep learning experience. But what features are important if you want to buy a new GPU? GPU RAM, cores, tensor cores? How to make a cost-efficient choice? This blog post will delve into these questions, tackle common misconceptions, give you an intuitive understanding of how to think about GPUs, and will lend you advice, which will help you to make a choice that is right for you.
This blog post is designed to give you different levels of understanding of GPUs and the new Ampere series GPUs from NVIDIA. You have the choice: (1) If you are not interested in the details of how GPUs work, what makes a GPU fast, and what is unique about the new NVIDIA RTX 30 Ampere series, you can skip right to the performance and performance per dollar charts and the recommendation section. These form the core of the blog post and the most valuable content.
(2) If you worry about specific questions, I have answered and addressed the most common questions and misconceptions in the later part of the blog post.
(3) If you want to get an in-depth understanding of how GPUs and Tensor Cores work, the best is to read the blog post from start to finish. You might want to skip a section or two based on your understanding of the presented topics.
I will head each major section with a small summary, which might help you to decide if you want to read the section or not.
Overview
This blog post is structured in the following way. First, I will explain what makes a GPU fast. I will discuss CPUs vs GPUs, Tensor Cores, memory bandwidth, and the memory hierarchy of GPUs and how these relate to deep learning performance. These explanations might help you get a more intuitive sense of what to look for in a GPU. Then I will make theoretical estimates for GPU performance and align them with some marketing benchmarks from NVIDIA to get reliable, unbiased performance data. I discuss the unique features of the new NVIDIA RTX 30 Ampere GPU series that are worth considering if you buy a GPU. From there, I make GPU recommendations for 1-2, 4, 8 GPU setups, and GPU clusters. After that follows a Q&A section of common questions posed to me in Twitter threads; in that section, I will also address common misconceptions and some miscellaneous issues, such as cloud vs desktop, cooling, AMD vs NVIDIA, and others.
How do GPUs work?
If you use GPUs frequently, it is useful to understand how they work. This knowledge will come in handy in understanding why GPUs might be slow in some cases and fast in others. In turn, you might be able to understand better why you need a GPU in the first place and how other future hardware options might be able to compete. You can skip this section if you just want the useful performance numbers and arguments to help you decide which GPU to buy. The best high-level explanation for the question of how GPUs work is my following Quora answer:
This is a high-level explanation that explains quite well why GPUs are better than CPUs for deep learning. If we look at the details, we can understand what makes one GPU better than another.
The Most Important GPU Specs for Deep Learning Processing Speed
This section can help you build a more intuitive understanding of how to think about deep learning performance. This understanding will help you to evaluate future GPUs by yourself.
Tensor Cores
Summary:
- Tensor Cores reduce the used cycles needed for calculating multiply and addition operations, 16-fold — in my example, for a 32×32 matrix, from 128 cycles to 8 cycles.
- Tensor Cores reduce the reliance on repetitive shared memory access, thus saving additional cycles for memory access.
- Tensor Cores are so fast that computation is no longer a bottleneck. The only bottleneck is getting data to the Tensor Cores.
There are now enough cheap GPUs that almost everyone can afford a GPU with Tensor Cores. That is why I only recommend GPUs with Tensor Cores. It is useful to understand how they work to appreciate the importance of these computational units specialized for matrix multiplication. Here I will show you a simple example of A*B=C matrix multiplication, where all matrices have a size of 32×32, what a computational pattern looks like with and without Tensor Cores. This is a simplified example, and not the exact way how a high performing matrix multiplication kernel would be written, but it has all the basics. A CUDA programmer would take this as a first “draft” and then optimize it step-by-step with concepts like double buffering, register optimization, occupancy optimization, instruction-level parallelism, and many others, which I will not discuss at this point.
To understand this example fully, you have to understand the concepts of cycles. If a processor runs at 1GHz, it can do 10^9 cycles per second. Each cycle represents an opportunity for computation. However, most of the time, operations take longer than one cycle. Thus it creates a pipeline where for one operation to start, it needs to wait for the number of cycles of time it takes for the previous operation to finish. This is also called the latency of the operation.
Here are some important cycle timings or latencies for operations:
- Global memory access (up to 48GB): ~200 cycles
- Shared memory access (up to 164 kb per Streaming Multiprocessor): ~20 cycles
- Fused multiplication and addition (FFMA): 4 cycles
- Tensor Core matrix multiply: 1 cycle
Furthermore, you should know that the smallest units of threads on a GPU is a pack of 32 threads — this is called a warp. Warps usually operate in a synchronous pattern — threads within a warp have to wait for each other. All memory operations on the GPU are optimized for warps. For example, loading from global memory happens at a granularity of 32*4 bytes, exactly 32 floats, exactly one float for each thread in a warp. We can have up to 32 warps = 1024 threads in a streaming multiprocessor (SM), the GPU-equivalent of a CPU core. The resources of an SM are divided up among all active warps. This means that sometimes we want to run fewer warps to have more registers/shared memory/Tensor Core resources per warp.
For both of the following examples, we assume we have the same computational resources. For this small example of a 32×32 matrix multiply, we use 8 SMs (about 10% of an RTX 3090) and 8 warps per SM.
Matrix multiplication without Tensor Cores
If we want to do an A*B=C matrix multiply, where each matrix is of size 32×32, then we want to load memory that we repeatedly access into shared memory because its latency is about ten times lower (200 cycles vs 20 cycles). A memory block in shared memory is often referred to as a memory tile or just a tile. Loading two 32×32 floats into a shared memory tile can happen in parallel by using 2*32 warps. We have 8 SMs with 8 warps each, so due to parallelization, we only need to do a single sequential load from global to shared memory, which takes 200 cycles.
To do the matrix multiplication, we now need to load a vector of 32 numbers from shared memory A and shared memory B and perform a fused multiply-and-accumulate (FFMA). Then store the outputs in registers C. We divide the work so that each SM does 8x dot products (32×32) to compute 8 outputs of C. Why this is exactly 8 (4 in older algorithms) is very technical. I recommend Scott Gray’s blog post on matrix multiplication to understand this. This means we have 8x shared memory access at the cost of 20 cycles each and 8 FFMA operations (32 in parallel), which cost 4 cycles each. In total, we thus have a cost of:
200 cycles (global memory) + 8*20 cycles (shared memory) + 8*4 cycles (FFMA) = 392 cycles
Let’s look at the cycle cost of using Tensor Cores.
Matrix multiplication with Tensor Cores
With Tensor Cores, we can perform a 4×4 matrix multiplication in one cycle. To do that, we first need to get memory into the Tensor Core. Similarly to the above, we need to read from global memory (200 cycles) and store in shared memory. To do a 32×32 matrix multiply, we need to do 8×8=64 Tensor Cores operations. A single SM has 8 Tensor Cores. So with 8 SMs, we have 64 Tensor Cores — just the number that we need! We can transfer the data from shared memory to the Tensor Cores with 1 memory transfers (20 cycles) and then do those 64 parallel Tensor Core operations (1 cycle). This means the total cost for Tensor Cores matrix multiplication, in this case, is:
200 cycles (global memory) + 20 cycles (shared memory) + 1 cycle (Tensor Core) = 221 cycles.
Thus we reduce the matrix multiplication cost significantly from 392 cycles to 221 cycles via Tensor Cores. In this simplified case, the Tensor Cores reduced the cost of both shared memory access and FFMA operations.
While this example roughly follows the sequence of computational steps for both with and without Tensor Cores, please note that this is a very simplified example. Real cases of matrix multiplication involve much larger shared memory tiles and slightly different computational patterns.
However, I believe from this example, it is also clear why the next attribute, memory bandwidth, is so crucial for Tensor-Core-equipped GPUs. Since global memory is the most considerable portion of cycle cost for matrix multiplication with Tensor Cores, we would even have faster GPUs if the global memory latency could be reduced. We can do this by either increasing the clock frequency of the memory (more cycles per second, but also more heat and higher energy requirements) or by increasing the number of elements that can be transferred at any one time (bus width).
Memory Bandwidth
From the previous section, we have seen that Tensor Cores are very fast. So fast, in fact, that they are idle most of the time as they are waiting for memory to arrive from global memory. For example, during BERT Large training, which uses huge matrices — the larger, the better for Tensor Cores — we have a Tensor Core TFLOPS utilization of about 30%, meaning that 70% of the time, Tensor Cores are idle.
This means that when comparing two GPUs with Tensor Cores, one of the single best indicators for each GPU’s performance is their memory bandwidth. For example, The A100 GPU has 1,555 GB/s memory bandwidth vs the 900 GB/s of the V100. As such, a basic estimate of speedup of an A100 vs V100 is 1555/900 = 1.73x.
Since memory transfers to the Tensor Cores are the limiting factor in performance, we are looking for other GPU attributes that enable faster memory transfer to Tensor Cores. Shared memory, L1 Cache, and amount of registers used are all related. To understand how a memory hierarchy enables faster memory transfers, it helps to understand how matrix multiplication is performed on a GPU.
To perform matrix multiplication, we exploit the memory hierarchy of a GPU that goes from slow global memory to fast local shared memory, to lightning-fast registers. However, the faster the memory, the smaller it is. As such, we need to separate the matrix into smaller matrices. We perform matrix multiplication across these smaller tiles in local shared memory that is fast and close to the streaming multiprocessor (SM) — the equivalent of a CPU core. With Tensor Cores, we go a step further: We take each tile and load a part of these tiles into Tensor Cores. A matrix memory tile in shared memory is ~10-50x faster than the global GPU memory, whereas the Tensor Cores’ registers are ~200x faster than the global GPU memory.
Having larger tiles means we can reuse more memory. I wrote about this in detail in my TPU vs GPU blog post. In fact, you can see TPUs as having very, very, large tiles for each Tensor Core. As such, TPUs can reuse much more memory with each transfer from global memory, which makes them a little bit more efficient at matrix multiplications than GPUs.
Each tile size is determined by how much memory we have per streaming multiprocessor (SM) — the equivalent to a “CPU core” on a GPU. We have the following shared memory sizes on the following architectures:
- Volta: 96kb shared memory / 32 kb L1
- Turing: 64kb shared memory / 32 kb L1
- Ampere: 164 kb shared memory / 32 kb L1
We see that Ampere has a much larger shared memory allowing for larger tile sizes, which reduces global memory access. Thus, Ampere can make better use of the overall memory bandwidth on the GPU memory. This improves performance by roughly 2-5%. The performance boost is particularly pronounced for huge matrices.
The Ampere Tensor Cores have another advantage in that they share more data between threads. This reduces the register usage. Registers are limited to 64k per streaming multiprocessor (SM) or 255 per thread. Comparing the Volta vs Ampere Tensor Core, the Ampere Tensor Core uses 3x fewer registers, allowing for more tensor cores to be active for each shared memory tile. In other words, we can feed 3x as many Tensor Cores with the same amount of registers. However, since bandwidth is still the bottleneck, you will only see tiny increases in actual vs theoretical TFLOPS. The new Tensor Cores improve performance by roughly 1-3%.
Overall, you can see that the Ampere architecture is optimized to make the available memory bandwidth more effective by using an improved memory hierarchy: from global memory to shared memory tiles, to register tiles for Tensor Cores.
Estimating Ampere Deep Learning Performance
Summary:
- Theoretical estimates based on memory bandwidth and the improved memory hierarchy of Ampere GPUs predict a speedup of 1.78x to 1.87x.
- NVIDIA provides accuracy benchmark data of Tesla A100 and V100 GPUs. These data are biased for marketing purposes, but it is possible to build a debiased model of these data.
- Debiased benchmark data suggests that the Tesla A100 compared to the V100 is 1.70x faster for NLP and 1.45x faster for computer vision.
This section is for those who want to understand the more technical details of how I derive the performance estimates for Ampere GPUs. If you do not care about these technical aspects, it is safe to skip this section.
Theoretical Ampere Speed Estimates
Putting together the reasoning above, we would expect the difference between two Tensor-Core-equipped GPU architectures to be mostly about memory bandwidth. Additional benefits come from more shared memory / L1 cache and better register usage in Tensor Cores.
If we take the Tesla A100 GPU bandwidth vs Tesla V100 bandwidth, we get a speedup of 1555/900 = 1.73x. Additionally, I would expect a 2-5% speedup from the larger shared memory and 1-3% from the improved Tensor Cores. This puts the speedup range between 1.78x and 1.87x. With similar reasoning, you would be able to estimate the speedup of other Ampere series GPUs compared to a Tesla V100.
Practical Ampere Speed Estimates
Suppose we have an estimate for one GPU of a GPU-architecture like Ampere, Turing, or Volta. It is easy to extrapolate these results to other GPUs from the same architecture/series. Luckily, NVIDIA already benchmarked the A100 vs V100 across a wide range of computer vision and natural language understanding tasks. Unfortunately, NVIDIA made sure that these numbers are not directly comparable by using different batch sizes and the number of GPUs whenever possible to favor results for the A100. So in a sense, the benchmark numbers are partially honest, partially marketing numbers. In general, you could argue that using larger batch sizes is fair, as the A100 has more memory. Still, to compare GPU architectures, we should evaluate unbiased memory performance with the same batch size.
To get an unbiased estimate, we can scale the V100 and A100 results in two ways: (1) account for the differences in batch size, (2) account for the differences in using 1 vs 8 GPUs. We are lucky that we can find such an estimate for both biases in the data that NVIDIA provides.
Doubling the batch size increases throughput in terms of images/s (CNNs) by 13.6%. I benchmarked the same problem for transformers on my RTX Titan and found, surprisingly, the very same result: 13.5% — it appears that this is a robust estimate.
As we parallelize networks across more and more GPUs, we lose performance due to some networking overhead. The A100 8x GPU system has better networking (NVLink 3.0) than the V100 8x GPU system (NVLink 2.0) — this is another confounding factor. Looking directly at the data from NVIDIA, we can find that for CNNs, a system with 8x A100 has a 5% lower overhead than a system of 8x V100. This means if going from 1x A100 to 8x A100 gives you a speedup of, say, 7.00x, then going from 1x V100 to 8x V100 only gives you a speedup of 6.67x. For transformers, the figure is 7%.
Using these figures, we can estimate the speedup for a few specific deep learning architectures from the direct data that NVIDIA provides. The Tesla A100 offers the following speedup over the Tesla V100:
- SE-ResNeXt101: 1.43x
- Masked-R-CNN: 1.47x
- Transformer (12 layer, Machine Translation, WMT14 en-de): 1.70x
Thus, the figures are a bit lower than the theoretical estimate for computer vision. This might be due to smaller tensor dimensions, overhead from operations that are needed to prepare the matrix multiplication like img2col or Fast Fourier Transform (FFT), or operations that cannot saturate the GPU (final layers are often relatively small). It could also be artifacts of the specific architectures (grouped convolution).
The practical transformer estimate is very close to the theoretical estimate. This is probably because algorithms for huge matrices are very straightforward. I will use these practical estimates to calculate the cost efficiency of GPUs.
Possible Biases in Estimates
The estimates above are for A100 vs V100. In the past, NVIDIA sneaked unannounced performance degradations into the “gaming” RTX GPUs: (1) Decreased Tensor Core utilization, (2) gaming fans for cooling, (3) disabled peer-to-peer GPU transfers. It might be possible that there are unannounced performance degradations in the RTX 30 series compared to the full Ampere A100.
As of now, one of these degradations was found: Tensor Core performance was decreased so that RTX 30 series GPUs are not as good as Quadro cards for deep learning purposes. This was also done for the RTX 20 series, so it is nothing new, but this time it was also done for the Titan equivalent card, the RTX 3090. The RTX Titan did not have performance degradation enabled.
I will update this blog post as information about further unannounced performance degradation becomes available.
Additional Considerations for Ampere / RTX 30 Series
Summary:
- Ampere allows for sparse network training, which accelerates training by a factor of up to 2x.
- Sparse network training is still rarely used but will make Ampere future-proof.
- Ampere has new low-precision data types, which makes using low-precision much easy, but not necessarily faster than for previous GPUs.
- The new fan design is excellent if you have space between GPUs, but it is unclear if multiple GPUs with no space in-between them will be efficiently cooled.
- 3-Slot design of the RTX 3090 makes 4x GPU builds problematic. Possible solutions are 2-slot variants or the use of PCIe extenders.
- 4x RTX 3090 will need more power than any standard power supply unit on the market can provide right now.
The new NVIDIA Ampere RTX 30 series has additional benefits over the NVIDIA Turing RTX 20 series, such as sparse network training and inference. Other features, such as the new data types, should be seen more as an ease-of-use-feature as they provide the same performance boost as Turing does but without any extra programming required.
Sparse Network Training
Ampere allows for fine-grained structure automatic sparse matrix multiplication at dense speeds. How does this work? Take a weight matrix and slice it into pieces of 4 elements. Now imagine 2 elements of these 4 to be zero. Figure 1 shows how this could look like.

When you multiply this sparse weight matrix with some dense inputs, the sparse matrix tensor core feature in Ampere automatically compresses the sparse matrix to a dense representation that is half the size as can be seen in Figure 2. After this compression, the densely compressed matrix tile is fed into the tensor core which computes a matrix multiplication of twice the usual size. This effectively yields a 2x speedup since the bandwidth requirements during matrix multiplication from shared memory are halved.

I was working on sparse network training in my research and I also wrote a blog post about sparse training. One criticism of my work was that “You reduce the FLOPS required for the network, but it does not yield speedups because GPUs cannot do fast sparse matrix multiplication.” Well, with the addition of the sparse matrix multiplication feature for Tensor Cores, my algorithm, or other sparse training algorithms, now actually provide speedups of up to 2x during training.

While this feature is still experimental and training sparse networks are not commonplace yet, having this feature on your GPU means you are ready for the future of sparse training.
Low-precision Computation
In my work, I’ve previously shown that new data types can improve stability during low-precision backpropagation.
![Figure 4: Low-precision deep learning 8-bit datatypes that I developed. Deep learning training benefits from highly specialized data types. My dynamic tree datatype uses a dynamic bit that indicates the beginning of a binary bisection tree that quantized the range [0, 0.9] while all previous bits are used for the exponent. This allows to dynamically represent numbers that are both large and small with high precision.](https://timdettmers.com/wp-content/uploads/2020/09/8-bit_data_types.png)
Currently, if you want to have stable backpropagation with 16-bit floating-point numbers (FP16), the big problem is that ordinary FP16 data types only support numbers in the range [-65,504, 65,504]. If your gradient slips past this range, your gradients explode into NaN values. To prevent this during FP16 training, we usually perform loss scaling where you multiply the loss by a small number before backpropagating to prevent this gradient explosion.
The Brain Float 16 format (BF16) uses more bits for the exponent such that the range of possible numbers is the same as for FP32: [-3*10^38, 3*10^38]. BF16 has less precision, that is significant digits, but gradient precision is not that important for learning. So what BF16 does is that you no longer need to do any loss scaling or worry about the gradient blowing up quickly. As such, we should see an increase in training stability by using the BF16 format as a slight loss of precision.
What this means for you: With BF16 precision, training might be more stable than with FP16 precision while providing the same speedups. With TF32 precision, you get near FP32 stability while giving the speedups close to FP16. The good thing is, to use these data types, you can just replace FP32 with TF32 and FP16 with BF16 — no code changes required!
Overall, though, these new data types can be seen as lazy data types in the sense that you could have gotten all the benefits with the old data types with some additional programming efforts (proper loss scaling, initialization, normalization, using Apex). As such, these data types do not provide speedups but rather improve ease of use of low precision for training.
New Fan Design / Thermal Issues
The new fan design for the RTX 30 series features both a blower fan and a push/pull fan. The design is ingenious and will be very effective if you have space between GPUs. So if you have 2 GPUs and one slot space between them (+3 PCIe slots), you will be fine, and there will be no cooling issues. However, it is unclear how the GPUs will perform if you have them stacked next to each other in a setup with more than 2 GPUs. The blower fan will be able to exhaust through the bracket away from the other GPUs, but it is impossible to tell how well that works since the blower fan is of a different design than before. So my recommendation: If you want to buy 1 GPU or 2 GPUs in a 4 PCIe slot setup, then there should be no issues. However, if you’re going to use 3-4 RTX 30 GPUs next to each other, I would wait for thermal performance reports to know if you need different GPU coolers, PCIe extenders, or other solutions. I will update the blog post with this information as it becomes available.
To overcome thermal issues, water cooling will provide a solution in any case. Many vendors offer water cooling blocks for RTX 3080/RTX 3090 cards, which will keep them cool even in a 4x GPU setup. Beware of all-in-one water cooling solution for GPUs if you want to run a 4x GPU setup, though it is difficult to spread out the radiators in most desktop cases.
Another solution to the cooling problem is to buy PCIe extenders and spread the GPUs within the case. This is very effective, and other fellow PhD students at the University of Washington and I use this setup with great success. It does not look pretty, but it keeps your GPUs cool! It can also help if you do not have enough space to spread the GPUs. For example, if you can find the space within a desktop computer case, it might be possible to buy standard 3-slot-width RTX 3090 and spread them with PCIe extenders within the case. With this, you might solve both the space issue and cooling issue for a 4x RTX 3090 setup with a single simple solution.

3-slot Design and Power Issues
The RTX 3090 is a 3-slot GPU, so one will not be able to use it in a 4x setup with the default fan design from NVIDIA. This is kind of justified because it runs at 350W TDP, and it will be difficult to cool in a multi-GPU 2-slot setting. The RTX 3080 is only slightly better at 320W TDP, and cooling a 4x RTX 3080 setup will also be very difficult.
It is also difficult to power a 4x 350W = 1400W system in the 4x RTX 3090 case. Power supply units (PSUs) of 1600W are readily available, but having only 200W to power the CPU and motherboard can be too tight. The components’ maximum power is only used if the components are fully utilized, and in deep learning, the CPU is usually only under weak load. With that, a 1600W PSU might work quite well with a 4x RTX 3080 build, but for a 4x RTX 3090 build, it is better to look for high wattage PSUs (+1700W). Some of my followers have had great success with cryptomining PSUs — have a look in the comment section for more info about that. Otherwise, it is important to note that not all outlets support PSUs above 1600W, especially in the US. This is the reason why in the US, there is currently not a standard desktop PSU above 1600W on the market. If you get a server or cryptomining PSUs, beware of the form factor — make sure it fits into your computer case.
Power Limiting: An Elegant Solution to Solve the Power Problem?
It is possible to set a power limit on your GPUs. So you would be able to programmatically set the power limit of an RTX 3090 to 300W instead of their standard 350W. In a 4x GPU system, that is a saving of 200W, which might just be enough to build a 4x RTX 3090 system with a 1600W PSU feasible. It also helps to keep the GPUs cool. So setting a power limit can solve the two major problems of a 4x RTX 3080 or 4x RTX 3090 setups, cooling, and power, at the same time. For a 4x setup, you still need effective blower GPUs (and the standard design may prove adequate for this), but this resolves the PSU problem.

You might ask, “Doesn’t this slow down the GPU?” Yes, it does, but the question is by how much. I benchmarked the 4x RTX 2080 Ti system shown in Figure 5 under different power limits to test this. I benchmarked the time for 500 mini-batches for BERT Large during inference (excluding the softmax layer). I choose BERT Large inference since, from my experience, this is the deep learning model that stresses the GPU the most. As such, I would expect power limiting to have the most massive slowdown for this model. As such, the slowdowns reported here are probably close to the maximum slowdowns that you can expect. The results are shown in Figure 7.
As we can see, setting the power limit does not seriously affect performance. Limiting the power by 50W — more than enough to handle 4x RTX 3090 — decreases performance by only 7%.
GPU Deep Learning Performance
The following benchmark includes not only the Tesla A100 vs Tesla V100 benchmarks but I build a model that fits those data and four different benchmarks based on the Titan V, Titan RTX, RTX 2080 Ti, and RTX 2080.[1,2,3,4] In an update, I also factored in the recently discovered performance degradation in RTX 30 series GPUs. And since I wrote this blog post, we now also have the first solid benchmark for computer vision which confirms my numbers.
Beyond this, I scaled intermediate cards like the RTX 2070, RTX 2060, or the Quadro RTX 6000 & 8000 cards via interpolating between those data points of benchmark data. Usually, within an architecture GPUs scale quite linearly with respect to streaming multiprocessors and bandwidth, and my within-architecture model is based on that.
I collected only benchmark data for mixed-precision FP16 training since I believe there is no good reason why one should use FP32 training.
Compared to an RTX 2080 Ti, the RTX 3090 yields a speedup of 1.41x for convolutional networks and 1.35x for transformers while having a 15% higher release price. Thus the Ampere RTX 30 yields a substantial improvement over the Turing RTX 20 series in raw performance and is also cost-effective (if you do not have to upgrade your power supply and so forth).
GPU Deep Learning Performance per Dollar
What is the GPU that gives you the best bang for your buck? It depends on the cost of the overall system. If you have an expensive system, it makes sense to invest in more expensive GPUs.
Here I have three PCIe 3.0 builds, which I use as base costs for 2/4 GPU systems. I take these base costs and add the GPU costs on top of it. The GPU costs are the mean of the GPU’s Amazon and eBay costs. For the new Ampere GPUs, I use just the release price. Together with the performance values from above, this yields performance per dollar values for these systems of GPUs. For the 8-GPU system, I use a Supermicro barebone — the industry standard for RTX servers — as baseline cost. Note that these bar charts do not account for memory requirements. You should think about your memory requirements first and then look for the best option in the chart. Here some rough guidelines for memory:
- Using pretrained transformers; training small transformer from scratch>= 11GB
- Training large transformer or convolutional nets in research / production: >= 24 GB
- Prototyping neural networks (either transformer or convolutional nets) >= 10 GB
- Kaggle competitions >= 8 GB
- Applying computer vision >= 10GB
- Neural networks for video: 24 GB
- Reinforcement learning =10GB + a strong deep learning desktop the largest Threadripper or EPYC CPU you can afford.
GPU Recommendations
The first thing that need to emphasize again: If you choose a GPU, you need to make sure that it has enough memory for what you want to do. The steps in selecting the best deep learning GPU for you should be:
- What do I want to do with the GPU(s): Kaggle competitions, machine learning, learning deep learning, hacking on small projects (GAN-fun or big language models?), doing research in computer vision / natural language processing / other domains, or something else?
- How much memory do I need for what I want to do?
- Use the Cost/Performance charts from above to figure out which GPU is best for you that fulfills the memory criteria.
- Are there additional caveats for the GPU that I chose? For example, if it is an RTX 3090, can I fit it into my computer? Does my power supply unit (PSU) have enough wattage to support my GPU(s)? Will heat dissipation be a problem, or can I somehow cool the GPU effectively?
Some of these details require you to self-reflect about what you want and maybe research a bit about how much memory the GPUs have that other people use for your area of interest. I can give you some guidance, but I cannot cover all areas here.
When do I need >= 11 GB of Memory?
I mentioned before that you should have at least 11 GB of memory if you work with transformers, and better yet, >= 24 GB of memory if you do research on transformers. This is so because most previous models that are pretrained have pretty steep memory requirements, and these models were trained with at least RTX 2080 Ti GPUs that have 11 GB of memory. Thus having less than 11 GB can create scenarios where it is difficult to run certain models.
Other areas that require large amounts of memory are anything medical imaging, some state-of-the-art computer vision models, anything with very large images (GAN, style transfer).
In general, if you seek to build models that give you the edge in competition, be it research, industry, or Kaggle competition, extra memory will provide you with a possible edge.
When is <11 GB of Memory Okay?
The RTX 3070 and RTX 3080 are mighty cards, but they lack a bit of memory. For many tasks, however, you do not need that amount of memory.
The RTX 3070 is perfect if you want to learn deep learning. This is so because the basic skills of training most architectures can be learned by just scaling them down a bit or using a bit smaller input images. If I would learn deep learning again, I would probably roll with one RTX 3070, or even multiple if I have the money to spare.
The RTX 3080 is currently by far the most cost-efficient card and thus ideal for prototyping. For prototyping, you want the largest memory, which is still cheap. With prototyping, I mean here prototyping in any area: Research, competitive Kaggle, hacking ideas/models for a startup, experimenting with research code. For all these applications, the RTX 3080 is the best GPU.
Suppose I would lead a research lab/startup. I would put 66-80% of my budget in RTX 3080 machines and 20-33% for “rollout” RTX 3090 machines with a robust water cooling setup. The idea is, RTX 3080 is much more cost-effective and can be shared via a slurm cluster setup as prototyping machines. Since prototyping should be done in an agile way, it should be done with smaller models and smaller datasets. RTX 3080 is perfect for this. Once students/colleagues have a great prototype model, they can rollout the prototype on the RTX 3090 machines and scale to larger models.
How can I fit +24GB models into 10GB memory?
It is a bit contradictory that I just said if you want to train big models, you need lots of memory, but we have been struggling with big models a lot since the onslaught of BERT and solutions exists to train 24 GB models in 10 GB memory. If you do not have the money or what to avoid cooling/power issues of the RTX 3090, you can get RTX 3080 and just accept that you need do some extra programming by adding memory-saving techniques. There are enough techniques to make it work, and they are becoming more and more commonplace.
Here just a list of common techniques:
- FP16/BF16 training (apex)
- Gradient checkpointing (only store some of the activations and recompute them in the backward pass)
- GPU-to-CPU Memory Swapping (swap layers not needed to the CPU; swap them back in just-in-time for backprop)
- Model Parallelism (each GPU holds a part of each layer; supported by fairseq)
- Pipeline parallelism (each GPU hols a couple of layers of the network)
- ZeRO parallelism (each GPU holds partial layers)
- 3D parallelism (Model + pipeline + ZeRO)
- CPU Optimizer state (store and update Adam/Momentum on the CPU while the next GPU forward pass is happening)
If you are not afraid to tinker a bit and implement some of these techniques — which usually means integrating packages that support them with your code — you will be able to fit that 24GB large network on a smaller GPU. With that hacking spirit, the RTX 3080, or any GPU with less than 11 GB memory, might be a great GPU for you.
Is upgrading from RTX 20 to RTX 30 GPU worth it? Or Should I wait for the next GPU?
If I were you, I would think twice about upgrading from an RTX 20 GPU to an RTX 30 GPU. You might be eager to get that 30% faster training or so, but it can be a big headache to deal with all the other RTX 30 GPU problems. The power supply, the cooling, you need to sell your old GPUs. Is it worth it all?
I could imagine if you need that extra memory, for example, to go from RTX 2080 Ti to RTX 3090, or if you want a huge boost in performance, say from RTX 2060 to RTX 3080, then it can be pretty worth it. But if you stay “in your league,” that is, going from Titan RTX to RTX 3090, or, RTX 2080 Ti to RTX 3080, it is hardly worth it. You gain a bit of performance, but you will have headaches about the power supply and cooling, and you are a good chunk of money lighter. I do not think it is worth it. I would wait until a better alternative to GDDR6X memory is released. This will make GPUs use less power and might even make them faster. Maybe wait a year and see how the landscape has changed since then.
It is worth mentioning that technology is slowing anyways. So waiting for a year might net you a GPU, which will stay current for more than 5 years. There will be a time when cheap HBM memory can be manufactured. If that time comes, and you buy that GPU and you will likely stay on that GPU for more than 7 years. Such GPUs might be available in 3-4 years. As such, playing the waiting game can be a pretty smart choice.
General Recommendations
In general, the RTX 30 series is very powerful, and I recommend these GPUs. Be aware of memory, as discussed in the previous section, but also power requirements and cooling. If you have one PCIe slot between GPUs, cooling will be no problem at all. Otherwise, with RTX 30 cards, make sure you get water cooling, PCIe extenders, or effective blower cards (data in the next weeks will show the NVIDIA fan design is adequate).
In general, I would recommend the RTX 3090 for anyone that can afford it. It will equip you not only for now but will be a very effective card for the next 3-7 years. As such, it is a good investment that will stay strong. It is unlikely that HBM memory will become cheap within three years, so the next GPU would only be about 25% better than the RTX 3090. We will probably see cheap HBM memory in 3-5 years, so after that, you definitely want to upgrade.
For PhD students, those who want to become PhD students, or those who get started with a PhD, I recommend RTX 3080 GPUs for prototyping and RTX 3090 GPUs for doing rollouts. If your department has a GPU cluster, I would highly recommend a Slurm GPU cluster with 8 GPU machines. However, since the cooling of RTX 3080 GPUs in an 8x GPU server setup is questionable it is unlikely that you will be able to run these. If the cooling works, I would recommend 66-80% RTX 3080 GPUs and the rest of the GPUs being either RTX 3090 or Tesla A100. If cooling does not work I would recommend 66-80% RTX 2080 and the rest being Tesla A100s. Again, it is crucial, though, that you make sure that heating issues in your GPU servers are taken care of before you commit to specific GPUs for your servers. More on GPU clusters below.
If you have multiple RTX 3090’s, make sure you choose solutions that guarantee sufficient cooling and power. I will update the blog post about this as more and more data is rolling in what is a proper setup.
For anyone without strictly competitive requirements (research, competitive Kaggle, competitive startups), I would recommend in order: Used RTX 2080 Ti, used RTX 2070, new RTX 3080, new RTX 3070. If you do not like used cards, but the RTX 3080. If you cannot afford the RTX 3080, go with the RTX 3070. All of these cards are very cost-effective solutions and will ensure fast training of most networks. If you use the right memory tricks and are fine with some extra programming, there are now enough tricks to make a 24 GB neural network fit into a 10 GB GPU. As such, if you accept a bit of uncertainty and some extra programming, the RTX 3080 might also be a better choice compared to the RTX 3090 since performance is quite similar between these cards.
If your budget is limited and an RTX 3070 is too expensive, a used RTX 2070 is about $260 on eBay. It is not clear yet if there will be an RTX 3060, but if you are on a limited budget, it might also be worth waiting a bit more. If priced similarly to the RTX 2060 and GTX 1060, you can expect a price of $250 to $300 and a pretty strong performance.
If your budget is limited, but you still need large amounts of memory, then old, used Tesla or Quadro cards from eBay might be best for you. The Quadro M6000 has 24 GB of memory and goes for $400 on eBay. The Tesla K80 has a 2-in-1 GPU with 2x 12 GB of memory for about $200. These cards are slow compared to more modern cards, but the extra memory can come in handy for specific projects where memory is paramount.
Recommendations for GPU Clusters
GPU cluster design depends highly on use. For a +1,024 GPU system, networking is paramount, but if users only use at most 32 GPUs at a time on such a system investing in powerful networking infrastructure is a waste. Here, I would go with similar prototyping-rollout reasoning, as mentioned in the RTX 3080 vs RTX 3090 case.
In general, RTX cards are banned from data centers via the CUDA license agreement. However, often universities can get an exemption from this rule. It is worth getting in touch with someone from NVIDIA about this to ask for an exemption. If you are allowed to use RTX cards, I would recommend standard Supermicro 8 GPU systems with RTX 3080 or RTX 3090 GPUs (if sufficient cooling can be assured). A small set of 8x A100 nodes ensures effective “rollout” after prototyping, especially if there is no guarantee that the 8x RTX 3090 servers can be cooled sufficiently. In this case, I would recommend A100 over RTX 6000 / RTX 8000 because the A100 is pretty cost-effective and future proof.
In the case you want to train vast networks on a GPU cluster (+256 GPUs), I would recommend the NVIDIA DGX SuperPOD system with A100 GPUs. At a +256 GPU scale, networking is becoming paramount. If you want to scale to more than 256 GPUs, you need a highly optimized system, and putting together standard solutions is no longer cutting it.
Especially at a scale of +1024 GPUs, the only competitive solutions on the market are the Google TPU Pod and NVIDIA DGX SuperPod. At that scale, I would prefer the Google TPU Pod since their custom made networking infrastructure seems to be superior to the NVIDIA DGX SuperPod system — although both systems come quite close to each other. The GPU system offers a bit more flexibility of deep learning models and applications over the TPU system, while the TPU system supports larger models and provides better scaling. So both systems have their advantages and disadvantages.
Do Not Buy These GPUs
I do not recommend buying multiple RTX Founders Editions (any) or RTX Titans unless you have PCIe extenders to solve their cooling problems. They will simply run too hot, and their performance will be way below what I report in the charts above. 4x RTX 2080 Ti Founders Editions GPUs will readily dash beyond 90C, will throttle down their core clock, and will run slower than properly cooled RTX 2070 GPUs.
I do not recommend buying Tesla V100 or A100 unless you are forced to buy them (banned RTX data center policy for companies) or unless you want to train very large networks on a huge GPU cluster — these GPUs are just not very cost-effective.
If you can afford better cards, do not buy GTX 16 series cards. These cards do not have tensor cores and, as such, provide relatively poor deep learning performance. I would choose a used RTX 2070 / RTX 2060 / RTX 2060 Super over a GTX 16 series card. If you are short on money, however, the GTX 16 series cards can be a good option.
When Is it Best Not to Buy New GPUs?
If you already have RTX 2080 Tis or better GPUs, an upgrade to RTX 3090 may not make sense. Your GPUs are already pretty good, and the performance gains are negligible compared to worrying about the PSU and cooling problems for the new power-hungry RTX 30 cards — just not worth it.
The only reason I would want to upgrade from 4x RTX 2080 Ti to 4x RTX 3090 would be if I do research on huge transformers or other highly compute dependent network training. However, if memory is a problem, you may first consider some memory tricks to fit large models on your 4x RTX 2080 Tis before upgrading to RTX 3090s.
If you have one or multiple RTX 2070 GPUs, I would think twice about an upgrade. These are pretty good GPUs. Reselling those GPUs on eBay and getting RTX 3090s could make sense, though, if you find yourself often limited by the 8 GB memory. This reasoning is valid for many other GPUs: If memory is tight, an upgrade is right.
Question & Answers & Misconceptions
Summary:
- PCIe 4.0 and PCIe lanes do not matter in 2x GPU setups. For 4x GPU setups, they still do not matter much.
- RTX 3090 and RTX 3080 cooling will be problematic. Use water-cooled cards or PCIe extenders.
- NVLink is not useful. Only useful for GPU clusters.
- You can use different types of GPUs in one computer (e.g., GTX 1080 + RTX 2080 + RTX 3090), but you will not be able to parallelize across them efficiently.
- You will need Infiniband +50Gbit/s networking to parallelize training across more than two machines.
- AMD CPUs are cheaper than Intel CPUs; Intel CPUs have almost no advantage.
- Despite heroic software engineering efforts, AMD GPUs + ROCm will probably not be able to compete with NVIDIA due to lacking community and Tensor Core equivalent for at least 1-2 years.
- Cloud GPUs are useful if you use them for less than 1 year. After that, a desktop is the cheaper solution.
Do I need PCIe 4.0?
Generally, no. PCIe 4.0 is great if you have a GPU cluster. It is okay if you have an 8x GPU machine, but otherwise, it does not yield many benefits. It allows better parallelization and a bit faster data transfer. Data transfers are not a bottleneck in any application. In computer vision, in the data transfer pipeline, the data storage can be a bottleneck, but not the PCIe transfer from CPU to GPU. So there is no real reason to get a PCIe 4.0 setup for most people. The benefits will be maybe 1-7% better parallelization in a 4 GPU setup.
Do I need 8x/16x PCIe lanes?
Same as with PCIe 4.0 — generally, no. PCIe lanes are needed for parallelization and fast data transfers, which are seldom a bottleneck. Operating GPUs on 4x lanes is fine, especially if you only have 2 GPUs. For a 4 GPU setup, I would prefer 8x lanes per GPU, but running them at 4x lanes will probably only decrease performance by around 5-10% if you parallelize across all 4 GPUs.
How do I fit 4x RTX 3090 if they take up 3 PCIe slots each?
You need to get one of the two-slot variants, or you can try to spread them out with PCIe extenders. Besides space, you should also immediately think about cooling and a suitable PSU. It seems the most manageable solution will be to get 4x RTX 3090 EVGA Hydro Copper with a custom water cooling loop. This will keep the cards very cool. EVGA produced hydro copper versions of GPUs for years, and I believe you can trust in their water-cooled GPUs’ quality. There might also be other variants which are cheaper though.
PCIe extenders might also solve both space and cooling issues, but you need to make sure that you have enough space in your case to spread out the GPUs. Make sure your PCIe extenders are long enough!
How do I cool 4x RTX 3090 or 4x RTX 3080?
See the previous section.
Can I use multiple GPUs of different GPU types?
Yes, you can! But you cannot parallelize efficiently across GPUs of different types. I could imagine a 3x RTX 3070 + 1 RTX 3090 could make sense for a prototyping-rollout split. On the other hand, parallelizing across 4x RTX 3070 GPUs would be very fast if you can make the model fit onto those GPUs. The only other reason why you want to do this that I can think of is if you’re going to use your old GPUs. This works just fine, but parallelization across those GPUs will be inefficient since the fastest GPU will wait for the slowest GPU to catch up to a synchronization point (usually gradient update).
What is NVLink, and is it useful?
Generally, NVLink is not useful. NVLink is a high speed interconnect between GPUs. It is useful if you have a GPU cluster with +128 GPUs. Otherwise, it yields almost no benefits over standard PCIe transfers.
I do not have enough money, even for the cheapest GPUs you recommend. What can I do?
Definitely buy used GPUs. Used RTX 2070 ($400) and RTX 2060 ($300) are great. If you cannot afford that, the next best option is to try to get a used GTX 1070 ($220) or GTX 1070 Ti ($230). If that is too expensive, a used GTX 980 Ti (6GB $150) or a used GTX 1650 Super ($190). If that is too expensive, it is best to roll with free GPU cloud services. These usually provided a GPU for a limited amount of time/credits, after which you need to pay. Rotate between services and accounts until you can afford your own GPU.
What is the carbon footprint of GPUs? How can I use GPUs without polluting the environment?
I built a carbon calculator for calculating your carbon footprint for academics (carbon from flights to conferences + GPU time). The calculator can also be used to calculate a pure GPU carbon footprint. You will find that GPUs produce much, much more carbon than international flights. As such, you should make sure you have a green source of energy if you do not want to have an astronomical carbon footprint. If no electricity provider in our area provides green energy, the best way is to buy carbon offsets. Many people are skeptical about carbon offsets. Do they work? Are they scams?
I believe skepticism just hurts in this case, because not doing anything would be more harmful than risking the probability of getting scammed. If you worry about scams, just invest in a portfolio of offsets to minimize risk.
I worked on a project that produced carbon offsets about ten years ago. The carbon offsets were generated by burning leaking methane from mines in China. UN officials tracked the process, and they required clean digital data and physical inspections of the project site. In that case, the carbon offsets that were produced were highly reliable. I believe many other projects have similar quality standards.
What do I need to parallelize across two machines?
If you want to be on the safe side, you should get at least +50Gbits/s network cards to gain speedups if you want to parallelize across machines. I recommend having at least an EDR Infiniband setup, meaning a network card with at least 50 GBit/s bandwidth. Two EDR cards with cable are about $500 on eBay.
In some cases, you might be able to get away with 10 Gbit/s Ethernet, but this is usually only the case for special networks (certain convolutional networks) or if you use certain algorithms (Microsoft DeepSpeed).
Is the sparse matrix multiplication features suitable for sparse matrices in general?
It does not seem so. Since the granularity of the sparse matrix needs to have 2 zero-valued elements, every 4 elements, the sparse matrices need to be quite structured. It might be possible to adjust the algorithm slightly, which involves that you pool 4 values into a compressed representation of 2 values, but this also means that precise arbitrary sparse matrix multiplication is not possible with Ampere GPUs.
Do I need an Intel CPU to power a multi-GPU setup?
I do not recommend Intel CPUs unless you heavily use CPUs in Kaggle competitions (heavy linear algebra on the CPU). Even for Kaggle competitions AMD CPUs are still great, though. AMD CPUs are cheaper and better than Intel CPUs in general for deep learning. For a 4x GPU built, my go-to CPU would be a Threadripper. We built dozens of systems at our university with Threadrippers, and they all work great — no complaints yet. For 8x GPU systems, I would usually go with CPUs that your vendor has experience with. CPU and PCIe/system reliability is more important in 8x systems than straight performance or straight cost-effectiveness.
Does computer case design matter for cooling?
No. GPUs are usually perfectly cooled if there is at least a small gap between GPUs. Case design will give you 1-3 C better temperatures, space between GPUs will provide you with 10-30 C improvements. The bottom line, if you have space between GPUs, cooling does not matter. If you have no space between GPUs, you need the right cooler design (blower fan) or another solution (water cooling, PCIe extenders), but in either case, case design and case fans do not matter.
Will AMD GPUs + ROCm ever catch up with NVIDIA GPUs + CUDA?
Not in the next 1-2 years. It is a three-way problem: Tensor Cores, software, and community.
AMD GPUs are great in terms of pure silicon: Great FP16 performance, great memory bandwidth. However, their lack of Tensor Cores or the equivalent makes their deep learning performance poor compared to NVIDIA GPUs. Packed low-precision math does not cut it. Without this hardware feature, AMD GPUs will never be competitive. Rumors show that some data center card with Tensor Core equivalent is planned for 2020, but no new data emerged since then. Just having data center cards with a Tensor Core equivalent would also mean that few would be able to afford such AMD GPUs, which would give NVIDIA a competitive advantage.
Let’s say AMD introduces a Tensor-Core-like-hardware feature in the future. Then many people would say, “But there is no software that works for AMD GPUs! How am I supposed to use them?” This is mostly a misconception. The AMD software via ROCm has come to a long way, and support via PyTorch is excellent. While I have not seen many experience reports for AMD GPUs + PyTorch, all the software features are integrated. It seems, if you pick any network, you will be just fine running it on AMD GPUs. So here AMD has come a long way, and this issue is more or less solved.
However, if you solve software and the lack of Tensor Cores, AMD still has a problem: the lack of community. If you have a problem with NVIDIA GPUs, you can Google the problem and find a solution. That builds a lot of trust in NVIDIA GPUs. You have the infrastructure that makes using NVIDIA GPUs easy (any deep learning framework works, any scientific problem is well supported). You have the hacks and tricks that make usage of NVIDIA GPUs a breeze (e.g., apex). You can find experts on NVIDIA GPUs and programming around every other corner while I knew much less AMD GPU experts.
In the community aspect, AMD is a bit like Julia vs Python. Julia has a lot of potential, and many would say, and rightly so, that it is the superior programming language for scientific computing. Yet, Julia is barely used compared to Python. This is because the Python community is very strong. Numpy, SciPy, Pandas are powerful software packages that a large number of people congregate around. This is very similar to the NVIDIA vs AMD issue.
Thus, it is likely that AMD will not catch up until Tensor Core equivalent is introduced (1/2 to 1 year?) and a strong community is built around ROCm (2 years?). AMD will always snatch a part of the market share in specific subgroups (e.g., cryptocurrency mining, data centers). Still, in deep learning, NVIDIA will likely keep its monopoly for at least a couple more years.
When is it better to use the cloud vs a dedicated GPU desktop/server?
Rule-of-thumb: If you expect to do deep learning for longer than a year, it is cheaper to get a desktop GPU. Otherwise, cloud instances are preferable unless you have extensive cloud computing skills and want the benefits of scaling the number of GPUs up and down at will.
For the exact point in time when a cloud GPU is more expensive than a desktop depends highly on the service that you are using, and it is best to do a little math on this yourself. Below I do an example calculation for an AWS V100 spot instance with 1x V100 and compare it to the price of a desktop with a single RTX 3090 (similar performance). The desktop with RTX 3090 costs $2,200 (2-GPU barebone + RTX 3090). Additionally, assuming you are in the US, there is an additional $0.12 per kWh for electricity. This compares to $2.14 per hour for the AWS on-demand instance.
At 15% utilization per year, the desktop uses:
(350 W (GPU) + 100 W (CPU))*0.15 (utilization) * 24 hours * 365 days = 591 kWh per year
So 591 kWh of electricity per year, that is an additional $71.
The break-even point for a desktop vs a cloud instance at 15% utilization (you use the cloud instance 15% of time during the day), would be about 300 days ($2,311 vs $2,270):
$2.14/h * 0.15 (utilization) * 24 hours * 300 days = $2,311
So if you expect to run deep learning models after 300 days, it is better to buy a desktop instead of using AWS on-demand instances.
AWS spot instances are a bit cheaper at about 0.9$ per hour. However, many users on Twitter were telling me that on-demand instances are a nightmare, but that spot instances are hell. AWS itself lists the average frequency of interruptions of V100 GPU spot instances to be above 20%. This means you need a pretty good spot instance management infrastructure to make it worth it to use spot instances. But if you have it, AWS spot instances and similar services are pretty competitive. You need to own and run a desktop for 20 months to run even compared to AWS spot instances. This means if you expect to run deep learning workloads in the next 20 months, a desktop machine will be cheaper (and easier to use).
You can do similar calculations for any cloud service to make the decision if you go for a cloud service or a desktop.
Common utilization rates are the following:
- PhD student personal desktop: < 15%
- PhD student slurm GPU cluster: > 35%
- Company-wide slurm research cluster: > 60%
In general, utilization rates are lower for professions where thinking about cutting edge ideas is more important than developing practical products. Some areas have low utilization rates (interpretability research), while other areas have much higher rates (machine translation, language modeling). In general, the utilization of personal machines is almost always overestimated. Commonly, most personal systems have a utilization rate between 5-10%. This is why I would highly recommend slurm GPU clusters for research groups and companies instead of individual desktop GPU machines.
TL;DR advice
Best GPU overall: RTX 3080 and RTX 3090.
GPUs to avoid (as an individual): Any Tesla card; any Quadro card; any Founders Edition card; Titan RTX, Titan V, Titan XP.
Cost-efficient but expensive: RTX 3080.
Cost-efficient and cheaper: RTX 3070, RTX 2060 Super
I have little money: Buy used cards. Hierarchy: RTX 2070 ($400), RTX 2060 ($300), GTX 1070 ($220), GTX 1070 Ti ($230), GTX 1650 Super ($190), GTX 980 Ti (6GB $150).
I have almost no money: There are a lot of startups that promo their clouds: Use free cloud credits and switch companies accounts until you can afford a GPU.
I do Kaggle: RTX 3070.
I am a competitive computer vision, pretraining, or machine translation researcher: 4x RTX 3090. Wait until working builds with good cooling, and enough power are confirmed (I will update this blog post).
I am an NLP researcher: If you do not work on machine translation, language modeling, or pretraining of any kind, an RTX 3080 will be sufficient and cost-effective.
I started deep learning, and I am serious about it: Start with an RTX 3070. If you are still serious after 6-9 months, sell your RTX 3070 and buy 4x RTX 3080. Depending on what area you choose next (startup, Kaggle, research, applied deep learning), sell your GPUs, and buy something more appropriate after about three years (next-gen RTX 40s GPUs).
I want to try deep learning, but I am not serious about it: The RTX 2060 Super is excellent but may require a new power supply to be used. If your motherboard has a PCIe x16 slot and you have a power supply with around 300 W, a GTX 1050 Ti is a great option since it will not require any other computer components to work with your desktop computer.
GPU Cluster used for parallel models across less than 128 GPUs: If you are allowed to buy RTX GPUs for your cluster: 66% 8x RTX 3080 and 33% 8x RTX 3090 (only if sufficient cooling is guaranteed/confirmed). If cooling of RTX 3090s is not sufficient buy 33% RTX 6000 GPUs or 8x Tesla A100 instead. If you are not allowed to buy RTX GPUs, I would probably go with 8x A100 Supermicro nodes or 8x RTX 6000 nodes.
GPU Cluster used for parallel models across 128 GPUs: Think about 8x Tesla A100 setups. If you use more than 512 GPUs, you should think about getting a DGX A100 SuperPOD system that fits your scale.
Version History
- 2020-09-20: Added discussion of using power limiting to run 4x RTX 3090 systems. Added older GPUs to the performance and cost/performance charts. Added figures for sparse matrix multiplication.
- 2020-09-07: Added NVIDIA Ampere series GPUs. Included lots of good-to-know GPU details.
- 2019-04-03: Added RTX Titan and GTX 1660 Ti. Updated TPU section. Added startup hardware discussion.
- 2018-11-26: Added discussion of overheating issues of RTX cards.
- 2018-11-05: Added RTX 2070 and updated recommendations. Updated charts with hard performance data. Updated TPU section.
- 2018-08-21: Added RTX 2080 and RTX 2080 Ti; reworked performance analysis
- 2017-04-09: Added cost-efficiency analysis; updated recommendation with NVIDIA Titan Xp
- 2017-03-19: Cleaned up blog post; added GTX 1080 Ti
- 2016-07-23: Added Titan X Pascal and GTX 1060; updated recommendations
- 2016-06-25: Reworked multi-GPU section; removed simple neural network memory section as no longer relevant; expanded convolutional memory section; truncated AWS section due to not being efficient anymore; added my opinion about the Xeon Phi; added updates for the GTX 1000 series
- 2015-08-20: Added section for AWS GPU instances; added GTX 980 Ti to the comparison relation
- 2015-04-22: GTX 580 no longer recommended; added performance relationships between cards
- 2015-03-16: Updated GPU recommendations: GTX 970 and GTX 580
- 2015-02-23: Updated GPU recommendations and memory calculations
- 2014-09-28: Added emphasis for memory requirement of CNNs
Acknowledgments
I want to thank Agrin Hilmkil, Ari Holtzman, Gabriel Ilharco, Nam Pho for their excellent feedback on the current version of this blog post.
For past updates of this blog post, I want to thank Mat Kelcey for helping me to debug and test custom code for the GTX 970; I want to thank Sander Dieleman for making me aware of the shortcomings of my GPU memory advice for convolutional nets; I want to thank Hannes Bretschneider for pointing out software dependency problems for the GTX 580; and I want to thank Oliver Griesel for pointing out notebook solutions for AWS instances. I want to thank Brad Nemire for providing me with an RTX Titan for benchmarking purposes.
Roger says
Thanks for this amazing article Tim! Has been useful for multiple years now! We have budget for a multiple GPU machine. We work with large 3D images and so considering the 3090 for its higher memory. Do you know if a machine with 3x 3090 blower edition GPUs can have any power/heating issues? Would you recommend it?
Thank you
Vytautas says
Great article!
I am researching now the best budget AI reinforcement learning hardware combination for a laptop. After some research and reading this article I basically ended up with two choices. Either RTX2060 (6G) and AMD Ryzen 9 4900H (8 cores) or RTX2070(8G) and Intel Core i7-10750H (6 cores).
From the article I see that more memory for GPU is definitely a good thing, however, for Reinforcement learning, you mentioned that the more CPU cores the better. So I wonder if in this case more CPU cores and less GPU memory outweigh the less CPU cores and more GPU memory?
Tim Dettmers says
I think there will not be a great difference between those CPUs, either one is fine and you should see no large differences in performance for RL.
zine says
great article. question, if one’s focus is on timeseries prediction and anomaly detection real time ; and one is looking to create models on a laptop and deploy on server, will a gtx-1650 be enough to test, including tensor code, thou no tensor on 1650-Q, or should one try to have an RTX-3000-Q or RTX-2070-Q as a minimum to test on a laptop. The server has 2080s. I am a noob, but i want something that I can carry with me and develop on easily and seamlessly transfer to server.
Tim Dettmers says
It highly depends on the nature of those datasets and the complexity of those models. A GTX 1650 will suffice for many kinds of models if it has ~100 variables and 1M datapoints. Beyond that, you might need a larger GPU with more memory.
Bruno Kemmer says
Hi Tim,
First, thank you for your posts, they are very instructive!
Do you know if the limitations imposed by NVIDIA on RTX 3060, limiting its hash rate, could affect its performance for CV applications?
I am planning to start researching GANs, I am waiting for the 3060ti with 12gb, hopefully, it will not be too much overpriced.
Finally, running a single GPU for a considerable time (I imagine a week) could create heating problems (in a ventilated case and room)?
Do you have recommendations for water-coolers?
Thank you,
Bruno
Daniel Danaie says
Hello Tim,
Awesome articles! I am trying to build an autonomous drone and the on board computer will be an Xavier NX. I am pretty sure that can handle inference of the system, but I am trying to understand which GPU I should train my prototype on. That is considering I will start with the CV aspect(asking the drone to find free paths, recognize colors, etc.), go towards RL (asking the drone maximize packages delivered, etc.), and as my last priority NLP (allow the drone to talk with people). This progression might take 3-4 years. I would love to hear the difference between these fields as it relates to choosing hardware. Also, considering I will start training the drone in mid-late 2021, what would you suggest considering my budget for the gpu & cpu is around 2000 USD and that I am creating this drone as a startup.
Thanks and sorry for my numerous questions,
Daniel
Joshua Brown Kramer says
This blog is the best source of information about how to build a machine learning box, however it has what I see as crucial problem. I’ve said as much before in these comments, but to put a fine point on it: The performance per dollar calculations are just wrong.
In particular, the supposed leader in that category is the 3080. The problem with that status is that it appears to depend largely on the MSRP of $800. But I have signed myself up for several services that alert me for the availability of this card at this price, and for months I have not been able to get it. The market price of this card is more like $1400. The MSRP is essentially meaningless. When compared to the 2080Ti, which is available for around $1000, and using your own performance comparisons, the 2080Ti beats to 3080 on performance per dollar.
Jason says
It’s currently a very bad time to build a deep learning machine. Prices are hugely inflated (as you’ve seen). The cost estimates seem fair in a normal market. Sadly, this is not a normal market.
Mohannad Barakat says
Hi Tim Dettmers,
I’m Mohannad a computer engineering undergraduate student. Currently I’m working on my graduation project. Me and my team have succeded to secure a fund of $10K on AWS. We plan to use them inorder to train our models. We are working on text to audio task (wavenet, waveglow, tacotron 2, and voice varifier) and audio to text task (wav2vec and wav2letter). We use Librispeech for both (about 1000 hours).
Also in case we have used all of the fund we will be allowed to use the university labs (2x 2080).
Currently we are stuck in deciding the type of the ec2 instance. Our options are k80, V100 and A100 (p2, p3, p4 types). Also we are not able to guiss how many GPUs of the type we should choose do we need?
Thanks a lot
Nate Liu says
Hi Tim,
Thanks for all these details!
currently, I’m working on computer vision, particularly in generative models.
I have been thinking about getting my own 4x gpus workstation, and the company called “lambda” seems to have some prebuilt stations. Do you think its better to build my own station than buy one from the company? and what gpus do you recommend for vision in general now?
Tuomo Kalliokoski says
Could you update the charts with current prices as the distribution problems for RTX30-series are predicted to last until end of the Q2 this year. Also could you give your opinion on RTX3060ti.
Dave says
Hi Tim,
Thanks for all the information, I really appreciate it!
I managed to get two 3080 cards, one MSI, one Gigabyte, the only difference seems to be one is 1740 MHz and the other is 1800 MHz. I do my current learning on a dual socket intel server that only has one PCIe x16 lane slot and 3 x8 slots. I do kaggle level work.
My 2 questions are:
* Does it matter that my two 3080s are not identical?
* Do I need to buy a new motherboard and cpu or can I get an 8x to 16x PCIe extender and use that for one of the gpus without really sacrificing performance?
Happy to provide more info if needed to answer. I don’t want to waste your time with extraneous words.
Thanks!
Angel Genchev says
I try running ResNet-50 (training) on a 6 GB 1660Ti and it fails to allocate enough CUDA memory. So the problem with the insufficient video memory is real. I begun to think what can I do and came to the idea of using AMD RoCm on their APUs.
This potentially could allow me to allocate “video” memory as long as there is RAM on the system. For example on a 32GB system it might be possible to allocate at least 16 GB for GPU. Slower training is preferable to impossible training 🙂
What do you think ?
Goce says
Tim,
Please allow this machine learning inexperienced cheapo to ask your opinion about a GPU choice.
I want to try experimenting with language models such as BERT, GPT etc. The goal is to create some software that will provide suggestions for a certain type of textual work. It’s still a vague idea at this point and not my first priority, but from what I tried so far on google it just might work well.
My budget at this point is kinda limited, so I am considering: a K80 24GB($300), 1080ti($400) or for slightly higher budget possibly 2080ti($600) M6000 24GB($800). Prices are local for my country.
I think I need 24GB so I can run the biggest models possible (BERT, GPT2 1.5 bil) , but I am not quite sure. I suppose Megatron-LM wouldn’t fit a 24GB. So I am not sure if i even need 24GB or 11GB would be good enough for now. What do you think?
X says
Hi Tim,
Thank you for this great article! Would you recommend RTX 3060 based on its VRAM capacity (12GB VRAM) and price (329 USD MSRP)? I may be a little bit too eager since it hasn’t been out on the market yet…
Michel Rathé says
Hi Tim,
I’m on a waiting list for a Asus RTX 3090 for my next workstation.
I’m considering the new AMD 3975wx (zen3) with 256 ecc ram.
My problem is the shortage of the 3090. If I’m lucky to get one what if I want a second one?
I’m developping on Matlab and want to have expandability, power and future proofing.
What if I go for the A6000 instead? I can have one for 6,200$ cad. Roughly 20% more than 2x 3090. I’d have 48Go Vram in one card and no problem to get one.
More technically, is there downsides having one bigger card for training?
Can I run multiple training tasks on that GPU.
Can deep learning beneficiates from larger batch size on such a card?
If the AMD 3975wx is too expensive (no prices yet) compare to the next zen 3 (genesis peak) 3970x, is the 3970x (with A6000) be still a beast and reliable? Is the 8 channels ECC memory of the AMD PRO a real deal breaker or there are no real advantages?
The A600 is tempting because it takes only 16 lanes and 2 or 3 slots (compare to 2x 3090), and no nvlink to get 48Go with 2x 3090.
What are your toughts?
Thanks again,
Michel
Angel Genchev says
It depends. If you train triplet loss with online hard-negative mining, then large batches are a must. Actually I hit the wall with 6GB VRAM, need at least 16 or 24. So having small VRAM can make it unavail to train certain networks (I fell back to CPU).
Your downside will be the higher MSRP cost vs GPU speed. On 1660 batch of 64 was enough to saturate the GPU (inference) and no further speed up with bigger batches was observed. If your networks fit in 3090, then 2×3090 might be faster than one RTX6000. “Multiple training tasks” – you not want to do this. One training task usually maxes out the GPU and asks for even more.
Vatsal Desai says
Hi Tim,
I am first time building Deep Learning Machine for my use as a freelancer/consultant/startup in AI.
So I do not want to limit myself to DL and would explore Reinforcement Learning as well – I do not wish to buy multiple GPUs or machines due to budget constraints.Also, I may explore participating in Kaggle competitions in the future.
So I plan to go with a single RTX 3090 24 GB at present, considering any future upgrade (not in the next 1-2 years for sure) – what can be a proper combination of CPU, motherboard, RAM, Hard Disk, power unit, cooling system?
I watched your recent interview on You Tube(chai time data science channel) as well – I am located in India and here there is not a good market to buy/sell used cards at present.
Please advise.
Thank you
Jun Wang says
Hi, is Threadrippers 2950x enough or should I go with 3rd gen?
Rene Munsch says
Hey @all,
just one Question: I have this Mainboard:
https://www.biostar.com.tw/app/en/mb/introduction.php?S_ID=886
as you can see, it offers 8 PCIe Slots (1×16 and 7×1).
Is it possible to use it for a low-budget setup and what size is the loss compared to “normal x16” slots in real life?
Ben M says
Hi Tim, awesome article – thanks for sharing your wisdom with the community. I’m hoping to pair two RTX 3090 cards on a Gigabyte TRX40 Designware motherboard (https://www.anandtech.com/show/15121/the-amd-trx40-motherboard-overview-/11), and am considering two cases to house this beast:
(1) CORSAIR OBSIDIAN SERIES™ 750D FULL TOWER CASE (https://www.legitreviews.com/corsair-obsidian-750d-full-tower-case-review_126122)
(2) FRACTAL DESIGN DEFINE 7 XL (https://www.legitreviews.com/fractal-design-define-7-xl-case-review_217535)
Any advice on whether these cases would be good choices? Would I even be able to fit two of these cards in the cases? Thanks for any pointers.
Mehran says
Hi,
I know it’s hard these days that there is a new GPU launch every month but would you please revisit this guide for the new 3060 12G VRAM. I think for ML it’s far more valuable than 3070.
keith says
Any thoughts on the new 3060 12GB release? Gets past that 11GB barrier mentioned for Transformer models. Also at nominally $400, you could run 4x for less than a 3090, curious how that could affect cost/performance metrics.
Sam says
Hi, thanks so much for your incredible work.
I am planning to build a studying/prototyping machine for NLP tasks, such as finetuning transformers like BERT-base. I’m on a pretty tight budget and as far as I understand, VRAM is essential in my case. I am considering 3060ti/3070 or the new 3060 with 12gb. Is it worth waiting for the 3060 or 3060ti+memory saving techniques will be sufficient for now, or may be I should consider something else?
IanK says
Hi, can you post your updates after CES 2021?
I want to buy new nVidia GPU since I have only 6GB 1060 and would like to have around 11 GB of memory – I have speech model on server that uses around 11GB (on 2013 K40 now, but it is slow (no tensors?) compared to current Ampere, also AMD ROCm has no easy support, AMD is obliterated by nVidia in deep learning), of course i can squeeze model size a little bit.
So I was thinking about getting ??3070 Ti? 12 GB???? or rather a day ago confirmed RTX 3060 12 GB, but I don’t want only 192 memory bus, I would prefer 256 if it increases performance of GPU training and inferring significantly.
Can you update your article how memory bus affects GPU performance in deep learning (can’t find info anywhere how it is important), is memory bus important with huge VRAM size in Deep Learning? I’m puzzled.
Do you think 3060 12GB is fair purchase vs lower VRAM and higher memory bus size 3060 Ti?
What is your opinion?
Thank you for your response and be well.
Jun says
Hi, thanks for the detailed explanation on all aspects. Which threadripper would you recommend tho?
Mary says
Hi Tim,
Thank you for all of the information you have provided. It was very informative for me. I was wondering if you can help me. I work in a small company, and for now we are just two data scientists, and at most we will grow to have 7-8 data scientists in 2 years. We are going to train some deep learning models for image processing. Can you tell me what kind of GPU we should get for a server for our company?
Drake says
Hi Tim,
I was using a machine with an intel i5, NVIDIA 1080, 32GB RAM for about a year/year and a half but it couldn’t keep up with some of the models I was trying to use.
I recently built a new rig for hobbyist ML. The new rig uses an AMD 2950 Threadripper, 64GB RAM, but I have a holdover graphics card until I can get my hands on a 30 series. My question is I have about 2k reserved for the graphics card right now. Which do you think would make more sense to get 4x3070s, 2x3080s, or 1×3090?
I have academia research experience with NLP and Computer Vision, but I also have little exposure to many other subfields and would like to experiment in other fields.
I have a Lian Li PC-O11 Air ATX Full Tower Case so I’m not really worried about space, my motherboard has 6 PCIE slots.
Thank you!
Arun Balajiee says
Hi Tim
Really great effort in putting together all the salient features for researchers in Deep Learning, especially useful for students like me
I put together a few parts based on the comments on this blog post and also from what I understood from the article here – https://pcpartpicker.com/list/yTr9RT
Do you think this would sulfice for small scale AI and NLP tasks? Would also be helpful if you could recommend some cost cutting measures that I could make to this list.
Marc Brian says
For researching/optimising hyper-params on a LSTM ie configure, train, test then repeat for a single company stock, I’ve been getting on fine for a few months with a 2080 Super. But now I am also testing for a “best stock” which means iterating on all hyper params across 200 companies and its now taking a week to get a clean run through (I can pause and resume). Would it be better for me to switch to a cloud or get a 3090 ? I dont mind waiting for 48 hours, but will this just burn out any card I get in < a year anyway ?
Tim Dettmers says
I think in your case you want to have many small GPUs. At least if your models are too big. For that using cloud computing can be very efficient. I would not worry about the lifetime of the GPU, it should be fine even if it runs all the time. You should only worry if their temperature is above 85° or higher for a very long time.
Gosuto says
You claim that it is better to avoid the RTX Titan models due to their temperature issues.
Do you expect the same for the to-be-released 3080 Ti? Because other than that, the VRAM upgrade from 10GB to 20GB and doubling of all types of cores seems worth the wait.
Or is the temperature issue a sign of limits being reached and Nvidia just trying to get away with it?
Tim Dettmers says
The main issue with the RTX Titan cooling was its fan design which was great for a single GPU but terrible for multiple. That has been fixed with the RTX 30 series and the 3080 Ti should be just fine.
Sovit Ranjan Rath says
Hello Tim. Great article. It is helping me a lot in deciding my next GPU and machine for doing deep learning. I have one question that I want to ask as there is not much information out there in such situations. First of all, I am serious about deep learning. I am a deep learning blog author (https://debuggercafe.com/). I regularly take on personalized medium-scaled deep learning projects and I also want to increase the standard and quality of my deep learning blog posts. Also, traveling between jobs for the next two years is almost mandatory for me. Currently, I do everything on an MSI laptop (i7 + 6GB GTX 1060) and some occasional work on Colab. I want to upgrade to a good laptop. Preferably i7 + 2080 Super Max-Q. I know that Max-Q variants keep the temperature hovering around 75-80 degrees Celcius for hours on end. Still, I am a bit skeptical. I really want to get your opinion on this. Hoping to hear from you.
Tim Dettmers says
Hello Sovit, I think the 2080 Max-Q is a great GPU and the high temperature should be fine (unless you often use your laptop on your lap, which can become uncomfortable over time). I think it is a great option in your case. Another option might be a dedicated cloud GPU. However, in the case of a cloud GPU you will need stable internet access, which is not always readily available when traveling. On the other hand, a cloud GPU is more powerful and allows you to always use the newest GPUs.
Sovit Ranjan Rath says
Thank you so much Tim. As you are providing two options, I think I will go for the Max Q machine for the next two years. It will help me in a lot of ways and will also provide me with the flexibility to take on my projects on my own time. Thanks again for your feedback. Your articles are really awesome and help me a lot.
Brian Arbuckle says
Hi Tim,
Fantastic article. Is it correct to say that buying two GPUs, say the 3080s, with 10GB of Video RAM does not “double” the RAM when it comes to deep learning? I am pricing a system for BERT as well as Audio / Video training. I know in both examples I need a lot of GPU RAM. I assumed the 30 GB of GPU RAM when buying three 3080 is not greater than 24 GB of GPU RAM as each card would utilize its own RAM, though maybe I am wrong?
Happy new year!
Thanks,
Brian
Tim Dettmers says
Hi Brian,
this is exactly right for data parallelism. With model parallelism (or the even rarer pipeline parallelism) you can however spread the model across GPUs and save more memory. However, software support for this is not readily available yet but should become common in the next 6 months. So if you need lots of memory, a single 24 GB GPU will currently serve you better and will still serve you well in a couple of months.
Roman Stehling says
Hi Tim,
Thanks for sharing all your detailed insights! Amazing learning experience for me.
I am working on Deep Reinforcement Learning for trading in the financial markets. My understanding is that the deep learning part in this context would be for finding out the optimal policy for the RL. Would that rather imply a smaller training dataset size like 10GB (3080) or a bigger one like 24GB (3090)?
Thank you!
Cheers,
Roman
Tim Dettmers says
Hi Roman, for RL the biggest problem is usually CPU performance rather than GPU memory. As such, I would go for the 3080 and invest the extra money in a CPU with more cores and more power.
Roman Stehling says
Awesome! Thank you, Tim!
Just returned the Ryzen 9 5900x and got the Threadripper 3970x with 32 cores and 64 threads! 🙂
And I am keeping the RTX 3080.
Albert says
Hi Tim, thanks for the always updated guide!
Deep Learning so far has been a hobby for me, and I am focusing more and more on CV, by doing MOOCs and reading books.
Knowing that’s the field of my interest, should I go for something like a RTX 3060 Ti (MRSP in EU should be around 400 euros, when and IF available) and hold it for 2021 as I become better and better or should I aim from Day 1 to get a RTX 3080(Ti)?
Tim Dettmers says
Hi Albert, I think the RTX 3060 Ti would be a very prudent choice for the first GPU. I think it is better to start small and depending on how you like it, upgrade later to something larger or if needed, so something really big.
swiss ml dude says
Hello Tim,
Thanks a lot for all the informations on your website. I have a question, I just started a master program in computer science and I would love to continue afterwards to grad schools in deep learning, especially in RL. My question is hardware related: I have a desktop computer I build a few years ago, but the GPU (zotac 1060 6gb) is a bottleneck now when I try to train big models. Hence I believe the 3080 gpu is a good fit. I wonder what to do with the old graphics card. Is it a good idea to use it for the system display (since it eats up some memory), so that the new 3080 gpu would be fully available for models ? Also, I am afraid the two GPUs would be stacked together too closely on the motherboard. Indeed, I have a MSI B150 Gaming m3 MB and the two GPU slots are really close to each other. Would it be a good idea to invest into water cooling or an other motherboard ? The blowers of the old card would be just above the new one in the current display.
Tim Dettmers says
Hello Justin,
I would not worry too much about cooling. You do not lose much if you run the display on the RTX 3080, but it can help to power the displays with the GTX 1060. I think cooling will just be fine, but what I would do is: Test the GTX driving the displays and monitor cooling, if cooling is not good remove the GTX 1060 and sell it online and power the displays with the RTX 3080.
Eric says
It’s very clear to tell me how to selcect GPU card in the paper. Thanks.
Di Lai says
Hi, Tim
I have a 2x RTX 3090 machine with air cooling (not blower type, just have one PCIe slot between the GPUs). I observed that when fully loaded the temperature of the GPU is about 81C, and the fan noise is strong. Is this normal? Currently my training time is short but I am not sure if when I train some very large model, could this temperature last hours or even days without causing any thermal issue or damage to the GPU?
Thanks a lot!
Di
edi says
Hi Tim, I have a EVGA 3090 FTW3 Ultra Hybrid (https://www.evga.com/products/product.aspx?pn=24G-P5-3988-KR) and a EVGA 3090 XC3 Ultra Hybrid (https://www.evga.com/products/product.aspx?pn=24G-P5-3978-KR). I want to build a Deep Learning workstation using those two GPUs. However, they are a little different although both of them are 3090. Will there be any compatibility issue (or will one of them be bottleneck) because of the differences between two cards? If so, I will refund one of them and get a new one.
Thank you!
Tim Dettmers says
Hi edi, both cards should work just fine together.
Rick Albright says
I have a Lenovo P1 Gen 2 with an Intel Core i7-9850H kaby lake processor and 64 gb of ram. I am considering purchasing a Razor Core X thunderbolt 3 enclosure. What card would you recommend under $1000. Will I have any performance issues using Thunderbolt 3 where I shouldn’t bother with the RTX3070 or 3080s? I plan on doing some nlp deep learning models. I’ve done some toy projects on the internal Qaudro P2000 card, but its 4GB of memory just doesn’t cut it. Would I be ok with an 8gb card, or should i spend a little extra for a 10gb card?
Asbjørn Berge says
Hi Tim! Thanks for the excellent and thoroughly researched post. We’re doing research on DL on point clouds and memory is a huge issue. Currently we’re running a single Titan RTX as it was a 24GB card with abundant supply. We’ve tried – but failed, so far – to split the training on multiple GPUs (using NVLink) – so now we’re looking to buy a 40GB+ card.
I’m still a bit unsure wether it makes more sense to grab a RTX 8000 than a pcie A100. The latter I guess will have some serious cooling issue in our typical GPU-boxen (we’re mostly building using Corsair Carbide Cube 540). But the performance looks so much better (on paper..), and the RTX 8000 looks like it is not much of an “upgrade” to the Titan RTX.
Any brief thoughts that will nudge me in the right direction? Thanks in advance for any input!
Tim Dettmers says
Hi Asbjørn,
I would go for the A100 and use power limiting if you run into cooling issues. It is just the better card all around and the experience to make it work in a build will pay off in the coming years. Also make sure that you exhaust all kinds of memory tricks to safe memory, such as gradient checkpointing, 16-bit compute, reversible residual connections, gradient accumulation, and others. This can often help to quarter the memory footprint at minimal runtime performance loss.
dvk says
@Tim Dettmers,
As its so hard to get a new GPU card of your choice. Can pairing a 3060 ti with RTX 3080 or 3090 in future or Pairing a 2080 ti with any RTX (3060 ti, 3070 or 3080) give us the advantages of pooling resources of a multi-gpu PC for deep learning?
Tim Dettmers says
It can work, but the details can be complicated. Some combinations of GPUs work better than others. In general, the smaller the gap between GPU performance, the better.
Protim says
Hi Tim,
Been following this blog and post including the full hardware guide for some time now. I wanted to ask if 2x 1060s would do for an intermediate researcher/senior phd student. Thinking of a 3950x build
Tim Dettmers says
Hi Protim,
for an intermediate researcher or senior PhD student, I would recommend a faster/larger GPU. I would recommend a minimum of an RTX 2060 (or two in your case). The reason for this is both memory and speed that is required to build on the most recent models (or to apply them to a different academic field).
Bill Smith says
Thanks for the post!
I was looking into the Nvidia Jetson AGX Xavier, would that have better performance training deep learning algorithms, especially with 32GB of ram then the 3080? The power consumption of only 30 watts and the 8-core ARM chip are also atractive. Thanks!
Tim Dettmers says
The Jetson GPUs are rather slow. I would only recommend them for robotics applications or if you really need a very low power solution.
Gonzalo says
Hi Tim,
thanks for this excellent an up-to-date appraisal of GPUs in Deep Learning.
Is there any difference, advantage/disadvantage of AMD vs Intel processors in terms of Deep Learning software? I saw that you seem to prefer AMD Threadripper in terms of hardware, but what about its impact in software libraries, etc available for Deep Learning?
Thanks
Tim Dettmers says
In general, the most CPU intensive deep learning tasks are data preprocessing. Both CPUs usually do just as well for these tasks. If you run also some heavy CPU code, for example, if you do Kaggle competitions, an Intel CPU will yield better performance.
David Haase says
Hi, thank you for your article, will you update the article to include the RTX 3060ti that will be released tomorrow (02.12.2020). It has the same amount of VRAM and might be a good alternative to the RTX 3070. What do you think?
Greetings.
Tim Dettmers says
I will probably need some time until I update this blog post again. Maybe in a month or so.
Marc says
Thank you for this is a great post! I am curious if you have any thoughts about where the soon-to-be-released A6000 fits in? It seems to combine the memory heft of RTX 6000/8000 with the Ampere architecture of the RTX 3090 without the power or design issues. Based on the naming convention, it seems aligned the the RTX 6000/8000, which I understand are more graphics-oriented than deep learning. Where do you expect this to fall, in particular when compared to the RTX 3090?
Tim Dettmers says
I would recommend the A6000 for data centers where it is not allowed to use RTX 3090.
Animesh Roy says
Hi Tim,
Following your blog post for GPU recommendations since last year. I am not a CS student , this is just a hobby for me so I want to spend as little as I can. I want to have cards with 24GB memory, which ones you would pick, TESLA M40 or K80?. I want to learn image recognition
Tim Dettmers says
Hi Animesh,
I would go for the M40 as it is a bit faster.
Satchel says
Up until now I’ve been using a GTX 1070 as my only card (previously focusing on small-scale vision applications). Now I’ve begun using more memory hungry tasks (GANs specifically) but also DRL. My question is, would you suggest selling my current GTX 1070, and buying a RTX2070 / RTX3070 or perhaps just adding a second GTX 1070 to my setup?
The way I see it 1070 -> $300, approx double performance and memory
RTX2070 -> $300 (after selling 1070), substantially better performance, same memory
RTX3070 -> $500 ( best performance, same memory) (assuming I can get at retail pricing)
Prices are approx and CAD. PSU is adequate for two 1070 cards.
Tim Dettmers says
I would probably wait for RTX 3070 Ti cards which will have more memory. It should be cheap enough and give you a bit more memory (10GB).
Will T. says
Amazing blog post! I have referred back here many times.
I see in another comment you recommended the 3080 over the 6800 XT since Nvidia has tensor cores – but what do you think about the fact that the 6800 XT has quite a bit more memory than the 3080? It also draws slightly less power, and has a lower sticker price, which are both appealing on the cost front.
Do you think that we will see some Rocm benchmarks soon on stuff like resnet/BERT? Do you really expect the performance discrepancy between the cards to be significant enough offset the benefits of the extra memory?
Thanks again for the great post!
Will T. says
Also, do you know if AMD will be more compelling in terms of multi-GPU setups? I am thinking about 2x 3080 vs 2x 6800 XT. I see that the CPU-GPU memory transfer speeds are slightly lower for AMD cards, but I am curious if smart access memory will change anything on this front.
Tim Dettmers says
If you use PCIe as an interface (that is what you would use in 95% of cases), both should be similar. However, not all libraries support ROCm GPUs and have equivalents to the NVIDIA libraries for parallelism. NVIDIA GPU RDMA is, for example, a technology only supports Mellanox cards and NVIDIA GPUs. NVIDIA has a dedicated library that uses it and has optimized GPU-to-GPU memory transfers for multi-node GPU clusters. I do not think AMD will catch up in cross-node communication for some time.
Tim Dettmers says
The problem with the RX 6800 XT might be that you are not able to use it in the first place. There was a thread on github in the ROCm repository where developers said that non-workstation GPUs were never really considered to be running ROCm. This might be changing in the future, but it seems it is not straightforward to use these new GPUs right out of the box.
Otherwise, the memory on the RX 6800 XT does help, but they do not have tensor cores and will be much slower than NVIDIA GPUs. So it is mostly a tradeoff between speed vs memory (if ROCm works). For BERT/ResNet I can easily see the RX 6800 XT be half the speed of an RTX 3080.
Chanhyuk Jung says
Apple recently released a tensorflow fork with hardware acceleration for macs. Installation script is provided and AMD gpus, intel’s integrated gpus and egpus on macs seems to work well.
Since macs are popular among programmers I think it’s great for testing models on your laptop before training on a dedicated server or a workstation. Apple’s new m1 chip shares ram with the gpu so you can run large models and hopfully with a faster gpu on the upcoming macbook pros so maybe even prototyping could be possible.
Do you think this is apple trying to be competitive in deep learning or is it just adding support just because they can?
Tim Dettmers says
It is just adding support, it seems. The Apple M1 processor is not powerful enough to train neural networks but will be very useful to prototype neural networks that you will deploy on iPhones. As such, it is an excellent processor to work with deep learning on iPhones, but you would probably train neural networks on GPU servers and transfer the weights to your MacBook to do further prototyping on the already trained network.
John says
Hey guys,
great work Tim!
what are your views on ASICs?
Google’s effort on TPU seems solid. There are more startups/companies claiming big performance and some of them already began selling their ASICs but I don’t see much adoption in the community.
Is this where we are headed?
Cheers!
Tim Dettmers says
ASICs are great! TPUs are solid, just as you said. The problem with ASICs is its enormous costs in R&D and a good compiler/software pipeline. If startups shoulder that cost, there is still the software and community problem. The most successful approaches compile PyTorch/Tensorflow graphs to something that can be understood by the ASIC. If this is not available, it gets difficult. The main problem with ASICs is usability. The fastest accelerator is worthless if you cannot use it! That is why all Intel accelerators failed. Once you get a usable ASIC, it is about community. NVIDIA GPUs have such a large community that if you have a problem, you can find a solution easily by googling or by asking a random person on the internet. With ASICs, there is no community, and only experts from the company can help you. So fast ASICs is the first step, but not the most important step to ASIC adoption.
In short, ASICs will find more use in the future and have huge potential, but their potential is limited by software usability and the community’s size around them.
Siddhant Kundu says
Thanks a lot for the informative article, it’s going to be invaluable while making my purchasing decisions!
I had a couple of follow-up questions regarding a build I want to do in the near future (early 2021 is the target, for now). I want to build an SFF PC that I will be able to carry around with ease. I’m mostly going to be working on game-dev based tasks, and I’m interested in working on RL-based networks for training my game’s decision making AI. I need a high amount of VRAM for blender renders and a good amount of hardware-accelerated compute capability for my ML models, and I’ll probably be running both at the same time. I’m torn between the 16GB frame buffer on the RX 6800XT and the tensor cores on the RTX 3080, and since the 3080 20GB model does not seem to be on the roadmap right now, I think my choices are limited to those two SKUs. (The 3090 is off the table since prices are already absurdly inflated in my country.) Also, is there any particular AIB model you would recommend? (I can go with up to 3-slot GPUs)
Tim Dettmers says
I am not sure about blender and its support for GPUs, but what I heard so far is that the new AMD GPUs do not support ROCm out-of-the-box, and it might be added later. With that, I would probably go with an RTX 3080. The VRAM on that one is a little small, though. It is a difficult choice. If training RL-based networks are more important, the RTX 3080; if rendering is more important, the RX 6800XT.
Pablo Remolino says
Hi,
great article.
My question:
It seems to be clear that RTX 30xx is far better than Nvidia Tesla in terms of cost efficiency, but what about Mean Time Between Errors (MTBF), for Tesla V100 is 1005907 hours (about 100 years). (source: https://images.nvidia.com/content/tesla/pdf/Tesla-V100-PCIe-Product-Brief.pdf)
I couldn’t find any estimations for RTX gaming cards. I’ve read (don’t remember where) that it could be something about 3-5 years. Can you confirm it?
If it’s true, it may be a real problem if you have not just a couple of cards but for example 30-50 GPUs. There is a fair chance that you have to replace GPU every few weeks. It is also possible that NVIDIA will not be eager to honor the warranty as they have banned RTX for data centers.
Tim Dettmers says
Hi Pablo, I never had a personal gaming GPU fail. From the ~30 GPUs that I used at universities, I had one fail. From a small GPU cluster I was using, I also saw one GPU fail (1 out of 48 GPUs). Some GPUs are known to have much higher failure rates than others (RTX 2080 Ti and RTX 2080 Founders Edition in particular).
I think for 30-50 GPUs, you can expect to replace one GPU every six months or so. Some vendors have guarantees on RTX cards for data centers, but this is rare and might incur extra costs.
teemu says
Thanks for the great article!
I see you do not discuss much about the 2080ti in your latest recommendations. Is this because it is not manufactured any more? I still see some 2080ti as b-stock about the same price as 3070 where I am. Do you think it is still a viable alternative to 3070/3080 in DL?
I also considered the 3090, but it seems a bit overly expensive for other than the big memory. However, memory often seems a limiting factor. As far as I understand from your post, using multiple GPU’s is possible for training a model. Does this also enable training bigger models that do not fit in the memory of a single GPU? If so, I could also consider getting one 3070/2080ti and maybe add another later.
I mostly do some Kaggle and applied ML/DL work. But sometimes I like to finetune some transformers and train some of the bigger CV models etc.
Tim Dettmers says
The RTX 2080 Ti is still a top-notch GPU! If you can find it cheap, it is definitely worth picking up. It is a great alternative to an RTX 3070 and on-a-par with an RTX 3080 if you need a little extra memory.
If memory is a problem, you might also want to wait 2021 Q1 for the release of the new RTX 3070 Ti etc which have extended memory. I think waiting for the big memory GPUs is a better choice than buying more 2080ti/3070 later.
Antoine says
Thanks a lot for the enlightening article, Tim !
Like teemu, I’m not sure whether using two GPUs in parallel allows to train models that would not fit into a single GPU.
Specifically, if I buy two RTX 3080 (with 10Gb memory each), will I be able to train models larger than 10Gb ? Or should I buy a RTX 3090 if I plan to do so ?
Tim Dettmers says
Hi Antoine,
that only works if you use model parallelism, which is getting more and more common, but which is not yet as widely supported as data parallelism. For now, a single RTX 3090 will be better for training large models.
Gokhan T says
Hi Tim. Great article. Saves tons of hours multiplied by thousands of people.
Picking up the right motherboard is really tricky though. I am trying to build a system with only 1 RTX3080 but I want the system to be expandable. If I go with AMD Ryzen 3-5-7-9 series more than 2 GPUs doesn’t seem meaningful since 24 lane is not enough. Am I right until here? Or does it make sense?
And the 2nd problem I have. Lets say 2 GPU-system is my only option because of budget restrictions. Motherboard descriptions are not explicit. Most of the time the descriptions breakdown lane usage by CPU generation. For example:
1 x PCIe 4.0/3.0 x16 slot (PCIE_1)
– 3rd Gen AMD Ryzen support PCIe 4.0 x16 mode
– 2nd Gen AMD Ryzen support PCIe 3.0 x16 mode
– Ryzen with Radeon Vega Graphics and 2nd Gen AMD Ryzen with Radeon Graphics support PCIe 3.0 x8 mode
1 x PCIe 4.0/3.0 x16 slot (PCIE_3, supports x4 mode)
I cannot understand if it is x4/x4/x4 or x8/x8/x4 or something else. Do you know how can I understand it?
Tim Dettmers says
Hi Gokhan, It depends. Usually running 3 GPUs at 4 lanes each is quite okay. Parallelism will not be that great, but it can still yield good speedups and if you use your GPUs independently you should see almost no decrease in performance. What you should make sure of though, is that your motherboard support x4/x4/x4 setups for 3 GPUs (sometimes motherboards only support x8/x4). You can usually find this information in the Newegg specification section of the motherboard in question. It is also worth a try to search for the motherboard and see if others build 3+ GPU builds with that.
Gokhan T says
Thank you Tim. I found that Gigabyte Aorus Ultra and Master motherboards provide x8/x8/x4 over PCIe 4.0. This is the cheapest board I could find and I hope it will work. I contacted with Gigabyte and ask if it is possible to use the card with 3 RTX3080 boards with x8/x8/x4 distribution. And they said yes it is possible. However they suggested to use 1500+W PSU. It matches with the calculations you did on the article.
Harsh Rangwani says
Hi Gokhan T,
One more thing that you might have to consider is the spacing between PCI-e slots. As RTX 3080 cards are usually of size > 2.5 PCIe slot. So I do have apprehensions about Gigabyte Aorus Master being able to fit in 3 cards at the same time without PCI-e extenders. Also currently a large majority of PCI-e extenders are 3.0. So don’t know if 4x PCIe 3.0 would be good enough for RTX 3080. Let’s see if Tim has a workaround for this.
Thanks
Arnaud Maréchal says
Thanks for this great learning article.
The package OpenCL allows R to leverage computing power of GPUs.
Also I understand that scikit-learn does not support GPUs, some alternatives such as scikit-cuda provide Python interfaces to many of the functions in the CUDA device/runtime, CUBLAS, CUFFT, and CUSOLVER libraries.
What are your views on using GPUs in R or Python with this packages? Especially when not using large NN?
Thanks,
Arnaud
Tim Dettmers says
Hi Arnaud, I see sklearn more like an exploration tool. I think you can always explore algorithms on a smaller scale and then use dedicated GPU implementations. For R, GPUs should be used. Similarly, I believe. But I think in the end, it is always better to reserve the use of R/sklearn for analysis and prototyping in a small part of the data and then roll out on GPUs.
Arnaud Maréchal says
Hi Tim.
I appreciate your answer and advice
We learn a lot from expert’s feedback, thanks for sharing your experience.
Arnaud
Bram Vandenbon says
Do you think it’s better to use a single-rail or multi-rail power supply unit when hooking up multiple GPUs ?
Tim Dettmers says
A single rail is usually better because it has a standard form factor, which allows using standard cases. If you do not need/want to use standard cases, a double rail might save you some headaches when using 4x RTX 3090 or other power-hungry GPUs.
Di Lai says
Hi,Tim
I have a quick question here: if I config a 2x RTX 3090 machine (only using 2x RTX 30xx ready config of main board + power supply), do yo suggest using air cooling or liquid cooling?
Thanks !
DL
Tim Dettmers says
If you have a PCIe slot in-between GPUs, air is just fine. Otherwise, use still air but buy the blower GPU variant.
Ryan DS says
Do AMD Radeon 5000/6000 GPUs support ROCm?
It doesn’t seem like it according to: https://github.com/RadeonOpenCompute/ROCm/issues/887.
Tim Dettmers says
Thanks for the link with the info. It indeed seems these GPUs are not supported immediately. They might be added later, but there does not seem to be a big official push since ROCm is designed for datacenter GPUs.
Shuhao Cao says
Hi Tim,
Thanks for this nice article. As I constantly got OOM error in a current Kaggle competition (16G memory for training a graph transformer), I snabbed one 3090 on newegg and started building a single card rig behind it. Now I have a simple question: should I use the integrated graphics on CPU to connect with the monitor for display purposes?
Does connecting the graphics card with a dual-monitor in QHD affect the performance of the graphics card during training?
Thanks,
Shuhao
Tim Dettmers says
Usually, the displays do not need that much memory. I run 3 displays at 1080×1920 and that usually uses 300-500MB. If you want to use less I would suggest getting a small NVIDIA card. Otherwise, it can be a mess to get all the drivers working (you need an active NVIDIA driver to run CUDA).
Shuhao Cao says
Thanks Tim. I followed this guide (https://gist.github.com/alexlee-gk/76a409f62a53883971a18a11af93241b link for anyone who is interested) to configure Ubuntu so that the integrated graphics is for display and 3090 for CUDA. However, it does not seem possible to do the same thing on Windows.
Tim Dettmers says
This is very useful — thanks for sharing!
Brad Love says
Amazing blog; thank you.
Do you have a reference or other justification for the utilization rates you state? I copy them below for your convenience.
Best,
Brad
Common utilization rates are the following:
PhD student personal desktop: 35%
Company-wide slurm research cluster: > 60%
Commonly, most personal systems have a utilization rate between 5-10%
Tim Dettmers says
Hi Brad! The 5-10% figure comes from desktop machines of PhD students at UW. I have had access to about ~10 machines with about 30 GPUs in total. At any one time, 2-4 GPUs were used. The 35% figure is actually not from a personal desktop, but a department-wide research cluster at UW (I made a mistake here). The 35% figure is utilization over a couple of months of all GPUs that the cluster has. The >60% figure comes from GPU clusters I have worked on before (Switzerland, an old cluster at UCL, one at Facebook).
Brad Love says
Thanks!!!
Vinamra Singhai says
Thanks for such a great article. I have some questions –
1) Should I wait for a comparison between Radeon RX 6900/6800 XT and RTX 3080 for DL/AI testing results?
2) What is the difference between desktop GPU vs workstation GPU? Which is the best workstation GPU in terms of performance vs cost ratio?
Thanks again.
Vinamra Singhai says
What is worth 2 RTX 3080 or a single RTX 3090?
Tim Dettmers says
I would go for two RTX 3080.
Tim Dettmers says
1) The NVIDIA GPUs are currently better since they have TensorCores. I would just go with the RTX 3080 GPU
2) There is no difference. Workstation GPUs are more expensive though and sometimes they have more memory. I do not recommend workstation GPUs. If you must buy one I recommend the RTX 6000 or RTX 8000.
Letmos says
Hello, NVIDIA has monopoly for ML on GPUs, but things are changing (unfortunately, very slowly!). New cards from AMD have got impressive performance, good price and 16 GB of VRAM. They lack of Tensor Cores, but overall are good choice for most of the games and pro software. In case of ML, NVIDIA is number one, but I hope this will change soon. Some competition is always good.
On AMD’s website there are docker-based versions of last TensorFlow and Pytorch.
( https://www.amd.com/en/graphics/servers-solutions-rocm-ml ). Anyone have used them ?? What are your opinions about those docker-based apps? Is process of instalation straightforward?
Thanks!
Tim Dettmers says
Docker-based options should be pretty straightforward to install. Installing ROCm and PyTorch should also be relatively easy.
Tony Shi says
Thanks for the detailed information. Are these RTX 30 series really useful for Machine Learning in data science? NVIDIA’s web site just said it use AI for gaming.
Tim Dettmers says
The RTX 30 series GPUs are great for ML and data science. Do not be confused by what NVIDIA says — they want you to spend more money on a Quadro or Tesla GPU.
John says
Hey Tim,
Any comments on the difference of an eGPU attached via Thunderbolt 3 with 2 PCIe lanes instead of 4? — Can assume a RTX 3080 if you need to hold something else constant.
I have seen a few articles saying the difference isn’t all that huge and others that say don’t bother unless you can have a full 4 lanes. I am wondering what the performance impact might actually be but haven’t come across anything googling yet that really breaks it down for deep learning.
Thanks!
Tim Dettmers says
Hey John, 2 PCIe lanes should be okay if you just use 1 GPU. In that case, you only transfer data to the GPU at 2 GB/s which is still quite fast even in the case for large images. To make an example: If your mini-batch takes 100ms through the network and use transfer a batch of 32 ImageNet images (224x224x3) then the transfer on 4 PCIe lanes will take 2.2ms while on 2 PCIe lanes it will take 4.5ms. So you will be 2% slower on 2 PCIe lanes in this case. If your network takes 10ms it will be 20% slower, if it takes 1000ms for a mini-batch then it will be 0.2% slower etc.
Miguel says
Hi! Congrats for the amazing post. It is very useful for us beginners… 😉
What’s your opinion about Radeon RX 5700 models? In some benchmarks they outperform or are very close to RTX 2060 and 2070 models, just for 400$ brand-new.
I am a Radeon and Fedora user, in the middle between “I want to try deep learning, but I am not serious about it:” and “I started deep learning, and I am serious about it”. I happily got my old Radeon (Pro WX 4150) working with rocm 3.8 and I am now hesitating a lot wether I should upgrade to a more powerfull Radeon and go on with the current settings or jump to nvidia and take the risk of having some issues when installing CUDA in Fedora….
Thanks so much.
Tim Dettmers says
Hi Miguel! Installing CUDA in any UNIX system should be relatively easy, so do not worry about that. I think you would get better performance out of NVIDIA GPUs, especially at the high end, mainly due to Tensor Cores, but if you are happy with a little less performance, Radeon GPUs are just fine. You might have some usability issues here and there, but if you are already using ROCm 3.8, you should already have some good experience under your belt. In general, using AMD GPUs is also quite useful for the community as we get more data on user experience, which will help defuse the NVIDIA monopoly. If you want to support the community, buying an AMD GPU and writing an experience report about it would be very helpful and valuable!
Jeff says
What an outstanding resource. Thanks so much.
Karthik Rao says
Hi Tim and other readers of this great source.
I am planning to get a new rig mostly for Text and NLP applications, might use for Images and Video too.
I plan to put in one rtx 3080 for now, but would like to build it such that I can add up to 3 more cards. Considering all the potential cooling and power issues, I am open to a two chassis build. Once chassis could host my cpu, ram, storage, power supply etc. Basically a PC.
The second chassis could be a GPU Enclosure of sorts, that connects over PCIE 4.0 X16 cable to the first chassis, and holds up to 4 GPUs and Power Supplies. I am ok even if I need 4 such cables, one for each GPU. This way, my GPU enclosure can have lots of space between GPUs. And a 1200 W PSU should be more than sufficient for the GPU Enclosure.
Can this kind of Frankenstein build work?
Tim Dettmers says
There is now more information about cooling and power. It seems power limiting works well and does not limit performance much. Also cooling can be done with ordinary blower-style GPUs. See this article for more information: https://www.pugetsystems.com/labs/articles/Quad-GeForce-RTX-3090-in-a-desktop—Does-it-work-1935/
Karthik Rao says
Thanks so much. This looks encouraging. Especially as I am considering 3080s, and at 230V, so just 10amps draw. And maybe a lower TDP processor like Ryzen 7 3700,
Will keep an eye out for updates on the blower editions.
Karthik Rao says
I read around a bit more and a couple of other things I realized:
1. It is a lot easier to plug in a few pcie cables than itnis to assemble a whole pc.
2. Having an external enclosure with its own power also means I can leave the GPUs off and use only the regular pc.
So if it is possible, I still want to try the frankenbuild option.
Will such a build work? Any big stumbling blocks?
Would deeply appreciate any pointers.
Thanks
Karthik
Harsh Rangwani says
Hi Tim,
Thanks for your wonderful post. https://www.pugetsystems.com/labs/articles/Quad-GeForce-RTX-3090-in-a-desktop—Does-it-work-1935/
In the above blog post they use Xeon and Asus SAGE motherboard. Our lab is planning to build a 2x RTX 3090 setup but want to add GPUs later as well. Will your build described with Threadripper https://pcpartpicker.com/user/tim_dettmers/saved/#view=wNyxsY
work for us for the 4x RTX 3090 setup with blower cards?
Tim Dettmers says
Hi, Harsh! Yes, the threadripper build should just work fine. I think you just need to make sure that you have enough space in the case.
Di Lai says
Hi, Tim
Very nice article! Thank you so much for the effort to put together this. I just would like to ask you a question: if I plan to buy a GPU workstation for deep learning, should I buy a brand name (like Dell, Lenovo, etc.), or some third party built (like lamda labs), or should I consider DIY? What is the usual practice? My budget is around 15K, what is the best machine that can buy?
Thanks a lot!
Tim Dettmers says
DIY is usually much cheaper and you have more control over the combinations of pieces that you buy. If you do not need a strong CPU you do not have to buy one if you do DIY. Dell, Lenovo are often enterprise machines that are well balanced — which means you will waste a lot of money on things that you do not need. LambdaLabs computers are deep learning optimized, but highly overpriced. In any case, DIY + youtube tutorials are your best option. If you do not want that I would probably go with a LambdaLabs computer.
For 15k you can pretty much buy and 4x GPU machine. You could use the 4-GPU barebone in my blog post and extend it with 4 GPUs of your choice.
Di Lai says
Thank you so much for your precious advice!
Josh Brown Kramer says
This post is amazing and is nearly prompting me to buy some RTX3080s. There are some things I want to clear up though.
The performance metrics indicate that the 3080 is 1.25x faster than a 2080Ti at CNNs. From the rest of the discussion, every other component of a box supporting 2080Ti’s ought to be cheaper than for the 3080 (cheaper cooling and power supply in particular). So if – theoretically – the 3080 cost the same as a 2080Ti, then I would expect that the normalized performance/$ for a 2080Ti to be at least 1/1.25 = .8. But it’s significantly less than that for every build configuration. Ok, so that must mean that 3080s are cheaper than 2080Tis in your model. In practice it seems like I can get a 3080 for $1400 or $1500 on eBay and that they are not available from retailors. The 2080Ti, however, seems to be available for $800 – $1000 on eBay (and something like $1300 on Amazon). So I understand that this is probably a shortage issue – there is high demand for scarce 3080 cards. However, my question is whether we can ever expect this analysis to hold up – given the relative performance advantage of the 3080 , can’t we expect it to ALWAYS cost more than a 2080Ti? Your analysis seems to indicate that it will cost significantly LESS than a 2080Ti. Granted – it has less memory, but not much less, and it takes more power, but its performance/watt is very close to the 2080Ti.
I hope you can clear things up for me. Thanks again for the terrific article.
Tim Dettmers says
Hi Josh! So currently, the prices are normalized by the cost of a full desktop. For example, the 2 GPU chart is for a typical 2-GPU desktop. A more powerful power supply is about $50 per GPU. Cooling seems to be sufficient if you pick the right GPUs. It seems “Turbo” RTX 3090s do not need any water cooling to work in a 4x setup. This means that 4x RTX 3080 with a blower-style “Turbo” fan should be enough. I think these cards are not any more expensive than regular GPUs. This means the bottom line is that you do not pay so much more extra, and the RTX 3080 remains the most cost-efficient GPU despite the additional power requirements. Does this help?
Josh Brown Kramer says
My main concern is that in practice right now almost every 3080 I can buy costs more than a typical 2080Ti, whereas the analysis seems to indicate that the 3080 costs significantly less. Furthermore, given the relative strengths of the 3080 it’s hard to see why that would change.
Henry says
That’s because the everyone wants a 3080 right now and they’re going for much more than MSRP. Your best bet (if you can) is to wait for NVIDIA to meet demand and the price will come back down to the MSRP. Otherwise you can track the inventory at reputable retailers to get a 3080 at a reasonable price. As for the 2080Ti pricing my hunch is it has gone up recently due to unscrupulous sellers hoping people looking to get a 3080 makes a mistake and buys the 2080 instead.
Ulf says
Nvidia did pretty much a ‘paper launch’. After November things should get more normal, especially since AMD has a competing product for gamers out soon .But you are right in a way , you will probably not get a good 3080 for 800 USD.
On the other hand: whats the alternative?? Certainly not buying the last generation.
Adrian G says
Thanks for all your help via your blogs over the years.
I am now in situation where I have 2 X99 workstations, one with 2xRTX2080ti and one with 3xRTX2080ti (couldnt put 4 in this due to buying cheap used 2.5 slot wide gpus, and one is already on a pcie riser).
I want to connect the 2 machines using high speed network cards and fiber.
Is having 100mbit/s network speed an absolute must or could I get away with 40/50mbit/s?
I havent found any 100 Mbit/s mellanox inifiniband cards for less than ~$400 usd each which is too pricey for me. Once network is setup is SLURM the best way to distrubute load?
Tim Dettmers says
I am sure you meant GBit/s and not MBit/s. 40/50Gbit/s is sufficient if you have only 5x GPUs in total, but I am not sure if it is worth the effort. I am not sure how difficult it is to setup Infiniband with RTX GPUs as it is officially not supported. It might be that you cannot do this and instead, the communication would be GPU->CPU->infiniband->CPU->GPU which is still fast enough for good parallelization but 3 GPUs might come quite close to that performance if parallelized already. I think it would be more effective to buy a new case and riser and try to fit 4x GPUs into one box.
Rory McCallion says
Hello! Thanks for compiling all of this information and staying on top of it 🙂 . Huge help to the community!
I have a request for clarity. In your Quora article, you wrote:
“However, if you now use a fleet of either Ferraris and big trucks (thread parallelism), and you have a big job with many packages (large chunks of memory such as matrices) then you will wait for the first truck a bit, but after that you will have no waiting time at all — unloading the packages takes so much time that all the trucks will queue in unloading location B so that you always have direct access to your packages (memory). This effectively hides latency so that GPUs offer high bandwidth while hiding their latency under thread parallelism — so for large chunks of memory GPUs provide the best memory bandwidth while having almost no drawback due to latency via thread parallelism.”
I do not understand this. I love the metaphor with fast cars vs trucks, but I’m not sure how they work together in this situation, or why having big trucks wait makes things fast (IDK if that’s what you meant to convey, but that’s how I’m reading it). It seems like the crux of the article, so I’m eager to understand this passage deeper. Thank you for all your help 🙂 .
Tim Dettmers says
Hi Rory! To go along with this metaphor, you can imagine you are working in a loading dock. The speed at which you can unload packages is 1 package per minute. If a Ferrari with 1 package comes every 30 minutes, you will be idle 29 minutes. A truck might hold 100 packages, but it needs 60 minutes to make the trip to the loading dock. This means you will wait 60 minutes and for the first truck to arrive, and subsequent trucks arrive before you can finish unloading the previous truck. This means using a truck for package delivery will be faster once you need 3 packages (Ferrari takes 90 minutes, the truck takes 60 minutes). For CPUs, we often only need 1 package (1+2), and that is why a Ferrari is better for that while in GPUs we often need multiple packages (A*B) at once. Let me know if this is still unclear.
eitamar saraf says
Hey,
First, Thanks for sharing.
Second, I saw at eBay tesla k80 for 300$,
Besides the cooling issue, power consumption, And the slowness(and it’s a lot!),
Is there an alternative for 10+ GB at a normal price?
Tim Dettmers says
A k80 for $300 is pretty good! I think otherwise, you might be able to get one of the old Titan cards for less than $300, but it will not be much less than that.
chanhyuk jung says
Can I plug a gpu to a pcie slot connected to the chipset? The gpu is connected to the chipset via pcie 4.0 x4 and the chipset is connected to the cpu via pcie 4.0 x4. I want to use three 3080s for multi gpu training and running separate experiments on each gpu.
Nadir says
Hi Tim, thank you for the in-depth guide! Do you have any suggestions on GPU (core/memory) overclocking? How much of overclocking would you consider for the RTX 3090 FE? (I’m mostly working on RNN and transformer for time series forecasting.)
Tim Dettmers says
Overclocking often does not yield great improvements for performance and it is difficult to do under Linux, especially if you have multiple GPUs. If you overclock, memory overclocking will give you much better performance than core overclocking. But make sure that these clocks are stable at the high temp and long durations that you run normal neural networks under.
Geoff Seyon says
Wow man! Given today’s nail-biting decision around the availability of RTX 3080 and RTX 3090 cards, this was an INVALUABLE article! Thanks Tim!
Geoff Seyon says
Tim,
Any thoughts on the speculated 3080 Ti versus the 3090 for deep learning workstations?
Thanks,
Geoff
Tim Dettmers says
It all depends on the details. It will probably be like with the previous series, that the RTX 3080 Ti will be much more cost-efficient, but we will have to see. It also depends on supply. If you cannot buy these cards
Manish says
leaks suggest 20 GB vram on 3080 ti, will there be any chance for 3090 competing with the new card?
Tim Dettmers says
If the rumors are true, the RTX 3080 Ti will be way better than the RTX 3090 in terms of price performance. I would probably no longer recommend the RTX 3090 (except maybe in 8x GPU builds).
George Pongracz says
Thanks for this great article.
I am looking to self study with a machine at home and was interested in your thoughts with regards to the recent update that a RTX 3070 16GB will be released in December 2020 and how a card like this would slot into your hierarchy.
Tim Dettmers says
This is definitely an interesting development! An RTX 3070 with 16Gb would be great for learning deep learning. However, it also seems that an RTX 3060 with 8 GB of memory will be released. Depending on the price, this might actually be the better card for learning deep learning until you are sure you want to commit to deep learning or a particular sub-area like RL, NLP, or CV, which need very different GPUs/CPUs. The money that you might save on an RTX 3060 compared to RTX 3070 might yield a much better GPU later that is more appropriate for your specific area where you want to use deep learning.
Devjeet Roy says
Hi Tim! Thanks for the post, it is really informative and comprehensive.
I wanted to ask you real quick about potentially upgrading my rig. I’m a PHD student 5 hours away from you at Washington State University. To keep it brief, I’m looking to pretrain Transformers for source code oriented tasks. Currently, I have 2x2080Tis and I’m definitely running into problems with model size (after trying some of the tricks you mentioned earlier using PyTorch Lightning). In the past, I was able to Google’s Tensorflow Research Cloud access for a large model to deal with these issues.
Do you think it’s worthwhile upgrading to a 3090 (and possibly putting my 2080Tis in a second machine)?
The 48GB VRAM seems enticing, although from my reading it seems clear that even with that amount of memory, pretraining Transformers might be untenable. However, it might speed up prototyping for my research. Also, I don’t really think I’ll be able to get more than 1. For now, we’re not an ML lab, although I personally am moving more towards applied ML for my thesis, so I’m not able to justify these expenses for funding.
Tim Dettmers says
Hi Devjeet! Two RTX 2080 Tis should be faster than a single RTX 3090 if you use parallelism. There are better and better implementations of model and other types of parallelism implemented in NLP frameworks, so if you still have some patience for some extra programming, you fare better with the two RTX 2080 Tis. I know that fairseq will soon support model parallelism out of the box, and with a bit time, fairseq will also have deepspeed parallelism implemented. I would go with 2x RTX 2080 Ti and save that money for the next line of GPUs (probably 2023).
Joseph says
This is by far the most informative article about building machine learning rig.
My own machine is a 3-year-old pc with an i5 6600k (4 cores 4 threads). I’m thinking upgrading the video card from 1060 to 3070. Does this move make sense? Will my CPU be a huge bottleneck for the setup?
Tim Dettmers says
It should be perfectly fine if you use a single RTX 3070 for most cases. The only case where the CPU could become a bottleneck is if you do heavy preprocessing on the CPU, for example, multiple variable image processing techniques like cutout on each mini-batch. For straight CNNs or Transformers, you should see a decrease in performance of at most 10% compared to a top-notch CPU.
Mark Antony says
Hi Tim!
Thanks for sharing this info!
I am a newbie to building a pc. I want to train big models potentially for days on my pc but I am worried a power surge might ruin the pc.
If I were to use four 3080s and a 1600W PSU (I think the AX1600i has surge protection built-in), would you recommend using a surge protector or even an UPS?
Tim Dettmers says
Hi Mark! Currently, nobody has experience with this, so I cannot give you any solid recommendations. I think a UPS might be overkill, but a surge protector socket does not hurt — I usually have my computer behind one in any case. I believe if you power-limit the GPUs, it might be safe, but if you want to make sure, either go with 3x RTX 3080 or wait for user reports if 4x RTX 3080 rigs with 1600W PSU are stable.
Jay says
Hi Tim,
Thank you for this post, it was extremely helpful! I’m somewhat new to the field of ML/DL and am beginning my career in this field in the long term. I am quite familiar with Kaggle and will likely be working with that as well as training smaller/beginner-intermediate DL models for personal projects and beginning an MS program in the future, so I was wondering about the right setup and GPU to get.
Do you see any issues with my parts? https://pcpartpicker.com/list/DCbfMc . I’m aiming for the 2080S, which I was able to find used (for almost half the price of the $1000 list price). Is the vram enough for my current use case, or when should I think about upgrading to a 2x 3000 series setup (if necessary)?
Additionally, on the software end, do you have recommendations/preferences for ML environments, such as either Windows or Linux (or both)? It would be extremely helpful if you could include an ML workstation software setup guide someday! Thank you!
Tim Dettmers says
Hi Jay,
I think your build looks good. I would start with 32 GB RAM — you can always upgrade later! I would increase the wattage on the PSU a bit so that there is room for a 2nd GPU later. Otherwise, it looks solid to me!
For software, I would use Ubuntu 20.04 with Anaconda as a package manager along with PyTorch. I think that is the easiest setup to get started and to experiment with deep learning.
Good luck!
Andrew Webb says
Great info as usual, thanks!
“Below I do an example calculation for an AWS V100 spot instance with 1x V100 and compare it to the price of a desktop with a single RTX 3090… This compares to $2.14 per hour for the AWS spot instance.”
Can I ask which instance type this is? I believe the p3.2xlarge has a single V100 and a spot price of $0.918 in region us-east-1.
Tim Dettmers says
Thanks, Andrew! I did not realize that something was wrong here until your reply on Twitter — thanks for making me aware of that! I think I took the on-demand instance price and calculated with it but later thought I used the spot instance price. I will correct that by including two calculations for spot/on-demand instances sometime in the next days. I will also update the rule-of-thumb and recommendations that stem from that calculation.
Giulia Savorgnan says
Simply: thank you for this!!
QwwqWq says
Do you think the 3090 will have good FP16 compute performance as per its price after Nvidia announced that is has been purposely nerfed for AI training workloads?
Source ::
RTX 3090 has been purposely nerfed by Nvidia at driver level.
https://www.reddit.com/r/MachineLearning/comments/iz7lu2/d_rtx_3090_has_been_purposely_nerfed_by_nvidia_at/
Tim Dettmers says
Yes, we got the first solid benchmarks and my RTX 3090 prediction is on point. As such, the RTX 3090 is still the best choice in some cases.
andrea de luca says
Hi Tim. An important question about the 3090 (and other consumer Amperes).
As we already know, despite being a lot less powerful than the 3090 in raw numbers, the professional Turings (Titan, Quadro) perform a lot better in certain CAD/CAM domains.
In spite of this, I was convinced that such issue would not affect our domain. Still, listen to this video starting from the position I’m linking: https://youtu.be/YjcxrfEVhc8?t=576
Particularly, at 10:02 on that video, it show Nvidia’s own reply: “For AI applications, Titan RTX is better than the 3090”.
Could you please explain what kind of features the consumer Amperes do miss with respect to professional Turings?
Veedrac says
Right now the only advantage to the Titan RTX over the 3090 is higher FP16 w/ FP32 accumulate, for mixed precision, as the GeForce line is half-rate. My understanding is that most libraries support mixed precision while just using pure FP16 matrix multiplies, and therefore this isn’t very important most of the time, though it may add stability.
The upcoming, unannounced Ampere Titan may have more significant advantages, since it will not only have full-rate FP16 w/ FP32 accumulate, but also full rate BF16 and TF32, both of which (AFAIK) require FP32 accumulation. So if you expect to use either of those and are willing to pay double, waiting for the new Titan might be better. The Ampere Titan might also have more memory, perhaps as high as 48 GB.
Bryan says
Tensor Cores are being (intentionally) limited for consumer-level cards built on the Ampere architecture to drive sales for the Titan/Quadro/Tesla lines.
https://www.nvidia.com/content/dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf
https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
Tim Dettmers says
We just got a new benchmark which shows that the RTYX 3090 has a solid lead of about 25% over the Titan RTX. So I guess NVIDIA wants to sell more Titan RTX cards 😛
G.B. says
Hello
probably not the most clever thing to ask :\ however, I’m wondering… is it possible to do both gaming & training neural nets at the same time?
Like, utilize both features of a 3080, say?
Thanks for listening. 🙂
Tim Dettmers says
Hi G.B! It works in theory, but both your gaming experience and deep learning experience is likely to be miserable. How the GPU works is that it schedules blocks of computation but the order of these blocks is not determined. This means, that on average you will slow each application by the amount the other application processedd blocks. So I would expect a very large frame rate drop which might shift dramatically from almost 0 to almost maximum. So I cannot recommend this setup 🙂
G.B. says
Thanks you very much. Appreciated. 🙂
Alexander says
The whitepaper on GA102 states that rtx 3080 has a massively cut-down TF32 performance (aka bfloat19) (see page 14), around 25% of tesla A100.
https://www.nvidia.com/content/dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf
Is this important in practice? If so, are there any options besides A100?
Tim Dettmers says
It may not be that important because the Turing RTX 20s series has too much computational FLOPS, meaning that most of it could not be used for performance gains and was useless. NVIDIA adjusted Ampere so that the needed computation and the available computation is more matched. You should not see any decrease from these statistics. NVIDIA however integrated a performance degradation for tensor cores in RTX 30 which will decrease performance (this is independent of the value that you quote).
Fedor says
Hi, Tim!
I’ve read your advice about stacking multiple rtx3090, but i’m still afraid of custom watercooling. Do you think one can use rtx 3090 turbo without spacing?
https://videocardz.com/newz/gigbyte-geforce-rtx-3090-turbo-is-the-first-ampere-blower-type-design
Tim Dettmers says
Hi Fedor!
Unfortunately, I do not think this will work. I could even imagine that the blower is worse than the NVIDIA fan design in this case. If you do not want to go with water cooling, the best bet is to use extenders + NVIDIA coolers I think.
Sly Golovanov says
“additional $0.12 per kW/h for electricity”
It’d be “kW*h” units.
Tim Dettmers says
Great catch, thank you!
David says
Hello, thanks a lot for all of those valuable informations for novice in deep learning like I am.
My question is, if I’m running (not training) a model using tensor cores, will it takes from the performance of the rest the card if it’s also used for regular graphics? My hope is to have a model running via tensorRT around 70% of what the card normally offers, while having, let’s says, still 70% of the graphics performance of the card. Is it possible or should I have a separate card for this? Thanks you in advance
Tim Dettmers says
Hello David, The card will still be limited if you do not use Tensor Cores because the block of computation that is scheduled “blocks” certain cores from being used. As such, you cannot schedule two blocks of computation (1 TC, 1 non-TC) at the same time on the same core. It is better to get two GPUs in your case. One GPU for inference/training each.
David says
I see, thanks a lot for your answer
dreahs says
3080 support sli?
or just putting multi gpus and spreading their tasks dont need sli/nvlink?
Tim Dettmers says
No need for SLI or NVLink, you can just parallelize and spread data through the PCIe interface and that will work just fine!
chanhyuk jung says
You can get up to 4 gpus without a threadripper using msi’s meg x570 godlike motherboard. You can get up to 16 cores with the cheaper ryzen 3000 series. The three gpus get x8 x4 x4 lanes from the cpu and the fourth gpu is connected to the chipset. It has pcie 4.0 x4 but it shares the chipset with other peripherals. I’m not sure if it’s negligible or not.
chanhyuk jung says
It was too good to be true. I checked some reviews on this motherboards and turns out it’s missing additional power for the pcie slots so if you do run 4gpus on it, the 24pin connector will melt.
Matt says
Great write up. Thanks for this.
Curious as to what you assume the A100 price to be when doing your performance-per-dollar examination, as there’s no “MSRP” on individual cards really.
Tim Dettmers says
I have seen some offers from 3rd part vendors who put them in 8x GPU servers. That is what I base my estimate on.
Christian says
Tim,
This is fantastic information on GPUs, absolutely fantastic. I’m working on building a machine geared toward gradient boosted regression (xgboost/lightgbm). Do most of your Deep Learning tests in this article hold true for xgboost/lightgbm? Or would your GPU recommendations change drastically for machines designed for these purposes?
Tim Dettmers says
I am not entirely sure about the algorithmic structure of xgboost and lightgbm, but I imagine is uses matrix multiplications and frequent element-wise operations. Since all these operations are very hardly bandwidth-bound (even more so than transformers) I would expect that the transformer benchmarks come quite close for xgboost and lightgbm relative performance. This means an RTX 3080 vs RTX 2080 Ti will compare similarly to the transformer case, but the overall speedup might be lower compared to an RTX 2080 Ti. In other words, the numbers should be accurate within RTX 30s and RTX 20s cards, but not between them.
Andy N. says
Hi Tim,
This is BY FAR the best thing I have ever read on GPUs in deep learning. Thank you for your work.
On top of that the comments are excellent and packed full of practical advice, kudos.
I was previously operating under the misconception that V100/ A100 was the only option for server-scale (6+ cards) DL GPUs- you’ve convinced me that actually 30XX (or 2080Ti) cards are the way to go. I’ve decided to invest in a large (ideally 8 card) rig of these. However, I’ve only built one desktop before and feel lost in the wilderness of suppliers and design decisions. I’m worried I won’t have the time to learn all the details needed to make a reliable configuration that doesn’t suffer from power/cooling problems. Which leads me to the following questions (which I expect may be informative for other readers who don’t have electrical engineering background or hardware experience):
1. Is there any provider you know of with a plug and play 6x+ 3080 or 2080Ti solution? (meaning already assembled and ready to boot) I’ve seen lambdalabs.com but wasn’t able to see any 30XX or 2080Ti machines large enough.
2. For DIY, do you know of any good end-to-end recipes for 3080s or 2080Tis which say exactly which parts to buy from where, and that have actually been run in production and found free of the cooling and power problems? This reddit thread mentioned by commenter Marcin concluded that the build was not viable it looks like: https://www.reddit.com/r/buildapc/comments/inqpo5/multigpu_seven_rtx_3090_workstation_possible/
Also curious if any of the other commenters know of options for 1 and 2. Other info- noise is not a concern and can have a dedicated 220V (e.g. washer/ dryer circuit) for power.
Tim Dettmers says
Hi Andy,
It’s a pretty difficult problem. I see the Reddit thread has some good suggestions already. I think you will not see an 8x GPU server with RTX 3090 anytime soon as there are too many issues to be figured out for companies that it is probably not worth it, so building a custom rig might be the only option. The motherboard is great that is suggested. Otherwise, the case is really important. I do not know any particular case which is suitable, but here it would be worth it do look at mining cases and PCIe extenders. I think that would work out well. Otherwise, there might be some issues here and there with server hardware. There was somebody in the comments that had very valuable experience, so you might want to read a bit more of those, but in general, the experience is gathered on a case-to-case basis which means if you go ahead sometimes you need to figure things out for yourself.
I did a similar project when I build my little GPU cluster with Infiniband and some issues took a long time to figure out (how to get Infiniband working on Geforce Titan GPUs). So if you go ahead you should expect some problems, but it can be a great opportunity to learn more details about servers and their software. So if you are not afraid of that, I would say give it a go. If you do so, I would love to hear back from you with your experience! Good luck!
Jason Hsu says
Hello Tim,
Thank you for a very detailed explanation!
I saw you recommend that RTX 3070 is the best choice for beginner. As a undergraduate college student who just started with machine learning (and everything related), I thought your recommendation perfectly fits me, however I was really attracted to the price to preformance value of 3080, I also like the fact it is slightly more future proof as I am not planning to swap my computer parts anytime soon (hopefully for 5+ years). Would you recommend 3070 or 3080 for my case?
Tim Dettmers says
Hi Jason,
Yes, if you look for a GPU for 5+ years, I would recommend the RTX 3080. I would even recommend waiting a little longer for the versions with increased memory. There are currently rumors of RTX 3080 versions with much larger memory. These GPUs would definitely last you 5+ years. I think that would be your best bet.
Devon says
Thanks, Tim, for the very informative and detailed write-up. One thing I didn’t really see in your post and the Q&A below is the consideration between purchasing a card from Nvidia or someone like EVGA (e.g., not Nvidia). Aside from cost, are there situations where someone should go to Nvidia rather than anyone else? Thanks again.
Tim Dettmers says
Often the third-party cards have some slight overclocking and different fans but are not very different from the original NVIDIA card. I would just buy the card that is cheapest or the card that has a particular fan-design which suits you best. In general, the fan-design of NVIDIA for the RTX 30 series seems to be pretty solid and I would probably buy the NVIDIA card over other cooling solutions (at least in a single or dual GPU setup).
Petr Prokop says
Hi Tim,
I am a bit newbe… I wonder how to think of possible upgrade of my rig. I have got ASUS TURBO RTX2070S 8G EVO + GTX1050 Ti on X399 with 1920X Threadripper. I consider adding Gigabyte RTX 2080 Ti TURBO 11G but am not sure if it is not better to consider the coming RTX 3070s or RTX 3080s or rather the same 2070 for the most efficient Dollar utilization.
I have read in your text that different GF lines 10/20/30 can work together but
– quicker GPUs are to wait for slower ones, which is no good
– parallelizing is not efficient
Q> Would you recommend i) second RTX 2070 SUPRA or ii) RTX 2080Ti or iii) RTX 30X0 in terms of $ efficiency ? Other considerations?
Tim Dettmers says
Hi Petr, This is a difficult question, and I think it depends on your use-case. If you feel your memory on your existing 2070S was sufficient, it might be a good idea to buy a second one. If you parallelize them, it will be faster than a single RTX 2080 Ti and close to a single RTX 3080. If you think you will upgrade more GPUs in the future, though, or feel memory-limited, I would go for an RTX 3070 or RTX 3080.
Raivo Koot says
Great post. Thank you!!
I am looking at motherboard compatibility for 2 3090s. Many motherboards (X570) have 3 PCIEX16 slots. However, often only the upper two are “integrated in the CPU” and the third one is “Integrated in the Chipset”. Can I still use the chipset slot for PyTorch etc?
The reason I am asking is because the middle slot, which is integrated into the CPU, would leave no space between two cards (heating problems!), whereas the third lower slot would leave enough space between the two. If you can let me know whether I can still happily use one GPU in the third “chipset” slot while using another in the top “CPU” slot you would help me a lot. Thanks!!
Tim Dettmers says
This always depends on motherboard configurations. If the motherboard says something like 4-way-SLI ready or the equivalent, it should work. If it says it supports 8x/8x/8x/8x or 8x/8x/16x/8x or something similar it should also work. These are usually the best indicators if you can put a GPU into that very last slot.
David says
Hi Tim! Thank you for the guide.
I am currently considering a 3080 GPU, with a second GPU to be added some time later. I am new to PC building, and I tried to put together a setup as follows:
https://pcpartpicker.com/list/HHdNXv
I based some parts from your 2-GPU barebones pcpartpicker. I’m not sure about the motherboard though:
1) Perhaps I should go for a newer generation with PCIe 4.0?
2) Your guide recommended some space between two 3080s. With 3 x16 PCIe lanes, would I put one GPU at the top one and the other at the bottom one? Would this be enough space between the GPUs, and would there be enough space (in the case) to put a GPU at the bottom-most slot?
If there is some issue with the motherboard, I would highly appreciate advice on what to choose
Lastly, would you recommend a FE 3080 or some aftermarket version?
Again, thank you so much for this guide!
Tim Dettmers says
Hi David! If you have only 1-2 GPUs, I would go with PCIe 3.0 because it’s cheaper, and you will have almost no advantage from PCIe 4.0 (in terms of deep learning). Yes, if you have 3 PCIe x16 slots, you can put one GPU in the top and one GPU in the bottom, and that should be more than enough to keep your GPUs cool!
Tom Lee says
Hello Tim Dettmers,
thank you very much for your effort over the years. Your posts are very helpful!
Regarding the power consumption of 4x RTX 3090:
It might apparently be possible to lower the power-limit of the GPU quite significantly without much performance loss.
https://www.reddit.com/r/MachineLearning/comments/isq8x0/d_rtx_3090_rtx_3080_rtx_3070_deep_learning/g59xd8o/
I tested this on my own Titan RTX with 240 Watts instead of 280 and lost about 0.5% speed with 85,7% power. Although the network was quite small per layer, I will test it again with the biggest one I can fit into memory with batch size of 8 so the GPU is fully utilized.
I have also googled for some time but couldn’t find any tests in this direction.
But it would be interesting to see extensive tests, so you know if you can just reduce the power limit RTX 3090 to something more reasonable like 300W.
Tim Dettmers says
Thanks, Tom! I should have included your data in the blog post. I think I overlooked it among all these comments. Thank you for sharing that information and sorry that I am only now replying!
Ivano says
I do not agree about avoiding tesla cards.
I have a tesla M40 with liquid cooling in my desktop and it works well.
It is not comparable with recent cards, but it is extremely cheap (you can find K80 or M40 at about 200-230$) and it has a lot of memory (24gb).
Tim Dettmers says
That is a fair point and actually a pretty good use-case: Low budget but high memory requirements and if speed is not that important these cards are pretty solid. I might add that use-case to the recommendation. Thanks for the comment!
ivan says
I’m currently using a Tesla M40 and a Titan X in my machine. Although I do not suggest using the Titan X, the M40 is quite good thanks to its 24GB.
Unfortunately, both GPUs are quite old and the performance/efficiency ratio is not so good (6.xx TFlops x 250W both of them).
In other words, you can have cheap GPUs with a lot of memory if u need, but the energy consumption is quite high. I think I’ll change the Titan X with a Titan XP (it doubles the TFlops with the same energy consumption) or a RTX2060Super.
In the meanwhile, I’m waiting for the 3090 😉
Tim Dettmers says
That sounds like a good plan! I like your thinking about efficiency in both power and runtime performance. It is spot on!
Fairwell says
Very nice input, I only checked out used pricing on other cards like Titan RTX but compared to the new options (in that case rtx 3090) the used options seem way to expensive mostly if you consider what you are getting.
At this price point it sounds like an excellent option for specific memory hungry use cases to play around with them without having to invest front-up into the usually quite expensive high memory options until you know for sure you need to perform serious training on them.
Fairwell says
First of all, amazing write-up. That is indeed extremely helpful.
I hope that you could give me some suggestion. You made some good suggestions regarding whether or not to get the Nvidia RTX 3080 vs RTX 3090 vs more sophisticated multi gpu setups (e.g. start-ups) based on what you intend to do. I am working full time as an AI engineer, do some side projects and kaggle stuff. So far the only times I needed really serious gpu computing power I only needed that for work for which I have a AWS cloud account of the company. Other stuff is done on the company’s GPU cluster and smaller prototyping on regular nvidia rtx cards.
I am looking for a serious upgrade for my current home setup which I use for a good part of my daily job (home office time), some side projects (which will grow significantly in the future) and I also use my setup for some other stuff for which a fast GPU comes in handy (e.g. some private video editing), but first and foremost I want to replace my older nvidia card with the new generation.
I was keen on getting the RTX 3090 since it was rumored for the 24gb vram which comes in really handy for more sophisticated models (so far I had to do with 8gb which is fine for daily prototyping) since it seemed the perfect deep learning card while not having to invest into serious Quadro/Tesla cards. However, due to the competition of the upcoming AMD Big Navi and the new consoles Nvidia was overly generous with the amount of cuda cores/tensor units etc on the RTX 3080. On paper that beast offers even way more performance for its price than the cheaper RTX 3070 sibling. Now Tensorflow 2 as well as Pytorch have pretty good multi gpu support (about 92% gain for each additional gpu up to 4 gpus in most situations) and I am leaning very hard to get 2x RTX 3080 Founders Edition instead of one RTX 3090. Right now my setup will remain air cooled so I want to go with the Founders Edition which come with a pretty nice cooling solution.
Would you recommend 2x RTX 3080 or RTX 3090 in my case? My case is pretty huge, has good ventilation and power is no issue, there is place for a second power supply to install which I have left-over anyways. I assume that 2x RTX 3080 would perform way better on most models even though batch size can be set way higher on a single RTX 3090.
Things I am very concerned about:
-> Huge models I might want to tackle with in the future (without the cloud). I might need to manually reduce a models complexity or split it up somehow if 10gb vram is not sufficient (memory pooling is not supported for 2x RTX 3080). What is your practical take on this?
-> Huge models combined with large batch size might perform better one single RTX 3090 (however, most days/models I used will be fine with 10gb vram on a day-to-day basis).
-> The cards would only be 2 slots apart, i.e. each RTX 3080 takes up 2 slots with their cooling solution, hence my setup would not provide space in between. Would those cards throttle too much potentially making the whole setup pointless compared to one RTX 3090? I’d like to note that if really necessary I could mount the 2nd GPU in a different position in the case by buying an additional pci-e extender. I’d prefer the straight up simple setup but that would be an option.
Any suggestions and reasoning is highly welcome.
Tim Dettmers says
I think 2x RTX 3080 makes the most sense in your situation. If you do personal projects, most time is spent on prototyping or hacks rather than on fully-fledged products. As such, training a bit smaller model, or accepting slow training for your final “production” model is okay. There enough memory tricks that you can train pretty sizable models. If you want to go one step further, you can use model parallelism which basically pools your memory across GPUs. Model parallelism is better and better supported in PyTorch and I am sure in the future there will be even more software that supports memory-efficient training.
Regarding cooling, I think the RTX 3080 FE has probably pretty strong cooling performance, especially if you have only 2 GPUs next to each other. You could wait for some reviews on cooling performance just to make sure, or you buy them now and get a PCIe extender if it does not work out. Generally, if cooling performance is poor, you lose about 20% performance. If you assume 92% gain per GPU and RTX 3090 baseline performance is 1.5 while RTX 3080 performance is 1.35, then 2*1.35*0.92/1.2= 2.07 or still 40% faster than a single RTX 3090. As such, you will have better performance on those RTX 3080 than a single RTX 3090 in any case.
However, the extra heat might make those cards more prone to failure. From personal experience, I would estimate the failure rate per year per GPU is about 2-5% and I could imagine that the failure rate would easily double to 4-10% if the setup is experiencing a lot of heat for an extended period of time.
I think considering all factors, it is still the best choice to go for 2x RTX 3080.
Fairwell says
Thanks a lot for taking the time to give me such a detailed breakdown and recommendation.
It seems like if some models are too big to fit into one GPU frameworks like Eisen (together with Pytorch) can handle that. Due to not having real memory pooling like nvlink provides and the overhead introduced might make a single RTX 3090 faster in these scenarios (based on 1-4 gpu modell parallelization benchmarks out there from the last generation).
However, overally right now setups that do not strictly require or are meant to use that vram most of the time are likely to be way more cost efficient in general to go for up to 4 RTX 3080 before considering more expansive cards strictly due to their extremely competitive pricing.
I will go for 2x RTX3080 FE to get it up soon and sell these cards later down the road once memory requirements for my usage really get too high.
Oscar says
Hi Tim, Fairwell:
I am a NLP engineer, I am also intending to use it for smaller NLP model training. Considering 24gb memory, I thought 1X3090 is better than 2X3080. this way also can avoid complication of parallelization of two. Any ideas? Thanks.
Tim Dettmers says
Hi Oscar! If you use smaller models, I would definitely prefer 2x RTX 3080 over the single RTX 3090. An RTX 3090 will not be faster than an RTX 3080 for small models (because you cannot saturate all cores). Besides, two RTX 3080 will be much faster if used with straight data parallelism, and with current software, it is pretty easy to use.
Ervin says
Hi Tim! You have mentioned multiple times some memory tricks in order to train larger networks on GPUs. Can you point to some of them? Thanks!
Michael Conrad says
How do the RTX 2060 and RTX 2070 compare to the GTX 1060?
I’m trying to get a multi-lingual/multi-voice tacotron 2 operational for a low resource language, and I don’t have the needed bandwidth to use cloud hosted solutions.
Tim Dettmers says
The RTX cards are much more powerful, but often also bit a more expensive. If you just want to use the Tacotron 2 for inference you do not need a powerful GPU. If you also want to trains models then you should favor the RTX cards.
Alex Dubinsky says
“Currently, there does not seem to be a PSU with more than 1600W on the desktop computer market.”
That’s because standard North American outlets are only 15A x 125V. They can’t deliver more than 1875W, and you also have to subtract PSU losses. Best to get a dual-PSU case.
Recipe for a killer 7-GPU machine:
Hydra VII case (1)
7x 50cm LinkUp Extreme Shielded PCIe riser cables (2)
Asrock Rack ROMED8-2T (3)
2nd gen Epyc (4)
4x or 8x 32GB RDIMMs (5)
an Intel NVME drive (6)
(1) supports 2 PSUs and 8 dual-slot or 6 triple-slot GPUs. Might not be available at the moment.
(2) the cheaper non-Extreme cables may need to be wrapped in aluminum foil to work. The Ultra are even better with PCIe 4 compatibility, but currently not available in 50cm
(3) 7 full 16x PCIe 4.0 slots! However, might need to be run at 3.0 speed for riser compatibility. The EPYCD8-2T is also a great motherboard, but with 8x PCIe 3.0 slots.
(4) The cheapest CPUs use only 1 or 2 chiplets, which affects L3 cache size and mem bandwidth, although it probably will have no noticeable effect. The cheapest full-fat CPUs are the 7262 and 7302P.
(5) Don’t get 8x 16GB. The difference in bandwidth will be hardly noticeable, but you’ll miss being able to cheaply upgrade the RAM.
(6) Intel SSDs are the best with smooth performance. Even the cheap 660p works well.
Tim Dettmers says
I did not know about the North American outlets, this is a very good point! Thank you for sharing! I should add this to the blog post as this is critical information for North Americans.
I like your part list and analysis! Very useful and hopefully, that will give people ideas about what their build could look like. Would love to update my blog post with some more data if you can give me (and others) feedback on your builds. If we can figure out what works and what does not we can all have cheap powerful machines.
Alex Dubinsky says
My 4 builds so far have used:
– The wondrous Hydra VII case, which turns out is not currently in production. I actually contacted the manufacturer yesterday and they said they can do a custom run with a MOQ of 200 units. I’m considering sponsoring a run and putting them back on sale.
– 7x non-Extreme LinkUp 50cm cables. I had to wrap them in aluminum foil and packing tape to work consistently. It’s a bit funny, but that’s what shielding is. They’ve released improved cables and I expect those will work fine out of the box.
– 3x AsRock EPYCD8-2T motherboard which is quite good with a very useful web-based IPMI interface. It has some odd quirks, like not letting you control fans through the OS. You have to use IPMI–not the web UI but actually ipmitool. (command is `ipmitool -H -U raw 0x3a 0x01 0x64 0x64 0x64 0x64 0x64 0x64 0x64 0x64`). Someone’s mentioned it doesn’t suspend either, but that’s not something I use.
– 1x ASUS X99-E WS motherboard. Legendary board. Works like a rock. I wish ASUS made something similar for Threadripper or EPYC. Supports 7x GPUs as long as you enable above-4G decoding. I’ve tried a Supermicro board a while back (just before I bought the ASUS), and it kept rebooting. I guessed it was some sort of watchdog timer, but the user manual was useless to figure anything out. I’ve avoided Supermicro ever since.
– I have both 7x 2080 Ti and 7x 1080 Ti machines.
– 2x PSUs in each PC. I’ve used cheap 2000W (220V) miner PSUs and expensive name-brand 1600W PSUs. The cheap-o ones work just as well as the expensive ones. All PSUs die (I’ve had 2 expensive ones die and 1 cheap one). Just keep extras onhand. I connect one PSU to the motherboard, and the other to the GPUs. Both draw similar amounts of power (I expected big GPUs to avoid using mobo power, but apparently they do). The motherboard PSU is connected to a UPS. The GPU PSU doesn’t need to be. If lights go out, the machine stays up but the GPUs become unavailable (on Linux… can’t say if Windows is so forgiving).
– Adjusting GPU fans on headless Linus PCs is possible, see my post https://unix.stackexchange.com/questions/367584/how-to-adjust-nvidia-gpu-fan-speed-on-a-headless-node/367585#367585
I’m looking forward to ROME8D-2T and the new LinkUp risers, but I haven’t actually tried them.
I’ve dreamed about 7x water-cooled GPUs ever since I started working with GPGPU back in 2004 (yes, before CUDA). Last year, that dream finally became a reality. Or rather, it became my nightmare. DO. NOT. WATERCOOL. It’s great in theory, but consumer watercooling equipment is just the worst, most unreliable shit. I can explain in more detail, but TL;DR air-cooled machines are just so much better. They even run cooler. All thanks to long, high-quality x16 riser cables which became available only very recently.
My next dream: Inventing bifurcated riser cables which split x16 PCIe buses on EPYC into 4 x4 buses. This is already an option in the EPYCD8-2T BIOS, but the difficulty is that some circuitry is required in the cable for clock distribution. Then we could have 28-GPU machines.
Tim Dettmers says
Hi Alex! Thank you for sharing your experience — this is extremely valuable information!
Great to see you made good experience with EPYC CPUs and motherboards. I would recommend them more, but there is just too little information out there about what CPU/motherboard combination is reliable!
The Hydra VII case looks absolutely great! Do you know of any comparable cases? Some people in the comments wondered if dust would be a problem in such open-air designs. Do you have any experience with this that you could share?
It is good to see that you can trust miner PSUs. I was always a bit skeptical about PSU quality and for me, it felt most PSUs have no difference in quality. The first PSUs that I felt had top-notch quality were EVGA PSUs. But it might be that my feeling is off here. I never had a PSU fail but I might just have been lucky. So I might be biased here.
Regarding headless cooling: Andy Jones worked on some python solution for this where you do not need to meddle with configs yourself. For me it worked great: https://github.com/andyljones/coolgpus
I never had a water-cooled setup myself and I am curious about more details of your water-cooling experience. I read up on water cooling and often read that parts were not reliable in the past but that it has come a long way since. It seems you would not agree with that statement. Otherwise, I agree that PCIe extenders/risers can often solve problems with cooling quite efficiently without any of the risks or hassles from water cooling. I guess the only problem can be space and as such, it is more important to pick the right case.
Please keep your comments coming — your insights are very valuable!
Marcin Bogdanski says
Hi Alex
I spent last few days researching a similar build. In the end it came up to power supply. Apparently it is not advised to bridge PSUs. I consulted few people with background in electrical engineering, one specifically with expertise in power supply design, and in general the more experience person had, the more they advised me against bridging PSUs (which were not specifically design with that in mind). Apparently server PSUs are specifically designed for it, but they are awfully loud.
I have more details here is someone want’s to carry on similar path
https://www.reddit.com/r/buildapc/comments/inqpo5/multigpu_seven_rtx_3090_workstation_possible/
Alex Dubinsky says
I haven’t had any problems with dual PSUs in my 4 machines. Now, you definitely don’t want to short-circuit the 12V lines from different PSUs, because one PSU may want to output 12.1V and another 11.9V, and they’ll fight eachother. AFAIK, that’s not what happens when you use dual PSUs. GPUs route current through independent wires and voltage converters and it’s ok if the GPU gets 12.1V from the PCIe slot and 11.9V from the power connectors.
andrea de luca says
Hi. Don’t know if it could be an issue for you, but consider that I have had both the EPYCD8 and the ROMED8, and both of them refuse to go into suspend mode (both sleep and hibernation) in linux and windows.
However they are awesome boards, and the Epyc 7282 I used did draw even less power than declared.
Chris says
How much performance drop would you expect the 3080 or 3090 cards if used in a eGPU setup (such as a Razer)?
(Also does anyone know if eGPU makers will be supporting the new 3080/3090?
Since laptops remain frustratingly hard (if not impossible) to upgrade hardware components, is it worth getting a laptop (for deep learning ) without a GPU and simply connect one of the 3070/3080/3090 cards to it via USB-C Thunderbolt and an eGPU chassis?
For example the Dell XPS 17 comes in 3 options:
a) no GPU,
b) a GTX 1650 or
c) an RTX 2060.
But the card is soldered to the motherboard, making upgrading not recommended.
I’m curious (and skeptical) if the crazy high TDP values of the 3080 and 3090 are possible to be adapted to a laptop, or if the heat and power requirements would make this untenable.
What kind of performance drop would we expect from, say a 3080 laptop version card compared to the desktop variety?
Thanks Tim, love your review.
Tim Dettmers says
Hi Chris, I think RTX 3080 and RTX 3090 should fit without any problem into eGPU setups (be aware of power requirements, though). I think they should be compatible without any problem since the transfer translates PCIe to Thunderbolt 3.0 and back to PCIe. Since PCIe has a unified protocol the transfers should be guaranteed to be compatible.
Yes, I think a cheap laptop in addition to an eGPU is a very smart solution, especially if you are a heavy user and want to avoid cloud costs over the long-term. A local GPU though can be useful for prototyping and some like it if they can run everything via a local IDE. But since your eGPU is close to you it should have low latency and it is easy to setup IDEs to work on remote computers. So with a bit more effort, a laptop with no GPU should be just fine.
I could imagine that NVIDIA adapts the RTX 3080 for a laptop version. This could mean a smaller GPU (not really an RTX 3080 anymore) or lower clock rates (very likely).
These were good questions. Thanks for your comment 🙂
Andy says
Hi Tim,
Great article. Have you got an article or know of a resource that explains in detail the reasons for memory requirements of different models. ie what is actually taking up the memory, such as the size of the batch, the size of the input, the depth and breadth of the model and all associated weights. What exactly happens during back propagation in terms of memory and what is stored.
I’m trying to understand how much memory I might need but I feel I need more information than the general guide you post here.
What is meant by ‘the model doesn’t fit in memory’……surely we just reduce the batch size a bit? OR would we be talking here about situations where we are already at a batch size of 1? How effective is training should we have a batch size of 1?
Are you saying there is a difference (in terms of memory requirements) between using something like a state of the art feature extractor where we already have the model and just train the tail, and creating a huge CNN from scratch.
Could you give some idea of the size of a CNN that might fit in 10GB or 24GB for a given input size.
Thanks
Tim Dettmers says
Hi Andy, the best resource I know of that discusses this is: https://arxiv.org/abs/1904.10631. The technical report does not cover all memory saving techniques but some of the most common ones.
“The model doesn’t fit into memory” often means that batch size 1 does not even fit, but also it is common to use that expression if the batch size is so small that training is abysmally slow.
There is definitely a big difference between using a feature extractor + smaller network or training a large network. Since the feature extractor is not trained, you do not need to store gradients or activation. This allows you to reuse all the “dead” memory of previous layers. Thus a feature extractor + small network will require very little memory.
The most significant factor for CNN memory requirements is the input size. Very large imagines as used in medical imagining need lots and lots of memory. Video-level CNNs also need a lot of memory. Beyond that, it is highly dependent on the network architecture. Early layers use much more memory than later layers. VGG uses quite a bit of memory for its depth when compared to a ResNet because ResNet has smaller early layers. More advanced networks that have branches within each layer are more difficult to analyze and it is difficult to get a sense of how much memory a CNN needs just by looking at the architecture. With those networks, you need to do benchmarking with the actual network to understand where the memory is used.
Marcin Bogdanski says
Hi Tim
First, thanks for putting all the effort into the great post, it is probably best single resource on the internet. What do you think about EPYC 7402P on a workstation motherboard as an alternative to Threadripper 3960x for 4x GPU build? The cost is very similar and you can fit 4xGPU and have spare PCIE slots left for SSD or fast networking.
– What do you think about potential BIOS/linux/driver compatibility issues?
– Do you think they would go 5 or 6 250W GPUs on risers (assuming cooling/mounting/power is resolved)?
ASRock ROMED8-2T – 7x PCIE x16 – $650
https://www.asrockrack.com/general/productdetail.asp?Model=ROMED8-2T
Gigabyte MZ32-AR0 – 6x PCIE x16 – $750
https://www.gigabyte.com/uk/Server-Motherboard/MZ32-AR0-rev-10
Thanks again!
Tim Dettmers says
EPYC CPUs are great! I think going a server components route makes a lot of sense, especially with the RTX 3090 which needs more space, power, and cooling. I think the only issue would be that often it is difficult to say if you buy good quality components compared to gaming components because these are reviewed and judged by a community while servers components are not. BIOS/Linus/Driver compatibility should be no issue, especially if you only have a single CPU on the motherboard.
Marcin Bogdanski says
Thanks for you reply!
After doing more research, I found that Supermicro also carries similar boards for Epyc processors and seems to be more reputable brand in server space (if anyone goes that route: Supermicro website -> building block -> server boards -> Epyc, it never showed up in google, had to manually browsed the website)
After much consideration I decided to purchase 2x 2nd Gen Threadripper 2950x systems for total of 8x GPUs slots. The price is slightly higher than single Threadripper 3 3960x but not much. PCIE 4.0 and fast inter-card communication are not factor for me. Hope this may be useful alternative to consider for others in similar situation.
The main reason I decided against server route is that desktop build would still be limited by 2000W PSU (so 4x GPUs anyway). For more power one would have to use server PSUs which seem too loud for office/home environment.
Hopefully by next time will have a server room and will go full swing 10x GPU build!
Ryan Mink says
Hi Marcin!
I am also interested in doing a 4x GPU build.
I am having a hard time deciding what case to use if I go with air cooling.
If you are not doing liquid cooling, what case do you plan to use?
Marcin Bogdanski says
Hi Ryan
Yeah, I’m almost certainly going with air. Easier to swap, don’t leak, much cheaper. Hybrid cards should fit into standard case but at significant price premium.
I haven’t decided yet but considering 6u server crypto mining case. Google “6u mining case”, there are not many suppliers in the west with stock but they pop up on ebay (they are Alibaba as well).
Here is one in stock in UK
https://www.xcase.co.uk/collections/mining-chassis-and-cases
Thing to note is that GPUs in mining rigs may not be properly mounted, they just kind of lay there. I’m planning to mount GPU to a vertical mount and then drill/crew the mount to the chassis. This way it will be properly mounted.
Example vertical GPU mount:
https://www.coolermaster.com/catalog/cases/accessories/universal-vertical-gpu-holder-kit-ver2/
Hope it helps!
Ryan Mink says
Thanks for the info.
I am considering an open air mining case like https://www.amazon.com/Veddha-Deluxe-Model-Stackable-Mining/dp/B0784LSPKV/ref=sr_1_2?dchild=1&keywords=veddha+gpu&qid=1599679247&sr=8-2.
What are your thoughts about this vs a closed air case like the one you posted above?
My only concern with an open air case is the possibility of more dust.
Marcin Bogdanski says
Hi Ryan
I don’t have much experience with open cases, sorry, might be worth googling around it and checking mining forums. For me the main reason to do closed one is because technicians in the lab frown upon exposed electronics, and I want an ability to move it around easily.
Gustavo says
Why not this
https://www.supermicro.com/en/products/system/4U/7049/SYS-7049GP-TRT.cfm
Marcin Bogdanski says
Xeons are more expensive and have less cores than EPYC/Threadripper.
Ryan Mink says
Hi Tim!
Thanks a lot for sharing this. It is very informative!
Btw I am thinking of buying four FE 3090s for my research workstation.
Since there is no conventional case that can accommodate four of them, I am considering an open air case commonly used in a mining rig.
What are your thoughts about that?
Should I be concerned with dust if I am going to clean the system using an air compressor once to twice a month?
Tim Dettmers says
I think an open case makes a lot of sense, but I am not sure how it will perform over time. It might make sense to read a bit through cryptocurrency mining forums to see what people’s experiences are. However, mining rigs are often at 100% load 24/7 while GPUs are usually used only a small fraction of overall time — so overall the experience might not be representative. I think it is difficult to say what will work best because nobody used GPUs in such a way (open-air case + low utilization).
Ryan Mink says
I am also considering custom water cooling but I am not comfortable having the system run nonstop for days for training transformers due to potential leakage that can totally ruin the system.
Is it common running water cooled system for days nonstop?
oarph says
Confused about the bar charts showing the RTX 30-series performance. Did you actually get a pre-release RTX 3090 etc to test, or are these estimates based upon the published specs? Furthermore, if you use a specific CNN and Transformer in the benchmarks, could you cite the models and/or publish them on Github? Thanks for this guide! It’s just confusing to me what the benchmarks represent.
Tim Dettmers says
I added a bit more detail to the benchmarking section now. Let me know if it clarifies the benchmarking a little bit.
Additionally, here a comment I made on the Hacker News thread which gives you a little bit more information, in particular regarding transformers:
Frank says
the benchmarking still isn’t very clear that you didn’t actually test with all these cards. You should bold it or something, list out which cards you DID test and which are interpolated or extrapolated (in the case of the 30XX, we all know the dangers of extrapolation….)
Tim Dettmers says
That is fair. I will have another look and see if I can make it clearer.
andrea de luca says
Hi Tim. Here are a couple of PSUs for 4×3090.
Both are 2000W Platinum, reliable brands. ATX format (but they are long, check your case clearance).
https://www.fsplifestyle.com/PROP182003192/ (~300Eur)
https://www.super-flower.com.tw/product-data.php?productID=67&lang=en (~370Eur)
Now, I was interested in sticking a couple of GPUs as you did with yours. May I ask for some practical advices about how to secure them in order to avoid damages? Like you, I’d like to nail one of them to the front vents, and another to the bottom vents.
Another question: How can I make my case dust-proof without compromising the airflow? I’m quite worried about the amount of dust I find upon the fans and the heatsinks of my GPUs.
Tim Dettmers says
Thanks, Andrea! These PSUs look excellent. I will have another look and might update the blog post accordingly.
I tried regular zip ties that come with the desktop case, but these are too short. I could imagine that using two long zip ties would work quite well to attach the GPUs to someplace in the case. The one GPU in the picture is just laying on the bottom on the desktop (over a vent) and it is unstable, but I am also not much my desktop that much. If I would move it, I would probably first detach the GPU.
In general, though, the location of vents near GPUs is not that important. I find that GPUs just need some space from each other to run cool enough. The desktop air will heat up to 50C or so, but that is still very cool compared to the GPUs. In my setup in the picture the Founders Edition cards run at 75C under full load while the blower GPUs run throttle slightly at 80-82C which is still pretty good.
I do not have any good solution for dust-proofing a case. The only solution that works for me is to just clean the desktop every 6 months or so.
andrea de luca says
Thanks, Tim.
I was worried about dust since it is the main factor in wearing the GPU’s fan bearings. If one invests in a couple of 3090s (which, as you highlighted in your article, can last a few years at the very least), I think it’s better to prevent from being broken by dust: replacement fans/heatsink would prove impossible to find, probably.
There actually is something on the market, like this:
https://www.silverstonetek.com/product.php?area=en&pid=525
But I don’t like the inverted layout, and the HEPA filters are not washable.
Tim Dettmers says
That is a very good point. I heard this happens a lot in cryptocurrency mining. A friend of mine has a pack of replacement fans for his cryptocurrency mining rigs. Often miners will choose GPUs where you can easily replace the fans.
I think right now it is difficult to say what will work well. I think time will tell what are the most robust cases for RTX 3090s.
andrea de luca says
The thing is I want it bad, and already had the frame of mind of ordering the Founder’s as soon as Sep 24.
Bu you are right, alas! It would be better to wait and see what do manufacturers spit out.
Thanks!
Alex Dubinsky says
FSP is a bad brand. I bought the 2000W PSU. It was broken from the start. It turned off when putting out even 1000W, possibly because of bad capacitors. The main problem is that FSP’s support is very bad. It’s confusing and difficult to file an RMA. It’s too bad I was past Amazon’s return period. Waste of $400. Better to use 2 PSUs.
Michael Balaban says
Just a heads up:
– The first PSU is only rated for 1500W @ 115-200Vac. 2000W requires 200-240Vac.
– The second PSU is only rated for 2000W at 230V.
As far as I know, there aren’t PSUs from quality companies rated beyond 1600W with 120V input.
Josh Scholar says
Is it possible to have a link to the old version of this article? It had helpful information for people who only have access to older hardware than Ampere.
Tim Dettmers says
That is a good point. I do not want to have two different blog posts online at the same time, but I could some of the old content back into this blog post. Is there anything missing in particular? I could add some of the older GPUs to the charts. Would that be enough?
Josh Scholar says
I want to see the old charts comparing the cards I can get a hold of right now, the 20 series the 1080 ti, the Titan X Pascal etc.
I would like to see them compared with TPUs.
I like that you used to compare them on 3 or 4 different kinds of tasks.
I was disappointed to see the old article gone that compared all of those. Why tell people to buy cards that no one has used yet and can not buy?
andrea de luca says
I think that he eliminated the old version since the new cards are so superior from the older ones and so cheap (in relative terms) that it doesn’t make sense to buy Turing and Pascal cards anymore. Consider that you’ll probably find a 3070 at less than 500usd.
For what is worth, here is a summary:
1. Pascal cards can do FP16, although with a very modest speedup. However, you will still see your VRAM almost doubled (something like 18/20Gb FP32 equivalent). So, the 1080ti is still a good card if you are on cheapo, but be sure not to pay it more than some ~250/300 usd/eur.
2. Forget about the other Pascal cards, unless you can have them as a gift.
3. Ampere has completely ruined the used GPU market, so you will probably find former top-notch Turings at very modest prices. My guess is that a 2080ti, used but in good condition, will be priced somewhere in between the 3070 and the 3080. Its performance would be a bit worse than a of the 3070 (allegedly), but then it has 11Gb. It’s still a capable card. My suggestion is to buy it only under 550/600usd/eur.
4. The true champion now is the 3090. With respect to the former 24Gb-class card, It almost doubled the number of cuda cores, the perfomance increases by at least 50%, but is costs 1000usd/eur LESS than the Titan RTX (in fact in EU, it costs 1200 eur less that the Titan). So, if you can afford it, buy it and forget about Pascal and Turing.
Letmos says
It would be nice to have update of article “GPU for Deep Learning” that focuses on brand new Nvidia Ampere graphics cards. We have right now three models (3070, 3080, 3090), but there are rumors that soon we will see also 3070 TI (with 16 GB VRAM) and 3080 TI (20 GB VRAM). That sounds interesting, and change a lot in Deep Learning.
Taako says
Are you going to benchmark the 30xx series cards?
What about the new Big Navi (2x) AMD cards with ROCm?
Also can you sort your comments reverse chronological? Lots of scrolling to get to this comment box and most recent comments?
Tim Dettmers says
Thanks for the suggestion to reverse the comments! I did not know that this feature was added to WordPress — very useful!
I discuss AMD cards a bit in the new blog post update, but not sure if there is anything reliable yet on Big Navi.
Jake says
The new RTX 3000s specs are officially out: https://www.nvidia.com/en-us/geforce/graphics-cards/30-series/?nvid=nv-int-gfhm-10484#cid=_nv-int-gfhm_en-us
Wonder what your thoughts on them. Only judging from the specs, could 3080 be your new tl;dr best GPU overall recommendation? Would a 3090 even worth considering at the price of picking up two 3080s?
Kamil says
Hi Tim. I am a beginner in Deep Learning but I think about it seriously. I do some data science stuff and consider buying a new PC. Unfortunately, Nvidia will launch a 3000 series for two weeks and I am so confused about which GPU should I buy. I started with GeForce 1660 but after some readings about it, I consider now adding some money to get GeForce 2070 windforce 2x. It is the cheapest 2070 I have found and …. this could be a problem. I would like to avoid buying a new card after a year and that is why I have a huge problem with the decision. Maybe it is better to wait 2 weeks and look what Nvidia will introduce (but 3070 and 3060 will appear in Q1 or Q2 of 2021 as I heard) or this GeForce 2070 windforce (maybe you have some recommendation about a version of this GPU?) is good enough to start with DL? Maybe 2060 or 2060 super is good enough and there is no need to pay more for the cheapest version of 2070? I am not an expert in the market of GPUs so I wonder if prices of 2060 or 2070 should decrease or increase now?
Samuel Rodriguez says
It is very informative and nice article.
Tamir says
Hi Tim, great work! very vital information!
1) To what type of RTX 2060 do you refer? 6GB/8GB? Super Version? If you can specify the exact model that would be great.
2) Do you have experience with working with two graphics cards, one basic for regular screen requirements, and another one that will be devoted only for deep learning work?
I tried it today and saw that both graphic devices allocated the same memory for screen processing although only one of them was supposed to allocate it.
3) What do you think is the minimum required memory for someone that buy GPU today and would like to be able to run the common models?
Thanks a lot,
Tamir
Tim Dettmers says
I would always go for Super. The Super in general is very good in cost-performance. However, if you need more memory you should just go for the max memory version. It has no use if your GPU is fast if you do not enough memory to run the models that you want to run.
Vinay says
Hi Tim,
Thanks for your insightful blog post. I am a beginner in deep learning and planning to buy a new server tower powered with Nvidia GPU. Could you please suggest me with the best possible and available system specifications with a maximum budget of $4500 – $5000 and which won’t require any change in hardware for a decent period of time. I would be obliged if you can help me here. Thanks in advance.
Tim Dettmers says
I would get the cheapest Threadripper v2 and 3-4x 2080 Ti with blower style fans depending on budget.
Vinay says
Thank you for your suggestion Tim.
I almost ended with intel Xeon silver 4214, 12 cores, 16.5 M cache, 2.2 GHz, processor.
1 x Nvidia GeForce RTX 2080Ti,
64 GB DDR4 ECC memory with hybrid disk.
I guess I will consider your opinion and look for AMD Ryzen Threadripper 2950x with 2 x Nvidia RTX 2080 Ti. I still doubt if windows server 2016 or higher version/ windows 10 OS will be compatible with Threadripper and also does not affect the processing speed.
Ivan says
could you tell me why the relative performance between word rnn and char rnn (based on your pictures) vary much from card to card?
what parameters of my LSTM should I consider? i have about 20 features and a lookback of around 400
Tim Dettmers says
I measured LSTM performance for a couple of cards and interpolated between the rest. In your case, 20 features and sequence dimension of 400 is pretty small and in terms of cost/performance a small GPU would be pretty optimal for that. Larger GPUs would not be fully utilized but should not be much slower. RTX have a different memory hierarchy which makes LSTMs a bit slower than on GTX cards.
Nicolas says
Hi Tim,
Great blog, quick question. Have you tried more specialised GPU services from companies like Genesis Cloud (genesiscloud.com) or other start ups in the area? It seems that they offer different type of GPUs and they are very focus on cloud infrastructure for AI with Relative good prices.
Any recommendation in the area would be useful.
Cheers
Tim Dettmers says
I have not tried this. There are many companies that offer these services and it takes a bit too much time to explore these services on my own. If you tried some of the services and have some feedback I would be happy to hear more about it (either email or as a comment here).
Lucas says
I am looking the mining gpu P106-100 which seems to be the same as a gtx1060 without video port.
I don’t know if it is CUDA compatible.
I want to run yolo over darknet. Will it work?
Thanks
Tim Dettmers says
It is CUDA compatible and you should be able to run yolo on it. You might need to downsample the images slightly but it should work smoothly.
Andrea says
Hi Tim,
Thank you very much for your article, it’s just great!
I have one question which I could not find explicitly mentioned in the posts or the article. Is it possible to prototype a code leveraging on tensor cores operations using a GTX card? Do you have any reference for this? (I imagine there is some form of emulation of the tensor cores for the GTX?)
I am asking because I am undecided between two laptops: one with GTX1650Ti and 16:10 aspect ratio (coming XPS 15), and the other RTX2060 but 16:9 (Razer 15).
Thank you again for your time!
Tim Dettmers says
Unfortunately, you cannot prototype tensorcore code on a GTX. What you can do is rent a AWS spot instance when you do test runs and otherwise prototype on a GTX card. If you should do it like this the costs should be minimal.
Paweł says
Hi,
I would be buy NVidia RTX 2070 GPU. But NVidia offers RTX 2070 Super. Is exists sense buy RTX 2070 Super version for machine learning ?
Tim Dettmers says
Yes buy RTX 2070 Super instead of RTX 2070. This blog post is a bit outdated.
Youcef says
Hi Tim,
Thank you for this excellent review, I am using it a reference!
I am looking to buy a GPU to use for deep learning (Computer vision, objection detection & NLP using neural networks – RCNN, FCN, Yolo , SSD, CTPN, EAST…etc)
I am a bit tight in budget, I was looking at the 1060 (6GB) as a minimum, how does it compare to the 1660 for the same purpose? read in one of your comments above that 1660 does not ave tensor cores and maybe good for gaming but not for deep learning…
which one do you suggest?
does the 1060 have tensor cores, or is it the 20xx only that are equipped with Tensor cores?
Thanx
Tim Dettmers says
I would go with the 1660 if those are your only options. If you have a bit of extra money the RTX 2060 would be much better all-around.
Eslam Haroun says
Hi Tim,
I have GTX 1070.
Is it valid for these applications?
Thanks
Tim Dettmers says
A GTX 1070 is pretty good for these applications. For some models you might need to use some “memory tricks” but overall you should have few problems, its a good GPU.
Eslam Haroun says
Some people said it will be valid for prediction not training.
Is this right?
Thanks
Tim Dettmers says
A GTX 1070 is pretty good for both, prediction and training.
Fordjour K. says
Tim,
Is it a good idea to combine RTX 2080 Ti founders with other GPUs like RTX 2070 or RTX 2080 super, if you want to build a 3 GPU machine?
Will the computing power for ML and NLP models decline?
Thanks.
Tim Dettmers says
That works without any problem. However, you will not be able to parallelize models across different type of GPUs.
andrea de luca says
Mhhh.. I think that he won’t be even able to parallelize *data* across different models of GPUs.
andrea de luca says
Tim, I got to connect my GPU (I’ll buy a Titan RTX soon) to different hosts.
What do you think about an external box like the razer core X? Will it work over thunderbolt? The speed corresponds to PCIe x4 gen3.. I don’t expect dramatic performance losses. Am I right? Thanks
Tim Dettmers says
The performance should be okay in most circumstances. Performance will be the same if you can transfer the dataset to the GPU before training. It will be also good for most NLP models. For computer vision you might see a drop of about 20-40% in performance depending on image size (the larger the image the worse the performance drop). However, for the performance drop you still get an excellent cost/performance since laptop GPUs are very expensive and weak and desktop GPUs require a full new desktop. As such, I think this is a very reasonable setup and while things are a bit slower you should be able to run any model which is very handy.
andrea de luca says
Thank you. I think I’ll go that way then…!
Matthew says
Another “would you rather” question:
Would you rather have a 1070ti or a 1660 Super? The 1660 Super’s DDR6 memory greatly increases bandwidth, but it only comes with 6GB of memory vs 8 for the 1070ti.
I find the prices for the two cards to be quite similar (new vs used), so that isn’t a driving issue at the moment.
Tim Dettmers says
I would definitely go for a 1660 Super in terms of performance. For the memory it highly depends on what you are planning to do with it. If you just want to play around and test things you can get most networks to work if you use gradient accumulation and full 16-bit computation (16-bit weights). But this requires more coding. If that is okay the 1660 Super is right for you.
Shafeez says
Hi Tim,
Thanks for the advice earlier. I might need a rig before the RTX 3000 series is released it seems like.
I came across an article on lambda labs on choosing a GPU for deep learning.
I am not sure if the person that wrote the article was using mixed precision for the RTX cards.
It seems like 2070 can do a lot! After scouring thru some of the comments here, I think I might settle for a dual RTX 2070 Super instead of a 2080TI. I think getting accustomed to working with data parallelism would increase my job prospects. I feel like that is one of the big reasons dual 2070 sounds like a better choice.
You mentioned that 2070S can offer up to 12GB of memory if I use mixed precision.
How much VRAM can I really squeeze out, out of a 2070 8GB if I were to use smaller batch size and gradient accumulation, on top using FP16?
Thank you.
Tim Dettmers says
It is reasonable that you can squeeze out an equivalent of about 24 GB. If you use gradient accumulation then it is just a question of performance. If a batch size 1 model fits into the GPUs you can train it, but it will be awfully slow. For good speed you want to have at least a batch size between 16-32 in most cases. I would not want to train transformers on two 2070S, but for computer vision models it might be fitting.
Sergej says
I recently started with NLP and BERT-based models (PyTorch). Unfortunately my GTX 1070 8GB runs out of memory… I am thinking of getting a K80 with 24GB or a M6000 with 24GB. Which one would you recommend? Or should I get another card? I will definitely need more than 16GB of memory.
Thanks for this great article, it helped me a lot.
Tim Dettmers says
K80 and M6000 will be quite slow. I would recommend getting a Titan RTX with 24 GB of memory. If that is too expensive I would definitely go for the M6000. You can also think about multiple RTX 2080 Ti cards and using parallel training. That will reduce the memory footprint slightly especially if you use FP16 training. If you use mixed FP16 training it reduces memory footprint by 25%, if you pure FP16 training via Apex it reduces footprint by 50%. Using 2 GPUs should decrease the footprint by about 20-30%. So 2x RTX 2080 Ti with pure FP16 training is roughly equivalent to 11/0.75/0.5 = 29 GB used by the K80 or M6000 but you train much much faster.
Sergej says
Thank you!
Shafeez says
Hi Tim,
So the rumor is that the 3080TI can have vram anywhere between 12gb-16gb. 12GB sounds very much possible to me but what is your opinion on it being 16gb? Would Nvidia really take a leap and add literally 5 more gigs from its predecessor 2080TI 11GB?
I do ML for research. I will finish my undergraduate in Computer Science in 2 months. I might do grad school in a year or so, until then I want to carry on independent research – reading and implementing research papers.
I have read some research papers and have implemented some, and I realized memory is very important, especially after playing around with implementations of Mask R-CNN, and other object detection APIs.
I am trying to weigh in the option of waiting for the 3080TI or just buy the 2080TI now.
What is your opinion to the following question:
Is it better to wait for the 3080TI, especially if 3080TI will have 16gb of memory and 7nm architecture for the same price as 2080TI?
Thanks.
Tim Dettmers says
I think the best strategy for NVIDIA is to keep the RAM low so that deep learning researchers are forced to buy the more expensive GPUs. I would be surprised if the new GPU would have 16 GB of RAM but it might be possible.
Regarding RTX 2080 Ti now vs waiting for RTX 3080 Ti. You are very right that 16 GB vs 12 GB would make a huge difference and I would wait a bit longer until it is confirmed that it is 16 GB. The performance is not that greater probably. Tensor core performance is better and that is where the widely improved max FLOPS comes from, but practically matrix multiplication is bandwidth bound and will see little increase in speed. Convolutions might be a bit faster but they will also be likely bandwidth bound now. I would guess we can see around 10-20% performance gain. Together with 16 GB of memory this would be a great and long-lasting GPU. If it is 12 GB it might be worth it to buy a used RTX 2080 Ti instead.
Aleksandr says
Hi Tim,
thanks for your posts. They together with comment sections helped me quite a lot to make up my mind about my new PC configuration. I decided that the best setup for me would be dual RTX 2070S + Ryzen 3700x. The problem is that cards with the best air cooling are almost 3 slots wide, so I’d need a 4 slot distance between them and the only motherboard I could find that has such distance and that allows to run both GPUs at x8 PCIe lanes is MSI MEG X570 GODLIKE which costs like HELL. There are a handful of cheaper motherboards with 4 slot spacing that can run in a dual GPU mode at PCIe 4.0 x16 / x4 (like ASRock X570 Steel Legend). I know that you recommend to have at least 8 lanes per GPU but that recommendation was for PCIe 3.0. Could we say that 4 lanes of PCIe 4.0 are roughly equivalent to 8 lanes PCIe 3.0 in terms of RTX 2070S performance and won’t cause any bottlenecks? If we just compare the bandwidths it should be fine but since RTX 2070S doesn’t support PCIe 4.0 I’m not sure how it will respond to 4 lanes and whether it would be able to utilize their full potential.
Tim Dettmers says
4x PCIe 4.0 lanes are equivalent to 8 PCIe 3.0 lanes, but I think think you are creating artificial problems here. If you only have two GPUs you can easily get away with 2-wide GPUs for excellent cooling (as long as they are not directly next to each other). Otherwise, going with a different CPU-motherboard combo might be cheaper and will not reduce performance by much. If you want to be ready for PCIe 4.0 GPUs and want to keep the computer for many years though, it makes sense to go with the CPU-motherboard combo that you selected at the moment.
Tania Farzana Keya says
Hi Tim,
Thanks for your hard work. This article saved a lot of time for me. I had some doubts.
You suggested,
“Start with an RTX 3070. If you are still serious after 6-9 months, sell your RTX 3070 and buy 4x RTX 3080. ”
According to Nvidia, RTX 3080 doesn’t support Nvlink . Is it possible to use 4x RTX 3080 for training (officially)?
Tim Dettmers says
Yes, using 4x RTX 3080 in parallel will be no problem since they can still communicate through PCIe with pretty respectable speeds — especially on a PCIe 4.0 board.
Tania Farzana Keya says
Thanks for replying Tim. If possible can you please write a detailed article on multiple GPU setup with training environment setup. It will be very helpful for the students from under developed & developing countries.
Ricardo Cruz says
Hi Tim,
This comparison benefits a lot of people, thank you for that!
A small request: Would it be possible to share the raw numbers behind Figure 2? It’s a little hard looking at the chart and it would allow us to more easily do our own cost comparisons. It could also potentially allow us to fortify your analysis; for example, by testing for correlations between performance and cuda cores and things like that.
Thank you!
Tim Dettmers says
This is a great idea! The next generation of GPUs will soon be released and for the next update of this blog post I will also publish the raw data.
Mark Hanslip says
Hi Tim,
Thanks for the post, super helpful. Apologies if this has been answered elsewhere, but can you tell me why the GTX cards perform so much better than RTX on WordRNN?
Best wishes,
Mark
Tim Dettmers says
I either made a benchmarking mistake or it has to do that the shared memory architecture of GTX GPUs is different. Since they have more shared memory some algorithms which depend are very memory intensive access patterns which are distributed in small pieces can have benefits on a GTX GPU.https://timdettmers.com/wp-admin/edit-comments.php?comment_status=moderated#comments-form
Dong says
Hi Tim,
Thanks so much for this helpful blogpost (and the other one about CPUs etc.). I’m thinking about buying RTX 8000 (I’ll start with two, alternatively I’m considering RTX Titans). From a company I received the following suggestion to have with the two RTX 8000’s:
CPU: 2 x intel Xeon Silver 4110
RAM: 12 x 32 GB DDR4-2666MHz 2Rx4 ECC reg.
SSD: 2 x 480 GB 2,5” SATA 6Gb/s S4510 TLC
HDD: 2 x 1 TB Seagate ST1000NX0303
In particular (1) the RAM strikes me at unnecessarily high (?); (2) I’m also curious to hear about your thoughts regarding SSD/HDD specs, and what you would recommend; (3) Does the CPU seems ok?
The goal is NLP research, and hopefully this should be useable for the next few years.
Tim Dettmers says
It seems a little bit overkill for the hardware compared to the GPUs. However, if you add more RTX 8000s over time this can be a pretty good build and the memory will help you a lot if you are training big transformers. RTX 8000 will get cheaper once the next generation of GPUs is released in a couple of months. The CPUs a great and the RAM amount is rather standard for servers (server RAM is usually much cheaper than consumer RAM). You can ask them to lower the RAM amount but you would probably not save much money. If you want to save money I recommend a desktop with Threadripper 2 and 4x RTX Titans with extenders (otherwise they run too hot). The 24 GB is often enough even for very big transformers.
andrea de luca says
I would not be so sure about high-end cards getting cheaper.. Look for example at the GP100 (Pascal top dog).. Even as of now, it is not any cheaper…
Pascal says
Hi Tim,
Thanks for this great article. It is just what I was looking for and would like to follow through with a few questions.
Our research group is looking to buy a GPU to speed up data analysis. We mostly deal with genetic data so we are talking of large data in TBs. The main challenge has been the time it takes to run large models (including ML) and imputations which can run into weeks in the CPU-based local server, hence the need to try out a GPU workstation. I looked up NVIDIA’s product line for data science and saw they promote the Quadro RTX’s and have CPU recommendations to go with either single or dual GPU. See below.
“Single GPU – six-core Intel Xeon W-2135 CPU with a base clock speed of 3.7GHz and turbo frequency of 4.5GHz. At least 128GB of RAM.
Dual GPUs- dual Intel Xeon Silver 4110 CPUs, each with eight cores, a base clock speed of 2.1GHz, and a turbo frequency of 3GHz. At least 192GB of RAM.”
Going through your blog I see that you recommend the RTX 2070. I also came across another article which said that the GV100 provides hardware support for 64-bit fp while the RTX go up to 32-bit. I was considering either RTX 8000 or 6000 or the GV1000 and wondering if for our work having dual gpus would have any added benefit. Also wondering if the RTX 2070 would suffice. What would you recommend?
Thanks.
Tim Dettmers says
Hi Pascal,
thank you this is an interesting request. For genetic data the important bit is how you load the dataset and how much of the data/model is held on the GPU at any time. What you can do without any problem is to stream data from disk (preferably a NVMe SSD Raid 0) to the GPU. Depending on how much RAM you need I would recommend the RTX 8000 or 6000 if you need 64-bit fp. Otherwise, the RTX Titan runs well, but it can have cooling problems. If you can run your program on 11 GB of memory and you only need 32-bit fp then the RTX 2080 Ti is the GPU to go! I would not go down to the RTX 2070 because it offers to little memory. You can definitively use 2 GPUs. I would use 2x RTX Titan because 4x is difficult to cool or 4x RTX 2080 Ti with blower fan. If you buy RTX 6000 or 8000 they usually come with servers with good cooling so you can get 4 or 8 GPUs in a server. Let me know if you have more questions!
andrea de luca says
Tim, are you sure about FP64 capabilities? AFAIK all the Turing cards are fp64-castrated, and you require Volta for high precision fp arithmetic (minimum titan V or GV100)..
Shane says
Hi Tim, thanks for such in-depth reviews, very helpful.
I am looking to upgrade a machine I have for ML / DL purposes, with the end goal of being able to host one or two trained networks to provide inferences as part of a computer vision service while still having computing resources available for further training / testing. The machine is a dual Xeon dell R720, so I can fit two full size GPUs, including the passively cooled Tesla series….
Now, I know you recommend to steer clear of Tesla cards due to price and likely consumer complications of external cooling solutions, but 24GB K80’s are currently ~$300 and I already have the server infrastructure. I am considering buying two of them. This way, as each K80 has two GPU’s w/ 12GB each onboard, I can host two nets on one of the cards, each net using up to 12GB dedicated, and still have two GPUs (one K80) for further testing / training. When operating in conjunction it seems using both GPUs on a K80 gives the performance of one M60 or 1 1080 Ti…. but the benefit over either of those is that I can host 2x the nets / models for inferences as a service.
Does this make sense? Can you poke holes in my logic of going this route over 2 x 1080 Ti or 2 x 2080?
Thanks!
Tim Dettmers says
K80 for $300 is a pretty good price and it might well be worth it. They will be a bit slow though but if you do startup-type inference it will be more than sufficient and with the extra memory you can do a lot of tricks / multiple models which is great for startups. Otherwise I would recommend probably a GTX 1080 Ti for inference rather than a RTX 2080 due to the additional memory.
Andrey says
Hi Tim, thanks for this job!
can you tell me what bandwidth the k80 has between the chips? and can i host one net using up to 24GB on k80?
Tim Dettmers says
I am not sure if there are good numbers for the bandwidth between k80 chips. I remember with old dual GPU cards the bandwidth was better than PCIe 3.0, but I do not know the exact numbers. I think to combined the memories you need to use model parallel code. I do not think you can combine them in a non-programmable way.
Vicky Patel says
Hi Tim,
Thanks for this wonderful post.
I am a deep learning beginner on a tight budget.
Can you please suggest which I should I choose between –
GTX 1060 6GB or GTX 1650 Super 4GB, as both are available at same prices to me
Tim Dettmers says
I would go for the GTX 1060.
Aster says
Tim,
First things first… as many others have stated, thanks for taking the time to write/blog about your experiences, advice, etc.
Can I ask why you suggested the GTX 1060 6GB over the GTX 1650 Super 4GB? From NVIDIA’s website, looking at the specs…
GTX 1060 6GB
Compute Capability: 6.1
NVIDIA CUDA Cores: 1280
Base Clock (MHz): 1506
Memory Speed: 8 Gbps
Standard Memory Config: 6 GB GDDR5/X
Memory Interface Width: 192-bit
Memory Bandwidth (GB/sec): 192
Bus Support: PCIe 3.0
Graphics Card Power (W): 120 W
Recommended System Power (W): 400
Supplementary Power Connectors: 6-Pin
GTX 1650 Super 4GB
Compute Capability: 7.5
NVIDIA CUDA Cores: 1280
Base Clock (MHz): 1530
Memory Speed: 12 Gbps
Standard Memory Config: 4GB GDDR6
Memory Interface Width: 128-bit
Memory Bandwidth (GB/sec): 192
Bus Support: PCIe 3.0
Graphics Card Power (W): 100
Recommended System Power (W): 350
Supplementary Power Connectors: 6-Pin
… as far as what I have been able to gather, neither of these have Tensor cores.
Did you suggest the GTX 1060 because it has 6GB vs 4GB on the GTX 1650?
So in general, is it better to have more RAM then higher compute capability?
For the sake of argument, lets say that GTX 1650 had 6GB, would you then suggest the GTX 1650 over the GTX 1060?
Thanks.
Aster
Tim Dettmers says
Hi Aster,
yes, it is mostly because of the memory. 6 GB is already pretty small. I guess for some applications, for example, to get started with deep learning, or to use a GPU for a class the GTX 1650 would make a lot of sense. If it would have more memory, I would recommend it instead of the GTX 1060.
Another reason for the availability. In some countries, such as India, GPU supply can be quite erratic and older GPUs are often easier to find.
Hannes Zietsman says
There are more affordable alternatives to AWS and Google now. For example, https://vast.ai that is a GPU sharing platform. From multiple RTX 2080ti to one or two GTX 1070’s. even some Teslas. You will need some knowledge of running your task in a docker.
Tim Dettmers says
I have to look into this. Thanks for sharing.
Filipp says
vast.ai is a fraud. Don’t use their service. I made a single time purchase at https://vast.ai/ But they continue charge money from my card and refuse to remove my card from their system.
Behnam says
Thank you for your great explanation.
Nvidia, GigaByte, MSI, and PNY produce GeForce RTX 2080 ti and the processor manufacturer of three of them is Nvidia,
is there a big difference between these there? which of them do you suggest?
Tim Dettmers says
No there is no difference really. Take the one with the best cooler or the cheapest one.
GK says
Hi Tim,
Thank you so much for your post. You’ve mentioned in the main text that it’s better to have the same GPU types. What about different manufacturers? I am considering buying one GPU now, and plan to buy another later. Does it matter if I have one GigaByte RTX 3090 and one MSI RTX 3090?
Also, how about same manufacturer and model, but different specs? e.g. one GigaByte RTX 3090 Xtreme, and one GigaByte RTX 3090 Master.
Regards,
GK
Tim Dettmers says
Hi GK,
different manufacturers are fine and should not give you any disadvantage. While overclocked GPUs are not much faster than the normal GPUs, if parallelized the GPUs will perform as fast as the slowest GPU. But usually, that does not matter and I would just go for the cheapest GPUs that you can find.
Jörg says
Hi Tim,
I really appreciate your posts about ML hardware. I’m currently build a ML “workstation” running with the AMD X570 chipset and AM4 socket. I read that I should not think that much about PCIe lanes but in fact I do because I think about bying
(a) 2 * GeForce RTX 2080 SUPER
or
(b) 1 GeForce RTX 2080 Ti
Both options currently cost the same price. The Ryzen CPUs do have 24 PCIe lanes – 4 are connected to the X570 Chipset. Of the remaining 20 lanes 4 are connected to the NVMe slot. Hence the boards have the options
– run 1 GPU with full x16 speed
– run 2 GPUs x8/x8
– run 3 GPU s x8/x8/x4
My ML interests are mixed – but I think most stuff that runs on GPU will be computer vision and not that much NLP. (Other ML stuff I experiment with is using scikit-learn like SVMs are not running on GPU anyway and due to that I include 64 GB RAM on the board – the maxiumum of 128 GB is much to expensive currently.)
My instinct currently tells me to start with 1 RTX 2080 Super and use 16 bit. If that does not work buy a secod RTX 2080 Super.
By the way regarding Memory both cards do support unified memory using NVLink. I wonder whether by using NVLink it would be possible to run the network on one card but use the combined memory of the second (and the hosts memory)? (https://devblogs.nvidia.com/how-nvlink-will-enable-faster-easier-multi-gpu-computing/)
Regards
Jörg
Tim Dettmers says
If you want to combine the memory you need to use NVLink and model parallelism which is not usually used. The x8/x8 is great for your use-case. x8/x8/x4 is also fine but make sure your motherboard supports this. 8 GB of memory on the RTX 2080 Super is sufficient if you use some memory tricks like gradient accumulation.
andrea de luca says
Would NVLink allow memory pooling even on consumer cards?
Jörg says
Thank you for the reply. I decided to use two cards too. My motherboard supports all these combinations. And doing 16 bit calculations will help I think to overcome the “small ” memory size.
In addition PyTorch is now able to use model parallel.
https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
Looking forward to test that as well.
Waldemar says
Great article. Thank you so much. I went through the art. and comments but could’t find the answer for the question:
why you advice avoiding Founders Edition cards? What are the downsides?
I am new to ML and am planning to renew my GTX 1060 3GB anyway. GTX 1080/1080Ti, used, FE, are easily accessible for reasonable prices.
It is easy to find watercooling kits for FE cards and they are cheaper.
Thank you so much for your time.
Tim Dettmers says
The RTX FE cards had major cooling problems and usually, FE cards are a bit more expensive at no real performance gain. If you get a cheap GTX 1000 series FE that is a pretty good deal though.
Lucas says
I believe that does not apply to the RTX 30 series anymore, as they totally redesigned the cooling of those cards and the FE are actually cheaper than the others (at least the MSRP). It might be a good update for this article.
Sandeep says
Hi Tim,
I got this advice from a vendor of GPU systems. I was arguing that for text models FP16 and RTX 2080Ti is good enough and comparable to v100. Also in their benchmarking they did not test RTX with NvLink but v100 was tested for FP16. I got this response. Just wanted to check if NvLink is of no use when using RTX 2080Ti. Please suggest. Your inputs are much appreciated here as I would use it for my next purchase.
“Quadro series cards like the RTX 8000 and RTX 6000 have much more vRAM (respectively 48 GB and 24 GB) than does the RTX 2080 Ti (11 GB). Hence you can train much bigger networks on the RTX 6000, RTX 8000, and Titan RTX (24 GB vRAM) that you can on the RTX 2080 Ti. In terms of the number of GPU CUDA cores though, they are all very similar.
Quadro series GPUs scale much better in the sense that the advantage of the 8x RTX 6000 over 8x RTX 2080 Ti is disproportionately larger than the advantage of 2x RTX 6000 over 2x RTX 2080 Ti for multi-GPU training.
There are two main reasons for this,
First is peering.
GeForce cards, like the RTX 2080 Ti and Titan RTX, cannot peer. This means that if they need to communicate, they have to go through the CPU. With many GPUs, the CPU can become a bottleneck for communication with all of the GPUs trying to communicate with it at the same time. This is especially true when PLX switching is necessary to allocate the existing PCIe lane bandwidth amongst the GPUs. Hence, for multi-GPU training, GeForce cards do not scale very well because of this.
Quadro series GPUs, like the RTX 8000 and the RTX 6000 can peer. This means that they can communicate directly over PCIe, and not have to go through the CPU.
The second reason is NVLink.
For more than 2x GPUs, GeForce cards are not NVLinked. This is because for GeForce GPUs (which includes the Titan RTX and 2080 Ti), the physical NVLink bridge is either 3 or 4 slots wide. If it was 2 slots wide, at least the GPUs could be connected in pairs, but unfortunately that is not the case.
NVLink for Quadro series GPUs (like the RTX 6000 and the RTX 8000) is 2 slots wide. This means that the GPUs can be connected with it in pairs, which further enhances the available communication bandwidth between the cards in each pair.
That being said, in terms of performance/dollar, we very much recommend the GeForce cards, as the Quadro cards are much more expensive.”
Tim Dettmers says
I would say that analysis is very much on point. Quadro cards are more expensive, but also yield better parallel performance and if you train large models like transformers, the extra memory will also give you huge performance gains. So they can make sense in some cases, but their cost/performance is not ideal for many applications.
Yufeng says
Hi Tim,
This is very helpful — thank you for spending the time to help people like us.
I’m curious about whether you have any experience in double-precision computation. Say I have a logistic regression that I would like to estimate by maximum likelihood, but I’m interested in estimating the parameters precisely (rather than training a neural network for prediction). Could I still stick to FP32 or do I need to move over to FP64? This has a big impact on which hardware I choose.
Thanks in advance.
Yufeng
Tim Dettmers says
The question here is how precise is precisely? How many significant digits do you need? 32-bit float is accurate to about 7 digits, and 64-bit floats to about 16 digits.
Ken says
Can vram be pooled on Linux with NVLiked 2080 supers or 2080Ti ‘s?
Do the Avx512 and MLK libraries still matter ? I’m looking a building a treadripper 3960x linux box.
andrea de luca says
I don’t know about vram pooling on consumer cards, but I’d wager you won’t be able to do it. NVLink for consumer cards is very different with respect the one you find on the titans, taslas, and quadros (rtx).
Yes, MKL is still very important for the preprocessing phases (data augmentation, mostly), but Zen2 is good at it, in contrast with zen and zen+.
lihan says
Thank you for your valuable insights Tim! I brought a 2080 ti recently, and I’d like to ask 2 questions about the apex amp framework (a lot of folks are encountering obstacle 1 as reflected in repo issues, without satisfactory answers).
1. I am trying to do 16 bit training on a transformer with mostly nn.Layernorms. Is that worthwhile? When I load a partly trained model, my loaded model returns NaN loss (does not occur in normal FP32 training). This leads amp.scale_loss to complain about gradient overflow until the scaler approaches 0. Have you ever encountered this?
2. Can a FP32 trained model be converted to FP16 training without penalty? What are the drawbacks?
Thank you and continue the good work, please!
Tim Dettmers says
1. I encountered it before. What often helps is to start with a smaller warmup learning rate.
2. There should be little to no penalty in accuracy/perplexity/loss when you convert a model from fp32 to fp16. The problems usually stem from training and not prediction, so once a model is trained you probably will not lose much predictive performance.
Aditya says
Hi Tim,
Thanks for the wonderful post. It was really helpful in picking out a GPU. I have decided to go with RTX 2070 Super.
I would appreciate your input on picking a motherboard and a CPU.
I was initially looking at the latest Ryzen 5 cpu which costs around $200 and a motherboard which would support SLI, also around $200.
However, I noticed that there is a sale on 1st generation Threadripper CPUs on Amazon. I can get one for almost $140. The biggest plus is that it has 64 PCIe lanes but the biggest con is that the motherboard for it costs around $320. I know PCIe lanes aren’t the most important so I was wondering what you think I should go with.
Tim Dettmers says
I think either option is a good choice. I think the Threadripper is a bit more powerful, but it is also $60 more expensive. I would take another look at the motherboard. I would determine it on expansion slots for NVMe SSDs and number of supported GPUs. I believe a Threadripper board would allow for further extension in the future and usually can support 3x NVMe SSDs without a problem. If you eye future expansion the Threadripper might be a good choice. Otherwise, the Ryzen 5 setup is just a bit cheaper and why buy something that is more expensive and that you do not need?
Eric Bohn says
For someone just starting out. How long typically before they outgrow a single gpu and want to move toward a multi gpu setup? Working at it 10-20 hours per week in a variety of use cases.
Juan says
Hello, thanks, great post.
It will be better to have 4 x RTX 2080TI or 2 x RTX Titan to work with faster RCNN/SSD/Retinanet in images of 5472×3648 px, some objects are 50px.
Thanks
Tim Dettmers says
5472×3648 px is very large! Even if you break down the images into 50×50 you probably gain a lot of ease by just having a larger GPU memory to work with this kind of data. I would recommend 2x RTX Titan for this kind of work.
Erick says
I’m trying to decide on an ML environment for my work. One question that pops up is: you seem to discourage Tesla cards, but according to https://cloud.google.com/compute/all Google only offers Tesla cards; why would that be?
Tim Dettmers says
NVIDIA has a policy that it only sells Tesla cards to companies and not consumer GPUs. That is the main reason, but there are also other reasons that involve a lot of details.
Joren vanGoethem says
my AI teacher (just started in uni) prefers the 2070 super due to not much higher cost but signifficant performance increase, could you possibly add the super cards to the charts?
Tim Dettmers says
I might do so over the Christmas period. I do not have time for that right now.
andrea de luca says
We can guess that the 2070S will perform some 5% under the 2080 non-S, at the cost of a regular 2070 (~500$).
Go for it if 8Gb is not an issue. Go for a blower model.
Recently I stacked three 2060 Super (blower) for a bit more than 1000$ and they are not bad. In almost any task (including transformer) you can parallelize, and have 24Gb of vram which is quite something.
Shahid Siddiqui says
Dear Tim! I was benchmarking two laptops for deep learning, 1) core i7-8750H, 16 GB 2666MHz, GTX 1070, Nvme SSD, OS Ubuntu 16.04. 2) core i5-8300H, 8GB 2666MHz, GTX 1050, Nvme SSD, Windows 10. Running a simple two layer network on CIFAR using only CPUs 1) had an average epoch time of 80s and 2) had an average epoch time of 180s. But when training on GPU, 1) gave an average of 88s while 2) only 70s. I made sure both pytorch environments are exactly the same. What could be other possible reasons? Thank you
Tim Dettmers says
These are strange results. Can you try “OMP_NUM_THREADS=1 python …” for your scripts and see if the performance changes?
Shahid Siddiqui says
Sorry Tim! freaked out too soon. Turned out the pytorch was unable to use the 1070 GPU on Linux and was giving me CPU numbers. Any ways trained a VGG on both, i5-8300H CPU takes 737s per epoch while i7-8750H takes 263s. When using the GPUs, 1050 took 31s while 1070 just 13s. Thank you for the prompt response.
Jo says
If we really do want to use AMD GPUs, which ones are better than others for deep learning? I tried Googling about this, but there is just too few useful information to be found.
Tim Dettmers says
It is difficult to say because, as you already said, there are too few reliable benchmarks. Going with the most recent model that fits your budget is probably the right call. In terms of GPU memory, there are the same requirements for AMD and NVIDIA GPUs. So for state-of-the-art models, you want at least 11 GB. If you want to train big transformers you want more. If you want to do Kaggle competitions less is okay.
andrea de luca says
The radeon VII seems to be the only viable option. It is the only one to possess 16Gb of vram. All the other are slow and limited in terms of memory.
Note also that Navi gpus (5700) are not supported by ROCm.
Miguel says
Greetings. With “Navi gpus (5700)” do you refer to RX 5700 XT, for instance? So those are not compatible with rocm?
Sorry I am not that much of a connoisseur…
Will Stewart says
I am seeking to purchase a home computer for both general use and deep learning.
Criteria;
1. Mobility is highly preferred, e.g., laptop
2. Price is important, e.g., nothing like “the sky is the limit”
3. Energy efficiency is a goal. If I’m not using it specifically for deep learning, I’d greatly prefer to have a low consumption computer (for a low carbon footprint).
4. I’ll be happy to start with Win 10, though I will undoubtedly dual boot it to Ubuntu like I have to 4 other machines in past, and use whichever OS provides the best overall results (noting some configuration and fan speed quirks with Linux and eGPUs).
I had considered a purpose built deep learning desktop, though it has no mobility and draws too much power in ‘normal’ operation.
I am considering the following;
Alienware 17 Gaming Laptop
https://www.costco.com/.product.1340132.html
Intel Core i7-8750H
16G DDR4 RAM
GeForce GTX 1070 Max-Q 8G
1 TB hybrid drive
1 Thunderbolt port, 3 USB 3.0 “Superspeed” ports
This gives me the flexibility to;
– Use the embedded GPU as at least a model experimentation starting point, where I can then shift to a more powerful eGPU with more memory, an AWS p3 VM, or Kaggle. If I choose an eGPU, then I would knowing accept the 15-20% hit in training duration.
– Upgrade eGPUs as they improve in power, memory, and price reduction
As deep learning can run 24/7 putting a significant thermal demand on laptop components, I’m paying *especially* careful attention to various cooling approaches, and will monitor CPU/GPU temperatures closely.
I may also add an SSD drive to store/stage the learning data on during training, to avoid long waits for batch pulls from the HDD. I don’t know how to tell if the motherboard (R5?) contains the Thunderbolt circuitry, or if it is on a daughter board.
All thoughts/critiques welcomed!
Tim Dettmers says
It sounds like you thought about this pretty well! It seems the perfect choice for you. I am not sure though though what you mean if the Thunderbolt circuitry is on the motherboard or the daughter board — what is this referring to?
andrea de luca says
If you are going to purchase a laptop, I think you should be aware of some issues:
1. Laptop hardware scarcely tolerates high-demand workloads 24/7. You could incur in overheating issues in a gaming laptop with discrete gpu.
2. Laptop gpus are somewhat limited. As far as i know, no laptop gpu goes above 8gb of vram. In other words, forget about training big transformers.
3. Pascal gpus (like the 1070 you mentioned) do possess fp16 capability, but you will sadly observe that they have convergence problems while training in 16 bit. This will exacerbate memory scarcity. I strongly urge you to purchase at least a turing laptop.
Stas says
Hi,
Can you say something about, say, 2060 Super X3 vs 2080ti? Is it worth to try 3 lower end cards vs one top end for DL?
Tim Dettmers says
It depends on your problem. If it does not require so much memory 3x 2060 Super can make sense. However, most modern models require a fair amount of GPU memory and run slow otherwise. So carefully check if you think the memory on the 2060 Super is sufficient.
wayne says
I am starting on ML and wanted to get a laptop to serve multiple roles including graphical work. You mentioned in your awesome blog about Quattro as a No-No. If I want to something like a Lenovo P53 with Quattro T1000 to fit within my budget and means, is this still OK?
Tim Dettmers says
Quattros usually have very low cost/performance, but if you find a good deal that is fine.
Sreevanth says
Hi Tim,
Thanks for your sharing, it has resolved the doubts I had. I am planning to buy an MSI GE63 laptop with RTX 2070. I’m a beginner but I would like to invest in a good laptop for a long term in deep learning. Is this a good laptop or do you suggest any other laptop?
Tim Dettmers says
An RTX 2070 in a laptop is pretty powerful as laptop GPUs goes. I think it will last you quite a while.