Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning

2023-01-30 by Tim Dettmers 1,665 Comments

Deep learning is a field with intense computational requirements, and your choice of GPU will fundamentally determine your deep learning experience. But what features are important if you want to buy a new GPU? GPU RAM, cores, tensor cores, caches? How to make a cost-efficient choice? This blog post will delve into these questions, tackle common misconceptions, give you an intuitive understanding of how to think about GPUs, and will lend you advice, which will help you to make a choice that is right for you.

This blog post is designed to give you different levels of understanding of GPUs and the new Ampere series GPUs from NVIDIA. You have the choice: (1) If you are not interested in the details of how GPUs work, what makes a GPU fast compared to a CPU, and what is unique about the new NVIDIA RTX 40 Ampere series, you can skip right to the performance and performance per dollar charts and the recommendation section. The cost/performance numbers form the core of the blog post and the content surrounding it explains the details of what makes up GPU performance.

(2) If you worry about specific questions, I have answered and addressed the most common questions and misconceptions in the later part of the blog post.

(3) If you want to get an in-depth understanding of how GPUs, caches, and Tensor Cores work, the best is to read the blog post from start to finish. You might want to skip a section or two based on your understanding of the presented topics.

Contents hide

Overview

How do GPUs work?

The Most Important GPU Specs for Deep Learning Processing Speed

Tensor Cores

Matrix multiplication without Tensor Cores

Matrix multiplication with Tensor Cores

Matrix multiplication with Tensor Cores and Asynchronous copies (RTX 30/RTX 40) and TMA (H100)

Memory Bandwidth

L2 Cache / Shared Memory / L1 Cache / Registers

Estimating Ada / Hopper Deep Learning Performance

Practical Ada / Hopper Speed Estimates

Possible Biases in Estimates

Advantages and Problems for RTX40 and RTX 30 Series

Sparse Network Training

Low-precision Computation

Fan Designs and GPUs Temperature Issues

3-slot Design and Power Issues

Power Limiting: An Elegant Solution to Solve the Power Problem?

RTX 4090s and Melting Power Connectors: How to Prevent Problems

8-bit Float Support in H100 and RTX 40 series GPUs

Raw Performance Ranking of GPUs

GPU Deep Learning Performance per Dollar

GPU Recommendations

Is it better to wait for future GPUs for an upgrade? The future of GPUs.

Question & Answers & Misconceptions

Do I need PCIe 4.0 or PCIe 5.0?

Do I need 8x/16x PCIe lanes?

How do I fit 4x RTX 4090 or 3090 if they take up 3 PCIe slots each?

How do I cool 4x RTX 3090 or 4x RTX 3080?

Can I use multiple GPUs of different GPU types?

What is NVLink, and is it useful?

I do not have enough money, even for the cheapest GPUs you recommend. What can I do?

What is the carbon footprint of GPUs? How can I use GPUs without polluting the environment?

What do I need to parallelize across two machines?

Is the sparse matrix multiplication features suitable for sparse matrices in general?

Do I need an Intel CPU to power a multi-GPU setup?

Does computer case design matter for cooling?

Will AMD GPUs + ROCm ever catch up with NVIDIA GPUs + CUDA?

When is it better to use the cloud vs a dedicated GPU desktop/server?

Version History

Acknowledgments

Overview

This blog post is structured in the following way. First, I will explain what makes a GPU fast. I will discuss CPUs vs GPUs, Tensor Cores, memory bandwidth, and the memory hierarchy of GPUs and how these relate to deep learning performance. These explanations might help you get a more intuitive sense of what to look for in a GPU. I discuss the unique features of the new NVIDIA RTX 40 Ampere GPU series that are worth considering if you buy a GPU. From there, I make GPU recommendations for different scenarios. After that follows a Q&A section of common questions posed to me in Twitter threads; in that section, I will also address common misconceptions and some miscellaneous issues, such as cloud vs desktop, cooling, AMD vs NVIDIA, and others.

How do GPUs work?

If you use GPUs frequently, it is useful to understand how they work. This knowledge will help you to undstand cases where are GPUs fast or slow. In turn, you might be able to understand better why you need a GPU in the first place and how other future hardware options might be able to compete. You can skip this section if you just want the useful performance numbers and arguments to help you decide which GPU to buy. The best high-level explanation for the question of how GPUs work is my following Quora answer:

Read Tim Dettmers‘ answer to Why are GPUs well-suited to deep learning? on Quora

This is a high-level explanation that explains quite well why GPUs are better than CPUs for deep learning. If we look at the details, we can understand what makes one GPU better than another.

The Most Important GPU Specs for Deep Learning Processing Speed

This section can help you build a more intuitive understanding of how to think about deep learning performance. This understanding will help you to evaluate future GPUs by yourself. This section is sorted by the importance of each component. Tensor Cores are most important, followed by memory bandwidth of a GPU, the cache hierachy, and only then FLOPS of a GPU.

Tensor Cores

Tensor Cores are tiny cores that perform very efficient matrix multiplication. Since the most expensive part of any deep neural network is matrix multiplication Tensor Cores are very useful. In fast, they are so powerful, that I do not recommend any GPUs that do not have Tensor Cores.

It is helpful to understand how they work to appreciate the importance of these computational units specialized for matrix multiplication. Here I will show you a simple example of A*B=C matrix multiplication, where all matrices have a size of 32×32, what a computational pattern looks like with and without Tensor Cores. This is a simplified example, and not the exact way how a high performing matrix multiplication kernel would be written, but it has all the basics. A CUDA programmer would take this as a first “draft” and then optimize it step-by-step with concepts like double buffering, register optimization, occupancy optimization, instruction-level parallelism, and many others, which I will not discuss at this point.

To understand this example fully, you have to understand the concepts of cycles. If a processor runs at 1GHz, it can do 10^9 cycles per second. Each cycle represents an opportunity for computation. However, most of the time, operations take longer than one cycle. Thus we essentially have a queue where the next operations needs to wait for the next operation to finish. This is also called the latency of the operation.

Here are some important latency cycle timings for operations. These times can change from GPU generation to GPU generation. These numbers are for Ampere GPUs, which have relatively slow caches.

Global memory access (up to 80GB): ~380 cycles
L2 cache: ~200 cycles
L1 cache or Shared memory access (up to 128 kb per Streaming Multiprocessor): ~34 cycles
Fused multiplication and addition, a*b+c (FFMA): 4 cycles
Tensor Core matrix multiply: 1 cycle

Each operation is always performed by a pack of 32 threads. This pack is termed a warp of threads. Warps usually operate in a synchronous pattern — threads within a warp have to wait for each other. All memory operations on the GPU are optimized for warps. For example, loading from global memory happens at a granularity of 32*4 bytes, exactly 32 floats, exactly one float for each thread in a warp. We can have up to 32 warps = 1024 threads in a streaming multiprocessor (SM), the GPU-equivalent of a CPU core. The resources of an SM are divided up among all active warps. This means that sometimes we want to run fewer warps to have more registers/shared memory/Tensor Core resources per warp.

For both of the following examples, we assume we have the same computational resources. For this small example of a 32×32 matrix multiply, we use 8 SMs (about 10% of an RTX 3090) and 8 warps per SM.

To understand how the cycle latencies play together with resources like threads per SM and shared memory per SM, we now look at examples of matrix multiplication. While the following example roughly follows the sequence of computational steps of matrix multiplication for both with and without Tensor Cores, please note that these are very simplified examples. Real cases of matrix multiplication involve much larger shared memory tiles and slightly different computational patterns.

Matrix multiplication without Tensor Cores

If we want to do an A*B=C matrix multiply, where each matrix is of size 32×32, then we want to load memory that we repeatedly access into shared memory because its latency is about five times lower (200 cycles vs 34 cycles). A memory block in shared memory is often referred to as a memory tile or just a tile. Loading two 32×32 floats into a shared memory tile can happen in parallel by using 2*32 warps. We have 8 SMs with 8 warps each, so due to parallelization, we only need to do a single sequential load from global to shared memory, which takes 200 cycles.

To do the matrix multiplication, we now need to load a vector of 32 numbers from shared memory A and shared memory B and perform a fused multiply-and-accumulate (FFMA). Then store the outputs in registers C. We divide the work so that each SM does 8x dot products (32×32) to compute 8 outputs of C. Why this is exactly 8 (4 in older algorithms) is very technical. I recommend Scott Gray’s blog post on matrix multiplication to understand this. This means we have 8x shared memory accesses at the cost of 34 cycles each and 8 FFMA operations (32 in parallel), which cost 4 cycles each. In total, we thus have a cost of:

200 cycles (global memory) + 8*34 cycles (shared memory) + 8*4 cycles (FFMA) = 504 cycles

Let’s look at the cycle cost of using Tensor Cores.

Matrix multiplication with Tensor Cores

With Tensor Cores, we can perform a 4×4 matrix multiplication in one cycle. To do that, we first need to get memory into the Tensor Core. Similarly to the above, we need to read from global memory (200 cycles) and store in shared memory. To do a 32×32 matrix multiply, we need to do 8×8=64 Tensor Cores operations. A single SM has 8 Tensor Cores. So with 8 SMs, we have 64 Tensor Cores — just the number that we need! We can transfer the data from shared memory to the Tensor Cores with 1 memory transfers (34 cycles) and then do those 64 parallel Tensor Core operations (1 cycle). This means the total cost for Tensor Cores matrix multiplication, in this case, is:

200 cycles (global memory) + 34 cycles (shared memory) + 1 cycle (Tensor Core) = 235 cycles.

Thus we reduce the matrix multiplication cost significantly from 504 cycles to 235 cycles via Tensor Cores. In this simplified case, the Tensor Cores reduced the cost of both shared memory access and FFMA operations.

This example is simplified, for example, usually each thread needs to calculate which memory to read and write to as you transfer data from global memory to shared memory. With the new Hooper (H100) architectures we additionally have the Tensor Memory Accelerator (TMA) compute these indices in hardware and thus help each thread to focus on more computation rather than computing indices.

Matrix multiplication with Tensor Cores and Asynchronous copies (RTX 30/RTX 40) and TMA (H100)

The RTX 30 Ampere and RTX 40 Ada series GPUs additionally have support to perform asynchronous transfers between global and shared memory. The H100 Hopper GPU extends this further by introducing the Tensor Memory Accelerator (TMA) unit. the TMA unit combines asynchronous copies and index calculation for read and writes simultaneously — so each thread no longer needs to calculate which is the next element to read and each thread can focus on doing more matrix multiplication calculations. This looks as follows.

The TMA unit fetches memory from global to shared memory (200 cycles). Once the data arrives, the TMA unit fetches the next block of data asynchronously from global memory. While this is happening, the threads load data from shared memory and perform the matrix multiplication via the tensor core. Once the threads are finished they wait for the TMA unit to finish the next data transfer, and the sequence repeats.

As such, due to the asynchronous nature, the second global memory read by the TMA unit is already progressing as the threads process the current shared memory tile. This means, the second read takes only 200 – 34 – 1 = 165 cycles.

Since we do many reads, only the first memory access will be slow and all other memory accesses will be partially overlapped with the TMA unit. Thus on average, we reduce the time by 35 cycles.

165 cycles (wait for async copy to finish) + 34 cycles (shared memory) + 1 cycle (Tensor Core) = 200 cycles.

Which accelerates the matrix multiplication by another 15%.

From these examples, it becomes clear why the next attribute, memory bandwidth, is so crucial for Tensor-Core-equipped GPUs. Since global memory is the by far the largest cycle cost for matrix multiplication with Tensor Cores, we would even have faster GPUs if the global memory latency could be reduced. We can do this by either increasing the clock frequency of the memory (more cycles per second, but also more heat and higher energy requirements) or by increasing the number of elements that can be transferred at any one time (bus width).

LLM.int8() and Emergent Features

Memory Bandwidth

From the previous section, we have seen that Tensor Cores are very fast. So fast, in fact, that they are idle most of the time as they are waiting for memory to arrive from global memory. For example, during GPT-3-sized training, which uses huge matrices — the larger, the better for Tensor Cores — we have a Tensor Core TFLOPS utilization of about 45-65%, meaning that even for the large neural networks about 50% of the time, Tensor Cores are idle.

This means that when comparing two GPUs with Tensor Cores, one of the single best indicators for each GPU’s performance is their memory bandwidth. For example, The A100 GPU has 1,555 GB/s memory bandwidth vs the 900 GB/s of the V100. As such, a basic estimate of speedup of an A100 vs V100 is 1555/900 = 1.73x.

L2 Cache / Shared Memory / L1 Cache / Registers

Since memory transfers to the Tensor Cores are the limiting factor in performance, we are looking for other GPU attributes that enable faster memory transfer to Tensor Cores. L2 cache, shared memory, L1 cache, and amount of registers used are all related. To understand how a memory hierarchy enables faster memory transfers, it helps to understand how matrix multiplication is performed on a GPU.

To perform matrix multiplication, we exploit the memory hierarchy of a GPU that goes from slow global memory, to faster L2 memory, to fast local shared memory, to lightning-fast registers. However, the faster the memory, the smaller it is.

While logically, L2 and L1 memory are the same, L2 cache is larger and thus the average physical distance that need to be traversed to retrieve a cache line is larger. You can see the L1 and L2 caches as organized warehouses where you want to retrieve an item. You know where the item is, but to go there takes on average much longer for the larger warehouse. This is the essential difference between L1 and L2 caches. Large = slow, small = fast.

For matrix multiplication we can use this hierarchical separate into smaller and smaller and thus faster and faster chunks of memory to perform very fast matrix multiplications. For that, we need to chunk the big matrix multiplication into smaller sub-matrix multiplications. These chunks are called memory tiles, or often for short just tiles.

We perform matrix multiplication across these smaller tiles in local shared memory that is fast and close to the streaming multiprocessor (SM) — the equivalent of a CPU core. With Tensor Cores, we go a step further: We take each tile and load a part of these tiles into Tensor Cores which is directly addressed by registers. A matrix memory tile in L2 cache is 3-5x faster than global GPU memory (GPU RAM), shared memory is ~7-10x faster than the global GPU memory, whereas the Tensor Cores’ registers are ~200x faster than the global GPU memory.

Having larger tiles means we can reuse more memory. I wrote about this in detail in my TPU vs GPU blog post. In fact, you can see TPUs as having very, very, large tiles for each Tensor Core. As such, TPUs can reuse much more memory with each transfer from global memory, which makes them a little bit more efficient at matrix multiplications than GPUs.

Each tile size is determined by how much memory we have per streaming multiprocessor (SM) and how much we L2 cache we have across all SMs. We have the following shared memory sizes on the following architectures:

Volta (Titan V): 128kb shared memory / 6 MB L2
Turing (RTX 20s series): 96 kb shared memory / 5.5 MB L2
Ampere (RTX 30s series): 128 kb shared memory / 6 MB L2
Ada (RTX 40s series): 128 kb shared memory / 72 MB L2

We see that Ada has a much larger L2 cache allowing for larger tile sizes, which reduces global memory access. For example, for BERT large during training, the input and weight matrix of any matrix multiplication fit neatly into the L2 cache of Ada (but not other Us). As such, data needs to be loaded from global memory only once and then data is available throught the L2 cache, making matrix multiplication about 1.5 – 2.0x faster for this architecture for Ada. For larger models the speedups are lower during training but certain sweetspots exist which may make certain models much faster. Inference, with a batch size larger than 8 can also benefit immensely from the larger L2 caches.

Estimating Ada / Hopper Deep Learning Performance

This section is for those who want to understand the more technical details of how I derive the performance estimates for Ampere GPUs. If you do not care about these technical aspects, it is safe to skip this section.

Practical Ada / Hopper Speed Estimates

Suppose we have an estimate for one GPU of a GPU-architecture like Hopper, Ada, Ampere, Turing, or Volta. It is easy to extrapolate these results to other GPUs from the same architecture/series. Luckily, NVIDIA already benchmarked the A100 vs V100 vs H100 across a wide range of computer vision and natural language understanding tasks. Unfortunately, NVIDIA made sure that these numbers are not directly comparable by using different batch sizes and the number of GPUs whenever possible to favor results for the H100 GPU. So in a sense, the benchmark numbers are partially honest, partially marketing numbers. In general, you could argue that using larger batch sizes is fair, as the H100/A100 GPU has more memory. Still, to compare GPU architectures, we should evaluate unbiased memory performance with the same batch size.

To get an unbiased estimate, we can scale the data center GPU results in two ways: (1) account for the differences in batch size, (2) account for the differences in using 1 vs 8 GPUs. We are lucky that we can find such an estimate for both biases in the data that NVIDIA provides.

Doubling the batch size increases throughput in terms of images/s (CNNs) by 13.6%. I benchmarked the same problem for transformers on my RTX Titan and found, surprisingly, the very same result: 13.5% — it appears that this is a robust estimate.

As we parallelize networks across more and more GPUs, we lose performance due to some networking overhead. The A100 8x GPU system has better networking (NVLink 3.0) than the V100 8x GPU system (NVLink 2.0) — this is another confounding factor. Looking directly at the data from NVIDIA, we can find that for CNNs, a system with 8x A100 has a 5% lower overhead than a system of 8x V100. This means if going from 1x A100 to 8x A100 gives you a speedup of, say, 7.00x, then going from 1x V100 to 8x V100 only gives you a speedup of 6.67x. For transformers, the figure is 7%.

Using these figures, we can estimate the speedup for a few specific deep learning architectures from the direct data that NVIDIA provides. The Tesla A100 offers the following speedup over the Tesla V100:

SE-ResNeXt101: 1.43x
Masked-R-CNN: 1.47x
Transformer (12 layer, Machine Translation, WMT14 en-de): 1.70x

Thus, the figures are a bit lower than the theoretical estimate for computer vision. This might be due to smaller tensor dimensions, overhead from operations that are needed to prepare the matrix multiplication like img2col or Fast Fourier Transform (FFT), or operations that cannot saturate the GPU (final layers are often relatively small). It could also be artifacts of the specific architectures (grouped convolution).

The practical transformer estimate is very close to the theoretical estimate. This is probably because algorithms for huge matrices are very straightforward. I will use these practical estimates to calculate the cost efficiency of GPUs.

Possible Biases in Estimates

The estimates above are for H100, A100 , and V100 GPUs. In the past, NVIDIA sneaked unannounced performance degradations into the “gaming” RTX GPUs: (1) Decreased Tensor Core utilization, (2) gaming fans for cooling, (3) disabled peer-to-peer GPU transfers. It might be possible that there are unannounced performance degradations in the RTX 40 series compared to the full Hopper H100.

As of now, one of these degradations was found for Ampere GPUs: Tensor Core performance was decreased so that RTX 30 series GPUs are not as good as Quadro cards for deep learning purposes. This was also done for the RTX 20 series, so it is nothing new, but this time it was also done for the Titan equivalent card, the RTX 3090. The RTX Titan did not have performance degradation enabled.

Currently, no degradation for Ada GPUs are known, but I update this post with news on this and let my followers on twitter know.

Advantages and Problems for RTX40 and RTX 30 Series

The new NVIDIA Ampere RTX 30 series has additional benefits over the NVIDIA Turing RTX 20 series, such as sparse network training and inference. Other features, such as the new data types, should be seen more as an ease-of-use-feature as they provide the same performance boost as Turing does but without any extra programming required.

The Ada RTX 40 series has even further advances like 8-bit Float (FP8) tensor cores. The RTX 40 series also has similar power and temperature issues compared to the RTX 30. The issue of melting power connector cables in the RTX 40 can be easily prevented by connecting the power cable correctly.

Sparse Network Training

Ampere allows for fine-grained structure automatic sparse matrix multiplication at dense speeds. How does this work? Take a weight matrix and slice it into pieces of 4 elements. Now imagine 2 elements of these 4 to be zero. Figure 1 shows how this could look like.

Figure 1: Structure supported by the sparse matrix multiplication feature in Ampere GPUs. The figure is taken from Jeff Pool's GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA. — Figure 1: Structure supported by the sparse matrix multiplication feature in Ampere GPUs. The figure is taken from Jeff Pool’s GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA.

When you multiply this sparse weight matrix with some dense inputs, the sparse matrix tensor core feature in Ampere automatically compresses the sparse matrix to a dense representation that is half the size as can be seen in Figure 2. After this compression, the densely compressed matrix tile is fed into the tensor core which computes a matrix multiplication of twice the usual size. This effectively yields a 2x speedup since the bandwidth requirements during matrix multiplication from shared memory are halved.

Figure 2: The sparse matrix is compressed to a dense representation before the matrix multiplication is performed. The figure is taken from Jeff Pool’s GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA.

I was working on sparse network training in my research and I also wrote a blog post about sparse training. One criticism of my work was that “You reduce the FLOPS required for the network, but it does not yield speedups because GPUs cannot do fast sparse matrix multiplication.” Well, with the addition of the sparse matrix multiplication feature for Tensor Cores, my algorithm, or other sparse training algorithms, now actually provide speedups of up to 2x during training.

Figure 3: The sparse training algorithm that I developed has three stages: (1) Determine the importance of each layer. (2) Remove the smallest, unimportant weights. (3) Grow new weights proportional to the importance of each layer. Read more about my work in my sparse training blog post.

While this feature is still experimental and training sparse networks are not commonplace yet, having this feature on your GPU means you are ready for the future of sparse training.

Low-precision Computation

In my work, I’ve previously shown that new data types can improve stability during low-precision backpropagation.

Figure 4: Low-precision deep learning 8-bit datatypes that I developed. Deep learning training benefits from highly specialized data types. My dynamic tree datatype uses a dynamic bit that indicates the beginning of a binary bisection tree that quantized the range [0, 0.9] while all previous bits are used for the exponent. This allows to dynamically represent numbers that are both large and small with high precision.

Currently, if you want to have stable backpropagation with 16-bit floating-point numbers (FP16), the big problem is that ordinary FP16 data types only support numbers in the range [-65,504, 65,504]. If your gradient slips past this range, your gradients explode into NaN values. To prevent this during FP16 training, we usually perform loss scaling where you multiply the loss by a small number before backpropagating to prevent this gradient explosion.

The BrainFloat 16 format (BF16) uses more bits for the exponent such that the range of possible numbers is the same as for FP32: [-3*10^38, 3*10^38]. BF16 has less precision, that is significant digits, but gradient precision is not that important for learning. So what BF16 does is that you no longer need to do any loss scaling or worry about the gradient blowing up quickly. As such, we should see an increase in training stability by using the BF16 format as a slight loss of precision.

What this means for you: With BF16 precision, training might be more stable than with FP16 precision while providing the same speedups. With 32-bit TensorFloat (TF32) precision, you get near FP32 stability while giving the speedups close to FP16. The good thing is, to use these data types, you can just replace FP32 with TF32 and FP16 with BF16 — no code changes required!

Overall, though, these new data types can be seen as lazy data types in the sense that you could have gotten all the benefits with the old data types with some additional programming efforts (proper loss scaling, initialization, normalization, using Apex). As such, these data types do not provide speedups but rather improve ease of use of low precision for training.

LLM.int8() and Emergent Features

Fan Designs and GPUs Temperature Issues

While the new fan design of the RTX 30 series performs very well to cool the GPU, different fan designs of non-founders edition GPUs might be more problematic. If your GPU heats up beyond 80C, it will throttle itself and slow down its computational speed / power. This overheating can happen in particular if you stack multiple GPUs next to each other. A solution to this is to use PCIe extenders to create space between GPUs.

Spreading GPUs with PCIe extenders is very effective for cooling, and other fellow PhD students at the University of Washington and I use this setup with great success. It does not look pretty, but it keeps your GPUs cool! This has been running with no problems at all for 4 years now. It can also help if you do not have enough space to fit all GPUs in the PCIe slots. For example, if you can find the space within a desktop computer case, it might be possible to buy standard 3-slot-width RTX 4090 and spread them with PCIe extenders within the case. With this, you might solve both the space issue and cooling issue for a 4x RTX 4090 setup with a single simple solution.

Figure 5: 4x GPUs with PCIe extenders. It looks like a mess, but it is very effective for cooling. I used this rig for 2 years and cooling is excellent despite problematic RTX 2080 Ti Founders Edition GPUs. — Figure 5: 4x GPUs with PCIe extenders. It looks like a mess, but it is very effective for cooling. I used this rig for 4 years and cooling is excellent despite problematic RTX 2080 Ti Founders Edition GPUs.

3-slot Design and Power Issues

The RTX 3090 and RTX 4090 are 3-slot GPUs, so one will not be able to use it in a 4x setup with the default fan design from NVIDIA. This is kind of justified because it runs at over 350W TDP, and it will be difficult to cool in a multi-GPU 2-slot setting. The RTX 3080 is only slightly better at 320W TDP, and cooling a 4x RTX 3080 setup will also be very difficult.

It is also difficult to power a 4x 350W = 1400W or 4x 450W = 1800W system in the 4x RTX 3090 or 4x RTX 4090 case. Power supply units (PSUs) of 1600W are readily available, but having only 200W to power the CPU and motherboard can be too tight. The components’ maximum power is only used if the components are fully utilized, and in deep learning, the CPU is usually only under weak load. With that, a 1600W PSU might work quite well with a 4x RTX 3080 build, but for a 4x RTX 3090 build, it is better to look for high wattage PSUs (+1700W). Some of my followers have had great success with cryptomining PSUs — have a look in the comment section for more info about that. Otherwise, it is important to note that not all outlets support PSUs above 1600W, especially in the US. This is the reason why in the US, there are currently few standard desktop PSUs above 1600W on the market. If you get a server or cryptomining PSUs, beware of the form factor — make sure it fits into your computer case.

Power Limiting: An Elegant Solution to Solve the Power Problem?

It is possible to set a power limit on your GPUs. So you would be able to programmatically set the power limit of an RTX 3090 to 300W instead of their standard 350W. In a 4x GPU system, that is a saving of 200W, which might just be enough to build a 4x RTX 3090 system with a 1600W PSU feasible. It also helps to keep the GPUs cool. So setting a power limit can solve the two major problems of a 4x RTX 3080 or 4x RTX 3090 setups, cooling, and power, at the same time. For a 4x setup, you still need effective blower GPUs (and the standard design may prove adequate for this), but this resolves the PSU problem.

Figure 6: Reducing the power limit has a slight cooling effect. Reducing the RTX 2080 Ti power limit by 50-60 W decreases temperatures slightly and fans run more silent.

You might ask, “Doesn’t this slow down the GPU?” Yes, it does, but the question is by how much. I benchmarked the 4x RTX 2080 Ti system shown in Figure 5 under different power limits to test this. I benchmarked the time for 500 mini-batches for BERT Large during inference (excluding the softmax layer). I choose BERT Large inference since, from my experience, this is the deep learning model that stresses the GPU the most. As such, I would expect power limiting to have the most massive slowdown for this model. As such, the slowdowns reported here are probably close to the maximum slowdowns that you can expect. The results are shown in Figure 7.

Figure 7: Measured slowdown for a given power limit on an RTX 2080 Ti. Measurements taken are mean processing times for 500 mini-batches of BERT Large during inference (excluding softmax layer).

As we can see, setting the power limit does not seriously affect performance. Limiting the power by 50W — more than enough to handle 4x RTX 3090 — decreases performance by only 7%.

RTX 4090s and Melting Power Connectors: How to Prevent Problems

There was a misconception that RTX 4090 power cables melt because they were bent. However, it was found that only 0.1% of users had this problem and the problem occured due to user error. Here a video that shows that the main problem is that cables were not inserted correctly.

So using RTX 4090 cards is perfectly safe if you follow the following install instructions:

If you use an old cable or old GPU make sure the contacts are free of debri / dust.
Use the power connector and stick it into the socket until you hear a *click* — this is the most important part.
Test for good fit by wiggling the power cable left to right. The cable should not move.
Check the contact with the socket visually, there should be no gap between cable and socket.

8-bit Float Support in H100 and RTX 40 series GPUs

The support of the 8-bit Float (FP8) is a huge advantage for the RTX 40 series and H100 GPUs. With 8-bit inputs it allows you to load the data for matrix multiplication twice as fast, you can store twice as much matrix elements in your caches which in the Ada and Hopper architecture are very large, and now with FP8 tensor cores you get 0.66 PFLOPS of compute for a RTX 4090 — this is more FLOPS then the entirety of the worlds fastest supercomputer in year 2007. 4x RTX 4090 with FP8 compute rival the faster supercomputer in the world in year 2010 (deep learning started to work just in 2009).

The main problem with using 8-bit precision is that transformers can get very unstable with so few bits and crash during training or generate non-sense during inference. I have written a paper about the emergence of instabilities in large language models and I also written a more accessible blog post.

The main take-way is this: Using 8-bit instead of 16-bit makes things very unstable, but if you keep a couple of dimensions in high precision everything works just fine.

Main results from my work on 8-bit matrix multiplication for Large Language Models (LLMs). We can see that the best 8-bit baseline fails to deliver good zero-shot performance. The method that I developed, LLM.int8(), can perform Int8 matrix multiplication with the same results as the 16-bit baseline.

But Int8 was already supported by the RTX 30 / A100 / Ampere generation GPUs, why is FP8 in the RTX 40 another big upgrade? The FP8 data type is much more stable than the Int8 data type and its easy to use it in functions like layer norm or non-linear functions, which are difficult to do with Integer data types. This will make it very straightforward to use it in training and inference. I think this will make FP8 training and inference relatively common in a couple of months.

If you want to read more about the advantages of Float vs Integer data types you can read my recent paper about k-bit inference scaling laws. Below you can see one relevant main result for Float vs Integer data types from this paper. We can see that bit-by-bit, the FP4 data type preserve more information than Int4 data type and thus improves the mean LLM zeroshot accuracy across 4 tasks.

4-bit Inference scaling laws for Pythia Large Language Models for different data types. We see that bit-by-bit, 4-bit float data types have better zeroshot accuracy compared to the Int4 data types.

Raw Performance Ranking of GPUs

Below we see a chart of raw relevative performance across all GPUs. We see that there is a gigantic gap in 8-bit performance of H100 GPUs and old cards that are optimized for 16-bit performance.

Shown is raw relative transformer performance of GPUs. For example, an RTX 4090 has about 0.33x performance of a H100 SMX for 8-bit inference. In other words, a H100 SMX is three times faster for 8-bit inference compared to a RTX 4090.

For this data, I did not model 8-bit compute for older GPUs. I did so, because 8-bit Inference and training are much more effective on Ada/Hopper GPUs because of the 8-bit Float data type and Tensor Memory Accelerator (TMA) which saves the overhead of computing read/write indices which is particularly helpful for 8-bit matrix multiplication. Ada/Hopper also have FP8 support, which makes in particular 8-bit training much more effective.

I did not model numbers for 8-bit training because to model that I need to know the latency of L1 and L2 caches on Hopper/Ada GPUs, and they are unknown and I do not have access to such GPUs. On Hopper/Ada, 8-bit training performance can well be 3-4x of 16-bit training performance if the caches are as fast as rumored.

But even with the new FP8 tensor cores there are some additional issues which are difficult to take into account when modeling GPU performance. For example, FP8 tensor cores do not support transposed matrix multiplication which means backpropagation needs either a separate transpose before multiplication or one needs to hold two sets of weights — one transposed and one non-transposed — in memory. I used two sets of weight when I experimented with Int8 training in my LLM.int8() project and this reduced the overall speedups quite significantly. I think one can do better with the right algorithms/software, but this shows that missing features like a transposed matrix multiplication for tensor cores can affect performance.

For old GPUs, Int8 inference performance is close to the 16-bit inference performance for models below 13B parameters. Int8 performance on old GPUs is only relevant if you have relatively large models with 175B parameters or more. If you are interested in 8-bit performance of older GPUs, you can read the Appendix D of my LLM.int8() paper where I benchmark Int8 performance.

GPU Deep Learning Performance per Dollar

Below we see the chart for the performance per US dollar for all GPUs sorted by 8-bit inference performance. How to use the chart to find a suitable GPU for you is as follows:

Determine the amount of GPU memory that you need (rough heuristic: at least 12 GB for image generation; at least 24 GB for work with transformers)
While 8-bit inference and training is experimental, it will become standard within 6 months. You might need to do some extra difficult coding to work with 8-bit in the meantime. Is that OK for you? If not, select for 16-bit performance.
Using the metric determined in (2), find the GPU with the highest relative performance/dollar that has the amount of memory you need.

We can see that the RTX 4070 Ti is most cost-effective for 8-bit and 16-bit inference while the RTX 3080 remains most cost-effective for 16-bit training. While these GPUs are most cost-effective, they are not necessarily recommended as they do not have sufficient memory for many use-cases. However, it might be the ideal cards to get started on your deep learning journey. Some of these GPUs are excellent for Kaggle competition where one can often rely on smaller models. Since to do well in Kaggle competitions the method of how you work is more important than the models size, many of these smaller GPUs are excellent for Kaggle competitions.

The best GPUs for academic and startup servers seem to be A6000 Ada GPUs (not to be confused with A6000 Turing). The H100 SXM GPU is also very cost effective and has high memory and very strong performance. If I would build a small cluster for a company/academic lab, I would use 66-80% A6000 GPUs and 20-33% H100 SXM GPUs. If I get a good deal on L40 GPUs, I would also pick them instead of A6000, so you can always ask for a quote on these.

Shown is relative performance per US Dollar of GPUs normalized by the cost for a desktop computer and the average Amazon and eBay price for each GPU. Additionally, the electricity cost of ownership for 5 years is added with an electricity price of 0.175 USD per kWh and a 15% GPU utilization rate. The electricity cost for a RTX 4090 is about $100 per year. How to read and interpret the chart: a desktop computer with RTX 4070 Ti cards owned for 5 years yields about 2x more 8-bit inference performance per dollar compared to a RTX 3090 GPU.

LLM.int8() and Emergent Features

GPU Recommendations

I have a create a recommendation flow-chart that you can see below (click here for interactive app from Nan Xiao). While this chart will help you in 80% of cases, it might not quite work for you because the options might be too expensive. In that case, try to look at the benchmarks above and pick the most cost effective GPU that still has enough GPU memory for your use-case. You can estimate the GPU memory needed by running your problem in the vast.ai or Lambda Cloud for a while so you know what you need. The vast.ai or Lambda Cloud might also work well if you only need a GPU very sporadically (every couple of days for a few hours) and you do not need to download and process large dataset to get started. However, cloud GPUs are usually not a good option if you use your GPU for many months with a high usage rate each day (12 hours each day). You can use the example in the “When is it better to use the cloud vs a dedicated GPU desktop/server?” section below to determine if cloud GPUs are good for you.

GPU recommendation chart for Ada/Hopper GPUs. Follow the answers to the Yes/No questions to find the GPU that is most suitable for you. While this chart works well in about 80% of cases, you might end up with a GPU that is too expensive. Use the cost/performance charts above to make a selection instead. [interactive app]

Is it better to wait for future GPUs for an upgrade? The future of GPUs.

To understand if it makes sense to skip this generation and buy the next generation of GPUs, it makes sense to talk a bit about what improvements in the future will look like.

In the past it was possible to shrink the size of transistors to improve speed of a processor. This is coming to an end now. For example, while shrinking SRAM increased its speed (smaller distance, faster memory access), this is no longer the case. Current improvements in SRAM do not improve its performance anymore and might even be negative. While logic such as Tensor Cores get smaller, this does not necessarily make GPU faster since the main problem for matrix multiplication is to get memory to the tensor cores which is dictated by SRAM and GPU RAM speed and size. GPU RAM still increases in speed if we stack memory modules into high-bandwidth modules (HBM3+), but these are too expensive to manufacture for consumer applications. The main way to improve raw speed of GPUs is to use more power and more cooling as we have seen in the RTX 30s and 40s series. But this cannot go on for much longer.

Chiplets such as used by AMD CPUs are another straightforward way forward. AMD beat Intel by developing CPU chiplets. Chiplets are small chips that are fused together with a high speed on-chip network. You can think about them as two GPUs that are so physically close together that you can almost consider them a single big GPU. They are cheaper to manufacture, but more difficult to combine into one big chip. So you need know-how and fast connectivity between chiplets. AMD has a lot of experience with chiplet design. AMD’s next generation GPUs are going to be chiplet designs, while NVIDIA currently has no public plans for such designs. This may mean that the next generation of AMD GPUs might be better in terms of cost/performance compared to NVIDIA GPUs.

However, the main performance boost for GPUs is currently specialized logic. For example, the asynchronous copy hardware units on the Ampere generation (RTX 30 / A100 / RTX 40) or the extension, the Tensor Memory Accelerator (TMA), both reduce the overhead of copying memory from the slow global memory to fast shared memory (caches) through specialized hardware and so each thread can do more computation. The TMA also reduces overhead by performing automatic calculations of read/write indices which is particularly important for 8-bit computation where one has double the elements for the same amount of memory compared to 16-bit computation. So specialized hardware logic can accelerate matrix multiplication further.
Low-bit precision is another straightforward way forward for a couple of years. We will see widespread adoption of 8-bit inference and training in the next months. We will see widespread 4-bit inference in the next year. Currently, the technology for 4-bit training does not exists, but research looks promising and I expect the first high performance FP4 Large Language Model (LLM) with competitive predictive performance to be trained in 1-2 years time.

Going to 2-bit precision for training currently looks pretty impossible, but it is a much easier problem than shrinking transistors further. So progress in hardware mostly depends on software and algorithms that make it possible to use specialized features offered by the hardware.

We will probably be able to still improve the combination of algorithms + hardware to the year 2032, but after that will hit the end of GPU improvements (similar to smartphones). The wave of performance improvements after 2032 will come from better networking algorithms and mass hardware. It is uncertain if consumer GPUs will be relevant at this point. It might be that you need an RTX 9090 to run run Super HyperStableDiffusion Ultra Plus 9000 Extra or OpenChatGPT 5.0, but it might also be that some company will offer a high-quality API that is cheaper than the electricity cost for a RTX 9090 and you want to use a laptop + API for image generation and other tasks.

Overall, I think investing into a 8-bit capable GPU will be a very solid investment for the next 9 years. Improvements at 4-bit and 2-bit are likely small and other features like Sort Cores would only become relevant once sparse matrix multiplication can be leveraged well. We will probably see some kind of other advancement in 2-3 years which will make it into the next GPU 4 years from now, but we are running out of steam if we keep relying on matrix multiplication. This makes investments into new GPUs last longer.

Question & Answers & Misconceptions

Do I need PCIe 4.0 or PCIe 5.0?

Generally, no. PCIe 5.0 or 4.0 is great if you have a GPU cluster. It is okay if you have an 8x GPU machine, but otherwise, it does not yield many benefits. It allows better parallelization and a bit faster data transfer. Data transfers are not a bottleneck in any application. In computer vision, in the data transfer pipeline, the data storage can be a bottleneck, but not the PCIe transfer from CPU to GPU. So there is no real reason to get a PCIe 5.0 or 4.0 setup for most people. The benefits will be maybe 1-7% better parallelization in a 4 GPU setup.

Do I need 8x/16x PCIe lanes?

Same as with PCIe 4.0 — generally, no. PCIe lanes are needed for parallelization and fast data transfers, which are seldom a bottleneck. Operating GPUs on 4x lanes is fine, especially if you only have 2 GPUs. For a 4 GPU setup, I would prefer 8x lanes per GPU, but running them at 4x lanes will probably only decrease performance by around 5-10% if you parallelize across all 4 GPUs.

How do I fit 4x RTX 4090 or 3090 if they take up 3 PCIe slots each?

You need to get one of the two-slot variants, or you can try to spread them out with PCIe extenders. Besides space, you should also immediately think about cooling and a suitable PSU.

PCIe extenders might also solve both space and cooling issues, but you need to make sure that you have enough space in your case to spread out the GPUs. Make sure your PCIe extenders are long enough!

How do I cool 4x RTX 3090 or 4x RTX 3080?

See the previous section.

Can I use multiple GPUs of different GPU types?

Yes, you can! But you cannot parallelize efficiently across GPUs of different types since you will often go at the speed of the slowest GPU (data and fully sharded parallelism). So different GPUs work just fine, but parallelization across those GPUs will be inefficient since the fastest GPU will wait for the slowest GPU to catch up to a synchronization point (usually gradient update).

What is NVLink, and is it useful?

Generally, NVLink is not useful. NVLink is a high speed interconnect between GPUs. It is useful if you have a GPU cluster with +128 GPUs. Otherwise, it yields almost no benefits over standard PCIe transfers.

I do not have enough money, even for the cheapest GPUs you recommend. What can I do?

Definitely buy used GPUs. You can buy a small cheap GPU for prototyping and testing and then roll out for full experiments to the cloud like vast.ai or Lambda Cloud. This can be cheap if you train/fine-tune/inference on large models only every now and then and spent more time protoyping on smaller models.

What is the carbon footprint of GPUs? How can I use GPUs without polluting the environment?

I built a carbon calculator for calculating your carbon footprint for academics (carbon from flights to conferences + GPU time). The calculator can also be used to calculate a pure GPU carbon footprint. You will find that GPUs produce much, much more carbon than international flights. As such, you should make sure you have a green source of energy if you do not want to have an astronomical carbon footprint. If no electricity provider in our area provides green energy, the best way is to buy carbon offsets. Many people are skeptical about carbon offsets. Do they work? Are they scams?

I believe skepticism just hurts in this case, because not doing anything would be more harmful than risking the probability of getting scammed. If you worry about scams, just invest in a portfolio of offsets to minimize risk.

I worked on a project that produced carbon offsets about ten years ago. The carbon offsets were generated by burning leaking methane from mines in China. UN officials tracked the process, and they required clean digital data and physical inspections of the project site. In that case, the carbon offsets that were produced were highly reliable. I believe many other projects have similar quality standards.

What do I need to parallelize across two machines?

If you want to be on the safe side, you should get at least +50Gbits/s network cards to gain speedups if you want to parallelize across machines. I recommend having at least an EDR Infiniband setup, meaning a network card with at least 50 GBit/s bandwidth. Two EDR cards with cable are about $500 on eBay.

In some cases, you might be able to get away with 10 Gbit/s Ethernet, but this is usually only the case for special networks (certain convolutional networks) or if you use certain algorithms (Microsoft DeepSpeed).

Is the sparse matrix multiplication features suitable for sparse matrices in general?

It does not seem so. Since the granularity of the sparse matrix needs to have 2 zero-valued elements, every 4 elements, the sparse matrices need to be quite structured. It might be possible to adjust the algorithm slightly, which involves that you pool 4 values into a compressed representation of 2 values, but this also means that precise arbitrary sparse matrix multiplication is not possible with Ampere GPUs.

Do I need an Intel CPU to power a multi-GPU setup?

I do not recommend Intel CPUs unless you heavily use CPUs in Kaggle competitions (heavy linear algebra on the CPU). Even for Kaggle competitions AMD CPUs are still great, though. AMD CPUs are cheaper and better than Intel CPUs in general for deep learning. For a 4x GPU built, my go-to CPU would be a Threadripper. We built dozens of systems at our university with Threadrippers, and they all work great — no complaints yet. For 8x GPU systems, I would usually go with CPUs that your vendor has experience with. CPU and PCIe/system reliability is more important in 8x systems than straight performance or straight cost-effectiveness.

Does computer case design matter for cooling?

No. GPUs are usually perfectly cooled if there is at least a small gap between GPUs. Case design will give you 1-3 C better temperatures, space between GPUs will provide you with 10-30 C improvements. The bottom line, if you have space between GPUs, cooling does not matter. If you have no space between GPUs, you need the right cooler design (blower fan) or another solution (water cooling, PCIe extenders), but in either case, case design and case fans do not matter.

Will AMD GPUs + ROCm ever catch up with NVIDIA GPUs + CUDA?

Not in the next 1-2 years. It is a three-way problem: Tensor Cores, software, and community.

AMD GPUs are great in terms of pure silicon: Great FP16 performance, great memory bandwidth. However, their lack of Tensor Cores or the equivalent makes their deep learning performance poor compared to NVIDIA GPUs. Packed low-precision math does not cut it. Without this hardware feature, AMD GPUs will never be competitive. Rumors show that some data center card with Tensor Core equivalent is planned for 2020, but no new data emerged since then. Just having data center cards with a Tensor Core equivalent would also mean that few would be able to afford such AMD GPUs, which would give NVIDIA a competitive advantage.

Let’s say AMD introduces a Tensor-Core-like-hardware feature in the future. Then many people would say, “But there is no software that works for AMD GPUs! How am I supposed to use them?” This is mostly a misconception. The AMD software via ROCm has come to a long way, and support via PyTorch is excellent. While I have not seen many experience reports for AMD GPUs + PyTorch, all the software features are integrated. It seems, if you pick any network, you will be just fine running it on AMD GPUs. So here AMD has come a long way, and this issue is more or less solved.

However, if you solve software and the lack of Tensor Cores, AMD still has a problem: the lack of community. If you have a problem with NVIDIA GPUs, you can Google the problem and find a solution. That builds a lot of trust in NVIDIA GPUs. You have the infrastructure that makes using NVIDIA GPUs easy (any deep learning framework works, any scientific problem is well supported). You have the hacks and tricks that make usage of NVIDIA GPUs a breeze (e.g., apex). You can find experts on NVIDIA GPUs and programming around every other corner while I knew much less AMD GPU experts.

In the community aspect, AMD is a bit like Julia vs Python. Julia has a lot of potential, and many would say, and rightly so, that it is the superior programming language for scientific computing. Yet, Julia is barely used compared to Python. This is because the Python community is very strong. Numpy, SciPy, Pandas are powerful software packages that a large number of people congregate around. This is very similar to the NVIDIA vs AMD issue.

Thus, it is likely that AMD will not catch up until Tensor Core equivalent is introduced (1/2 to 1 year?) and a strong community is built around ROCm (2 years?). AMD will always snatch a part of the market share in specific subgroups (e.g., cryptocurrency mining, data centers). Still, in deep learning, NVIDIA will likely keep its monopoly for at least a couple more years.

When is it better to use the cloud vs a dedicated GPU desktop/server?

Rule-of-thumb: If you expect to do deep learning for longer than a year, it is cheaper to get a desktop GPU. Otherwise, cloud instances are preferable unless you have extensive cloud computing skills and want the benefits of scaling the number of GPUs up and down at will.

Numbers in the following paragraphs are going to change, but it serves as a scenario that helps you to understand the rough costs. You can use similar math to determine if cloud GPUs are the best solution for you.

For the exact point in time when a cloud GPU is more expensive than a desktop depends highly on the service that you are using, and it is best to do a little math on this yourself. Below I do an example calculation for an AWS V100 spot instance with 1x V100 and compare it to the price of a desktop with a single RTX 3090 (similar performance). The desktop with RTX 3090 costs $2,200 (2-GPU barebone + RTX 3090). Additionally, assuming you are in the US, there is an additional $0.12 per kWh for electricity. This compares to $2.14 per hour for the AWS on-demand instance.

At 15% utilization per year, the desktop uses:

(350 W (GPU) + 100 W (CPU))*0.15 (utilization) * 24 hours * 365 days = 591 kWh per year

So 591 kWh of electricity per year, that is an additional $71.

The break-even point for a desktop vs a cloud instance at 15% utilization (you use the cloud instance 15% of time during the day), would be about 300 days ($2,311 vs $2,270):

$2.14/h * 0.15 (utilization) * 24 hours * 300 days = $2,311

So if you expect to run deep learning models after 300 days, it is better to buy a desktop instead of using AWS on-demand instances.

You can do similar calculations for any cloud service to make the decision if you go for a cloud service or a desktop.

Common utilization rates are the following:

PhD student personal desktop: < 15%
PhD student slurm GPU cluster: > 35%
Company-wide slurm research cluster: > 60%

In general, utilization rates are lower for professions where thinking about cutting edge ideas is more important than developing practical products. Some areas have low utilization rates (interpretability research), while other areas have much higher rates (machine translation, language modeling). In general, the utilization of personal machines is almost always overestimated. Commonly, most personal systems have a utilization rate between 5-10%. This is why I would highly recommend slurm GPU clusters for research groups and companies instead of individual desktop GPU machines.

Version History

2023-01-30: Improved font and recommendation chart. Added 5 years cost of ownership electricity perf/USD chart. Updated Async copy and TMA functionality. Slight update to FP8 training. General improvements.
2023-01-16: Added Hopper and Ada GPUs. Added GPU recommendation chart. Added information about the TMA unit and L2 cache.
2020-09-20: Added discussion of using power limiting to run 4x RTX 3090 systems. Added older GPUs to the performance and cost/performance charts. Added figures for sparse matrix multiplication.
2020-09-07: Added NVIDIA Ampere series GPUs. Included lots of good-to-know GPU details.
2019-04-03: Added RTX Titan and GTX 1660 Ti. Updated TPU section. Added startup hardware discussion.
2018-11-26: Added discussion of overheating issues of RTX cards.
2018-11-05: Added RTX 2070 and updated recommendations. Updated charts with hard performance data. Updated TPU section.
2018-08-21: Added RTX 2080 and RTX 2080 Ti; reworked performance analysis
2017-04-09: Added cost-efficiency analysis; updated recommendation with NVIDIA Titan Xp
2017-03-19: Cleaned up blog post; added GTX 1080 Ti
2016-07-23: Added Titan X Pascal and GTX 1060; updated recommendations
2016-06-25: Reworked multi-GPU section; removed simple neural network memory section as no longer relevant; expanded convolutional memory section; truncated AWS section due to not being efficient anymore; added my opinion about the Xeon Phi; added updates for the GTX 1000 series
2015-08-20: Added section for AWS GPU instances; added GTX 980 Ti to the comparison relation
2015-04-22: GTX 580 no longer recommended; added performance relationships between cards
2015-03-16: Updated GPU recommendations: GTX 970 and GTX 580
2015-02-23: Updated GPU recommendations and memory calculations
2014-09-28: Added emphasis for memory requirement of CNNs

Acknowledgments

I thank Suhail for making me aware of outdated prices on H100 GPUs, Gjorgji Kjosev for pointing out font issues, Anonymous for pointing out that the TMA unit does not exist on Ada GPUs, Scott Gray for pointing out that FP8 tensor cores have no transposed matrix multiplication, and reddit and HackerNews users for pointing out many other improvements.

For past updates of this blog post, I want to thank Mat Kelcey for helping me to debug and test custom code for the GTX 970; I want to thank Sander Dieleman for making me aware of the shortcomings of my GPU memory advice for convolutional nets; I want to thank Hannes Bretschneider for pointing out software dependency problems for the GTX 580; and I want to thank Oliver Griesel for pointing out notebook solutions for AWS instances. I want to thank Brad Nemire for providing me with an RTX Titan for benchmarking purposes. I want to thank Agrin Hilmkil, Ari Holtzman, Gabriel Ilharco, Nam Pho for their excellent feedback on the previous version of this blog post.

LLM.int8() and Emergent Features

Comments

Zoran says
2025-04-14 at 11:12
Threadripper 3945wx works almost 2x slower than intel 12400 as per my test.
Reply
Zoran says
2024-02-15 at 22:01
Where can i find the code for the 8-bit inference?
Reply
Andrea de Luca says
2023-12-18 at 10:24
Hi Tim. I think there is a bit of confusion in the article regarding the RTX A6000. You wrote:
“The best GPUs for academic and startup servers seem to be A6000 Ada GPUs (not to be confused with A6000 Turing). ”
The RTX A6000 is a 48gb Ampere card, not Turing. Its performance (in any domain) is slightly better than a 3090, while in the charts it performs equal to the old Turing workstation cards (which is frankly impossible). Other than that, the Ada 48gb workstation card is officially called “RTX 6000 Ada” (without the “A”). Thanks.
Reply
Andrea de Luca says
2023-12-18 at 10:24
Hi Tim. I think there is a bit of confusion in the article regarding the RTX A6000. You wrote:
<>
The RTX A6000 is a 48gb Ampere card, not Turing. Its performance (in any domain) is slightly better than a 3090, while in the charts it performs equal to the old Turing workstation cards (which is frankly impossible). Other than that, the Ada 48gb workstation card is officially called “RTX 6000 Ada” (without the “A”). Thanks.
Reply
Zoran says
2023-04-30 at 08:21
Hello,
Have you had any chance to test high end CPU only interface, for example on Intel 13900K CPU? It seemed like could be pretty fast and a dedicated server with it can be around 100 EUR per month. Much cheaper compared to GPU dedicated server or cloud.
Reply
David Laxer says
2023-01-17 at 12:24
Hi Tim,
Thanks for your posts.
Do you have any comments on Apple’s M1/M2 chips for Deep Learning research?
Apple’ Metal Performance Shader API only supports float32 as well as 32bit complex numbers. Do you see this as a
‘show stopper’ for Deep Learning Research for M1/M2 chips.
The M1/M2 processors do use considerably less power then NVIDIA GPUs
does this significantly change the trade-off calculus?
Thanks in advance.
Reply
Paulo Ricardo says
2022-02-10 at 06:04
What a wonderful text.
Can I translate to Portuguese?
Reply
- Tim Dettmers says
  2022-03-14 at 10:09
  Yes, I would love to see a Portuguese translation! Go for it. The only thing that I would ask you to do if you translate it section by section is to include the source somewhere in the introduction to the translation.
  Reply
Mike says
2021-10-18 at 12:40
Hi Tim,
Great info here. Wondering your thoughts on the new apple m1 max chip with 64GB RAM— will it be practical for training?
Thanks
Reply
Simon Demeule says
2021-09-26 at 08:52
Hi Tim!
First off I want to thank you for taking the time to write all these well researched articles. Your website has been incredibly useful.
I am a master’s student in machine learning and a digital artist who is looking to build a workstation. My main use case is training and being able to run inference of very large (16GB+) image generation and natural language models simultaneously for interactive art installations, and using it for my research (which doesn’t have a precise direction yet). Because of the specifics of how I would use that machine in interactive pieces, it would not be possible to use a remote cloud machine because of latency and bandwidth requirements — I absolutely need a physical machine.
I believe I have mostly settled on the right GPUs (2x RTX A5000, chosen for their performance in-between the RTX 3080 and RTX 3090, high VRAM, relatively close cost to the RTX 3090, availability, blower fan design, and possibility of a discount through the NVIDIA inception program). I have not found a ton of information on these cards online, but the Lambda Labs benchmarks and most of what I’ve read seem to indicate these cards are good.
I am not set on the CPU however.
The maximalist in me wants to go for a 3975WX, but that might be overkill. It would make this rig future-proof and enable me to add in extra GPUs and RAM later down the line if needed. The prospect of having a ridiculously powerful machine is very exciting too, given I’ve only somehow gotten by only owning laptops so far. I can think of ways of truly using that hardware to its limits, so I believe its power would not go to waste. It is very, very expensive however — I’d need to work on getting a grant for this.
On the more reasonable side of the spectrum a 5950X would likely give nice performance (better on single core tasks, worse on multi-core tasks that scale up massively), but there are no possibilities for expandability. It cannot accommodate 4 GPUs, and RAM would be maxed from the start. It is much more affordable however — this is more realistically achievable.
I’d love to hear your thoughts on this. Thanks!
Reply
- Tim Dettmers says
  2021-10-24 at 11:51
  Hi Simon! I think the choice of A5000 GPUs is great in your use-case. I think both CPU options can have their advantages. For inference, a CPU can sometimes be critical to achieving good latency, but that can depend on many factors, mainly how much pre and post-processing needs to be done on CPU. Depending on this, it could be that a good CPU has large benefits or not. On the other hand, such performance will not be very noticeable for general use, but might be critical for user-facing applications. So if you aim to build projects that are user-facing and require low latency it might be worth it to go with the 3975WX. However, since CPUs have quirks and many threads are difficult to use, in many cases you will get the same performance for the much cheaper 5950X. As such, it might also be a waste of money. I think I would probably go with the 5950X, and if the user needs to wait for an additional 50ms then the user just needs to wait for that!
  Reply
  - Simon Demeule says
    2021-11-06 at 23:34
    Hi Tim!
    Thank you for your insight. After some more research, I have found that in synthetic benchmarks the 5950X actually outperforms the 3975WX in both single and multi core performance (which is very surprising given it has half the core count) — it seems the architecture version (Gen 3 vs Gen 2) plays a large role here. It really doesn’t make sense to choose Threadripper Pro over Ryzen for my use case, or at least right now. Gen 3 Threadripper Pro is likely coming out soon, as some benchmark scores have been leaked (and they are definitely scoring higher than Ryzen there). I will likely order the GPUs soon, and maybe wait a bit until the next generation Threadripper Pro CPUs are revealed.
    Reply
Ricardo says
2021-09-09 at 01:48
Hi Tim
I am new to the exciting world of artificial intelligence and I’m learning on a 4gb jetson nano but my models’ training is a little slow.
I would love to buy a faster graphics card to speed up the training of my models but graphics card prices have increased dramatically in 2021.
I found a Lenovo IdeaPad 700-15ISK with a gtx 650m 4 GB DDR5 gpu at a reasonable price and I would like to know if this GPU is a good choice to start training models.
Reply
- Tim Dettmers says
  2021-10-24 at 11:36
  Hi Ricardo,
  it is difficult with the increasing prices, but I think a GTX 650M would be a bit slow still. Try to wait a bit longer and look for used laptops with a better GPU.
  Reply
Carl Miller says
2021-08-25 at 10:55
Hi,
Awesome article.
I am looking to rent a dedicated GPU server for inferencing only. but finding it really hard to get decent advise how I should serve the model.
I have a flask app listening for base64 requests which my python app pushes through tensorflow and sends the output image back to the client.
Currently its using CPU and a single image (which is usually 1080p) takes 20 seconds per request. I need this to be as fast as possible, so i’m renting servers like k40, k80 to test. so far it’s not making a massive difference in time.
The problem I have is that the server can get a request at any time during the day, so hourly rental is just not cost effective at the moment. One day there might be 1 request, and the next day there might be 1000.
Would you recommend I try a server with RTX cards instead of K40/k80?
Reply
- Tim Dettmers says
  2021-10-24 at 11:35
  This is a difficult use case. I would probably build a small desktop with a GPU for this sake to reduce costs, but upload speed is important for this which is usually small for residential internet. I definitely would recommend a small RTX card over a K40/K80, you should see latencies which are about 4 times better.
  Reply
Veer says
2021-07-07 at 21:11
Hi Tim,
I am trying to build a new PC for DL. This would be a pure work PC and not for games. For now I will go with one 3090 or A6000 and expand later to 2 or 3. Since you have mentioned that you have experience with Threadripper systems, so I wanted to ask whether going for Threadripper 3960X or 3970X is better or going for Threadripper Pro 3975WX is better ? What I am concerned is that Threadrippers are limited by 256GB max memory and Pro’s has a capability until 2TB. So I am looking atleast this PC to last for next 8-10years. So I wanted to do like an incremental build now going for 128GB or incrementing it as computing needs increases. You thoughts on this would be a golden advise. Thanks in Advance.
Reply
- Tim Dettmers says
  2021-10-24 at 11:00
  The question is what do you need that much RAM for? It can be useful to offload memory from the GPU but generally with PCIe 4.0 that is too slow to be very useful in many cases. A large memory can be useful if you use information retrieval algorithms/frameworks like FAISS, but other than that I think you do not need a very large RAM. 256 GB is a lot for 2-3 GPUs. I would go with the threadripper with the best cost/performance in terms of raw compute, or even just the cheapest one.
  Reply
james says
2021-05-21 at 18:16
Will you check if new “LHR” RTX differs from non-LHR models?
Reply
Jacky says
2021-04-28 at 20:02
It is an amazing post. Thanks for your sharing.
You use two model (CNN, transformer) to compare the performance of multiple GPUs. But I am still confused what does CNN mean? ResNet-50? or Mask-RCNN? or YOLO? or something else CNN model?
Thanks!
Reply
- Tim Dettmers says
  2021-10-24 at 10:38
  This is mostly ResNet-50 since it is the most common benchmark.
  Reply
Seungchan Lee says
2021-04-20 at 19:05
Hi Tim,
Thanks for the article – this is so great.
One quick question – have you looked in Graphcore IPUs at all? It looks like their exchange memory has a much higher bandwidth (180TB/s) than A100 (2TB/s), which means it could meaningfully outperform A100 based on your comment regarding the main memory bandwidth being the most important bottleneck?
Any thoughts on this?
Thanks!
Reply
- Tim Dettmers says
  2021-10-24 at 10:57
  The problem is that the exchange memory is very small (MBs) compared to the GPU memory (GBs). The new A100 GPUs also have fast memory that works at roughly 200 TB/s but only a couple dozen MB of it. This memory alone does not make for a fast processor and one would need a special kind of neural network to take full advantage of IPUs. Normal networks like transformers and CNNs are likely not any faster since these were designed for GPUs/TPUs.
  Reply
Jeff says
2021-04-11 at 10:44
Which barebones supermicro system should I pick up? I’m ok with the RTX 3090 approach. It’s for a prototype / research but I want to start off on the lighter end and work up as needed. I don’t see needing any more than 4 GPUs at max and I assume I can pick up a 2U 4X GPU barebones but only put one rtx gpu, 1 threadripper, 1 SSD, and some minimum amount of ram to start. If Supermicro makes this it’s not clear which one would work best.
Reply
- Tim Dettmers says
  2021-10-24 at 10:55
  Your case sounds actually like you would benefit a lot from a desktop computer that you stock up as needed. If I would get a supermicro system I would invest in an 8-GPU system and grow it slowly. One disadvantage is that the warranty and support will expire often in about 3 years, so this advantage of a server system would be nullified. It is also a waste if you only need 4 GPUs. The markup on 4 GPU supermicro systems is pretty high and unnecessary since desktops work excellent for 4 GPUs.
  Reply
JJ says
2021-03-26 at 08:10
Hi
Really nice in-depth article, thanks.
But, I don’t understand how can you say that tensor cores are 70% of the time waiting, and then saying that NVLink only works for many many GPUS.
If a daughterboard with 8xA100 has NVlink, I guess the Tensor cores would be busy much more % of time, right?
regards
Reply
- Tim Dettmers says
  2021-10-24 at 10:47
  There is a difference between memory transfer between GPUs and within a GPU. Tensor cores are limited mainly by the memory access time within a GPU. The main memory is too slow to fully utilize the tensor cores even under perfect conditions (other than trivial in-place computation).
  Reply
Leigh says
2021-03-15 at 04:29
Great article. I learned a lot.
You might want to be careful for liabilities with your advice.
You might edit this:
“Rule-of-thumb: If you expect to do deep learning for longer than a year, it is cheaper to get a desktop GPU. Otherwise, cloud instances are preferable unless you have extensive cloud computing skills and want the benefits of scaling the number of GPUs up and down at will.”
The “unless” doesn’t really pose an exception to the pre-conjugate sentence, but rather to the sentence before it. kind of confusing.
Reply
George P says
2021-03-12 at 00:58
Hi Tim,
I am looking for a GPU that I might be able to use learn about Deep Learning in a single GPU system.
I am weighing up an RTX 3060 against a RTX 3070 which you have recommended in September last year.
The 3060 has 12GB memory vs 8GB for the 3070.
The 3060 has a 192 bit bus with 112 tensor cores vs a 256 bus with 184 tensor cores.
Given that the 3060 has been released since you last wrote about the 3070 last year could please be kind enough to offer an opinion on how this new card might fit into you suggested “pecking order”
Regards,
George
Reply
- Tim Dettmers says
  2021-10-24 at 10:44
  The RTX 3060 is a bit slower but it’s easier to work with because it has a larger memory. If you just want to run things without any trouble I would recommend the RTX 3060. If you are fine with writing extra code to fit models into 8 GB then go with the RTX 3070.
  Reply
Sadra says
2021-03-10 at 11:41
Hi, Tim.
Thanks for your trying to sharing with us. I am going to build up a system with 4 GPUs. It is hard to choose between A100 and A6000. What do you think works better for computer vision? probably we won’t be able to update the system maybe for 3 years.
Reply
- Tim Dettmers says
  2021-10-24 at 10:42
  The A100 is clearly better, but also more expensive. If money is not the problem I would go with the A100. If you look at cost-performance, these two come actually pretty close, so either GPU is fine.
  Reply
Pan inkzs says
2021-03-10 at 06:26
HI Tim,
First sorry for my bad english.Your essay has benefited me a lot,but I have some questions to ask.
I preparing to study for a master’s degree,and I want to research computer vision and Image recognition/Processing by CNN and GAN.Is study GAN paper is the good way to getting start,or do Image type
on kaggle?
Also,I want to buy a new PC,but 3070 is out of stock,should I wait for it or just buy 3060?
Is GAN need 12gvram or 3070 8G is enough but faster?
Thank for your answer!
Reply
- Tim Dettmers says
  2021-10-24 at 10:41
  Hi Pan, for GANs I would try to find a GPU with more memory. So the RTX 3060 is great. If you want to do research, I would recommend that you try to find another researcher to work with from whom you can learn how to do research.
  Reply
Charley says
2021-03-07 at 00:20
This is awesome! Thank you so much for this. Does it have to be a card made by NVIDIA? Or could it be like an Asus RTX 3080, or any of the other manufacturers?
Reply
- Tim Dettmers says
  2021-10-24 at 10:39
  You can buy GPUs from any manufacturers. These are the same as the ones from NVIDIA. They just have a different fan/branding.
  Reply
KV says
2021-03-05 at 10:21
Hi there
possible to use 2 cards per genre concurrently? Sort of, one for gaming and the other for DL? Just thinking out loud, basically.
Cheers
Reply
Roger says
2021-02-23 at 22:13
Thanks for this amazing article Tim! Has been useful for multiple years now! We have budget for a multiple GPU machine. We work with large 3D images and so considering the 3090 for its higher memory. Do you know if a machine with 3x 3090 blower edition GPUs can have any power/heating issues? Would you recommend it?
Thank you
Reply
Vytautas says
2021-02-21 at 05:02
Great article!
I am researching now the best budget AI reinforcement learning hardware combination for a laptop. After some research and reading this article I basically ended up with two choices. Either RTX2060 (6G) and AMD Ryzen 9 4900H (8 cores) or RTX2070(8G) and Intel Core i7-10750H (6 cores).
From the article I see that more memory for GPU is definitely a good thing, however, for Reinforcement learning, you mentioned that the more CPU cores the better. So I wonder if in this case more CPU cores and less GPU memory outweigh the less CPU cores and more GPU memory?
Reply
- Tim Dettmers says
  2021-03-03 at 10:02
  I think there will not be a great difference between those CPUs, either one is fine and you should see no large differences in performance for RL.
  Reply
zine says
2021-02-20 at 11:08
great article. question, if one’s focus is on timeseries prediction and anomaly detection real time ; and one is looking to create models on a laptop and deploy on server, will a gtx-1650 be enough to test, including tensor code, thou no tensor on 1650-Q, or should one try to have an RTX-3000-Q or RTX-2070-Q as a minimum to test on a laptop. The server has 2080s. I am a noob, but i want something that I can carry with me and develop on easily and seamlessly transfer to server.
Reply
- Tim Dettmers says
  2021-03-03 at 10:00
  It highly depends on the nature of those datasets and the complexity of those models. A GTX 1650 will suffice for many kinds of models if it has ~100 variables and 1M datapoints. Beyond that, you might need a larger GPU with more memory.
  Reply
Bruno Kemmer says
2021-02-19 at 04:18
Hi Tim,
First, thank you for your posts, they are very instructive!
Do you know if the limitations imposed by NVIDIA on RTX 3060, limiting its hash rate, could affect its performance for CV applications?
I am planning to start researching GANs, I am waiting for the 3060ti with 12gb, hopefully, it will not be too much overpriced.
Finally, running a single GPU for a considerable time (I imagine a week) could create heating problems (in a ventilated case and room)?
Do you have recommendations for water-coolers?
Thank you,
Bruno
Reply
Daniel Danaie says
2021-02-04 at 18:26
Hello Tim,
Awesome articles! I am trying to build an autonomous drone and the on board computer will be an Xavier NX. I am pretty sure that can handle inference of the system, but I am trying to understand which GPU I should train my prototype on. That is considering I will start with the CV aspect(asking the drone to find free paths, recognize colors, etc.), go towards RL (asking the drone maximize packages delivered, etc.), and as my last priority NLP (allow the drone to talk with people). This progression might take 3-4 years. I would love to hear the difference between these fields as it relates to choosing hardware. Also, considering I will start training the drone in mid-late 2021, what would you suggest considering my budget for the gpu & cpu is around 2000 USD and that I am creating this drone as a startup.
Thanks and sorry for my numerous questions,
Daniel
Reply
Joshua Brown Kramer says
2021-02-02 at 11:01
This blog is the best source of information about how to build a machine learning box, however it has what I see as crucial problem. I’ve said as much before in these comments, but to put a fine point on it: The performance per dollar calculations are just wrong.
In particular, the supposed leader in that category is the 3080. The problem with that status is that it appears to depend largely on the MSRP of $800. But I have signed myself up for several services that alert me for the availability of this card at this price, and for months I have not been able to get it. The market price of this card is more like $1400. The MSRP is essentially meaningless. When compared to the 2080Ti, which is available for around $1000, and using your own performance comparisons, the 2080Ti beats to 3080 on performance per dollar.
Reply
- Jason says
  2021-02-26 at 13:19
  It’s currently a very bad time to build a deep learning machine. Prices are hugely inflated (as you’ve seen). The cost estimates seem fair in a normal market. Sadly, this is not a normal market.
  Reply
Mohannad Barakat says
2021-01-31 at 06:36
Hi Tim Dettmers,
I’m Mohannad a computer engineering undergraduate student. Currently I’m working on my graduation project. Me and my team have succeded to secure a fund of $10K on AWS. We plan to use them inorder to train our models. We are working on text to audio task (wavenet, waveglow, tacotron 2, and voice varifier) and audio to text task (wav2vec and wav2letter). We use Librispeech for both (about 1000 hours).
Also in case we have used all of the fund we will be allowed to use the university labs (2x 2080).
Currently we are stuck in deciding the type of the ec2 instance. Our options are k80, V100 and A100 (p2, p3, p4 types). Also we are not able to guiss how many GPUs of the type we should choose do we need?
Thanks a lot
Reply
Nate Liu says
2021-01-29 at 06:52
Hi Tim,
Thanks for all these details!
currently, I’m working on computer vision, particularly in generative models.
I have been thinking about getting my own 4x gpus workstation, and the company called “lambda” seems to have some prebuilt stations. Do you think its better to build my own station than buy one from the company? and what gpus do you recommend for vision in general now?
Reply
Tuomo Kalliokoski says
2021-01-27 at 04:22
Could you update the charts with current prices as the distribution problems for RTX30-series are predicted to last until end of the Q2 this year. Also could you give your opinion on RTX3060ti.
Reply
Dave says
2021-01-26 at 02:42
Hi Tim,
Thanks for all the information, I really appreciate it!
I managed to get two 3080 cards, one MSI, one Gigabyte, the only difference seems to be one is 1740 MHz and the other is 1800 MHz. I do my current learning on a dual socket intel server that only has one PCIe x16 lane slot and 3 x8 slots. I do kaggle level work.
My 2 questions are:
* Does it matter that my two 3080s are not identical?
* Do I need to buy a new motherboard and cpu or can I get an 8x to 16x PCIe extender and use that for one of the gpus without really sacrificing performance?
Happy to provide more info if needed to answer. I don’t want to waste your time with extraneous words.
Thanks!
Reply
Angel Genchev says
2021-01-25 at 16:59
I try running ResNet-50 (training) on a 6 GB 1660Ti and it fails to allocate enough CUDA memory. So the problem with the insufficient video memory is real. I begun to think what can I do and came to the idea of using AMD RoCm on their APUs.
This potentially could allow me to allocate “video” memory as long as there is RAM on the system. For example on a 32GB system it might be possible to allocate at least 16 GB for GPU. Slower training is preferable to impossible training 🙂
What do you think ?
Reply
Goce says
2021-01-24 at 15:38
Tim,
Please allow this machine learning inexperienced cheapo to ask your opinion about a GPU choice.
I want to try experimenting with language models such as BERT, GPT etc. The goal is to create some software that will provide suggestions for a certain type of textual work. It’s still a vague idea at this point and not my first priority, but from what I tried so far on google it just might work well.
My budget at this point is kinda limited, so I am considering: a K80 24GB($300), 1080ti($400) or for slightly higher budget possibly 2080ti($600) M6000 24GB($800). Prices are local for my country.
I think I need 24GB so I can run the biggest models possible (BERT, GPT2 1.5 bil) , but I am not quite sure. I suppose Megatron-LM wouldn’t fit a 24GB. So I am not sure if i even need 24GB or 11GB would be good enough for now. What do you think?
Reply
X says
2021-01-22 at 11:37
Hi Tim,
Thank you for this great article! Would you recommend RTX 3060 based on its VRAM capacity (12GB VRAM) and price (329 USD MSRP)? I may be a little bit too eager since it hasn’t been out on the market yet…
Reply
- Konrad Gnoinski says
  2021-03-10 at 09:50
  Don’t buy 3060 it has 192-bit bus. You do want 256 – more in the article above 😉
  Reply
Michel Rathé says
2021-01-22 at 04:07
Hi Tim,
I’m on a waiting list for a Asus RTX 3090 for my next workstation.
I’m considering the new AMD 3975wx (zen3) with 256 ecc ram.
My problem is the shortage of the 3090. If I’m lucky to get one what if I want a second one?
I’m developping on Matlab and want to have expandability, power and future proofing.
What if I go for the A6000 instead? I can have one for 6,200$ cad. Roughly 20% more than 2x 3090. I’d have 48Go Vram in one card and no problem to get one.
More technically, is there downsides having one bigger card for training?
Can I run multiple training tasks on that GPU.
Can deep learning beneficiates from larger batch size on such a card?
If the AMD 3975wx is too expensive (no prices yet) compare to the next zen 3 (genesis peak) 3970x, is the 3970x (with A6000) be still a beast and reliable? Is the 8 channels ECC memory of the AMD PRO a real deal breaker or there are no real advantages?
The A600 is tempting because it takes only 16 lanes and 2 or 3 slots (compare to 2x 3090), and no nvlink to get 48Go with 2x 3090.
What are your toughts?
Thanks again,
Michel
Reply
- Angel Genchev says
  2021-02-18 at 13:43
  It depends. If you train triplet loss with online hard-negative mining, then large batches are a must. Actually I hit the wall with 6GB VRAM, need at least 16 or 24. So having small VRAM can make it unavail to train certain networks (I fell back to CPU).
  Your downside will be the higher MSRP cost vs GPU speed. On 1660 batch of 64 was enough to saturate the GPU (inference) and no further speed up with bigger batches was observed. If your networks fit in 3090, then 2×3090 might be faster than one RTX6000. “Multiple training tasks” – you not want to do this. One training task usually maxes out the GPU and asks for even more.
  Reply
Vatsal Desai says
2021-01-20 at 05:21
Hi Tim,
I am first time building Deep Learning Machine for my use as a freelancer/consultant/startup in AI.
So I do not want to limit myself to DL and would explore Reinforcement Learning as well – I do not wish to buy multiple GPUs or machines due to budget constraints.Also, I may explore participating in Kaggle competitions in the future.
So I plan to go with a single RTX 3090 24 GB at present, considering any future upgrade (not in the next 1-2 years for sure) – what can be a proper combination of CPU, motherboard, RAM, Hard Disk, power unit, cooling system?
I watched your recent interview on You Tube(chai time data science channel) as well – I am located in India and here there is not a good market to buy/sell used cards at present.
Please advise.
Thank you
Reply
Jun Wang says
2021-01-16 at 02:11
Hi, is Threadrippers 2950x enough or should I go with 3rd gen?
Reply
Rene Munsch says
2021-01-15 at 19:25
Hey @all,
just one Question: I have this Mainboard:
https://www.biostar.com.tw/app/en/mb/introduction.php?S_ID=886
as you can see, it offers 8 PCIe Slots (1×16 and 7×1).
Is it possible to use it for a low-budget setup and what size is the loss compared to “normal x16” slots in real life?
Reply
Ben M says
2021-01-15 at 18:16
Hi Tim, awesome article – thanks for sharing your wisdom with the community. I’m hoping to pair two RTX 3090 cards on a Gigabyte TRX40 Designware motherboard (https://www.anandtech.com/show/15121/the-amd-trx40-motherboard-overview-/11), and am considering two cases to house this beast:
(1) CORSAIR OBSIDIAN SERIES™ 750D FULL TOWER CASE (https://www.legitreviews.com/corsair-obsidian-750d-full-tower-case-review_126122)
(2) FRACTAL DESIGN DEFINE 7 XL (https://www.legitreviews.com/fractal-design-define-7-xl-case-review_217535)
Any advice on whether these cases would be good choices? Would I even be able to fit two of these cards in the cases? Thanks for any pointers.
Reply
Mehran says
2021-01-15 at 03:39
Hi,
I know it’s hard these days that there is a new GPU launch every month but would you please revisit this guide for the new 3060 12G VRAM. I think for ML it’s far more valuable than 3070.
Reply
keith says
2021-01-13 at 11:21
Any thoughts on the new 3060 12GB release? Gets past that 11GB barrier mentioned for Transformer models. Also at nominally $400, you could run 4x for less than a 3090, curious how that could affect cost/performance metrics.
Reply
Sam says
2021-01-12 at 15:29
Hi, thanks so much for your incredible work.
I am planning to build a studying/prototyping machine for NLP tasks, such as finetuning transformers like BERT-base. I’m on a pretty tight budget and as far as I understand, VRAM is essential in my case. I am considering 3060ti/3070 or the new 3060 with 12gb. Is it worth waiting for the 3060 or 3060ti+memory saving techniques will be sufficient for now, or may be I should consider something else?
Reply
IanK says
2021-01-11 at 05:36
Hi, can you post your updates after CES 2021?
I want to buy new nVidia GPU since I have only 6GB 1060 and would like to have around 11 GB of memory – I have speech model on server that uses around 11GB (on 2013 K40 now, but it is slow (no tensors?) compared to current Ampere, also AMD ROCm has no easy support, AMD is obliterated by nVidia in deep learning), of course i can squeeze model size a little bit.
So I was thinking about getting ??3070 Ti? 12 GB???? or rather a day ago confirmed RTX 3060 12 GB, but I don’t want only 192 memory bus, I would prefer 256 if it increases performance of GPU training and inferring significantly.
Can you update your article how memory bus affects GPU performance in deep learning (can’t find info anywhere how it is important), is memory bus important with huge VRAM size in Deep Learning? I’m puzzled.
Do you think 3060 12GB is fair purchase vs lower VRAM and higher memory bus size 3060 Ti?
What is your opinion?
Thank you for your response and be well.
Reply
Jun says
2021-01-10 at 05:06
Hi, thanks for the detailed explanation on all aspects. Which threadripper would you recommend tho?
Reply
Mary says
2021-01-06 at 14:27
Hi Tim,
Thank you for all of the information you have provided. It was very informative for me. I was wondering if you can help me. I work in a small company, and for now we are just two data scientists, and at most we will grow to have 7-8 data scientists in 2 years. We are going to train some deep learning models for image processing. Can you tell me what kind of GPU we should get for a server for our company?
Reply
Drake says
2021-01-06 at 05:26
Hi Tim,
I was using a machine with an intel i5, NVIDIA 1080, 32GB RAM for about a year/year and a half but it couldn’t keep up with some of the models I was trying to use.
I recently built a new rig for hobbyist ML. The new rig uses an AMD 2950 Threadripper, 64GB RAM, but I have a holdover graphics card until I can get my hands on a 30 series. My question is I have about 2k reserved for the graphics card right now. Which do you think would make more sense to get 4x3070s, 2x3080s, or 1×3090?
I have academia research experience with NLP and Computer Vision, but I also have little exposure to many other subfields and would like to experiment in other fields.
I have a Lian Li PC-O11 Air ATX Full Tower Case so I’m not really worried about space, my motherboard has 6 PCIE slots.
Thank you!
Reply
Arun Balajiee says
2021-01-05 at 14:37
Hi Tim
Really great effort in putting together all the salient features for researchers in Deep Learning, especially useful for students like me
I put together a few parts based on the comments on this blog post and also from what I understood from the article here – https://pcpartpicker.com/list/yTr9RT
Do you think this would sulfice for small scale AI and NLP tasks? Would also be helpful if you could recommend some cost cutting measures that I could make to this list.
Reply
Marc Brian says
2021-01-02 at 23:27
For researching/optimising hyper-params on a LSTM ie configure, train, test then repeat for a single company stock, I’ve been getting on fine for a few months with a 2080 Super. But now I am also testing for a “best stock” which means iterating on all hyper params across 200 companies and its now taking a week to get a clean run through (I can pause and resume). Would it be better for me to switch to a cloud or get a 3090 ? I dont mind waiting for 48 hours, but will this just burn out any card I get in < a year anyway ?
Reply
- Tim Dettmers says
  2021-01-19 at 15:56
  I think in your case you want to have many small GPUs. At least if your models are too big. For that using cloud computing can be very efficient. I would not worry about the lifetime of the GPU, it should be fine even if it runs all the time. You should only worry if their temperature is above 85° or higher for a very long time.
  Reply
Gosuto says
2021-01-01 at 05:07
You claim that it is better to avoid the RTX Titan models due to their temperature issues.
Do you expect the same for the to-be-released 3080 Ti? Because other than that, the VRAM upgrade from 10GB to 20GB and doubling of all types of cores seems worth the wait.
Or is the temperature issue a sign of limits being reached and Nvidia just trying to get away with it?
Reply
- Tim Dettmers says
  2021-01-02 at 00:36
  The main issue with the RTX Titan cooling was its fan design which was great for a single GPU but terrible for multiple. That has been fixed with the RTX 30 series and the 3080 Ti should be just fine.
  Reply
Sovit Ranjan Rath says
2020-12-30 at 21:24
Hello Tim. Great article. It is helping me a lot in deciding my next GPU and machine for doing deep learning. I have one question that I want to ask as there is not much information out there in such situations. First of all, I am serious about deep learning. I am a deep learning blog author (https://debuggercafe.com/). I regularly take on personalized medium-scaled deep learning projects and I also want to increase the standard and quality of my deep learning blog posts. Also, traveling between jobs for the next two years is almost mandatory for me. Currently, I do everything on an MSI laptop (i7 + 6GB GTX 1060) and some occasional work on Colab. I want to upgrade to a good laptop. Preferably i7 + 2080 Super Max-Q. I know that Max-Q variants keep the temperature hovering around 75-80 degrees Celcius for hours on end. Still, I am a bit skeptical. I really want to get your opinion on this. Hoping to hear from you.
Reply
- Tim Dettmers says
  2021-01-02 at 01:49
  Hello Sovit, I think the 2080 Max-Q is a great GPU and the high temperature should be fine (unless you often use your laptop on your lap, which can become uncomfortable over time). I think it is a great option in your case. Another option might be a dedicated cloud GPU. However, in the case of a cloud GPU you will need stable internet access, which is not always readily available when traveling. On the other hand, a cloud GPU is more powerful and allows you to always use the newest GPUs.
  Reply
  - Sovit Ranjan Rath says
    2021-01-02 at 02:44
    Thank you so much Tim. As you are providing two options, I think I will go for the Max Q machine for the next two years. It will help me in a lot of ways and will also provide me with the flexibility to take on my projects on my own time. Thanks again for your feedback. Your articles are really awesome and help me a lot.
    Reply
Brian Arbuckle says
2020-12-29 at 17:47
Hi Tim,
Fantastic article. Is it correct to say that buying two GPUs, say the 3080s, with 10GB of Video RAM does not “double” the RAM when it comes to deep learning? I am pricing a system for BERT as well as Audio / Video training. I know in both examples I need a lot of GPU RAM. I assumed the 30 GB of GPU RAM when buying three 3080 is not greater than 24 GB of GPU RAM as each card would utilize its own RAM, though maybe I am wrong?
Happy new year!
Thanks,
Brian
Reply
- Tim Dettmers says
  2021-01-02 at 01:43
  Hi Brian,
  this is exactly right for data parallelism. With model parallelism (or the even rarer pipeline parallelism) you can however spread the model across GPUs and save more memory. However, software support for this is not readily available yet but should become common in the next 6 months. So if you need lots of memory, a single 24 GB GPU will currently serve you better and will still serve you well in a couple of months.
  Reply
Roman Stehling says
2020-12-29 at 10:51
Hi Tim,
Thanks for sharing all your detailed insights! Amazing learning experience for me.
I am working on Deep Reinforcement Learning for trading in the financial markets. My understanding is that the deep learning part in this context would be for finding out the optimal policy for the RL. Would that rather imply a smaller training dataset size like 10GB (3080) or a bigger one like 24GB (3090)?
Thank you!
Cheers,
Roman
Reply
- Tim Dettmers says
  2021-01-02 at 01:39
  Hi Roman, for RL the biggest problem is usually CPU performance rather than GPU memory. As such, I would go for the 3080 and invest the extra money in a CPU with more cores and more power.
  Reply
  - Roman Stehling says
    2021-01-03 at 16:38
    Awesome! Thank you, Tim!
    Just returned the Ryzen 9 5900x and got the Threadripper 3970x with 32 cores and 64 threads! 🙂
    And I am keeping the RTX 3080.
    Reply
Albert says
2020-12-26 at 10:22
Hi Tim, thanks for the always updated guide!
Deep Learning so far has been a hobby for me, and I am focusing more and more on CV, by doing MOOCs and reading books.
Knowing that’s the field of my interest, should I go for something like a RTX 3060 Ti (MRSP in EU should be around 400 euros, when and IF available) and hold it for 2021 as I become better and better or should I aim from Day 1 to get a RTX 3080(Ti)?
Reply
- Tim Dettmers says
  2021-01-02 at 01:38
  Hi Albert, I think the RTX 3060 Ti would be a very prudent choice for the first GPU. I think it is better to start small and depending on how you like it, upgrade later to something larger or if needed, so something really big.
  Reply
swiss ml dude says
2020-12-18 at 04:23
Hello Tim,
Thanks a lot for all the informations on your website. I have a question, I just started a master program in computer science and I would love to continue afterwards to grad schools in deep learning, especially in RL. My question is hardware related: I have a desktop computer I build a few years ago, but the GPU (zotac 1060 6gb) is a bottleneck now when I try to train big models. Hence I believe the 3080 gpu is a good fit. I wonder what to do with the old graphics card. Is it a good idea to use it for the system display (since it eats up some memory), so that the new 3080 gpu would be fully available for models ? Also, I am afraid the two GPUs would be stacked together too closely on the motherboard. Indeed, I have a MSI B150 Gaming m3 MB and the two GPU slots are really close to each other. Would it be a good idea to invest into water cooling or an other motherboard ? The blowers of the old card would be just above the new one in the current display.
Reply
- Tim Dettmers says
  2021-01-02 at 01:36
  Hello Justin,
  I would not worry too much about cooling. You do not lose much if you run the display on the RTX 3080, but it can help to power the displays with the GTX 1060. I think cooling will just be fine, but what I would do is: Test the GTX driving the displays and monitor cooling, if cooling is not good remove the GTX 1060 and sell it online and power the displays with the RTX 3080.
  Reply
Eric says
2020-12-17 at 05:16
It’s very clear to tell me how to selcect GPU card in the paper. Thanks.
Reply
Di Lai says
2020-12-13 at 17:06
Hi, Tim
I have a 2x RTX 3090 machine with air cooling (not blower type, just have one PCIe slot between the GPUs). I observed that when fully loaded the temperature of the GPU is about 81C, and the fan noise is strong. Is this normal? Currently my training time is short but I am not sure if when I train some very large model, could this temperature last hours or even days without causing any thermal issue or damage to the GPU?
Thanks a lot!
Di
Reply
edi says
2020-12-10 at 10:44
Hi Tim, I have a EVGA 3090 FTW3 Ultra Hybrid (https://www.evga.com/products/product.aspx?pn=24G-P5-3988-KR) and a EVGA 3090 XC3 Ultra Hybrid (https://www.evga.com/products/product.aspx?pn=24G-P5-3978-KR). I want to build a Deep Learning workstation using those two GPUs. However, they are a little different although both of them are 3090. Will there be any compatibility issue (or will one of them be bottleneck) because of the differences between two cards? If so, I will refund one of them and get a new one.
Thank you!
Reply
- Tim Dettmers says
  2021-01-02 at 01:30
  Hi edi, both cards should work just fine together.
  Reply
Rick Albright says
2020-12-08 at 15:18
I have a Lenovo P1 Gen 2 with an Intel Core i7-9850H kaby lake processor and 64 gb of ram. I am considering purchasing a Razor Core X thunderbolt 3 enclosure. What card would you recommend under $1000. Will I have any performance issues using Thunderbolt 3 where I shouldn’t bother with the RTX3070 or 3080s? I plan on doing some nlp deep learning models. I’ve done some toy projects on the internal Qaudro P2000 card, but its 4GB of memory just doesn’t cut it. Would I be ok with an 8gb card, or should i spend a little extra for a 10gb card?
Reply
Asbjørn Berge says
2020-12-08 at 12:02
Hi Tim! Thanks for the excellent and thoroughly researched post. We’re doing research on DL on point clouds and memory is a huge issue. Currently we’re running a single Titan RTX as it was a 24GB card with abundant supply. We’ve tried – but failed, so far – to split the training on multiple GPUs (using NVLink) – so now we’re looking to buy a 40GB+ card.
I’m still a bit unsure wether it makes more sense to grab a RTX 8000 than a pcie A100. The latter I guess will have some serious cooling issue in our typical GPU-boxen (we’re mostly building using Corsair Carbide Cube 540). But the performance looks so much better (on paper..), and the RTX 8000 looks like it is not much of an “upgrade” to the Titan RTX.
Any brief thoughts that will nudge me in the right direction? Thanks in advance for any input!
Reply
- Tim Dettmers says
  2021-01-02 at 01:29
  Hi Asbjørn,
  I would go for the A100 and use power limiting if you run into cooling issues. It is just the better card all around and the experience to make it work in a build will pay off in the coming years. Also make sure that you exhaust all kinds of memory tricks to safe memory, such as gradient checkpointing, 16-bit compute, reversible residual connections, gradient accumulation, and others. This can often help to quarter the memory footprint at minimal runtime performance loss.
  Reply
dvk says
2020-12-07 at 07:02
@Tim Dettmers,
As its so hard to get a new GPU card of your choice. Can pairing a 3060 ti with RTX 3080 or 3090 in future or Pairing a 2080 ti with any RTX (3060 ti, 3070 or 3080) give us the advantages of pooling resources of a multi-gpu PC for deep learning?
Reply
- Tim Dettmers says
  2021-01-02 at 01:24
  It can work, but the details can be complicated. Some combinations of GPUs work better than others. In general, the smaller the gap between GPU performance, the better.
  Reply
Protim says
2020-12-07 at 01:56
Hi Tim,
Been following this blog and post including the full hardware guide for some time now. I wanted to ask if 2x 1060s would do for an intermediate researcher/senior phd student. Thinking of a 3950x build
Reply
- Tim Dettmers says
  2021-01-02 at 01:22
  Hi Protim,
  for an intermediate researcher or senior PhD student, I would recommend a faster/larger GPU. I would recommend a minimum of an RTX 2060 (or two in your case). The reason for this is both memory and speed that is required to build on the most recent models (or to apply them to a different academic field).
  Reply
Bill Smith says
2020-12-03 at 07:46
Thanks for the post!
I was looking into the Nvidia Jetson AGX Xavier, would that have better performance training deep learning algorithms, especially with 32GB of ram then the 3080? The power consumption of only 30 watts and the 8-core ARM chip are also atractive. Thanks!
Reply
- Tim Dettmers says
  2021-01-02 at 01:16
  The Jetson GPUs are rather slow. I would only recommend them for robotics applications or if you really need a very low power solution.
  Reply
  - Alex says
    2021-04-08 at 08:03
    I use Jetson AGX Xavier for training. Bought it since didn’t have money for PC+GPU. Compared it to GTX 1080 (simple conv on mnist). On equal batch sizes Jetson works a bit slower but I ingested it with larger batch than gtx 1080 and train epoch on mnist faster.
    In general, I wouldn’t recommend you Jetson for training. It’s based on arm aarch64 architecture and can be a pain in the ass to build needed libraries on arm. Moreover, pytorch just supports intel MKL library for math and sometimes you can’t even run a code on jetson.
    https://github.com/pytorch/pytorch/issues/31598
    Reply
    - Tim Dettmers says
      2021-10-24 at 10:52
      Thanks for leaving a comment about your experience, Alex! I agree, from my experience, the main problem with Jetson is the ARM CPU which makes it a pain to install the libraries that you need to run stuff.
      Reply
Gonzalo says
2020-12-02 at 12:20
Hi Tim,
thanks for this excellent an up-to-date appraisal of GPUs in Deep Learning.
Is there any difference, advantage/disadvantage of AMD vs Intel processors in terms of Deep Learning software? I saw that you seem to prefer AMD Threadripper in terms of hardware, but what about its impact in software libraries, etc available for Deep Learning?
Thanks
Reply
- Tim Dettmers says
  2021-01-02 at 01:15
  In general, the most CPU intensive deep learning tasks are data preprocessing. Both CPUs usually do just as well for these tasks. If you run also some heavy CPU code, for example, if you do Kaggle competitions, an Intel CPU will yield better performance.
  Reply
David Haase says
2020-12-01 at 11:13
Hi, thank you for your article, will you update the article to include the RTX 3060ti that will be released tomorrow (02.12.2020). It has the same amount of VRAM and might be a good alternative to the RTX 3070. What do you think?
Greetings.
Reply
- Tim Dettmers says
  2021-01-02 at 01:13
  I will probably need some time until I update this blog post again. Maybe in a month or so.
  Reply
Marc says
2020-11-30 at 12:57
Thank you for this is a great post! I am curious if you have any thoughts about where the soon-to-be-released A6000 fits in? It seems to combine the memory heft of RTX 6000/8000 with the Ampere architecture of the RTX 3090 without the power or design issues. Based on the naming convention, it seems aligned the the RTX 6000/8000, which I understand are more graphics-oriented than deep learning. Where do you expect this to fall, in particular when compared to the RTX 3090?
Reply
- Tim Dettmers says
  2021-01-02 at 01:08
  I would recommend the A6000 for data centers where it is not allowed to use RTX 3090.
  Reply
Animesh Roy says
2020-11-28 at 08:50
Hi Tim,
Following your blog post for GPU recommendations since last year. I am not a CS student , this is just a hobby for me so I want to spend as little as I can. I want to have cards with 24GB memory, which ones you would pick, TESLA M40 or K80?. I want to learn image recognition
Reply
- Tim Dettmers says
  2021-01-02 at 01:05
  Hi Animesh,
  I would go for the M40 as it is a bit faster.
  Reply
Satchel says
2020-11-22 at 02:13
Up until now I’ve been using a GTX 1070 as my only card (previously focusing on small-scale vision applications). Now I’ve begun using more memory hungry tasks (GANs specifically) but also DRL. My question is, would you suggest selling my current GTX 1070, and buying a RTX2070 / RTX3070 or perhaps just adding a second GTX 1070 to my setup?
The way I see it 1070 -> $300, approx double performance and memory
RTX2070 -> $300 (after selling 1070), substantially better performance, same memory
RTX3070 -> $500 ( best performance, same memory) (assuming I can get at retail pricing)
Prices are approx and CAD. PSU is adequate for two 1070 cards.
Reply
- Tim Dettmers says
  2020-11-22 at 22:36
  I would probably wait for RTX 3070 Ti cards which will have more memory. It should be cheap enough and give you a bit more memory (10GB).
  Reply
Will T. says
2020-11-21 at 07:50
Amazing blog post! I have referred back here many times.
I see in another comment you recommended the 3080 over the 6800 XT since Nvidia has tensor cores – but what do you think about the fact that the 6800 XT has quite a bit more memory than the 3080? It also draws slightly less power, and has a lower sticker price, which are both appealing on the cost front.
Do you think that we will see some Rocm benchmarks soon on stuff like resnet/BERT? Do you really expect the performance discrepancy between the cards to be significant enough offset the benefits of the extra memory?
Thanks again for the great post!
Reply
- Will T. says
  2020-11-21 at 08:16
  Also, do you know if AMD will be more compelling in terms of multi-GPU setups? I am thinking about 2x 3080 vs 2x 6800 XT. I see that the CPU-GPU memory transfer speeds are slightly lower for AMD cards, but I am curious if smart access memory will change anything on this front.
  Reply
  - Tim Dettmers says
    2020-11-22 at 22:34
    If you use PCIe as an interface (that is what you would use in 95% of cases), both should be similar. However, not all libraries support ROCm GPUs and have equivalents to the NVIDIA libraries for parallelism. NVIDIA GPU RDMA is, for example, a technology only supports Mellanox cards and NVIDIA GPUs. NVIDIA has a dedicated library that uses it and has optimized GPU-to-GPU memory transfers for multi-node GPU clusters. I do not think AMD will catch up in cross-node communication for some time.
    Reply
- Tim Dettmers says
  2020-11-22 at 22:31
  The problem with the RX 6800 XT might be that you are not able to use it in the first place. There was a thread on github in the ROCm repository where developers said that non-workstation GPUs were never really considered to be running ROCm. This might be changing in the future, but it seems it is not straightforward to use these new GPUs right out of the box.
  Otherwise, the memory on the RX 6800 XT does help, but they do not have tensor cores and will be much slower than NVIDIA GPUs. So it is mostly a tradeoff between speed vs memory (if ROCm works). For BERT/ResNet I can easily see the RX 6800 XT be half the speed of an RTX 3080.
  Reply
Chanhyuk Jung says
2020-11-20 at 04:37
Apple recently released a tensorflow fork with hardware acceleration for macs. Installation script is provided and AMD gpus, intel’s integrated gpus and egpus on macs seems to work well.
Since macs are popular among programmers I think it’s great for testing models on your laptop before training on a dedicated server or a workstation. Apple’s new m1 chip shares ram with the gpu so you can run large models and hopfully with a faster gpu on the upcoming macbook pros so maybe even prototyping could be possible.
Do you think this is apple trying to be competitive in deep learning or is it just adding support just because they can?
Reply
- Tim Dettmers says
  2020-11-22 at 22:24
  It is just adding support, it seems. The Apple M1 processor is not powerful enough to train neural networks but will be very useful to prototype neural networks that you will deploy on iPhones. As such, it is an excellent processor to work with deep learning on iPhones, but you would probably train neural networks on GPU servers and transfer the weights to your MacBook to do further prototyping on the already trained network.
  Reply
John says
2020-11-19 at 23:51
Hey guys,
great work Tim!
what are your views on ASICs?
Google’s effort on TPU seems solid. There are more startups/companies claiming big performance and some of them already began selling their ASICs but I don’t see much adoption in the community.
Is this where we are headed?
Cheers!
Reply
- Tim Dettmers says
  2020-11-22 at 22:19
  ASICs are great! TPUs are solid, just as you said. The problem with ASICs is its enormous costs in R&D and a good compiler/software pipeline. If startups shoulder that cost, there is still the software and community problem. The most successful approaches compile PyTorch/Tensorflow graphs to something that can be understood by the ASIC. If this is not available, it gets difficult. The main problem with ASICs is usability. The fastest accelerator is worthless if you cannot use it! That is why all Intel accelerators failed. Once you get a usable ASIC, it is about community. NVIDIA GPUs have such a large community that if you have a problem, you can find a solution easily by googling or by asking a random person on the internet. With ASICs, there is no community, and only experts from the company can help you. So fast ASICs is the first step, but not the most important step to ASIC adoption.
  In short, ASICs will find more use in the future and have huge potential, but their potential is limited by software usability and the community’s size around them.
  Reply
Siddhant Kundu says
2020-11-16 at 05:34
Thanks a lot for the informative article, it’s going to be invaluable while making my purchasing decisions!
I had a couple of follow-up questions regarding a build I want to do in the near future (early 2021 is the target, for now). I want to build an SFF PC that I will be able to carry around with ease. I’m mostly going to be working on game-dev based tasks, and I’m interested in working on RL-based networks for training my game’s decision making AI. I need a high amount of VRAM for blender renders and a good amount of hardware-accelerated compute capability for my ML models, and I’ll probably be running both at the same time. I’m torn between the 16GB frame buffer on the RX 6800XT and the tensor cores on the RTX 3080, and since the 3080 20GB model does not seem to be on the roadmap right now, I think my choices are limited to those two SKUs. (The 3090 is off the table since prices are already absurdly inflated in my country.) Also, is there any particular AIB model you would recommend? (I can go with up to 3-slot GPUs)
Reply
- Tim Dettmers says
  2020-11-22 at 22:08
  I am not sure about blender and its support for GPUs, but what I heard so far is that the new AMD GPUs do not support ROCm out-of-the-box, and it might be added later. With that, I would probably go with an RTX 3080. The VRAM on that one is a little small, though. It is a difficult choice. If training RL-based networks are more important, the RTX 3080; if rendering is more important, the RX 6800XT.
  Reply
Pablo Remolino says
2020-11-14 at 07:59
Hi,
great article.
My question:
It seems to be clear that RTX 30xx is far better than Nvidia Tesla in terms of cost efficiency, but what about Mean Time Between Errors (MTBF), for Tesla V100 is 1005907 hours (about 100 years). (source: https://images.nvidia.com/content/tesla/pdf/Tesla-V100-PCIe-Product-Brief.pdf)
I couldn’t find any estimations for RTX gaming cards. I’ve read (don’t remember where) that it could be something about 3-5 years. Can you confirm it?
If it’s true, it may be a real problem if you have not just a couple of cards but for example 30-50 GPUs. There is a fair chance that you have to replace GPU every few weeks. It is also possible that NVIDIA will not be eager to honor the warranty as they have banned RTX for data centers.
Reply
- Tim Dettmers says
  2020-11-15 at 19:30
  Hi Pablo, I never had a personal gaming GPU fail. From the ~30 GPUs that I used at universities, I had one fail. From a small GPU cluster I was using, I also saw one GPU fail (1 out of 48 GPUs). Some GPUs are known to have much higher failure rates than others (RTX 2080 Ti and RTX 2080 Founders Edition in particular).
  I think for 30-50 GPUs, you can expect to replace one GPU every six months or so. Some vendors have guarantees on RTX cards for data centers, but this is rare and might incur extra costs.
  Reply
teemu says
2020-11-13 at 15:44
Thanks for the great article!
I see you do not discuss much about the 2080ti in your latest recommendations. Is this because it is not manufactured any more? I still see some 2080ti as b-stock about the same price as 3070 where I am. Do you think it is still a viable alternative to 3070/3080 in DL?
I also considered the 3090, but it seems a bit overly expensive for other than the big memory. However, memory often seems a limiting factor. As far as I understand from your post, using multiple GPU’s is possible for training a model. Does this also enable training bigger models that do not fit in the memory of a single GPU? If so, I could also consider getting one 3070/2080ti and maybe add another later.
I mostly do some Kaggle and applied ML/DL work. But sometimes I like to finetune some transformers and train some of the bigger CV models etc.
Reply
- Tim Dettmers says
  2020-11-15 at 19:25
  The RTX 2080 Ti is still a top-notch GPU! If you can find it cheap, it is definitely worth picking up. It is a great alternative to an RTX 3070 and on-a-par with an RTX 3080 if you need a little extra memory.
  If memory is a problem, you might also want to wait 2021 Q1 for the release of the new RTX 3070 Ti etc which have extended memory. I think waiting for the big memory GPUs is a better choice than buying more 2080ti/3070 later.
  Reply
  - Antoine says
    2020-12-18 at 08:08
    Thanks a lot for the enlightening article, Tim !
    Like teemu, I’m not sure whether using two GPUs in parallel allows to train models that would not fit into a single GPU.
    Specifically, if I buy two RTX 3080 (with 10Gb memory each), will I be able to train models larger than 10Gb ? Or should I buy a RTX 3090 if I plan to do so ?
    Reply
    - Tim Dettmers says
      2021-01-02 at 00:57
      Hi Antoine,
      that only works if you use model parallelism, which is getting more and more common, but which is not yet as widely supported as data parallelism. For now, a single RTX 3090 will be better for training large models.
      Reply
Gokhan T says
2020-11-09 at 08:11
Hi Tim. Great article. Saves tons of hours multiplied by thousands of people.
Picking up the right motherboard is really tricky though. I am trying to build a system with only 1 RTX3080 but I want the system to be expandable. If I go with AMD Ryzen 3-5-7-9 series more than 2 GPUs doesn’t seem meaningful since 24 lane is not enough. Am I right until here? Or does it make sense?
And the 2nd problem I have. Lets say 2 GPU-system is my only option because of budget restrictions. Motherboard descriptions are not explicit. Most of the time the descriptions breakdown lane usage by CPU generation. For example:
1 x PCIe 4.0/3.0 x16 slot (PCIE_1)
– 3rd Gen AMD Ryzen support PCIe 4.0 x16 mode
– 2nd Gen AMD Ryzen support PCIe 3.0 x16 mode
– Ryzen with Radeon Vega Graphics and 2nd Gen AMD Ryzen with Radeon Graphics support PCIe 3.0 x8 mode
1 x PCIe 4.0/3.0 x16 slot (PCIE_3, supports x4 mode)
I cannot understand if it is x4/x4/x4 or x8/x8/x4 or something else. Do you know how can I understand it?
Reply
- Tim Dettmers says
  2020-11-12 at 10:36
  Hi Gokhan, It depends. Usually running 3 GPUs at 4 lanes each is quite okay. Parallelism will not be that great, but it can still yield good speedups and if you use your GPUs independently you should see almost no decrease in performance. What you should make sure of though, is that your motherboard support x4/x4/x4 setups for 3 GPUs (sometimes motherboards only support x8/x4). You can usually find this information in the Newegg specification section of the motherboard in question. It is also worth a try to search for the motherboard and see if others build 3+ GPU builds with that.
  Reply
  - Gokhan T says
    2020-11-12 at 14:11
    Thank you Tim. I found that Gigabyte Aorus Ultra and Master motherboards provide x8/x8/x4 over PCIe 4.0. This is the cheapest board I could find and I hope it will work. I contacted with Gigabyte and ask if it is possible to use the card with 3 RTX3080 boards with x8/x8/x4 distribution. And they said yes it is possible. However they suggested to use 1500+W PSU. It matches with the calculations you did on the article.
    Reply
    - Harsh Rangwani says
      2020-11-25 at 08:52
      Hi Gokhan T,
      One more thing that you might have to consider is the spacing between PCI-e slots. As RTX 3080 cards are usually of size > 2.5 PCIe slot. So I do have apprehensions about Gigabyte Aorus Master being able to fit in 3 cards at the same time without PCI-e extenders. Also currently a large majority of PCI-e extenders are 3.0. So don’t know if 4x PCIe 3.0 would be good enough for RTX 3080. Let’s see if Tim has a workaround for this.
      Thanks
      Reply
Arnaud Maréchal says
2020-11-09 at 00:48
Thanks for this great learning article.
The package OpenCL allows R to leverage computing power of GPUs.
Also I understand that scikit-learn does not support GPUs, some alternatives such as scikit-cuda provide Python interfaces to many of the functions in the CUDA device/runtime, CUBLAS, CUFFT, and CUSOLVER libraries.
What are your views on using GPUs in R or Python with this packages? Especially when not using large NN?
Thanks,
Arnaud
Reply
- Tim Dettmers says
  2020-11-12 at 10:32
  Hi Arnaud, I see sklearn more like an exploration tool. I think you can always explore algorithms on a smaller scale and then use dedicated GPU implementations. For R, GPUs should be used. Similarly, I believe. But I think in the end, it is always better to reserve the use of R/sklearn for analysis and prototyping in a small part of the data and then roll out on GPUs.
  Reply
  - Arnaud Maréchal says
    2020-11-30 at 23:49
    Hi Tim.
    I appreciate your answer and advice
    We learn a lot from expert’s feedback, thanks for sharing your experience.
    Arnaud
    Reply
Bram Vandenbon says
2020-11-08 at 18:02
Do you think it’s better to use a single-rail or multi-rail power supply unit when hooking up multiple GPUs ?
Reply
- Tim Dettmers says
  2020-11-12 at 10:30
  A single rail is usually better because it has a standard form factor, which allows using standard cases. If you do not need/want to use standard cases, a double rail might save you some headaches when using 4x RTX 3090 or other power-hungry GPUs.
  Reply
Di Lai says
2020-11-06 at 12:19
Hi，Tim
I have a quick question here: if I config a 2x RTX 3090 machine (only using 2x RTX 30xx ready config of main board + power supply), do yo suggest using air cooling or liquid cooling?
Thanks !
DL
Reply
- Tim Dettmers says
  2020-11-06 at 15:38
  If you have a PCIe slot in-between GPUs, air is just fine. Otherwise, use still air but buy the blower GPU variant.
  Reply
Ryan DS says
2020-11-04 at 12:35
Do AMD Radeon 5000/6000 GPUs support ROCm?
It doesn’t seem like it according to: https://github.com/RadeonOpenCompute/ROCm/issues/887.
Reply
- Tim Dettmers says
  2020-11-06 at 15:37
  Thanks for the link with the info. It indeed seems these GPUs are not supported immediately. They might be added later, but there does not seem to be a big official push since ROCm is designed for datacenter GPUs.
  Reply
Shuhao Cao says
2020-11-03 at 08:32
Hi Tim,
Thanks for this nice article. As I constantly got OOM error in a current Kaggle competition (16G memory for training a graph transformer), I snabbed one 3090 on newegg and started building a single card rig behind it. Now I have a simple question: should I use the integrated graphics on CPU to connect with the monitor for display purposes?
Does connecting the graphics card with a dual-monitor in QHD affect the performance of the graphics card during training?
Thanks,
Shuhao
Reply
- Tim Dettmers says
  2020-11-06 at 15:34
  Usually, the displays do not need that much memory. I run 3 displays at 1080×1920 and that usually uses 300-500MB. If you want to use less I would suggest getting a small NVIDIA card. Otherwise, it can be a mess to get all the drivers working (you need an active NVIDIA driver to run CUDA).
  Reply
  - Shuhao Cao says
    2020-11-14 at 15:56
    Thanks Tim. I followed this guide (https://gist.github.com/alexlee-gk/76a409f62a53883971a18a11af93241b link for anyone who is interested) to configure Ubuntu so that the integrated graphics is for display and 3090 for CUDA. However, it does not seem possible to do the same thing on Windows.
    Reply
    - Tim Dettmers says
      2020-11-15 at 19:31
      This is very useful — thanks for sharing!
      Reply
Brad Love says
2020-11-02 at 15:19
Amazing blog; thank you.
Do you have a reference or other justification for the utilization rates you state? I copy them below for your convenience.
Best,
Brad
Common utilization rates are the following:
PhD student personal desktop: 35%
Company-wide slurm research cluster: > 60%
Commonly, most personal systems have a utilization rate between 5-10%
Reply
- Tim Dettmers says
  2020-11-06 at 15:32
  Hi Brad! The 5-10% figure comes from desktop machines of PhD students at UW. I have had access to about ~10 machines with about 30 GPUs in total. At any one time, 2-4 GPUs were used. The 35% figure is actually not from a personal desktop, but a department-wide research cluster at UW (I made a mistake here). The 35% figure is utilization over a couple of months of all GPUs that the cluster has. The >60% figure comes from GPU clusters I have worked on before (Switzerland, an old cluster at UCL, one at Facebook).
  Reply
  - Brad Love says
    2020-11-07 at 06:08
    Thanks!!!
    Reply
Vinamra Singhai says
2020-11-01 at 11:49
Thanks for such a great article. I have some questions –
1) Should I wait for a comparison between Radeon RX 6900/6800 XT and RTX 3080 for DL/AI testing results?
2) What is the difference between desktop GPU vs workstation GPU? Which is the best workstation GPU in terms of performance vs cost ratio?
Thanks again.
Reply
- Vinamra Singhai says
  2020-11-03 at 03:47
  What is worth 2 RTX 3080 or a single RTX 3090?
  Reply
  - Tim Dettmers says
    2020-11-06 at 15:32
    I would go for two RTX 3080.
    Reply
- Tim Dettmers says
  2020-11-06 at 15:18
  1) The NVIDIA GPUs are currently better since they have TensorCores. I would just go with the RTX 3080 GPU
  2) There is no difference. Workstation GPUs are more expensive though and sometimes they have more memory. I do not recommend workstation GPUs. If you must buy one I recommend the RTX 6000 or RTX 8000.
  Reply
Letmos says
2020-10-30 at 15:44
Hello, NVIDIA has monopoly for ML on GPUs, but things are changing (unfortunately, very slowly!). New cards from AMD have got impressive performance, good price and 16 GB of VRAM. They lack of Tensor Cores, but overall are good choice for most of the games and pro software. In case of ML, NVIDIA is number one, but I hope this will change soon. Some competition is always good.
On AMD’s website there are docker-based versions of last TensorFlow and Pytorch.
( https://www.amd.com/en/graphics/servers-solutions-rocm-ml ). Anyone have used them ?? What are your opinions about those docker-based apps? Is process of instalation straightforward?
Thanks!
Reply
- Tim Dettmers says
  2020-10-30 at 17:33
  Docker-based options should be pretty straightforward to install. Installing ROCm and PyTorch should also be relatively easy.
  Reply
Tony Shi says
2020-10-26 at 18:45
Thanks for the detailed information. Are these RTX 30 series really useful for Machine Learning in data science? NVIDIA’s web site just said it use AI for gaming.
Reply
- Tim Dettmers says
  2020-10-30 at 17:16
  The RTX 30 series GPUs are great for ML and data science. Do not be confused by what NVIDIA says — they want you to spend more money on a Quadro or Tesla GPU.
  Reply
John says
2020-10-26 at 07:13
Hey Tim,
Any comments on the difference of an eGPU attached via Thunderbolt 3 with 2 PCIe lanes instead of 4? — Can assume a RTX 3080 if you need to hold something else constant.
I have seen a few articles saying the difference isn’t all that huge and others that say don’t bother unless you can have a full 4 lanes. I am wondering what the performance impact might actually be but haven’t come across anything googling yet that really breaks it down for deep learning.
Thanks!
Reply
- Tim Dettmers says
  2020-10-30 at 17:20
  Hey John, 2 PCIe lanes should be okay if you just use 1 GPU. In that case, you only transfer data to the GPU at 2 GB/s which is still quite fast even in the case for large images. To make an example: If your mini-batch takes 100ms through the network and use transfer a batch of 32 ImageNet images (224x224x3) then the transfer on 4 PCIe lanes will take 2.2ms while on 2 PCIe lanes it will take 4.5ms. So you will be 2% slower on 2 PCIe lanes in this case. If your network takes 10ms it will be 20% slower, if it takes 1000ms for a mini-batch then it will be 0.2% slower etc.
  Reply
Miguel says
2020-10-24 at 10:16
Hi! Congrats for the amazing post. It is very useful for us beginners… 😉
What’s your opinion about Radeon RX 5700 models? In some benchmarks they outperform or are very close to RTX 2060 and 2070 models, just for 400$ brand-new.
I am a Radeon and Fedora user, in the middle between “I want to try deep learning, but I am not serious about it:” and “I started deep learning, and I am serious about it”. I happily got my old Radeon (Pro WX 4150) working with rocm 3.8 and I am now hesitating a lot wether I should upgrade to a more powerfull Radeon and go on with the current settings or jump to nvidia and take the risk of having some issues when installing CUDA in Fedora….
Thanks so much.
Reply
- Tim Dettmers says
  2020-10-30 at 17:14
  Hi Miguel! Installing CUDA in any UNIX system should be relatively easy, so do not worry about that. I think you would get better performance out of NVIDIA GPUs, especially at the high end, mainly due to Tensor Cores, but if you are happy with a little less performance, Radeon GPUs are just fine. You might have some usability issues here and there, but if you are already using ROCm 3.8, you should already have some good experience under your belt. In general, using AMD GPUs is also quite useful for the community as we get more data on user experience, which will help defuse the NVIDIA monopoly. If you want to support the community, buying an AMD GPU and writing an experience report about it would be very helpful and valuable!
  Reply
Jeff says
2020-10-24 at 07:19
What an outstanding resource. Thanks so much.
Reply
Karthik Rao says
2020-10-22 at 08:56
Hi Tim and other readers of this great source.
I am planning to get a new rig mostly for Text and NLP applications, might use for Images and Video too.
I plan to put in one rtx 3080 for now, but would like to build it such that I can add up to 3 more cards. Considering all the potential cooling and power issues, I am open to a two chassis build. Once chassis could host my cpu, ram, storage, power supply etc. Basically a PC.
The second chassis could be a GPU Enclosure of sorts, that connects over PCIE 4.0 X16 cable to the first chassis, and holds up to 4 GPUs and Power Supplies. I am ok even if I need 4 such cables, one for each GPU. This way, my GPU enclosure can have lots of space between GPUs. And a 1200 W PSU should be more than sufficient for the GPU Enclosure.
Can this kind of Frankenstein build work?
Reply
- Tim Dettmers says
  2020-10-27 at 14:48
  There is now more information about cooling and power. It seems power limiting works well and does not limit performance much. Also cooling can be done with ordinary blower-style GPUs. See this article for more information: https://www.pugetsystems.com/labs/articles/Quad-GeForce-RTX-3090-in-a-desktop—Does-it-work-1935/
  Reply
  - Karthik Rao says
    2020-10-28 at 05:41
    Thanks so much. This looks encouraging. Especially as I am considering 3080s, and at 230V, so just 10amps draw. And maybe a lower TDP processor like Ryzen 7 3700,
    Will keep an eye out for updates on the blower editions.
    Reply
    - Karthik Rao says
      2020-10-30 at 13:58
      I read around a bit more and a couple of other things I realized:
      1. It is a lot easier to plug in a few pcie cables than itnis to assemble a whole pc.
      2. Having an external enclosure with its own power also means I can leave the GPUs off and use only the regular pc.
      So if it is possible, I still want to try the frankenbuild option.
      Will such a build work? Any big stumbling blocks?
      Would deeply appreciate any pointers.
      Thanks
      Karthik
      Reply
  - Harsh Rangwani says
    2020-11-01 at 04:07
    Hi Tim,
    Thanks for your wonderful post. https://www.pugetsystems.com/labs/articles/Quad-GeForce-RTX-3090-in-a-desktop—Does-it-work-1935/
    In the above blog post they use Xeon and Asus SAGE motherboard. Our lab is planning to build a 2x RTX 3090 setup but want to add GPUs later as well. Will your build described with Threadripper https://pcpartpicker.com/user/tim_dettmers/saved/#view=wNyxsY
    work for us for the 4x RTX 3090 setup with blower cards?
    Reply
    - Tim Dettmers says
      2020-11-06 at 15:16
      Hi, Harsh! Yes, the threadripper build should just work fine. I think you just need to make sure that you have enough space in the case.
      Reply
Di Lai says
2020-10-22 at 02:05
Hi, Tim
Very nice article! Thank you so much for the effort to put together this. I just would like to ask you a question: if I plan to buy a GPU workstation for deep learning, should I buy a brand name (like Dell, Lenovo, etc.), or some third party built (like lamda labs), or should I consider DIY? What is the usual practice? My budget is around 15K, what is the best machine that can buy?
Thanks a lot!
Reply
- Tim Dettmers says
  2020-10-27 at 14:47
  DIY is usually much cheaper and you have more control over the combinations of pieces that you buy. If you do not need a strong CPU you do not have to buy one if you do DIY. Dell, Lenovo are often enterprise machines that are well balanced — which means you will waste a lot of money on things that you do not need. LambdaLabs computers are deep learning optimized, but highly overpriced. In any case, DIY + youtube tutorials are your best option. If you do not want that I would probably go with a LambdaLabs computer.
  For 15k you can pretty much buy and 4x GPU machine. You could use the 4-GPU barebone in my blog post and extend it with 4 GPUs of your choice.
  Reply
  - Di Lai says
    2020-11-06 at 10:40
    Thank you so much for your precious advice!
    Reply
Josh Brown Kramer says
2020-10-21 at 14:50
This post is amazing and is nearly prompting me to buy some RTX3080s. There are some things I want to clear up though.
The performance metrics indicate that the 3080 is 1.25x faster than a 2080Ti at CNNs. From the rest of the discussion, every other component of a box supporting 2080Ti’s ought to be cheaper than for the 3080 (cheaper cooling and power supply in particular). So if – theoretically – the 3080 cost the same as a 2080Ti, then I would expect that the normalized performance/$ for a 2080Ti to be at least 1/1.25 = .8. But it’s significantly less than that for every build configuration. Ok, so that must mean that 3080s are cheaper than 2080Tis in your model. In practice it seems like I can get a 3080 for $1400 or $1500 on eBay and that they are not available from retailors. The 2080Ti, however, seems to be available for $800 – $1000 on eBay (and something like $1300 on Amazon). So I understand that this is probably a shortage issue – there is high demand for scarce 3080 cards. However, my question is whether we can ever expect this analysis to hold up – given the relative performance advantage of the 3080 , can’t we expect it to ALWAYS cost more than a 2080Ti? Your analysis seems to indicate that it will cost significantly LESS than a 2080Ti. Granted – it has less memory, but not much less, and it takes more power, but its performance/watt is very close to the 2080Ti.
I hope you can clear things up for me. Thanks again for the terrific article.
Reply
- Tim Dettmers says
  2020-10-21 at 16:51
  Hi Josh! So currently, the prices are normalized by the cost of a full desktop. For example, the 2 GPU chart is for a typical 2-GPU desktop. A more powerful power supply is about $50 per GPU. Cooling seems to be sufficient if you pick the right GPUs. It seems “Turbo” RTX 3090s do not need any water cooling to work in a 4x setup. This means that 4x RTX 3080 with a blower-style “Turbo” fan should be enough. I think these cards are not any more expensive than regular GPUs. This means the bottom line is that you do not pay so much more extra, and the RTX 3080 remains the most cost-efficient GPU despite the additional power requirements. Does this help?
  Reply
  - Josh Brown Kramer says
    2020-10-22 at 06:35
    My main concern is that in practice right now almost every 3080 I can buy costs more than a typical 2080Ti, whereas the analysis seems to indicate that the 3080 costs significantly less. Furthermore, given the relative strengths of the 3080 it’s hard to see why that would change.
    Reply
    - Henry says
      2020-10-27 at 07:20
      That’s because the everyone wants a 3080 right now and they’re going for much more than MSRP. Your best bet (if you can) is to wait for NVIDIA to meet demand and the price will come back down to the MSRP. Otherwise you can track the inventory at reputable retailers to get a 3080 at a reasonable price. As for the 2080Ti pricing my hunch is it has gone up recently due to unscrupulous sellers hoping people looking to get a 3080 makes a mistake and buys the 2080 instead.
      Reply
    - Ulf says
      2020-10-30 at 06:05
      Nvidia did pretty much a ‘paper launch’. After November things should get more normal, especially since AMD has a competing product for gamers out soon .But you are right in a way , you will probably not get a good 3080 for 800 USD.
      On the other hand: whats the alternative?? Certainly not buying the last generation.
      Reply
Adrian G says
2020-10-15 at 18:17
Thanks for all your help via your blogs over the years.
I am now in situation where I have 2 X99 workstations, one with 2xRTX2080ti and one with 3xRTX2080ti (couldnt put 4 in this due to buying cheap used 2.5 slot wide gpus, and one is already on a pcie riser).
I want to connect the 2 machines using high speed network cards and fiber.
Is having 100mbit/s network speed an absolute must or could I get away with 40/50mbit/s?
I havent found any 100 Mbit/s mellanox inifiniband cards for less than ~$400 usd each which is too pricey for me. Once network is setup is SLURM the best way to distrubute load?
Reply
- Tim Dettmers says
  2020-10-18 at 15:16
  I am sure you meant GBit/s and not MBit/s. 40/50Gbit/s is sufficient if you have only 5x GPUs in total, but I am not sure if it is worth the effort. I am not sure how difficult it is to setup Infiniband with RTX GPUs as it is officially not supported. It might be that you cannot do this and instead, the communication would be GPU->CPU->infiniband->CPU->GPU which is still fast enough for good parallelization but 3 GPUs might come quite close to that performance if parallelized already. I think it would be more effective to buy a new case and riser and try to fit 4x GPUs into one box.
  Reply
Rory McCallion says
2020-10-15 at 09:58
Hello! Thanks for compiling all of this information and staying on top of it 🙂 . Huge help to the community!
I have a request for clarity. In your Quora article, you wrote:
“However, if you now use a fleet of either Ferraris and big trucks (thread parallelism), and you have a big job with many packages (large chunks of memory such as matrices) then you will wait for the first truck a bit, but after that you will have no waiting time at all — unloading the packages takes so much time that all the trucks will queue in unloading location B so that you always have direct access to your packages (memory). This effectively hides latency so that GPUs offer high bandwidth while hiding their latency under thread parallelism — so for large chunks of memory GPUs provide the best memory bandwidth while having almost no drawback due to latency via thread parallelism.”
I do not understand this. I love the metaphor with fast cars vs trucks, but I’m not sure how they work together in this situation, or why having big trucks wait makes things fast (IDK if that’s what you meant to convey, but that’s how I’m reading it). It seems like the crux of the article, so I’m eager to understand this passage deeper. Thank you for all your help 🙂 .
Reply
- Tim Dettmers says
  2020-10-15 at 13:23
  Hi Rory! To go along with this metaphor, you can imagine you are working in a loading dock. The speed at which you can unload packages is 1 package per minute. If a Ferrari with 1 package comes every 30 minutes, you will be idle 29 minutes. A truck might hold 100 packages, but it needs 60 minutes to make the trip to the loading dock. This means you will wait 60 minutes and for the first truck to arrive, and subsequent trucks arrive before you can finish unloading the previous truck. This means using a truck for package delivery will be faster once you need 3 packages (Ferrari takes 90 minutes, the truck takes 60 minutes). For CPUs, we often only need 1 package (1+2), and that is why a Ferrari is better for that while in GPUs we often need multiple packages (A*B) at once. Let me know if this is still unclear.
  Reply
eitamar saraf says
2020-10-12 at 14:05
Hey,
First, Thanks for sharing.
Second, I saw at eBay tesla k80 for 300$,
Besides the cooling issue, power consumption, And the slowness(and it’s a lot!),
Is there an alternative for 10+ GB at a normal price?
Reply
- Tim Dettmers says
  2020-10-13 at 17:13
  A k80 for $300 is pretty good! I think otherwise, you might be able to get one of the old Titan cards for less than $300, but it will not be much less than that.
  Reply
chanhyuk jung says
2020-10-11 at 01:25
Can I plug a gpu to a pcie slot connected to the chipset? The gpu is connected to the chipset via pcie 4.0 x4 and the chipset is connected to the cpu via pcie 4.0 x4. I want to use three 3080s for multi gpu training and running separate experiments on each gpu.
Reply
Nadir says
2020-10-10 at 14:20
Hi Tim, thank you for the in-depth guide! Do you have any suggestions on GPU (core/memory) overclocking? How much of overclocking would you consider for the RTX 3090 FE? (I’m mostly working on RNN and transformer for time series forecasting.)
Reply
- Tim Dettmers says
  2020-10-11 at 15:03
  Overclocking often does not yield great improvements for performance and it is difficult to do under Linux, especially if you have multiple GPUs. If you overclock, memory overclocking will give you much better performance than core overclocking. But make sure that these clocks are stable at the high temp and long durations that you run normal neural networks under.
  Reply
Geoff Seyon says
2020-10-10 at 08:44
Wow man! Given today’s nail-biting decision around the availability of RTX 3080 and RTX 3090 cards, this was an INVALUABLE article! Thanks Tim!
Reply
- Geoff Seyon says
  2020-11-06 at 07:49
  Tim,
  Any thoughts on the speculated 3080 Ti versus the 3090 for deep learning workstations?
  Thanks,
  Geoff
  Reply
  - Tim Dettmers says
    2020-11-06 at 15:40
    It all depends on the details. It will probably be like with the previous series, that the RTX 3080 Ti will be much more cost-efficient, but we will have to see. It also depends on supply. If you cannot buy these cards
    Reply
    - Manish says
      2020-11-13 at 19:12
      leaks suggest 20 GB vram on 3080 ti, will there be any chance for 3090 competing with the new card?
      Reply
      - Tim Dettmers says
        2020-11-15 at 19:26
        If the rumors are true, the RTX 3080 Ti will be way better than the RTX 3090 in terms of price performance. I would probably no longer recommend the RTX 3090 (except maybe in 8x GPU builds).
George Pongracz says
2020-10-09 at 16:04
Thanks for this great article.
I am looking to self study with a machine at home and was interested in your thoughts with regards to the recent update that a RTX 3070 16GB will be released in December 2020 and how a card like this would slot into your hierarchy.
Reply
- Tim Dettmers says
  2020-10-11 at 15:25
  This is definitely an interesting development! An RTX 3070 with 16Gb would be great for learning deep learning. However, it also seems that an RTX 3060 with 8 GB of memory will be released. Depending on the price, this might actually be the better card for learning deep learning until you are sure you want to commit to deep learning or a particular sub-area like RL, NLP, or CV, which need very different GPUs/CPUs. The money that you might save on an RTX 3060 compared to RTX 3070 might yield a much better GPU later that is more appropriate for your specific area where you want to use deep learning.
  Reply
Devjeet Roy says
2020-10-06 at 08:11
Hi Tim! Thanks for the post, it is really informative and comprehensive.
I wanted to ask you real quick about potentially upgrading my rig. I’m a PHD student 5 hours away from you at Washington State University. To keep it brief, I’m looking to pretrain Transformers for source code oriented tasks. Currently, I have 2x2080Tis and I’m definitely running into problems with model size (after trying some of the tricks you mentioned earlier using PyTorch Lightning). In the past, I was able to Google’s Tensorflow Research Cloud access for a large model to deal with these issues.
Do you think it’s worthwhile upgrading to a 3090 (and possibly putting my 2080Tis in a second machine)?
The 48GB VRAM seems enticing, although from my reading it seems clear that even with that amount of memory, pretraining Transformers might be untenable. However, it might speed up prototyping for my research. Also, I don’t really think I’ll be able to get more than 1. For now, we’re not an ML lab, although I personally am moving more towards applied ML for my thesis, so I’m not able to justify these expenses for funding.
Reply
- Tim Dettmers says
  2020-10-07 at 13:03
  Hi Devjeet! Two RTX 2080 Tis should be faster than a single RTX 3090 if you use parallelism. There are better and better implementations of model and other types of parallelism implemented in NLP frameworks, so if you still have some patience for some extra programming, you fare better with the two RTX 2080 Tis. I know that fairseq will soon support model parallelism out of the box, and with a bit time, fairseq will also have deepspeed parallelism implemented. I would go with 2x RTX 2080 Ti and save that money for the next line of GPUs (probably 2023).
  Reply
Joseph says
2020-10-04 at 00:48
This is by far the most informative article about building machine learning rig.
My own machine is a 3-year-old pc with an i5 6600k (4 cores 4 threads). I’m thinking upgrading the video card from 1060 to 3070. Does this move make sense? Will my CPU be a huge bottleneck for the setup?
Reply
- Tim Dettmers says
  2020-10-07 at 13:07
  It should be perfectly fine if you use a single RTX 3070 for most cases. The only case where the CPU could become a bottleneck is if you do heavy preprocessing on the CPU, for example, multiple variable image processing techniques like cutout on each mini-batch. For straight CNNs or Transformers, you should see a decrease in performance of at most 10% compared to a top-notch CPU.
  Reply
Mark Antony says
2020-10-02 at 03:18
Hi Tim!
Thanks for sharing this info!
I am a newbie to building a pc. I want to train big models potentially for days on my pc but I am worried a power surge might ruin the pc.
If I were to use four 3080s and a 1600W PSU (I think the AX1600i has surge protection built-in), would you recommend using a surge protector or even an UPS?
Reply
- Tim Dettmers says
  2020-10-13 at 17:46
  Hi Mark! Currently, nobody has experience with this, so I cannot give you any solid recommendations. I think a UPS might be overkill, but a surge protector socket does not hurt — I usually have my computer behind one in any case. I believe if you power-limit the GPUs, it might be safe, but if you want to make sure, either go with 3x RTX 3080 or wait for user reports if 4x RTX 3080 rigs with 1600W PSU are stable.
  Reply
Jay says
2020-09-30 at 14:19
Hi Tim,
Thank you for this post, it was extremely helpful! I’m somewhat new to the field of ML/DL and am beginning my career in this field in the long term. I am quite familiar with Kaggle and will likely be working with that as well as training smaller/beginner-intermediate DL models for personal projects and beginning an MS program in the future, so I was wondering about the right setup and GPU to get.
Do you see any issues with my parts? https://pcpartpicker.com/list/DCbfMc . I’m aiming for the 2080S, which I was able to find used (for almost half the price of the $1000 list price). Is the vram enough for my current use case, or when should I think about upgrading to a 2x 3000 series setup (if necessary)?
Additionally, on the software end, do you have recommendations/preferences for ML environments, such as either Windows or Linux (or both)? It would be extremely helpful if you could include an ML workstation software setup guide someday! Thank you!
Reply
- Tim Dettmers says
  2020-10-13 at 17:49
  Hi Jay,
  I think your build looks good. I would start with 32 GB RAM — you can always upgrade later! I would increase the wattage on the PSU a bit so that there is room for a 2nd GPU later. Otherwise, it looks solid to me!
  For software, I would use Ubuntu 20.04 with Anaconda as a package manager along with PyTorch. I think that is the easiest setup to get started and to experiment with deep learning.
  Good luck!
  Reply
Andrew Webb says
2020-09-30 at 08:51
Great info as usual, thanks!
“Below I do an example calculation for an AWS V100 spot instance with 1x V100 and compare it to the price of a desktop with a single RTX 3090… This compares to $2.14 per hour for the AWS spot instance.”
Can I ask which instance type this is? I believe the p3.2xlarge has a single V100 and a spot price of $0.918 in region us-east-1.
Reply
- Tim Dettmers says
  2020-10-13 at 17:51
  Thanks, Andrew! I did not realize that something was wrong here until your reply on Twitter — thanks for making me aware of that! I think I took the on-demand instance price and calculated with it but later thought I used the spot instance price. I will correct that by including two calculations for spot/on-demand instances sometime in the next days. I will also update the rule-of-thumb and recommendations that stem from that calculation.
  Reply
Giulia Savorgnan says
2020-09-26 at 11:09
Simply: thank you for this!!
Reply
QwwqWq says
2020-09-25 at 12:32
Do you think the 3090 will have good FP16 compute performance as per its price after Nvidia announced that is has been purposely nerfed for AI training workloads?
Source ::
RTX 3090 has been purposely nerfed by Nvidia at driver level.
https://www.reddit.com/r/MachineLearning/comments/iz7lu2/d_rtx_3090_has_been_purposely_nerfed_by_nvidia_at/
Reply
- Tim Dettmers says
  2020-09-29 at 10:27
  Yes, we got the first solid benchmarks and my RTX 3090 prediction is on point. As such, the RTX 3090 is still the best choice in some cases.
  Reply
andrea de luca says
2020-09-25 at 05:49
Hi Tim. An important question about the 3090 (and other consumer Amperes).
As we already know, despite being a lot less powerful than the 3090 in raw numbers, the professional Turings (Titan, Quadro) perform a lot better in certain CAD/CAM domains.
In spite of this, I was convinced that such issue would not affect our domain. Still, listen to this video starting from the position I’m linking: https://youtu.be/YjcxrfEVhc8?t=576
Particularly, at 10:02 on that video, it show Nvidia’s own reply: “For AI applications, Titan RTX is better than the 3090”.
Could you please explain what kind of features the consumer Amperes do miss with respect to professional Turings?
Reply
- Veedrac says
  2020-09-26 at 09:59
  Right now the only advantage to the Titan RTX over the 3090 is higher FP16 w/ FP32 accumulate, for mixed precision, as the GeForce line is half-rate. My understanding is that most libraries support mixed precision while just using pure FP16 matrix multiplies, and therefore this isn’t very important most of the time, though it may add stability.
  The upcoming, unannounced Ampere Titan may have more significant advantages, since it will not only have full-rate FP16 w/ FP32 accumulate, but also full rate BF16 and TF32, both of which (AFAIK) require FP32 accumulation. So if you expect to use either of those and are willing to pay double, waiting for the new Titan might be better. The Ampere Titan might also have more memory, perhaps as high as 48 GB.
  Reply
- Bryan says
  2020-09-26 at 21:15
  Tensor Cores are being (intentionally) limited for consumer-level cards built on the Ampere architecture to drive sales for the Titan/Quadro/Tesla lines.
  https://www.nvidia.com/content/dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf
  https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
  Reply
- Tim Dettmers says
  2020-09-29 at 10:29
  We just got a new benchmark which shows that the RTYX 3090 has a solid lead of about 25% over the Titan RTX. So I guess NVIDIA wants to sell more Titan RTX cards 😛
  Reply
G.B. says
2020-09-21 at 22:19
Hello
probably not the most clever thing to ask :\ however, I’m wondering… is it possible to do both gaming & training neural nets at the same time?
Like, utilize both features of a 3080, say?
Thanks for listening. 🙂
Reply
- Tim Dettmers says
  2020-09-29 at 10:32
  Hi G.B! It works in theory, but both your gaming experience and deep learning experience is likely to be miserable. How the GPU works is that it schedules blocks of computation but the order of these blocks is not determined. This means, that on average you will slow each application by the amount the other application processedd blocks. So I would expect a very large frame rate drop which might shift dramatically from almost 0 to almost maximum. So I cannot recommend this setup 🙂
  Reply
  - G.B. says
    2020-10-04 at 10:13
    Thanks you very much. Appreciated. 🙂
    Reply
Alexander says
2020-09-21 at 19:24
The whitepaper on GA102 states that rtx 3080 has a massively cut-down TF32 performance (aka bfloat19) (see page 14), around 25% of tesla A100.
https://www.nvidia.com/content/dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf
Is this important in practice? If so, are there any options besides A100?
Reply
- Tim Dettmers says
  2020-09-29 at 10:34
  It may not be that important because the Turing RTX 20s series has too much computational FLOPS, meaning that most of it could not be used for performance gains and was useless. NVIDIA adjusted Ampere so that the needed computation and the available computation is more matched. You should not see any decrease from these statistics. NVIDIA however integrated a performance degradation for tensor cores in RTX 30 which will decrease performance (this is independent of the value that you quote).
  Reply
Fedor says
2020-09-21 at 19:05
Hi, Tim!
I’ve read your advice about stacking multiple rtx3090, but i’m still afraid of custom watercooling. Do you think one can use rtx 3090 turbo without spacing?
https://videocardz.com/newz/gigbyte-geforce-rtx-3090-turbo-is-the-first-ampere-blower-type-design
Reply
- Tim Dettmers says
  2020-09-29 at 10:36
  Hi Fedor!
  Unfortunately, I do not think this will work. I could even imagine that the blower is worse than the NVIDIA fan design in this case. If you do not want to go with water cooling, the best bet is to use extenders + NVIDIA coolers I think.
  Reply
Sly Golovanov says
2020-09-21 at 14:43
“additional $0.12 per kW/h for electricity”
It’d be “kW*h” units.
Reply
- Tim Dettmers says
  2020-09-29 at 10:36
  Great catch, thank you!
  Reply
David says
2020-09-21 at 12:43
Hello, thanks a lot for all of those valuable informations for novice in deep learning like I am.
My question is, if I’m running (not training) a model using tensor cores, will it takes from the performance of the rest the card if it’s also used for regular graphics? My hope is to have a model running via tensorRT around 70% of what the card normally offers, while having, let’s says, still 70% of the graphics performance of the card. Is it possible or should I have a separate card for this? Thanks you in advance
Reply
- Tim Dettmers says
  2020-09-29 at 10:39
  Hello David, The card will still be limited if you do not use Tensor Cores because the block of computation that is scheduled “blocks” certain cores from being used. As such, you cannot schedule two blocks of computation (1 TC, 1 non-TC) at the same time on the same core. It is better to get two GPUs in your case. One GPU for inference/training each.
  Reply
  - David says
    2020-11-12 at 13:20
    I see, thanks a lot for your answer
    Reply
dreahs says
2020-09-21 at 03:37
3080 support sli?
or just putting multi gpus and spreading their tasks dont need sli/nvlink?
Reply
- Tim Dettmers says
  2020-09-29 at 10:39
  No need for SLI or NVLink, you can just parallelize and spread data through the PCIe interface and that will work just fine!
  Reply
chanhyuk jung says
2020-09-19 at 09:02
You can get up to 4 gpus without a threadripper using msi’s meg x570 godlike motherboard. You can get up to 16 cores with the cheaper ryzen 3000 series. The three gpus get x8 x4 x4 lanes from the cpu and the fourth gpu is connected to the chipset. It has pcie 4.0 x4 but it shares the chipset with other peripherals. I’m not sure if it’s negligible or not.
Reply
- chanhyuk jung says
  2020-09-19 at 10:08
  It was too good to be true. I checked some reviews on this motherboards and turns out it’s missing additional power for the pcie slots so if you do run 4gpus on it, the 24pin connector will melt.
  Reply
Matt says
2020-09-18 at 19:13
Great write up. Thanks for this.
Curious as to what you assume the A100 price to be when doing your performance-per-dollar examination, as there’s no “MSRP” on individual cards really.
Reply
- Tim Dettmers says
  2020-09-29 at 10:45
  I have seen some offers from 3rd part vendors who put them in 8x GPU servers. That is what I base my estimate on.
  Reply
Christian says
2020-09-18 at 15:51
Tim,
This is fantastic information on GPUs, absolutely fantastic. I’m working on building a machine geared toward gradient boosted regression (xgboost/lightgbm). Do most of your Deep Learning tests in this article hold true for xgboost/lightgbm? Or would your GPU recommendations change drastically for machines designed for these purposes?
Reply
- Tim Dettmers says
  2020-09-29 at 10:48
  I am not entirely sure about the algorithmic structure of xgboost and lightgbm, but I imagine is uses matrix multiplications and frequent element-wise operations. Since all these operations are very hardly bandwidth-bound (even more so than transformers) I would expect that the transformer benchmarks come quite close for xgboost and lightgbm relative performance. This means an RTX 3080 vs RTX 2080 Ti will compare similarly to the transformer case, but the overall speedup might be lower compared to an RTX 2080 Ti. In other words, the numbers should be accurate within RTX 30s and RTX 20s cards, but not between them.
  Reply
Andy N. says
2020-09-18 at 00:32
Hi Tim,
This is BY FAR the best thing I have ever read on GPUs in deep learning. Thank you for your work.
On top of that the comments are excellent and packed full of practical advice, kudos.
I was previously operating under the misconception that V100/ A100 was the only option for server-scale (6+ cards) DL GPUs- you’ve convinced me that actually 30XX (or 2080Ti) cards are the way to go. I’ve decided to invest in a large (ideally 8 card) rig of these. However, I’ve only built one desktop before and feel lost in the wilderness of suppliers and design decisions. I’m worried I won’t have the time to learn all the details needed to make a reliable configuration that doesn’t suffer from power/cooling problems. Which leads me to the following questions (which I expect may be informative for other readers who don’t have electrical engineering background or hardware experience):
1. Is there any provider you know of with a plug and play 6x+ 3080 or 2080Ti solution? (meaning already assembled and ready to boot) I’ve seen lambdalabs.com but wasn’t able to see any 30XX or 2080Ti machines large enough.
2. For DIY, do you know of any good end-to-end recipes for 3080s or 2080Tis which say exactly which parts to buy from where, and that have actually been run in production and found free of the cooling and power problems? This reddit thread mentioned by commenter Marcin concluded that the build was not viable it looks like: https://www.reddit.com/r/buildapc/comments/inqpo5/multigpu_seven_rtx_3090_workstation_possible/
Also curious if any of the other commenters know of options for 1 and 2. Other info- noise is not a concern and can have a dedicated 220V (e.g. washer/ dryer circuit) for power.
Reply
- Tim Dettmers says
  2020-09-29 at 13:30
  Hi Andy,
  It’s a pretty difficult problem. I see the Reddit thread has some good suggestions already. I think you will not see an 8x GPU server with RTX 3090 anytime soon as there are too many issues to be figured out for companies that it is probably not worth it, so building a custom rig might be the only option. The motherboard is great that is suggested. Otherwise, the case is really important. I do not know any particular case which is suitable, but here it would be worth it do look at mining cases and PCIe extenders. I think that would work out well. Otherwise, there might be some issues here and there with server hardware. There was somebody in the comments that had very valuable experience, so you might want to read a bit more of those, but in general, the experience is gathered on a case-to-case basis which means if you go ahead sometimes you need to figure things out for yourself.
  I did a similar project when I build my little GPU cluster with Infiniband and some issues took a long time to figure out (how to get Infiniband working on Geforce Titan GPUs). So if you go ahead you should expect some problems, but it can be a great opportunity to learn more details about servers and their software. So if you are not afraid of that, I would say give it a go. If you do so, I would love to hear back from you with your experience! Good luck!
  Reply
Jason Hsu says
2020-09-16 at 15:26
Hello Tim,
Thank you for a very detailed explanation!
I saw you recommend that RTX 3070 is the best choice for beginner. As a undergraduate college student who just started with machine learning (and everything related), I thought your recommendation perfectly fits me, however I was really attracted to the price to preformance value of 3080, I also like the fact it is slightly more future proof as I am not planning to swap my computer parts anytime soon (hopefully for 5+ years). Would you recommend 3070 or 3080 for my case?
Reply
- Tim Dettmers says
  2020-10-13 at 18:13
  Hi Jason,
  Yes, if you look for a GPU for 5+ years, I would recommend the RTX 3080. I would even recommend waiting a little longer for the versions with increased memory. There are currently rumors of RTX 3080 versions with much larger memory. These GPUs would definitely last you 5+ years. I think that would be your best bet.
  Reply
Devon says
2020-09-16 at 11:58
Thanks, Tim, for the very informative and detailed write-up. One thing I didn’t really see in your post and the Q&A below is the consideration between purchasing a card from Nvidia or someone like EVGA (e.g., not Nvidia). Aside from cost, are there situations where someone should go to Nvidia rather than anyone else? Thanks again.
Reply
- Tim Dettmers says
  2020-10-13 at 18:15
  Often the third-party cards have some slight overclocking and different fans but are not very different from the original NVIDIA card. I would just buy the card that is cheapest or the card that has a particular fan-design which suits you best. In general, the fan-design of NVIDIA for the RTX 30 series seems to be pretty solid and I would probably buy the NVIDIA card over other cooling solutions (at least in a single or dual GPU setup).
  Reply
Petr Prokop says
2020-09-16 at 08:12
Hi Tim,
I am a bit newbe… I wonder how to think of possible upgrade of my rig. I have got ASUS TURBO RTX2070S 8G EVO + GTX1050 Ti on X399 with 1920X Threadripper. I consider adding Gigabyte RTX 2080 Ti TURBO 11G but am not sure if it is not better to consider the coming RTX 3070s or RTX 3080s or rather the same 2070 for the most efficient Dollar utilization.
I have read in your text that different GF lines 10/20/30 can work together but
– quicker GPUs are to wait for slower ones, which is no good
– parallelizing is not efficient
Q> Would you recommend i) second RTX 2070 SUPRA or ii) RTX 2080Ti or iii) RTX 30X0 in terms of $ efficiency ? Other considerations?
Reply
- Tim Dettmers says
  2020-10-13 at 18:18
  Hi Petr, This is a difficult question, and I think it depends on your use-case. If you feel your memory on your existing 2070S was sufficient, it might be a good idea to buy a second one. If you parallelize them, it will be faster than a single RTX 2080 Ti and close to a single RTX 3080. If you think you will upgrade more GPUs in the future, though, or feel memory-limited, I would go for an RTX 3070 or RTX 3080.
  Reply
Raivo Koot says
2020-09-16 at 07:04
Great post. Thank you!!
I am looking at motherboard compatibility for 2 3090s. Many motherboards (X570) have 3 PCIEX16 slots. However, often only the upper two are “integrated in the CPU” and the third one is “Integrated in the Chipset”. Can I still use the chipset slot for PyTorch etc?
The reason I am asking is because the middle slot, which is integrated into the CPU, would leave no space between two cards (heating problems!), whereas the third lower slot would leave enough space between the two. If you can let me know whether I can still happily use one GPU in the third “chipset” slot while using another in the top “CPU” slot you would help me a lot. Thanks!!
Reply
- Tim Dettmers says
  2020-10-13 at 18:20
  This always depends on motherboard configurations. If the motherboard says something like 4-way-SLI ready or the equivalent, it should work. If it says it supports 8x/8x/8x/8x or 8x/8x/16x/8x or something similar it should also work. These are usually the best indicators if you can put a GPU into that very last slot.
  Reply
David says
2020-09-15 at 23:22
Hi Tim! Thank you for the guide.
I am currently considering a 3080 GPU, with a second GPU to be added some time later. I am new to PC building, and I tried to put together a setup as follows:
https://pcpartpicker.com/list/HHdNXv
I based some parts from your 2-GPU barebones pcpartpicker. I’m not sure about the motherboard though:
1) Perhaps I should go for a newer generation with PCIe 4.0?
2) Your guide recommended some space between two 3080s. With 3 x16 PCIe lanes, would I put one GPU at the top one and the other at the bottom one? Would this be enough space between the GPUs, and would there be enough space (in the case) to put a GPU at the bottom-most slot?
If there is some issue with the motherboard, I would highly appreciate advice on what to choose
Lastly, would you recommend a FE 3080 or some aftermarket version?
Again, thank you so much for this guide!
Reply
- Tim Dettmers says
  2020-10-13 at 18:22
  Hi David! If you have only 1-2 GPUs, I would go with PCIe 3.0 because it’s cheaper, and you will have almost no advantage from PCIe 4.0 (in terms of deep learning). Yes, if you have 3 PCIe x16 slots, you can put one GPU in the top and one GPU in the bottom, and that should be more than enough to keep your GPUs cool!
  Reply
Tom Lee says
2020-09-15 at 15:02
Hello Tim Dettmers,
thank you very much for your effort over the years. Your posts are very helpful!
Regarding the power consumption of 4x RTX 3090:
It might apparently be possible to lower the power-limit of the GPU quite significantly without much performance loss.
https://www.reddit.com/r/MachineLearning/comments/isq8x0/d_rtx_3090_rtx_3080_rtx_3070_deep_learning/g59xd8o/
I tested this on my own Titan RTX with 240 Watts instead of 280 and lost about 0.5% speed with 85,7% power. Although the network was quite small per layer, I will test it again with the biggest one I can fit into memory with batch size of 8 so the GPU is fully utilized.
I have also googled for some time but couldn’t find any tests in this direction.
But it would be interesting to see extensive tests, so you know if you can just reduce the power limit RTX 3090 to something more reasonable like 300W.
Reply
- Tim Dettmers says
  2020-10-13 at 18:24
  Thanks, Tom! I should have included your data in the blog post. I think I overlooked it among all these comments. Thank you for sharing that information and sorry that I am only now replying!
  Reply
Ivano says
2020-09-14 at 15:55
I do not agree about avoiding tesla cards.
I have a tesla M40 with liquid cooling in my desktop and it works well.
It is not comparable with recent cards, but it is extremely cheap (you can find K80 or M40 at about 200-230$) and it has a lot of memory (24gb).
Reply
- Tim Dettmers says
  2020-09-14 at 21:18
  That is a fair point and actually a pretty good use-case: Low budget but high memory requirements and if speed is not that important these cards are pretty solid. I might add that use-case to the recommendation. Thanks for the comment!
  Reply
  - ivan says
    2020-11-29 at 06:09
    I’m currently using a Tesla M40 and a Titan X in my machine. Although I do not suggest using the Titan X, the M40 is quite good thanks to its 24GB.
    Unfortunately, both GPUs are quite old and the performance/efficiency ratio is not so good (6.xx TFlops x 250W both of them).
    In other words, you can have cheap GPUs with a lot of memory if u need, but the energy consumption is quite high. I think I’ll change the Titan X with a Titan XP (it doubles the TFlops with the same energy consumption) or a RTX2060Super.
    In the meanwhile, I’m waiting for the 3090 😉
    Reply
    - Tim Dettmers says
      2021-01-02 at 01:06
      That sounds like a good plan! I like your thinking about efficiency in both power and runtime performance. It is spot on!
      Reply
- Fairwell says
  2020-09-17 at 05:03
  Very nice input, I only checked out used pricing on other cards like Titan RTX but compared to the new options (in that case rtx 3090) the used options seem way to expensive mostly if you consider what you are getting.
  At this price point it sounds like an excellent option for specific memory hungry use cases to play around with them without having to invest front-up into the usually quite expensive high memory options until you know for sure you need to perform serious training on them.
  Reply
Fairwell says
2020-09-11 at 11:27
First of all, amazing write-up. That is indeed extremely helpful.
I hope that you could give me some suggestion. You made some good suggestions regarding whether or not to get the Nvidia RTX 3080 vs RTX 3090 vs more sophisticated multi gpu setups (e.g. start-ups) based on what you intend to do. I am working full time as an AI engineer, do some side projects and kaggle stuff. So far the only times I needed really serious gpu computing power I only needed that for work for which I have a AWS cloud account of the company. Other stuff is done on the company’s GPU cluster and smaller prototyping on regular nvidia rtx cards.
I am looking for a serious upgrade for my current home setup which I use for a good part of my daily job (home office time), some side projects (which will grow significantly in the future) and I also use my setup for some other stuff for which a fast GPU comes in handy (e.g. some private video editing), but first and foremost I want to replace my older nvidia card with the new generation.
I was keen on getting the RTX 3090 since it was rumored for the 24gb vram which comes in really handy for more sophisticated models (so far I had to do with 8gb which is fine for daily prototyping) since it seemed the perfect deep learning card while not having to invest into serious Quadro/Tesla cards. However, due to the competition of the upcoming AMD Big Navi and the new consoles Nvidia was overly generous with the amount of cuda cores/tensor units etc on the RTX 3080. On paper that beast offers even way more performance for its price than the cheaper RTX 3070 sibling. Now Tensorflow 2 as well as Pytorch have pretty good multi gpu support (about 92% gain for each additional gpu up to 4 gpus in most situations) and I am leaning very hard to get 2x RTX 3080 Founders Edition instead of one RTX 3090. Right now my setup will remain air cooled so I want to go with the Founders Edition which come with a pretty nice cooling solution.
Would you recommend 2x RTX 3080 or RTX 3090 in my case? My case is pretty huge, has good ventilation and power is no issue, there is place for a second power supply to install which I have left-over anyways. I assume that 2x RTX 3080 would perform way better on most models even though batch size can be set way higher on a single RTX 3090.
Things I am very concerned about:
-> Huge models I might want to tackle with in the future (without the cloud). I might need to manually reduce a models complexity or split it up somehow if 10gb vram is not sufficient (memory pooling is not supported for 2x RTX 3080). What is your practical take on this?
-> Huge models combined with large batch size might perform better one single RTX 3090 (however, most days/models I used will be fine with 10gb vram on a day-to-day basis).
-> The cards would only be 2 slots apart, i.e. each RTX 3080 takes up 2 slots with their cooling solution, hence my setup would not provide space in between. Would those cards throttle too much potentially making the whole setup pointless compared to one RTX 3090? I’d like to note that if really necessary I could mount the 2nd GPU in a different position in the case by buying an additional pci-e extender. I’d prefer the straight up simple setup but that would be an option.
Any suggestions and reasoning is highly welcome.
Reply
- Tim Dettmers says
  2020-09-13 at 08:21
  I think 2x RTX 3080 makes the most sense in your situation. If you do personal projects, most time is spent on prototyping or hacks rather than on fully-fledged products. As such, training a bit smaller model, or accepting slow training for your final “production” model is okay. There enough memory tricks that you can train pretty sizable models. If you want to go one step further, you can use model parallelism which basically pools your memory across GPUs. Model parallelism is better and better supported in PyTorch and I am sure in the future there will be even more software that supports memory-efficient training.
  Regarding cooling, I think the RTX 3080 FE has probably pretty strong cooling performance, especially if you have only 2 GPUs next to each other. You could wait for some reviews on cooling performance just to make sure, or you buy them now and get a PCIe extender if it does not work out. Generally, if cooling performance is poor, you lose about 20% performance. If you assume 92% gain per GPU and RTX 3090 baseline performance is 1.5 while RTX 3080 performance is 1.35, then 2*1.35*0.92/1.2= 2.07 or still 40% faster than a single RTX 3090. As such, you will have better performance on those RTX 3080 than a single RTX 3090 in any case.
  However, the extra heat might make those cards more prone to failure. From personal experience, I would estimate the failure rate per year per GPU is about 2-5% and I could imagine that the failure rate would easily double to 4-10% if the setup is experiencing a lot of heat for an extended period of time.
  I think considering all factors, it is still the best choice to go for 2x RTX 3080.
  Reply
  - Fairwell says
    2020-09-14 at 07:21
    Thanks a lot for taking the time to give me such a detailed breakdown and recommendation.
    It seems like if some models are too big to fit into one GPU frameworks like Eisen (together with Pytorch) can handle that. Due to not having real memory pooling like nvlink provides and the overhead introduced might make a single RTX 3090 faster in these scenarios (based on 1-4 gpu modell parallelization benchmarks out there from the last generation).
    However, overally right now setups that do not strictly require or are meant to use that vram most of the time are likely to be way more cost efficient in general to go for up to 4 RTX 3080 before considering more expansive cards strictly due to their extremely competitive pricing.
    I will go for 2x RTX3080 FE to get it up soon and sell these cards later down the road once memory requirements for my usage really get too high.
    Reply
    - Oscar says
      2020-09-17 at 11:23
      Hi Tim, Fairwell:
      I am a NLP engineer, I am also intending to use it for smaller NLP model training. Considering 24gb memory, I thought 1X3090 is better than 2X3080. this way also can avoid complication of parallelization of two. Any ideas? Thanks.
      Reply
      - Tim Dettmers says
        2020-10-13 at 18:11
        Hi Oscar! If you use smaller models, I would definitely prefer 2x RTX 3080 over the single RTX 3090. An RTX 3090 will not be faster than an RTX 3080 for small models (because you cannot saturate all cores). Besides, two RTX 3080 will be much faster if used with straight data parallelism, and with current software, it is pretty easy to use.
  - Ervin says
    2020-12-02 at 13:10
    Hi Tim! You have mentioned multiple times some memory tricks in order to train larger networks on GPUs. Can you point to some of them? Thanks!
    Reply
Michael Conrad says
2020-09-10 at 04:56
How do the RTX 2060 and RTX 2070 compare to the GTX 1060?
I’m trying to get a multi-lingual/multi-voice tacotron 2 operational for a low resource language, and I don’t have the needed bandwidth to use cloud hosted solutions.
Reply
- Tim Dettmers says
  2020-09-10 at 14:04
  The RTX cards are much more powerful, but often also bit a more expensive. If you just want to use the Tacotron 2 for inference you do not need a powerful GPU. If you also want to trains models then you should favor the RTX cards.
  Reply
Alex Dubinsky says
2020-09-08 at 23:33
“Currently, there does not seem to be a PSU with more than 1600W on the desktop computer market.”
That’s because standard North American outlets are only 15A x 125V. They can’t deliver more than 1875W, and you also have to subtract PSU losses. Best to get a dual-PSU case.
Recipe for a killer 7-GPU machine:
Hydra VII case (1)
7x 50cm LinkUp Extreme Shielded PCIe riser cables (2)
Asrock Rack ROMED8-2T (3)
2nd gen Epyc (4)
4x or 8x 32GB RDIMMs (5)
an Intel NVME drive (6)
(1) supports 2 PSUs and 8 dual-slot or 6 triple-slot GPUs. Might not be available at the moment.
(2) the cheaper non-Extreme cables may need to be wrapped in aluminum foil to work. The Ultra are even better with PCIe 4 compatibility, but currently not available in 50cm
(3) 7 full 16x PCIe 4.0 slots! However, might need to be run at 3.0 speed for riser compatibility. The EPYCD8-2T is also a great motherboard, but with 8x PCIe 3.0 slots.
(4) The cheapest CPUs use only 1 or 2 chiplets, which affects L3 cache size and mem bandwidth, although it probably will have no noticeable effect. The cheapest full-fat CPUs are the 7262 and 7302P.
(5) Don’t get 8x 16GB. The difference in bandwidth will be hardly noticeable, but you’ll miss being able to cheaply upgrade the RAM.
(6) Intel SSDs are the best with smooth performance. Even the cheap 660p works well.
Reply
- Tim Dettmers says
  2020-09-09 at 08:09
  I did not know about the North American outlets, this is a very good point! Thank you for sharing! I should add this to the blog post as this is critical information for North Americans.
  I like your part list and analysis! Very useful and hopefully, that will give people ideas about what their build could look like. Would love to update my blog post with some more data if you can give me (and others) feedback on your builds. If we can figure out what works and what does not we can all have cheap powerful machines.
  Reply
  - Alex Dubinsky says
    2020-09-10 at 11:32
    My 4 builds so far have used:
    – The wondrous Hydra VII case, which turns out is not currently in production. I actually contacted the manufacturer yesterday and they said they can do a custom run with a MOQ of 200 units. I’m considering sponsoring a run and putting them back on sale.
    – 7x non-Extreme LinkUp 50cm cables. I had to wrap them in aluminum foil and packing tape to work consistently. It’s a bit funny, but that’s what shielding is. They’ve released improved cables and I expect those will work fine out of the box.
    – 3x AsRock EPYCD8-2T motherboard which is quite good with a very useful web-based IPMI interface. It has some odd quirks, like not letting you control fans through the OS. You have to use IPMI–not the web UI but actually ipmitool. (command is `ipmitool -H -U raw 0x3a 0x01 0x64 0x64 0x64 0x64 0x64 0x64 0x64 0x64`). Someone’s mentioned it doesn’t suspend either, but that’s not something I use.
    – 1x ASUS X99-E WS motherboard. Legendary board. Works like a rock. I wish ASUS made something similar for Threadripper or EPYC. Supports 7x GPUs as long as you enable above-4G decoding. I’ve tried a Supermicro board a while back (just before I bought the ASUS), and it kept rebooting. I guessed it was some sort of watchdog timer, but the user manual was useless to figure anything out. I’ve avoided Supermicro ever since.
    – I have both 7x 2080 Ti and 7x 1080 Ti machines.
    – 2x PSUs in each PC. I’ve used cheap 2000W (220V) miner PSUs and expensive name-brand 1600W PSUs. The cheap-o ones work just as well as the expensive ones. All PSUs die (I’ve had 2 expensive ones die and 1 cheap one). Just keep extras onhand. I connect one PSU to the motherboard, and the other to the GPUs. Both draw similar amounts of power (I expected big GPUs to avoid using mobo power, but apparently they do). The motherboard PSU is connected to a UPS. The GPU PSU doesn’t need to be. If lights go out, the machine stays up but the GPUs become unavailable (on Linux… can’t say if Windows is so forgiving).
    – Adjusting GPU fans on headless Linus PCs is possible, see my post https://unix.stackexchange.com/questions/367584/how-to-adjust-nvidia-gpu-fan-speed-on-a-headless-node/367585#367585
    I’m looking forward to ROME8D-2T and the new LinkUp risers, but I haven’t actually tried them.
    I’ve dreamed about 7x water-cooled GPUs ever since I started working with GPGPU back in 2004 (yes, before CUDA). Last year, that dream finally became a reality. Or rather, it became my nightmare. DO. NOT. WATERCOOL. It’s great in theory, but consumer watercooling equipment is just the worst, most unreliable shit. I can explain in more detail, but TL;DR air-cooled machines are just so much better. They even run cooler. All thanks to long, high-quality x16 riser cables which became available only very recently.
    My next dream: Inventing bifurcated riser cables which split x16 PCIe buses on EPYC into 4 x4 buses. This is already an option in the EPYCD8-2T BIOS, but the difficulty is that some circuitry is required in the cable for clock distribution. Then we could have 28-GPU machines.
    Reply
    - Tim Dettmers says
      2020-09-13 at 08:48
      Hi Alex! Thank you for sharing your experience — this is extremely valuable information!
      Great to see you made good experience with EPYC CPUs and motherboards. I would recommend them more, but there is just too little information out there about what CPU/motherboard combination is reliable!
      The Hydra VII case looks absolutely great! Do you know of any comparable cases? Some people in the comments wondered if dust would be a problem in such open-air designs. Do you have any experience with this that you could share?
      It is good to see that you can trust miner PSUs. I was always a bit skeptical about PSU quality and for me, it felt most PSUs have no difference in quality. The first PSUs that I felt had top-notch quality were EVGA PSUs. But it might be that my feeling is off here. I never had a PSU fail but I might just have been lucky. So I might be biased here.
      Regarding headless cooling: Andy Jones worked on some python solution for this where you do not need to meddle with configs yourself. For me it worked great: https://github.com/andyljones/coolgpus
      I never had a water-cooled setup myself and I am curious about more details of your water-cooling experience. I read up on water cooling and often read that parts were not reliable in the past but that it has come a long way since. It seems you would not agree with that statement. Otherwise, I agree that PCIe extenders/risers can often solve problems with cooling quite efficiently without any of the risks or hassles from water cooling. I guess the only problem can be space and as such, it is more important to pick the right case.
      Please keep your comments coming — your insights are very valuable!
      Reply
- Marcin Bogdanski says
  2020-09-09 at 08:35
  Hi Alex
  I spent last few days researching a similar build. In the end it came up to power supply. Apparently it is not advised to bridge PSUs. I consulted few people with background in electrical engineering, one specifically with expertise in power supply design, and in general the more experience person had, the more they advised me against bridging PSUs (which were not specifically design with that in mind). Apparently server PSUs are specifically designed for it, but they are awfully loud.
  I have more details here is someone want’s to carry on similar path
  https://www.reddit.com/r/buildapc/comments/inqpo5/multigpu_seven_rtx_3090_workstation_possible/
  Reply
  - Alex Dubinsky says
    2020-09-10 at 10:52
    I haven’t had any problems with dual PSUs in my 4 machines. Now, you definitely don’t want to short-circuit the 12V lines from different PSUs, because one PSU may want to output 12.1V and another 11.9V, and they’ll fight eachother. AFAIK, that’s not what happens when you use dual PSUs. GPUs route current through independent wires and voltage converters and it’s ok if the GPU gets 12.1V from the PCIe slot and 11.9V from the power connectors.
    Reply
- andrea de luca says
  2020-09-09 at 09:44
  Hi. Don’t know if it could be an issue for you, but consider that I have had both the EPYCD8 and the ROMED8, and both of them refuse to go into suspend mode (both sleep and hibernation) in linux and windows.
  However they are awesome boards, and the Epyc 7282 I used did draw even less power than declared.
  Reply
Chris says
2020-09-08 at 18:15
How much performance drop would you expect the 3080 or 3090 cards if used in a eGPU setup (such as a Razer)?
(Also does anyone know if eGPU makers will be supporting the new 3080/3090?
Since laptops remain frustratingly hard (if not impossible) to upgrade hardware components, is it worth getting a laptop (for deep learning ) without a GPU and simply connect one of the 3070/3080/3090 cards to it via USB-C Thunderbolt and an eGPU chassis?
For example the Dell XPS 17 comes in 3 options:
a) no GPU,
b) a GTX 1650 or
c) an RTX 2060.
But the card is soldered to the motherboard, making upgrading not recommended.
I’m curious (and skeptical) if the crazy high TDP values of the 3080 and 3090 are possible to be adapted to a laptop, or if the heat and power requirements would make this untenable.
What kind of performance drop would we expect from, say a 3080 laptop version card compared to the desktop variety?
Thanks Tim, love your review.
Reply
- Tim Dettmers says
  2020-09-09 at 08:15
  Hi Chris, I think RTX 3080 and RTX 3090 should fit without any problem into eGPU setups (be aware of power requirements, though). I think they should be compatible without any problem since the transfer translates PCIe to Thunderbolt 3.0 and back to PCIe. Since PCIe has a unified protocol the transfers should be guaranteed to be compatible.
  Yes, I think a cheap laptop in addition to an eGPU is a very smart solution, especially if you are a heavy user and want to avoid cloud costs over the long-term. A local GPU though can be useful for prototyping and some like it if they can run everything via a local IDE. But since your eGPU is close to you it should have low latency and it is easy to setup IDEs to work on remote computers. So with a bit more effort, a laptop with no GPU should be just fine.
  I could imagine that NVIDIA adapts the RTX 3080 for a laptop version. This could mean a smaller GPU (not really an RTX 3080 anymore) or lower clock rates (very likely).
  These were good questions. Thanks for your comment 🙂
  Reply
Andy says
2020-09-08 at 08:55
Hi Tim,
Great article. Have you got an article or know of a resource that explains in detail the reasons for memory requirements of different models. ie what is actually taking up the memory, such as the size of the batch, the size of the input, the depth and breadth of the model and all associated weights. What exactly happens during back propagation in terms of memory and what is stored.
I’m trying to understand how much memory I might need but I feel I need more information than the general guide you post here.
What is meant by ‘the model doesn’t fit in memory’……surely we just reduce the batch size a bit? OR would we be talking here about situations where we are already at a batch size of 1? How effective is training should we have a batch size of 1?
Are you saying there is a difference (in terms of memory requirements) between using something like a state of the art feature extractor where we already have the model and just train the tail, and creating a huge CNN from scratch.
Could you give some idea of the size of a CNN that might fit in 10GB or 24GB for a given input size.
Thanks
Reply
- Tim Dettmers says
  2020-09-08 at 13:13
  Hi Andy, the best resource I know of that discusses this is: https://arxiv.org/abs/1904.10631. The technical report does not cover all memory saving techniques but some of the most common ones.
  “The model doesn’t fit into memory” often means that batch size 1 does not even fit, but also it is common to use that expression if the batch size is so small that training is abysmally slow.
  There is definitely a big difference between using a feature extractor + smaller network or training a large network. Since the feature extractor is not trained, you do not need to store gradients or activation. This allows you to reuse all the “dead” memory of previous layers. Thus a feature extractor + small network will require very little memory.
  The most significant factor for CNN memory requirements is the input size. Very large imagines as used in medical imagining need lots and lots of memory. Video-level CNNs also need a lot of memory. Beyond that, it is highly dependent on the network architecture. Early layers use much more memory than later layers. VGG uses quite a bit of memory for its depth when compared to a ResNet because ResNet has smaller early layers. More advanced networks that have branches within each layer are more difficult to analyze and it is difficult to get a sense of how much memory a CNN needs just by looking at the architecture. With those networks, you need to do benchmarking with the actual network to understand where the memory is used.
  Reply
Marcin Bogdanski says
2020-09-08 at 02:16
Hi Tim
First, thanks for putting all the effort into the great post, it is probably best single resource on the internet. What do you think about EPYC 7402P on a workstation motherboard as an alternative to Threadripper 3960x for 4x GPU build? The cost is very similar and you can fit 4xGPU and have spare PCIE slots left for SSD or fast networking.
– What do you think about potential BIOS/linux/driver compatibility issues?
– Do you think they would go 5 or 6 250W GPUs on risers (assuming cooling/mounting/power is resolved)?
ASRock ROMED8-2T – 7x PCIE x16 – $650
https://www.asrockrack.com/general/productdetail.asp?Model=ROMED8-2T
Gigabyte MZ32-AR0 – 6x PCIE x16 – $750
https://www.gigabyte.com/uk/Server-Motherboard/MZ32-AR0-rev-10
Thanks again!
Reply
- Tim Dettmers says
  2020-09-08 at 13:24
  EPYC CPUs are great! I think going a server components route makes a lot of sense, especially with the RTX 3090 which needs more space, power, and cooling. I think the only issue would be that often it is difficult to say if you buy good quality components compared to gaming components because these are reviewed and judged by a community while servers components are not. BIOS/Linus/Driver compatibility should be no issue, especially if you only have a single CPU on the motherboard.
  Reply
  - Marcin Bogdanski says
    2020-09-09 at 04:01
    Thanks for you reply!
    After doing more research, I found that Supermicro also carries similar boards for Epyc processors and seems to be more reputable brand in server space (if anyone goes that route: Supermicro website -> building block -> server boards -> Epyc, it never showed up in google, had to manually browsed the website)
    After much consideration I decided to purchase 2x 2nd Gen Threadripper 2950x systems for total of 8x GPUs slots. The price is slightly higher than single Threadripper 3 3960x but not much. PCIE 4.0 and fast inter-card communication are not factor for me. Hope this may be useful alternative to consider for others in similar situation.
    The main reason I decided against server route is that desktop build would still be limited by 2000W PSU (so 4x GPUs anyway). For more power one would have to use server PSUs which seem too loud for office/home environment.
    Hopefully by next time will have a server room and will go full swing 10x GPU build!
    Reply
- Ryan Mink says
  2020-09-08 at 19:57
  Hi Marcin!
  I am also interested in doing a 4x GPU build.
  I am having a hard time deciding what case to use if I go with air cooling.
  If you are not doing liquid cooling, what case do you plan to use?
  Reply
  - Marcin Bogdanski says
    2020-09-09 at 00:27
    Hi Ryan
    Yeah, I’m almost certainly going with air. Easier to swap, don’t leak, much cheaper. Hybrid cards should fit into standard case but at significant price premium.
    I haven’t decided yet but considering 6u server crypto mining case. Google “6u mining case”, there are not many suppliers in the west with stock but they pop up on ebay (they are Alibaba as well).
    Here is one in stock in UK
    https://www.xcase.co.uk/collections/mining-chassis-and-cases
    Thing to note is that GPUs in mining rigs may not be properly mounted, they just kind of lay there. I’m planning to mount GPU to a vertical mount and then drill/crew the mount to the chassis. This way it will be properly mounted.
    Example vertical GPU mount:
    https://www.coolermaster.com/catalog/cases/accessories/universal-vertical-gpu-holder-kit-ver2/
    Hope it helps!
    Reply
    - Ryan Mink says
      2020-09-09 at 12:25
      Thanks for the info.
      I am considering an open air mining case like https://www.amazon.com/Veddha-Deluxe-Model-Stackable-Mining/dp/B0784LSPKV/ref=sr_1_2?dchild=1&keywords=veddha+gpu&qid=1599679247&sr=8-2.
      What are your thoughts about this vs a closed air case like the one you posted above?
      My only concern with an open air case is the possibility of more dust.
      Reply
      - Marcin Bogdanski says
        2020-09-10 at 00:57
        Hi Ryan
        I don’t have much experience with open cases, sorry, might be worth googling around it and checking mining forums. For me the main reason to do closed one is because technicians in the lab frown upon exposed electronics, and I want an ability to move it around easily.
  - Gustavo says
    2020-09-09 at 08:07
    Why not this
    https://www.supermicro.com/en/products/system/4U/7049/SYS-7049GP-TRT.cfm
    Reply
    - Marcin Bogdanski says
      2020-09-09 at 08:10
      Xeons are more expensive and have less cores than EPYC/Threadripper.
      Reply
Ryan Mink says
2020-09-08 at 00:35
Hi Tim!
Thanks a lot for sharing this. It is very informative!
Btw I am thinking of buying four FE 3090s for my research workstation.
Since there is no conventional case that can accommodate four of them, I am considering an open air case commonly used in a mining rig.
What are your thoughts about that?
Should I be concerned with dust if I am going to clean the system using an air compressor once to twice a month?
Reply
- Tim Dettmers says
  2020-09-08 at 13:20
  I think an open case makes a lot of sense, but I am not sure how it will perform over time. It might make sense to read a bit through cryptocurrency mining forums to see what people’s experiences are. However, mining rigs are often at 100% load 24/7 while GPUs are usually used only a small fraction of overall time — so overall the experience might not be representative. I think it is difficult to say what will work best because nobody used GPUs in such a way (open-air case + low utilization).
  Reply
  - Ryan Mink says
    2020-09-08 at 21:50
    I am also considering custom water cooling but I am not comfortable having the system run nonstop for days for training transformers due to potential leakage that can totally ruin the system.
    Is it common running water cooled system for days nonstop?
    Reply
oarph says
2020-09-07 at 22:20
Confused about the bar charts showing the RTX 30-series performance. Did you actually get a pre-release RTX 3090 etc to test, or are these estimates based upon the published specs? Furthermore, if you use a specific CNN and Transformer in the benchmarks, could you cite the models and/or publish them on Github? Thanks for this guide! It’s just confusing to me what the benchmarks represent.
Reply
- Tim Dettmers says
  2020-09-07 at 23:13
  I added a bit more detail to the benchmarking section now. Let me know if it clarifies the benchmarking a little bit.
  Additionally, here a comment I made on the Hacker News thread which gives you a little bit more information, in particular regarding transformers:
  Thank you, I just updated the blog post with a more detailed clarification of where the data comes from.
  One thing that I am quite sure about the A100 is its transformer performance. It turns out, large transformers are so strongly bottlenecked by memory bandwidth that you can just use memory bandwidth alone to measure performance — even across GPU architectures. The error between Volta and Turning with a pure bandwidth model is less than 5%. The NVIDIA transformer A100 benchmark data shows similar scaling. So I am pretty confident about the transformer numbers.
  The computer vision numbers are more dependent on the network and it is difficult to generalize across all CNNs. For example, group convolution or depth-wise separable convolution-based CNNs do not scale well with better GPUs and speedups will be small (1.2 – 1.5x) whereas some other networks like ResNet get pretty straightforward improvements (1.6x-1.7x). So CNN values are less straightforward because there is more diversity between CNNs compared to transformers.
  Reply
  - Frank says
    2020-09-08 at 08:05
    the benchmarking still isn’t very clear that you didn’t actually test with all these cards. You should bold it or something, list out which cards you DID test and which are interpolated or extrapolated (in the case of the 30XX, we all know the dangers of extrapolation….)
    Reply
    - Tim Dettmers says
      2020-09-08 at 13:17
      That is fair. I will have another look and see if I can make it clearer.
      Reply
andrea de luca says
2020-09-07 at 16:40
Hi Tim. Here are a couple of PSUs for 4×3090.
Both are 2000W Platinum, reliable brands. ATX format (but they are long, check your case clearance).
https://www.fsplifestyle.com/PROP182003192/ (~300Eur)
https://www.super-flower.com.tw/product-data.php?productID=67&lang=en (~370Eur)
Now, I was interested in sticking a couple of GPUs as you did with yours. May I ask for some practical advices about how to secure them in order to avoid damages? Like you, I’d like to nail one of them to the front vents, and another to the bottom vents.
Another question: How can I make my case dust-proof without compromising the airflow? I’m quite worried about the amount of dust I find upon the fans and the heatsinks of my GPUs.
Reply
- Tim Dettmers says
  2020-09-07 at 22:12
  Thanks, Andrea! These PSUs look excellent. I will have another look and might update the blog post accordingly.
  I tried regular zip ties that come with the desktop case, but these are too short. I could imagine that using two long zip ties would work quite well to attach the GPUs to someplace in the case. The one GPU in the picture is just laying on the bottom on the desktop (over a vent) and it is unstable, but I am also not much my desktop that much. If I would move it, I would probably first detach the GPU.
  In general, though, the location of vents near GPUs is not that important. I find that GPUs just need some space from each other to run cool enough. The desktop air will heat up to 50C or so, but that is still very cool compared to the GPUs. In my setup in the picture the Founders Edition cards run at 75C under full load while the blower GPUs run throttle slightly at 80-82C which is still pretty good.
  I do not have any good solution for dust-proofing a case. The only solution that works for me is to just clean the desktop every 6 months or so.
  Reply
  - andrea de luca says
    2020-09-08 at 08:34
    Thanks, Tim.
    I was worried about dust since it is the main factor in wearing the GPU’s fan bearings. If one invests in a couple of 3090s (which, as you highlighted in your article, can last a few years at the very least), I think it’s better to prevent from being broken by dust: replacement fans/heatsink would prove impossible to find, probably.
    There actually is something on the market, like this:
    https://www.silverstonetek.com/product.php?area=en&pid=525
    But I don’t like the inverted layout, and the HEPA filters are not washable.
    Reply
    - Tim Dettmers says
      2020-09-08 at 13:16
      That is a very good point. I heard this happens a lot in cryptocurrency mining. A friend of mine has a pack of replacement fans for his cryptocurrency mining rigs. Often miners will choose GPUs where you can easily replace the fans.
      I think right now it is difficult to say what will work well. I think time will tell what are the most robust cases for RTX 3090s.
      Reply
      - andrea de luca says
        2020-09-08 at 16:06
        The thing is I want it bad, and already had the frame of mind of ordering the Founder’s as soon as Sep 24.
        Bu you are right, alas! It would be better to wait and see what do manufacturers spit out.
        Thanks!
- Alex Dubinsky says
  2020-09-14 at 07:30
  FSP is a bad brand. I bought the 2000W PSU. It was broken from the start. It turned off when putting out even 1000W, possibly because of bad capacitors. The main problem is that FSP’s support is very bad. It’s confusing and difficult to file an RMA. It’s too bad I was past Amazon’s return period. Waste of $400. Better to use 2 PSUs.
  Reply
- Michael Balaban says
  2020-09-14 at 20:24
  Just a heads up:
  – The first PSU is only rated for 1500W @ 115-200Vac. 2000W requires 200-240Vac.
  – The second PSU is only rated for 2000W at 230V.
  As far as I know, there aren’t PSUs from quality companies rated beyond 1600W with 120V input.
  Reply
Josh Scholar says
2020-09-07 at 09:31
Is it possible to have a link to the old version of this article? It had helpful information for people who only have access to older hardware than Ampere.
Reply
- Tim Dettmers says
  2020-09-07 at 09:50
  That is a good point. I do not want to have two different blog posts online at the same time, but I could some of the old content back into this blog post. Is there anything missing in particular? I could add some of the older GPUs to the charts. Would that be enough?
  Reply
  - Josh Scholar says
    2020-09-07 at 22:25
    I want to see the old charts comparing the cards I can get a hold of right now, the 20 series the 1080 ti, the Titan X Pascal etc.
    I would like to see them compared with TPUs.
    I like that you used to compare them on 3 or 4 different kinds of tasks.
    I was disappointed to see the old article gone that compared all of those. Why tell people to buy cards that no one has used yet and can not buy?
    Reply
    - andrea de luca says
      2020-09-08 at 08:52
      I think that he eliminated the old version since the new cards are so superior from the older ones and so cheap (in relative terms) that it doesn’t make sense to buy Turing and Pascal cards anymore. Consider that you’ll probably find a 3070 at less than 500usd.
      For what is worth, here is a summary:
      1. Pascal cards can do FP16, although with a very modest speedup. However, you will still see your VRAM almost doubled (something like 18/20Gb FP32 equivalent). So, the 1080ti is still a good card if you are on cheapo, but be sure not to pay it more than some ~250/300 usd/eur.
      2. Forget about the other Pascal cards, unless you can have them as a gift.
      3. Ampere has completely ruined the used GPU market, so you will probably find former top-notch Turings at very modest prices. My guess is that a 2080ti, used but in good condition, will be priced somewhere in between the 3070 and the 3080. Its performance would be a bit worse than a of the 3070 (allegedly), but then it has 11Gb. It’s still a capable card. My suggestion is to buy it only under 550/600usd/eur.
      4. The true champion now is the 3090. With respect to the former 24Gb-class card, It almost doubled the number of cuda cores, the perfomance increases by at least 50%, but is costs 1000usd/eur LESS than the Titan RTX (in fact in EU, it costs 1200 eur less that the Titan). So, if you can afford it, buy it and forget about Pascal and Turing.
      Reply
Letmos says
2020-09-03 at 06:05
It would be nice to have update of article “GPU for Deep Learning” that focuses on brand new Nvidia Ampere graphics cards. We have right now three models (3070, 3080, 3090), but there are rumors that soon we will see also 3070 TI (with 16 GB VRAM) and 3080 TI (20 GB VRAM). That sounds interesting, and change a lot in Deep Learning.
Reply
Taako says
2020-09-01 at 13:55
Are you going to benchmark the 30xx series cards?
What about the new Big Navi (2x) AMD cards with ROCm?
Also can you sort your comments reverse chronological? Lots of scrolling to get to this comment box and most recent comments?
Reply
- Tim Dettmers says
  2020-09-07 at 09:55
  Thanks for the suggestion to reverse the comments! I did not know that this feature was added to WordPress — very useful!
  I discuss AMD cards a bit in the new blog post update, but not sure if there is anything reliable yet on Big Navi.
  Reply
Jake says
2020-09-01 at 10:25
The new RTX 3000s specs are officially out: https://www.nvidia.com/en-us/geforce/graphics-cards/30-series/?nvid=nv-int-gfhm-10484#cid=_nv-int-gfhm_en-us
Wonder what your thoughts on them. Only judging from the specs, could 3080 be your new tl;dr best GPU overall recommendation? Would a 3090 even worth considering at the price of picking up two 3080s?
Reply
Kamil says
2020-08-18 at 03:42
Hi Tim. I am a beginner in Deep Learning but I think about it seriously. I do some data science stuff and consider buying a new PC. Unfortunately, Nvidia will launch a 3000 series for two weeks and I am so confused about which GPU should I buy. I started with GeForce 1660 but after some readings about it, I consider now adding some money to get GeForce 2070 windforce 2x. It is the cheapest 2070 I have found and …. this could be a problem. I would like to avoid buying a new card after a year and that is why I have a huge problem with the decision. Maybe it is better to wait 2 weeks and look what Nvidia will introduce (but 3070 and 3060 will appear in Q1 or Q2 of 2021 as I heard) or this GeForce 2070 windforce (maybe you have some recommendation about a version of this GPU?) is good enough to start with DL? Maybe 2060 or 2060 super is good enough and there is no need to pay more for the cheapest version of 2070? I am not an expert in the market of GPUs so I wonder if prices of 2060 or 2070 should decrease or increase now?
Reply
Samuel Rodriguez says
2020-07-30 at 00:13
It is very informative and nice article.
Reply
Tamir says
2020-07-02 at 09:40
Hi Tim, great work! very vital information!
1) To what type of RTX 2060 do you refer? 6GB/8GB? Super Version? If you can specify the exact model that would be great.
2) Do you have experience with working with two graphics cards, one basic for regular screen requirements, and another one that will be devoted only for deep learning work?
I tried it today and saw that both graphic devices allocated the same memory for screen processing although only one of them was supposed to allocate it.
3) What do you think is the minimum required memory for someone that buy GPU today and would like to be able to run the common models?
Thanks a lot,
Tamir
Reply
- Tim Dettmers says
  2020-07-03 at 07:50
  I would always go for Super. The Super in general is very good in cost-performance. However, if you need more memory you should just go for the max memory version. It has no use if your GPU is fast if you do not enough memory to run the models that you want to run.
  Reply
Vinay says
2020-06-25 at 00:40
Hi Tim,
Thanks for your insightful blog post. I am a beginner in deep learning and planning to buy a new server tower powered with Nvidia GPU. Could you please suggest me with the best possible and available system specifications with a maximum budget of $4500 – $5000 and which won’t require any change in hardware for a decent period of time. I would be obliged if you can help me here. Thanks in advance.
Reply
- Tim Dettmers says
  2020-07-03 at 07:41
  I would get the cheapest Threadripper v2 and 3-4x 2080 Ti with blower style fans depending on budget.
  Reply
  - Vinay says
    2020-07-03 at 09:10
    Thank you for your suggestion Tim.
    I almost ended with intel Xeon silver 4214, 12 cores, 16.5 M cache, 2.2 GHz, processor.
    1 x Nvidia GeForce RTX 2080Ti,
    64 GB DDR4 ECC memory with hybrid disk.
    I guess I will consider your opinion and look for AMD Ryzen Threadripper 2950x with 2 x Nvidia RTX 2080 Ti. I still doubt if windows server 2016 or higher version/ windows 10 OS will be compatible with Threadripper and also does not affect the processing speed.
    Reply
Ivan says
2020-06-01 at 08:10
could you tell me why the relative performance between word rnn and char rnn (based on your pictures) vary much from card to card?
what parameters of my LSTM should I consider? i have about 20 features and a lookback of around 400
Reply
- Tim Dettmers says
  2020-07-03 at 07:27
  I measured LSTM performance for a couple of cards and interpolated between the rest. In your case, 20 features and sequence dimension of 400 is pretty small and in terms of cost/performance a small GPU would be pretty optimal for that. Larger GPUs would not be fully utilized but should not be much slower. RTX have a different memory hierarchy which makes LSTMs a bit slower than on GTX cards.
  Reply
Nicolas says
2020-05-28 at 14:22
Hi Tim,
Great blog, quick question. Have you tried more specialised GPU services from companies like Genesis Cloud (genesiscloud.com) or other start ups in the area? It seems that they offer different type of GPUs and they are very focus on cloud infrastructure for AI with Relative good prices.
Any recommendation in the area would be useful.
Cheers
Reply
- Tim Dettmers says
  2020-07-03 at 07:56
  I have not tried this. There are many companies that offer these services and it takes a bit too much time to explore these services on my own. If you tried some of the services and have some feedback I would be happy to hear more about it (either email or as a comment here).
  Reply
Lucas says
2020-05-26 at 07:10
I am looking the mining gpu P106-100 which seems to be the same as a gtx1060 without video port.
I don’t know if it is CUDA compatible.
I want to run yolo over darknet. Will it work?
Thanks
Reply
- Tim Dettmers says
  2020-07-03 at 07:55
  It is CUDA compatible and you should be able to run yolo on it. You might need to downsample the images slightly but it should work smoothly.
  Reply
Andrea says
2020-05-13 at 02:51
Hi Tim,
Thank you very much for your article, it’s just great!
I have one question which I could not find explicitly mentioned in the posts or the article. Is it possible to prototype a code leveraging on tensor cores operations using a GTX card? Do you have any reference for this? (I imagine there is some form of emulation of the tensor cores for the GTX?)
I am asking because I am undecided between two laptops: one with GTX1650Ti and 16:10 aspect ratio (coming XPS 15), and the other RTX2060 but 16:9 (Razer 15).
Thank you again for your time!
Reply
- Tim Dettmers says
  2020-07-03 at 07:23
  Unfortunately, you cannot prototype tensorcore code on a GTX. What you can do is rent a AWS spot instance when you do test runs and otherwise prototype on a GTX card. If you should do it like this the costs should be minimal.
  Reply
Paweł says
2020-05-12 at 07:02
Hi,
I would be buy NVidia RTX 2070 GPU. But NVidia offers RTX 2070 Super. Is exists sense buy RTX 2070 Super version for machine learning ?
Reply
- Tim Dettmers says
  2020-07-03 at 07:22
  Yes buy RTX 2070 Super instead of RTX 2070. This blog post is a bit outdated.
  Reply
Youcef says
2020-05-05 at 04:36
Hi Tim,
Thank you for this excellent review, I am using it a reference!
I am looking to buy a GPU to use for deep learning (Computer vision, objection detection & NLP using neural networks – RCNN, FCN, Yolo , SSD, CTPN, EAST…etc)
I am a bit tight in budget, I was looking at the 1060 (6GB) as a minimum, how does it compare to the 1660 for the same purpose? read in one of your comments above that 1660 does not ave tensor cores and maybe good for gaming but not for deep learning…
which one do you suggest?
does the 1060 have tensor cores, or is it the 20xx only that are equipped with Tensor cores?
Thanx
Reply
- Tim Dettmers says
  2020-07-03 at 07:21
  I would go with the 1660 if those are your only options. If you have a bit of extra money the RTX 2060 would be much better all-around.
  Reply
  - Eslam Haroun says
    2020-10-07 at 10:57
    Hi Tim,
    I have GTX 1070.
    Is it valid for these applications?
    Thanks
    Reply
    - Tim Dettmers says
      2020-10-07 at 12:47
      A GTX 1070 is pretty good for these applications. For some models you might need to use some “memory tricks” but overall you should have few problems, its a good GPU.
      Reply
      - Eslam Haroun says
        2020-10-08 at 04:52
        Some people said it will be valid for prediction not training.
        Is this right?
        Thanks
      - Tim Dettmers says
        2020-10-11 at 15:26
        A GTX 1070 is pretty good for both, prediction and training.
Fordjour K. says
2020-05-03 at 19:47
Tim,
Is it a good idea to combine RTX 2080 Ti founders with other GPUs like RTX 2070 or RTX 2080 super, if you want to build a 3 GPU machine?
Will the computing power for ML and NLP models decline?
Thanks.
Reply
- Tim Dettmers says
  2020-07-03 at 07:20
  That works without any problem. However, you will not be able to parallelize models across different type of GPUs.
  Reply
  - andrea de luca says
    2020-07-03 at 13:23
    Mhhh.. I think that he won’t be even able to parallelize *data* across different models of GPUs.
    Reply
andrea de luca says
2020-04-24 at 06:20
Tim, I got to connect my GPU (I’ll buy a Titan RTX soon) to different hosts.
What do you think about an external box like the razer core X? Will it work over thunderbolt? The speed corresponds to PCIe x4 gen3.. I don’t expect dramatic performance losses. Am I right? Thanks
Reply
- Tim Dettmers says
  2020-04-26 at 21:48
  The performance should be okay in most circumstances. Performance will be the same if you can transfer the dataset to the GPU before training. It will be also good for most NLP models. For computer vision you might see a drop of about 20-40% in performance depending on image size (the larger the image the worse the performance drop). However, for the performance drop you still get an excellent cost/performance since laptop GPUs are very expensive and weak and desktop GPUs require a full new desktop. As such, I think this is a very reasonable setup and while things are a bit slower you should be able to run any model which is very handy.
  Reply
  - andrea de luca says
    2020-04-27 at 04:08
    Thank you. I think I’ll go that way then…!
    Reply
Matthew says
2020-04-21 at 14:30
Another “would you rather” question:
Would you rather have a 1070ti or a 1660 Super? The 1660 Super’s DDR6 memory greatly increases bandwidth, but it only comes with 6GB of memory vs 8 for the 1070ti.
I find the prices for the two cards to be quite similar (new vs used), so that isn’t a driving issue at the moment.
Reply
- Tim Dettmers says
  2020-04-26 at 21:43
  I would definitely go for a 1660 Super in terms of performance. For the memory it highly depends on what you are planning to do with it. If you just want to play around and test things you can get most networks to work if you use gradient accumulation and full 16-bit computation (16-bit weights). But this requires more coding. If that is okay the 1660 Super is right for you.
  Reply
Shafeez says
2020-04-17 at 02:11
Hi Tim,
Thanks for the advice earlier. I might need a rig before the RTX 3000 series is released it seems like.
I came across an article on lambda labs on choosing a GPU for deep learning.
I am not sure if the person that wrote the article was using mixed precision for the RTX cards.
It seems like 2070 can do a lot! After scouring thru some of the comments here, I think I might settle for a dual RTX 2070 Super instead of a 2080TI. I think getting accustomed to working with data parallelism would increase my job prospects. I feel like that is one of the big reasons dual 2070 sounds like a better choice.
You mentioned that 2070S can offer up to 12GB of memory if I use mixed precision.
How much VRAM can I really squeeze out, out of a 2070 8GB if I were to use smaller batch size and gradient accumulation, on top using FP16?
Thank you.
Reply
- Tim Dettmers says
  2020-04-18 at 18:31
  It is reasonable that you can squeeze out an equivalent of about 24 GB. If you use gradient accumulation then it is just a question of performance. If a batch size 1 model fits into the GPUs you can train it, but it will be awfully slow. For good speed you want to have at least a batch size between 16-32 in most cases. I would not want to train transformers on two 2070S, but for computer vision models it might be fitting.
  Reply
Sergej says
2020-04-12 at 11:25
I recently started with NLP and BERT-based models (PyTorch). Unfortunately my GTX 1070 8GB runs out of memory… I am thinking of getting a K80 with 24GB or a M6000 with 24GB. Which one would you recommend? Or should I get another card? I will definitely need more than 16GB of memory.
Thanks for this great article, it helped me a lot.
Reply
- Tim Dettmers says
  2020-04-12 at 14:33
  K80 and M6000 will be quite slow. I would recommend getting a Titan RTX with 24 GB of memory. If that is too expensive I would definitely go for the M6000. You can also think about multiple RTX 2080 Ti cards and using parallel training. That will reduce the memory footprint slightly especially if you use FP16 training. If you use mixed FP16 training it reduces memory footprint by 25%, if you pure FP16 training via Apex it reduces footprint by 50%. Using 2 GPUs should decrease the footprint by about 20-30%. So 2x RTX 2080 Ti with pure FP16 training is roughly equivalent to 11/0.75/0.5 = 29 GB used by the K80 or M6000 but you train much much faster.
  Reply
  - Sergej says
    2020-04-13 at 01:53
    Thank you!
    Reply
Shafeez says
2020-04-08 at 15:41
Hi Tim,
So the rumor is that the 3080TI can have vram anywhere between 12gb-16gb. 12GB sounds very much possible to me but what is your opinion on it being 16gb? Would Nvidia really take a leap and add literally 5 more gigs from its predecessor 2080TI 11GB?
I do ML for research. I will finish my undergraduate in Computer Science in 2 months. I might do grad school in a year or so, until then I want to carry on independent research – reading and implementing research papers.
I have read some research papers and have implemented some, and I realized memory is very important, especially after playing around with implementations of Mask R-CNN, and other object detection APIs.
I am trying to weigh in the option of waiting for the 3080TI or just buy the 2080TI now.
What is your opinion to the following question:
Is it better to wait for the 3080TI, especially if 3080TI will have 16gb of memory and 7nm architecture for the same price as 2080TI?
Thanks.
Reply
- Tim Dettmers says
  2020-04-12 at 14:26
  I think the best strategy for NVIDIA is to keep the RAM low so that deep learning researchers are forced to buy the more expensive GPUs. I would be surprised if the new GPU would have 16 GB of RAM but it might be possible.
  Regarding RTX 2080 Ti now vs waiting for RTX 3080 Ti. You are very right that 16 GB vs 12 GB would make a huge difference and I would wait a bit longer until it is confirmed that it is 16 GB. The performance is not that greater probably. Tensor core performance is better and that is where the widely improved max FLOPS comes from, but practically matrix multiplication is bandwidth bound and will see little increase in speed. Convolutions might be a bit faster but they will also be likely bandwidth bound now. I would guess we can see around 10-20% performance gain. Together with 16 GB of memory this would be a great and long-lasting GPU. If it is 12 GB it might be worth it to buy a used RTX 2080 Ti instead.
  Reply
Aleksandr says
2020-03-20 at 13:46
Hi Tim,
thanks for your posts. They together with comment sections helped me quite a lot to make up my mind about my new PC configuration. I decided that the best setup for me would be dual RTX 2070S + Ryzen 3700x. The problem is that cards with the best air cooling are almost 3 slots wide, so I’d need a 4 slot distance between them and the only motherboard I could find that has such distance and that allows to run both GPUs at x8 PCIe lanes is MSI MEG X570 GODLIKE which costs like HELL. There are a handful of cheaper motherboards with 4 slot spacing that can run in a dual GPU mode at PCIe 4.0 x16 / x4 (like ASRock X570 Steel Legend). I know that you recommend to have at least 8 lanes per GPU but that recommendation was for PCIe 3.0. Could we say that 4 lanes of PCIe 4.0 are roughly equivalent to 8 lanes PCIe 3.0 in terms of RTX 2070S performance and won’t cause any bottlenecks? If we just compare the bandwidths it should be fine but since RTX 2070S doesn’t support PCIe 4.0 I’m not sure how it will respond to 4 lanes and whether it would be able to utilize their full potential.
Reply
- Tim Dettmers says
  2020-04-03 at 19:56
  4x PCIe 4.0 lanes are equivalent to 8 PCIe 3.0 lanes, but I think think you are creating artificial problems here. If you only have two GPUs you can easily get away with 2-wide GPUs for excellent cooling (as long as they are not directly next to each other). Otherwise, going with a different CPU-motherboard combo might be cheaper and will not reduce performance by much. If you want to be ready for PCIe 4.0 GPUs and want to keep the computer for many years though, it makes sense to go with the CPU-motherboard combo that you selected at the moment.
  Reply
  - Tania Farzana Keya says
    2020-09-09 at 10:08
    Hi Tim,
    Thanks for your hard work. This article saved a lot of time for me. I had some doubts.
    You suggested,
    “Start with an RTX 3070. If you are still serious after 6-9 months, sell your RTX 3070 and buy 4x RTX 3080. ”
    According to Nvidia, RTX 3080 doesn’t support Nvlink . Is it possible to use 4x RTX 3080 for training (officially)?
    Reply
    - Tim Dettmers says
      2020-09-09 at 11:17
      Yes, using 4x RTX 3080 in parallel will be no problem since they can still communicate through PCIe with pretty respectable speeds — especially on a PCIe 4.0 board.
      Reply
      - Tania Farzana Keya says
        2020-09-18 at 02:21
        Thanks for replying Tim. If possible can you please write a detailed article on multiple GPU setup with training environment setup. It will be very helpful for the students from under developed & developing countries.
Ricardo Cruz says
2020-03-18 at 05:19
Hi Tim,
This comparison benefits a lot of people, thank you for that!
A small request: Would it be possible to share the raw numbers behind Figure 2? It’s a little hard looking at the chart and it would allow us to more easily do our own cost comparisons. It could also potentially allow us to fortify your analysis; for example, by testing for correlations between performance and cuda cores and things like that.
Thank you!
Reply
- Tim Dettmers says
  2020-04-03 at 19:50
  This is a great idea! The next generation of GPUs will soon be released and for the next update of this blog post I will also publish the raw data.
  Reply
Mark Hanslip says
2020-03-11 at 07:02
Hi Tim,
Thanks for the post, super helpful. Apologies if this has been answered elsewhere, but can you tell me why the GTX cards perform so much better than RTX on WordRNN?
Best wishes,
Mark
Reply
- Tim Dettmers says
  2020-04-03 at 19:48
  I either made a benchmarking mistake or it has to do that the shared memory architecture of GTX GPUs is different. Since they have more shared memory some algorithms which depend are very memory intensive access patterns which are distributed in small pieces can have benefits on a GTX GPU.https://timdettmers.com/wp-admin/edit-comments.php?comment_status=moderated#comments-form
  Reply
Dong says
2020-03-04 at 03:05
Hi Tim,
Thanks so much for this helpful blogpost (and the other one about CPUs etc.). I’m thinking about buying RTX 8000 (I’ll start with two, alternatively I’m considering RTX Titans). From a company I received the following suggestion to have with the two RTX 8000’s:
CPU: 2 x intel Xeon Silver 4110
RAM: 12 x 32 GB DDR4-2666MHz 2Rx4 ECC reg.
SSD: 2 x 480 GB 2,5” SATA 6Gb/s S4510 TLC
HDD: 2 x 1 TB Seagate ST1000NX0303
In particular (1) the RAM strikes me at unnecessarily high (?); (2) I’m also curious to hear about your thoughts regarding SSD/HDD specs, and what you would recommend; (3) Does the CPU seems ok?
The goal is NLP research, and hopefully this should be useable for the next few years.
Reply
- Tim Dettmers says
  2020-04-03 at 19:31
  It seems a little bit overkill for the hardware compared to the GPUs. However, if you add more RTX 8000s over time this can be a pretty good build and the memory will help you a lot if you are training big transformers. RTX 8000 will get cheaper once the next generation of GPUs is released in a couple of months. The CPUs a great and the RAM amount is rather standard for servers (server RAM is usually much cheaper than consumer RAM). You can ask them to lower the RAM amount but you would probably not save much money. If you want to save money I recommend a desktop with Threadripper 2 and 4x RTX Titans with extenders (otherwise they run too hot). The 24 GB is often enough even for very big transformers.
  Reply
  - andrea de luca says
    2020-04-04 at 07:12
    I would not be so sure about high-end cards getting cheaper.. Look for example at the GP100 (Pascal top dog).. Even as of now, it is not any cheaper…
    Reply
Pascal says
2020-02-26 at 03:17
Hi Tim,
Thanks for this great article. It is just what I was looking for and would like to follow through with a few questions.
Our research group is looking to buy a GPU to speed up data analysis. We mostly deal with genetic data so we are talking of large data in TBs. The main challenge has been the time it takes to run large models (including ML) and imputations which can run into weeks in the CPU-based local server, hence the need to try out a GPU workstation. I looked up NVIDIA’s product line for data science and saw they promote the Quadro RTX’s and have CPU recommendations to go with either single or dual GPU. See below.
“Single GPU – six-core Intel Xeon W-2135 CPU with a base clock speed of 3.7GHz and turbo frequency of 4.5GHz. At least 128GB of RAM.
Dual GPUs- dual Intel Xeon Silver 4110 CPUs, each with eight cores, a base clock speed of 2.1GHz, and a turbo frequency of 3GHz. At least 192GB of RAM.”
Going through your blog I see that you recommend the RTX 2070. I also came across another article which said that the GV100 provides hardware support for 64-bit fp while the RTX go up to 32-bit. I was considering either RTX 8000 or 6000 or the GV1000 and wondering if for our work having dual gpus would have any added benefit. Also wondering if the RTX 2070 would suffice. What would you recommend?
Thanks.
Reply
- Tim Dettmers says
  2020-04-03 at 19:24
  Hi Pascal,
  thank you this is an interesting request. For genetic data the important bit is how you load the dataset and how much of the data/model is held on the GPU at any time. What you can do without any problem is to stream data from disk (preferably a NVMe SSD Raid 0) to the GPU. Depending on how much RAM you need I would recommend the RTX 8000 or 6000 if you need 64-bit fp. Otherwise, the RTX Titan runs well, but it can have cooling problems. If you can run your program on 11 GB of memory and you only need 32-bit fp then the RTX 2080 Ti is the GPU to go! I would not go down to the RTX 2070 because it offers to little memory. You can definitively use 2 GPUs. I would use 2x RTX Titan because 4x is difficult to cool or 4x RTX 2080 Ti with blower fan. If you buy RTX 6000 or 8000 they usually come with servers with good cooling so you can get 4 or 8 GPUs in a server. Let me know if you have more questions!
  Reply
  - andrea de luca says
    2020-04-04 at 07:10
    Tim, are you sure about FP64 capabilities? AFAIK all the Turing cards are fp64-castrated, and you require Volta for high precision fp arithmetic (minimum titan V or GV100)..
    Reply
Shane says
2020-02-17 at 07:52
Hi Tim, thanks for such in-depth reviews, very helpful.
I am looking to upgrade a machine I have for ML / DL purposes, with the end goal of being able to host one or two trained networks to provide inferences as part of a computer vision service while still having computing resources available for further training / testing. The machine is a dual Xeon dell R720, so I can fit two full size GPUs, including the passively cooled Tesla series….
Now, I know you recommend to steer clear of Tesla cards due to price and likely consumer complications of external cooling solutions, but 24GB K80’s are currently ~$300 and I already have the server infrastructure. I am considering buying two of them. This way, as each K80 has two GPU’s w/ 12GB each onboard, I can host two nets on one of the cards, each net using up to 12GB dedicated, and still have two GPUs (one K80) for further testing / training. When operating in conjunction it seems using both GPUs on a K80 gives the performance of one M60 or 1 1080 Ti…. but the benefit over either of those is that I can host 2x the nets / models for inferences as a service.
Does this make sense? Can you poke holes in my logic of going this route over 2 x 1080 Ti or 2 x 2080?
Thanks!
Reply
- Tim Dettmers says
  2020-04-03 at 19:16
  K80 for $300 is a pretty good price and it might well be worth it. They will be a bit slow though but if you do startup-type inference it will be more than sufficient and with the extra memory you can do a lot of tricks / multiple models which is great for startups. Otherwise I would recommend probably a GTX 1080 Ti for inference rather than a RTX 2080 due to the additional memory.
  Reply
  - Andrey says
    2020-08-03 at 13:00
    Hi Tim, thanks for this job!
    can you tell me what bandwidth the k80 has between the chips? and can i host one net using up to 24GB on k80?
    Reply
    - Tim Dettmers says
      2020-09-14 at 22:05
      I am not sure if there are good numbers for the bandwidth between k80 chips. I remember with old dual GPU cards the bandwidth was better than PCIe 3.0, but I do not know the exact numbers. I think to combined the memories you need to use model parallel code. I do not think you can combine them in a non-programmable way.
      Reply
Vicky Patel says
2020-02-16 at 08:35
Hi Tim,
Thanks for this wonderful post.
I am a deep learning beginner on a tight budget.
Can you please suggest which I should I choose between –
GTX 1060 6GB or GTX 1650 Super 4GB, as both are available at same prices to me
Reply
- Tim Dettmers says
  2020-02-26 at 10:15
  I would go for the GTX 1060.
  Reply
  - Aster says
    2020-11-29 at 09:50
    Tim,
    First things first… as many others have stated, thanks for taking the time to write/blog about your experiences, advice, etc.
    Can I ask why you suggested the GTX 1060 6GB over the GTX 1650 Super 4GB? From NVIDIA’s website, looking at the specs…
    GTX 1060 6GB
    Compute Capability: 6.1
    NVIDIA CUDA Cores: 1280
    Base Clock (MHz): 1506
    Memory Speed: 8 Gbps
    Standard Memory Config: 6 GB GDDR5/X
    Memory Interface Width: 192-bit
    Memory Bandwidth (GB/sec): 192
    Bus Support: PCIe 3.0
    Graphics Card Power (W): 120 W
    Recommended System Power (W): 400
    Supplementary Power Connectors: 6-Pin
    GTX 1650 Super 4GB
    Compute Capability: 7.5
    NVIDIA CUDA Cores: 1280
    Base Clock (MHz): 1530
    Memory Speed: 12 Gbps
    Standard Memory Config: 4GB GDDR6
    Memory Interface Width: 128-bit
    Memory Bandwidth (GB/sec): 192
    Bus Support: PCIe 3.0
    Graphics Card Power (W): 100
    Recommended System Power (W): 350
    Supplementary Power Connectors: 6-Pin
    … as far as what I have been able to gather, neither of these have Tensor cores.
    Did you suggest the GTX 1060 because it has 6GB vs 4GB on the GTX 1650?
    So in general, is it better to have more RAM then higher compute capability?
    For the sake of argument, lets say that GTX 1650 had 6GB, would you then suggest the GTX 1650 over the GTX 1060?
    Thanks.
    Aster
    Reply
    - Tim Dettmers says
      2021-01-02 at 00:55
      Hi Aster,
      yes, it is mostly because of the memory. 6 GB is already pretty small. I guess for some applications, for example, to get started with deep learning, or to use a GPU for a class the GTX 1650 would make a lot of sense. If it would have more memory, I would recommend it instead of the GTX 1060.
      Another reason for the availability. In some countries, such as India, GPU supply can be quite erratic and older GPUs are often easier to find.
      Reply
Hannes Zietsman says
2020-01-27 at 13:19
There are more affordable alternatives to AWS and Google now. For example, https://vast.ai that is a GPU sharing platform. From multiple RTX 2080ti to one or two GTX 1070’s. even some Teslas. You will need some knowledge of running your task in a docker.
Reply
- Tim Dettmers says
  2020-02-26 at 10:12
  I have to look into this. Thanks for sharing.
  Reply
- Filipp says
  2020-02-29 at 01:49
  vast.ai is a fraud. Don’t use their service. I made a single time purchase at https://vast.ai/ But they continue charge money from my card and refuse to remove my card from their system.
  Reply
Behnam says
2020-01-27 at 00:52
Thank you for your great explanation.
Nvidia, GigaByte, MSI, and PNY produce GeForce RTX 2080 ti and the processor manufacturer of three of them is Nvidia,
is there a big difference between these there? which of them do you suggest?
Reply
- Tim Dettmers says
  2020-02-26 at 10:12
  No there is no difference really. Take the one with the best cooler or the cheapest one.
  Reply
  - GK says
    2020-12-04 at 14:07
    Hi Tim,
    Thank you so much for your post. You’ve mentioned in the main text that it’s better to have the same GPU types. What about different manufacturers? I am considering buying one GPU now, and plan to buy another later. Does it matter if I have one GigaByte RTX 3090 and one MSI RTX 3090?
    Also, how about same manufacturer and model, but different specs? e.g. one GigaByte RTX 3090 Xtreme, and one GigaByte RTX 3090 Master.
    Regards,
    GK
    Reply
    - Tim Dettmers says
      2021-01-02 at 01:20
      Hi GK,
      different manufacturers are fine and should not give you any disadvantage. While overclocked GPUs are not much faster than the normal GPUs, if parallelized the GPUs will perform as fast as the slowest GPU. But usually, that does not matter and I would just go for the cheapest GPUs that you can find.
      Reply
Jörg says
2020-01-26 at 03:50
Hi Tim,
I really appreciate your posts about ML hardware. I’m currently build a ML “workstation” running with the AMD X570 chipset and AM4 socket. I read that I should not think that much about PCIe lanes but in fact I do because I think about bying
(a) 2 * GeForce RTX 2080 SUPER
or
(b) 1 GeForce RTX 2080 Ti
Both options currently cost the same price. The Ryzen CPUs do have 24 PCIe lanes – 4 are connected to the X570 Chipset. Of the remaining 20 lanes 4 are connected to the NVMe slot. Hence the boards have the options
– run 1 GPU with full x16 speed
– run 2 GPUs x8/x8
– run 3 GPU s x8/x8/x4
My ML interests are mixed – but I think most stuff that runs on GPU will be computer vision and not that much NLP. (Other ML stuff I experiment with is using scikit-learn like SVMs are not running on GPU anyway and due to that I include 64 GB RAM on the board – the maxiumum of 128 GB is much to expensive currently.)
My instinct currently tells me to start with 1 RTX 2080 Super and use 16 bit. If that does not work buy a secod RTX 2080 Super.
By the way regarding Memory both cards do support unified memory using NVLink. I wonder whether by using NVLink it would be possible to run the network on one card but use the combined memory of the second (and the hosts memory)? (https://devblogs.nvidia.com/how-nvlink-will-enable-faster-easier-multi-gpu-computing/)
Regards
Jörg
Reply
- Tim Dettmers says
  2020-02-26 at 10:12
  If you want to combine the memory you need to use NVLink and model parallelism which is not usually used. The x8/x8 is great for your use-case. x8/x8/x4 is also fine but make sure your motherboard supports this. 8 GB of memory on the RTX 2080 Super is sufficient if you use some memory tricks like gradient accumulation.
  Reply
  - andrea de luca says
    2020-02-26 at 10:20
    Would NVLink allow memory pooling even on consumer cards?
    Reply
  - Jörg says
    2020-02-26 at 11:09
    Thank you for the reply. I decided to use two cards too. My motherboard supports all these combinations. And doing 16 bit calculations will help I think to overcome the “small ” memory size.
    In addition PyTorch is now able to use model parallel.
    https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
    Looking forward to test that as well.
    Reply
Waldemar says
2020-01-10 at 21:09
Great article. Thank you so much. I went through the art. and comments but could’t find the answer for the question:
why you advice avoiding Founders Edition cards? What are the downsides?
I am new to ML and am planning to renew my GTX 1060 3GB anyway. GTX 1080/1080Ti, used, FE, are easily accessible for reasonable prices.
It is easy to find watercooling kits for FE cards and they are cheaper.
Thank you so much for your time.
Reply
- Tim Dettmers says
  2020-02-26 at 10:07
  The RTX FE cards had major cooling problems and usually, FE cards are a bit more expensive at no real performance gain. If you get a cheap GTX 1000 series FE that is a pretty good deal though.
  Reply
  - Lucas says
    2021-01-03 at 04:34
    I believe that does not apply to the RTX 30 series anymore, as they totally redesigned the cooling of those cards and the FE are actually cheaper than the others (at least the MSRP). It might be a good update for this article.
    Reply
Sandeep says
2020-01-07 at 20:37
Hi Tim,
I got this advice from a vendor of GPU systems. I was arguing that for text models FP16 and RTX 2080Ti is good enough and comparable to v100. Also in their benchmarking they did not test RTX with NvLink but v100 was tested for FP16. I got this response. Just wanted to check if NvLink is of no use when using RTX 2080Ti. Please suggest. Your inputs are much appreciated here as I would use it for my next purchase.
“Quadro series cards like the RTX 8000 and RTX 6000 have much more vRAM (respectively 48 GB and 24 GB) than does the RTX 2080 Ti (11 GB). Hence you can train much bigger networks on the RTX 6000, RTX 8000, and Titan RTX (24 GB vRAM) that you can on the RTX 2080 Ti. In terms of the number of GPU CUDA cores though, they are all very similar.
Quadro series GPUs scale much better in the sense that the advantage of the 8x RTX 6000 over 8x RTX 2080 Ti is disproportionately larger than the advantage of 2x RTX 6000 over 2x RTX 2080 Ti for multi-GPU training.
There are two main reasons for this,
First is peering.
GeForce cards, like the RTX 2080 Ti and Titan RTX, cannot peer. This means that if they need to communicate, they have to go through the CPU. With many GPUs, the CPU can become a bottleneck for communication with all of the GPUs trying to communicate with it at the same time. This is especially true when PLX switching is necessary to allocate the existing PCIe lane bandwidth amongst the GPUs. Hence, for multi-GPU training, GeForce cards do not scale very well because of this.
Quadro series GPUs, like the RTX 8000 and the RTX 6000 can peer. This means that they can communicate directly over PCIe, and not have to go through the CPU.
The second reason is NVLink.
For more than 2x GPUs, GeForce cards are not NVLinked. This is because for GeForce GPUs (which includes the Titan RTX and 2080 Ti), the physical NVLink bridge is either 3 or 4 slots wide. If it was 2 slots wide, at least the GPUs could be connected in pairs, but unfortunately that is not the case.
NVLink for Quadro series GPUs (like the RTX 6000 and the RTX 8000) is 2 slots wide. This means that the GPUs can be connected with it in pairs, which further enhances the available communication bandwidth between the cards in each pair.
That being said, in terms of performance/dollar, we very much recommend the GeForce cards, as the Quadro cards are much more expensive.”
Reply
- Tim Dettmers says
  2020-01-07 at 23:08
  I would say that analysis is very much on point. Quadro cards are more expensive, but also yield better parallel performance and if you train large models like transformers, the extra memory will also give you huge performance gains. So they can make sense in some cases, but their cost/performance is not ideal for many applications.
  Reply
Yufeng says
2020-01-02 at 14:27
Hi Tim,
This is very helpful — thank you for spending the time to help people like us.
I’m curious about whether you have any experience in double-precision computation. Say I have a logistic regression that I would like to estimate by maximum likelihood, but I’m interested in estimating the parameters precisely (rather than training a neural network for prediction). Could I still stick to FP32 or do I need to move over to FP64? This has a big impact on which hardware I choose.
Thanks in advance.
Yufeng
Reply
- Tim Dettmers says
  2020-01-07 at 23:13
  The question here is how precise is precisely? How many significant digits do you need? 32-bit float is accurate to about 7 digits, and 64-bit floats to about 16 digits.
  Reply
Ken says
2019-12-28 at 13:06
Can vram be pooled on Linux with NVLiked 2080 supers or 2080Ti ‘s?
Do the Avx512 and MLK libraries still matter ? I’m looking a building a treadripper 3960x linux box.
Reply
- andrea de luca says
  2020-01-01 at 13:11
  I don’t know about vram pooling on consumer cards, but I’d wager you won’t be able to do it. NVLink for consumer cards is very different with respect the one you find on the titans, taslas, and quadros (rtx).
  Yes, MKL is still very important for the preprocessing phases (data augmentation, mostly), but Zen2 is good at it, in contrast with zen and zen+.
  Reply
lihan says
2019-12-24 at 10:54
Thank you for your valuable insights Tim! I brought a 2080 ti recently, and I’d like to ask 2 questions about the apex amp framework (a lot of folks are encountering obstacle 1 as reflected in repo issues, without satisfactory answers).
1. I am trying to do 16 bit training on a transformer with mostly nn.Layernorms. Is that worthwhile? When I load a partly trained model, my loaded model returns NaN loss (does not occur in normal FP32 training). This leads amp.scale_loss to complain about gradient overflow until the scaler approaches 0. Have you ever encountered this?
2. Can a FP32 trained model be converted to FP16 training without penalty? What are the drawbacks?
Thank you and continue the good work, please!
Reply
- Tim Dettmers says
  2019-12-25 at 22:09
  1. I encountered it before. What often helps is to start with a smaller warmup learning rate.
  2. There should be little to no penalty in accuracy/perplexity/loss when you convert a model from fp32 to fp16. The problems usually stem from training and not prediction, so once a model is trained you probably will not lose much predictive performance.
  Reply
Aditya says
2019-12-20 at 14:24
Hi Tim,
Thanks for the wonderful post. It was really helpful in picking out a GPU. I have decided to go with RTX 2070 Super.
I would appreciate your input on picking a motherboard and a CPU.
I was initially looking at the latest Ryzen 5 cpu which costs around $200 and a motherboard which would support SLI, also around $200.
However, I noticed that there is a sale on 1st generation Threadripper CPUs on Amazon. I can get one for almost $140. The biggest plus is that it has 64 PCIe lanes but the biggest con is that the motherboard for it costs around $320. I know PCIe lanes aren’t the most important so I was wondering what you think I should go with.
Reply
- Tim Dettmers says
  2019-12-24 at 07:12
  I think either option is a good choice. I think the Threadripper is a bit more powerful, but it is also $60 more expensive. I would take another look at the motherboard. I would determine it on expansion slots for NVMe SSDs and number of supported GPUs. I believe a Threadripper board would allow for further extension in the future and usually can support 3x NVMe SSDs without a problem. If you eye future expansion the Threadripper might be a good choice. Otherwise, the Ryzen 5 setup is just a bit cheaper and why buy something that is more expensive and that you do not need?
  Reply
Eric Bohn says
2019-12-20 at 05:19
For someone just starting out. How long typically before they outgrow a single gpu and want to move toward a multi gpu setup? Working at it 10-20 hours per week in a variety of use cases.
Reply
Juan says
2019-12-20 at 01:20
Hello, thanks, great post.
It will be better to have 4 x RTX 2080TI or 2 x RTX Titan to work with faster RCNN/SSD/Retinanet in images of 5472×3648 px, some objects are 50px.
Thanks
Reply
- Tim Dettmers says
  2019-12-24 at 02:23
  5472×3648 px is very large! Even if you break down the images into 50×50 you probably gain a lot of ease by just having a larger GPU memory to work with this kind of data. I would recommend 2x RTX Titan for this kind of work.
  Reply
Erick says
2019-12-17 at 02:02
I’m trying to decide on an ML environment for my work. One question that pops up is: you seem to discourage Tesla cards, but according to https://cloud.google.com/compute/all Google only offers Tesla cards; why would that be?
Reply
- Tim Dettmers says
  2019-12-17 at 16:03
  NVIDIA has a policy that it only sells Tesla cards to companies and not consumer GPUs. That is the main reason, but there are also other reasons that involve a lot of details.
  Reply
Joren vanGoethem says
2019-12-16 at 07:17
my AI teacher (just started in uni) prefers the 2070 super due to not much higher cost but signifficant performance increase, could you possibly add the super cards to the charts?
Reply
- Tim Dettmers says
  2019-12-16 at 18:07
  I might do so over the Christmas period. I do not have time for that right now.
  Reply
- andrea de luca says
  2019-12-18 at 12:23
  We can guess that the 2070S will perform some 5% under the 2080 non-S, at the cost of a regular 2070 (~500$).
  Go for it if 8Gb is not an issue. Go for a blower model.
  Recently I stacked three 2060 Super (blower) for a bit more than 1000$ and they are not bad. In almost any task (including transformer) you can parallelize, and have 24Gb of vram which is quite something.
  Reply
Shahid Siddiqui says
2019-12-16 at 00:26
Dear Tim! I was benchmarking two laptops for deep learning, 1) core i7-8750H, 16 GB 2666MHz, GTX 1070, Nvme SSD, OS Ubuntu 16.04. 2) core i5-8300H, 8GB 2666MHz, GTX 1050, Nvme SSD, Windows 10. Running a simple two layer network on CIFAR using only CPUs 1) had an average epoch time of 80s and 2) had an average epoch time of 180s. But when training on GPU, 1) gave an average of 88s while 2) only 70s. I made sure both pytorch environments are exactly the same. What could be other possible reasons? Thank you
Reply
- Tim Dettmers says
  2019-12-16 at 18:10
  These are strange results. Can you try “OMP_NUM_THREADS=1 python …” for your scripts and see if the performance changes?
  Reply
  - Shahid Siddiqui says
    2019-12-17 at 05:14
    Sorry Tim! freaked out too soon. Turned out the pytorch was unable to use the 1070 GPU on Linux and was giving me CPU numbers. Any ways trained a VGG on both, i5-8300H CPU takes 737s per epoch while i7-8750H takes 263s. When using the GPUs, 1050 took 31s while 1070 just 13s. Thank you for the prompt response.
    Reply
Jo says
2019-12-13 at 11:56
If we really do want to use AMD GPUs, which ones are better than others for deep learning? I tried Googling about this, but there is just too few useful information to be found.
Reply
- Tim Dettmers says
  2019-12-13 at 16:24
  It is difficult to say because, as you already said, there are too few reliable benchmarks. Going with the most recent model that fits your budget is probably the right call. In terms of GPU memory, there are the same requirements for AMD and NVIDIA GPUs. So for state-of-the-art models, you want at least 11 GB. If you want to train big transformers you want more. If you want to do Kaggle competitions less is okay.
  Reply
- andrea de luca says
  2019-12-18 at 12:18
  The radeon VII seems to be the only viable option. It is the only one to possess 16Gb of vram. All the other are slow and limited in terms of memory.
  Note also that Navi gpus (5700) are not supported by ROCm.
  Reply
  - Miguel says
    2020-10-25 at 03:43
    Greetings. With “Navi gpus (5700)” do you refer to RX 5700 XT, for instance? So those are not compatible with rocm?
    Sorry I am not that much of a connoisseur…
    Reply
Will Stewart says
2019-12-07 at 05:31
I am seeking to purchase a home computer for both general use and deep learning.
Criteria;
1. Mobility is highly preferred, e.g., laptop
2. Price is important, e.g., nothing like “the sky is the limit”
3. Energy efficiency is a goal. If I’m not using it specifically for deep learning, I’d greatly prefer to have a low consumption computer (for a low carbon footprint).
4. I’ll be happy to start with Win 10, though I will undoubtedly dual boot it to Ubuntu like I have to 4 other machines in past, and use whichever OS provides the best overall results (noting some configuration and fan speed quirks with Linux and eGPUs).
I had considered a purpose built deep learning desktop, though it has no mobility and draws too much power in ‘normal’ operation.
I am considering the following;
Alienware 17 Gaming Laptop
https://www.costco.com/.product.1340132.html
Intel Core i7-8750H
16G DDR4 RAM
GeForce GTX 1070 Max-Q 8G
1 TB hybrid drive
1 Thunderbolt port, 3 USB 3.0 “Superspeed” ports
This gives me the flexibility to;
– Use the embedded GPU as at least a model experimentation starting point, where I can then shift to a more powerful eGPU with more memory, an AWS p3 VM, or Kaggle. If I choose an eGPU, then I would knowing accept the 15-20% hit in training duration.
– Upgrade eGPUs as they improve in power, memory, and price reduction
As deep learning can run 24/7 putting a significant thermal demand on laptop components, I’m paying *especially* careful attention to various cooling approaches, and will monitor CPU/GPU temperatures closely.
I may also add an SSD drive to store/stage the learning data on during training, to avoid long waits for batch pulls from the HDD. I don’t know how to tell if the motherboard (R5?) contains the Thunderbolt circuitry, or if it is on a daughter board.
All thoughts/critiques welcomed!
Reply
- Tim Dettmers says
  2019-12-17 at 16:08
  It sounds like you thought about this pretty well! It seems the perfect choice for you. I am not sure though though what you mean if the Thunderbolt circuitry is on the motherboard or the daughter board — what is this referring to?
  Reply
- andrea de luca says
  2019-12-18 at 12:16
  If you are going to purchase a laptop, I think you should be aware of some issues:
  1. Laptop hardware scarcely tolerates high-demand workloads 24/7. You could incur in overheating issues in a gaming laptop with discrete gpu.
  2. Laptop gpus are somewhat limited. As far as i know, no laptop gpu goes above 8gb of vram. In other words, forget about training big transformers.
  3. Pascal gpus (like the 1070 you mentioned) do possess fp16 capability, but you will sadly observe that they have convergence problems while training in 16 bit. This will exacerbate memory scarcity. I strongly urge you to purchase at least a turing laptop.
  Reply
Stas says
2019-12-02 at 14:26
Hi,
Can you say something about, say, 2060 Super X3 vs 2080ti? Is it worth to try 3 lower end cards vs one top end for DL?
Reply
- Tim Dettmers says
  2019-12-06 at 10:01
  It depends on your problem. If it does not require so much memory 3x 2060 Super can make sense. However, most modern models require a fair amount of GPU memory and run slow otherwise. So carefully check if you think the memory on the 2060 Super is sufficient.
  Reply
wayne says
2019-11-29 at 21:02
I am starting on ML and wanted to get a laptop to serve multiple roles including graphical work. You mentioned in your awesome blog about Quattro as a No-No. If I want to something like a Lenovo P53 with Quattro T1000 to fit within my budget and means, is this still OK?
Reply
- Tim Dettmers says
  2019-12-06 at 10:02
  Quattros usually have very low cost/performance, but if you find a good deal that is fine.
  Reply
Sreevanth says
2019-11-23 at 20:27
Hi Tim,
Thanks for your sharing, it has resolved the doubts I had. I am planning to buy an MSI GE63 laptop with RTX 2070. I’m a beginner but I would like to invest in a good laptop for a long term in deep learning. Is this a good laptop or do you suggest any other laptop?
Reply
- Tim Dettmers says
  2019-11-24 at 16:01
  An RTX 2070 in a laptop is pretty powerful as laptop GPUs goes. I think it will last you quite a while.
  Reply

« Older Comments

Skip links

Main navigation

Overview

How do GPUs work?

The Most Important GPU Specs for Deep Learning Processing Speed

Tensor Cores

Matrix multiplication without Tensor Cores

Matrix multiplication with Tensor Cores

Matrix multiplication with Tensor Cores and Asynchronous copies (RTX 30/RTX 40) and TMA (H100)

Memory Bandwidth

L2 Cache / Shared Memory / L1 Cache / Registers

Estimating Ada / Hopper Deep Learning Performance

Practical Ada / Hopper Speed Estimates

Possible Biases in Estimates

Advantages and Problems for RTX40 and RTX 30 Series

Sparse Network Training

Low-precision Computation

Fan Designs and GPUs Temperature Issues

3-slot Design and Power Issues

Power Limiting: An Elegant Solution to Solve the Power Problem?

RTX 4090s and Melting Power Connectors: How to Prevent Problems

8-bit Float Support in H100 and RTX 40 series GPUs

Raw Performance Ranking of GPUs

GPU Deep Learning Performance per Dollar

GPU Recommendations

Is it better to wait for future GPUs for an upgrade? The future of GPUs.

Question & Answers & Misconceptions

Do I need PCIe 4.0 or PCIe 5.0?

Do I need 8x/16x PCIe lanes?

How do I fit 4x RTX 4090 or 3090 if they take up 3 PCIe slots each?

How do I cool 4x RTX 3090 or 4x RTX 3080?

Can I use multiple GPUs of different GPU types?

What is NVLink, and is it useful?

I do not have enough money, even for the cheapest GPUs you recommend. What can I do?

What is the carbon footprint of GPUs? How can I use GPUs without polluting the environment?

What do I need to parallelize across two machines?

Is the sparse matrix multiplication features suitable for sparse matrices in general?

Do I need an Intel CPU to power a multi-GPU setup?

Does computer case design matter for cooling?

Will AMD GPUs + ROCm ever catch up with NVIDIA GPUs + CUDA?

When is it better to use the cloud vs a dedicated GPU desktop/server?

Version History

Acknowledgments

Related

Related Posts

Reader Interactions

Comments

Leave a Reply Cancel reply