Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning

2023-01-30 by Tim Dettmers 1,665 Comments

Deep learning is a field with intense computational requirements, and your choice of GPU will fundamentally determine your deep learning experience. But what features are important if you want to buy a new GPU? GPU RAM, cores, tensor cores, caches? How to make a cost-efficient choice? This blog post will delve into these questions, tackle common misconceptions, give you an intuitive understanding of how to think about GPUs, and will lend you advice, which will help you to make a choice that is right for you.

This blog post is designed to give you different levels of understanding of GPUs and the new Ampere series GPUs from NVIDIA. You have the choice: (1) If you are not interested in the details of how GPUs work, what makes a GPU fast compared to a CPU, and what is unique about the new NVIDIA RTX 40 Ampere series, you can skip right to the performance and performance per dollar charts and the recommendation section. The cost/performance numbers form the core of the blog post and the content surrounding it explains the details of what makes up GPU performance.

(2) If you worry about specific questions, I have answered and addressed the most common questions and misconceptions in the later part of the blog post.

(3) If you want to get an in-depth understanding of how GPUs, caches, and Tensor Cores work, the best is to read the blog post from start to finish. You might want to skip a section or two based on your understanding of the presented topics.

Contents hide

Overview

How do GPUs work?

The Most Important GPU Specs for Deep Learning Processing Speed

Tensor Cores

Matrix multiplication without Tensor Cores

Matrix multiplication with Tensor Cores

Matrix multiplication with Tensor Cores and Asynchronous copies (RTX 30/RTX 40) and TMA (H100)

Memory Bandwidth

L2 Cache / Shared Memory / L1 Cache / Registers

Estimating Ada / Hopper Deep Learning Performance

Practical Ada / Hopper Speed Estimates

Possible Biases in Estimates

Advantages and Problems for RTX40 and RTX 30 Series

Sparse Network Training

Low-precision Computation

Fan Designs and GPUs Temperature Issues

3-slot Design and Power Issues

Power Limiting: An Elegant Solution to Solve the Power Problem?

RTX 4090s and Melting Power Connectors: How to Prevent Problems

8-bit Float Support in H100 and RTX 40 series GPUs

Raw Performance Ranking of GPUs

GPU Deep Learning Performance per Dollar

GPU Recommendations

Is it better to wait for future GPUs for an upgrade? The future of GPUs.

Question & Answers & Misconceptions

Do I need PCIe 4.0 or PCIe 5.0?

Do I need 8x/16x PCIe lanes?

How do I fit 4x RTX 4090 or 3090 if they take up 3 PCIe slots each?

How do I cool 4x RTX 3090 or 4x RTX 3080?

Can I use multiple GPUs of different GPU types?

What is NVLink, and is it useful?

I do not have enough money, even for the cheapest GPUs you recommend. What can I do?

What is the carbon footprint of GPUs? How can I use GPUs without polluting the environment?

What do I need to parallelize across two machines?

Is the sparse matrix multiplication features suitable for sparse matrices in general?

Do I need an Intel CPU to power a multi-GPU setup?

Does computer case design matter for cooling?

Will AMD GPUs + ROCm ever catch up with NVIDIA GPUs + CUDA?

When is it better to use the cloud vs a dedicated GPU desktop/server?

Version History

Acknowledgments

Overview

This blog post is structured in the following way. First, I will explain what makes a GPU fast. I will discuss CPUs vs GPUs, Tensor Cores, memory bandwidth, and the memory hierarchy of GPUs and how these relate to deep learning performance. These explanations might help you get a more intuitive sense of what to look for in a GPU. I discuss the unique features of the new NVIDIA RTX 40 Ampere GPU series that are worth considering if you buy a GPU. From there, I make GPU recommendations for different scenarios. After that follows a Q&A section of common questions posed to me in Twitter threads; in that section, I will also address common misconceptions and some miscellaneous issues, such as cloud vs desktop, cooling, AMD vs NVIDIA, and others.

How do GPUs work?

If you use GPUs frequently, it is useful to understand how they work. This knowledge will help you to undstand cases where are GPUs fast or slow. In turn, you might be able to understand better why you need a GPU in the first place and how other future hardware options might be able to compete. You can skip this section if you just want the useful performance numbers and arguments to help you decide which GPU to buy. The best high-level explanation for the question of how GPUs work is my following Quora answer:

Read Tim Dettmers‘ answer to Why are GPUs well-suited to deep learning? on Quora

This is a high-level explanation that explains quite well why GPUs are better than CPUs for deep learning. If we look at the details, we can understand what makes one GPU better than another.

The Most Important GPU Specs for Deep Learning Processing Speed

This section can help you build a more intuitive understanding of how to think about deep learning performance. This understanding will help you to evaluate future GPUs by yourself. This section is sorted by the importance of each component. Tensor Cores are most important, followed by memory bandwidth of a GPU, the cache hierachy, and only then FLOPS of a GPU.

Tensor Cores

Tensor Cores are tiny cores that perform very efficient matrix multiplication. Since the most expensive part of any deep neural network is matrix multiplication Tensor Cores are very useful. In fast, they are so powerful, that I do not recommend any GPUs that do not have Tensor Cores.

It is helpful to understand how they work to appreciate the importance of these computational units specialized for matrix multiplication. Here I will show you a simple example of A*B=C matrix multiplication, where all matrices have a size of 32×32, what a computational pattern looks like with and without Tensor Cores. This is a simplified example, and not the exact way how a high performing matrix multiplication kernel would be written, but it has all the basics. A CUDA programmer would take this as a first “draft” and then optimize it step-by-step with concepts like double buffering, register optimization, occupancy optimization, instruction-level parallelism, and many others, which I will not discuss at this point.

To understand this example fully, you have to understand the concepts of cycles. If a processor runs at 1GHz, it can do 10^9 cycles per second. Each cycle represents an opportunity for computation. However, most of the time, operations take longer than one cycle. Thus we essentially have a queue where the next operations needs to wait for the next operation to finish. This is also called the latency of the operation.

Here are some important latency cycle timings for operations. These times can change from GPU generation to GPU generation. These numbers are for Ampere GPUs, which have relatively slow caches.

Global memory access (up to 80GB): ~380 cycles
L2 cache: ~200 cycles
L1 cache or Shared memory access (up to 128 kb per Streaming Multiprocessor): ~34 cycles
Fused multiplication and addition, a*b+c (FFMA): 4 cycles
Tensor Core matrix multiply: 1 cycle

Each operation is always performed by a pack of 32 threads. This pack is termed a warp of threads. Warps usually operate in a synchronous pattern — threads within a warp have to wait for each other. All memory operations on the GPU are optimized for warps. For example, loading from global memory happens at a granularity of 32*4 bytes, exactly 32 floats, exactly one float for each thread in a warp. We can have up to 32 warps = 1024 threads in a streaming multiprocessor (SM), the GPU-equivalent of a CPU core. The resources of an SM are divided up among all active warps. This means that sometimes we want to run fewer warps to have more registers/shared memory/Tensor Core resources per warp.

For both of the following examples, we assume we have the same computational resources. For this small example of a 32×32 matrix multiply, we use 8 SMs (about 10% of an RTX 3090) and 8 warps per SM.

To understand how the cycle latencies play together with resources like threads per SM and shared memory per SM, we now look at examples of matrix multiplication. While the following example roughly follows the sequence of computational steps of matrix multiplication for both with and without Tensor Cores, please note that these are very simplified examples. Real cases of matrix multiplication involve much larger shared memory tiles and slightly different computational patterns.

Matrix multiplication without Tensor Cores

If we want to do an A*B=C matrix multiply, where each matrix is of size 32×32, then we want to load memory that we repeatedly access into shared memory because its latency is about five times lower (200 cycles vs 34 cycles). A memory block in shared memory is often referred to as a memory tile or just a tile. Loading two 32×32 floats into a shared memory tile can happen in parallel by using 2*32 warps. We have 8 SMs with 8 warps each, so due to parallelization, we only need to do a single sequential load from global to shared memory, which takes 200 cycles.

To do the matrix multiplication, we now need to load a vector of 32 numbers from shared memory A and shared memory B and perform a fused multiply-and-accumulate (FFMA). Then store the outputs in registers C. We divide the work so that each SM does 8x dot products (32×32) to compute 8 outputs of C. Why this is exactly 8 (4 in older algorithms) is very technical. I recommend Scott Gray’s blog post on matrix multiplication to understand this. This means we have 8x shared memory accesses at the cost of 34 cycles each and 8 FFMA operations (32 in parallel), which cost 4 cycles each. In total, we thus have a cost of:

200 cycles (global memory) + 8*34 cycles (shared memory) + 8*4 cycles (FFMA) = 504 cycles

Let’s look at the cycle cost of using Tensor Cores.

Matrix multiplication with Tensor Cores

With Tensor Cores, we can perform a 4×4 matrix multiplication in one cycle. To do that, we first need to get memory into the Tensor Core. Similarly to the above, we need to read from global memory (200 cycles) and store in shared memory. To do a 32×32 matrix multiply, we need to do 8×8=64 Tensor Cores operations. A single SM has 8 Tensor Cores. So with 8 SMs, we have 64 Tensor Cores — just the number that we need! We can transfer the data from shared memory to the Tensor Cores with 1 memory transfers (34 cycles) and then do those 64 parallel Tensor Core operations (1 cycle). This means the total cost for Tensor Cores matrix multiplication, in this case, is:

200 cycles (global memory) + 34 cycles (shared memory) + 1 cycle (Tensor Core) = 235 cycles.

Thus we reduce the matrix multiplication cost significantly from 504 cycles to 235 cycles via Tensor Cores. In this simplified case, the Tensor Cores reduced the cost of both shared memory access and FFMA operations.

This example is simplified, for example, usually each thread needs to calculate which memory to read and write to as you transfer data from global memory to shared memory. With the new Hooper (H100) architectures we additionally have the Tensor Memory Accelerator (TMA) compute these indices in hardware and thus help each thread to focus on more computation rather than computing indices.

Matrix multiplication with Tensor Cores and Asynchronous copies (RTX 30/RTX 40) and TMA (H100)

The RTX 30 Ampere and RTX 40 Ada series GPUs additionally have support to perform asynchronous transfers between global and shared memory. The H100 Hopper GPU extends this further by introducing the Tensor Memory Accelerator (TMA) unit. the TMA unit combines asynchronous copies and index calculation for read and writes simultaneously — so each thread no longer needs to calculate which is the next element to read and each thread can focus on doing more matrix multiplication calculations. This looks as follows.

The TMA unit fetches memory from global to shared memory (200 cycles). Once the data arrives, the TMA unit fetches the next block of data asynchronously from global memory. While this is happening, the threads load data from shared memory and perform the matrix multiplication via the tensor core. Once the threads are finished they wait for the TMA unit to finish the next data transfer, and the sequence repeats.

As such, due to the asynchronous nature, the second global memory read by the TMA unit is already progressing as the threads process the current shared memory tile. This means, the second read takes only 200 – 34 – 1 = 165 cycles.

Since we do many reads, only the first memory access will be slow and all other memory accesses will be partially overlapped with the TMA unit. Thus on average, we reduce the time by 35 cycles.

165 cycles (wait for async copy to finish) + 34 cycles (shared memory) + 1 cycle (Tensor Core) = 200 cycles.

Which accelerates the matrix multiplication by another 15%.

From these examples, it becomes clear why the next attribute, memory bandwidth, is so crucial for Tensor-Core-equipped GPUs. Since global memory is the by far the largest cycle cost for matrix multiplication with Tensor Cores, we would even have faster GPUs if the global memory latency could be reduced. We can do this by either increasing the clock frequency of the memory (more cycles per second, but also more heat and higher energy requirements) or by increasing the number of elements that can be transferred at any one time (bus width).

LLM.int8() and Emergent Features

Memory Bandwidth

From the previous section, we have seen that Tensor Cores are very fast. So fast, in fact, that they are idle most of the time as they are waiting for memory to arrive from global memory. For example, during GPT-3-sized training, which uses huge matrices — the larger, the better for Tensor Cores — we have a Tensor Core TFLOPS utilization of about 45-65%, meaning that even for the large neural networks about 50% of the time, Tensor Cores are idle.

This means that when comparing two GPUs with Tensor Cores, one of the single best indicators for each GPU’s performance is their memory bandwidth. For example, The A100 GPU has 1,555 GB/s memory bandwidth vs the 900 GB/s of the V100. As such, a basic estimate of speedup of an A100 vs V100 is 1555/900 = 1.73x.

L2 Cache / Shared Memory / L1 Cache / Registers

Since memory transfers to the Tensor Cores are the limiting factor in performance, we are looking for other GPU attributes that enable faster memory transfer to Tensor Cores. L2 cache, shared memory, L1 cache, and amount of registers used are all related. To understand how a memory hierarchy enables faster memory transfers, it helps to understand how matrix multiplication is performed on a GPU.

To perform matrix multiplication, we exploit the memory hierarchy of a GPU that goes from slow global memory, to faster L2 memory, to fast local shared memory, to lightning-fast registers. However, the faster the memory, the smaller it is.

While logically, L2 and L1 memory are the same, L2 cache is larger and thus the average physical distance that need to be traversed to retrieve a cache line is larger. You can see the L1 and L2 caches as organized warehouses where you want to retrieve an item. You know where the item is, but to go there takes on average much longer for the larger warehouse. This is the essential difference between L1 and L2 caches. Large = slow, small = fast.

For matrix multiplication we can use this hierarchical separate into smaller and smaller and thus faster and faster chunks of memory to perform very fast matrix multiplications. For that, we need to chunk the big matrix multiplication into smaller sub-matrix multiplications. These chunks are called memory tiles, or often for short just tiles.

We perform matrix multiplication across these smaller tiles in local shared memory that is fast and close to the streaming multiprocessor (SM) — the equivalent of a CPU core. With Tensor Cores, we go a step further: We take each tile and load a part of these tiles into Tensor Cores which is directly addressed by registers. A matrix memory tile in L2 cache is 3-5x faster than global GPU memory (GPU RAM), shared memory is ~7-10x faster than the global GPU memory, whereas the Tensor Cores’ registers are ~200x faster than the global GPU memory.

Having larger tiles means we can reuse more memory. I wrote about this in detail in my TPU vs GPU blog post. In fact, you can see TPUs as having very, very, large tiles for each Tensor Core. As such, TPUs can reuse much more memory with each transfer from global memory, which makes them a little bit more efficient at matrix multiplications than GPUs.

Each tile size is determined by how much memory we have per streaming multiprocessor (SM) and how much we L2 cache we have across all SMs. We have the following shared memory sizes on the following architectures:

Volta (Titan V): 128kb shared memory / 6 MB L2
Turing (RTX 20s series): 96 kb shared memory / 5.5 MB L2
Ampere (RTX 30s series): 128 kb shared memory / 6 MB L2
Ada (RTX 40s series): 128 kb shared memory / 72 MB L2

We see that Ada has a much larger L2 cache allowing for larger tile sizes, which reduces global memory access. For example, for BERT large during training, the input and weight matrix of any matrix multiplication fit neatly into the L2 cache of Ada (but not other Us). As such, data needs to be loaded from global memory only once and then data is available throught the L2 cache, making matrix multiplication about 1.5 – 2.0x faster for this architecture for Ada. For larger models the speedups are lower during training but certain sweetspots exist which may make certain models much faster. Inference, with a batch size larger than 8 can also benefit immensely from the larger L2 caches.

Estimating Ada / Hopper Deep Learning Performance

This section is for those who want to understand the more technical details of how I derive the performance estimates for Ampere GPUs. If you do not care about these technical aspects, it is safe to skip this section.

Practical Ada / Hopper Speed Estimates

Suppose we have an estimate for one GPU of a GPU-architecture like Hopper, Ada, Ampere, Turing, or Volta. It is easy to extrapolate these results to other GPUs from the same architecture/series. Luckily, NVIDIA already benchmarked the A100 vs V100 vs H100 across a wide range of computer vision and natural language understanding tasks. Unfortunately, NVIDIA made sure that these numbers are not directly comparable by using different batch sizes and the number of GPUs whenever possible to favor results for the H100 GPU. So in a sense, the benchmark numbers are partially honest, partially marketing numbers. In general, you could argue that using larger batch sizes is fair, as the H100/A100 GPU has more memory. Still, to compare GPU architectures, we should evaluate unbiased memory performance with the same batch size.

To get an unbiased estimate, we can scale the data center GPU results in two ways: (1) account for the differences in batch size, (2) account for the differences in using 1 vs 8 GPUs. We are lucky that we can find such an estimate for both biases in the data that NVIDIA provides.

Doubling the batch size increases throughput in terms of images/s (CNNs) by 13.6%. I benchmarked the same problem for transformers on my RTX Titan and found, surprisingly, the very same result: 13.5% — it appears that this is a robust estimate.

As we parallelize networks across more and more GPUs, we lose performance due to some networking overhead. The A100 8x GPU system has better networking (NVLink 3.0) than the V100 8x GPU system (NVLink 2.0) — this is another confounding factor. Looking directly at the data from NVIDIA, we can find that for CNNs, a system with 8x A100 has a 5% lower overhead than a system of 8x V100. This means if going from 1x A100 to 8x A100 gives you a speedup of, say, 7.00x, then going from 1x V100 to 8x V100 only gives you a speedup of 6.67x. For transformers, the figure is 7%.

Using these figures, we can estimate the speedup for a few specific deep learning architectures from the direct data that NVIDIA provides. The Tesla A100 offers the following speedup over the Tesla V100:

SE-ResNeXt101: 1.43x
Masked-R-CNN: 1.47x
Transformer (12 layer, Machine Translation, WMT14 en-de): 1.70x

Thus, the figures are a bit lower than the theoretical estimate for computer vision. This might be due to smaller tensor dimensions, overhead from operations that are needed to prepare the matrix multiplication like img2col or Fast Fourier Transform (FFT), or operations that cannot saturate the GPU (final layers are often relatively small). It could also be artifacts of the specific architectures (grouped convolution).

The practical transformer estimate is very close to the theoretical estimate. This is probably because algorithms for huge matrices are very straightforward. I will use these practical estimates to calculate the cost efficiency of GPUs.

Possible Biases in Estimates

The estimates above are for H100, A100 , and V100 GPUs. In the past, NVIDIA sneaked unannounced performance degradations into the “gaming” RTX GPUs: (1) Decreased Tensor Core utilization, (2) gaming fans for cooling, (3) disabled peer-to-peer GPU transfers. It might be possible that there are unannounced performance degradations in the RTX 40 series compared to the full Hopper H100.

As of now, one of these degradations was found for Ampere GPUs: Tensor Core performance was decreased so that RTX 30 series GPUs are not as good as Quadro cards for deep learning purposes. This was also done for the RTX 20 series, so it is nothing new, but this time it was also done for the Titan equivalent card, the RTX 3090. The RTX Titan did not have performance degradation enabled.

Currently, no degradation for Ada GPUs are known, but I update this post with news on this and let my followers on twitter know.

Advantages and Problems for RTX40 and RTX 30 Series

The new NVIDIA Ampere RTX 30 series has additional benefits over the NVIDIA Turing RTX 20 series, such as sparse network training and inference. Other features, such as the new data types, should be seen more as an ease-of-use-feature as they provide the same performance boost as Turing does but without any extra programming required.

The Ada RTX 40 series has even further advances like 8-bit Float (FP8) tensor cores. The RTX 40 series also has similar power and temperature issues compared to the RTX 30. The issue of melting power connector cables in the RTX 40 can be easily prevented by connecting the power cable correctly.

Sparse Network Training

Ampere allows for fine-grained structure automatic sparse matrix multiplication at dense speeds. How does this work? Take a weight matrix and slice it into pieces of 4 elements. Now imagine 2 elements of these 4 to be zero. Figure 1 shows how this could look like.

Figure 1: Structure supported by the sparse matrix multiplication feature in Ampere GPUs. The figure is taken from Jeff Pool's GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA. — Figure 1: Structure supported by the sparse matrix multiplication feature in Ampere GPUs. The figure is taken from Jeff Pool’s GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA.

When you multiply this sparse weight matrix with some dense inputs, the sparse matrix tensor core feature in Ampere automatically compresses the sparse matrix to a dense representation that is half the size as can be seen in Figure 2. After this compression, the densely compressed matrix tile is fed into the tensor core which computes a matrix multiplication of twice the usual size. This effectively yields a 2x speedup since the bandwidth requirements during matrix multiplication from shared memory are halved.

Figure 2: The sparse matrix is compressed to a dense representation before the matrix multiplication is performed. The figure is taken from Jeff Pool’s GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA.

I was working on sparse network training in my research and I also wrote a blog post about sparse training. One criticism of my work was that “You reduce the FLOPS required for the network, but it does not yield speedups because GPUs cannot do fast sparse matrix multiplication.” Well, with the addition of the sparse matrix multiplication feature for Tensor Cores, my algorithm, or other sparse training algorithms, now actually provide speedups of up to 2x during training.

Figure 3: The sparse training algorithm that I developed has three stages: (1) Determine the importance of each layer. (2) Remove the smallest, unimportant weights. (3) Grow new weights proportional to the importance of each layer. Read more about my work in my sparse training blog post.

While this feature is still experimental and training sparse networks are not commonplace yet, having this feature on your GPU means you are ready for the future of sparse training.

Low-precision Computation

In my work, I’ve previously shown that new data types can improve stability during low-precision backpropagation.

Figure 4: Low-precision deep learning 8-bit datatypes that I developed. Deep learning training benefits from highly specialized data types. My dynamic tree datatype uses a dynamic bit that indicates the beginning of a binary bisection tree that quantized the range [0, 0.9] while all previous bits are used for the exponent. This allows to dynamically represent numbers that are both large and small with high precision.

Currently, if you want to have stable backpropagation with 16-bit floating-point numbers (FP16), the big problem is that ordinary FP16 data types only support numbers in the range [-65,504, 65,504]. If your gradient slips past this range, your gradients explode into NaN values. To prevent this during FP16 training, we usually perform loss scaling where you multiply the loss by a small number before backpropagating to prevent this gradient explosion.

The BrainFloat 16 format (BF16) uses more bits for the exponent such that the range of possible numbers is the same as for FP32: [-3*10^38, 3*10^38]. BF16 has less precision, that is significant digits, but gradient precision is not that important for learning. So what BF16 does is that you no longer need to do any loss scaling or worry about the gradient blowing up quickly. As such, we should see an increase in training stability by using the BF16 format as a slight loss of precision.

What this means for you: With BF16 precision, training might be more stable than with FP16 precision while providing the same speedups. With 32-bit TensorFloat (TF32) precision, you get near FP32 stability while giving the speedups close to FP16. The good thing is, to use these data types, you can just replace FP32 with TF32 and FP16 with BF16 — no code changes required!

Overall, though, these new data types can be seen as lazy data types in the sense that you could have gotten all the benefits with the old data types with some additional programming efforts (proper loss scaling, initialization, normalization, using Apex). As such, these data types do not provide speedups but rather improve ease of use of low precision for training.

LLM.int8() and Emergent Features

Fan Designs and GPUs Temperature Issues

While the new fan design of the RTX 30 series performs very well to cool the GPU, different fan designs of non-founders edition GPUs might be more problematic. If your GPU heats up beyond 80C, it will throttle itself and slow down its computational speed / power. This overheating can happen in particular if you stack multiple GPUs next to each other. A solution to this is to use PCIe extenders to create space between GPUs.

Spreading GPUs with PCIe extenders is very effective for cooling, and other fellow PhD students at the University of Washington and I use this setup with great success. It does not look pretty, but it keeps your GPUs cool! This has been running with no problems at all for 4 years now. It can also help if you do not have enough space to fit all GPUs in the PCIe slots. For example, if you can find the space within a desktop computer case, it might be possible to buy standard 3-slot-width RTX 4090 and spread them with PCIe extenders within the case. With this, you might solve both the space issue and cooling issue for a 4x RTX 4090 setup with a single simple solution.

Figure 5: 4x GPUs with PCIe extenders. It looks like a mess, but it is very effective for cooling. I used this rig for 2 years and cooling is excellent despite problematic RTX 2080 Ti Founders Edition GPUs. — Figure 5: 4x GPUs with PCIe extenders. It looks like a mess, but it is very effective for cooling. I used this rig for 4 years and cooling is excellent despite problematic RTX 2080 Ti Founders Edition GPUs.

3-slot Design and Power Issues

The RTX 3090 and RTX 4090 are 3-slot GPUs, so one will not be able to use it in a 4x setup with the default fan design from NVIDIA. This is kind of justified because it runs at over 350W TDP, and it will be difficult to cool in a multi-GPU 2-slot setting. The RTX 3080 is only slightly better at 320W TDP, and cooling a 4x RTX 3080 setup will also be very difficult.

It is also difficult to power a 4x 350W = 1400W or 4x 450W = 1800W system in the 4x RTX 3090 or 4x RTX 4090 case. Power supply units (PSUs) of 1600W are readily available, but having only 200W to power the CPU and motherboard can be too tight. The components’ maximum power is only used if the components are fully utilized, and in deep learning, the CPU is usually only under weak load. With that, a 1600W PSU might work quite well with a 4x RTX 3080 build, but for a 4x RTX 3090 build, it is better to look for high wattage PSUs (+1700W). Some of my followers have had great success with cryptomining PSUs — have a look in the comment section for more info about that. Otherwise, it is important to note that not all outlets support PSUs above 1600W, especially in the US. This is the reason why in the US, there are currently few standard desktop PSUs above 1600W on the market. If you get a server or cryptomining PSUs, beware of the form factor — make sure it fits into your computer case.

Power Limiting: An Elegant Solution to Solve the Power Problem?

It is possible to set a power limit on your GPUs. So you would be able to programmatically set the power limit of an RTX 3090 to 300W instead of their standard 350W. In a 4x GPU system, that is a saving of 200W, which might just be enough to build a 4x RTX 3090 system with a 1600W PSU feasible. It also helps to keep the GPUs cool. So setting a power limit can solve the two major problems of a 4x RTX 3080 or 4x RTX 3090 setups, cooling, and power, at the same time. For a 4x setup, you still need effective blower GPUs (and the standard design may prove adequate for this), but this resolves the PSU problem.

Figure 6: Reducing the power limit has a slight cooling effect. Reducing the RTX 2080 Ti power limit by 50-60 W decreases temperatures slightly and fans run more silent.

You might ask, “Doesn’t this slow down the GPU?” Yes, it does, but the question is by how much. I benchmarked the 4x RTX 2080 Ti system shown in Figure 5 under different power limits to test this. I benchmarked the time for 500 mini-batches for BERT Large during inference (excluding the softmax layer). I choose BERT Large inference since, from my experience, this is the deep learning model that stresses the GPU the most. As such, I would expect power limiting to have the most massive slowdown for this model. As such, the slowdowns reported here are probably close to the maximum slowdowns that you can expect. The results are shown in Figure 7.

Figure 7: Measured slowdown for a given power limit on an RTX 2080 Ti. Measurements taken are mean processing times for 500 mini-batches of BERT Large during inference (excluding softmax layer).

As we can see, setting the power limit does not seriously affect performance. Limiting the power by 50W — more than enough to handle 4x RTX 3090 — decreases performance by only 7%.

RTX 4090s and Melting Power Connectors: How to Prevent Problems

There was a misconception that RTX 4090 power cables melt because they were bent. However, it was found that only 0.1% of users had this problem and the problem occured due to user error. Here a video that shows that the main problem is that cables were not inserted correctly.

So using RTX 4090 cards is perfectly safe if you follow the following install instructions:

If you use an old cable or old GPU make sure the contacts are free of debri / dust.
Use the power connector and stick it into the socket until you hear a *click* — this is the most important part.
Test for good fit by wiggling the power cable left to right. The cable should not move.
Check the contact with the socket visually, there should be no gap between cable and socket.

8-bit Float Support in H100 and RTX 40 series GPUs

The support of the 8-bit Float (FP8) is a huge advantage for the RTX 40 series and H100 GPUs. With 8-bit inputs it allows you to load the data for matrix multiplication twice as fast, you can store twice as much matrix elements in your caches which in the Ada and Hopper architecture are very large, and now with FP8 tensor cores you get 0.66 PFLOPS of compute for a RTX 4090 — this is more FLOPS then the entirety of the worlds fastest supercomputer in year 2007. 4x RTX 4090 with FP8 compute rival the faster supercomputer in the world in year 2010 (deep learning started to work just in 2009).

The main problem with using 8-bit precision is that transformers can get very unstable with so few bits and crash during training or generate non-sense during inference. I have written a paper about the emergence of instabilities in large language models and I also written a more accessible blog post.

The main take-way is this: Using 8-bit instead of 16-bit makes things very unstable, but if you keep a couple of dimensions in high precision everything works just fine.

Main results from my work on 8-bit matrix multiplication for Large Language Models (LLMs). We can see that the best 8-bit baseline fails to deliver good zero-shot performance. The method that I developed, LLM.int8(), can perform Int8 matrix multiplication with the same results as the 16-bit baseline.

But Int8 was already supported by the RTX 30 / A100 / Ampere generation GPUs, why is FP8 in the RTX 40 another big upgrade? The FP8 data type is much more stable than the Int8 data type and its easy to use it in functions like layer norm or non-linear functions, which are difficult to do with Integer data types. This will make it very straightforward to use it in training and inference. I think this will make FP8 training and inference relatively common in a couple of months.

If you want to read more about the advantages of Float vs Integer data types you can read my recent paper about k-bit inference scaling laws. Below you can see one relevant main result for Float vs Integer data types from this paper. We can see that bit-by-bit, the FP4 data type preserve more information than Int4 data type and thus improves the mean LLM zeroshot accuracy across 4 tasks.

4-bit Inference scaling laws for Pythia Large Language Models for different data types. We see that bit-by-bit, 4-bit float data types have better zeroshot accuracy compared to the Int4 data types.

Raw Performance Ranking of GPUs

Below we see a chart of raw relevative performance across all GPUs. We see that there is a gigantic gap in 8-bit performance of H100 GPUs and old cards that are optimized for 16-bit performance.

Shown is raw relative transformer performance of GPUs. For example, an RTX 4090 has about 0.33x performance of a H100 SMX for 8-bit inference. In other words, a H100 SMX is three times faster for 8-bit inference compared to a RTX 4090.

For this data, I did not model 8-bit compute for older GPUs. I did so, because 8-bit Inference and training are much more effective on Ada/Hopper GPUs because of the 8-bit Float data type and Tensor Memory Accelerator (TMA) which saves the overhead of computing read/write indices which is particularly helpful for 8-bit matrix multiplication. Ada/Hopper also have FP8 support, which makes in particular 8-bit training much more effective.

I did not model numbers for 8-bit training because to model that I need to know the latency of L1 and L2 caches on Hopper/Ada GPUs, and they are unknown and I do not have access to such GPUs. On Hopper/Ada, 8-bit training performance can well be 3-4x of 16-bit training performance if the caches are as fast as rumored.

But even with the new FP8 tensor cores there are some additional issues which are difficult to take into account when modeling GPU performance. For example, FP8 tensor cores do not support transposed matrix multiplication which means backpropagation needs either a separate transpose before multiplication or one needs to hold two sets of weights — one transposed and one non-transposed — in memory. I used two sets of weight when I experimented with Int8 training in my LLM.int8() project and this reduced the overall speedups quite significantly. I think one can do better with the right algorithms/software, but this shows that missing features like a transposed matrix multiplication for tensor cores can affect performance.

For old GPUs, Int8 inference performance is close to the 16-bit inference performance for models below 13B parameters. Int8 performance on old GPUs is only relevant if you have relatively large models with 175B parameters or more. If you are interested in 8-bit performance of older GPUs, you can read the Appendix D of my LLM.int8() paper where I benchmark Int8 performance.

GPU Deep Learning Performance per Dollar

Below we see the chart for the performance per US dollar for all GPUs sorted by 8-bit inference performance. How to use the chart to find a suitable GPU for you is as follows:

Determine the amount of GPU memory that you need (rough heuristic: at least 12 GB for image generation; at least 24 GB for work with transformers)
While 8-bit inference and training is experimental, it will become standard within 6 months. You might need to do some extra difficult coding to work with 8-bit in the meantime. Is that OK for you? If not, select for 16-bit performance.
Using the metric determined in (2), find the GPU with the highest relative performance/dollar that has the amount of memory you need.

We can see that the RTX 4070 Ti is most cost-effective for 8-bit and 16-bit inference while the RTX 3080 remains most cost-effective for 16-bit training. While these GPUs are most cost-effective, they are not necessarily recommended as they do not have sufficient memory for many use-cases. However, it might be the ideal cards to get started on your deep learning journey. Some of these GPUs are excellent for Kaggle competition where one can often rely on smaller models. Since to do well in Kaggle competitions the method of how you work is more important than the models size, many of these smaller GPUs are excellent for Kaggle competitions.

The best GPUs for academic and startup servers seem to be A6000 Ada GPUs (not to be confused with A6000 Turing). The H100 SXM GPU is also very cost effective and has high memory and very strong performance. If I would build a small cluster for a company/academic lab, I would use 66-80% A6000 GPUs and 20-33% H100 SXM GPUs. If I get a good deal on L40 GPUs, I would also pick them instead of A6000, so you can always ask for a quote on these.

Shown is relative performance per US Dollar of GPUs normalized by the cost for a desktop computer and the average Amazon and eBay price for each GPU. Additionally, the electricity cost of ownership for 5 years is added with an electricity price of 0.175 USD per kWh and a 15% GPU utilization rate. The electricity cost for a RTX 4090 is about $100 per year. How to read and interpret the chart: a desktop computer with RTX 4070 Ti cards owned for 5 years yields about 2x more 8-bit inference performance per dollar compared to a RTX 3090 GPU.

LLM.int8() and Emergent Features

GPU Recommendations

I have a create a recommendation flow-chart that you can see below (click here for interactive app from Nan Xiao). While this chart will help you in 80% of cases, it might not quite work for you because the options might be too expensive. In that case, try to look at the benchmarks above and pick the most cost effective GPU that still has enough GPU memory for your use-case. You can estimate the GPU memory needed by running your problem in the vast.ai or Lambda Cloud for a while so you know what you need. The vast.ai or Lambda Cloud might also work well if you only need a GPU very sporadically (every couple of days for a few hours) and you do not need to download and process large dataset to get started. However, cloud GPUs are usually not a good option if you use your GPU for many months with a high usage rate each day (12 hours each day). You can use the example in the “When is it better to use the cloud vs a dedicated GPU desktop/server?” section below to determine if cloud GPUs are good for you.

GPU recommendation chart for Ada/Hopper GPUs. Follow the answers to the Yes/No questions to find the GPU that is most suitable for you. While this chart works well in about 80% of cases, you might end up with a GPU that is too expensive. Use the cost/performance charts above to make a selection instead. [interactive app]

Is it better to wait for future GPUs for an upgrade? The future of GPUs.

To understand if it makes sense to skip this generation and buy the next generation of GPUs, it makes sense to talk a bit about what improvements in the future will look like.

In the past it was possible to shrink the size of transistors to improve speed of a processor. This is coming to an end now. For example, while shrinking SRAM increased its speed (smaller distance, faster memory access), this is no longer the case. Current improvements in SRAM do not improve its performance anymore and might even be negative. While logic such as Tensor Cores get smaller, this does not necessarily make GPU faster since the main problem for matrix multiplication is to get memory to the tensor cores which is dictated by SRAM and GPU RAM speed and size. GPU RAM still increases in speed if we stack memory modules into high-bandwidth modules (HBM3+), but these are too expensive to manufacture for consumer applications. The main way to improve raw speed of GPUs is to use more power and more cooling as we have seen in the RTX 30s and 40s series. But this cannot go on for much longer.

Chiplets such as used by AMD CPUs are another straightforward way forward. AMD beat Intel by developing CPU chiplets. Chiplets are small chips that are fused together with a high speed on-chip network. You can think about them as two GPUs that are so physically close together that you can almost consider them a single big GPU. They are cheaper to manufacture, but more difficult to combine into one big chip. So you need know-how and fast connectivity between chiplets. AMD has a lot of experience with chiplet design. AMD’s next generation GPUs are going to be chiplet designs, while NVIDIA currently has no public plans for such designs. This may mean that the next generation of AMD GPUs might be better in terms of cost/performance compared to NVIDIA GPUs.

However, the main performance boost for GPUs is currently specialized logic. For example, the asynchronous copy hardware units on the Ampere generation (RTX 30 / A100 / RTX 40) or the extension, the Tensor Memory Accelerator (TMA), both reduce the overhead of copying memory from the slow global memory to fast shared memory (caches) through specialized hardware and so each thread can do more computation. The TMA also reduces overhead by performing automatic calculations of read/write indices which is particularly important for 8-bit computation where one has double the elements for the same amount of memory compared to 16-bit computation. So specialized hardware logic can accelerate matrix multiplication further.
Low-bit precision is another straightforward way forward for a couple of years. We will see widespread adoption of 8-bit inference and training in the next months. We will see widespread 4-bit inference in the next year. Currently, the technology for 4-bit training does not exists, but research looks promising and I expect the first high performance FP4 Large Language Model (LLM) with competitive predictive performance to be trained in 1-2 years time.

Going to 2-bit precision for training currently looks pretty impossible, but it is a much easier problem than shrinking transistors further. So progress in hardware mostly depends on software and algorithms that make it possible to use specialized features offered by the hardware.

We will probably be able to still improve the combination of algorithms + hardware to the year 2032, but after that will hit the end of GPU improvements (similar to smartphones). The wave of performance improvements after 2032 will come from better networking algorithms and mass hardware. It is uncertain if consumer GPUs will be relevant at this point. It might be that you need an RTX 9090 to run run Super HyperStableDiffusion Ultra Plus 9000 Extra or OpenChatGPT 5.0, but it might also be that some company will offer a high-quality API that is cheaper than the electricity cost for a RTX 9090 and you want to use a laptop + API for image generation and other tasks.

Overall, I think investing into a 8-bit capable GPU will be a very solid investment for the next 9 years. Improvements at 4-bit and 2-bit are likely small and other features like Sort Cores would only become relevant once sparse matrix multiplication can be leveraged well. We will probably see some kind of other advancement in 2-3 years which will make it into the next GPU 4 years from now, but we are running out of steam if we keep relying on matrix multiplication. This makes investments into new GPUs last longer.

Question & Answers & Misconceptions

Do I need PCIe 4.0 or PCIe 5.0?

Generally, no. PCIe 5.0 or 4.0 is great if you have a GPU cluster. It is okay if you have an 8x GPU machine, but otherwise, it does not yield many benefits. It allows better parallelization and a bit faster data transfer. Data transfers are not a bottleneck in any application. In computer vision, in the data transfer pipeline, the data storage can be a bottleneck, but not the PCIe transfer from CPU to GPU. So there is no real reason to get a PCIe 5.0 or 4.0 setup for most people. The benefits will be maybe 1-7% better parallelization in a 4 GPU setup.

Do I need 8x/16x PCIe lanes?

Same as with PCIe 4.0 — generally, no. PCIe lanes are needed for parallelization and fast data transfers, which are seldom a bottleneck. Operating GPUs on 4x lanes is fine, especially if you only have 2 GPUs. For a 4 GPU setup, I would prefer 8x lanes per GPU, but running them at 4x lanes will probably only decrease performance by around 5-10% if you parallelize across all 4 GPUs.

How do I fit 4x RTX 4090 or 3090 if they take up 3 PCIe slots each?

You need to get one of the two-slot variants, or you can try to spread them out with PCIe extenders. Besides space, you should also immediately think about cooling and a suitable PSU.

PCIe extenders might also solve both space and cooling issues, but you need to make sure that you have enough space in your case to spread out the GPUs. Make sure your PCIe extenders are long enough!

How do I cool 4x RTX 3090 or 4x RTX 3080?

See the previous section.

Can I use multiple GPUs of different GPU types?

Yes, you can! But you cannot parallelize efficiently across GPUs of different types since you will often go at the speed of the slowest GPU (data and fully sharded parallelism). So different GPUs work just fine, but parallelization across those GPUs will be inefficient since the fastest GPU will wait for the slowest GPU to catch up to a synchronization point (usually gradient update).

What is NVLink, and is it useful?

Generally, NVLink is not useful. NVLink is a high speed interconnect between GPUs. It is useful if you have a GPU cluster with +128 GPUs. Otherwise, it yields almost no benefits over standard PCIe transfers.

I do not have enough money, even for the cheapest GPUs you recommend. What can I do?

Definitely buy used GPUs. You can buy a small cheap GPU for prototyping and testing and then roll out for full experiments to the cloud like vast.ai or Lambda Cloud. This can be cheap if you train/fine-tune/inference on large models only every now and then and spent more time protoyping on smaller models.

What is the carbon footprint of GPUs? How can I use GPUs without polluting the environment?

I built a carbon calculator for calculating your carbon footprint for academics (carbon from flights to conferences + GPU time). The calculator can also be used to calculate a pure GPU carbon footprint. You will find that GPUs produce much, much more carbon than international flights. As such, you should make sure you have a green source of energy if you do not want to have an astronomical carbon footprint. If no electricity provider in our area provides green energy, the best way is to buy carbon offsets. Many people are skeptical about carbon offsets. Do they work? Are they scams?

I believe skepticism just hurts in this case, because not doing anything would be more harmful than risking the probability of getting scammed. If you worry about scams, just invest in a portfolio of offsets to minimize risk.

I worked on a project that produced carbon offsets about ten years ago. The carbon offsets were generated by burning leaking methane from mines in China. UN officials tracked the process, and they required clean digital data and physical inspections of the project site. In that case, the carbon offsets that were produced were highly reliable. I believe many other projects have similar quality standards.

What do I need to parallelize across two machines?

If you want to be on the safe side, you should get at least +50Gbits/s network cards to gain speedups if you want to parallelize across machines. I recommend having at least an EDR Infiniband setup, meaning a network card with at least 50 GBit/s bandwidth. Two EDR cards with cable are about $500 on eBay.

In some cases, you might be able to get away with 10 Gbit/s Ethernet, but this is usually only the case for special networks (certain convolutional networks) or if you use certain algorithms (Microsoft DeepSpeed).

Is the sparse matrix multiplication features suitable for sparse matrices in general?

It does not seem so. Since the granularity of the sparse matrix needs to have 2 zero-valued elements, every 4 elements, the sparse matrices need to be quite structured. It might be possible to adjust the algorithm slightly, which involves that you pool 4 values into a compressed representation of 2 values, but this also means that precise arbitrary sparse matrix multiplication is not possible with Ampere GPUs.

Do I need an Intel CPU to power a multi-GPU setup?

I do not recommend Intel CPUs unless you heavily use CPUs in Kaggle competitions (heavy linear algebra on the CPU). Even for Kaggle competitions AMD CPUs are still great, though. AMD CPUs are cheaper and better than Intel CPUs in general for deep learning. For a 4x GPU built, my go-to CPU would be a Threadripper. We built dozens of systems at our university with Threadrippers, and they all work great — no complaints yet. For 8x GPU systems, I would usually go with CPUs that your vendor has experience with. CPU and PCIe/system reliability is more important in 8x systems than straight performance or straight cost-effectiveness.

Does computer case design matter for cooling?

No. GPUs are usually perfectly cooled if there is at least a small gap between GPUs. Case design will give you 1-3 C better temperatures, space between GPUs will provide you with 10-30 C improvements. The bottom line, if you have space between GPUs, cooling does not matter. If you have no space between GPUs, you need the right cooler design (blower fan) or another solution (water cooling, PCIe extenders), but in either case, case design and case fans do not matter.

Will AMD GPUs + ROCm ever catch up with NVIDIA GPUs + CUDA?

Not in the next 1-2 years. It is a three-way problem: Tensor Cores, software, and community.

AMD GPUs are great in terms of pure silicon: Great FP16 performance, great memory bandwidth. However, their lack of Tensor Cores or the equivalent makes their deep learning performance poor compared to NVIDIA GPUs. Packed low-precision math does not cut it. Without this hardware feature, AMD GPUs will never be competitive. Rumors show that some data center card with Tensor Core equivalent is planned for 2020, but no new data emerged since then. Just having data center cards with a Tensor Core equivalent would also mean that few would be able to afford such AMD GPUs, which would give NVIDIA a competitive advantage.

Let’s say AMD introduces a Tensor-Core-like-hardware feature in the future. Then many people would say, “But there is no software that works for AMD GPUs! How am I supposed to use them?” This is mostly a misconception. The AMD software via ROCm has come to a long way, and support via PyTorch is excellent. While I have not seen many experience reports for AMD GPUs + PyTorch, all the software features are integrated. It seems, if you pick any network, you will be just fine running it on AMD GPUs. So here AMD has come a long way, and this issue is more or less solved.

However, if you solve software and the lack of Tensor Cores, AMD still has a problem: the lack of community. If you have a problem with NVIDIA GPUs, you can Google the problem and find a solution. That builds a lot of trust in NVIDIA GPUs. You have the infrastructure that makes using NVIDIA GPUs easy (any deep learning framework works, any scientific problem is well supported). You have the hacks and tricks that make usage of NVIDIA GPUs a breeze (e.g., apex). You can find experts on NVIDIA GPUs and programming around every other corner while I knew much less AMD GPU experts.

In the community aspect, AMD is a bit like Julia vs Python. Julia has a lot of potential, and many would say, and rightly so, that it is the superior programming language for scientific computing. Yet, Julia is barely used compared to Python. This is because the Python community is very strong. Numpy, SciPy, Pandas are powerful software packages that a large number of people congregate around. This is very similar to the NVIDIA vs AMD issue.

Thus, it is likely that AMD will not catch up until Tensor Core equivalent is introduced (1/2 to 1 year?) and a strong community is built around ROCm (2 years?). AMD will always snatch a part of the market share in specific subgroups (e.g., cryptocurrency mining, data centers). Still, in deep learning, NVIDIA will likely keep its monopoly for at least a couple more years.

When is it better to use the cloud vs a dedicated GPU desktop/server?

Rule-of-thumb: If you expect to do deep learning for longer than a year, it is cheaper to get a desktop GPU. Otherwise, cloud instances are preferable unless you have extensive cloud computing skills and want the benefits of scaling the number of GPUs up and down at will.

Numbers in the following paragraphs are going to change, but it serves as a scenario that helps you to understand the rough costs. You can use similar math to determine if cloud GPUs are the best solution for you.

For the exact point in time when a cloud GPU is more expensive than a desktop depends highly on the service that you are using, and it is best to do a little math on this yourself. Below I do an example calculation for an AWS V100 spot instance with 1x V100 and compare it to the price of a desktop with a single RTX 3090 (similar performance). The desktop with RTX 3090 costs $2,200 (2-GPU barebone + RTX 3090). Additionally, assuming you are in the US, there is an additional $0.12 per kWh for electricity. This compares to $2.14 per hour for the AWS on-demand instance.

At 15% utilization per year, the desktop uses:

(350 W (GPU) + 100 W (CPU))*0.15 (utilization) * 24 hours * 365 days = 591 kWh per year

So 591 kWh of electricity per year, that is an additional $71.

The break-even point for a desktop vs a cloud instance at 15% utilization (you use the cloud instance 15% of time during the day), would be about 300 days ($2,311 vs $2,270):

$2.14/h * 0.15 (utilization) * 24 hours * 300 days = $2,311

So if you expect to run deep learning models after 300 days, it is better to buy a desktop instead of using AWS on-demand instances.

You can do similar calculations for any cloud service to make the decision if you go for a cloud service or a desktop.

Common utilization rates are the following:

PhD student personal desktop: < 15%
PhD student slurm GPU cluster: > 35%
Company-wide slurm research cluster: > 60%

In general, utilization rates are lower for professions where thinking about cutting edge ideas is more important than developing practical products. Some areas have low utilization rates (interpretability research), while other areas have much higher rates (machine translation, language modeling). In general, the utilization of personal machines is almost always overestimated. Commonly, most personal systems have a utilization rate between 5-10%. This is why I would highly recommend slurm GPU clusters for research groups and companies instead of individual desktop GPU machines.

Version History

2023-01-30: Improved font and recommendation chart. Added 5 years cost of ownership electricity perf/USD chart. Updated Async copy and TMA functionality. Slight update to FP8 training. General improvements.
2023-01-16: Added Hopper and Ada GPUs. Added GPU recommendation chart. Added information about the TMA unit and L2 cache.
2020-09-20: Added discussion of using power limiting to run 4x RTX 3090 systems. Added older GPUs to the performance and cost/performance charts. Added figures for sparse matrix multiplication.
2020-09-07: Added NVIDIA Ampere series GPUs. Included lots of good-to-know GPU details.
2019-04-03: Added RTX Titan and GTX 1660 Ti. Updated TPU section. Added startup hardware discussion.
2018-11-26: Added discussion of overheating issues of RTX cards.
2018-11-05: Added RTX 2070 and updated recommendations. Updated charts with hard performance data. Updated TPU section.
2018-08-21: Added RTX 2080 and RTX 2080 Ti; reworked performance analysis
2017-04-09: Added cost-efficiency analysis; updated recommendation with NVIDIA Titan Xp
2017-03-19: Cleaned up blog post; added GTX 1080 Ti
2016-07-23: Added Titan X Pascal and GTX 1060; updated recommendations
2016-06-25: Reworked multi-GPU section; removed simple neural network memory section as no longer relevant; expanded convolutional memory section; truncated AWS section due to not being efficient anymore; added my opinion about the Xeon Phi; added updates for the GTX 1000 series
2015-08-20: Added section for AWS GPU instances; added GTX 980 Ti to the comparison relation
2015-04-22: GTX 580 no longer recommended; added performance relationships between cards
2015-03-16: Updated GPU recommendations: GTX 970 and GTX 580
2015-02-23: Updated GPU recommendations and memory calculations
2014-09-28: Added emphasis for memory requirement of CNNs

Acknowledgments

I thank Suhail for making me aware of outdated prices on H100 GPUs, Gjorgji Kjosev for pointing out font issues, Anonymous for pointing out that the TMA unit does not exist on Ada GPUs, Scott Gray for pointing out that FP8 tensor cores have no transposed matrix multiplication, and reddit and HackerNews users for pointing out many other improvements.

For past updates of this blog post, I want to thank Mat Kelcey for helping me to debug and test custom code for the GTX 970; I want to thank Sander Dieleman for making me aware of the shortcomings of my GPU memory advice for convolutional nets; I want to thank Hannes Bretschneider for pointing out software dependency problems for the GTX 580; and I want to thank Oliver Griesel for pointing out notebook solutions for AWS instances. I want to thank Brad Nemire for providing me with an RTX Titan for benchmarking purposes. I want to thank Agrin Hilmkil, Ari Holtzman, Gabriel Ilharco, Nam Pho for their excellent feedback on the previous version of this blog post.

LLM.int8() and Emergent Features

Comments

Hannes says
2019-11-16 at 23:00
Hello, now there are some very affordable used Tesla M40s with 24 GB memory on the market. I found them starting from 650 EUR (about 720 USD). Is this a good deal for some use cases?
Reply
- Tim Dettmers says
  2019-11-24 at 15:44
  A Tesla M40 is pretty slow. I would advise you to get a Titan RTX if you really more memory as they are still much more cost efficient as M40s for 720 USD.
  Reply
Sammy B says
2019-11-15 at 11:02
Hi Tim,
After reading all of your excellent posts, I came up with the following build: https://pcpartpicker.com/list/4dj8Mc. Just to briefly explain my rationale: I’m buying a 2GPU machine on ~4k budget, but will probably upgrade to 4GPU in the next year. Hence, I’ve chosen to max out PSU power, RAM, and chassis size in anticipation. As for the CPU/mobo, it seems like AMD really gives you a lot more bang for the buck than Intel, hence TR+X399. And as per your posts, I’m going with blower style cards.
I’d greatly appreciate it if you could let me know whether I’m not missing any incompatibilities, and whether in my described situation I can still possibly shave off some of the costs. Two particular questions: will the CPU/mobo combo definitely support 4 2080s as well as 2+ NVME SSDs? Also, this will be used in a university setting, and I have the option of putting the build on a rack rather than in its own chassis; I’ve never done that before, any advice or resources I can look into for building on a rack?
Reply
- Tim Dettmers says
  2019-11-24 at 15:40
  Looks like a solid build. I would be careful about the case though. Often cases are just big enough to house 3 GPUs. Make sure it fits 4 GPUs.
  Yes it supports 4 RTX 2080 Ti and 2+ NVMe SSDs since you have a threadripper with additional lanes. I use the same setup with 3 NVMe SSDs and it works great.
  For a rack you just need the right case. You probably are looking to buy a 2U format. Ask your university about which format they need the case to be and then look for a chase of the right format that supports 4 GPUs.
  Reply
Keshav says
2019-11-07 at 11:34
Hi Tim,
Planning to buy a GPU. I predominantly work on NLP and most of my models only require within 8GB so planning for the RTX 2060 Super. But I would like to do some hobby projects on video analytics and I would like to know if it would work if I just used the 2060S for coding/debugging and building large models, and then just for training move it to a cloud compute like Azure or AWS with bigger config PC. That way it would be quite cost-effective.
Thanks in advance
Reply
- Tim Dettmers says
  2019-11-24 at 15:34
  That sounds reasonable. However, it also sounds like you would be doing a lot of deep learning at work / as a hobby and it might be that a bigger GPU might be just better for you. If you do NLP you probably also want to use pretrained transformers. If that is the case a RTX 2080 Ti might be better. If you do not want to use transformers you might be fine with a RTX 2060S.
  Reply
Paul says
2019-11-05 at 12:53
I got a good deal for i9 9900k/64GB/rtx 2070super, so went with that one. Hope I won’t regret not buying 2080 ti at the cost of ram and CPU. But I guess since I’m new to deep learning it won’t matter that much at least in the beginning.
Thanks for your reply.
Reply
- Tim Dettmers says
  2019-11-24 at 15:31
  You can also always save GPU memory in different ways, for example to aggregate gradients via multiple mini-batches. So there are always workarounds — no worries!
  Reply
Jared says
2019-11-02 at 22:14
Hi, I’m an architecture student and I use mostly Revit, Rhino, Grasshopper3d, and Lumion along with Adobe Cloud. I am interested in incorporating AI for generative design and deep learning but I’m finding conflicting info online about what hardware is best for this workflow. Should I go with build 1:
Intel i9-9900KS, Gigabyte z390 Aorus Ultra, RTX 2080 ti, and 128 GB DDR4-2666 RAM?
Or build 2:
AMD Ryzen 9 3900x, ASRock x570 Taichi motherboard, RTX 2080 ti, and 128 GB DDR4-2666 RAM?
Or is there another set of components that would be better than either?
I really appreciate any feedback! Thanks.
Reply
- Tim Dettmers says
  2019-11-24 at 15:30
  Both systems look fine, I would go with the one that is cheaper (which is probably the AMD Ryzen one).
  Reply
Mick Lalescu says
2019-11-01 at 16:11
Tim,
Excellent article, very valuable info.
I have one comment and one question.
Comment: I think you should update the article due to RTX Super series becoming available. It could be that the best performance per dollar is now on RTX Super 2060 which is roughly equivalent to the old RTX 2080 but it still costs lower.
Question: Would it make sense for a Machine Learning workstation to have to different GPUs:
– a more powerful one a GTX 20xx for executing machine learning workloads with no monitor attached
– a cheaper, less performant one, for attaching two or three monitors , for regular UI desktop tasks (email, browsing, IDE, ssh terminals, etc)
My concern here is that running the UI on the ML GPU could result in slowing down the ML tasks. Is this concern valid? I am thinking about using a Radeon GPU for UI. The Radeon choice is for two reasons, AMD is cheaper plus it is easier to identify which GPU is used by the ML task.
Reply
- Tim Dettmers says
  2019-11-24 at 15:29
  Usually it is fine to have monitors attached to a GPU and also use it for models. I have 3 monitors on one GPU and occasionally I have problems if I run a model in parallel and I max out the GPU memory to the very last bit though. However, often you just need to reduce the batch size slightly and it is fine.
  Reply
Paul C says
2019-10-29 at 02:50
Hey,
I’ve got ~2000$ to spend on PC and I’d like to know your opinion on the hardware priority in deep learning. Is it better to buy rtx 2080 ti, Ryzen 7 2700X/some i5 and 16GB ram, or maybe buy rtx 2070/2080(super?), i9/Ryzen 7 3700x and 32GB(64?) ram?
Reply
- Tim Dettmers says
  2019-10-30 at 10:33
  A Ryzen 7 2700X will be more than enough for 1 GPU. The 16 GB of RAM will also be more enough for the GPU, but maybe not for certain applications. If you run just deep learning algorithms and not other CPU-based modeling algorithms though, 16 GB should be fine. So I would go for the first build. If you have the need to run some CPU-based algorithms, (sklearn, Kaggle competitions etc.) then I would go for the second build.
  It also depends if you want to run big state-of-the-art models. For that the 8 GPU of GPU RAM in the second options will not be enough. So in that case also go for the first option.
  Reply
Mofii says
2019-10-28 at 09:42
Hi Tim,
Great post. It’s very helpful. I am working on large GANs (1024×1024) and the GPU memory would be very important to me. Karras recommended in pgGAN that “high-end NVIDIA Pascal or Volta GPUs with 16GB of DRAM” would be good. I am thinking about a RTX Titan. Do you think this would be enough? For me, money is a less important issue when it comes to the GPU performance and memory requirement. Thank you for any advice!
Best,
Mofii
Reply
- Tim Dettmers says
  2019-10-28 at 10:26
  If you can afford to spend more then you could get a Quadro RTX 8000 with 48 GB of RAM ($5.5k). Otherwise, the Titan RTX ($2.8k) is already pretty good with 24 GB. You can also use techniques like batch aggregation to train with a larger batch size while requiring less memory.
  Reply
  - Mofii says
    2019-10-28 at 10:39
    Hi Tim,
    Thank you very much for the advice!
    Best,
    Mofii
    Reply
- andrea de luca says
  2019-10-28 at 12:11
  The rtx 8000 mentioned by Tim is the best option, but I noticed that the Tesla V100 SXM2 are crap cheap on ebay these days.
  If you can buy an used SXM2 platform (supermicro, gigabyte, etc..) you can purchase four of them for the price of a single quadro 8000.
  Reply
  - Mofii says
    2019-10-28 at 12:57
    This sounds pretty cool. I’ll definitely check it out.
    Thank you!
    Best,
    Mofii
    Reply
Mira says
2019-10-23 at 05:10
Hi Tim. Currently I am building a PC dedicated for CUDA de/compressing calculations, therefore nVidia is necessary to use, sadly (probably 2070S or maybe 2060S is enough). At its hard time it can be loaded to de/compress 2GB of data per second for couple of minutes. At this point, is ECC memory necessary?
The 8 CPU cores for some background work are fine. The most limiting requirement is a need to have 2 x 10Gb/s LAN cards. Therefore PCI lanes x16/x8 (GPU+LAN) on MB is best scenario to have.
I think about x570 platform as future proof, but not best suitable.
Then I think about Threadripper which has a lot of PCI lanes and probably better compatibility with ECC modules (which I do not know if are needed).
What would you recommend me?
Thank you a lot.
Reply
- Tim Dettmers says
  2019-10-23 at 16:46
  ECC is necessary if it is really important that your data is not corrupted. Usually, for a normal desktop, you should experience a few couple of errors per month. If you compress/decompress a lot of data each day then ECC memory can make good sense. If you do it for research purposes or every week/month once then you do not need ECC.
  Even if you have a slow GPU your bottleneck will still be the x16 lanes. The GPU can process your data at hundreds of GB/s while the PCIe bus only can process 16 GB/s. However, it also depends on how intense the de/compression is, but you should be fine with most GPUs. If the de/compression is very intensive then anything above a RTX 2070 SUPER will not be much faster. I think you should be fine with the RTX 2060S/2070S. Another factor could be memory, but I guess you would need to use some streaming pipelines for your data anyway and that should work just fine with 8 GB memory.
  Reply
Ruscio says
2019-10-20 at 10:46
Hi Tim,
for a Computer Vision build would you go with 1x 2080 Ti or 2x 2060s Super?
Reply
- Tim Dettmers says
  2019-10-21 at 18:36
  I would go with a RTX 2080 Ti. I think for many computer vision tasks the 8 GB of memory can be limiting even if you use 16-bit computation. If you think you will only fit smaller models, then 2 2060 Super might actually be better. You can also always fit a model somehow on a 2060 Super, using tricks like aggregating the gradient over multiple small mini-batches, but this will also make the training slower so a RTX 2080 Ti might be more reliable after all.
  Reply
- andrea de luca says
  2019-10-23 at 12:48
  For the price of a single 2080ti or 2x2060S, I would buy neither, opting for 2x1080ti.
  Same cost, more value: you will get two cards with 11gb each. Usad in parallel, they will beat a single 2080ti both in fp16 and fp32, not to mention having twice the memory.
  Compared to two 2060S, you will still have considerably more memory, more speed in fp32, and you will be just a bit slower in fp16.
  But please appreciate that if your model doesn’t fit in memory, it just doesn’t fit, while if you have to wait a bit more for you training, well.. you just wait.
  Reply
ttodd says
2019-10-14 at 19:10
Actually,if we put the motherboard,ssd,memory cost together to analyze the cost-efficiency,it’s easy to find the RTX series gpus have the almost same unit performance per dollar,though 2070s and 2080ti would have slight higher unit performance per dollar.
And may I ask if I could use two 2070s ,one using fan-cooling,another using turbo cooling,I don’t know if the turbo would make some noise cause I have to put it in my bedroom.
Reply
- Tim Dettmers says
  2019-10-21 at 18:27
  If you want silent operation I recommend an all-in-one (AIO) hybrid cooled GPU.
  You are right, if you put all the other hardware costs on top of the GPU, more expensive GPUs are much better in cost performance. If you get 2 GPUs, the RTX 2080 Ti just looks very good.
  Reply
dragon says
2019-10-09 at 03:51
Excuse me,What is the difference between RTX2070 MAX-Q and RTX2070 in deep learning power?
Reply
- Tim Dettmers says
  2019-10-21 at 18:09
  MAX-Q is for laptops and is a bit less powerful than a RTX 2070.
  Reply
Uday says
2019-10-01 at 10:45
Hi Tim,
Help me with whether GTX 1650 Max-Q compared to GTX 1050 Ti? Both are coming at the same price fitted in a MSI
Reply
- Tim Dettmers says
  2019-10-01 at 13:11
  They are about the same. The GTX 1650 Max-Q is slightly better if you use convolutional networks and 16-bit though.
  Reply
Rafael says
2019-09-30 at 08:36
Hi Tim
Superb blog and dedication from you. Very inspiring. Thanks in advance
Intel Xeon e5-2697-v3 (14 Cores 28 Threads with 2.6-3.6Ghz )
Asus x99 E Ws (1x C612 Chipset Motherboard)
64GB DDR4 RAM (Samsung 2133 MHz, ECC, 2Rx4, 16 GBx4)
2 x Nvidia Tesla k80 that could upgraded to 4 k80
250 GB Samsung 970 Evo Nvme M.2 SSD for OS Installation
1TB SSD Corsair MP510 for Data Storage
Fractal Design Define XL R2 titanium Cabinet
Noctua NH-D15S for CPU cooling
¿Fans type and configuration ? Still an open question for 2 to 4 GPUs
EVGA 1600W 80+ Gold PSU (Support 2-4 GPU)
I know passive tesla k80 or similar should be avoided in multiGPU configurations due to thermal issues. However, we need double precission high performance for running VASP or QUANTUM EXPRESSO in demanding DFT calculations. I am stuck in an optimum design of the case air flow for a 2 K80 GPU configuration considering that this could be extended to 4 GPUs. I have considered the following option:
(1) Using 2 powerful intake funs (High static pressure >4 mm H2O and high air flow about 110-150 CFM each) placed at a distance of approx. 18 cm from the GPUs.
As an indication, Commercial servers seems to push similar airflows (approx. 300 CFM per 4 GPU) but at a shorter distance (5-8 cm) of the GPU. In some cases, exhaust fans are additionally included outside the case to push the air directly out of the GPUs (97x97x33mm fans exhausting at max. speed around 40 CFM). I have doubts of this option for a 4 GPU configurations, but this might still work with 2 GPUs?
(2) Designing a 3D shroud for each tesla K80 with a coupled SUNON centrifugal fans (97x97x33 mm and 40-44 CFM at max. 5400 rpm speed…very noisy 54 dB fans) for active air cooling. The airflow and power is similar to that described for active cooling GTX Geforce graphic cards. This extension of about 14 cm (shroud + fan) would leave a gap of aroun 6-8 cm between the centrifugal fan of each GPU and both intake fans of the case (Now, should I considered fans with no need of so high static pressures but similar CFM air flow delivery as before (100-150 CFM each)?.
What is your experience in thermal optimization of 2 or 4 GPUs?
Any recommendation?
Best regards
Reply
- Tim Dettmers says
  2019-09-30 at 10:51
  The distance of the fans to the GPUs can be a critical factor as well as the more stream-lined airflow in commercial solutions. Since K80 GPUs are already quite expensive I would try to get a cheap commercial 2 GPU solution. I am sure you can work out a solution, and you seem to be have something that can work well, but the uncertainty, if it works, might be a bit unsettling. A commercial solution should just work out of the box.
  Another option which might work much better is to get Titan V GPUs instead (better double-precision performance and cheaper, but no memory correction!). These have high double precision and come with a fan cooling solution. You still need good airflow to cool 4 of them, but cooling 2 is not a problem. You can also buy extenders and distribute the GPUs around the case if you have 4 of them and with that cooling should not be a problem.
  Reply
Abhishek Rao says
2019-09-30 at 05:33
Tim, if I have to make a choice between GTX 1060 and GTX 1660ti laptops. Which would it be? Is it worth spending 150$ more on the 1660ti?
Reply
- Tim Dettmers says
  2019-09-30 at 10:36
  The GTX 1660 Ti is slightly better. I personally would get the best possible because you will probably use the laptop for a long time and it is difficult and costly to upgrade your GPU for your laptop. I think the $150 could be worth it over the long-term.
  Reply
anonymous says
2019-09-26 at 10:51
Actually, that is no longer the case, last year AMD made available in depth support for deep learning algorithms on its latest GPUs, which are considerably less costly than NVidia
Reply
- Tim Dettmers says
  2019-09-26 at 21:26
  Do you know any source which documents the functionality for AMD GPUs for TensorFlow/PyTorch?
  Reply
Sandeep says
2019-09-25 at 19:28
Hi Tim,
I work in Research in NLP and Speech. Working on LSTM, GAN with datasets of size 10million+. I usually am way of increasing the batch size given [N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima,” Sep. 2016]. So usually I keep it to 256 for batch size. I am ordering new h/w in my organization and budget is not a real constraint. I am planning the following. Can you please let me know if there is any incompatibility? I plan to use Tensorflow, PyTorch and SpaCy too.
8 x NVidia RTX 2080 Ti OC edition GPU processors (for text processing)
4 x NVidia Titan RTX GPU processors (for speech and image/OCR processing)
SSD 500GB
2 x 2TB HDD 10KRPM
16 x PCI-e 3.0
Appreciate all the info you have put. Thank you
Reply
- Tim Dettmers says
  2019-09-26 at 21:31
  It sounds like what you want is a small GPU cluster (12 GPUs, right?). You can build it together in a cheap way by buying a cheap EPYC board that supports 6+ GPUs and then buy infiniband cards to put them together. However, such a setup can have its issues and it might be difficult to figure out what is wrong if it goes wrong. Once you go across multiple servers or you have 8 GPUs in a server then it is probably better to buy such a setup from a hardware vendor that sells the full machine. I tried it myself and it took more than a month to get all the software working well together — if you buy a cheap solution that you build yourself, this is what you can expect.
  Reply
Host says
2019-09-20 at 18:49
Great article. Been following it for years until having the right budget for a good start. Is it a good idea to provide water cooling for the 2 rtx 2070 super I’m about to get?(have one alr). If so, how much better will it gets? Thanks
Reply
- Tim Dettmers says
  2019-09-21 at 18:18
  If there is an empty slot between the two RTX 2070 you do not need water cooling. Otherwise, if the slots are next to each other, water cooling might yield 0-15% performance if you use both GPUs at the same time.
  Reply
  - Eric Bohn says
    2019-12-19 at 19:44
    What about blowers vs non-blowers for the 2x RTX 2070 supers on a normal SLI board like the x470, x570, or z390?
    Reply
Andy says
2019-09-18 at 14:15
Thanks for the great research here! I have a question. I have $3800 to spend on a laptop/mobile workstation. I’m only going to use it for deep learning and have to spend it all (otherwise the bureaucracy does not think I’m doing my job). Can you give me some advice on which laptop I should buy with the best GPU performance? It’s been exhausting looking around and I could use some advice. Thanks!
Reply
- Tim Dettmers says
  2019-09-19 at 18:11
  I would go with either RTX 2080 or with a Quadro 5000/6000. There are also the Max-Q variants, but they will have less performance it seems. Also balance the other aspects of a laptop in terms of practicality and how you use it in general.
  Reply
  - Andy says
    2019-09-19 at 19:04
    Thanks for the quick reply!
    Reply
- Mircea Giurgiu says
  2019-09-19 at 23:04
  I am looking for similar laptop. Could give a link for possible provider (model)?
  Reply
chanhyuk jung says
2019-09-11 at 09:34
Do you think getting a graphics card on laptops only for debugging a good idea? And is the price justified?
Reply
Kenneth N Fricklas says
2019-09-11 at 09:19
Hey Tim,
I picked up a machine with and RX5700XT/8GB and an AMD Ryzen 3900x (12 core). Is there a set of benchmarks you’d like me to run on it?
Reply
- Tim Dettmers says
  2019-09-12 at 06:00
  Thanks for the offer! I think the most useful thing would be to run all different kind of models and see if something is going wrong. If the AMD card supports all the models flawlessly then it might be worthwhile to do more careful benchmarking and in turn adding these benchmarks to the blog post.
  Reply
Faruk says
2019-09-11 at 07:06
Hi Tim, I am considering to buy a RTX 2070 for my Thesis work, which is a NLP task and RNN/GPT models. But there are several models from makers and all of the comparisons are for gaming. Sure I will use it for gaming as well but I want to get the best one for deep learning. Do you have info or suggestion. Thanks in advance.
Reply
- Tim Dettmers says
  2019-09-11 at 09:08
  All the RTX 2070 should have similar performance for deep learning. I would just make sure that you get one that has good cooling.
  Reply
  - Faruk says
    2019-09-16 at 07:12
    Ok, I think I’ve found one but got confused for an other reason. Is it better to buy just a 2080ti or 2x2070super. Which one would you pick?
    Reply
    - Tim Dettmers says
      2019-09-16 at 14:28
      2080 Ti if you need the memory. Otherwise 2x 2070 Super.
      Reply
      - Faruk says
        2019-09-16 at 14:51
        Well, that is the “if” i do not have an answer for. I will use it for LSTM and deep neural networks. You suggested to prioritize memory bandwidth over all for Lstm networks just before 16-bit compatibality. As far I understand, I can get 32gb of vram using 16-bit on 2x2070s and 22gb of vram with 2080ti. Is that true?
        CNNs, on the other hand, at some I may train also but I think 2070 will handle those moderately.
Wagner says
2019-09-10 at 15:34
Fot the same price which is better? 1080 ti used or 2070 new?
I do mostly computer vision.
Sorry if this was already asked, but I didnt find.
Reply
- Tim Dettmers says
  2019-09-11 at 05:48
  If you are comfortable with just using 16-bit models the RTX 2070 might be a bit better. The problem is really the 8 GB memory on the RTX 2070 which might be limiting. But otherwise, the RTX 2070 is clearly better than the GTX 1080 Ti.
  Reply
andrea de luca says
2019-09-10 at 07:46
Hi Tim. Right now, I got two 1080TIs (11Gb each). Before their street price begins to fall down, I’d like to sell them and acquire something newer and more future-proof.
With a budged of ~1500EUR (1000 of which will come from my 1080TIs), I devise two viable options:
1. A single quadro RTX 5000. It has 16gb of vram and full-fledged fp32 accumulation when running in fp16 (which desktop RTXs do not possess).
2. Three RTX 2070. While operating in parallel, they get 24Gb of vram, which is a lot.
My main concern regards the actual vram constraints. In other words, is there anything, in both Pytorch and TF, which still needs to be done on a single GPU and cannot be parallelized across multiple GPUs?
Reply
- Tim Dettmers says
  2019-09-11 at 09:10
  I think the RTX Titan also has full fp32 accumulation (although the difference in performance is very minor with half-baked fp32 accumulation). The three RTX 2070 cannot be combined so easily to extend the memory. For parallelism, you are usually stuck with just the memory of the smallest GPU, taht is 8 GB in case of the RTX 2070.
  Reply
  - andrea de luca says
    2019-09-11 at 12:13
    Thanks for your quick reply, Tim.
    Could you elaborate a bit regarding the difficulties in combining GPUs to extend the memory? Currently, I use mainly Pytorch over vision tasks. It is sufficient to make use of the DataParallel API to automatically get 22Gb available (on my two 1080TIs). That is, for example I can exactly double the size of my minibatches, or the resolution (no. of pixels) of the images.
    Could you also cite some examples in which that would not be possible? NLP is my next area of interest, so any example in such domain would be really useful.
    Thanks!
    Note: If I cannot combine the mem of three 2070 in other typical use cases, I think it would be better to stick with my good old couple of TIs.
    Reply
mestrace says
2019-09-10 at 07:33
Thanks for the blog.
I have a desktop with one GTX980. I was considering upgrade the GPU for academic purposes (i.e. course projects, exploration). I was considering two options: buying an additional GTX980 for SLI, or upgrade to RTX2060 (That’s what I can afford). Which way should I go, can you given out any suggestions?
Reply
- Tim Dettmers says
  2019-09-11 at 09:11
  If you can afford it, a RTX 2060 Super would be best. A would also consider a normal RTX 2060 over a GTX 980 in SLI.
  Reply
Mridul Pandey says
2019-09-09 at 03:43
Hey Tim,
I am getting RTX2070 SUPER supporting NVLINK at good price, what do you suggest 2 NVLINKed RTX2070 super, how will it work? Can you suggest, does NVlink doubles VRAM, if it is so then the problem of small RAM with 2070 is solved
Reply
scoodood says
2019-09-04 at 09:47
Dear Tim,
I really learned a lot from you, thanks for the great article. I had some intro courses on ML/DL/RL a while ago. Now I am getting a bit serious and would like to dive deeper into this area (mostly for fun). I am thinking about started with one of these GPU and add more in the future. Which GPU would you recommend for someone like me?
2080 super (new, ~$760)
2070 super (new, ~$520)
1080 TI (used, ~$520)
Best Regard
Reply
- Tim Dettmers says
  2019-09-11 at 09:15
  The RTX 2070 Super is great. I would recommend that over a GTX 1080 Ti. If you are doing it for fun, I would not spend too much money. The RTX 2070 Super is already pretty good for any application.
  Reply
Nader says
2019-09-04 at 06:55
What are your thoughts on the new RTX 5000 gpus that are in the Thinkpad P53 laptop
Reply
- Tim Dettmers says
  2019-09-11 at 09:16
  If it is like a Quadro RTX 5000 it is pretty good! Must be expensive though. Not the most cost-efficient card.
  Reply
Ziv Freund says
2019-08-29 at 06:54
Hi Tim,
Is there any change about your recomendatation now that the super vesrions are out ?
Now RTX 2060 super has 8GB memory like the RTX 2070 super. should I still pay the extra money ?
Also, When speaking with NVIDIA guys, they claim that any GTX based GPU is not suited for 24/7 training, and for that I must buy the workstation GPU’s (like T4, V100 etc…) What is your comment do that ?
Thanks !
Ziv
Reply
- Tim Dettmers says
  2019-09-11 at 09:18
  Super card are great and should be considered if you can afford them. I will write an update sometime later. Maybe in a week or two.
  Reply
Vincent says
2019-08-20 at 18:51
Hi Tim, what about the new RTX 2070 Super or RTX 2060 Super? Are they supurior to the former RTX 2070 and RTX 2060?
Reply
- Tim Dettmers says
  2019-09-11 at 09:22
  Yes they are in most circumstances.
  Reply
Devidas says
2019-08-16 at 11:03
This is a really very good blog for beginners.
Reply
Ashish Duhan says
2019-08-15 at 03:47
Hey Tim,
I am getting a gtx 1080ti(used) and RTX 2060 SUPER 8GB (almost as good as RTX 2070) for same price. Which one should i choose? Does an RTX card almost always allow a double size model with mixed precision? If yes then i would have almost 16 GB with RTX 2060 SUPER (5 GB more than 1080ti). In this case, do you see any advantage of choosing an GTX 1080ti over RTX 2060 SUPER ?
Please suggest.
Reply
- Tim Dettmers says
  2019-09-11 at 09:25
  The RTX 2060 SUPER is better.
  Reply
User02 says
2019-08-13 at 23:50
Hey Tim, I have just started out in the field of machine learning and seriously want to do this. I want to participate in Kaggle Contests and build projects of my own. Currently, I own a MacBook Pro 2017 model with an 8gb RAM and was looking to buy an eGPU. The only thing is I am not completely sure of what I should buy. I really liked the idea of installing windows on my Mac and use RTX 2070 and just wanted to get your views on it and what you would suggest I do.
Reply
- Tim Dettmers says
  2019-09-11 at 09:27
  If you want to do machine learning, I would try to stick with either the Mac or Linux. Windows is now also supported by most frameworks but it can be cumbersome and unreliable at times.
  Reply
Warren says
2019-08-13 at 06:54
Tim,
I am now learning AI/ML.
Really appreciate your blog as I am looking to buy a PC.
Can you please tell me:
1. You mentioned a couple of times about the limitation of 8GB on say RTX2080. When I install 2 x 2080 (non-TI), do I have 16 GB to use? Does it work that way?
2. Will the knowledge I gain from studying AL/ML (Udacity) using a personal GPU be able easily transferred to Google TPU down the road? Or it is going to be another battle when I need to use Google TPU in the future?
Thanks
Warren
Reply
- Tim Dettmers says
  2019-09-11 at 09:28
  1. Usually your smallest GPU memory is what you get if you parallelize across GPUs. So 8 GB.
  2. Yes, it should be transferable. The only difference is how you launch your programs (and some TPU related changes to batch size and layer sizes).
  Reply
Jay says
2019-08-10 at 09:53
Hi Tim,
You explain that for CNN, priority is Tensor Cores > FLOPs > Memory Bandwidth > 16-bit capability.
So purely in terms of performance (without taking cost-efficiency into consideration), for CNN would you recommend an RTX 2080 Max Q over an RTX 2070?
The 2080 Max Q has 28% more Tensor Cores, although it also has 14% less FLOPs and 14% less Memory Bandwidth than the 2070…
RTX 2080 Max Q:
– 368 Tensor Cores
– 12.89 TFLOPs (FP16)
– 384 GB/s Memory Bandwidth
RTX 2070:
– 288 Tensor Cores
– 14.93 TFLOPs (FP16)
– 448 GB/s Memory Bandwidth
Thanks for this super blog by the way, always looking forward to your updates and comment replies!
Reply
- Tim Dettmers says
  2019-09-11 at 09:30
  It is not so straightforward actually. I would assume that the RTX 2070 is slightly faster here, but probably both cards are very similar. However, since I have no benchmark data I cannot say this for sure. I think all the factors (less tensor cores, but high memory bandwidth and flops) cancel each other out and both GPUs have similar performance.
  Reply
Adam Kantorik says
2019-08-06 at 09:00
Hey Tim.
I had to sell my desktop because I travel too much. So now I am going to upgrade my laptop, and I am deciding between graphics card – GTX 1660 Ti 6GB vs RTX 2060. I know that the RTX has new tensor cores, but is it worth on the laptop? Also, I do deeplearning just 15% of my coding time or so (but I code every day and when I do deeplearning I go all the way – LSTM, convolutional, rl,..).
If it’s going to speed up 10% I don’t think I need to pay extra for it, but I don’t want to regret the choice later. What do you think?
Reply
- Tim Dettmers says
  2019-08-06 at 22:30
  If you find a laptop with a cheap GTX 1660 Ti it might be right for you. It seems what is important for you that you can run these models at a reasonable pace but it does not need to be super fast. Also for big models there is always the cloud. I guess the GTX 1660 Ti might just be right for you!
  Reply
Michel Rathé says
2019-08-04 at 07:45
Hi again Tim,
What is the current state on multi gpu workstation in regard of throttling and heat.
Since the Asus rtx matrix , that has it’s own aio loop for cooling, I haven’t seen any relevant build on the net. So, for 2 to 4 gpu what design/Brand is the optimal to keep performance, reliability and longevity of the cards. If custom aio loop (or hybrid) is best choice please give high level specs. Is blower still relevant? What are the true and thoughtful practioneers using on a workstation build (not a gpu server).
The mobo will be workstation class ex: Asus Dominous extreme or Sage.
Thanks and keep on the good work,
Michel
Reply
- Tim Dettmers says
  2019-08-04 at 13:05
  I did not see any deep learning data for AIO loop GPUs. For now the blower do better if you put them side-by-side in the 4 GPU case, but what I found to be best is to have the normal, non-blower fans and use extenders to distribute the GPUs in the case. This keeps them much cooler. I would assume though that AIO loop provide even better performance (if you have the space for all those radiators). Otherwise, there are also custom loop solutions which definitely give you the highest performance, but also the highest efforts/price point to setup a system.
  Reply
island fox says
2019-08-02 at 07:37
Hi do you think a RTX 2080 (not ti) is good enough for doing Kaggle? How about a 2080 super?
Reply
- Tim Dettmers says
  2019-08-04 at 13:07
  Yes, both card are good enough for most cases. The only problem is the smaller 8 GB of RAM when you face “deep learning competitions”, that is computer vision competitions where convolutional networks are really important. But even then you should get pretty far with 8 GB of RAM.
  Reply
  - Island Fox says
    2019-08-09 at 08:38
    Thanks for the response👍
    Reply
Maurice Langner says
2019-07-27 at 10:28
Hi Tim,
I already wrote you a message recently, but I had to delete the respective email address.
What I would like to know is, what GPU setup you would recommend for NLP scientists working on speech recognition. I have the choice between 1-2 RTX 2080ti and a Titan RTX with student discout, which makes it equal in cost to buying two 2080ti.
I am still not sure whether it is necessary to have 24 GB when connecting two 2080ti means summing memory to 22GB and CUDA core number to about 8000.
I would be really greatful if you could help me.
best,
Maurice
Reply
- Tim Dettmers says
  2019-08-04 at 13:17
  I would go for the two RTX 2080 Ti. The memory will still be limited to 11 GB for standard APIs in popular libraries (tensorflow, pytorch), but with some effort you could utilize the full 22 GB with a single model. However, probably you will be fine with 11 GB 95% of the time.
  Reply
YOYO says
2019-07-26 at 20:28
The best GPU for deep learning is RTX 2070. How about RTX 2060 SUPER which has 8GB release at July. 2019. Is it possible to be the best and cost-efficient GPU?? I am going to buy an used 2070 or brand new 2060 super.
Reply
- Tim Dettmers says
  2019-08-04 at 13:18
  The RTX 2060 Super might actually be better in terms of cost efficiency, but I do not have any hard data about this. In the worst case it is just a bit worse than a RTX 2070 — there is not much too lose!
  Reply
Alon says
2019-07-24 at 02:25
Hi,
I’ve been wanting to purchase an RTX 2070 , but noticed a lot of complaints (on internet forums/review threads) about RTX GPU’s failing. Do you think it’s a real issue? Have you encountered problems with RTX GPU’s ?
Thanks, Alon
Reply
- Tim Dettmers says
  2019-08-04 at 13:27
  It seems that only the first batch of RTX cards was affected. At the University of Washington I know of 2 failures our of 20+ GPUs. Which is high but not as high as all these news would suggest. These cards also came mainly from the first batch of RTX cards.
  Reply
Shadiq says
2019-07-22 at 06:08
Hello, I absolutely respect the work you have done for this post. I have a question, since the RTX Super lineup is now available and replacing RTX 20x0s, would you reckon the overall gpu recommendation list to be the same order? So in example now the best overall gpu is now rtx 2070 super and so on. Or is there any further considerations before I purchase a rtx 2070 super myself?
Reply
- Tim Dettmers says
  2019-08-04 at 13:29
  The RTX 2060 Super looks great! The RTX 2080/2070 Super should be more cost-effective than the RTX 2080/2070 but with fluctuating prices this is difficult to say for sure.
  Reply
- Keshav says
  2019-11-07 at 10:37
  Hey did you get the Super GPU, if so how is it? Why is it that the RTX 2070 Super is cheaper by $100 than the normal RTX 2070?
  Reply
Harsh Mittal says
2019-07-21 at 03:47
Hey Tim thanks a lot for such useful information. I wanted to start kaggle and I am serious about it.
As you recommend RTX 2070 would have been best , but since it’s not available right now, I should go with RTX 2060 Super or RTX 2070 Super?
Reply
- Tim Dettmers says
  2019-08-04 at 13:30
  If you have the money the RTX 2070 Super is a good portion better than the RTX 2070 for conv nets/transformers. If you rather train RNNs and small CNNs, get a RTX 2060 Super.
  Reply
Alex says
2019-07-19 at 00:13
Hi Tim,
Thank you for the great post!
I think of about PC for doing ML projects and taking part in Kaggle competitions. I usually work with image data, so I think about buying one RTX 2070 Super. Would you recommend this GPU?
Reply
- Tim Dettmers says
  2019-08-04 at 13:31
  This works well for most competitions. If there are some computer vision competitions a RTX 2070 Super will not have enough RAM to get the best results, but you will get some good results.
  Reply
Jürgen B says
2019-07-17 at 11:51
Hi Tim,
thank you for this very profound analysis! I have just some questions. According to what I read on techpowerup.com the GTX 1060 works well for float32, but doesn’t do so well for float16. Does that match your experience? Would you still recommend the GTX 1060 if float16 is used, or would you then recommend something different?
I am a beginner in deep learning and would just like to try out some things mainly related to convolutional nets and later would like to check out an idea I have about training on short image sequences taken from a video (black and white pictures of size maybe 50×50 and up to 10 images combined in one input vector) is it realsitic to train something like this on such a card?
One other question, you recommend the GTX 1060 for starters. Do you prefer this card over the GTX 1660 because its around 30% more expensive, or is it because the 1660 has no TPUs (just read this here), I guess the 1060 has some, right?
Thank very much in advance
Jürgen
Here the articles I mentioned:
https://www.techpowerup.com/gpu-specs/geforce-gtx-1660-ti.c3364
https://www.techpowerup.com/gpu-specs/geforce-gtx-1060-6-gb-gddr5x.c3328
Reply
- Tim Dettmers says
  2019-08-04 at 13:34
  The GTX 1060 can be run with float16 to save a little bit of memory, but it will not make training faster. 50x50x10 is very doable with a 6 GB GTX 1060. If you get a GTX 1060 with less memory it might run into some difficulties, but you might be able to go around that if you decrease the batch size to save some memory. For starters I would recommend instead the RTX 2060 or the RTX 2060 Super (if you can afford it). A cheap GTX 1060 with 6 GB from eBay is great as well.
  Reply
  - Jürgen says
    2019-08-12 at 13:29
    Hi Tim, thank you for your reply. In the meantime I caught an RTX 2070 on eBay. But couldn’t test it up to now, because Ubuntu 19 unfortunately installs CUDA 10.1 which is neither supported by PyTorch nor by Tensorflow, so up to now, the RTX is only an additional source of light due to the led bar on it’s edge 🙁
    Reply
    - Tim Dettmers says
      2019-09-11 at 09:29
      You can try to download CUDA 10.0 and install it manually. The run files are usually easy to install.
      Reply
Steve Ni says
2019-07-17 at 05:29
Hi, Tim! Thanks for your great article!
I am now wondering whether to buy two GTX1080 or a single RTX2080 using the same money. (In my country the price of RTX2080 is twice as GTX1080). Which should I choose? (I generally do CNN using TensorFlow at present, but maybe more models in future.)
Reply
- Tim Dettmers says
  2019-08-04 at 13:35
  This is a though choice. I think it is about the same. If you are using CNNs a lot then the RTX 2080 will be great if you train large ones (16-bit compute is great for large CNNs). If your CNNs are a bit smaller the GTX 1080 are definitely better. I personally would probably go with the two GTX 1080.
  Reply
William says
2019-07-14 at 12:24
Looking forward to seeing 2060/2070/2080 Super GPU and their performances, thanks!
Reply
- Tim Dettmers says
  2019-08-04 at 13:36
  I might work on this in late August / early September. Thanks for your patience!
  Reply
Michael Mao says
2019-07-12 at 09:41
How important is GPU clock speed for deep learning? I’m looking to spend some money on an RTX 2080 Ti as a second card alongside an RTX 2070 in my system. Since this is a much more substantial investment, I want to get the best bang for the buck in terms of deep learning performance in RTX 2080 Tis on the market. I’m looking at EVGA black edition and Aorus 11GC, which are 150 dollars apart, but has a boost clock difference of 150Mhz. It seems that all other aspects are pretty similar apart from clock speed differences. Is it worth getting the pricier model? How much does clock speed affect deep learning performances?
Reply
- Tim Dettmers says
  2019-08-04 at 13:38
  The clock speed does not matter. The cooling solution matters much more. Cooling will make a difference of -/+30% while clock speeds are about -/+5%. In terms of cooling I heard good things about the RTX 2080 Ti Sea Hawk or the Strix also seems to be decent (not sure about a Strix multi-gpu setup though).
  Reply
Scarecrow says
2019-07-12 at 00:38
Hi Tim!
I’m looking to buy a new laptop for deep learning. Not super into deep learning yet but in the near future, I would get into competing in kaggle competitions. I was looking at laptops from HP and Lenovo. I find almost all of these laptops within my budget have the following GPUs – MX150, GTX1050, 1050Ti, GTX 1650, and GTX 1660Ti. Which one of these is better in 2019?
My main laptop specs would be => i7 (7th gen+) and16GB RAM.
Reply
- Tim Dettmers says
  2019-08-04 at 13:39
  I would only recommend the GTX 1660 Ti version. But note that this GPU is also not very good. You can do some prototyping but not run regular deep neural networks that are a bit larger.
  Reply
  - Cindeep says
    2019-08-17 at 05:08
    If I have an option to buy either GTX 1060 or GTX 1660 Ti, which one do you recommend? (considering that the latter has 1536 CUDA cores compared to 1280 from the former)
    Reply
    - Tim Dettmers says
      2019-09-11 at 09:25
      I would recommend a GTX 1660 Ti.
      Reply
Mircea Giurgiu says
2019-07-11 at 09:19
If I compare price and performance of 4xRTX2080 versus 4 x RTX5000, is there a reason to adopt the RTX5000 , given that the number of cores are less?
Reply
Attila says
2019-07-09 at 01:23
Your guide helped me a lot already, thank you. I’m about to become an NLP researcher, and I’d like to buy a PC for it. Until now I was thinking about going with an RTX 2070, since it just fits my budget. However RTX Super cards are arriving soon. My question is: should I go with the “old” 2070 or is the 2060 Super a better value for my purposes?
Reply
- Tim Dettmers says
  2019-08-04 at 13:43
  The RTX 2060 Super and RTX 2070 are about the same in performance. Buy the one that is cheaper.
  Reply
MibaNafi says
2019-07-07 at 07:21
RTX 2060 SUPER vs RTX 2070?
Reply
- Tim Dettmers says
  2019-08-04 at 13:43
  Both have the same performance, but the one that is cheaper.
  Reply
MibaNafi says
2019-07-06 at 19:03
Hi,
Better 2×1060 6gb or 1 RTX 2070?
Nearly Same price….
Reply
- Tim Dettmers says
  2019-08-04 at 13:43
  I would go for the RTX 2070.
  Reply
Malli says
2019-07-04 at 10:40
Hi Tim
Great blog and I found it very useful to kickstart my deep learning journey. This is the configuration that I am going with-
Intel Core i7 9th Gen 9700K (8Cores 8Threads with 4.9Ghz )
MSI Z390 Chipset Motherboard with WIFI, Bluetooth and
64GB Ram Support
32GB G.Skill Ripjaws V 2400Mhz DDR4
Nvidia GeForce RTX 2070 8GB
1TB Samsung 970 Evo Nvme M.2 SSD for OS Installation
2TB Western Digital HDD for Data Storage
Cooler Master H500 Irony Grey Cabinet
Cooler Master MasterLiquid 240 Cooler
1200W 80+ Platinum PowerSupply Support 2 GPU
Any comments from your side would be great. Planning to get a second RTX 2070 in 6 months.
Reply
- Tim Dettmers says
  2019-08-04 at 13:44
  The PSU wattage is a bit high right now. I guess if you add more GPUs along the way it fits very well though!
  Reply
Ken Fricklas says
2019-07-01 at 07:53
Hey Tim – any opinions (Early on) about the new announcements from AMD?
Seems like the $750 Ryzen 9 3950X with PCIe 4.0 and a couple of $449 RX5700XTs could be a game changer in the sub-$2000 workstation market (to say nothing of gaming).
Reply
- Tim Dettmers says
  2019-07-02 at 16:08
  PCIe 4.0 is not needed at the moment and will not yield the greatest benefit unless you parallelize across 8 GPUs. For PCIe 3.0 CPUs the EPYC CPUs are currently far the best in terms of cost/performance. I am unsure about the RX5700XT. It is difficult to get reliable information about deep learning performance for AMD cards and I am not sure if I can make a recommendation without seeing multiple benchmarks on relevant deep learning models/tasks.
  Reply
Mircea Giurgiu says
2019-07-01 at 05:02
I have noticed to some hardware providers that:
a) a workstation with 4 x RTX 2080 Ti,
and
b) a workstation with 2 x Quadro RTX 8000 NVLinked (the other support hardware: memory, HDD, CPU processor, etc are the same for (a) and (b)),
have, more or less, THE SAME PRICE.
Q1) Given that (a) has in total more CudaCores /TensorC, which would be the reason to purchase (b)?
Q2) Which could be the resons to purchase (a)?
Thank you in advance,
Mircea Giurgiu
Reply
- Tim Dettmers says
  2019-07-02 at 16:05
  If you have very large models (such as transformers) which need the full 48 GB of GPU memory the Quadro RTX 8000 will be a good choice. However, for normal transformers (BERT-style) or normal computer vision models, the RTX 2080 Ti is sufficient and preferred since they will yield better performance. I would recommend the 4x RTX 2080 Ti option in 9/10 cases — I think this is what you really want to get as well!
  Reply
  - Mircea Giurgiu says
    2019-07-02 at 23:42
    Thank you so much for the clear and prompt response. Congratulations for this blog !!
    Reply
  - Yashovardhan Chaturvedi says
    2019-09-26 at 18:33
    Hi,
    What about for image segmentation models will rtx 8000 be better than 4 rtx 2080 ti ?
    Reply
    - Tim Dettmers says
      2019-09-26 at 21:27
      If your model fits into GPU memory (11 GB) then 4x RTX 2080 Ti will be much faster.
      Reply
      - yash says
        2019-09-30 at 11:49
        Thats the issue I am facing my model is not fitting into the GPU I can also get about 8 images in each gpu RTX 2080 ti and because of that my training time is currently 2 weeks as the dataset is quite large. So I was wondering in order to increase the batch size should I go for RTX 8000 which has 48 GB GPU memory thereby I can process through my dataset faster ?
      - Tim Dettmers says
        2019-09-30 at 13:51
        An RTX 8000 is quite pricey, but if you encounter this problem often or expect it to encounter it longer in the future an RTX 8000 is a good choice. An RTX Titan might be a middle ground with 24 GB of memory, but it would be a shame if you need more than that and you are stuck with 24 GB. So if you have the money, the RTX 8000 might be the right way to go.
        Another way would be to buy multiple RTX 2080 Ti and separate the model with model parallelism (grouped convolution) across the two GPUs. That is more work programming-wise, but it could also work.
      - yash says
        2019-10-04 at 14:55
        Hey Tim,
        Thank you for your reply.As this is in a company setting and not personal use money is not the issue. I was going through the specs of V100 32gb vs qaudro RTX 8000 the 32 bit flops of V100 are quite high. So with a focus on data parallelism and not model parallelism to go through the whole data faster what would you suggest 8 V100 32 gb vs 8 quadro rtx 8000 48gb each ? Price is not an issue ?
Joseph Quinn says
2019-06-30 at 01:51
I can buy 2070 and 1080 ti at almost the same price. If the trend stays the same after nvidia super cards launch, should I go with more ram on gtx card, or just go with rtx?
Reply
- Tim Dettmers says
  2019-07-02 at 15:57
  This is a bit of a tough choice. I think if you want just ease of use a GTX 1080 might be better. If you are fine with doing a bit of extra work for 16-bit models and to fully utilize tensor cores a RTX 2070 will be slightly better (but less convenient).
  Reply
  - Joseph Quinn says
    2019-07-09 at 09:36
    Thanks for your reply. Now that RTX Super arrived, I can buy used 2070 for ~40$ less than 2060 Super and 1080Ti. I don’t mind getting used to work with 16-bit models. Which of the three would be the best bet?
    Is it worth to spend like 200/300$ more for 2070 Super or 2080?
    Reply
    - Peter Voksa says
      2019-10-04 at 09:58
      I would like to know the answer to this as well
      Reply
      - Tim Dettmers says
        2019-10-05 at 14:22
        The RTX 2070, RTX 2070 Super, RTX 2080 and RTX 2060 Super are all competitively priced. You will not get a much better deal for any particular card. However, for a RTX 2070 for $40 less than RTX 2060 Super, I would go for that offer.
Mridul Pandey says
2019-06-29 at 23:13
Hi Tim,
I am plannig to build a computer and finding my self in “I started deep learning and I am serious about it” catagory. RTX2070 looks better option for me. I am going for a ryzen 3600x CPU with it. I am planning to go for more RTX 2070 cards in future. Should i spend extra money for SLI supported motherboards or normal motherboard with multiple GPUs can give a similar performance. Please suggest.
Reply
- Tim Dettmers says
  2019-07-02 at 15:55
  As long as you can fit multiple GPUs you should be fine. Make sure, however, that your motherboard not only has the slots but also supports the multiple GPUs. Also make sure your CPU supports the RTX 2070s that you like. Many Ryzen CPUs support 2 GPUs, but not more.
  Reply
James says
2019-06-22 at 11:10
Kaggle now provides p100 for 9 hours runtime, and I can run 4 kernels at a time. Is there any advantage for me to buy rtx2060?
Reply
- Tim Dettmers says
  2019-06-29 at 08:26
  It is really about if the p100 limits your work or not. Sometimes when I prototype I need a single GPU to be productive. Sometimes I need to parameter search or run experiments for comparison and then even 15 GPUs will not be enough. So think about different situations and think if 9 hours is enough for that or not.
  Reply
Mantas says
2019-06-17 at 20:16
Hello,
If you would have to chose GPU on laptop would you ho with RTX 2060 6 GB OR RTX 2070 8 GB max q?
Reply
- Tim Dettmers says
  2019-06-29 at 08:21
  I personally would go with an RTX 2070 due to the extra memory, but it really depends what your applications are. If you want to do research then 8 GB is really the minimum.
  Reply
Deep Khamaru says
2019-06-17 at 12:34
I am BTech student, interested in learning Machine Learning. My choice of Machine the aforementioned task is MSI-GS65, with i7 8th Gen processor and a 6GB 1060 gpu. Now please bear with me. I know a laptop is not what is intended for ML, but I desperately need portability of some form or other. So is this the Laptop I should go for? Or something with a 2060? I don’t have the money to higher. Thank you for this article on this particular topic which kind of rare.
Reply
- Tim Dettmers says
  2019-06-29 at 08:20
  You cannot do much on a GTX 1060 these days — this should only be an option for very basic deep learning workloads. I would recommend getting a cheap desktop + cheap laptop and then to work remotely on your desktop from your laptop. For $2000 you can buy a desktop with an RTX 2070 and some cheap components and an additional good netbook.
  Reply
Rytiss says
2019-06-16 at 05:25
Hi,
I am looking forward to making cheap deep learning rig. I was really interested in EGPU (external graphics card) variation because I have a laptop with a pretty powerful processor (8300H i5). Due to problems appearing with EGPU, I am focusing on building a cheap rig for mainly deep learning. I am looking to make mini ITX with RTX2060 or RTX2070 (of course RTX 2070 is a way to go 🙂 ). However, CPU price itself makes system price increase severally (Buying better CPU also requires cooling fans, that also increases price). I am thinking about CPU as cheap as possible for this to handle stuff related to deep learning nicely on RTX2060/2070. I know things about bottlenecks and etc., but for example, going with Intel Pentium G5400 in the system for the deep learning purpose would be a bad decision? I am training models with float32 and float16.
Reply
- Tim Dettmers says
  2019-06-29 at 08:16
  I had a similar machine. For an RTX 2070, the Intel Pentium G5400 should be a pretty good CPU. An Intel Pentium G5400 would work well for many applications but can be a bit slow if you want to train on datasets with large input-size/second, for example, ImageNet. But the performance loss on ImageNet should still be small (probably <20%).
  Reply
Mary says
2019-06-14 at 14:06
HI,
What are you thoughts on Arm based CPU & GPUs for ML? Especially since ML is moving towards the edge. Have you done any performance comparisons with cortex-M family?
Reply
- Tim Dettmers says
  2019-06-15 at 10:53
  It looks like these GPUs are mostly inference GPUs for mobile devices and would perform poorly for training. Or did I miss something here? If this is accurate, then I would believe that ordinary GPUs/TPUs/Graphcores would be used for training and you would then optimize trained networks for these mobile GPUs.
  Reply
Nicolas CS says
2019-06-09 at 09:10
Hello There Tim, thanks for this blog, it has been quite helpful for me.
I move a lot and I am planning to buy a laptop for DL because of that. The difference in price between a laptop with a 1660ti-maxQ and a 2070-maxQ is above 900 USD. As we are talking about the maxQ versions, is it worth it to pay those 900 extra dollars?
Thank you again for saving us a lot of time in this kind of cost efficiency research.
Reply
- Tim Dettmers says
  2019-06-13 at 20:35
  900 USD is quite a big difference for those laptop GPUs! If you want to use deep learning the 900 USD investment could make sense since the GTX 1660Ti is not that good for deep learning. On the other hand, you can also get the 1660 Ti and just rely on cloud GPUs in the case you need them. This could be a very competitive alternative and it would be much faster than 2070-maxQ.
  Reply
Van Long says
2019-06-08 at 20:37
I wonder why GTX 2070 is less effective than 1070/1070Ti in word RNN? that sound weird…
Reply
- Tim Dettmers says
  2019-06-13 at 20:27
  It is likely due to the larger cache of the GTX 10 series.
  Reply
Thanhlong says
2019-05-26 at 18:43
Hi Tim, thanks to your great article about chosing gpu for deep learning. I want to ask you that should i buy rtx 2070 or 1080ti, i think in the future, the rtx will take a head because it will be support desipte the gtx 1080 have 11gb vram than 8gb vrams on rtx 2070. So can i have your idea about it. Thanks you so much
Reply
- Tim Dettmers says
  2019-06-13 at 19:44
  The extra memory on the GTX 1080 Ti can be a blessing, but if you figure out how to use 16-bit computation on the RTX 2070 reliably you will get much more out of an RTX 2070 and it will last longer. It will be a bit more painful to adapt things to 16-bit, but with a bit of work it should work out just fine.
  Reply
Osama says
2019-05-19 at 08:28
Hey Tim, should I go for gtx 1660ti 6gb or gtx 1070 8gb?
Reply
- Tim Dettmers says
  2019-06-13 at 19:39
  Though choice. I would probably go the GTX 1070 because of the additional memory. But either card is a good choice.
  Reply
Alexander W. MacFarlane IV says
2019-05-17 at 20:11
Hi Tim!
I am thinking about this from the supply side, and am curious why the option I have been considering was not mentioned.
My play is to take an oldish Xeon server that I have in my basement, install a couple of RTX2080Ti cards, and rent it out to people like you on Vast.ai.
https://vast.ai/
They seem to be capitalizing on the no RTX in data center rule by creating a service where buyers meet sellers, and my idea is to become one of the sellers.
Have you not heard of this or is there some other reason why you did not mention that as an option?
Alex
Reply
- Tim Dettmers says
  2019-06-13 at 19:39
  The problem here is reliability and customizability and ease of use. AWS/Azure/Google Cloud is not much more expensive but just offers better service. It is difficult to beat that right now. Otherwise, in the long-term, people are still better of investing in a GPU themselves.
  Reply
Josh says
2019-05-14 at 09:13
Thanks for the detailed breakdown! I have some questions regarding the performance comparison of the various cards for the various tasks (RNN, CNN, Transformer).
You state that memory bandwidth is crucial for RNN performance, however, in Figure 2 the 1080 Ti with 448 GB/s bandwidth has a far higher performance rating on the word RNN metric than the RTX 2080 Ti with 616 GB/s bandwidth. I guess my question is, what specs do you take into account when extrapolating from the RTX 2080 Ti to e.g. the GTX 1080 Ti?
Also, I don’t understand the combination of char RNN + Transformer metrics, from the preceding text it seems like those are dependent on different specs (memory bandwidth for RNN, bandwidth + core speed for Transformer). Why do you combine those two?
Best,
Josh
Reply
- Tim Dettmers says
  2019-06-13 at 19:37
  Unfortunately, I do not have time to make a detailed analysis of this, but here my best guess: Cache size is an important factor for performance and the GTX 1080 Ti has a larger cache than the RTX 2080 Ti. This is likely why the GTX 1080 Ti is faster for small matrix multiplications. Additionally, this could be an issue where some code-paths are just better optimized for GTX 1080 Ti and with a different hidden size the RTX 2080 Ti would have caught up. If this is true the performance might change with future releases of CUDA.
  Reply
Alan Chen says
2019-05-07 at 11:43
Big surprised, How could 1660Ti be worse than 1600? 1660Ti beats in all aspects and win in all Benchmarks for gaming.
Reply
- Tim Dettmers says
  2019-06-13 at 19:27
  They do not have tensor cores which are not important for gaming, but important for deep learning.
  Reply
Karan Sharma says
2019-05-05 at 13:29
Hi Tim,
I am planning to begin with my Deep Learning work. I am very serious about it and want to do more projects after getting done with an initial one.
I am stuck with the choice of GPU I should opt for, tensor cores and cloud TPUs put in a bit of confusion.
Please help me to pick an option. I currenty will be working on CNN
Reply
- Tim Dettmers says
  2019-06-13 at 19:24
  For long-term work with CNNs I would recommend an RTX 2070 or RTX 2080 Ti if you can afford it.
  Reply
Tim says
2019-05-05 at 13:24
Google Colab recent upgraded GPU’s to Nvidia Turing T4. I posted a notebook that shows the upgrade below.
https://colab.research.google.com/drive/1eIAvJEHNnx-bHGGyf1bIU4h_07pW4W4t#scrollTo=aqSYfrO5oQFm
https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-t4/t4-tensor-core-datasheet-951643.pdf
Someone do benchmarks on this upgrade yet? Thanks.
Reply
- Tim Dettmers says
  2019-06-13 at 19:23
  The T4 is roughly in-between RTX 2060 and RTX 2070 in performance. So not the best GPU.
  Reply
Laurene says
2019-05-03 at 07:29
Can you please help me to calculate a good cost if someone rented the following computing power from our Power Edge server: For use of 1 high-performance server node, with 24 cores for 80 hours per month. If you could, please suggest the following rates (per minute, per hour and per week)
It would be most helpful if you could also help me estimate the same costs on a newer server with new specs like NVIDIA .
Thanks
Reply
- Tim Dettmers says
  2019-06-13 at 19:22
  I would look at cloud prices (AWS/Azure/Google) of similar hardware and compare the price this way. If you rent for a month you should pay much less (maybe half) to make it competitive with cloud services. Otherwise, it is not worth it.
  Reply
Hannes van der Merwe says
2019-05-02 at 08:09
HI Tim
Thanks for this article. I started my deep learning journey on a Dell Laptop with GeForce 845M windows machine and that did not go too well. I worked my way through Deep Learning with Python from Francois Chollet, but my local machine was not always happy and ran out of memory on the GPU very fast. Google’s colab.research.google.com was a big save.
We are looking to buy a PC for deep learning. We are most interested in image classification, facial recognition and number plate reading. We later would like to classify video sequences into know human actions. Like he is running or jumping…
We also plan to dabble a bit with AR so that we can virtually define different areas of detection rules for live feeds. And lastly we will venture into autonomous drones/robots.
To start off we simply want to retrain an image classifier to better fit our solution, but when spending this kind of money on a device we would really like it to cover as much of our future training as possible.
The best card I could find at my supplier was a MSI RTX 2080 TI LIGHT Z – 11GB GDDR6, HDMIX1, DPX3
How far will that get me? We are currently using a caffe model, but we not really fixated on a specific solution yet. The other sample modules I could find was taking long to fire, but the caffe was close to 17 frames per second.
Thanks
Hannes
Reply
- Tim Dettmers says
  2019-06-13 at 18:52
  The RTX 2080 Ti should enable you to train everything that is out there. If something is too big, use a smaller batch size or sub-mini-batches and update the gradient every k sub-mini-batches — should work like a charm!
  Reply
Nick says
2019-05-01 at 15:20
Hi Tim,
Your website has been very helpful with learning about deep learning. This fall I will be starting a MS in Stats program that is computer intensive. The program has some courses in basic deep learning and they require the use of a laptop. I was wondering if you could provide some advice on what would be the best setup for a laptop to learn basic to moderate deep learning
I plan to purchase a laptop with the RTX 2070. There are 3 laptop versions of this GPU: RTX 2070, RTX 2070 Max-Q 90W, RTX 2070 Max-Q 80W. Is the difference between these three variations going to make a noticeable difference in machine learning?
Most of the laptops I’m considering have (or will have) the i7-9750 CPU with 6 cores, but I may splurge for one with a i9-9880 with 8 cores? Is the use of the extra money better to be spent on a bigger hard drive than faster CPU with more cores?
I plan to have 32 GB RAM on the laptop. Is this enough RAM to get started with deep learning?
I plan to have two SSD hard drives on the laptop. Ideally I would like two 1 TB 970 Evo Plus drives. One would be committed to Windows and the other to Linux. Is this an excessive amount of space for my beginner level of deep learning? Would two 500 GB drives be enough space?
The laptops I’m considering are the following:
TLX by Falcon Northwest
Aurora by Digital Storm
P75 Creator by MSI through Xotic PC
Raider by MSI through Xotic PC
Evo17-S by Orgin PC
Any advice you can provide will be greatly appreciated. Thank You!
Reply
- Tim Dettmers says
  2019-06-13 at 18:51
  Two 500 GB drives will be enough for almost all deep learning applications. I would get the best GPU you can afford since you will use your laptop for quite a while. Get a regular RTX 2070, the MAX-Q versions will improve deep learning performance slightly but suck away a lot of battery.
  Reply
  - MM says
    2019-09-04 at 22:15
    Hi Tim. I have a similar case of buying a loptop, specifically for deep learning. I went down to 2 options: 1) MSI GS65 Stealth 9SF and 2) MSI P65 Creator 9SE. Both are quite similar (with i7 9750H CPU and 512 GB SSD drive) yet the first one has RTX 2070 Max-Q while the latter one is with RTX 2060. The first one costs about 150 USD more. Does it worth to pay that extra money? Also, both has 16GB DDR4 RAM. Do you think 16GB is a reasonable option, or should I look for the 32GB variants considering the graphics cards?
    Reply
    - Tim Dettmers says
      2019-09-11 at 09:12
      I would say the extra 150 USD is worth it. I think 16 GB is enough for most applications.
      Reply
julian says
2019-04-30 at 03:57
Hey Tim,
thanks a lot for the great blog and all the work you put into answering people’s questions.
I would like to replicate or run at least a similar benchmark as your word-level biLSTM ((2) For word and char RNNs I benchmarked state-of-the-art biLSTM models).
I couldn’t find a reference to the benchmark implementation in your blog.
Could you point me to it or do you have a few more information on the benchmark you ran?
Thanks for your help
Reply
- Tim Dettmers says
  2019-05-01 at 06:30
  That sounds great! I used this repository: https://github.com/salesforce/awd-lstm-lm
  Please let me know what you find — thank you!
  Reply
  - ALEX says
    2019-06-29 at 21:04
    Hi, I see the LSTM model using torch.nn.LSTM.
    I didn’t use Pytorch,but tensorflow myself,I use Tensorflow a lot using LSTMs.
    I noticed the vanilla LSTM implemention in Tensorflow is not quite good, it uses a “for loop “for each timestep, in that case ,it can’t utilize all the GPU ,for instance my GTX1660 GPU is running at 75% using Tensorflow’s basic LSTM.
    but, there are other high efficient implementions of LSTM, like blockfuseLSTM in tensorflow, it runs 95% of my GPU,and it is 230 % faster.
    so I wonder if you can run another kind of LSTM for these cards? like blockfuseLSTM
    here is a git famous in NLP , one of the best in NER task. https://github.com/guillaumegenthial/tf_ner
    thanks a lot ,I really appreciate your post ,as I am now thinking buying a 2080ti,but I found it is slower than 1080ti for most of my projects , that makes me really frustrating.
    should I just buy an 1080ti ?
    Reply
    - Tim Dettmers says
      2019-07-02 at 15:54
      I think the standard PyTorch LSTM implementation uses a fused implementation, but I am not entirely sure. I will be a bit more careful in the next iteration of this blog post. However, I was also aiming at benchmarking code that most people would use rather than to use the most optimal code. This approach reflects how the normal experience of users would be if they had a specific GPU. Do you think this makes sense?
      Reply
Raj says
2019-04-26 at 22:29
What is the temperature of GPUs in idle ? Because you seem to have put them so close to each other that they may not be able to get enough air circulation, I guess.
Reply
- Tim Dettmers says
  2019-04-27 at 09:25
  Depends on the cooler on the GPU and the room temperature. Usually, you can get 30-40C in a normal office; in a hot server room multiple GPUs next to each other might idle at 50-60C.
  Reply
Atralb says
2019-04-24 at 18:56
Hi Tim, Thx an awful lot for all your articles. These are genuinely the best available publicly.
I have a couple questions :
– Would you buy an RTX 2070 for 475€ or a used GTX 1080 Ti for 650€ ?
– Do you think the RTX has enough memory for audio analysis (in particular music) ?-
– I am the “new to deep learning but serious about it” type of guy. Do you think having only 1 of those cards will be enough to get good experience quickly, or would you rather recommend buying 2-3 GTX 1060 for the same price ?
Reply
- Tim Dettmers says
  2019-04-27 at 09:23
  – Definitely the RTX 2070!
  – I am not sure, I have never worked with audio. Usually, you can get away with a low amount of memory if you use the right techniques such as 16-bit compute and gradient aggregation and so forth — so it might work well unless you are aiming to develop state-of-the-art models.
  – I think a single RTX 2070 will give you a pretty good experience. I would recommend 2 RTX 2060 if you want multiple GPUs instead of GTX 1060s.
  Reply
Can says
2019-04-24 at 03:40
Hi Tim, thank you for this detailed blog post.
I’m not sure whether mixed precision training really helps to decrease memory consumption at the moment when using Apex. Normalization layers use fp32 and optimizer keeps parameters in fp32 for stable results. Therefore, we almost have an fp32 replica of every fp16 data in when training a deep CNN/MLP. This might even increase memory usage.
Mixed precision increases the arithmetic intensity of GEMM/Convolution kernels and allows these kernels to utilize the Tensor Cores on modern RTX cards. Thus, mixed precision only speeds up heavy lifting operations without any significant decrease in memory used at the moment.
Thanks.
Reply
- Tim Dettmers says
  2019-04-27 at 09:18
  The weights of the network usually constitute only a very small part of the memory for the network, thus 16-bit activations can help to reduce memory. If you do 16-bit computation but aggregation in 32-bits it will have no benefit — you are right — but generally, you want to avoid that anyway. If I run 16-bit transformers I can increase the batch-size by quite a bit — it is not double the amount, but I can increase it by half and more and use the same amount of memory. I am sure it is similar for CNNs.
  Reply
Yaroslav Smolin says
2019-04-23 at 14:37
Thank you for the great paper. Just one question, as I understood that to achieve GPU-to-GPU parallelization I need to have the same GPU architecture. But what does this mean? For example, RTX 2070 and RTX 2080Ti have the same Turing cheap architecture, but you mentioned that they are not compatible. So basically GPUs must have the same name? I would really appreciate your answer.
Reply
- Tim Dettmers says
  2019-04-27 at 09:15
  I am actually not quite sure if you have enabled peer access between different chips but the same architecture — never tested that! From what I can tell from the NVIDIA documentation, it might just work, but I cannot give you any guarantees.
  Reply
PH says
2019-04-22 at 14:41
Hi Tim Dettmers:
Thanks for the blog posts.
You wrote:
>>”…. I would never recommend buying an XP Titan, Titan V, any Quadro cards, or any Founders Edition GPUs.”
Can you explain why?
I am not sure where to post the following question,
but may be this is as good a place as any since
you do not have similar topic. Here goes…..
I am trying to setup up 2 to 4 GPUs Deep Learning system using Ubuntu 18.04.
One pair of GPU is GeForce GTX 1080 Ti
The NVidia driver I have used were 390.xx and 418.xx.
They both gave me the same problem as described below.
Here is the problem:
When I have just ONE GPU, the display works, and the OS boots up just fine.
After installing TWO GPUs (and I have done this multiple times),
when I try to load the OS, the display is blank.
But if I remove the 2nd GPU, everything works again!
Note that the display was still connected to the 1st GPU when I have two GPUs setup!
What do you think is causing this problem?
Since I am using my GPUs for deep learning,
I only need one display connected to ONLY one GPU..
I did not see you have a blog for step by step installation of two or more GPUs
on a Linux system.
Do you know of one or more websites where there are step by step instructions
on installing two or more GPUs for deep learning?
Thanks for sharing your knowledge.
Reply
- Tim Dettmers says
  2019-04-27 at 08:56
  This is strange, usually, it just works. I guess there is something wrong with your NVIDIA driver installation or with your x-server config (Xord).
  Reply
Ed Austin says
2019-04-21 at 05:50
Great article.
On a budget, I purchased a GTX 670 with 4GB, this, although limited at least has a decent memory capacity, I was thinking of adding a second GTX 670 with another 4GB later, combined CUDA benchmarks (assuming 100% efficiency) would give me a 1060 level compute capacity with 8GB – wishful thinking – but are my assumptions too wacky?
Thanks!
Reply
- Tim Dettmers says
  2019-04-27 at 08:54
  Yeah, it does not really work like that. I would not recommend GTX 670 cards as they will be quite slow and the memory will not be sufficient. Adding a second GPU does not double your memory in most cases (only if you use model parallelism, but not library supports this well enough). You would be better of with a GTX 1060 or even better an RTX 2060 if you can afford it.
  Reply
SIP says
2019-04-19 at 09:07
Hey Tim, thank you for your in depth post. I do a lot of deep learning for my job and am building a machine to do personal experimentation outside work. I do a lot of vision stuff and I’ve been doing some RL stuff too. Previously, I’ve done most of my work with 1080 ti’s but from time to time I’ve run into issues with memory. I’m trying to decide whether the difference in performance and particularly memory between the 2070 and 2080 ti would be significant for my use case. I preferably would want 2 gpu’s so I could train multiple models at the same time and use data parallelism. I can afford 2 2080 ti’s but would rather go with 2 2070’s if my use case wouldn’t take advantage of the 2080 ti’s. I additionally will likely be going to grad school in 1-2 years so I don’t know if it makes more sense to get something cheaper now and then upgrade then. Do you have any suggestions?
Reply
- Tim Dettmers says
  2019-04-27 at 08:52
  If you ran into memory problems with GTX 1080 Tis then RTX 2070s might have not enough memory in some instances. However, you can use techniques where you chunk each mini-batch into multiple pieces (or in other words you aggregate smaller mini-batches) to do a single update. This saves a lot of memory. If you have not used this technique before, I would go with RTX 2070 and use this technique together with 16-bit. Otherwise, go for a single RTX 2080 Ti.
  Reply
Aron Boettcher says
2019-04-18 at 09:38
So on your GPU’s to avoid list. Is the argument here based solely on the price/performance relationship?
Also, how relevant do you think the Tensor core technology is? If I can only get one card, is it worth it to go with the Titan RTX (rather than doing multiple cards in SLI?) for the tensor cores?
Reply
- Tim Dettmers says
  2019-04-27 at 08:49
  Yes, I do not recommend GPUs which are a waste of money. You can get the same GPU for less money, no reason to buy these expensive ones!
  Tensor Cores are good, but all RTX cards have them. You should buy a Titan RTX only if you need the additional memory.
  Reply
  - andrea de luca says
    2019-04-28 at 07:09
    Let’s compare two 2080ti (or even two 1080ti) with a single Titan RTX, and let us do it purely on a memory basis (that is, we neglect the training speed). Since you can have more or less the same memory as the titan with two relatively inexpensive cards, I’d like to know whether there are use cases in which a model necessary has to reside on just one single card VRAM. In other words, are there use cases that one can do with the titan but NOT with two 11Gb cards? Thanks!
    Reply
  - Aron Boettcher says
    2019-04-29 at 10:18
    Thanks for the response Tim,
    Since I’m looking to build a machine that my company will be paying for (it will be my primary machine), its very unlikely that I would get it upgraded in the future, or that I would be add an additional card (our IT is very backwards, and although I can and have built my personal PC’s, I wont be able to change my work machine without a lengthy and expensive involvement with IT).
    In this sense, I’m less concerned about the prices than I am: keeping the configuration simple enough that the IT/ Finance people don’t get confused; and that it be forward compatible enough that I’ll be able to use this machine for at least a few years. Likewise, with the models we’ve been building, we’re already hitting memory limitations.
    Just letting you know these things so that you or your audience can understand that sometimes the decision is more complex than just price/ performance.
    Things would be very different if it were my personal machine or my company wasn’t intending to foot the bill.
    Reply
Fred Chang says
2019-04-17 at 22:53
Hi,
It’s my first time buying GPU. In this article, “However, note that through 16-bit training you virtually have 16 GB of memory and any standard model should fit into your RTX 2070 easily if you use 16-bits” , but spec. of RTX 2070 is 256-bit. That’s mean RTX 2070 could run with 16-bit? Thanks!
Reply
- Tim Dettmers says
  2019-04-27 at 08:48
  256-bit is the width of the memory controller, 16-bit is concerning the width of the compute units.
  Reply
  - Fred Chang says
    2019-07-22 at 18:48
    Thanks! Tim, you are really a great helper!
    Reply
  - Fred Chang says
    2019-08-04 at 07:14
    Tim:
    Thanks for reply. I have another question: Is it necessary to have two GPUs, one for display and one for data computation? Thanks!
    Reply
    - Tim Dettmers says
      2019-08-04 at 13:05
      Yes, I use the same GPU for displays and computation.
      Reply
Mike says
2019-04-17 at 02:58
Hi Tim
Very nice article. Can you comment more on your dislike of Quadros? I ask because this is obviously a fast moving field and Dell now seem to now be doing an excellent workstation deal with either single/dual Quadro RTX6000 and lots of memory 24GB, e.g.
https://www.dell.com/en-uk/work/shop/desktop-and-all-in-one-pcs/precision-7920-tower/spd/precision-7920-workstation/XCTOPT7920EMEA?selectionState=eyJPQyI6InhjdG9wdDc5MjBlbWVhIiwiTW9kcyI6W3siSWQiOjMsIk9wdHMiOlt7IklkIjoiNjRHNFIifV19LHsiSWQiOjYsIk9wdHMiOlt7IklkIjoiR0hFSTNVOSJ9XX0seyJJZCI6MTEsIk9wdHMiOlt7IklkIjoiRzVLQVkyMyJ9XX0seyJJZCI6MTQ2LCJPcHRzIjpbeyJJZCI6IkQ0MTEwIn1dfSx7IklkIjozNzIsIk9wdHMiOlt7IklkIjoiTk9PUFQifV19LHsiSWQiOjQxMiwiT3B0cyI6W3siSWQiOiJESEVBVFNLIn1dfSx7IklkIjoxMDAyLCJPcHRzIjpbeyJJZCI6IjUxODk0NSJ9XX0seyJJZCI6MTAwMywiT3B0cyI6W3siSWQiOiJVQlVOVFUifV19LHsiSWQiOjIwMDA3NiwiT3B0cyI6W3siSWQiOiI3NzYzMjcifV19XX0%3D
Is it just a matter of price for such GPUs or are there bigger issues
Thanks
Mike
Reply
- Tim Dettmers says
  2019-04-27 at 08:47
  Yes, it is just the price. These GPUs are very cost-inefficient. Personally, I would also not buy server hardware with less than 4 GPUs.
  Reply
Marv_ says
2019-04-16 at 03:29
What a nice article!! According to your words, I bought a 2070 recently. Thanks u at first.
I am wondering that why RTXs are not good at word RNN? In the Figure 2, it seems not good in this professional area.
If I want to do some DL projects with two cards, 1660ti or 2060 you will recommend? I am shifting from CV to NLP.
Waiting for your reply. Thx again~
Reply
- Tim Dettmers says
  2019-04-16 at 13:25
  I am not entirely sure why this is the case. I speculate that the decrease in shared memory per SM from 96 kb (Pascal) to 64 kb (Volta/Turing) decreased the performance of small matrix multiplications and thus slow down RNNs that use short sequences of length 100 or below.
  Reply
  - Marv_ says
    2019-04-16 at 18:55
    Wow. Your reply came quickly. Thx~
    I am wondering that is a cheap titan x pascal good for now. I don’t know how much is cheap enough. 500-600 I guess?
    If am only focusing on new Turing, 1660ti will you recommend? There is no tensor core in it.
    Thanks again~
    Reply
chanhyuk jung says
2019-04-15 at 09:35
What is better in terms of cost efficiency; cheapest rtx 2070 or highest boost clock rtx 2070?
Reply
- Tim Dettmers says
  2019-04-16 at 13:07
  Cheapest RTX 2070 by far.
  Reply
Aaron says
2019-04-12 at 00:21
Scaleway offers Cheaper Cloud Products Compared to AWS & Google.
They also have a GPU Instance https://blog.scaleway.com/2019/gpu-instances-using-deep-learning-to-obtain-frontal-rendering-of-facial-images/
Reply
- Tim Dettmers says
  2019-04-16 at 13:11
  Looks like a good alternative, but I have no time to evaluate it in detail. Note that AWS, Azure, and Google offer more than just a GPU for a low price, but if one is fishing for cost-performance just in terms of compute this might be a good service.
  Reply
James says
2019-04-08 at 10:02
Tim, thanks for the article. I’m just starting out with Keras. Would you recommend RTX 2060 or GTX 1060?
Reply
- Tim Dettmers says
  2019-04-16 at 13:16
  RTX 2060 if you can afford it. If your budget is tight, go for a GTX 1060.
  Reply
Zack says
2019-04-07 at 21:54
Hi, i’m a degree student currently doing project on detecting tree disease from leaf.. so my dataset is picture of leaf with diseases and some normal one. Which one is better ? Gtx1080ti or rtx2070 ? I will be using CNN
Reply
- Tim Dettmers says
  2019-04-16 at 13:17
  I would go for the RTX 2070 and learn how to do 16-bit training your training framework.
  Reply
Sachin says
2019-04-06 at 21:09
Google offers the K80 card as a GPU option when you configure a cloud VM. Is it worth choosing it, when what you want to do is train LSTMs and Transformers?
Reply
- Tim Dettmers says
  2019-04-16 at 13:18
  Yes, the K80 is a good card for that. If it makes sense to use it compared to other GPUs depends on the price. For LSTMs a K80 works very well, but for Transformers the price should be at least 3 times better than for V100 GPUs or otherwise a V100 is more cost-efficient.
  Reply
  - Sachin says
    2019-04-16 at 22:46
    Thank you Tim
    Reply
Mikołaj says
2019-04-06 at 06:58
Tim, I love your job. But still the question for me is to choose RTX 2060 or GTX 1070 as they are priced similarly?
The main problem with 2060 is memory, which is only 6GB. I am planning to do NLP stuff that may require big dictionaries and not sure if be able to process them even with mixed precion addition (FP16 support). You advise to treat FP16 as additional 50% memory because not everything can be done without FP32, but it gives me *possible* 1 GB more over 8GB for sure in 1070. Is it worth?
The second thing – going FP16 with 1070 should also give me more memory, but not necessarily better performance (as GTXs are not optimized for that), right? So overally, if model in memory is most important for me, is 1070 the better choice in this price range?
Reply
- Tim Dettmers says
  2019-04-16 at 13:22
  You are correct in that a GTX 1070 with 16-bit will yield an additional memory benefit. If you really think the 8 GB will not be sufficient then going for a GTX 1070 might be the right choice. However, I would probably go for an RTX 2060, use 16-bit and use a small batch-size and aggregate the gradient of multiple batches before doing the weight update. Note that even with this it will be difficult to train standard-sized or big transformers. You would also run into problems when you use a GTX 1070 for such big models, but with 16-bit, small batches and gradient accumulation you might be able to fit and train a big transformer.
  Reply
Mircea Giurgiu says
2019-04-04 at 00:01
What about RTX8000?
Reply
- Tim Dettmers says
  2019-04-04 at 08:50
  Quadros are very cost-inefficient. I do not recommend them.
  Reply
Zhihui Chan says
2019-04-02 at 00:33
I am a Chinese senior.Thanks for your helpful blog. But I still have a question .When using multiple GPUs, such as two, the two different GPUs, for example, use one RTX2080Ti and one GTX2080Ti. What will happen?
Reply
- Tim Dettmers says
  2019-04-03 at 12:45
  Peer-to-peer GPU communication will not be available so you cannot do transfers like these: RTX 2080 Ti -> GTX 1080 Ti. Instead you need to make transfers like these RTX 2080 Ti -> CPU -> GTX 1080 Ti. This can make parallel training quite slow. So for parallel training you will need two GPUs of the same kind.
  Reply
mnawar says
2019-03-25 at 05:22
Thank you for your great post.
What do you think I should go with in March 2019?
I’m mainly work in deep reinforcement learning and need to upgrade to new 2070 or used 1080 ti; they are almost the same price In Egypt. I would buy either 2070 with international warranty from Amazon or used 1080 ti with local warranty.
My two years 1060 is bought from Amazon and I didn’t face any problems with it.
I’m biased towards 2070 only to get hands-on with 16-bit models and it may be more future-proof.
Reply
- Tim Dettmers says
  2019-03-27 at 20:35
  I agree with the RTX 2070 being a litter better, but if you need to import it the costs might be steeper. The GTX 1080 Ti is also an excellent card and probably good for another 1-2 years.
  Reply
  - mnawar says
    2019-03-28 at 01:23
    No, the importing fees is included in the prices. So should i get 2070.
    I’m also running Z400 workstation with w3530 and have no problem in getting W3690. I’m not sure if i should go with upgrading the cpu and psu or build a complete pc for the new GPU as z400 has pci 2 and max memory of 24G
    Reply
Zoran says
2019-03-18 at 18:11
Hi,
How slower GTX 1050 Ti should be than GTX 1080. I have build for a first time a windows based system with a 1050 Ti, a core 2 quad CPU and I find that GPU is only 5 times faster than CPU, while the same test with GTX 1080 its around 70X faster than the CPU. And I expected that GTX 1050 Ti would be at least 20X faster that CPU because i think 1080 should not be more than 3 times faster than 1050 ti.
Reply
- Tim Dettmers says
  2019-03-24 at 16:04
  Yeah, those numbers do not sound quite right. They are close, but still a bit too far off. I am not sure what is happening. A GTX 1080 should be about 5-8x faster than a GTX 1050 Ti. I cannot think of a scenario where the GTX 1080 is 14x faster compared to a GTX 1050 Ti. Is the GTX 1080 using the same CPU?
  Reply
Damian says
2019-03-14 at 02:09
I am thinking to make budget pc with fe (4-7 cheap gpu cards). Do you think for example 6x 1060 3GB + 120SSD + 8 or 16GB ram and cheap processor (dual?) will be ok for AI/deep learning?
Reply
- Tim Dettmers says
  2019-03-24 at 15:58
  I would stay away from dual CPU motherboards etc. Just get a regular motherboard and 4 GPUs, that makes a much more solid system with less problems.
  Reply
  - Andrea de Luca says
    2019-03-25 at 10:11
    Agreed, but you said that 8 lanes per card are a bit limiting if you have 4 cards, and boards equipped with PLX/PXE are not viable due to nvidia drivers issues..
    Reply
    - Nikos says
      2019-03-25 at 12:03
      Lnux drivers were not affected last time I checked, ie worked fine with PLX 8747. However, I’ve personally lost confidence in NVIDIA, since it changed the rules after the products were purchased. I’m still on June 2017 drivers to use 4x GTX 1080Ti cards with Asus X299 Sage (uses PLX) and I’m giving a regular fight with Windows, as it sometimes upgrades the drivers without my consent. Hence, I can’t recommend a mobo with PLX.
      Reply
    - Tim Dettmers says
      2019-03-27 at 20:34
      8 Lanes per card are totally fine if you run on 4 GPUs.
      Reply
      - andrea de luca says
        2019-03-29 at 09:38
        Tim & Nikos: Thanks. Given this, it seems that 1080ti is still one of the most attractive gpu in terms of price/performance ratio. Note that you can train in FP16 even on pascal card (your vram will be doubled, you’ll just get a more modest speedup, which amounts to ~10-15%, for example, upon resnet50).
        For something like 2000 euros, you can get four cards, and be more competitive than two 2080ti or a single titan rtx. Tell me if you concur..
Bull Shark says
2019-03-09 at 06:25
I can get a tesla M40 24GB for 800. I can also get an rtx 2080ti for 1100. Which one would be the better choice? Given that in the case of M40 I take care of the thermals.
Reply
Dale Smith says
2019-03-08 at 06:47
Thanks for the very interesting comments.
We just purchased an Intel Compute Stick 2 for $100. Being a startup, that fits our pre-seed budget.
Does anyone have benchmarks comparing this stick to deep learning on a RTX 2070 or a GTX 1060?
This is a pretty good summary of a case study using the NC stick. https://software.intel.com/en-us/articles/detecting-invasive-ductal-carcinoma-with-convolutional-neural-networks
Reply
- Tim Dettmers says
  2019-03-11 at 19:37
  Just using the theoretical maximum compute for an Intel Compute Stick 2 and compare it to real/practical compute of an RTX 2070, then you would have that an Intel Compute Stick 2 is about 50% as fast as a RTX 2070. The real number is probably closer to 25%. On top of this, you should consider software: Intel software is terrible and I would not recommend the Intel Compute Stick 2 for this reason. However, if you are in a low-watt setting, the Intel Compute Stick 2 might be a reasonable option if you are willing to accept software nightmares.
  Reply
Pengfei Zhang says
2019-03-03 at 22:06
Hi,
Can I use a SUPERMICRO SuperO MBD-C9Z390-PGW-O LGA 1151 (300 Series) Intel Z390 HDMI SATA 6Gb/s USB 3.1 ATX Intel Motherboard which has PLX chip to let my i7 8700k to support 4 1080ti for machine learning?
Cheers!
Reply
- Tim Dettmers says
  2019-03-11 at 19:32
  That could work. Just make sure that you have some form of confirmation that this setup actually works and then you will be fine.
  Reply
Pengfei Zhang says
2019-03-03 at 19:59
Hi, thx for your post, it really helped lots of people!
I’m building a machine for computer vision especially video analysis, to memory is a big issue for me. Considered about the memory/price, I decide to upgrade my 2 1080ti machine to 4 1080ti (1080ti can do fp16 as well, just slower….)
I have an i7-8700K CPU and prime z370-a (which support 2 8x PCI-E). Just wonder can I just replace a motherboard such as ASUS WS Z390 PRO LGA and kept my CPU? From the comments here I realized it only has 16 lanes so it seems it can only fit for 2 gpu. However, I’m wondering can I use it for 4 gpu maybe with lower speed and how slow it can be? otherwise I need to sell it and get an AMD combo, and that’s another big amount of money 🙁
Reply
- Tim Dettmers says
  2019-03-11 at 19:31
  Indeed getting more GPUs instead of faster ones has the problem that you need the CPU support it. Sometimes CPUs support running 4 GPUs with fewer lanes, but I am not sure if that works with an i7-8700K. Best is to look for other people that use a 4 GPU setup with that CPU. Otherwise, it might sense to upgrade to 2 more powerful GPUs and keep your CPU and motherboard.
  Reply
yang cd says
2019-02-24 at 22:15
How about 1660 ti ?
Reply
- Ugo says
  2019-03-14 at 23:02
  Yeah, it would be nice to include the 1600 if it’s not too hard! (or just overall comment on it)
  Reply
Michel Rathé says
2019-02-20 at 16:02
Hi Tim,
Since multi gpus setup actually comes with a lot of challenges, could the upcoming Asus rtx 2080ti MATRIX,with infinity loop, be the long awaited optimal solution ?
Thanks again for your constant expertise,
Reply
Muhammad Fazalul Rahman says
2019-02-20 at 11:26
Hi Tim,
First of all thanks for the insightful article. To date, this seems to be the most reliable article to find for comparing GPUs for deep learning.
I was considering getting 4x RTX 2080 Ti for a workstation for my lab, but then I came across an article (https://www.titancomputers.com/Articles.asp?ID=258) that compares workstation and desktop GPUs. In short, they were talking about the fact that workstation GPUs, while costlier than their desktop counterparts, are built for stability and efficiency, and are meant to run at 100% for several days, while desktop GPUs are not meant for it. Taking into consideration the recent news about the dying RTX 2080 Tis, do you think that they can withstand week long training tasks?
Reply
- Tim Dettmers says
  2019-02-20 at 12:52
  These vendors have a self-serving incentive to tell you that workstation cards are designed like that — they are not. They are the very same chip as consumer cards. What changes in workstations if often (1) these cards have no fan, but larger passive cooling elements, (2) workstation servers have loud, strong airflow which transports away the heat effectively. Thus the real reason is the strong airflow through the case (and not the GPU itself) and this is difficult to achieve with consumer GPUs. Especially, the RTX 2080 Ti has problems with cooling, but there are some good cooling solutions which work and do not require to spend the extra money on server hardware. I might update the blog about this next week or so.
  Reply
Jonathan ALIBERT says
2019-02-17 at 03:18
Hi Tim, great post, thank you very much.
Do you know what bandwith is needed between GPU and CPU/RAM during training, depending on the discipline studied ? I’m curious about using one GPU in PCI Express 3.0 1X (984.6 MB/s) to do single GPU training (or multiple, one model by GPU, not distributed models where it should be totally inefficient). It should work on paper, because CUDA works perfectly in this condition.
If you think these tests should be run, can you give me some benchmarks to do the job ?
Reply
- Tim Dettmers says
  2019-02-20 at 14:18
  The bandwidth is usually quite low if you work with larger models. Smaller models with large inputs usually need larger PCIe bandwidth. Never seen 1x PCIe in deep learning. Would be curious if it works for you. Please let us know if you have some results.
  Reply
jimmy Gu says
2019-02-14 at 00:03
Hi,Tim
The rtx2060 has released almost one month .What about this card? Compared to 1070 or 1070Ti?
Also, several rtx cards users(they are all gamers) report their rtx video cards (include 2080ti,2080,2070) have blurred screens issues.Almost over 100 samples in Chinese hardware bbs . Have you heard about rtx20 series issuses?
Reply
Jason says
2019-02-08 at 13:03
Tim,
I can’t thank you enough for your article and your updates to it that keep it current. I am new in machine learning and think your recommendation of multi-GPU for gaining feedback faster is great. It very much is a psychological gain and makes for quicker learning. I bought an RTX 2070 to add to my system with a GTX 1080 TI. I am having trouble being able to run a model on one GPU and use the other GPU for research and running another model. I use separate Jupyter notebook instances, and specify which GPU to use via nesting model build and training code in with “tf.device(‘/gpu:1’)”. I always get errors when I try to train on the GPU not already active. Can you point me in the right direction for learning how to use two GPU’s at the same time for different models?
Reply
- Tim Dettmers says
  2019-02-20 at 14:22
  Difficult to say where the problem is as I am not using tensorflow. Have you tried executing your code with NVIDIA_VISIBLE_DEVICES? That might help.
  Reply
Ganesh says
2019-02-08 at 01:37
Is a GTX 1070 max Q 8GB > RTX 2060 6GB for deep learning?
Also how much better is the 1060 max q than 1070 or 2060? The normalised ratio is hard to determine as the difference between these cards get squashed by the TPU. Thanks!
Reply
- Tim Dettmers says
  2019-02-20 at 14:23
  RTX 2060 has probably better performance, but memory can be a bit small if you use 32-bits.
  Reply
Anthony Cai says
2019-01-31 at 21:47
Hi, Tim
I plan to upgrade my laptop to eGPU, but the CPU is just i5-5200U(16GB MEM, 500MB SSD), is that worth upgrade this laptop to eGPU with GTX980 ti? Thanks!
Reply
- Tim Dettmers says
  2019-02-20 at 14:39
  Your existing CPU and RAM is a good match for a GTX 980 Ti as eGPU. Deep learning performance should be good.
  Reply
Joshua Marsh says
2019-01-31 at 00:44
Hi Tim,
Thank you for the phenomenal article!
I’m currently in the process of deciding on a GPU configuration for an AI build geared towards training models that take about a week on 8 Tesla v100s. Since my budget is limited by scholarship money, I’m currently trying to decide between:
2 x RTX 2080 Ti
4 x RTX 2070
Initially, I was leaning towards the quad RTX 2070 setup due to increased flexibility in simultaneously training smaller models etc, but the complexity of setting up the custom water cooling system (I’m not sure the 2070 even has a water block yet) seems a bit daunting for a first time pc builder, plus the fact that I would need to replace the entire water cooling system when I eventually upgrade to 4 RTX 2080 Ti’s. Water cooling is a must because it will be in my dorm room near my bed and I don’t think I will be able to bear the 24/7 noise of the blower style fans.
So that leads me a dual RTX 2080 Ti setup. The main benefit seems to be that it won’t require a complex water cooling system (initially), I won’t need any noisy blower fans, and that it is easy to upgrade to a quad setup (with water cooling). I’m just concerned that I’ll be losing an unacceptable amount of performance and flexibility in comparison to the quad RTX 2070 setup. I’m also generally concerned about the long term reliability of a custom water cooling system. I’d like to be able to leave it running for weeks on end and not have to worry about anything.
So yeah, that’s my dilemma #firstworldproblems. Any insight you could give that will help me decide would be incredibly appreciated. Thank you so much!
Reply
- Tim Dettmers says
  2019-02-20 at 14:44
  You analyzed the situation very well — it is just a tough choice. I think however that the stability of two RTX 2080 Ti and avoiding all the mess might be an advantage. You could also see if you get a big case, buy some PCIe extenders and then you zip-tie two air-cooled RTX 2070 to different locations thus avoiding the heat issues with 4 GPUs. Some here at the University of Washington use this solution. But I made no experience myself with the zip-tie method so I don’t know yet if this a good solution.
  Reply
- Panand says
  2020-09-08 at 09:53
  Consider also the heat that is generated by those GPUs. I have two GTX 1080Ti’s with waterblocks and they heat up the room quickly. Going for 4xRTX 2070 and then switching to 4xRTX2080ti requires selling those 2070s and their waterblocks and setting up 4 GPU watercooling up front, or using PCIE extenders. Used waterblocks won’t be of much value, as reliability is important.
  Maybe dual 2080ti and later add custom waterloop to it. Two gpu waterloop should be more gentle introduction to watercooling than four.
  Reply
  - Tim Dettmers says
    2020-09-08 at 13:25
    This is great information and feedback! Thank you for sharing!
    Reply
ThanosPAS says
2019-01-30 at 02:30
Hi Tim,
Thank you for all these very informative articles. You are a point of reference! I am going to study Bioinformatics in my Master’s and I want to focus on ML, DL in these 2 years. Of course these tasks will be only a portion of what I will be required to do but I believe there won’t be anything more demanding than these tasks. Since you need to move a computer around with you also, I am going to buy a laptop and using a cooling base underneath it, I was thinking of doing my model training and stuff in this machine. I am a little worried about the weight but you can’t have it all, I guess. The specs that I am thinking to buy are the following:
i7-8750H or i7-9750H (if it comes out in Q2 ’19 – more powerful cores, no Hyper Threading)
RTX2070 (the laptop I am eyeing draws max ~115-120 watts for the GPU)
32 GB RAM (2666mhz)
970 EVO PLUS 2 TB NMVE (it will be released in April)
gaming cooling system – Cryonat paste
17.3” display
1.Do you find this rig adequate for someone like me just starting ML training?
2. I ‘ve read your article that downclocking doesn’t play a big role in ML performance, but do you think undervolting the CPU in case of throttling improves the overall ML, DL performance/experience like it does in certain scenarios in gaming for example? (if you need to keep clocking high for single threaded performance)
Thank you for your time and patience!
Reply
- Tim Dettmers says
  2019-02-20 at 14:49
  You can also consider buying a desktop and a small laptop. Then you can always move around and ssh into your desktop when you need your GPU. Another option is to get an eGPU, but then you can only run at one place and not move around. The “mobile but heavy laptop” is also a good solution though it is definitely adequate for a large proportion of deep learning problems and models. If something does not fit into the GPU memory, you can always get a cloud GPU instance from somewhere to do your work while you use your laptop for prototyping. All of these solutions have advantages and drawbacks and it is a bit of a personal choice. I personally would get a desktop and ssh into it with my laptop.
  Reply
  - ThanosPAS says
    2019-02-23 at 13:32
    Thank you taking the time to answer! I appreciate it 🙂 I will consider your choice 🙂
    Reply
Khaled Mohammad says
2019-01-28 at 23:33
Him Tim,
Really confused should I get the RTX2060, As this is what is in my budget or should I get something else? Please can you push an update to this article including the RTX2060!
Thankyou very much!
Reply
- Tim Dettmers says
  2019-02-20 at 14:49
  Will do this sometime the next weekend.
  Reply
Tim says
2019-01-25 at 19:54
I haven’t gotten my hands on an RTX card yet. Is there a way to force Keras to use tensor cores or utilize fp16 that you are aware of? Does the following work, or there’s another method? Thanks.
from keras import backend as K
K.set_floatx(‘float16’)
Reply
Val Schmidt says
2019-01-24 at 03:49
Hello Tim!
Thanks for writing such an informative post. Really terrific!
I have a naive noobie question.
Can you comment on strategies for training a CNN whose final application will be to do detection/classification on a platform with constrained (lesser) compute resources than the training machine? Being new to this we are fearful that we will purchase a high end GPU, tune our algorithm to optimize its capability and speed for training on that card, and then find that we cannot fit the model into the system we’ll use to deploy it in the field in real-time. Is there a strategy to ensure we don’t have this problem you could recommend, other than using the deployment machine for training, which would work but presumably be much slower.
Thanks!
-Val
Reply
- Tim Dettmers says
  2019-02-20 at 14:55
  Usually, you can use a CPU for inference after you trained a model since you will often be doing one sample at a time, a CPU is quite good for this task. So one way would be to see if the processing time on your CPU is acceptable. If it is, then everything is fine. If not, you need to make the model smaller through distillation/truncation/sparsification/quantization or by simply training a smaller model.
  Reply
George M says
2019-01-23 at 19:23
I’ve spammed this thread quite a bit now without knowing much at all, but I really am passionate about figuring out the FP16 puzzle before investing in an RTX card. I think I have found a data point strongly against buying, if you happen to use the esteemed fastai library (which I do). Here’s a relevant thread from their forum:
https://forums.fast.ai/t/how-to-install-rtx-enabled-fastai-cuda10/29092/23
I don’t guarantee I have this right, but at this point it sounds like there is no way to train in FP16 without halving the batch size, which pretty much defeats the purpose. Add in the half-speed FP32 accumulate, and it sure sounds like it isn’t worth it to buy. Again, I assume this is specific to fastai only – there may still be benefits if you don’t use that. Thoughts welcome.
Reply
- G. says
  2019-01-30 at 16:19
  Good news: Here’s a followup that negates the above. It appears the OP made a mistake somewhere, because others are getting fine results. Curiously, FP16 shows improvement even on a 1080ti, though not as much as on a 2080ti.
  https://forums.fast.ai/t/comparision-between-to-fp16-and-to-fp32-with-mnist-sample-on-rtx-2070/35693
  Reply
- The usual George says
  2019-01-31 at 23:18
  In counterpoint to the above, I have found the following:
  https://forums.fast.ai/t/comparision-between-to-fp16-and-to-fp32-with-mnist-sample-on-rtx-2070/35693
  It appears the OP may have made a mistake, because others are reporting FP16 working fine. Batch size was not 2x but closer to 1.8x.
  Reply
  - Tim Dettmers says
    2019-02-20 at 14:37
    The problem with the benchmark is that PyTorch allocates the memory but does not free it to reuse it in the future (saving the call to cudaMalloc for higher performance). To release the memory you need to call a function in pytorch to actually release it. You will not see any speedups from 16-bits in this case since the model is just too small and not compute-bound.
    Reply
Joakim Edin says
2019-01-22 at 10:18
Great post! I am considering buying 2 RTX 2070, but I discovered that they do not support NVlink nor SLI. Is this a problem when using Deep Learning?
Reply
- Tim Dettmers says
  2019-01-23 at 08:19
  NVLink or SLI is mostly a gaming construct. You will be fine and you should have no problems to parallelize across your GPUs.
  Reply
Tim says
2019-01-21 at 11:23
I see RTX 2060 and Titan R:
TX benchmarks up. https://www.phoronix.com/scan.php?page=article&item=plaidml-nvidia-amd&num=4
Reply
- Tim Dettmers says
  2019-01-23 at 08:24
  Thank you for the link! Unfortunately, these are OpenCL benchmarks which do not utilize Tensor Cores — so not the best benchmarks to compare different RTX cards.
  Reply
Ethan Zou says
2019-01-19 at 02:12
Could you please also add RTX 2060 to the comparison? I’m very curious how this card performs since I’m short of money but still want to try the mixed precision.
Reply
yang cd says
2019-01-18 at 23:04
How about the 2060 compared with the 2070?
Reply
Mircea says
2019-01-16 at 06:35
Thanks for the comprehensive guide, Tim. I’d like to weigh in on the blower-style cooler recommendation. I tested an MSI 2070 Aero (blower-style) and when installed next to my 1060 it reached 80 degrees Celsius and the puny fan was spinning at more than 3000 RPM. Needless to say, I returned it, and now have the Asus 2070 ROG, which has three fans. The card does not exceed 64 degrees Celsius and the fans don’t exceed 1900 RPM. It is phenomenally quiet despite the fact that it is factory overclocked.
So my recommendation is to NEVER get a blower-style card unless you can deal with the whiny fan noise. A large case with good airflow, two fans in the front and one in the back (creating more pressure inside to force dust and hot air out) is the way to go.
Reply
- Tim Dettmers says
  2019-01-18 at 12:32
  Sorry for your experience with a blower-style fan and thank you for your feedback. I have had a different experience with blower-style fans with GTX series cards. I think the RTX cards just might be a bit different here, which is surprising since they have lower TDP. I consider adding your experience to the next blog post update.
  Reply
- George M says
  2019-01-23 at 19:27
  Mircea, if I’m understanding you correctly, this noise is called coil whine and may be caused by the individual card that you had, not necessarily the fact of a blower model. I have encountered this situation on non-blower cards as well, including AMD cards. Unfortunately, it is fairly common and there is no way to know if you have it or not, without actually testing the card.
  Reply
- soldierofhell says
  2019-08-21 at 15:28
  Hi, but what about size? ROG is 4.89 cm – this is more than 2x PCI, so on typical TR4/X399 MB you can’t fit 4 such GPUs, am I right?
  Actually all non-blowers seems like > 2x PCI.
  Reply
  - Tim Dettmers says
    2019-09-11 at 09:22
    Never thought about that. Indeed, if a GPU is larger than 2 PCI slots I would not buy it.
    Reply
Rob says
2019-01-15 at 11:52
Hi Tim,
Your last update was last year, and just wondering if you think Vega 20/ or the MI60 was able to reach that goal of being able to compete within range against a dedicated tensor core?
And thank you for your blog post, it was super informative. I’m hoping to do more in this space with my 1070 before i go out and buy something.
Reply
- Tim Dettmers says
  2019-01-18 at 12:35
  I think the new AMD cards might be competitive, but I need to see actual benchmarks to come to a definite opinion.
  Reply
Sadak Vali says
2019-01-14 at 21:28
Thanks for this article
Advantages of buying RTX 2060 6GB over GTX 1060 6GB?
in terms of 16-bit training, Modelling time, value for money and in general
Reply
- Tim Dettmers says
  2019-01-18 at 12:37
  The RTX 2060 has 16-bit training, faster model training, I have not calculated the value for money yet, but in general, it is faster but more expensive.
  Reply
  - Martin Mocko says
    2019-01-22 at 07:02
    For me RTX 2060 is also a very interesting card. Relatively cheap, but should offer FP16 training with 32bit accumulate in the dot product, so if I definitely need FP16 could this be the right choice? (I don’t have too much money to spare)
    Reply
    - Tim Dettmers says
      2019-02-20 at 14:56
      Yes, it is probably the best cheap but fast card right now. It is a perfect card to get started with deep learning.
      Reply
  - Sadak Vali says
    2019-01-22 at 21:31
    Can I do 16-bit training on GTX 1060 6GB?
    I am planning to buy one of these 2 cards,
    please suggest me, Which one to buy?
    Reply
    - George M says
      2019-01-25 at 17:24
      While you can technically do 16-bit, there will be no appreciable speedup, because FP16 is deliberately crippled on the 10-series cards. With the 20-series, FP16 essentially runs at full speed, except FP32 accumulate will run at half speed. So it will be faster than full FP32 training, but not by exactly double.
      Reply
Bull Shark says
2019-01-12 at 06:50
Hi Tim, AMD’s radeon 7 has recently been announced. How are your thoughts about this card? Two important specs of this card are the 1024GB/s memory bandwidth and the 13/14 TFLOPS. It also features 16GB of HBM2 with a memory bus of 4096bit. This al sounds pretty good to me, if I don’t consider the lack of support for deep learning libraries for AMD. How are your thoughts about this card?
Another question: there’s this deep learning library called PlaidML that supports some AMD ROCm hardware. Do you know whether they will be supporting vega 20(so the radeon 7)?
Thanks in advance !
Reply
Yaohua Guo says
2019-01-11 at 05:56
Hi Tim,
Thanks for a great article, it helped a lot.
Recently I was thinking about purchasing gpu. I have some questions about the choice of memory. Is the RTX 2070’s 8G memory available for target detection (yolo, ssd, faster rcnn) and nlp (transformer) training tasks?
A friend of mine trained fast rcnn on gtx 1080ti. If the batchsize is greater than 2, it will overflow. Is the rtx 2070 8G memory available for these training models? Which tasks are enough for? Which tasks are not enough?
Thank you.
Reply
- Tim Dettmers says
  2019-01-18 at 12:27
  If you use 16-bits then theoretically the models that fit into a GTX 1080 Ti also fit into an RTX 2070. However, it is often not that straightforward since frameworks often store also 32-bit weights with 16-bits weights to do more accurate updates. You can ask your friend to use 16-bit mode and weights on GTX 1080 Ti and you will know how much the code consumes with an RTX 2070 and if it fits into that memory.
  Reply
Michael says
2019-01-10 at 08:50
I hope that comparisons and diagrams for RTX 2060 and RTX Titan will appear as soon as possible, a very nice article.
Reply
Houssem MENHOUR says
2019-01-10 at 01:12
Hi,
Thanks for the continuously updated guide. With the release of the AMD Radeon VII and its 16GB memory, is there any chance for it to perform well enough in DL tasks? I know that it won’t beat Cuda and that the software support is not quite there yet, but I’m curious.
Reply
- Tim Dettmers says
  2019-01-18 at 12:23
  It probably performs very well. However, since I do not have the GPU myself I will only be able to discuss it in detail if benchmarks are released.
  Reply
- Ken Fricklas says
  2019-04-27 at 09:35
  It’s pretty much identical in real use with the RTX2080Ti (both Inception3 at ~190). And it’s much less expensive.
  Reply
ABHINAV MATHUR says
2019-01-09 at 22:07
The guide is detailed and enabled my organisation to buy perfect GPU server for AI workflows, we bought 8 x Tesla-based server and they are quite powerful..
Reply
Paul says
2019-01-08 at 03:18
Hello, Tim.
I carefully read your article and deeply appreciate for it.
Now I am considering to buy my first deep learning system and I really need your help!
If you have about $3000 for GPU, what are you going to buy?
1. One RTX Titan
2. Two RTX 2080Ti
3. Four RTX 2070
4. Other option?
(I am interested in analysis of medical imaging or clinical photographs)
Thank you very much and happy new year!
Reply
- Tim Dettmers says
  2019-01-18 at 12:18
  Medical imagining and clinical photographs usually need quite a bit of RAM. I would either go with two RTX 2080 Ti or one Titan RTX. RTX 2080 Tis are most cost-efficient, but the Titan RTX might be useful in some cases.
  Reply
Rahul S says
2019-01-07 at 18:34
Tim,
I’ve just got a new rig setup. An AMD Threadripper 1920, with 32GB DDR4 and a GeForce RTX 2070. I’ve got the software setup and stable now on Linux Mint 19.
Do I need to perform any “tuning” to extract the performance I can expect from this setup? Or is it simply plug and play? I haven’t done any overclocking or setup apart from the basics to get things running.
Thank you
Reply
- Tim Dettmers says
  2019-01-07 at 19:51
  I did some test myself with a similar setup and it seems for PyTorch installing via anaconda or compiling from source yields about the same performance. However, on the other hand, I heard that some people were reporting better performance with compiled source code. I have not done any tests with TensorFlow though. In general, compilation should always yield optimal performance, but of course, it is less convenient.
  Reply
  - Ray Donnelly says
    2019-09-24 at 04:06
    > In general, compilation should always yield optimal performance
    I disagree. Anaconda compile software very well, most naive attempts at compilation will result is slower binaries. OK, you can get cpu flags more suited to your hardware, such that AVX512 may become available, but that isn’t used, your GPU is, that will be the only bottleneck and that’s driven by CUDA which is closed source and GPU drivers which are also closed source.
    Reply
    - Tim Dettmers says
      2019-09-24 at 21:53
      That makes sense. Thank you for your feedback! I will in-corporate this with the next update.
      Reply
edison says
2019-01-06 at 01:12
Hi
Will you test RTX 2060(6GB) in the future? and When
thx 🙂
Reply
- Tim Dettmers says
  2019-01-07 at 19:48
  I will probably look for some benchmarks next weekend and then push an update which also includes the Titan RTX.
  Reply
  - Kevin says
    2019-01-15 at 10:46
    Ditto on the request for including the 2060 🙂
    Reply
  - George M says
    2019-02-19 at 12:48
    Here’s a good start: Eric Perbos-Brinck has graciously shared his benchmarks on CIFAR, and actually got better results on a 2060 (in FP16 mode) than a 1080ti. Here’s the link:
    https://towardsdatascience.com/rtx-2060-vs-gtx-1080ti-in-deep-learning-gpu-benchmarks-cheapest-rtx-vs-most-expensive-gtx-card-cd47cd9931d2
    I finally took the plunge and bought a dual 2060 machine, and can confirm similar results. Note we both have dual card systems so they are running on 8 PCIE lanes. A single card at 16 may run faster, though not by much as Tim noted earlier.
    I will say this, the thermals with dual 2060’s (3 120mm fans) are not good. I got up to 85C on one card and I’m sure that automatically throttles the speed, not to mention being bad for long-term life span. Will have to experiment with fan speed curves, extra fans, a new case or even a new cooling solution.
    Reply
- Michael says
  2019-01-16 at 13:48
  +1
  Reply
  - Stanley Chen says
    2019-01-18 at 17:33
    +1
    Reply
  - Rushi says
    2019-02-06 at 15:34
    +1
    Reply
- Long-Van says
  2019-01-26 at 20:37
  +1 for comparision !
  Reply
- Mikolaj says
  2019-01-30 at 05:02
  +1
  Reply
Alex Dai says
2019-01-01 at 20:29
Hey Tim!
Great article!
I followed your recommendation in buying a RTX 2070, however, when testing it out straight out of the box using some benchmarks, it seemed to be performing noticeably worse than the 1080Ti, even when utilizing half precision in both training and inference.
The benchmarks I used are contained here:
https://github.com/ryujaehun/pytorch-gpu-benchmark
Is this because the benchmarks weren’t utilizing tensor cores?
Or is it because I am missing some fine tuning steps of my GPU?
Thanks,
Alex
Reply
- Tim Dettmers says
  2019-01-03 at 08:12
  The benchmarks that you linked shows that the RTX 2080 Ti (16-bit) should be twice as fast as for ResNet-152 compared to a GTX 1080 Ti (32-bits). To see what kind of GPU kernels were utilized you could run “nvprof python your_program.py –arg1 abc –arg2 def” this will log the kernels that were used. What you want to see is that 16-bit/half precision tensor core kernels were used. If this is not the case something might be off with your configuration/install.
  Reply
  - Alex Dai says
    2019-01-03 at 11:25
    Is the 1080Ti unable to utilize FP16?
    Some of the benchmarks you’ve linked have shown some results for the 1080Ti for FP16.
    Is the 30-40% figure you cite for the 2070 in FP16 vs. the 1080Ti in FP32? If so, is that a fair comparison?
    The benchmarks you linked:
    https://github.com/u39kun/deep-learning-benchmark
    https://github.com/stefan-it/dl-benchmarks
    Show that the RTX is ~10-15% faster when both the RTX 2070 and 1080Ti are in FP16 mode.
    https://imgur.com/a/lClO6iU
    Thanks again, Tim!
    Reply
    - Tim Dettmers says
      2019-01-07 at 19:44
      The GTX 1080 Ti does not support 16-bit computation. If you use 16-bit what the code does it casts 16 bits to either 24 bits (in some matrix multiplication kernels) or to 32-bits (all other code) and then performs 32-bit computation. The results are then cast back to 16-bits. This is not any faster than 32-bit execution in most cases.
      RTX cards are really bad at 32-bit computation. A fair comparison would be 16-bit vs 16-bit, but since 16-bit computation is 32-bit computation under the hood for the GTX 1080 and lower, the comparison is still quite fair.
      Reply
Alexandre Soares says
2019-01-01 at 14:54
Hi, everyone who’s reading this post. I know the Titan V is not recommended by Tim, the writer. However, would the Titan V be a good deal if it was being sold used for $1,200? Or would buying a new 2080 ti be a better choice?
Reply
- Tim Dettmers says
  2019-01-03 at 08:07
  A Titan V for $1,200 would be a good deal. The Titan V is more powerful than the RTX cards and does not have so many issues with cooling. If you can get one for $1,200 I just go for it!
  Reply
  - Alexandre Soares da Silva says
    2019-01-03 at 19:35
    Thank you for your answer, Tim. This is a great source of suggestions and recommendations!
    Reply
Richard S. says
2018-12-29 at 13:17
Hey, I’m starting a long-term project in deep learning/ neural nets and am thinking about buying a GPU for work and (personal) learning at home – a bit of time into the project, my uni will acquire a dedicated PC for it, so I’m wondering wether a GTX 1070 would be a sensible option for now and for myself.
As far as I see, the RTX 2070 has substantially better performance in deep learning applications, but I’m unsure whether that is important for learning and testing my ideas for the project; right now I’d get a 1070 for around 350€, and a 2070 for around 520€.
What do you think?
Reply
- Tim Dettmers says
  2019-01-03 at 08:03
  A GTX 1070 is an excellent option. If money is a constraint then a GTX 1070 is a very good option — go for it! However, note that memory might be a problem sometimes (you cannot run the biggest models), but there is no cheap option with which you can do this. So I think a GTX 1070 is the best option for you.
  Reply
Damien says
2018-12-29 at 08:29
I think you should revisit your performances measurements… 2070 is showing very similar deep learning speed than 1080 Ti for less memory. I don’t see how it could be higher in any reasonable scoring… Or just link (or perform) benchmarks !
Reply
- Tim Dettmers says
  2019-01-03 at 08:01
  I base my results on 7 benchmark results which I link the blog below Figure 3. Two rigorous benchmarks indicate that the RTX 2070 is about 30% faster than a GTX 1080 Ti for convolution. The RTX 2080 Ti is 40% for LSTMs compared to a GTX 1080 Ti and for an RTX 2070, this is about 30%.
  Reply
Jerry says
2018-12-28 at 00:25
Hi Tim,
Thanks for your effort to write these helpful guides.
I am going to buy a RTX 2070 with a blower fan first since I just graduate this year. When I have sufficient money to buy a RTX titan in 2019, is it possible to speed up for Deep Neural network algorithm with these two different models of display cards?
Reply
- Tim Dettmers says
  2019-01-03 at 07:55
  You will not be able to parallelize an RTX 2070 and a RTX Titan together, but you will be able to run separate models on each of those GPUs without any problem.
  Reply
Mikhail says
2018-12-27 at 10:07
This is excellent, thank you so much for the insight!
Answered my question about whether I should get RTX 2070 🙂
Reply
Rinish says
2018-12-20 at 20:32
Hi Tim,
I have been trying to install Tensor flow gpu and cuda 10 but with no success. Can you help me with the process or point to some source that can be helpful. I am using Ubuntu 18.04 and my graphic card is RTX 2080.
Reply
George M says
2018-12-19 at 22:08
Just learned that the 2080 ti (not the Titan RTX) runs FP32 accumulate at half speed:
https://devtalk.nvidia.com/default/topic/1042897/cuda-programming-and-performance/is-geforce-rtx-2080-slower-than-geforce-gtx-1080-on-small-matrix-matrix-multiplication-/
I’m not familiar enough with mixed precision training to know – is this significant?
Reply
- George M says
  2018-12-20 at 10:30
  Correction: this should read FP32 *during mixed precision training mode.* Of course regular FP32 runs fine.
  Reply
Rahul Sangole says
2018-12-15 at 23:02
Tim,
I’m setting up my first deep learning system. As I search for RTX 2070 on Amazon, there seem to be many choices – EVGA, Zotac, MSI. Does it matter which rtx 2070 I pick up?
Thanks
Reply
- Tim Dettmers says
  2018-12-27 at 17:09
  Not really. If you want to get multiple RTX 2070 however, I recommend going with a brand that offers blower-style fans or even water all-in-one cooling. However, it does not matter.
  Reply
  - Rahul S says
    2018-12-30 at 19:02
    Thanks for the response Tim. I did end up setting up my first system with a Gigabyte RTX 2070.
    Setting everything up is certainly no joke. Lots of version-compatibility issues I had to resolve before getting Python, Keras and TF to recognize the GPU. Finally having got over that hurdle, I still can’t get R or PyTorch to recognize the GPU.
    While I attack that issue myself, could you point me towards how we leverage “lower precision” training on this GPU to double the memory? Where can I learn more about this?
    Reply
    - Tim Dettmers says
      2019-01-03 at 08:06
      Make sure you have installed the correct video driver and CUDA is visible to the software (try typing nvcc into your terminal). The easiest way to make sure that everything is working is to install PyTorch via anaconda. To use 16-bits in PyTorch, it can be as simple has calling .half() on your model. For more info you can see the NVIDIA 16-bit repo: https://github.com/NVIDIA/apex
      Reply
      - Rahul Sangole says
        2019-01-11 at 08:46
        Thanks Tim. I was able to get Python+R+Tensorflow+Keras working. Not able to get Pytorch working, but no matter. I’m happy with the setup so far.
        Within Tensorflow, is running half precision on the RTX2070 card a matter of following the instructions in : https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#tensorflow
Tony Qi says
2018-12-13 at 15:49
Hi Tim,
Your article about choosing GPUs is excellent.
I’m a master student and recently get in Machine Learning area, focusing on deep learning models. My lab doesn’t have the hardware environment and I need to make a hardware list. Since this area is totally new for me and I didn’t get the budget, just making a list. So I wonder the RTX2070 is enough? or two GTX1080ti will be better?
Thanks
Tony
Reply
- Tim Dettmers says
  2018-12-18 at 11:39
  Hi Tony, two GTX 1080 Ti will be better, but probably also more pricey. If you can get two GTX 1080 Ti at the price of one RTX 2070, definitely go for the GTX 1080 Tis.
  Reply
  - Tony Qi says
    2018-12-18 at 14:37
    1080ti is almost 1100CAD, 2070 is about 800CAD. So I would choose two RTX 2070s. Better than 2 1080Tis? Future study will focus on deep learning models, like compression or parameters optimation. The RAM of 2070 is 8GB, slightly smaller than 1080Ti (11G). Could it satisfy the basic need?
    Reply
Yaohua Guo says
2018-12-11 at 00:18
Hello, I am a Chinese nlp developer. thanks for your share.
I have a question, multi-GPU can make hole RAM large? such as BERT , the author say ,it only can run in the GPU which RAM larger than 12G ,like GTX 1080 .etc . Can i use two RTX2070 do this ? 2*8G can run BERT-base?
Two GPUs RAM like RTX2070 8G is equals with one 16G RAM GPU？
thanks for your answer!
Reply
- Tim Dettmers says
  2018-12-11 at 07:30
  Unfortunately, this does not work this way. What you describe would be some form of model parallelism. However, there is no deep learning framework which supports model parallelism in a straightforward way. Thus the only option is to get a GPU with larger memory and/or to use 16-bit computations and weights.
  Reply
  - Yaohua Guo says
    2018-12-12 at 03:06
    Thanks for your reply, I also do not understand , can tow RTX2070 share RAM? A big model ,like BERT , need large RAM , can we use BOTH GPUs RAM and just use one GPU to compute ?
    If we can’t do this , when i run model , why all GPUs RAM allocated, and just one to compute.
    thanks for you answer.
    Reply
    - Tim Dettmers says
      2018-12-16 at 09:14
      Two RTX 2070 generally cannot share RAM. With special code this is possible, but this is too cumbersome to really be practical. If you want to run BERT and you run into memory problems then the best bet is to get a bigger GPU. The new Titan RTX might be a good fit for you in this situation, or you can get a RTX 2080 Ti and work with 16-bits.
      Reply
      - Yaohua Guo says
        2018-12-17 at 18:46
        Thanks for your reply, I have another question to ask you, how can i make model work with 16-bits, use tf.float16 in TensorFlow? And BERT had been pre-trained, how can i rebuild the model to 16-bit and use the per-trained checkpoints which the author shared in github ? Can we do this ?
        THANKS!
      - Yaohua Guo says
        2018-12-17 at 18:55
        Do 16-bit model will harmful to model pofermance? if do ,how much?
wilkas says
2018-12-10 at 23:59
How come GTX Titan (Pascal) is added as cost-efficient and cheap option, if even pre-owned card without warranty costs more than brand new RTX 2070 which is more capable?
Reply
- Tim Dettmers says
  2018-12-11 at 07:29
  That is a good point, I might have made a mistake here. I should not recommend the GTX Titan X (Pascal) anymore. It could also been that RTX 2070 was more expensive and GTX Titan (Pascal) cheaper on eBay. I will fix this later today.
  Reply
Umit says
2018-12-09 at 01:30
Hello Tim,
Thanks for the great post. I am currently doing research on GANs and other DNNs. You mentioned that we should look at CUDA cores as well as flops. Clearly the GB of the card is another thing to consider. In comparing a 1080Ti it has 3584 CUDA cores a vs a 2080 which has 2944 CUDA cores. I see that the 1080Ti has a lower clock speed than the 2080. Is this why there is such a large difference in the performance? Because other than the clock speed it seems the 1080Ti wins in all other categories including a 3GB advantage. I was able to get my hands on a MSI 1080Ti gaming X which is stable at ~1.8 GHz overclocked for ~$630 USD. The cheapest I’m finding a 2070 is ~$500 and 2080 is ~$700. Do the 2070 and 2080 really out perform the 1080Ti? All the gaming bench marks rank the 1080Ti higher
Reply
- Tim Dettmers says
  2018-12-11 at 07:33
  Hi Umit. What I meant with cores is Tensor Cores. CUDA cores do not map very nicely to compute performance. The RTX 2080 is faster despite the small difference in clock etcetera because it has Tensor Cores which speed up computation for 16-bits. To understand more about this issue you can read my blog post about TPU vs GPU which also discusses low-bit computation and why it is such a big advantage.
  Reply
  - Umit says
    2018-12-13 at 02:31
    Ah so while the 1080Ti will outperform the 2070 and 2080 in gaming when we are training nets, since the RTX cards have Tensor Cores, both the RTX cards will outperform the 1080Ti? Do you see any benefit in the GB size of the 1080Ti vs the 2070/80 cards in terms of not running out of space when training? In the past I worked off a 1050Ti and have ran into the error that a batch size of 1 would not fit into the memory.
    Reply
    - Tim Dettmers says
      2018-12-27 at 17:05
      Since you run models in 16-bit with RTX 2070 cards you will consume less memory than with 32-bit training. As such there is only a small difference in memory compared to a GTX 1080 Ti.
      Reply
  - Umit says
    2018-12-15 at 00:41
    Hello Tim, $100 here and there does not bother me much. Would you recommend a 1080Ti a or a 2080 for deep learning. You say the 2080 is faster but can it handle as much data at once? Thank you.
    Reply
    - Tim Dettmers says
      2018-12-27 at 17:09
      Hi Umit, if you get an RTX 2080 you would use 16-bit training and as such you would be able to store much more data on the GPU than with 32-bits. In terms of throughput of data, the RTX 2080 will also be better. However, both GPUs are great — if you find a good offer for a GTX 1080 Ti this might be very well worth it over an RTX 2080!
      Reply
Xunfei says
2018-12-07 at 14:10
Thanks a lot, will get a RTX 2070. Just for other’s information, RTX 2070 encounter lots of crash/artifact incidents so far.
Reply
- Tim Dettmers says
  2018-12-08 at 11:50
  Same for other RTX cards. The issue is if you use multiple of them. Try to get RTX 2070 with blower-design fan — that should be a bit better in any case. If you can find an all-in-one water cooled solution I would buy that!
  Reply
  - andrea de luca says
    2018-12-08 at 12:42
    Could you elaborate a bit further about that artifact-crash thing? Thanks.
    Reply
    - Xunfei says
      2018-12-08 at 19:36
      Refer to this post
      https://wccftech.com/geforce-rtx-2080tis-are-dying-and-there-are-different-rtx-2070-chips/
      Also, plenty of reviews on Newegg/Amazon claiming card died after one or two weeks of use. So I suggest if you want to buy it, from notable retailer only to save the hassle if any issue occurred to you.
      Reply
Amit Garud says
2018-12-03 at 11:47
Hi Tim,
Now that the RTX Titan is announced today for $2500, would you recommend that over a pair of 2080TIs with nvlink for the same price? I understand it depends on the network models being used, but is there really a model you know of that can use the 24GB of the single RTX Titan?
Reply
- Tim Dettmers says
  2018-12-03 at 15:22
  I would not recommend getting the Titan RTX. The performance is not much better than an RTX 2080 Ti but the card costs nearly twice as much. The only reason to get a Titan RTX is the 24 GB memory. If the 11 GB of 16-bit precision is not enough (equivalent to 22 GB of 32-bit precision) then the Titan RTX is a fair choice.
  Reply
  - John L. says
    2018-12-04 at 12:06
    Hi Tim,
    I found RTX 2080 or RTX2080Ti doesn’t have p2p access. When I run the simplep2p testing in cuda sample, all of these cards shows capable for P2P, but cannot access each other. May I ask, if no P2P access, how big the impact for the multiple cards train? Even NV sells NVLink bridge, there only 2 ways bridge adapter available.
    Reply
    - Tim Dettmers says
      2018-12-08 at 11:48
      The problem is if you have different kind of GPUs you cannot do direct GPU transfers among them. With the two GPUs that you have and convolutional nets, you should still see an okay performance. If you train something like transformers you will have problems with parallelization.
      Reply
      - Eri Rubin says
        2018-12-27 at 05:54
        There actually seems to be an issue with P2P on the 2080Tis, we have a few of them, exactly the same card, and can’t seem to get P2P to work.
      - Tim Dettmers says
        2018-12-27 at 17:44
        Someone mentioned something like that before — I personally have no data on this and I am unable to help you. However, if you can figure out the problem it would be great if you can report back here so others can benefit from your finding. Good luck!
James says
2018-12-01 at 20:26
Hello, I was searching in comments but couldn’t find battle between gtx 1080 and rtx 2070 since they has pretty much same price. I know that from rule of thumb bandwidth if you use RNN and FLOPS if you use convolution. So is gtx better for CNN?
Sorry if you have to repeat yourself.
Reply
- Tim Dettmers says
  2018-12-02 at 11:30
  It is not as straightforward to look at FLOPs because the RTX 2070 has tensor cores and 16-bit computations. For CNNs the RTX 2070 should be better if you use 16-bits, but the GTX 1080 Ti will be better if you use 32-bits.
  Reply
  - James says
    2018-12-02 at 16:26
    I didn’t mean GTX 1080Ti, but simple GTX 1080 (not Ti). Does what u have said still applies to GTX 1080 (not Ti)?
    Reply
    - Tim Dettmers says
      2018-12-02 at 21:59
      Ah, my bad. If I would choose between RTX 2070 and GTX 1080 I would definitely go for the RTX 2070. However, the 32-bit processing still applies although the gap is not as large anymore.
      Reply
Kevin says
2018-12-01 at 18:03
Hi Tim,
Hands down, best article I’ve been able to find on deep learning hardware. I just have one question though. From what I could tell, you only recommended the blower style cooler on the 2080 Ti. Did you mean this? If so, why only that model not on the lower models (e.g. 2070)? I plan to start with a 2070 and add more cards as financially able (and as my skills progress). Would it be wise to get the blower model in anticipation of running a multi-gpu setup, or does it not matter for the 2070’s?
Thanks for your great contributions to the community!
Kevin
Reply
- Tim Dettmers says
  2018-12-02 at 11:28
  The RTX 2070 has much lower TDP of 175 watts compared to the RTX 2080 Ti with 250 watts. However, if you have a multi-GPU setup, I definitely recommend blower-style fans for RTX 2070 as well. I will update my blog post to reflect that.
  Reply
Bruce says
2018-11-30 at 12:21
Hey Tim – thanks for the helpful post.
Have you seen NVDA’s T4 for inference? Any thoughts on it?
What are the switching costs of existing work done on NVDA if you want to swap out the hardware to AMD or a TPU?
Do you think the hyperscale cloud players use the software libraries of NVDA to a similar extent as everyone else? Do they build software around AI that is embedded in NVDA so that the switching cost s are pretty high?
Thanks!
Bruce
Reply
- Tim Dettmers says
  2018-12-02 at 11:33
  The T4, just like Quadro and Tesla cards, is very cost-inefficient and thus I would not recommend it.
  Switching depends on the context. I do not think a flawless switching is possible if you used NVIDIA cards in an industrial setting. You might be able to switch NVIDIA->TPU if you use TensorFlow. For AMD I am not sure, you will always need to make some adjustments to your code and it will take more or less time depending on the project.
  Reply
Alfonso Campos says
2018-11-27 at 04:56
It’s already been said, but I would be very interested if you had a look at Azure with the flexibility of its latest ML Service & Batch AI. In a nutshell, you can target different compute envs and easily swap between local and Cloud GPU Clusters (Hovorod).
Additionally, you can train on FPGAs. I would like to hear your thoughts on that as well, particularly vs GPU Clusters and TPUs.
Finally, if you could at some point extend the article to include memory usage for completeness that would also be great.
Thanks for sharing this!
Reply
- Tim Dettmers says
  2018-12-02 at 11:38
  These are good points and I have not been keeping up on this. I will try to include it in the next iteration of my blog post.
  Reply
Michel Rathé says
2018-11-24 at 07:14
Hi Tim,
We cannot thank you enough for such relevancy, facts and knowledge.
You make sense through all we actually find on the internet, especially at a time where hardware, software, AI and participants seems to be only limited by creativity.
Actually, I got caught in all this, as well as (I presume) a lot of your readers.
The question is should I leverage my actual workstation or go ‘all in’ the hype of the next xeon cascade that will be detroning gpu’s. The 2,3,4 GPU in workstation or cluster (noise, heat and choice of gpu form factor (i.e.: 2,2.7 slots).
I mainly use Matlab, Tableau software, Excel on multi-monitor (3).
I’m a market investor that is seeking to transfer and expand my knowledge in using several aspects of machine learning. Though I cannot scope, at this point, all the size and depth of my projects I do not want to be limited in a 24-36 month horizon.
I now have a Asus x99-E WS with i7-5930K (6 cores, 40 lanes) with 32G of DDR4 (2133) with 2 Asus gtx 750Ti (I know there are not relevant to machine learning but keep one, if needed for monitors). OS is Win 7 on a samsung 850 pro 512 G in a define R2 case with 750 psu). No water cooling.
Knowing that Matlab is highly vectorized I tend to improve performance on the best practices programmation sides. I also made some tests with the Parrallel toolbox. I’m mostly now experimenting with Matlab because I’m still developping the so many features (though i’ts a ongoing process).
From reading all of your posts there seems to be a relevancy, still, for my setup.
My first inclination (reaction) has been to get a newer paltform (i9 or xeon w) with 18 cores (Are those factors really significant?).
From all posts the 2080Ti gpu is a no brainer (hope they will correct the actual flaws on that gpu). I can run these on actual or next platform.
I’m looking for the sweet spot to advance at a certain level in my actual and future projects.
First line of thought would be (1) actual workstation with 1 or 2 Asus rtx 20801Ti but it seems that I have to make the right call at the beginning for the model i.e. Turbo (single fan) in case of expanding to 4. But I’m really concerned about the heat. (2) Is water cooling a real game changer to keep performance and longevity?)Upgrade to 64G mem (does 2666 really improves over 2133).
I’d appreciate your overall thought and considerations on my objectives as a high level starting point. I’m mostly at a orientation phase (and making sense) at this point.
Thanks immensely,
Michel
Reply
- Tim Dettmers says
  2018-11-26 at 22:01
  Hi Michel, I would stick with your Asus x99-E WS for now and just upgrade your GPUs. You can keep the 750 Ti for your monitors and just add an additional RTX 2080 Ti for your work. Multiple big GPUs are usually only needed if you are already experienced with deep networks and you find yourself limited by runtime. From what I read is that you are mostly in a prototyping stage where you would like to experiment with greater flexibility. A single RTX 2080 Ti will not limit you here. I think upgrading your CPU and motherboard will only yield small gains (larger gains when you work on problems beyond deep learning).
  I would also suggest getting the blower-style single fan version to avoid heat problems. The ASUS RTX 2080 Ti should be the right option for you (also if you want to upgrade to 4 GPUs). Water cooling for GPUs can be a curse but also a blessing and I would not recommend water cooling in your case where you want to not touch your system for many months (too unreliable).
  64 GB of RAM is good but stick to 2133. The 2666 one has no real benefit in performance.
  I hope this helps you to orient yourself and come to a decision.
  Reply
  - Michel Rathé says
    2018-11-27 at 05:13
    Hi Tim,
    Thank you for the so precise and worthy informations.
    That confirms basically what I expected. By splitting the time horizon in 18-30 months we can only expect interesting leaps in hardware/software. On such basis, and from a cost/benefits standpoint, it make sense to leverage the actual setup. What I’m trying to achieve though, based on financial wisdom, is to somehow leverage and roll over the hardware on an ongoing basis up to the point where non relevant (for me) hardware can still be sold or re-used in a complementary setup.
    I wish that you continue that great work you do and on presenting relevant advises to us wishful “deep learners”.
    Thanks
    Reply
Haider Alwasiti says
2018-11-23 at 04:31
I am not very much interested in running 1 model on several gpus. I wanted the gpus to make them running several models separately.
16 lanes for 2 gpus and there are lanes coming from the chipset at x4 speed for the 3rd gpu and ssd… etc
And theoretically AMD cpus with 128 lanes can dedicate them to 4 gpu slots and use the extra lanes from the chipset for other stuff just like the Z390-E asus motherboard that i have.
Reply
wesley faria says
2018-11-22 at 06:29
Great article Tim Dettmers, congratulations !!!
I’m beginner in ML/DL but I’m working on challenging project and I need to buy an hardware but I’m with a lot of questions.
The project aims to recognize tiny object in the image … let’s work with Dataset around 50 classes, each class with 1000 images, each 1024×1024 image size.I can not greatly shrink the size of the images because I need to get the details.
I need an equipment to start the project, but in the future I can buy other(s) hardware but nowadays I can’t spend much money.
My questions are:
1 – Which Nvidia card should I use ? My options are ( GTX 1080ti, RTX 2080 or other card with similar price or cheaper. I consider use multiples cards )
2 – Should I need a lot of RAM Memory with high speed ? I’m thinking around 16GB RAM DDR4 Is it good ?
3 – Should I need an ultra fast CPU ? An Intel Core 7th or 8th Generation Is it good ?
4 – Should I need an ultra fast Disk like SSD NVME ?
Thank’s a lot.
Reply
- Tim Dettmers says
  2018-11-22 at 08:28
  1) GTX 1070 Ti, RTX 2070 or GTX 1060 (6 GB) should all be fine for you.
  2) 16GB DDR4 RAM is good. Clock does not matter. Latencies do matter for some tasks but not for deep learning
  3) You do not need a fast CPU for deep learning. 7th generation with 4 cores is more than you would need.
  4) You do not need a fast disk. A simple hard drive would suffice. However, I like to have a SSD so the OS runs smoother.
  Reply
Peixiang says
2018-11-22 at 05:27
Hi Tim,
Thank you very much for your detailed and helpful post.
I was deciding to buy two used 1080ti, but after reading your article I’m thinking about two new 2070 since there are no used ones. The prices are roughly the same. Which setup do you recommend? I run experiments more using LSTM.
Also, right now it’s not easy to directly use 16bit in frameworks like Pytorch, do you think using 16bit will be more prevalent in the near future (end of 2018)?
One final questions :), i’m planning to buy used e5-2670 v2 and 4×16 ecc 1866mhz memory. Are the cpu and memory the bottleneck of my system?
Thank you
Peixiang
Reply
- Tim Dettmers says
  2018-11-22 at 08:25
  Everybody will 16-bit soon — there is no reason not to do so. I would probably go with the RTX 2070s. The CPU looks more than fine to me — you should have no problems that the CPU is a bottleneck. One thing though is the DDR3 memory. If you are preprocessing a lot of data DDR4 memory might be nicer. But on the other hand, you get a fast and cheap CPU with that. I think it could work quite well!
  Reply
Subash says
2018-11-22 at 02:57
Hi Tim,
I want to setup the build with RTX 2070 GPUs, however in the market space i could see different players (ASUS, Gigabyte, PALIT, EVGA, MSI, Zotac etc.) and different models as well like GE Force, OC edition, Turbo, Rog Strix etc. Do you have any specific preference in terms of a particular brand or model for ML/DL setup? or the difference is negligible?
Regards
Subash
Reply
- AV says
  2018-11-22 at 03:14
  For multi GPU setups go with a card that blows hot air out of the case. These are called Turbo, Blower or Aero (MSI). I have only tested the Asus 2070 Turbo and MSI 2070 Aero and the latter had 5 degrees lower temperature.
  Reply
AV says
2018-11-21 at 11:10
Hi Tim,
Did you see this test by Puget systems last month? https://www.pugetsystems.com/labs/hpc/RTX-2080Ti-with-NVLINK—TensorFlow-Performance-Includes-Comparison-with-GTX-1080Ti-RTX-2070-2080-2080Ti-and-Titan-V-1267/
What do you think explains the difference between your estimate of 2070/2080ti performance difference and theirs? You suggest that 2070 is not far from 2080 ti in terms of performance with RNNs. Puget suggests that 2080 ti is about twice as fast as 2070. Many thanks.
Reply
- Tim Dettmers says
  2018-11-21 at 15:12
  See the note in the blog post, that the memory was not large enough to run a larger batch size on RTX 2070 and RTX 2080. If you compare both RTX 2070 and RTX 2080, both of which cannot run larger batch sizes, then you see that they have the same performance on this task. If you would use a smaller batch size for the RTX 2080 Ti, you would probably see the same result.
  Reply
  - AV says
    2018-11-21 at 21:43
    But why would you use smaller batch size with the 2080 ti?
    Reply
    - Tim Dettmers says
      2018-11-22 at 08:22
      A smaller mini-batch size yields a faster descent to a local minimum. Then you usually turn off momentum and you may also increase the batch size. People in NLP often use batch sizes around 32.
      Reply
xu says
2018-11-20 at 18:24
Hi Tim, you suggested If you already have a GTX 1080 Ti or GTX Titan (Pascal) you might want to wait until the RTX Titan is released. Your GPUs are still okay.However I already have GTX 1070Ti, should I upgrade to RTX 2070 or wait until the RTX 2080/2080ti’s price stabilize?
Reply
- Tim Dettmers says
  2018-11-20 at 19:04
  It might be worth to wait a bit more to see how the prices play out. I would also only upgrade if you feel unsatisfied with your current GPU and you have the spare money. Upgrading just to get a faster GPU is often not the right reason for a new GPU. If you find yourself limited by a GTX 1070 Ti some of the time, then this is a good reason to wait for prices to stabilize and go for an RTX 2080 / RTX 2080 Ti.
  Reply
Subash says
2018-11-17 at 04:06
Hi Tim,
Thank you for the wonderful article.
Am setting up with the below mentioned config. Am planning to start with one RTX 2070 initially and plan to add one more in 8-12 months down the line. My primary objective is to learn ML & DL and start with doing some kaggle projects for 1 or 2 years to get expertise.
Should I spend money on Intel i7 now considering if i get additional GPUs in future and would it be a bottleneck with 6 cores & 12 Threads? or is it better to get AMD Ryzen where the additional cores & threads would be helpful when I add additional GPUs?
ASUS Rog Strix RTX 2070 GPU & 2390-E motherboard
32 GB RAM
1 GB SSD M.2 & 2 GB SATA
Intel i7-8700K
Thanks in advance.
Cheers,
Subash
Reply
- Tim Dettmers says
  2018-11-18 at 11:48
  The CPU does not matter too much, both options will give you more than enough power to use two GPUs efficiently. A better CPU might be in particular good if you do a lot of preprocessing, that is wrangle data in Kaggle competitions — so it might be worth it for you for that reason, but not for the reason of keeping your GPUs busy.
  Reply
Leo says
2018-11-11 at 17:55
Hi Tim,
If I do CNNs and RCNNs, what card should I get? I have a budget of 1000-1300 USD. Should I get 2 GTX 1080 ti for 1200, or 2 RTX 2070 for 1000? My Motherboard will fit 4 GPUs for any future update but I am not planning on updating anytime soon. Thank you!
Reply
- Tim Dettmers says
  2018-11-18 at 11:40
  I would definitely go with two RTX 2070. They are a much better long-term investment.
  Reply
  - Leo says
    2018-11-21 at 13:30
    Does the fact that RTX 2070 doesn’t support NVLink SLI bridge affect the power of using multiple 2070s together?
    Reply
    - Tim Dettmers says
      2018-11-21 at 15:13
      NVLink is a bit useless for the RTX series. For 2 GPUs it does not matter if you use it or PCIe. For 4 GPUs it would give you a benefit, but since only two GPUs can be coupled at any time, one would need complex communication patterns to get a payoff. So no: The lack of NVLink has no effect on RTX 2070 parallel performance.
      Reply
      - Tyler says
        2018-12-14 at 11:15
        thanks for this valueable info. i was thinking gettign a 2080 instead of 2070 because the latter does not support NVLink. now I difinite go for 2070, acutally just placed an order.
chris says
2018-11-11 at 13:15
Tim can you help me choose, should I buy (1 rtx 2070) or (2 gtx 1070 ti), and why:
I mainly train models using fp32 vgg16, resnet50 & alexnet. I use tensorflow.
Waiting for your reply
Reply
- Tim Dettmers says
  2018-11-18 at 11:39
  It is a big change to train your models in 16-bits but its worth it because it will be the standard in the future. If you want no hassle and just do your work two GTX 1070 Ti might be simpler to use. Also note that two GTX 1070 Tis are much faster if you are training two independent models at the same time. On the other hand, the RTX 2070 will yield good performance and with 16-bit training you will have much more memory. If you want to train big models this could be an important point.
  Reply
thanh tung hoang says
2018-11-08 at 23:40
Thank you for your tutorial and advice. I currently have a desktop with a GTX 1080. I plan to add another card next year but 1080 is no longer on sale. Can I run a model on a system with a 1080 and a 2070/2080?
Reply
- Tim Dettmers says
  2018-11-18 at 11:37
  Yes, that works without a problem if you do not use parallelism. If you want to parallelize across GPUs you can only do that on multiple 1080 or multiple 2070/2080. I do not think parallelism is very necessary, but if you think it is you could try to find a used GTX 1080 and buy that card.
  Reply
Mario says
2018-11-08 at 08:35
Hi Tim,
What were the price points used for each GPU while comparing? This would have been an excellent piece of information.
Right now, like many, I am not clear about gtx 1080 ti vs rtx 2080. Gaming versions of gtx 1080 ti are about €700 in EU and about €750 for RTX 2080. eBay market is CRAZY here – gtx 1080 tis are selling for €600 on eBay. Then there is the issue that warranties are not being transferable for second-hand purchases.
All in all, this is going to be a dilemma.
Mario
Reply
- Tim Dettmers says
  2018-11-18 at 11:36
  I used eBay and Amazon in the US. I could have added these data as well, but the charts already contain too much information. Maybe I add just a price chart so people can compare with the prices in their own countries.
  Reply
AV says
2018-11-07 at 21:42
Thank you , Tim. What about PCIE 3.0 at x8 speed? Does that impact RTX performance compared to x16 setting? Some C422 motherboards are able to divide 48 lanes to six PCIE slots running at x8. Would you recommend running 6 2070s in parallel at x8? Many thanks!
Reply
- Sam Karopoulos says
  2018-11-08 at 00:58
  According to this article (https://www.pugetsystems.com/labs/hpc/PCIe-X16-vs-X8-with-4-x-Titan-V-GPUs-for-Machine-Learning-1167/#results), there is negligible difference between using x16 and x8 connections. The top comment there explains that the theoretical limit for x8 is 8 GB/s and suggests that unless you are using a batch size larger than 8 GB, there will be negligible difference.
  Reply
  - Tim Dettmers says
    2018-11-18 at 11:35
    Indeed, people usually overestimate the importance of lanes. I have stated that before. However, the article might be a bit misleading because for certain architectures the penalties for lanes can be much larger. For example for VGG lanes are a bit more important than for GoogLeNet or ResNet. The difference however only matters if you are using 4 GPUs. For 2-3 GPUs 8x lanes are just fine and for 4 GPUs there are no standard options that give you 16x/16x/16x/16x setups anyway. For LSTMs there might also be a slowdown if you use shorter sequences, but with longer sequences lanes are also a non-issue.
    However, the argument with batch size larger than 8 GB makes no sense really. You will see a performance penalty long before that.
    Reply
    - andrea de luca says
      2018-11-19 at 14:12
      TBH, there are some standard options that give you 16x/16x/16x/16x: think about X99E-WS, to be bought used on ebay (or C422-SAGE if you want s2066). Note that they support 16x/16x/16x/16x no matter the 40-48 lanes processor.
      Reply
      - Nikolaos Tsarmpopoulos says
        2018-11-20 at 04:48
        This is not entirely correct.
        X99E-WS uses the PLX PEX8747 PCI-E switch, which communicates with the CPU using 32x bi-directional lanes of PCI-E 3.0. The switch also communicates with each GPU using 16x bi-directional lanes of PCI-E 3.0.
        Hence, the CPU can’t transmit only, or receive only, data to/from 4x GPUs using x16 lanes per GPU concurrently. It is limitted by the 32x lanes to the switch.
        However, if the software schedules the data transfers so that only two concurrent READs and two concurrent WRITEs take place in parallel (at most), the algorithm will take advantage of the PCI-E switch.
        Note that NVIDIA’s recent drivers for GTX graphics cards are unstable in Windows 10 since August 2017, causing BSOD. The latest stable drivers for 3 or 4 GPUs on motherboards featuring the PLX PEX8747 switch is version 382.53. This is a know issue and NVIDIA does not seem interested in fixing it.
      - andrea de luca says
        2018-11-20 at 19:07
        Thanks Nikolaos, theste are useful pieces of information. I suppose the same issues do affect the c422 SAGE 10G (It employs two PLXs). I was not aware of the driver problem. Note that 382.53 are insufficient for using the latest version of the most popular framework, not to mention tensor cores, FP16, and so on.
        So, apart from dual-processor boards, I reckon that no motherboard can, as of today, allow 16x/16x/16x/16x. Not even threadripper’s boards.
      - Nikolaos Tsarmpopoulos says
        2018-11-21 at 14:47
        To the best of my knowledge, all motherboards that feature the PLX PEX9747 are affected.
        Regarding dual processor systems, each CPU has access to half of the system RAM. The CPUs (and the devices attached to different CPUs) exchange data via Quick Path Interconnect (QPI) or UltraPath Interconnect (UPI), with a maximum aggregate throughput of about ~9-10GT/s or ~20GB/s.
        In comparison, PCI-E 3.0 x16 delivers ~15.8 GB/s. Two GPUs using x16 lanes each could facilitate ~31GB/s.
        That means, if 2x GPUs connected to CPU1 need to be fed with data store in system RAM attached to CPU2, the data transfer will be limited to ~20GB/s by QPI or UPI.
        Hence, to take advantage of systems with multiple GPUs and CPUs, the software needs to be NUMA-aware (non-uniform memory access).
      - andrea de luca says
        2018-11-21 at 16:23
        Thanks.
        Just for the sake of curiosity, how did nvidia managed to overcome such limitations on its DGX Station deep learning workstation?
      - Nikolaos Tsarmpopoulos says
        2018-11-21 at 16:51
        I think NVLink does the magic.
      - Haider Alwasiti says
        2018-11-21 at 18:29
        There are these new AMD CPUs announced few months ago, that give you 64 lanes without PLX. So I think if you want 4 GPUs then AMD is your friend.
        AMD Ryzen Threadripper 2990WX
        32 cores/64 threads
        4.2GHz boost/3.0GHz base
        64MB L3 cache
        250W TDP
        64 PCIe Gen 3.0 lanes
        Price: $1,799
        Availability: Aug 13, 2018
        AMD Ryzen Threadripper 2970WX
        24 cores/48 threads
        4.2GHz boost/3.0GHz base
        64MB L3 cache
        250W TDP
        64 PCIe Gen 3.0 lanes
        Price: $1,299
        Availability: Oct 2018
        AMD Ryzen Threadripper 2950X
        16 cores/32 threads
        4.4GHz boost/3.5GHz base
        32MB L3 cache
        180W TDP
        64 PCIe Gen 3.0 lanes
        Price: $899
        Availability: Aug 31, 2018
        AMD Ryzen Threadripper 2920X
        12 cores/24 threads
        4.3GHz boost/3.5GHz base
        32MB L3 cache
        180W TDP
        64 PCIe Gen 3.0 lanes
        Price: $649
        Availability: Oct 2018
        Source:
        https://www.zdnet.com/article/amd-unveils-world-record-breaking-intel-beating-2nd-generation-ryzen-threadripper-processors/
        It seems intel could not (do not want?) make CPUs with 64 lanes.
        For Nvidia DGX-2 with 16 GPUs they seem relying on the NVlinks between the V100s. They are using 2 Xeon processors (Dual Intel Xeon Platinum
        8168, 2.7 GHz, 24-cores) each with 48 lanes (6 lanes/gpu?), but that is not a problem with NVlinks.
        source:
        http://images.nvidia.com/content/pdf/dgx-2-print-datasheet-738070-nvidia-a4-web.pdf
        DGX-1
        2x Intel Xeon E5-2698 v3 (16 core, Haswell-EP) with 40 lanes each
        8x NVIDIA Tesla P100 (3584 CUDA Cores) with NVlinks
        = 10 lanes/gpu
        But again there is NVlinks between the 8 P100s gpus
        source:
        https://www.anandtech.com/show/10229/nvidia-announces-dgx1-server
        I think, only the new AMD processors can do 4 GPUs efficiently for us without server grade tesla gpus.
        My question though, if we go for the cheapest AMD with 64 lanes. How is the performance and most importantly the compatibility with the DL frameworks of AMD CPUs. Like this one with $650:
        AMD Ryzen Threadripper 2920X
        12 cores/24 threads
        4.3GHz boost/3.5GHz base
        32MB L3 cache
        180W TDP
        64 PCIe Gen 3.0 lanes
        Price: $649
        Availability: Oct 2018
        Saying that, I have recently purchased the newly announced corei9-9900k ($550) for building a system with 3 GPUs .
        I just don’t feel comfortable to go for AMD cpus fearing of compatibility issues and lower performance for DL or other compute tasks that I am interested in (rendering, Ansys simulations..etc.)
      - andrea says
        2018-11-22 at 06:04
        Mh, I was thinking, suppose you have a mainboard which supports even less than 8x (e.g. a mixture of 16x,8x, and 4x slots. Perhaps 1x), but got enough room to accommodate 4 cards.
        You can still attain full speed with NVLink?
      - andrea de luca says
        2018-11-22 at 15:27
        @Haider:
        note that any threadripper mb allows (at best) for 16x/16x/8x/8x, and that’s because a good amount of lanes are employed for storage and other similar stuff.
      - andrea de luca says
        2018-11-22 at 15:29
        @Haider:
        How do you attain 3 GPUs with a 9900K? It just has 16 lanes…
AV says
2018-11-07 at 07:32
Hi Tim, can you please comment on the necessity of NVLink for parallel training? NVLink is missing on the 2070. Do the benefits differ by network architecture? I found this presentation https://www.dcs.warwick.ac.uk/pmbs/pmbs/PMBS/pres/paper1.pdf
Is there any truth to it? Thanks so much for all your input!
Reply
- Tim Dettmers says
  2018-11-07 at 12:16
  Currently, only NVLink for 2 GPUs is usable for RTX cards. For 2 GPUs NVlink is quite useless because both GPUs just has to send 1 message which can be done in parallel. For 8 GPUs you need to send 7 message of which only 2 can be sent in parallel on each PCI root complex. This means that you need to wait for an equivalent of at least 8 messages so the bandwidth requirements for equal performance are 8 times higher than in the 2 GPU case. What this means that NVLink is great for 8 GPUs but unnecessary for 2 GPUs.
  Reply
Jannes says
2018-11-07 at 06:05
In the “TL;DR: I am an NLP researcher” section you mention a “RTX 2070 Ti” which is not a thing yet. Did you mean the RTX 2070 or the RTX 2080 Ti?
Reply
- Tim Dettmers says
  2018-11-07 at 12:17
  Thanks, I correct this right away.
  Reply
Sam Karopoulos says
2018-11-07 at 04:34
Hi Tim – thanks for updating your article – it’s super helpful 🙂
I’m trying to build the most cost efficient machine learning workstation possible. Based on the August 21st revision of your article, I originally was planning the following build:
https://pcpartpicker.com/list/8Pd9Bb
I didn’t include storage as I already have plenty of SSDs and HDDs to use (and my understanding of machine learning is that the storage speed doesn’t matter much). There are 4 MSI RTX 2080 Sea Hawks in this build, with PCI-e extenders on two of the cards to make it all fit. I chose these as I’ve read that a blower card or a card with AIO cooling works better to keep a multi GPU build cool. I chose a Threadripper 1900X for its 64 PCI-e lanes. The Gigabyte X399 Designare EX has 5 PCI-e x16 slots: 2 run at x16, 2 run at x8, 1 runs at x4. I’ve read that x8 runs at about 90% efficiency compared to x16 for ML, which is an acceptable hit to performance for me.
With the November 5th revision of your article, I’m now considering using RTX2070s instead of RTX2080s. However, it looks like there aren’t any RTX2070s which come with AIO cooling, so I’m worried about cooling.
I’d really appreciate it if you could provide some advice about how you would build the most cost effective machine learning workstation possible.
Thanks 🙂
Reply
- Tim Dettmers says
  2018-11-18 at 11:53
  Look out for new RTX 2070 cards that come out. Also, have a look at gaming reviews and see if some cards have better temperatures. Also note that the RTX 2070 has lower wattage so that it should generate less heat and thus be more manageable than others cards.
  Reply
peter says
2018-11-06 at 19:45
Hi Tim, how are these GPUs ranked among each other in terms of doing DL in laptop (Nvidia 1050 Ti MaxQ or Quadro P1000, P2000, all 4GB)?
Reply
George M says
2018-10-30 at 10:30
Thanks for this helpful article. I have a very wide-open question but I will do my best. What kinds of models are currently stretching (or will soon stretch) the limits of a single video card? By that I don’t mean models that run slow, I mean they simply will not fit regardless of FP16/batch size/accumulation type workarounds. I am just getting started in DL so it’s mostly a theoretical question at the moment, but I may buy my own hardware in a few months. I’d prefer to wait until I at least know what I’m doing and the current epidemic of RTX hardware failures gets somewhat resolved.
TLDR, I want to get a sense of whether I can get by with a single 8gb 2080 and the aforementioned techniques, or whether the day is fast approaching (or maybe already has…) when there are many important models that require a dual 2080ti system or more to run at all. I do not have a specific use case in mind yet, so let’s assume all fields are fair game – images, language, audio, video, etc. (Told you it was a wide open question) What are your thoughts?
Reply
- Tim Dettmers says
  2018-10-30 at 11:31
  I think if you use 16-bits then you should be able to run anything because it virtually extends the memory to 16 GB while the largest consumer-grade GPUs are currently at 12 GB — so you should have no problems. If people come up with even larger models then you can just use the tricks that I mentioned and you will be fine.
  Reply
Bull Shark says
2018-10-28 at 08:35
Which GPU would be best to choose for spec wise:
-rtx 2080ti
-less VRAM
-faster?
-amd pro duo 32gb (yes I know, plaidml, but I’m just talking spec wise)
-more VRAM
-slower?
I’m mostly interested in training really large deep learning models. For t
hese models 11GB is not enough. However, the AMD card might be so slow that training a large model on it isn’t much good either…
What would you recommend? I expect to be using the hardware pretty intensively, so cloud solutions are not a cost efficient option then. Maybe you know other faster high memory GPUs <1500?
Thank you in advance!
Reply
- Tim Dettmers says
  2018-10-29 at 11:03
  Buy an RTX 2080 Ti and train with 16-bit only (virtually doubles memory to 22GB). If that is not enough memory: (1) use a small batch size, (2) accumulate gradients over multiple batches, (3) apply the gradient. With this, you should virtually have about 30-40GB of memory.
  Reply
  - Bull Shark says
    2018-11-04 at 04:44
    Thanks Tim!
    I’m now considering the following options:
    -upgrade my old system with an rtx 2080ti. It has an i5 4670 and 8gb of ddr3 ram and no SSD and motherboard with the older PCiE 2.0(will it bottleneck?)
    -build a new pc with something like i3 or cheap ryzen 2 processor 8gb ddr4 and an SSD with the rtx 2080 ti and a motherboard with PciE 3.0
    -same as above but 2x1080ti
    -build new pc with ryzen 2700x, 16 gb ddr4, SSD and gtx 1080ti.
    Reply
    - andrea de luca says
      2018-11-04 at 07:14
      I’ll give you my two cents.
      1. In my experience, FP16 cannot double your memory since you cannot train in FP16 alone. What you will do is a mixture of FP32 and FP16. You will have befits in terms of memory, but you won’t double it.
      2. If possible, avoid AMD processors. I’m sorry to say this, but with an Intel CPU you could benefit from Intel MKL, which help you in a lot of collateral tasks.
      3. gen2 X16 won’t bottleneck you, but 8Gb of RAM definitely will. With 22Gb of VRAM, which will become more in mixed precision training, I’d settle for 64Gb of RAM, and not less.
      Reply
    - Bull Shark says
      2018-11-05 at 23:51
      So which of the four options would be best? I can upgrade my old system’s ram to a max of 32gb ddr3. There is also no second pcie slot for 2xgtx 1080ti so this SLI option only belongs to the ‘build a new pc’-options
      Reply
    - Bull Shark says
      2018-11-06 at 03:36
      Would it for example be a good idea to put an rtx 2080 ti in my old system? Or is this system just too ‘old’ to live up to the powers of the 2080 ti and would it be a better idea just to build an all new pc?
      Reply
    - Bull Shark says
      2018-11-07 at 12:51
      What would you think about this Tim?
      Reply
Nguyen says
2018-10-25 at 23:41
Hi Tim,
for some reasons, we can only order a workstation from Dell website. We have only some choices for Quadro cards, such as 02 GP100 (16GB) and other 3-4 cards (P4000, P5000, or P6000). We will do a lot of image processing and machine learning. What would you suggest, for the dual GP100 ( + 10.868,00 €) or 03 P5000 (+ 3.956,55 € ) , or 03 P6000 (+ 11.188,45 €). We can pay if we have to, but also prefer to do the same for less. Thanks in advance! Best, TR
Reply
- Tim Dettmers says
  2018-10-29 at 11:01
  These are all terrible options, I would not recommend any of them. Consider getting a Hetzner machine with a GTX 1080 ($130 per month) for prototyping and then rent AWS/Azure/TPUs to run jobs once you prototyped your models. You might also like to privately buy an RTX 2070 equipped desktop for prototyping and then just run cloud jobs on AWS/Azure/TPUs.
  Reply
  - Anuj says
    2019-08-05 at 13:58
    Why is it always advised to prototype the models in-house and not on the cloud? I would like to know the negatives of prototyping the model on the cloud and then buying a workstation to train the hyperparameters.
    Reply
    - Tim Dettmers says
      2019-08-06 at 22:28
      To make prototyping cost-efficient you would need to ramp-up and shut-down a spot instance every time you made a fix/change. This would make prototyping very slow. Debugging your models locally is much more cost-efficient. You can do this on a cheap/slow GPU (which will still be more than fast enough for debugging). Once you debugged your model you can run it in parallel in a big cloud instance.
      If you train a lot of models, it is always more cost-efficient to buy a workstation even if you do extensive hyperparameter searches. However, if time is short or you do not run models frequently the cloud will be much faster or cost-efficient.
      I guess it all depends on the exact setup. I am sure there are cloud providers which provide a reasonable latency to power up an instance for prototyping, but it is cumbersome and requires an initial time investment to setup up such a development procedure. So I guess what you propose can make sense, but it is just difficult to execute — not for your regular user.
      Reply
Ruirui Liu says
2018-10-25 at 17:25
which solution is more efficient 2* 2070 graphic cards or only 1 2080 ti graphic card for deep learning?
Thank you,
Reply
- Tim Dettmers says
  2018-10-29 at 10:53
  2x RTX 2070 is much better if the memory on them is enough for you.
  Reply
andrea says
2018-10-17 at 11:28
Hi Tim. The 2070 is out, along with its specifications. Street price is ~550 euros.
It seems it *has* tensor cores and (consequently) FP16 capability. I’d like to hear you opinion about its price/yield ratio.
Reply
- Tim Dettmers says
  2018-10-17 at 11:36
  I waited for the RTX 2070 to be released for an update. I will update the blog post today or tomorrow.
  Reply
  - Brian A. Mulrooney says
    2018-10-18 at 08:17
    I think the $499.99 models of the 2070 with 2304 cuda cores could be the new price/performance king. The cheapest 2080 is $799.99 with 2944 cuda cores. When adjusted for price per core you are looking at ~20% more cores per dollar with the 2070. Consider that these cards can be manually clocked about ~200MHz more from the boost; then you have a very serious contender at the $500 price point. Likely within 30% performance, at 2/3rds the cost of the 2080.
    Reply
    - Tim Dettmers says
      2018-10-18 at 08:28
      You are right. After doing my calculations the RTX 2070 is on top. One issue though is that it only has 8 GB of memory. That practically turns to 16GB though if you use 16-bit compute — which is an absolute requirement to get good performance from the RTX series — so maybe not the biggest issue.
      Reply
      - Nile Furth says
        2018-10-19 at 08:08
        One drawback to the RTX 2070 is that it does not have NVlink capability. That’s been a trend with Nvidia: with each successive generation, fewer and fewer cards have SLI/multi-link support. I’m still not clear on whether RTX video memory can be pooled/shared in Linux (Puget Systems indicates this cannot currently be done in Windows), but if that is your desired use case, or you want to play Tomb Raider at 4K between jobs, you might have to go with the 2080 or 2080ti instead.
      - Tim Dettmers says
        2018-10-19 at 12:08
        I think for deep learning NVLink will not make the biggest difference because it is limited to two GPUs. For two GPUs the PCIe bus is not the biggest bottleneck if you want to parallelize via data parallelism. NVLink enables possibilities for model parallelism which were not possible before, which means not faster models but first and foremost bigger models. But I think the option to do efficient model parallelism is not that important because currently there are no real deep learning models which would require this. So overall I would not worry if you cannot use NVLink. The RTX 2070 is fine without.
      - andrea says
        2018-10-22 at 11:12
        Tim, as you surely noticed, you cannot just do all in FP16 , e.g. you calculate the gradients in FP32, then backpropagate them in FP16, and then do the update step again in FP32. This is what they call mixed precision training. So, depending on the implementations, you will NOT double the amount of memory. The exact gains in terms of memory have to be evaluated by experimentation. Maybe you could do such experiments and let us know (I don’t have any card with FP16 tensor cores as of yet). Thanks!!
      - Tim Dettmers says
        2018-10-22 at 19:42
        You should see a halving of memory if you train in straight 16-bit. Usually, you do not see a decrease in performance if you train in 16-bits if you use gradient scaling, that is multiplying the error by a big number during backprop and dividing by that number for weight updates. If you go below 10-bits you usually see some form of decrease in predictive performance, but we are not there yet! 16-bit is very save to train in.
George says
2018-10-16 at 01:20
Informative and well-written article.
Reply
p says
2018-10-02 at 17:10
Hi Tim, between Quadro P2000 4GB GDDR5 and GTX 1050Ti Q-Max 4GB GDDR5, which has better performance? Is the difference noticable?
Reply
Sourabh says
2018-09-28 at 04:51
Hi Tim,
Thank you for your wonderful analysis.
We want to setup lab for deep learning. Mostly training CNN or ResNets.
Which Nividia GPU should be procure?
Option a) GeForce RTX 2080 Ti (Turing microarchitecture)
Option b) Nvidia TITAN V (Volta microarchitecture)
According to your blog:
“Now a combination of bandwidth, FLOPS, and Tensor Cores are the best indicator for the performance of a GPU”
“So overall, the best rule of thumb would be: Look at bandwidth if you use RNNs; look at FLOPS if you use convolution; get Tensor Cores if you can afford them”
On this wiki page (https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units)
we observe that Option b) has more Tensor Cores, GFLOPS & Badnwidth. But you have not discussed/considered about Option b) in your blog post. Please let us know which one is better specially for training computer vision deep learaning tasks.
Thank you.
Reply
- Tim Dettmers says
  2018-09-28 at 07:15
  Hi Sourabh, the Titan V is very inefficient in terms of cost/performance. I would definitely go for the RTX 2080 Ti! First benchmarks also show that their performance is comparable and thus you will probably get roughly the same performance for $1800 less if you buy RTX 2080 Ti GPUs.
  Reply
Rinish says
2018-09-26 at 23:25
Hi Tim I preordered a rtx 2080. I was whether the 2080 has better performance than 1080 ti. Should I cancel my 2080 and get a 1080 ti?
Reply
- Tim Dettmers says
  2018-09-27 at 13:06
  The RTX 2080 will be faster than the GTX 1080 Ti — at least 25% faster. However, these data are somewhat unclear and no high-quality CUDA 10 Tensor Core benchmarks exist yet. This means that the RTX 2080 will likely be even faster for convolutions than 25% compared to the GTX 1080 Ti.
  Reply
Artur says
2018-09-25 at 09:32
Hi Tim,
Thank you very much for the in depth article, very useful!
I was wondering, if I want to train two models at the same time, am I better off with one very good GPU or with training each model on two less good GPUs?
Artur
Reply
- Tim Dettmers says
  2018-09-26 at 08:19
  Hi Artur, In your case, two slower GPUs are usually better and even necessary if your networks consume a good amount of memory. If you train one model only, then you could also use the two GPUs with data parallelism for good speedups. In both cases, two small GPUs should be better. The only disadvantage is if you want to get more GPUs and they do not fit on your motherboard. Also, two smaller GPUs are often a bit more expensive than a single big one.
  Reply
  - Artur says
    2018-09-27 at 05:31
    Thank you very much Tim.
    Another question, everyone is praising GPUs versus CPUs for different reasons that I do understand. However I was not able to find a single study that compares performance of a GPU with one of a good CPU (eg: AMD Ryzen Threadripper 2950X). All studies are using outdate/standard CPUs and compare them with very good GPUs. Have you actually compared both? People I know in the 3D field are back to CPUs (but the best ones) and were disappointed by GPUs.
    Reply
    - Tim Dettmers says
      2018-09-27 at 13:13
      GPUs are not good for some computational problems. Especially if you have access data selectively, that is you do not access 128 bytes sequentially (for example, an array of 32 sequential positions in a 32-bit floating point array. This might be the case why GPUs are not suitable for some problems in 3D.
      Reply
Art Lee says
2018-09-21 at 04:47
Hi Tim, I was wondering if you could comment which one of RTX 2080 or GTX 1080Ti should one get for deep learning if one is on a budget. These two cards are very close in price. 2080 has tensor cores but 1080Ti has 11Gb of RAM. I don’t mind sacrificing a bit of speed if I can train larger networks with 1080Ti.
Reply
- Tim Dettmers says
  2018-09-21 at 13:13
  Hi Art, I think you should go for a used (eBay or otherwise) GTX 1080 Ti — this is a very solid choice if you are on a budget right now and the memory on the RTX 2080 is not enough for you.
  Reply
Frixos says
2018-09-19 at 23:30
Dear Tim,
I recently discovered about your blog and it seems on the right time! I am looking to purchase a 1080 ti soon enough. At first, I wanted to buy 2 more gtx 970’s to pair with the one I have now, but I’d have to change my motherboard and psu accordingly, so I can use a multiple gpu system. It seems the 970 is severely out date tho.
I wanted to ask you if you have any experience or suggestions about the prices when it comes to the gtx 1080 ti. My main goal was to buy one around the black friday period. Do you think that’s a good idea? Or should I just get one at the beginning of october?
Mainly, I am asking because I’ve been reading that the value of the gtx 1080 ti is crazy good now when compared with the 2080 RTX. Thus; I am afraid that the prices won’t drop much during black friday whereas this time around the price drop is fairly decent for the 1080 ti.
P.S: I’d love to hear your thoughts regarding a single 1080 ti vs multiple 970’s. I am a postgraduate student with my main concern being to enter kaggle competitions.
Reply
- Tim Dettmers says
  2018-09-21 at 13:11
  Hi Frixos, I think going for a GTX 1080 Ti over a GTX 970 or RTX 2080 is a good choice if you need memory and are on a budget. It is difficult to predict how the market will look like around Black Friday though: Gamers were dissatisfied with the RTX 2080 but it seems now that benchmarks are coming out they are more positive. This might increase demand so that the GTX 1080 Ti will drop — but this could totally change in a couple of days. There are also rumors that NVIDIA has a very large stockpile of RTX cards which might make them very cheap if their sales are less than predicted.
  I would go for a used GTX 1080 Ti now on eBay (or similar sites) if I were you. That should be quite cheap right now and should be a solid choice especially if you are on a budget.
  Reply
Well Honey says
2018-09-19 at 18:08
Here is a benchmark with TensorFlow Deepfakes. It seems like RTX 2080 Ti is nearly the same performance as Titan V.
https://www.youtube.com/watch?v=1KEHi-7r8VE at 16:45. Also CNN CIFAR-10 at 16:01
Reply
- Tim Dettmers says
  2018-09-19 at 21:54
  Thanks for making me aware of this! Deepfakes and CIFAR-10 are not the best performance benchmarks, but it hints at a general direction. It seems that the RTX 2080 Ti is closer to the Titan V than I thought and the RTX 2080 is closer to the RTX 2080. This might change for particular cases like LSTM benchmarks and straight ImageNet benchmarks. I will probably update the blog post at the end of the week.
  Reply
  - Well Honey says
    2018-10-12 at 15:26
    from Pugetsystems:
    https://www.pugetsystems.com/labs/hpc/NVIDIA-RTX-2080-Ti-vs-2080-vs-1080-Ti-vs-Titan-V-TensorFlow-Performance-with-CUDA-10-0-1247/
    Reply
peter says
2018-09-18 at 18:34
Hi Tim, am I correct that laptop with NVIDIA GeForce MX150 2GB GDDR5 won’t be able to run the Tensorflow GPU version because it is not listed in the list of GPU compatible with CUDA?
Reply
- stkarlos says
  2019-11-01 at 00:45
  any update over this ?
  Reply
Bordeaux25 says
2018-09-17 at 22:14
Hi Tim,
thank you for your very throughout guide.
I was wondering if a quad-core CPU like an AMD Ryzen 2400G is too much of a limiting factor for a DL build based on a single 1080Ti or on a RTX 2080 (Ti).
Thanks
Reply
- Tim Dettmers says
  2018-09-18 at 08:14
  I think the AMD Ryzen 2400G is perfect for a single GTX 1080 Ti or an RTX 2080 Ti — so from a deep learning perspective, this is totally fine. The question is if you have other workloads which are CPU heavy and depending on the answer you might want to upgrade your CPU.
  Reply
Chip Reuben says
2018-09-17 at 20:00
When you use the GPUs for machine learning, what gets used for your video card? Say I am installing two EVGA GeForce GTX 1080 Ti on a MSI X299M Pro Carbon AC along with a Intel Core i9 7900X 3.3GHz Ten Core 13.37MB 140W CPU (and the necessary RAM, fans, etc.)
Does one get used for the video card and one for GPU computing? Then what if I use an SLI bridge? What gets used for the video card?
Or let’s just say I start with just one EVGA GeForce GTX 1080 Ti? What gets used as the video card?
Reply
- Tim Dettmers says
  2018-09-18 at 17:04
  You can use for GPU computing, but one card will consume a bit of RAM for the monitor (usually 100-300 MB, depending on resolution etc) and will consume a bit of compute to render the display (less than 5%; usually close to 0%). With an SLI bridge both GPUs will drive the display(s), but you can still use both GPUs and both GPUs RAM for CUDA computations — so no worry, this will not affect your CUDA / deep learning experience at all!
  Reply
Willfried Wienholt says
2018-09-17 at 07:22
Thanks, Tim, for your excellent blog post including the RTX 20 series.
I wonder in what way the new series put higher demands on desktop systems for deep learning purposes? What kind of a CPU, ram, and motherboard would you consider for excellent performance today if data sets are reasonably small (several 100 MB) but networks designed for NLP, classification, time series prediction might need quite some performance.
Reply
- Tim Dettmers says
  2018-09-17 at 08:52
  I should update my other blog post about hardware. The CPU does not matter that much if you have a single GPU, but it can matter if you have four. Supported PCIe lanes on both CPU and motherboard are an important criterion if you have more than 2 GPUs — aim for at least 8 lanes per GPU. Any DDR4 RAM is good for deep learning, however, if a bulk of your work is in machine learning or you need to preprocess data a lot I recommend getting fast DDR4 RAM. Get at least the same amount of RAM as your GPUs have combined RAM (if you have 4 GPUs you can go a bit lower instead of 44 GBs (4x RTX 2080 Ti) you can go with 32 GB for example). If you want a NVMe SSD because you process datasets a lot make sure that you have enough lanes for the GPUs — remember 8 lanes for each GPU + 4 PCIe lanes for the NVMe SSD.
  Reply
Alexander/O says
2018-09-16 at 06:05
On Supermicro 4028GR-TR2 the PCIe PLX switches have 96 lanes each. There are two of them connected to the CPU1.
This might be the problem wrt. bandwidth issue. http://www.supermicro.com/support/faqs/faq.cfm?faq=20732
Will test it on Monday. I did check the 4027GR-TRT today and ACSCtl are all negative (correct to get maximum bandwidth throughput).
Reply
Tim says
2018-09-14 at 08:52
Will you update this with the T4?
Reply
Thomas says
2018-09-14 at 08:49
Hi Tim,
Very nice article! I just bought the gtx1080ti for 650 euros, which is around 100 euros cheaper than the average price of a 1080ti in my country (around 750-800 euros) . The cheaper price is due to a 1 day only discount, so I decided to just go for it, since I can always just return it within 2 weeks If I decide to go for something else.
It was really hard to choose with the RTX2080 on the horizon with tensor cores. However, the cheapest RTX2080 card here currently sells for 850 euros, which is significantly more expensive. Plus I am also a bit wary because of the lack of benchmarks.
What do you think, will I be good with the 1080ti, also considering the price? Or should I definitely return it and buy he RTX2080?
Reply
- Tim Dettmers says
  2018-09-17 at 08:56
  Hi Thomas, congrats on the deal! I think this is really determined by up-to-date benchmarks. Given the numbers that I calculated, both choices definitely make sense. If you see benchmarks that favor RTX cards also make sure to have a look at software support for TensorCores. If people complain that they cannot run their models reliably with TensorCores then the GTX 1080 Ti will be better — at least until the problems with the TensorCores are fixed in TensorFlow / PyTorch.
  Reply
  - Thomas says
    2018-09-18 at 14:11
    Thanks a lot, I am very happy with it. The cheapest RTX2080 is about 30% more expensive than the gtx1080ti I have now, but I really doubt that the RTX2080 will perform 30 % better than the gtx1080ti. But we will have to see of course.
    Also, as you said: software support will probably be lacking for at least a little while. It is still bleeding edge tech.
    Reply
Remi Cadene says
2018-09-13 at 18:08
The best card for Deep Learning is TitanV.
The best ratio between computing capability and price is TitanXp.
RTX 2080 and RTX 2080 ti are not made for Deep Learning, they are made for Gaming and Computer-Animated Movies.
A benchmark with VGG16 and ResNet50 in 16 bits and 32 bits is needed!!!!
Also 16 bits training does not work well. A mixed precision (16 bits + 32 bits) is needed to converge to good local minima: http://on-demand.gputechconf.com/gtc/2018/video/S81012/
Reply
- Tim Dettmers says
  2018-09-14 at 08:05
  RTX 2080 and RTX 2080 Ti GPUs have Tensor Cores which are made for deep learning — so I do not see your point.
  I did myself research on low-precision deep learning and I trained convolutional networks with 8-bit activations and gradients — that was not a big problem and the results were the same statistically. Training neural networks with 16-bits is very practical. In some cases, you will see a decrease in performance, but it is not enough to justify sticking to 32-bits. There is even work that shows you can train neural networks without any loss of precision with 10-14 bits. Of course, for people that want to squeeze the last bit of performance out of a network (usually only computer vision researchers) software needs to be adjusted to work under these circumstances, but if there is a need these things will be developed.
  It is in NVIDIA’s interest to make us believe that mixed precision is the only way to go. I do not believe that until I see more evidence for that.
  Reply
Vadim says
2018-09-13 at 08:34
Hello, Tim!
Thanks for great article!
What do you think about ready to use PC (when you don’t need to buy parts separately and somehow connect it especially if you didn’t know how)?
Can you recommend some models?
Thanks in advance!
Reply
- Tim Dettmers says
  2018-09-14 at 07:57
  There are some deep-learning-branded desktop PCs but they are very expensive. I would recommend going with a gaming PC with a GPU that you want. See if the computer allows for additional GPUs. This will allow you to add another GPU if you like to later on. Exchanging GPUs is easier than building a computer on your own, but building a computer is also not that difficult and it is a good skill to have. But if you insist on not building your own PC a gaming PC is a good option.
  Otherwise, there are also services where you buy parts and they then put together all the parts for you. This could also be an option if you can find a service like that. A local computer shop might do that for you and then you also always have somebody to talk to if something is wrong with the hardware of your computer.
  Reply
Alexander says
2018-09-08 at 18:39
If the cost for RTX Titan is going be anything like Titan V, then the end cost in USD with GTX 10xx + Titan RTX will be $4000 + opportunity cost.
I just had my company send back to Nvidia 4 (out of 8 cards) we bought on Aug. 8 because I could not in good faith allow them to spend $12000, when I can use that money to buy ten(10) overclocked (thinking water-cooled EVGA RTX 2080 Ti) and totally populate our 10 GPU Supermicro.
Just buy a highly overclocked 2080 Ti, and I bet it will be faster than RTX Titan because there is very little extra hardware on that Turing chip to activate to convert it into a Titan.
Reply
- Tim Dettmers says
  2018-09-08 at 19:48
  That makes sense. The only big difference between a Titan and an XX80 is usually a bit of RAM. The XX80 Ti however often closed that gap. A Titan makes also sense if it is important to squeeze every bit of performance out of a GPU slot. However, if you have already a 10 GPU server and just need to swap GPUs then the RTX 2080 Ti is an excellent choice. I would have done the same in your situation.
  Reply
  - Alexander/O says
    2018-09-12 at 05:31
    I am observing very bad bandwidth benchmarks for Titan V GPUs, plugged into a Supermicro single root complex system. I have no idea why.
    [P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
    Device: 0, TITAN V, pciBusID: 4, pciDeviceID: 0, pciDomainID:0
    Device: 1, TITAN V, pciBusID: 6, pciDeviceID: 0, pciDomainID:0
    Device: 2, TITAN V, pciBusID: 8, pciDeviceID: 0, pciDomainID:0
    Device=0 CAN Access Peer Device=1
    Device=0 CAN Access Peer Device=2
    Device=1 CAN Access Peer Device=0
    Device=1 CAN Access Peer Device=2
    Device=2 CAN Access Peer Device=0
    Device=2 CAN Access Peer Device=1
    ***NOTE: In case a device doesn’t have P2P access to other one, it falls back to normal memcopy procedure.
    So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
    P2P Connectivity Matrix
    D\D 0 1 2
    0 1 1 1
    1 1 1 1
    2 1 1 1
    Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
    D\D 0 1 2
    0 552.51 5.73 5.73
    1 5.76 555.65 5.72
    2 5.77 5.71 555.65
    Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
    D\D 0 1 2
    0 554.08 4.21 4.18
    1 4.14 558.04 4.18
    2 4.14 4.21 556.45
    Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
    D\D 0 1 2
    0 561.65 6.12 6.14
    1 6.11 561.24 6.12
    2 6.11 6.10 562.46
    Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
    D\D 0 1 2
    0 563.67 8.04 8.06
    1 8.05 563.27 8.09
    2 8.06 8.10 560.84
    P2P=Disabled Latency Matrix (us)
    GPU 0 1 2
    0 2.20 16.68 17.39
    1 17.02 2.18 16.88
    2 16.73 16.95 2.20
    CPU 0 1 2
    0 4.71 11.17 11.20
    1 11.18 4.75 11.10
    2 11.15 10.93 4.58
    P2P=Enabled Latency (P2P Writes) Matrix (us)
    GPU 0 1 2
    0 2.16 1.65 1.65
    1 1.65 2.19 1.64
    2 1.69 1.69 2.21
    CPU 0 1 2
    0 4.69 3.22 3.20
    1 3.20 4.57 3.16
    2 3.22 3.17 4.63
    Reply
    - Tim Dettmers says
      2018-09-14 at 07:52
      I cannot find the exact benchmark that you are using quickly. But the simpleP2P benchmark from the CUDA samples uses 64 MB buffers. If you are running on 8 lanes per GPU on a single root complex then 4-5 GB/s is not unreasonable. In usual applications, you will not see bandwidths above 7 GB/s. You can achieve the theoretical 8 GB/s if you use very large buffers, but that will not happen in practice. In my theoretical model for deep learning parallelism, I benchmarked the PCIe bandwidth for usual sizes for gradients and activations and find that it is about 5 GB/s for 8 PCIe lanes — very similar to your numbers. If you increase the buffer-size you should see higher numbers.
      Reply
      - Alexander/O says
        2018-09-15 at 22:07
        Numbers for Supermicro 4027GR-TRT with 4 Titan X (Maxwell, EVGA Hybrid)
        developer@theano:~/NVIDIA_CUDA-9.2_Samples/1_Utilities/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest |more
        [P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
        Device: 0, GeForce GTX TITAN X, pciBusID: 4, pciDeviceID: 0, pciDomainID:0
        Device: 1, GeForce GTX TITAN X, pciBusID: 5, pciDeviceID: 0, pciDomainID:0
        Device: 2, GeForce GTX TITAN X, pciBusID: 8, pciDeviceID: 0, pciDomainID:0
        Device: 3, GeForce GTX TITAN X, pciBusID: 9, pciDeviceID: 0, pciDomainID:0
        Device=0 CAN Access Peer Device=1
        Device=0 CAN Access Peer Device=2
        Device=0 CAN Access Peer Device=3
        Device=1 CAN Access Peer Device=0
        Device=1 CAN Access Peer Device=2
        Device=1 CAN Access Peer Device=3
        Device=2 CAN Access Peer Device=0
        Device=2 CAN Access Peer Device=1
        Device=2 CAN Access Peer Device=3
        Device=3 CAN Access Peer Device=0
        Device=3 CAN Access Peer Device=1
        Device=3 CAN Access Peer Device=2
        ***NOTE: In case a device doesn’t have P2P access to other one, it falls back to normal memcopy procedure.
        So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
        P2P Connectivity Matrix
        D\D 0 1 2 3
        0 1 1 1 1
        1 1 1 1 1
        2 1 1 1 1
        3 1 1 1 1
        Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
        D\D 0 1 2 3
        0 253.07 9.61 10.76 10.62
        1 9.63 257.11 10.71 10.65
        2 11.06 10.92 256.94 9.67
        3 10.95 10.78 9.63 253.54
        Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
        D\D 0 1 2 3
        0 254.21 13.18 10.27 10.27
        1 13.18 258.23 10.09 10.27
        2 10.27 10.18 258.28 13.18
        3 10.27 10.17 13.18 255.75
        Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
        D\D 0 1 2 3
        0 256.31 10.32 18.56 18.20
        1 10.12 257.80 18.54 17.98
        2 18.25 18.53 258.08 10.14
        3 18.36 18.21 10.18 255.89
        Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
        D\D 0 1 2 3
        0 256.58 25.40 18.62 18.62
        1 25.41 258.61 18.60 18.62
        2 18.62 18.62 258.56 25.41
        3 18.61 18.61 25.41 256.49
        P2P=Disabled Latency Matrix (us)
        GPU 0 1 2 3
        0 3.01 11.34 18.27 11.97
        1 17.15 2.89 13.50 14.48
        2 11.68 12.73 2.95 12.17
        3 13.64 14.89 17.64 2.94
        CPU 0 1 2 3
        0 7.64 13.85 14.21 12.92
        1 13.70 7.00 13.61 13.25
        2 13.71 13.65 7.25 13.76
        3 14.04 13.48 13.78 7.22
        P2P=Enabled Latency (P2P Writes) Matrix (us)
        GPU 0 1 2 3
        0 2.91 1.22 1.61 1.62
        1 1.18 2.88 1.61 1.61
        2 1.62 1.63 3.01 1.20
        3 1.62 1.62 1.19 2.93
        CPU 0 1 2 3
        0 7.25 3.98 3.86 3.86
        1 3.64 6.59 3.60 3.70
        2 3.63 3.61 6.89 3.62
        3 3.63 3.58 3.76 6.63
        NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
      - Alexander/O says
        2018-09-15 at 22:20
        On Supermicro 4028GR-TR2 the PCIe PLX switches have 96 lanes each. There are two of them connected to the CPU1.
        This might be the problem http://www.supermicro.com/support/faqs/faq.cfm?faq=20732
        Will test it on Monday. I did check the 4027GR-TRT today and ACSCtl are all negative (correct to get maximum bandwidth numbers).
Emmanuel says
2018-09-06 at 00:41
Hello,
Thanks for this very usefull article !!
I wanted to use some GTX 1080 ti (or RTX 2080 ti maybe after reading this) to start using deep learning on images, and as I am in a university, I will not be a data center, but we have one, maybe my GPU’s will be used by other people (with very retreint access)… I’m wondering what this new policy agreement for CUDA looks like, but I am not able to find where it is specified that CUDA should not be used in data center with GTX or RTX.
Do you have any other information on this, or do you know where I can find these policy agreement ?
Thanks a lot for your help !
Reply
- Tim Dettmers says
  2018-09-06 at 08:29
  The thing is that it is nowhere defined what a “data center” is. But I think as long as you are using a desktop computer (literally a computer under your desk) you will have no problems, even if other people log into that computer from the distance. This policy’s main purpose is to prevent a company/university from buying hundreds of GPUs and put them into networked computers — I think this would also be a good definition for the “no data center” policy: Dozens to hundreds of computers which are networked.
  Reply
Mathias says
2018-09-04 at 11:17
There is no mention of Azure?
Reply
- Tim Dettmers says
  2018-09-04 at 16:39
  You are right, I should add a few words about Azure. I will try to do that in the next update. Thanks for the feedback.
  Reply
Facundo Calcagno says
2018-09-04 at 10:21
I dealt with an issue like this last month. In my company they decided to start with an amazon Pé.xlarge instange that included a K80 GPU. Everything moved really slowed in terms of training. Hence, I tried the same pytorch code in my personal computer that has a titan xp ans noticed a 5x boost up.
I use Pytorch 0.4, a 3D CNN (Modifiend C3D), Data Augmentation, 4GB dataset.
Can you explain me why there is so much difference?
Reply
- Tim Dettmers says
  2018-09-04 at 16:44
  I do not think this is a GPU issue since the difference of the GPUs should be much smaller and the compiled code is very similar and thus this should not be an issue of compiling with wrong flags etc. What I suspect is that the CPU or the PCIe transfer might be a bottleneck here. The data augmentation that you mention, do you do it before you pass data to the GPU? If so, this would be my main suspect. Another thing is that 3D data can be very large, if you do not use asynchronous CPU->GPU transfers, then this might be very slow on AWS where the PCIe architecture is often degraded through virtualization. I think the new V100 GPUs have better virtualization and you might be able to go around this issue with a newer AWS instance. Hope this helps.
  Reply
  - CALCAGNO Facundo says
    2018-09-05 at 00:42
    Yes indeed, for Data Augmentation I’m using the Pytorch pipeline that makes the data augmentation in Numpy arrays before sending them to the GPU.
    Thanks!
    Reply
Unknown says
2018-09-03 at 10:31
Thanks a lot!
It was great!
Reply
Cherla Sri Krishna Kiran says
2018-09-03 at 04:34
Hi Tim,
I would like your opinion and advice. I am trying to purchase a GPU for kaggle competitions. The GTX 1080Ti and RTX 2080 cost the same in my country. The 1080Ti has 11GB VRAM but the 2080 has tensor cores. Which one is better in the long run?
Thanks,
CSKK
Reply
- Tim Dettmers says
  2018-09-03 at 06:46
  Difficult to say for sure for now since benchmarks for the RTX 2080 are lacking. However, I would believe that you get better performance from an RTX 2080 if you use 16-bit precision.
  Reply
Norman Heckscher says
2018-09-02 at 21:37
It’s worth considering a free option from Google. Basic hardware, yet, more than enough to get started.
https://colab.research.google.com/
Reply
- Tim Dettmers says
  2018-09-03 at 07:14
  That is a good point. I should have looked into that. I will try to include it in an update with the hard RTX performance numbers.
  Reply
  - Tim O'Hear says
    2020-09-07 at 09:54
    This is even more the case nowadays. You can get Colab pro for 10$/month that has less usage restrictions (Colab standard bans you for 12 hours after 12 hours use).
    Nowadays you typcially get a P100 16Gb and V100 16Gb are starting to turn up on Colab Pro.
    Reply
    - Tim Dettmers says
      2020-09-08 at 13:28
      Thanks for your feedback. From the overall feedback that I got a broader discussion of cloud solutions was one of the main issues mentioned. I will consider adding a bit more content and writing a small update that focuses more on the cloud.
      Reply
pep says
2018-09-01 at 02:20
If I using External GPU Enclosures, which card does it make sense the most to buy? I mean giving the fact that Thunderbolt 3 might be a bottleneck, at what point it becomes pointless to buy a better and quicker card?
Thanks
Reply
- Tim Dettmers says
  2018-09-01 at 07:41
  The bandwidth on Thunderbolt 3 is pretty high and you should not see a serious dent in performance. I would expect at most a hit of about 10-15% of performance. The performance penalty is also more determined by the task than by the GPU. The simpler the task the higher the performance penalty. If you are running a large ResNet model on a large image or running a BiDAF model on a paragraph you see almost no performance degradation.
  Reply
gaoyuanbo says
2018-08-30 at 09:10
Hi Tim, i am a tensorflow user. i am using V100 to train model, my network has a lot of LSTM. After reading your blog, i replaced tf.float16 with tf.float32, but i found the speed didn’t become fast. Can you give me some advices? Thank you very much.
Reply
- Tim Dettmers says
  2018-08-31 at 04:11
  It might need more configuration. The LSTM benchmarks that I quote use this setup:
  As of version 1.4 (released in November 2017), half precision (FP16) data type support has been added and the GPU backend has been configured to use the V100 Tensor Cores for half or mixed-precision matrix multiplications. In addition to the 1.4 mainline release, Nvidia maintains a custom and optimized version as a Docker container in their GPU Cloud (NGC) Docker registry. The latest version of this container is 17.11. For best performance, we used this NGC container for our benchmarks.
  Reply
Marcin says
2018-08-29 at 21:33
I wonder – you mentioned RTX2080 and 2080Ti on your list. However have you heard any confirmation that their tensor cores are actually going to be exposed via CUDA? I mean, it IS a gaming card by it’s nature and it could cannibalize Nvidia’s Quadro lineup. At the very least I have heard nothing about it ACTUALLY being supported yet.
Reply
- Tim Dettmers says
  2018-08-30 at 03:45
  I do not think they would have two different Tensor Cores, ones that work for Tesla and Quadro and are programmable, and one that only works for gaming — but we will see in the next days and weeks.
  Reply
Ani says
2018-08-25 at 21:12
Hi Tim,
Thanks for the informative post.
I just learnt about “NVIDIA® JETSON XAVIER™ DEVELOPER KIT”.
https://developer.nvidia.com/embedded/jetson-xavier-faq
I know, it’s an edge device, but they are offering it at a great discount to the retail price.
I don’t currently own a NVIDIA GPU. Can this be used as a platform for learning ML and for training models?
I would really appreciate your opinion on this one.
Thanks in advance,
– A
Reply
- Tim Dettmers says
  2018-08-26 at 03:12
  I had a Jetson before and one problem can be their CPU which makes installing software more complicated since not all packages support ARM CPUs. I had additional problems with a CPU that only supports 32-bit (the newer Jetson should support 64-bit) but still, you will spend some time setting everything up if you are planning to use it via ssh from a laptop or desktop. The performance is rather poor compared to dedicated GPUs (even though there is a discount). I would only get it if you need it for robotics etc., that is only if you actually need a small, portable GPU.
  Reply
  - Joseph says
    2018-10-02 at 21:52
    I have read your comments about the previous Jetsons, but the new Xavier looks very promising.
    The Xavier reportedly operates at 30 teraflops, which makes it roughly 3x faster than a GTX 1080i. So for $2500 you get a full deep learning computer which appears competitive with what most hobbyists would build for that amount of money.
    So what are the downsides?
    Reply
    - Joseph says
      2018-10-02 at 22:19
      One last thing. The product page says:
      “Can NVIDIA GPUs be used with the Jetson AGX Xavier Developer Kit?
      The current early access JetPack release does not support this; support will be added in a future release.”
      If I’m understanding correctly, does that mean that you could add another GPU card to the Jetson AGX Xavier?
      Reply
      - Tim Dettmers says
        2018-10-03 at 08:12
        As I understand it, this is meant to enable prototyping for the Xavier Jetson on your desktop GPU and when finished roll your code out to the Jetson AGX Xavier. This would mean that a company that uses Xavier for mobile application does not need one Xavier per developer, but developers can work from their desktop stations.
    - Tim Dettmers says
      2018-10-03 at 08:08
      Since the introduction of Tensor Cores the interpretation of these numbers is no longer straightforward: (1) The Xavier Jetson can theoretically operate at 30 teraOPS which means 8-bit or 4-bit compute (not teraFLOPs for 32-bit or 16-bit compute), and (2) the theoretical FLOPS while realistic for standard GPUs are very unrealistic for Tensor Cores at 8-bit because your algorithms will be generally limited by bandwidth and not compute for Tensor Cores since the Xaver memory is very slow with 137GB/s. This bottleneck is not so strong for RTX cards which have about 600GB/s. You can expect a Xavier Jetston to be about half as fast or less compared to a GTX 1080 Ti for training networks.
      Reply
      - Wingman says
        2020-02-24 at 23:36
        It is now 2020, and NVIDIA recently upgraded the RAM to 32GB on the Xavier dev kit and the current price is $700. I’ve used the TX1 dev kit as a headless training box in the past (yes, not fantastic performance), but I’d like to upgrade to something that allows me to do much larger batch sizes on CNN’s. The next step in my home project is to take my dataset that now has >100K images and see if I can bootstrap e.g. MobileNetV2 from scratch. (Let’s just say I’m … patient.) Now, newer GPU’s are definitely faster, but for getting a larger batch size, I don’t see too many options more cost efficient than the $700 Xavier dev kit for the RAM. What do you think? Is the computational power tradeoff for the larger (albeit shared) RAM maybe a reasonable deal?
      - Tim Dettmers says
        2020-04-03 at 19:19
        I would only recommend the Xavier kit for regular training/development if you also want to use it for some hacky mobile stuff like vision for robotics. For a headless GPU box I would maybe rent a computer in the cloud or buy a small computer with a small GPU ($600 – $700).
      - Wingman says
        2020-04-03 at 21:03
        Thanks Tim for the advice on the Xavier! I did more research and decided to go with a used SuperMicro GPU server hosting a K80 (and room for one more). The speed might be slower but the VRAM/$ ratio is hard to beat and I get a good CPU/RAM on the main board too. I’m new to the world of servers so it was a fun journey: https://wingman-jr.blogspot.com/2020/03/the-quest-for-new-hardware-pt-1.html
Ps Z says
2018-08-25 at 09:16
Thanks for the sharing and it’s really helping a lot. However, I’m just being curious on how did you get the performance data of RTX 2080Ti right after it came to public? Did you get these cards in advance from nvidia?
Reply
- Tim Dettmers says
  2018-08-25 at 15:33
  Thank you for your comment. This is a bit buried in the blog post, but I mention my methodology briefly:
  Note that the numbers for the RTX 2080 and RTX 2080 Ti should be taken with a grain of salt since no hard performance numbers existed. I estimated performance according to a roofline model of matrix multiplication and convolution under this hardware together with Tensor Core benchmarks from the V100 and Titan V.
  You also find the LSTM and Convolutional benchmark for V100 and Titan V in the blog post. From other sources I know how many Tensor Cores the RTX 2080 and RTX 2080 Ti have. If I combine these numbers with the method above I arrive at the numbers in the charts. This numbers will be rough estimates though and I am curious how close they are to reality when benchmark data appears for the first time. I will update the blog post once these numbers become available.
  Reply
  - Ps Z says
    2018-08-25 at 19:29
    Sorry I didn’t notice that. Thanks a lot!
    Reply
Dave says
2018-08-24 at 02:34
Hi Tim,
Firstly, thanks A LOT for your work in this article. Secondly, i agree with you (and other folks), lets wait for reliable benchmarks before buying the 2080 ti/2080.
One last thing, Im trying to buy a new gtx 1080ti but it is always out of stock at the nvidia web. Does anyone know how often it is in stock?
Best,
Dave
Reply
- Tim Dettmers says
  2018-08-24 at 12:58
  I think NVIDIA does not have them anymore in stock on purpose. They want you to buy the new RTX cards. Try another vendor they should still have the GTX 1080 Ti cards. I would also recommend getting a cheap one on eBay!
  Reply
James says
2018-08-22 at 21:17
Hi Tim,
Thanks for the update, this helped me to decide on a 1070 second hand.
For the cloud options, is there any reason why Azure has been left out? They now offer their Data Science Virtual Machine on Linux which can be equipped with K80, P100 or the new V100 all in multiples of 1 to 4 GPUs.
I don’t have any experience with Azure DSVM, but wondered if it had been left out for a reason?
Thanks
Reply
andrea de luca says
2018-08-22 at 14:57
About the RTXs: Since they got double-fan ventilation, do you think one can stack four cards in a single box?
Reply
- Tim Dettmers says
  2018-08-22 at 16:49
  If you look at the display connectors of the current cards, you can see that they will occupy two slots. This might differ for future designs, but keep in mind that all GPUs that have a high TDP are usually 2 slots high because the heat would be difficult to dissipate otherwise. I do not think the fans and heatsinks on the RTX 2080 can be made so much better that they would allow for a single slot since the TDP is still quite high.
  Reply
Ting says
2018-07-26 at 23:00
Thanks for your article. I plan to get a GPU to run convolutional neural networks, but the IT supporter said GTX 1080 require a different power supply thus cannot be considered, they recommanded me NVIDIA Quadro P2000, 5GB, but seems someone said this is not for network, but is for CAD, do you know this GPU and if it fits for a CNN?
Thanks!
Reply
- Tim Dettmers says
  2018-08-06 at 17:34
  A Quadro P2000 would be fine for deep learning. Slowe than a GTX 1080, but you can still run most networks especially if the input size is not too large. However, the P2000 is a bit pricey. I would recommend at GTX 1050 Ti with 4 GB of RAM which is 25% slower than a P2000 but half the price.
  Reply
Peter Bartlett says
2018-07-09 at 03:23
According to this:
https://cloud.google.com/gpu/
I can get a P100 GPU for 1 /hr @ $1.49
I’m thinking this will be my new ‘I have no money’ option 🙂
Reply
- Tim Dettmers says
  2018-07-09 at 18:51
  That is $252 a week if you run it non-stop. I would recommend a Hetzner GPU instance which is 99 euros per month for a GTX 1080.
  Reply
Aparna Ravishankar says
2018-06-14 at 15:01
Hi Tim,
I would like your opinion and advice. I would like to set up my own work station at home to pursue my interest in parallel programming (CUDA gpu programming). I’m also keen on running experiments with webots (robotic simulator). I was doing a bit of research on various work stations in the market. I am particularly keen on this one.- BOSS Xeon E5-2630 v4 Titan xp SLI. I would like your opinion on it. Is it worth the money or would you recommend something else?
The work station is NOT for gaming, but programming and simualations.
Looking forward to your reply,
Thanking you,
Aparna.
Reply
- Tim Dettmers says
  2018-07-02 at 11:25
  It seems a bit overpriced and has not the best value for a deep learning desktop. If you need 10 cores for your robotics simulator it might be different, but the CPU is too powerful for doing deep learning alone. I would save some money on that. You also do not need ECC DDR4 memory. I would go with usual DDR4 memory. These are the things you pay extra for when you buy that machine. The problem is that there is currently not a good vendor that sells good desktops for deep learning. I would recommend to buy parts and build the PC by yourself. You can put together a PC on https://pcpartpicker.com/ and follow a few youtube videos to learn how to put together your desktop.
  Reply
Chirjot Singh says
2018-06-05 at 19:58
GeForce MX150 gddr5 2 gb or GeForce 940MX ddr3 2gb for deep learning
Reply
- Tim Dettmers says
  2018-06-07 at 14:48
  The Geforce MX150 will be faster. I have not looked at the price though and the 940MX might also have a good cost/performance. I, however, would probably go with the MX150.
  Reply
Kevin says
2018-05-17 at 22:21
Hey tim, thanks for the fantastic article. I’m still in uni and finding my way into the deep learning world but I was planning on building my own machine for the purpose and had some questions about hardware that I was struggling to find answers for in terms of deep learning programs because most pc hardware sources seem to be about gaming. (I’m not sure specifically what programs or datasets I would be using so these questions are meant pretty generally). I was planning on starting with a gtx 1080ti and adding another later when my budget allowed for it.
My first question is whether my systems cooling would be fine with a cpu AIO cooler and the two 1080 ti’s in sli left on air cooling, or whether it would be worth investing in a custom liquid cooling setup (especially since I’d have to take it apart to add a second gpu later on).
And secondly, I’ve done some research on the topic and it seems two good cpu choices for this build are the intel i7 8700k and the intel i7 6850k. Now they both have 6 cores but the 8700k has a higher clock speed but only 16 PCIe lanes whereas the 6850k has a lower clock speed but 40 PCIe lanes. I was wondering where generally for deep learning, would it be better to bottleneck the two gpus to 8x lanes in exchange for the higher clock speed on the 8700k, or would it be better to opt for the lower clock speed on the 6850k, knowing that the two gpus wont hit the PCIe lane bottleneck? From what I can tell the 8700k is better for gaming but because that is not my primary goal, I can’t tell if that translates to deep learning programs considering the PCIe lane bottleneck I may be putting on myself.
Thanks in advance!
Reply
- Tim Dettmers says
  2018-05-21 at 14:55
  GTX 1080 Tis on air are fine. Liquid is always better, but I do not think it is worth the trouble. You also do not need 16 PCIe lanes per GPU. Even for parallelism, if you only have 2 GPUs, 8 lanes will not make a big difference two 16 lanes. With respect to CPUs, I think both are fine. I personally would go with the cheapest Threadripper, but in any case, it does not make a big difference for deep learning.
  Reply
Age says
2018-05-15 at 09:09
Wow, so many comments to your great post. If you or anyone could help with recommendations that would be appreciated.
I have 1xGeforce gtx1080ti open style that blows air into case (bought new) and 1xgtx 1080 (bought used) that also is open style the blows hot air into case. (Ive been using dl/ml for a couple if years but only last week read that some gpus funnel hot air out of case rather than into the case)
I have 2 machines and a gpu in each. One machine can fit 2x xeons for total of 80lanes and 3×16,1×8 pcie3 config (currently 1x xeon)
I would like to get another gpu and an not sure if i should get a 1080 ($650 AUD used) or 1080ti ($1060 AUD new). I domantly use dl for time series and nlp work, occasional conv nets.
If i got a ti id need to water cool the 2x ti’s which adds complexity but should help with gpu lifespan (currently 75-80 deg during training). One case is big enough for 420/360mm water cooling rads.
Other option is to get a 1080 used , run with second 1080 aircooled at at probably relatively high temps and hope they last a few years.
Cheers
Reply
- Tim Dettmers says
  2018-05-17 at 09:20
  The deal on the GTX 1080 looks better — for the application to sequential data 8 GB should be okay. If you want to train something big like BiDAF it can be tricky, but often it will train just fine with a smaller batch size. From my experience, the ventilation of air inside the case makes differences in the range between 2-4 °C and is not really worth the money. Water cooling, on the other hand, is very effective. If you can get a cheap water cooling solution, it might also make sense to invest in GTX 1080 Ti — as you mention the lifetime will be improved (although I never had a GPU die) and the cooling will provide additional performance which will make the GTX 1080 Ti competitive with an air-cooled 4 GPU node with next-generation GPUs.
  Personally, I would probably go with the GTX 1080 on air and upgrade to a liquid cooled GTX 1180 or Vega 20 (if it is < $2000) if they come available. But water-cooled GTX 1080 Ti would give you more performance now, and an okay performance for the future, so that you might want to skip a generation of GPUs and invest in GTX 1280 or Navi GPUs that come in 1-2 years. Both are good decisions. I do not know if I would buy liquid cooled GTX 1080 since they will be outdated soon and the cooler might not fit newer models of the GTX 1100 series or AMD GPUs.
  Reply
Marko Rantala says
2018-05-08 at 12:47
Nice, analytical view.
Reply
teja says
2018-04-27 at 18:12
I am using a deeplearning code which will use tensorflow library and Googlenet CNN or RESNET CNN where the dataset images are around 1000 images in each datasets where total data sets will be 4. My laptop is i5 7th generation + 2.5 GHZ CPU upto 3.1 GHZ , 2 cores and 3MB cache and the NIVIDIA 940MX (2gb, DDR3) . Is this possible for me to use my laptop for my deep learning purpose if not please suggest me some best and cheap ways? I am a researcher .
Reply
- Tim Dettmers says
  2018-04-28 at 12:29
  Training on your GPU will be a bit slow. Your CPU seems to be okay and you can expect to train a model on it in about 12 to 24 hours. You could run a model overnight. If this is too inconvenient for you, you might want to use GPU spot instances in the cloud, with which the training time should be a couple of minutes (about 5 minutes). If you plan to do a lot of work in deep learning in the future, I would recommend getting a desktop instead of a cloud server. You can get a small desktop with a GTX 1050 Ti for about $500; if you want to do more you can upgrade the GTX 1050 Ti once the GTX 11 series hits the market (2018 Q3).
  Reply
Michael says
2018-04-26 at 18:36
Dear Tim,
thanks a lot for this thorough comparison and the multiple updates over more than two years! Your blog is a very helpful source of information!
While I would be able to make up my mind on which GPU to buy given the state of mid 2017, I have a bit trouble finding such excellent comparisons for spring 2018. Do you have any recommendations either for later GPU models or can you point to newer sources of information?
Thanks a lot, Michael
Reply
- Tim Dettmers says
  2018-04-28 at 12:21
  I will publish an update which will include TPUs, AMD GPUs and GTX 11 series sometime in the next month. In the meantime, I would suggest getting a cheap GPU to get your task done and wait for updated GPUs in 2018 Q3 which will increase performance by roughly 50% — so it is well worth the wait. Otherwise, if you do not want to wait, the recommendations of this blog post still stand.
  Reply
  - Sunny says
    2018-04-28 at 20:23
    An updated blog would be great. I am particularly excited about the rumored Vega 20, with 16-32 GB, 1 Tb/s , ECC, 150-175 W … probably too good to be true but for sure Q3 might have some important hw releases for HPC and ML from both team Green and Red
    Reply
    - Tim Dettmers says
      2018-04-29 at 21:59
      Indeed the Vega 20 looks exciting! The question is really if the price is right. With a good price this could be the turning point from NVIDIA to AMD (albeit a very slow turning, at least initially).
      Reply
sslz says
2018-04-22 at 06:36
Xeon Phi is one of my main consideration as hardware accelerator for deep learning. Have you tried the latest Intel suite of compiler? Are their better support for Python?
BTW, will Xeon Phi become a better option than Nvidia if I can write my own library in C and C++?
Reply
- Tim Dettmers says
  2018-04-23 at 18:47
  I do not recommend using the Xeon Phi — it will be a very frustrating experience. The experience might be better than I had about 2 years ago, but it will still be bad. Writing code for the Xeon Phi is terrible and you still not be able to come even close to the practical performance of cuDNN because Xeon Phis are too difficult to optimize fully: L1 cache management is an absolute mess; I never bothered about register optimization, but I do not see any tools for that to make it work. So I strongly recommend against using Xeon Phis if you care about using your hardware to do something useful. If you are interested in expanding the Xeon Phi code base for the sake of doing so it might be an option, but note that even if you publish good code, few are probably willing to use it along with the Xeon Phi.
  Reply
P says
2018-04-21 at 19:01
Hello, I can’t find my questions posted few days ago so maybe it did not go through.
How is the performance between the 1060 (6GB) and 1070Ti? For example, if the 1070Ti can get a simulation completed in one hour, about how much longer will it take the 1060 to complete? Given my background, is the 1060 sufficiently good enough for me to learn about DL for the first few months? I have done some work on NN in grad school long before most people heard about the term NN.
Reply
- Tim Dettmers says
  2018-04-23 at 18:52
  A GTX 1070 Ti should be about 25% faster than a GTX 1060. If you get the version with 6GB of ram, a GTX 1060 will be sufficient to explore and learn almost all areas of deep learning — so a good choice for the start!
  Reply
Bruno says
2018-04-20 at 13:20
Hi Tim, thanks so much for the efforts put in this article and the comments.
Question for you: I am considering buying a gaming box NVIDIA AORUS GTX 1080 for deep learning purpose, as I have a lenovo laptop with a thunderbolt 3 port.
I know it is probably less efficient than an integrated 1080 card, still any other reason whyt you think it is not a good idea ?
Thanks a lot!
Reply
- Tim Dettmers says
  2018-04-23 at 18:55
  This is a very good choice. With Thunderbolt 3 you will have almost the same performance as a dedicated machine. The performance penalty should be between 0-10%, and for most tasks, it will be 1-3%.
  Reply
Tchicken says
2018-04-18 at 16:26
Hello from France…
I have a small budget…
Someone give me 2 GPU NVidia GTX 590 that I think to ride in SLI on a motherboard MSI Z270 SLI PLUS with an I7-7700K, can I do properly deep learning or should I look for something else ?
In advance thank you for your answers, friendly, Michel POULET.
Reply
- Tim Dettmers says
  2018-04-18 at 22:34
  A GTX 590 will not be sufficient. To run cuDNN which is often a requirement for fast CNNs and RNNs you need at least a GTX 600 series or better.
  Reply
  - tchicken says
    2018-04-18 at 23:42
    Is a GTX 1060 3 Gb is good ?
    Thanks.
    Reply
  - Tchicken says
    2018-04-19 at 10:07
    Sorry to bother you again, I found this PC :
    processeur : Core I3 3220 2×3,3Ghz
    ventirad : Artic Cooling Freezer 13
    Carte mère : ASUS P8h77-M micro-ATX
    mémoire : 8Go (2×4 Go) DDR3 Crucial Tactical Tracer 1600Mhz
    GPU : MSI GTX660 Twin Frozer III 2Go GDDR5
    2xDVI 1xHDMI 1xDP
    stockage : SSD Crucial M4 128
    HDD Wester Digital Caviar Green 500Go
    Boitier : Zalman H11 Plus bleu, 2x120mm + 2x90mm
    Alim : Corsair CX430M v2
    can it allow me to start in Deep Learning?
    In advance thank you for your response.
    Reply
    - P says
      2018-04-19 at 20:31
      Hello Tim, how is the Nvidia 1060 compared with 1070Ti? For example, how many % faster does a 1070Ti get the simulations done compared with the 1060?
      Reply
William Benjamin says
2018-04-05 at 21:02
Hi Tim,
I already have a 1080Ti and was planning to add another card to the set up. Goes it make more sense
1) Get another 1080Ti
2) Get a Titan Xp. Based on the current prices, there is a $200 difference between 1080Ti and a Titan Xp.
Reply
- Tim Dettmers says
  2018-04-06 at 08:38
  I think the Titan Xp for $200 is a good deal. However, if you get a GTX 1080 Ti, it might be easier to parallelize across your GPUs. Theoretically, you will have that option also with Titan Xp since they run the same chipset with same compute capability, but parallelism might not be effective with two devices of different speeds. If you want to use parallelism a lot, then I would go with a GTX 1080 Ti, otherwise, if you only use parallelism from time to time, then I would go with the Titan Xp deal.
  Reply
jims1990 says
2018-04-04 at 22:17
Hi Tim,
Great article, it helped me a lot!
I have read all comments and didn’t find answer for question:
Let’s assume some budget, let it be (10-13k USD).
What will be the best option for deep learning (mostly image recognition task):
a) buying 4x GTX 1080 Ti 11 GB GGDR5X
b) buying 1x Tesla P100 16 GB GGDR5X
c) anything else?
Let’s assume that other parts in setup are the same (like intel i7 or i9, 128 GB RAM, SSD disk) – it’s all about performance of GPU.
For now I’m consider that 4x1080Ti will be best, mostly because of parallelization possibilities.
I found some benchmark comparison between 1080Ti and P100, but any discussion about 4x 1080Ti and 1xP100. I guess this will be helpful not only for me.
Thank you in advance for your answer.
Best regards!
Reply
- Tim Dettmers says
  2018-04-05 at 19:57
  I would definitely go with 4x GTX 1080 Ti. You have much more flexibility how to use them. They are faster than the P100 if you use parallelization, they are better if you work with multiple people on the same server, they are better if you want to run multiple models at the same time. If I would buy a system for research, I personally would go with 4x GTX 1080 Tis at the moment.
  Reply
P says
2018-04-04 at 16:50
Hello Tim, are laptops with GeForce GTX 1050-1080 (with or without Ti) sufficiently good enough to do DL work? How many RAM in the GPU and laptop would be sufficient? I know the more the better.
Reply
- Tim Dettmers says
  2018-04-05 at 19:55
  It depends on what you are trying to do. Cutting-edge research is not possible, but with any card above 6GB you can do a lot of nice things. You can do Kaggle, use state-of-the-art models on smaller datasets, train many different models in NLP. With an 8GB card you do most things except memory hungry computer vision models on ImageNet and similar things. The GTX 10 series laptop GPUs are very powerful, so its definitely a good option!
  Reply
  - P says
    2018-04-05 at 21:46
    Thanks Tim. I have worked on Neural Networks in grad school but haven’t done any deep learning work. I am trying to decide whether to build a high end workstation now or buy a cheap laptop/desktop with hardware “just good enough” for the next few months of learning. Do you think Nvidia 1050 GPU, 16GB RAM and i7-7700 or AMD Ryzen 7 1700 would be sufficient to get me started until Fall? I am also concerned if AMD CPUs such as Ryzen 7 1700, 1800X and Threadripper would be compatible with Tensorflow and other deep learning related frameworks and libraries. In general, is it better to get Intel CPU? Does it matter much?
    Reply
    - Tim Dettmers says
      2018-04-06 at 08:42
      Threadripper is a fantastic CPU for deep learning with its 64 PCIe lanes — I am using it too! Its fully compatible with all deep learning software and the advantage is you can run (multiple) NVMe SSDs without a deep learning penalty to your GPUs.
      The laptop with the GTX 1050 can get you started. You will be able to run code and algorithms, but mostly on small datasets. A laptop with a GTX can come in handy later since you can program and test code locally on your laptop and later run it on a big machine. This is, for example, nice when you are at airports with shitty wifi.
      Reply
      - P says
        2018-04-06 at 21:15
        Thanks Tim. Do I have to worry that the Threadripper CPUs do not support as many instruction sets as the Intel’s CPUs? Is compatibility with AVX, AVX2 and AVX-512 important? I think I read somewhere that python or tensorflow can take advantageous of these instruction sets. Few months ago, there were some issues running Ubuntu on Threadripper systems. For example, some users complained about PCI related compatibility issues. Since we use GPUs, I am concerned that these issues might affect us. Have these issues been resolved?
      - Tim Dettmers says
        2018-04-11 at 08:14
        I did not experience any problems so far. If you do not get a GPU you should pay attention to CPU performance, but I would not worry about it if you get a GPU, since you will use the CPU mostly for prototyping if at all.
        If I look at benchmarks of general matrix-matrix multiplication then the Threadripper fares well even without those features though — so it should be fine.
      - P says
        2018-04-13 at 18:18
        Hello Tim, thanks. In case of using only one 1080Ti, is it true that even the lowest end Threadripper 1900X performs better than an Intel 7900X which supports AVX-512? How about the case of using two 1080Ti? I am trying to choose between the two CPUs. Which do you recommend?
Marcus says
2018-02-27 at 20:35
Hi Tim,
thanks a lot for this interesting article.
What is your opinion about other nvidia models like Quadro or Tesla? I have to assemble a (some) system(s) for development in a research group for different purposes. Tasks will be: Object recognition / computer vision for collision avoidance, mapping of the environment, predictive maintenance based on systems monitoring data and, to bring everything together, decision making / reasoning based on all the above inputs to create “intelligent” behaviour within a graphically intense simulation. On top of that the developed functions should be transfered on a mobile robot equipped with a Jetson TX2 board for real world testing. (MATLAB/Simulink).
Is there a chance to find that one Cuda card configuration compatible for all these tasks? I hope this idea/question is not absolutely hare-brained, if so, I apologize as a total newbie to this field 🙂 .
Many greetings,
Marcus
Reply
- Tim Dettmers says
  2018-03-17 at 08:58
  Tesla and Quadro cards are rebranded GTX cards; they are not really worth the price. However, the CUDA license agreement states that you are not allowed to use GTX cards in “data centers”. So you might want to avoid building a data center and instead give each person in the research group a GTX GPU.
  The applications that you mentioned are mainly robots and computer vision. Some of it might require high memory, others might not be compute and memory intensive. Its difficult to say from this high-level description, but it might be best if you aim for a GTX 1080 Ti or a Titan Xp — with these cards you will fulfill the requirements and since GTX 1070/1080 cards are expensive due to crypo-mining, a 1080 Ti and a Titan Xp are good choices in themselves — so I would go with either of those cards.
  Reply
  - Marcus says
    2018-03-18 at 10:15
    Thanks a lot for your tipps! I’ll do it this way.
    Reply
dougr says
2018-02-16 at 19:41
Tim,
You have previously recommended against the Titan XP due to the price delta… however, with current availability and pricing of 1080 ti cards, most of the time, the price differential from a 1080ti and a titanXP (direct from nvidia) is less than $200. Independent of the current limbo status, if you needed a GPU in the near term, what premium would you place on a titan XP vs a 1080ti? I’m patient enough to wait for any nvidia announcements at GTC in late March, but lacking something there, do not think I have the patience to wait for AMD to come around on the software side.
On a related note, there are a few articles out there discussing GDDR6 in the next round of consumer cards, and estimating memory bandwidth based on predicted bus widths and GDDR6 speeds. Most seem to assume that whenever nvidia gets around to releasing their next architecture, that the initial top-end card will have 16GB of GDDR6 and memory bandwidth nearing 600GB/s (and well above that if they use a wider bus in the later enthusiast card). Do you think those are reasonable assumptions given the market dynamics for consumer cards (gaming, crypto, ML)?
Always impressed with your thoughts on the state of the market, Tim… great resource.
Reply
- Tim Dettmers says
  2018-02-20 at 15:27
  I agree, if you can snatch a cheap Titan XP it is well worth it. There also seems to be recent announcements that no new GPU will be introduced for the major GPU conferences. The concept will be introduced, but not the GPU itself. NVIDIA’s strategy is likely to introduce a gaming GPU so that deep learning folks have to buy the Titan V if they want to deep learning. If this is really true, then investing into a Titan XP makes a lot of sense.
  I think your predictions for GDDR6 could make sense. Probably GDDR6 is also cheaper to produce than HMB2 memory so I expect that we see a lot of cards with it, but as mentioned above, it might be that we see no deep learning cards with that. We will see in the next months.
  Reply
  - dougr says
    2018-02-22 at 20:24
    Ended up picking up a water-cooled 1080ti in a bundle with a cpu cooler from EVGA for $920… might use the cooler, or just resell it. $300 is too much of a premium for the XP for me.
    Reply
Amin says
2018-02-13 at 04:22
Hello, Thanks for excellent information.
I want to know how much VRAM i need?
Software:
Ubuntu 16.04 , Caffe , Mobilenet SSD
In CPU-only mode, training consume 6-8 GB of RAM.
But I have a weak system (with core i5 2400 and 10 GB 1333 RAM) right now and I want to find a good system build.
training speed = 75 iteration/Hour !
GTX 690 seems to be a good choice because of its 384 (GB/Sec) Memory Bandwidth and 3 M cuda core clocks and also low price.
But I’m not sure about its 4 GB VRAM.
Also I will be very thankful if you give me an estimation about training speed with this build:
GTX 690
Core i7 6700k (4-4.2 GHz, 8 MB cache)
16 GB DDR4 2133 dual
Reply
- Tim Dettmers says
  2018-02-15 at 16:39
  I would get a more modern GPU. For cheap GPUs I would recommend a GTX 1050Ti 4GB with 16-bit training, GTX 1060 6GB or a GTX 1070 (with 8GB). I am not sure where the memory consumption comes from, but you want to make sure that the dataset is not stored on the GPU.
  The new CPU and RAM will hardly affect training performance at all (maybe 10-15% faster) and your money is better invested in buying a better GPU.
  Reply
  - Amin says
    2018-02-16 at 05:27
    Thanks.
    I don’t have GPU yet.
    (system = core i5 2400 and 10 GB 1333 RAM)
    In this system, training consume 6-8 GB of RAM.
    I want to know if I buy a new system that has GPU, how much VRAM will be consumed or needed?
    Because I want to decide about GPU model.
    I don’t want to change software and network parameters. (Except using CUDA!)
    Do all of this RAM consumption will goes to VRAM?
    I decided to buy Lenovo Y900 RE gaming tower.
    GTX 1080
    Core i7 6700k
    16 GB DDR4 2133
    Reply
Bill Ross says
2018-02-04 at 01:42
Would a Gtx 1030 be a valid option? I have a 1080 ti, but it is hardly used at all with some of my models (0-13% of Volatile GPU-Util for predictions), and if Cudnn is supported, I wonder if this would do for predictions, which run for ~8-10 days:
input shape: 3794
model file size: 93040732
keras/tensorflow
E.g. I loaded/ran 83 models in one session on the 1080 ti before running out of memory (forgot to add keras.clear_session()), running 100M cases through each.
I have a dedicated python data prep thread with a queue 2 deep, bringing the process to 150% .
Reply
KR says
2018-01-25 at 12:15
Hi Tim,
What are your thoughts on putting 7 gpus in a single machine or is 4 gpus the absolute limit? To make it work you would need a mobo with 7 PCIe slots, water cooling to make the gpus fit in a single slot and two power supplies.
See the link below:
http://rawandrendered.com/Octane-Render-Hepta-GPU-Build
Is this a bad idea?
Thanks for the great post.
Reply
- Tim Dettmers says
  2018-01-27 at 11:16
  I assume that you want to parallelize across 7 cards. If not there is no reason to get a 7 GPU computer as there are many hardware problems and bottlenecks associated with this. The build that you linked is not good for deep learning because you have few PCIe lanes for parallelization and would be very slow. If you want to get a 7 GPU system, the only way to go currently is with an EPYC CPU and this motherboard: http://b2b.gigabyte.com/Server-Motherboard/AMD-EPYC-7000
  Even then there might be problems, but the motherboard above is the only situation where you have enough PCIe lanes and avoid issues with PCIe switches and software. With other CPUs/motherboard you would need to use 2 CPUs to get the required PCIe lanes for good parallelization performance as you cannot do cuda-aware MPI parallelization across different PCI root complexes that 2 CPU designs use.
  In general, I would only advise you to build such a system if you really need parallelization across 7 GPUs. If you can do your work with 4 GPUs I would definitely recommend going with a 4 GPU setup as there are no software/hardware problems with that — a 4 GPU system is very straightforward.
  Reply
  - Nikos Tsarmpopoulos says
    2018-01-27 at 14:58
    The MZ31-AR0-rev-10 doesn’t appear to feature 7x PCI-E x16 slots. It features five full length x16 PCI-E ports, and two x8 ports.
    Reply
jungju says
2018-01-22 at 10:19
Hello, I have one question. Is it possible to do deep learning with only a PCIe slot available (3.0×16) computer (I`ve bought a motherboard for my web server which has only one PCIe slot) ?
Reply
- Tim Dettmers says
  2018-01-24 at 13:28
  Yes this will work without any problem.
  Reply
Jagan Babu says
2018-01-09 at 20:08
Dear Tim,
Iam currently working on Deep learning data model training for image recognition related.
I want create my own work environment and stuck up choosing between MSI X trio 1080 ti and EVGA 1080 ti FTW3 icx technology.I will running huge data models >250GB.Iam looking for robust model without burning related issues and cost factor do not matter.Please advise.
Reply
- Tim Dettmers says
  2018-01-15 at 22:55
  Those cards are very much the same. If you have benchmarks about how their fans perform for cooling, go with the cooler fans. Other than that there will be no difference. If you are worried about temperatures you might want invest in a liquid cooled GPU. They will be much, much cooler (and also faster!).
  Reply
  - Jagan Babu says
    2018-02-04 at 17:52
    Thank you Tim…:)
    Reply
Dave says
2018-01-09 at 17:57
Hello Tim, how is the performance of 2-3 Nvidia 1080Ti installed in a computer compared with one NVIDIA’s new TITAN V installed on the same computer?
Reply
- Tim Dettmers says
  2018-01-15 at 22:56
  For LSTMs the 3x GTX 1080Ti will be faster, for convolution the Titan V. Overall I would prefer the GTX 1080Ti as it is much easier to run multiple networks at the same time. This does not work well on a single GPU and is slow.
  Reply
Lety says
2018-01-05 at 16:20
Hi Tim
I’m planning to by a 1070Ti, any opinion? it’s not in your benchmark analysis.
Thanks.
Reply
- Tim Dettmers says
  2018-01-15 at 22:58
  Since GPUs can be expensive due to cryptocurrency mining I would keep my eyes open for a cheap GTX 1070/1070Ti/1080 and grab the first cheap card that you can find. All these cards have similar cost/performance. If you can find a used GTX 1070/1080 this might be a bit cheaper and I personally would prefer such a card compared to a brand new GTX 1070Ti.
  Reply
Bruce Wang says
2018-01-01 at 08:59
Brilliant analysis and conclusion.
Thanks a million.
Reply
Mathieu says
2017-12-24 at 01:35
Hello, and many thanks for your article which is very much useful and relevant, especially the Xeon phi part (for me), and Intel’s behavior (BTW did you have premium account which gives you the VIP/direct feedback? I have it and they are rather fast an efficient !). But also few comments are very interesting. I’m doing private research in computer vision and AI, and I daily practice HPC.
Do you think you might update your paper with Quadro P6000 ? which is very much relevant when we need to hold all the data in the device ! And then the Titan V ? which is expensive too, and at the same time not that much expensive, if the “business” is using FP16 and lots of tensor maths. For example, it might be interesting the compare 1-2 Titan V with 3-4 1080 ti ! For deep learning, yes but also for other algorithms which scale well without NVlink or even peer to peer communication.
Best
Reply
- Tim Dettmers says
  2018-01-15 at 23:08
  I think these cards are too expensive for most people, so I will not cover them here. I also do not have enough data on these cards to compare them directly.
  Reply
Asaf Oron says
2017-12-23 at 02:03
Thank you very much Tim for this post.
I am a bit confused: when i look for a certain card e.g GTX 1060 i find the card from multiple vendors i.e. zotac or asus. some of them have the nvidia logo on the box some dont. Are these the same cards ? how do i choose ?
I’m looking for my first gpu for a windows pc capable of running convnets. i have a budget of around 300 $. what would you recommend ?
Reply
- Tim Dettmers says
  2018-01-15 at 23:09
  Yes, there are the same. The vendors buy GPUs from NVIDIA and put their own driver, fans etc on it, but essentially it is still a NVIDIA GPU in any case.
  Reply
Sunny says
2017-12-12 at 19:59
Hi Tim,
Looking to build a Deep Learning PC and I am pretty new to the hardware side of things. Any comments on the recently released Nvidia Titan V ? I am specifically interested in a 2 Titan V GPU setup (with Xeon or Ryzen), but have been reading that this card has disabled SLI / NVLink. Will still be useful to shell out $6K to have a powerful Deep Learning setup that will be viable for few years at least ?
Reply
- Tim Dettmers says
  2017-12-18 at 18:58
  I do not recommend buying Titan Vs. They are too cost-ineffective. I will write an new blog post about this topic in the next days.
  Reply
Amir H. Jadidinejad says
2017-12-11 at 21:09
Would you please review the functionality of the new Titan V in the field of deep learning and compare it with others such as 1080 ti?
Reply
EricPB says
2017-12-08 at 22:50
Hi Tim,
NVidia just made a surprise announcement yesterday: they are releasing, for immediate purchase, a Titan V priced at $3000 with specs almost identical to the Tesla V100 PCIe ($10,000).
https://www.anandtech.com/show/12135/nvidia-announces-nvidia-titan-v-video-card-gv100-for-3000-dollars
For $3000, you can get either four units of a GTX 1080 TI ($750 a piece) or a single Titan V.
Which option would you go for Deep Learning ?
Cheers,
E.
Reply
Haider says
2017-12-01 at 01:52
Hi Tim,
I have one 1080ti, and want to buy another 1080ti.
I am thinking to add a third GPU 1060-6GB, so that when I want to use the PC for coding or other purposes during training NN on the other two 1080ti GPUs, it will not be sluggish. And perhaps running this third GPU as well for NN training when I don’t use the PC, so I can train another NN architecture in the meantime.
I am new to deep learning, but when I have used my only 1080ti for 3DSMax rendering using VRays RT (GPU rendering), it became annoyingly slow.
The question is how many lanes my CPU does support:
Core i7-3770K spec says up to either 1×16 or 2×8 or 1×8 & 2×4 and they was not clear what is the maximum lanes, but seems 16 lanes.
Motherboard GA-Z77X-D3H specs says (under the Expansion slots):
1 x PCI Express x16 slot, running at x16 (PCIEX16)
* For optimum performance, if only one PCI Express graphics card is to be installed, be sure to install it in the PCIEX16 slot.
1 x PCI Express x16 slot, running at x8 (PCIEX8)
* The PCIEX8 slot shares bandwidth with the PCIEX16 slot. When the PCIEX8 slot is populated, the PCIEX16 slot will operate at up to x8 mode.
1 x PCI Express x16 slot, running at x4 (PCIEX4)
* The PCIEX4 slot shares bandwidth with the PCIEX1_1/2/3 slots. The PCIEX1_1/2/3 slots slots will become unavailable when a PCIe x4 expansion card is installed.
3 x PCI Express x1 slots
(The PCIEX4 and PCIEX1 slots conform to PCI Express 2.0 standard.)
This is a bit confusing. Now, can I connect the two gtx1080ti GPUs with PCI3.0 running at x8, and the other gtx1060 with PCI2.0 running at x4 ? Or the maximum lanes is 16x, so no room for the third GPU?
Perhaps the Motherboard Block Diagram is more enlightening at page 8 here:
http://download.gigabyte.eu/FileList/Manual/mb-manual_ga-z77x-d3h_v1.1_e.pdf
It seems the CPU indeed has maximum 16 lanes of PCI3.0 x16 , but the other extra 4 lanes are not coming from the CPU PCI Express Bus. It is coming from the Intel Z77 chip which has its own PCI2.0 bus running at 4x.
What do you think?
Many thanks!
References:
https://ark.intel.com/products/65523/Intel-Core-i7-3770K-Processor-8M-Cache-up-to-3_90-GHz
https://www.gigabyte.com/Motherboard/GA-Z77X-D3H-rev-11#sp
https://www.gigabyte.com/Motherboard/GA-Z77X-D3H-rev-11#support-manual
http://download.gigabyte.eu/FileList/Manual/mb-manual_ga-z77x-d3h_v1.1_e.pdf
Reply
Jack Marck says
2017-11-20 at 21:24
Hi Tim,
I’m interested in deep reinforcement learning for robotics. Since the training episodes for this type of work is often done in real-time, is there any tangible benefit to training on a beefy GPU?
Reply
Jack Marck says
2017-11-13 at 20:32
What are your thoughts on the recent availability of Volta cards via AWS? By my estimate, it would take about 245hrs on AWS On-Demand to break even with a 1080ti.
Reply
deepuser says
2017-11-09 at 21:29
Hi Tim, Excellent article! Thanks very much for the detailed analysis!
We are looking to build a machine for running deep learning algorithms for computer vision on relatively large data sets (many tera bytes). We are deciding between a machine with 4 Titan Xps vs a machine with say 1 p100. Some background: We envision this machine to be used not for experimentation or for training different models on different GPUs, but this will be used to train “a” model and do inference. So, for us, 4 GPUs are useful if we can use multi GPU training/inference (data/model parallelization) and if at all it gives us a significant performance improvement. The other option is to have a single high performance GPU like the p100. But we certainly don’t require double precision accuracy. This being the case, do you have any suggestions on either of the two options or something else altogether? Thanks again for your advice!
Reply
Scott says
2017-11-07 at 14:09
Tim what about the lastest Amazon AWS EC2 P3 instances based on the Volta V100?
Is their price/perfomance competitive?
Reply
ArtVandelay says
2017-11-05 at 15:38
Hello Tim,
Thanks for this write up, it’s been very helpful.
You say that for computer vision, for some of the leading models Titan Xp would be recommended, also for computer vision with data sets larger than 250 gig(or something along those lines).
So I’m asking, with regards to Deepmind and their PySC2 research. They are aimed at image based machine learning AI’s.
So would a Titan Xp be better here? Or would a 1080 Ti suffice?
Also, I imagine you know of Matthew Fisher and his blog. He’s done some very interesting things with AI and computer vision.
I am interested in doing what he did with SC2. Intercept the Direct3D 9 API to allow his AI to interact with the game. For something like this would a 1080 Ti be okay? Or would an Titan Xp make a huge difference?
Thank you
Reply
sharath says
2017-11-04 at 13:32
Hi Tim,
I want to purchase a good GPU specifically for Natural Language Processing computations. Could you suggest me a good GPU in both nVidia and AMD GPU types which can handle good amount of NLP tasks.
And also suggest the best API (OpenCL, OpenGL, Vulkan) for NLP purpose in AMD GPUs for NLP computations for Microsoft OS and Linux OS types.
Reply
Paco says
2017-11-03 at 21:30
Hello Tim, really useful staff in you site. Congratulations. I have been reading around that the tesla v100 has 900GB/s memory bandwidth but when i go to aws to read p3 specs it says EBS bandwith 1.5 – 14 Gbps. That is the difference in performance due to virtualization that you were talking about? or this is another metric? Many thanks
Reply
ANkit Dhingra says
2017-11-01 at 09:48
HI , Thanks for this well written article, Its is really helpful.
I am beginner in deep learning and going to start deep learning models soon, Data is not too big rather small.
I have a 16 Core Xeon 2.4GHz CPU, 32GB RAM.
It comes preloaded with Nvidia Quadro k620 (2GB) GPU card.
Will this small GPU will work on small data and small deep networks for our development phase, Or we have to buy a new GPU just now? Although by next month we expect a new machine with better GPU.
Thanks in Advance
Ankit Dhingra
Reply
Arturo Pablo Rocha Ramirez says
2017-10-30 at 04:59
What card i would choose for start in deep learning a old/new gtx960/970 or a new gtx 1050ti, the price of the three cards is very similar, or i can choose a more old gen like 770 or 780.
Thanks for this good post!
Reply
Iago Burduli says
2017-10-29 at 14:18
will have a 2×1080 ti bottleneck because of the 28 pci express line of Intel Core i7 7820x, as they will work 16+8 line scheme?
Reply
- Tim Dettmers says
  2017-11-20 at 18:48
  Its not a bottleneck, but if you run parallel algorithms, you will see a decrease in performance of 0-10% compared to a 16/16 lanes setup.
  Reply
- Robin Colclough says
  2017-11-20 at 20:53
  You can upgrade to 64 PCIe lanes by using the AMD Threadripper 1900x, which now retails for US$ 449- on Amazon.
  Combine this with AMD Radeon Instinct AI cards, and the Cuda-cross library, and the performance/$ cost of AI reduces by over 50%.
  Reply
John A says
2017-10-27 at 07:00
thanks for the deep information 🙂
Reply
- Tim Dettmers says
  2017-11-20 at 18:46
  You are welcome! 🙂
  Reply
Marian says
2017-10-24 at 22:27
Thanks for the great article on GPU selection. Any chance you could offer updated advice on system building? About a year ago, I built a system with 4 1080s based on your recommendations. Now I am interested in building an 8 GPU system, but it seems this will require some rather specialized hardware, and I am not sure what to buy or if it is even practical for an individual or small company to build this kind of machine. Also, I am curious about the new AMD processors that may have more PCI lanes and whether they would be preferred over Xeon now. Also, it would be great to see an article about getting visibility into PCI bus usage and whether it is a bottleneck or not. This is something I have often wondered about.
Reply
- Tim Dettmers says
  2017-10-25 at 12:00
  Thanks for your feedback. Indeed these are some issues which have been raised again and again and I should write about it to clarify these points.
  To give you a few short answers:
  – I do not recommend 8 GPU systems; for optimal performance, they require special code which no framework supports. Some frameworks should work with less optimal performance algorithms without a problem though (less performance would be about 7.9x speedup vs 6.5x speedup for convnets for example). Another problems with such a system is the power supply (multiple PSUs usually) and cooling (often special cooling solutions are needed). My recommendation is to go with 4 GPU systems which are independent instead of 8 GPU systems.
  – AMD processors do not have any big advantage over Intel CPUs; the extra lanes make almost no difference. Pay attention to cooling solutions first. If you have a liquid cooling solution and you parallelize most of your workloads, only then it may make sense to look into more lanes. Usually, parallel algorithms have a much larger effect than lanes. Good algorithm + few lanes > bad algorithm + maximum lanes.
  Reply
Marc-Philippe Huget says
2017-10-24 at 17:43
Dear Tim,
Have you checked post and people working on developing Ethereum mining rig? They consider up to 8-12 GPU (or pretend it is possible) in one unique machine. This could be effective for multi-GPU simulations. Any thoughts about that?
Regards,
mph
Reply
new_dl_learner says
2017-10-24 at 15:56
Hello Tim, how do the mobile version of Nvidia 1040, 1050, 1060 with/without Ti perform? Are they not as good as the desktop version? I am considering to get a new laptop. such as a Surface Book 2 or Lenovo’s. Thanks
Reply
Maximilian says
2017-10-23 at 13:24
Hello,
Thank you very much for the tutorial!
I am sorry if I have missed the point but I am a skint student haha.
So I have three options for a GPU that is under £200 a GTX 1060 6GB for ~£200 a 980ti 6GB for a similar price (but way better spec) or the 780ti 2GB for ~£60.
I am planning on running a facial detection model on the GPU.
What are the positives and negatives of these GPUs? Why wouldn’t I get the 780ti which has the best (CUDA score – https://browser.geekbench.com/cuda-benchmarks) for the price – Do I need more than 2GB (model is unlikely to be anywhere near that big?), what is wrong with it being so old?
Any advice would be greatly appreciated.
Thank you
Reply
- Tim Dettmers says
  2017-10-25 at 12:46
  2 GB can be a bit short on the memory side. There are some sophisticated facial detection models and they might need more RAM than that. I would go for a GTX 980Ti if I were you. If you just run inference on pretrained models, 2GB should be enough though and then a GTX 780Ti would be a good choice.
  Reply
huichen says
2017-10-17 at 16:48
HI Tim
Compare the parameters with GTX780 and GTX1050TI,
include the bandwidth,buswidth,cuda cores,TFLOPS,
GTX780 always better than GTX1050TI.
And they has a almost same price!
But in your opinion,the GTX780 < GTX1050TI in DL ?
Thanks!
Reply
- Tim Dettmers says
  2017-10-24 at 15:46
  Good catch! The problem is that you cannot compare bandwidth, bus-width, CUDA cores and say “X is better than Y”. The architecture of the chip determines how efficient these things can be used and the GTX 10 architecture is much more efficient than the GTX 7 architecture. However, I have no hard data, and it might still be that GTX 780 > GTX 1050 Ti, but unless somebody has some deep learning benchmarks on, say, ResNet performance I would still assume that GTX 1050Ti > GTX 780.
  Reply
Kenneth Hesse says
2017-10-17 at 06:02
Hi Tim,
Thanks for the post.
I have been studying machine learning theory for the past three months and I’m itching to start experimentation. I’ll start with feed forward networks, but I am most interested in sequence learning using recurrent networks. I want to experiment with single and double layered networks using LSTM cells. I want to mess around with bidirectional network architectures as well. But, from reading UCSD’s critical review I’m left with the impression that researchers mostly use Titan XPs for these types of things (Lipton, Berkowitz, Elkan [2015]). If so I’ll focus my experimentation on feed forwards for now and have to wait for grad school to mess around with sequence capable architectures. Do you know if the GeForce 1080 will be sufficient for training recurrent networks?
Here is my build so far:
Case: Corsair 540 Mid-Tower
Motherboard: MSI X99A SLI PLUS ATX
Memory: Kingston Corsair Vengeance ,
??? 8×2 gb or 2×16 gb
GPU: ??? GeForce 1080 ( 1 to start, increase to up to 3 later)
CPU: intel i7
PCU: ??? 1300 watt or 1500 watt
I appreciate any advice you can give me.
Ken
Reply
- Tim Dettmers says
  2017-10-24 at 15:50
  Yes, the GTX 1080 will be more than sufficient for these architectures. If you have just stacked LSTMs there is almost all datasets should work with a GTX 1080. If you have more complicated models you might run short, but you can always try to make your network architecture more efficient. A GTX 1080 will definitely get you very far, and for most researchers, that is all their need. For my research, a GTX 1080 was always sufficient.
  The build looks good. I think 1300 watts will be enough. The RAM is personal preference, so both options are fine. Looks solid otherwise.
  Reply
Nade3r says
2017-10-12 at 17:58
What is a Recommended GPU for a STRATUP Company with data-sets of more than 10GB in size to apply DEEP LEARNING and ML models ?
Thank you
Reply
- Tim Dettmers says
  2017-10-19 at 17:56
  A GTX 1070 or GTX 1080 are often good for a startup. However, due to bitcoin-mania, the prices on those cards can be high. It might be better to go with used GTX Titan X (Maxwell or Pascal) or buy new Pascal if you find some which are competitively priced.
  Reply
John O says
2017-10-12 at 03:35
Dear Tim,
Thank you for your superb article in helping me get started on deep learning. I am planning on getting the following configuration, and would like to seek your feedback:
– Asrock Z270 Killer SLI/ac ATX Motherboard
– Zotac GTX 1070 Mini / Zotac GTX 1080 Mini
– Intel Core i5-7400
– Kingston 16GB DDR4 2133MHZ
– Kingston SSDNOW UV400 480GB
– Cooler Master G750M
– Fractal Design Define R5 Blackout Edition ATX Chassis
Are there any problems in using Asrock & Zotac? I can’t seem to find much reviews on them for deep learning. Asrock seems to pack more features for the buck, and I can extend it with more GPUs in the future, while Zotac Mini is cheaper than other GTX 1080 brands with a more compact design.
https://www.asrock.com/mb/Intel/Z270%20Killer%20SLIac/index.us.asp
https://www.zotac.com/my/product/graphics_card/GeForce-GTX-1080/all
What casing would you recommend for a quieter build, but without getting into heat problems (potentially with 2 GPUs in the future)?
Thank you for your time.
Reply
Abhinav Mathur says
2017-10-09 at 22:53
Hi,
Can you suggest some of the gaming desktops available at costco.
Ex: https://www.costco.com/ASUS-ROG-G20CB-Gaming-Tower—Intel-Core-i7—8GB-NVIDIA-Graphics.product.100347734.html.
https://www.costco.com/CyberpowerPC-SLC3200C-Desktop—7th-Generation-Intel-Core-i7—8GB-NVIDIA-GeForce-GTX-1080-Graphics.product.100333319.html
https://www.costco.com/CyberpowerPC-SLC3600C-Desktop—Intel-Core-i7—11GB-NVIDIA-GeForce-GTX-1080Ti-Graphics.product.100350563.html
Do you think they make sense for some one with limited budget trying on kaggle competitions.
Reply
Aerylia says
2017-10-07 at 15:35
Hey!
Thanks for the great overview post.
Could you give me some advice about some problems i am having with my 660 Ti?
I am an ai student and have to train some neural networks for assignments every once in a while, but i have never been able to install the well-known python libraries in a way that they would work; caffe, theano, keras and tensorflow.
The best indication i got was when theano gave me (an actual informative) warning, stating that it saw my 660ti, but could not use the CuDNN optimizer. As i understand from the internet, my gpu should have support for it, but apparently it doesnt work.
Since i have swapped around all of the (hand-me-down) components in my computer, except for the gpu, in the last few years, I expected that something was wrong with the gpu and was looking for a cheap, but functional new gpu, like the 1060 6gb that your post advises me to get.
As a student, a new gpu is quite the investment and reading the comments, i thought: why not ask?
So do you have any ideas about why i cannot get any neural network libraries to run on my gpu? They do work on the cpu and other libraries for the gpu do work, like openMP.
Do you think that getting a new gpu would solve my weird problems?
Thanks!
Reply
William Simmons says
2017-10-04 at 20:52
Hi Tim,
I’ve been building a system that has an x299 motherboard with 128GB Ram and i9 core processor. I bought two Titan Xp GPUs to use for Deep Learning Network training. In my lab, I wound up with an Nvidia GeForce 1080Ti card without a home. The motherboard has an additional PCIe slot equivalent to the two slots occupied by the Titan cards. Would I be better off to add this card to the two Titan Xp cards already there for some additional umph, or would there be issues with cards which are different, driverwise or otherwise? I am using Ubuntu 16.04 LTS operating system.
Cheers,
Bill
Reply
- Tim Dettmers says
  2017-10-05 at 09:50
  Hi Bill, you should have no problems with pairing Titan Xp and GTX 1080Ti cards. The only problem is that you cannot parallelize the system across all three cards, only over the two GTX Titan Xps.
  Reply
  - William Simmons says
    2017-10-05 at 22:03
    So, there would really be no advantage to the third card then, right?
    Cheers,
    Bill
    Reply
    - Tim Dettmers says
      2017-10-11 at 10:06
      You can still run single GPU jobs on the third card which is still quite useful!
      Reply
Shahab says
2017-10-01 at 14:31
Hi Tim,
First of all, thank you so much for your fantastic article; it’s very informative and unique.
I’d like to ask a question and would be grateful if you can help me somehow.
I plan to buy a GTX 1080 Ti for an object detection and transfer learning task using the pre-trained model “rfcn_resnet101_coco”. As I know, adding this GPU will hugely increase the model training/tuning execution time, which is now very time consuming on my CPU (Core i7 3770k). In addition to the training process, at this moment, when the tuned model wants to detect the object of interest in the image, CPU loads up to 100 percent, though the detection process is very slow (maybe taking a minute to be done).
My question is that does this hardware upgrading also boost the prediction/inference execution time?
Thank you
Reply
- Tim Dettmers says
  2017-10-01 at 15:28
  Yes, inference will be much faster too. If you only run one image at a time for prediction the GPU can not be utilized fully, and thus 100 images one-by-one is much slower than a single batch of 100 images, but you should still see a huge increase of 5x to 10x if you do it one-by-one when compared with the CPU.
  Reply
  - Shahab says
    2017-10-02 at 09:18
    Hi Tim,
    Thank you so much for your valuable help.
    Reply
new_dl_learner says
2017-09-29 at 03:38
Hello, I came across an article about the difference in performance between PCI3 3.0 x8 vs. x16.:
https://www.gamersnexus.net/guides/2488-pci-e-3-x8-vs-x16-performance-impact-on-gpus
They seem to concluded that “The difference is not even close to perceptible and should be ignored as inconsequential to users fretting over potential slot or lane limitations. ”
I guess they talked about the frame rate performance. Does the conclusion also applies to deep learning with 1080Ti running at x16x16x16x16 vs. x8x8x8x8 vs. x16x8x16x8? If so, I just get a CPU that supports 2-3 x16 lanes.
Reply
new_dl_learner says
2017-09-27 at 11:35
Hi Tim, my Macbook Pro has NVIDIA GeForce GT 330M graphics processor with 512MB of GDDR3 memory. Is this sufficient to learn about deep learning?
Reply
Joe says
2017-09-26 at 14:20
I came across Intel’s Movidius Stick (see https://developer.movidius.com/ ) and was looking for a comparison how it performs compared to GPUs. This post is the nicest comparison, I wonder whether you could add a note or even do another post.
Cheers !
Reply
- Tim Dettmers says
  2017-09-29 at 13:50
  The problem with new hardware is software. I do not see the point of this stick if it does not support CUDA, and since it is from Intel, it will probably not support it. Nice gimmick, but probably better to get some other hardware.
  Reply
Peter says
2017-09-22 at 16:26
Hello Tim, I have an old MacBook Pro with a NVIDIA GeForce GT 330M GPU. Will that be good enough to get me through the first few months learning of DL?
Reply
- Tim Dettmers says
  2017-09-26 at 10:50
  Unfortunately, it will not be enough. The compute capability it too low, meaning that you cannot run most deep learning software which requires cuDNN. There are some exceptions for software libraries which would still allow you to run some models, but it would be limiting and it would also be quite slow. You can get started on a CPU although it will take some while until models will be trained fully, but it gives you a little feel of what it going on. If that is for you, you might want to invest into a small, cheap desktop with GTX 1050 Ti or GTX 1060, or buy an external Thunderbolt GPU for your macbook.
  Reply
Jalal Abdulbaqi says
2017-09-20 at 22:19
Hi,
thank you very much for this valuable information. I just have one comment about the compassion. You keep the comparison with in GeForce series only, what about Quardo? I thing there are many interesting model that can be used either for performance and price efficiency.
Reply
- Tim Dettmers says
  2017-09-26 at 10:47
  Quadro performance is about the same as GeForce performance but Quadros are significantly more expensive. I do not recommend Quadro or Tesla GPUs, they are too expensive for their performance. If you are forced to buy a Quadro, look up the GPU chip type and compare with the GeForce series to get a performance comparison (also pay attention to the bandwidth, this is sometimes different between Quadro and GeForce that use the same chip).
  Reply
Wei Zhang says
2017-09-20 at 03:12
What a great post!
Reply
Ravi Kumar says
2017-09-17 at 14:11
Hi Tim! Thank you very much for a detailed article on the GPUs. I am looking at GTX 1080 Ti for deep learning in computer vision. Based on your advice to one of the questions above, I am treating all the brands equally. Now I have to decide on the cooling. From this article – http://thepcenthusiast.com/geforce-gtx-1080-ti-compared-asus-evga-gigabyte-msi-zotac/, looks like there are 3 cards with inbuilt liquid cooling. One of them is air + liquid and the other 2 are blower + liquid. So how does these hybrid ones work. Would both the cooling options work in tandem ? or should we select one of them to work at a time ? If both of work in tandem, would the blower create same noise as in the blower-only cards ?
Reply
- Tim Dettmers says
  2017-09-26 at 10:44
  Air + liquid and blower + liquid just refers to different designs here. The main concept of liquid cooling is that liquids, such as water, have much higher thermal conductivity and thus heat can “flow” much quicker into the liquid rather than the air and thus it flows away faster from the processors. However, you still need to cool down the water and you do this by using cooling fins which increase the surface area + some form of air cooling. Air here refers to an fan (or blower if you like) that is directly attached to the GPU, while a blower is an external air fan.
  External air fans have the advantage that you can have more intricate cooling fin structures which will dissipate the heat quicker. Also, the fan can be larger. With this, you can cool the water and thus the processor more efficiently but it will also take up more space.
  I do not believe that there is a big difference between these designs, as long as you have any liquid cooling you should see a big decrease in GPU temperatures. So both options should be fine.
  Reply
Nero Wang says
2017-09-14 at 17:59
Hi, Tim, if I want to run 2 GPUs, is it necessary to run in PCIe x16 for each card or also can run in PCIe x8 for each card?
Reply
- Nero Wang says
  2017-09-14 at 18:04
  OK, I found the answer.
  http://timdettmers.com/2017/04/09/which-gpu-for-deep-learning/#comment-14149
  Reply
  - Tim Dettmers says
    2017-09-16 at 21:57
    Thanks for finding the answer yourself and also linking to it! It more people would do this, the comment section would be much easier to read.
    Reply
Mykhailo Matviiv says
2017-09-13 at 16:34
Hi, Tim!
Thanks for such helpful article!
I’m in the ‘I started deep learning and I am serious about it’ category now 🙂
I own gtx960 4gb now and want to upgrade it to 1080ti. But I’ve heard that Volta GPUs will hit the market soon so I’m confused now.
Will you recommend to buy 1080ti now or just wait for Volta GPU?
Reply
- Tim Dettmers says
  2017-09-13 at 21:52
  We will see the Volta Titan probably around February next year. The 1180 and 1170 would hit the market earlier probably. I would probably just get the GTX 1080 Ti, skip the early Voltas and then settle with a Volta Titan. That way you do not have to wait with your GTX 960 which might slow you down for a few months.
  Reply
  - Mykhailo Matviiv says
    2017-09-13 at 22:42
    Thanks for the fast advice 🙂 Btw, is there any info about when Volta specs will be revealed?
    Reply
Mahmood says
2017-09-10 at 08:43
Hi Tim,
Thank you for these valuable informations,
I am planning to start playing Deep Learning.
Is my laptop with GTX 1050 4GB is sufficeint?
My laptop configuration: https://support.hp.com/hk-en/document/c05370398
Thank in advance
Reply
- Tim Dettmers says
  2017-09-10 at 21:00
  Yes this will be sufficient to peak a bit into deep learning. With that you will be able to test depe learning on small datasets and you will be able to use deep learning methods on many Kaggle competitions.
  Reply
Spring says
2017-09-09 at 21:04
Hello Tim
I am going to purchase a machine for deep learning in computer vision.
Could you let me know how do you think about the following configuration?
One(1) E5-2623 v4 2.60GHz Four-Core85W Processor
32GBDDR4ECCRAM(4x8GB)
One(1)NVidiaGeForceGTXTitanXP
One(1)512GBSSD
One(1)2TBHDD
Reply
- Tim Dettmers says
  2017-09-10 at 21:02
  Looks good. I do not think you really need ECC option for your RAM unless it is the same price as a non-ECC option of course. Otherwise, this would be solid main parts.
  Reply
navid ziaei says
2017-09-07 at 20:55
hi
I am planning to build the following system for deep learning:
Mother Board: MSI X99A SLI PLUS
GPU: MSI GeForce GTX 1080 ( x 2)
CPU: Intel® Core™ i7-6800K Processor
RAM: DDR4 32GB (16GB x 2) 2400MHz
Case: Deepcool Tesseract Sw
Power: Green GP1050B
I wonder what is the best cooling system? is this system with dual GPU really needs water cooling? in this case this case is suitable?
is there something like cpu liquid cooling blocks for gpu?
Reply
- Tim Dettmers says
  2017-09-08 at 20:44
  Air is sufficient for 2 GPUs. For 3 and 4 GPUs air can also be reasonable, but here water cooling makes more sense and will improve performance considerably. For 2 GPUs I do not think it is worth it, because you will gain almost no performance.
  Reply
Danita Halmick says
2017-09-04 at 22:58
This website is known as a stroll-by for all of the information you wanted about this and didn’t know who to ask. Glimpse right here, and you’ll positively uncover it.
Reply
Andres says
2017-08-30 at 23:58
Hi Tim,
I love your post and it seems to be one of the to-go resources for making a choice. I wanted to ask you a question. I am doing an application where I need a LOT of memory, and by a LOT I mean I was thinking about getting the Quadro P6000 which has 24Gb of vram (the most of any nvidia card).
I know people tend to go either GForce or Tesla for machine learning, but I was wondering if you knew of the Quadros being used for this purpose and if you have any suggestions for me regarding this matter. If I go for a Tesla P100 I would “only” have 16Gb.
Thanks a lot!
Reply
- Tim Dettmers says
  2017-09-01 at 16:00
  Quadro cards are okay. The thing is, you don’t want to run expensive cards if you can do it with less. Have you looked into 16-bit networks and did you simplify your network and memory requirements to a sufficient point? I think you can keep the memory in check with a little bit of engineering (or precise GPU -> CPU and CPU -> GPU transfers).
  Reply
Jamil says
2017-08-30 at 21:55
Awesome guide Tim!
As many others are probably wondering, what’s your take on the new RX Vega cards? And specifically their FP16 performance for deep learning as opposed to an Nvidia 1080ti for example?
For the price of a 1080ti you can almost buy two Vega 56. For someone planning to use them for mostly Reinforcement Learning on an AMD threadripper platform, would you recommend I get two 1080tis or four Vega 56’s?
Thanks!
Reply
- Tim Dettmers says
  2017-09-01 at 16:03
  The hardware is there, but not yet the software. Currently, the software and community behind AMD cards are too weak to be a reasonable option. It will probably become more mainstream in the next 6 months or so. So you can live with trouble, be a pioneer in the AMD land and thus help many others with contributing bug reports and problems, or you can make it easy and go with NVIDIA — your choice.
  Reply
  - Yubei Chen says
    2018-04-06 at 02:00
    Hi Tim,
    I have a follow-up question on this. I’m building a new server with 4 1080 Ti’s. I plan to use threadripper 1950x and MSI X399 combo. Do you think this is a good option?
    Reply
    - Tim Dettmers says
      2018-04-06 at 08:47
      Yes, it is a good option. In fact, I believe its the best option for standard hardware currently. I am using an X399 (AsRock) and a Threadripper too. I do not think there is a big difference between MSI and AsRock, and you will be happy with that. I personally use a 1900x — it has the good 64 PCIe lanes so that you can run NVMe SSDs, and 8 fast cores are sufficient even if multiple people use the computer for deep learning. If you are using other CPU hungry application like databases a lot, then a 16 core Threadripper might be worth it. Switching from a 1950x to a 1900x saves a good amount of money at little performance difference, though. I personally would probably go with a 1900x and more RAM instead of a 1950x.
      Reply
      - Yubei Chen says
        2018-04-06 at 21:47
        Awesome! I got the combo and together with a 1600W PSU , 128 GB 2666 RAM. I will update later about my experience with this Threadripper build. Overall, this is our first AMD based box, we have several good boxes based on i7 and Xeon platform. I hope this AMD experimental platform can work well, Threadripper’s spec is really good.
      - Tim Dettmers says
        2018-04-11 at 08:15
        Looks great! It am excited about your feedback!
      - P says
        2018-05-03 at 23:05
        Hi Tim, I am drawn between the Intel CPU and the Threadripper. Is the Threadripper fully compatible with all the DL/ML/DS/RL related frameworks and software libraries? Do I have to concern about the lack of some instructions set available only on Intel CPUs?
      - Tim Dettmers says
        2018-05-09 at 10:35
        The Threadripper is great. I had no issues so far with it and everything runs! I think in terms of cost/performance you will get more from a Threadripper compared to most Intel CPUs — Intel CPUs are just a bit expensive!
      - P says
        2018-04-06 at 21:54
        Hello Tim, any reason for choosing the AsRock motherboard over the others such as the Asus Rog Zenith Extreme? Which NVMe SSD and size do you recommend?
      - Tim Dettmers says
        2018-04-11 at 08:17
        I wanted to tinker with a 10 Gbits/sec network card, that is the main reason I chose the board. I would get one NVMe SSD for your OS (256 GB is sufficient) and if you want to put some datasets on an NVMe SSD, I would definitely get 1 TB or if that is too expensive 512 GB.
      - P says
        2018-04-14 at 14:05
        Thanks Tim. The motherboard that you are using is not available in my city. I need to choose another one.
        I plan to buy one GPU at the beginning and add 2-3 more if needed later. Do all GPUs have to be from the same brand and the same model? I think somewhere I read that if one GPU is 1080Ti, all four have to be 1080Ti. Similarly, if one GPU is 1070Ti, all four have to be 1070Ti. Do they all have to be from the same brand and same model? Is there a problem if I buy them in stages rather than buying all at once?
        For such mult-GPU systems, what kind of cooling system do you recommend? 2 fans, 3 fans or water cooling? I think in the case of 1070(ti) and 1080(ti), they take 2 slots.
      - Tim Dettmers says
        2018-04-15 at 15:59
        The brand does not matter — you can mix here. The series should match (10s or 900s only). If you want to have good parallelization performance the chips should match (GTX 1080 Ti OR 1070 Ti only); you can get away with minor differences, e.g. Titan Xp and GTX 1080 Ti, but the slowest card will determine the speed of the parallelism. There should be no problem buying them in stages, often I bought my GPUs in that way. Currently, I have only one GPU (the dilemmas of being a grad student).
        Water cooling for GPUs is very effective but can be a mess too. If you go for water cooling I would try to go for prebuild models (not water cooling kits for Do It Yourself (DIY)). If you go for water cooling make sure you have a case that can fit the radiators; this can be quite problematic with 4 GPUs. I would not recommend either water or air, both have clear benefits and drawbacks. It’s more of what you like to have. I usually go with air-cooled GPUs (especially if I only have 1 or 2).
      - P says
        2018-04-15 at 17:41
        Thanks Tim. When you mentioned “good parallelization performance” and using 2-4 GPUs, do you mean mutli-GPUs simulating different pieces of the same model with the same paramtersor each GPU running the same model but different parameters? In both cases, does Tensorflow do it automatically with one parameter change from the user or the user have to manually write codes to do parallelization?
        As for the RAM, I see people using 2666, 3200 and 3600. Which do you recommend for both Thripper and INTEL i9 systems?
      - Tim Dettmers says
        2018-04-16 at 09:46
        Usually, parallelization means data parallelism which means having the same model / same parameters on each GPU and splitting the input data over GPUs. You usually have to write manual code, but it is often easy to do.
      - P says
        2018-04-15 at 17:49
        Hello Tim, about the slowest card be the one determining the speed of parallelism…
        If I use a slower card, say 1060, to drive a 4K monitor. Then, get 1-3 1080Ti to do DL work. Will I be able to tell Tensorflow and other frameworks to ignore the 1060 and run the simulations using those 1080Ti only? If not, is it better to just use one of the multi-1080Ti to do both simulation and drive the 4K monitor?
      - Tim Dettmers says
        2018-04-16 at 09:48
        You can select the GPUs for parallelism in TensorFlow and other frameworks, so if a GTX 1060 drives your monitor, then just select the 3 GTX 1080 Ti and you will be fine. For parallelism, in this case, it would be definitely better to just use the 3 GTX 1080 Tis.
      - P says
        2018-04-16 at 13:42
        Thanks Tim. So, to split the same simulation across all the GPUs in one computer, I need to have cards of the same type (i.e. all 1080ti or all 1070ti). Otherwise the performance will be limited by the slower card. If I run the same model with different parameters in one computer, do I still have to have GPUs of the same type to avoid this issue? Will adding a slower GPU in the same computer becomes a bottleneck regardless of whether or not I use it just to run 4K monitor or drive 4K monitor and do simulation?
        In each case, how is the performance on using 2-4 more computers with one GPU each vs. putting 2-4 GPUs in one computer?
      - Tim Dettmers says
        2018-04-18 at 13:14
        If you do not use parallelism the GPUs will be independent meaning that they do not affect each others performance. 4 computers 1 GPU is very slow for parallelism; 4 GPUs 1 computer will be in all relevant aspects superior (cost/performance mostly). If you add slower GPUs for your monitors you do not have to use them you can always parallelize over your fast GPUs — so that should be no issue.
      - P says
        2018-04-18 at 13:24
        Thanks Tim. I recall you mentioning that the GPU is more important than CPU. Do systems with CPU that supports quad channel memory perform “much better” than those with CPU that supports dual channel memory?
      - Tim Dettmers says
        2018-04-18 at 13:33
        The difference is almost non-existent. However, DDR4 memory is much better than DDR3 for many other tasks. If you run your models on the CPU instead of GPU it can increase the performance considerably. Now DDR4 is quite common and I would in general recommend to get DDR4 over DDR3 — it is not very useful for deep learning, but you will have a more balanced computer which is useful for many other tasks.
      - P says
        2018-04-07 at 14:31
        Hello Tim, is it possible for you to share the list of components to build your Threadripper workstation?
      - Tim Dettmers says
        2018-04-11 at 08:22
        This is my build: https://de.pcpartpicker.com/list/NRhtpG. The PSU is for 1 GPU; you want to upgrade the PSU if you want to have more GPUs.
      - P says
        2018-04-11 at 19:26
        Thanks Tim. If I connect one 4K monitor@60Hz or 1-2 QHD/WQHD monitors to the 1080Ti, will the performance of the card in doing DL be affected, say by more than 20%? Is it better to get a cheaper video card to drive the high-res displays(s) and denote 100% of the 1080Ti(s) on DL tasks? Which video cards do you recommend to do it reasonably well? I guess I should plug it in x4 or x8 PCI slot and save the x16 to the 1080Ti(s).
        Does TensorBoard looks good on 4K display?
      - Tim Dettmers says
        2018-04-11 at 20:39
        It should not affect the performance much. It will take a few hundred MBs of ram (100-300MB), but will hardly affect deep learning performance (0-5% at most, if you are watching a video in the background while training etc).
      - P says
        2018-04-11 at 21:33
        Thanks. Do you know any good computer cases that can support 4 1080Ti or 4 Titan GPUs and silent or almost silent?
      - P says
        2018-04-11 at 21:55
        Thanks Tim. Is there any point to get EPYC CPU rather than Threadripper?
      - P says
        2018-04-12 at 04:15
        Hello Tim, besides the ASRock Fatal1ty X399, do you know any other X399 motherboard that works well and is fully compatible with Ubuntu 17.10?
      - Yubei Chen says
        2018-04-14 at 06:44
        Here is a quick update. The system is running smoothly with Linux 16.04.4 LTS. Installation took less 20 seconds something? I went out of the office and grabbed a cup of coffee then I found it finished. I haven’t installed the tool chains yet, so I can not speak anything about it right now.
        The following is my configuration for a reference in case other people are thinking about building a server like this.
        CPU: Threadripper 1950x
        Mobo: MSI X399 GAMING PRO CARBON AC, the nice thing of this mobo is that it comes with a 3D mounting kit so that you can mount a high-speed fan to cool the GPUs additionally.
        Memory:
        Ballistix Sport LT Series DDR4 2666 MHz UDIMM (16×8), this memory is officially tested by MSI mobo with 16GB, 32GB, 64GB and 128GB all work and some thread mentioned that overclocking the memory may introduce extra issues. I suggest to stay at 2666MHz.
        Storage: 2x Samsung 960 Evo nvme (this nvme drive is much faster than 860 and I believe it is the maximum supported speed by 1950x)
        PSU: EVGA Supernova 1600 G2, this is a very nice PSU and comes with every cord you need. Good for quad-GPU build. Anything smaller than 1600W seems to have issue according to another thread.
        Case:
        Corsair Graphite Series 760T Full-Tower Windowed Case (Arctic White)
        GPU: 4x Nvidia GTX 1080 Ti
        CPU Cooler: Corsair H100i V2, I suggest to reapply the thermal paste to cover the whole copper plate since the CPU is huge. This cooler seems to do a good job at 3.9GHz. I didn’t try to go beyond 3.9 since overall the load will be mostly on GPUs.
      - Tim Dettmers says
        2018-04-15 at 16:07
        Thank you for the update! Looks like a very solid high-end build. This packs a quarter of the punch of a DGX2 with a tenth of the costs — and it can be upgraded with new GPUs in the future!
      - P says
        2018-04-15 at 17:32
        Thanks Yubei for sharing.
      - Yubei Chen says
        2018-05-20 at 02:10
        Hi Tim,
        Recently I did some benchmarks on the 1950x system with both numpy and pytorch. It seems 1950x is slower even than a i7-6700k, not even comparable to i9-7920x for linear algebras. So I wonder your workaround to this? It seems the software like openblas or so is not there yet.
        Best,
        Yubei
      - Tim Dettmers says
        2018-05-21 at 15:01
        That is very interesting, thank you for your comment. Do you have more details on your benchmarks; did you benchmark matrix multiplications or convolutions? I have not tested my Threadripper extensively since I usually run workloads on the GPU. I do not think it is a bad issue. Usually, the majority of my CPU time goes to preprocessing and batching data, and I do not think Intel is much superior for those applications. For matrix multiplication, I am sure Intel is probably better because they have better software.
      - Yubei Chen says
        2018-05-24 at 03:38
        Hi Tim,
        My benchmarks are reported at openBLAS github, on this specific issue: https://github.com/xianyi/OpenBLAS/issues/1461#issuecomment-391279110
        Basically, the conclusion is: if you use numpy, or PyTorch CPU tensor, or linear algebra like matrix factorization with even PyTorch GPU tensor, Threadripper will easily become a bottleneck. PyTorch calls Magma for SVD stuffs and Magma is a hybrid library uses both CPU and GPU. The Lapack performance on Threadripper already has a theoretically lower bound, and the libraries seems not competitive with MKL+Intel CPU. Unfortunately, SVD and matrix linear algebra is not something we can easily get rid of. So in this specific case, we will probably switch to an i9-7960x platform and retire the Threadripper CPU.
      - Tim Dettmers says
        2018-06-04 at 12:12
        Thank you for a link to those benchmarks — this is a very important result and people should be aware of it! After I finish my master thesis I will probably update my hardware guide and I will include that information.
      - Kapil Khanna says
        2018-10-25 at 10:41
        Hi Tim, Yubei,
        I cannot compromise on Linear Algebra / PyTorch performance for my DL research. So ruling out Threadripper.
        Do you see any challenges with the Intel Xeon CPUs?
        I can see them being used professionally – for example lambda labs is using them
        I’m considering the Xeon W-2123 (4-core, 48 PCIe lanes) or the Xeon W-2133 (6-core, 48 PCIe lanes).
        Thanks in advance.
        Also, please share latest updates on Threadripper.
      - Tim Dettmers says
        2018-10-29 at 10:52
        Xeon processors are excellent if you can afford them! Either of the Xeon’s should be a solid choice.
      - Kapil Khanna says
        2018-10-25 at 22:41
        Hi Tim, Yubei,
        Sorry I missed adding this to my above comment.
        My goal is to primarily use GPUs (GTX 1080 Ti) for DL (Matrix Algebra, Convolutions etc.)
        I will start with one GPU and gradually 2 more as my workload increases.
        So the main question remains is that with Threadripper + GPU combination, does the DL performance (on GPU) get compromised?
      - Tim Dettmers says
        2018-10-29 at 10:56
        Threadripper has almost zero influence on GPU performance. It only is a problem if you have a pipeline where you preprocess data and then feed it into the GPU and you do this in a loop rather than (1) preprocess data, (2) loop over preprocessed data and feed it into GPU. For all other applications, the CPU choice does not matter much (except PCIe lanes, which matter only if you use parallelism across GPUs).
      - Kapil Khanna says
        2018-11-05 at 02:28
        Hi Tim,
        I’ve got the following configuration for my DL box:
        1. One GTX 1080 Ti (3 slots on the motherboard for expansion)
        2. Intel Xeon-W2123 (4-core)
        3. Samsung nVME 970 EVO 500GB
        4. 2 TB HDD
        5. 32 GB DDR4 RAM
        6. Ubuntu 18.04 LTS (Bionic)
        The motherboard / processor does not have on board graphics.
        Should the GPU (1080 Ti) be used for Display, when the primary purpose is for DL workloads OR Should I buy a cheap graphics card for display only?
        Thanks,
        Kapil
      - Tim Dettmers says
        2018-11-05 at 13:48
        You can use your GTX 1080 Ti for display and compute. That usually does not cause any problems, except when you stretch your GPU memory to the limit.
      - Kapil Khanna says
        2018-12-04 at 21:49
        Tim,
        I have a MSI laptop with Ubuntu 18.04. It has Intel Graphics as well as an Nvidia GTX 1050 Ti.
        I want to setup the display to use Intel. Right now both display and CUDA are using the 1050 Ti. Both Intel and Nvidia drivers have been installed.
        I tried some google links but it did not work.
        Please help. Thanks,
        Kapil
      - Tim Dettmers says
        2018-12-08 at 11:49
        I would just use the NVIDIA drivers for display. It has no real drawbacks. Or why do you want to use the Intel ones?
      - Kapil Khanna says
        2018-12-08 at 16:50
        The NVIDIA drivers for display have issues like the laptop won’t resume after a hibernate or closing the lid.
        I’ve tried this yesterday … its working but under testing for now:
        https://www.pugetsystems.com/labs/hpc/Install-TensorFlow-with-GPU-Support-the-Easy-Way-on-Ubuntu-18-04-without-installing-CUDA-1170/#comment-4226572272
        And it freed up about 0.5GB of GPU RAM.
        Thanks,
        Kapil
Krishna says
2017-08-16 at 20:30
Say, so if I get 2 1080 8GB gpus for training models, do I need a 3rd gpu to run 2 monitors? I was trying to figure out if I needed to actually get 3 1080s, or if I can get away with just 2 1080s, and then one somewhat smaller gpu just to run my monitors?
Reply
- Tim Dettmers says
  2017-09-01 at 16:15
  You just need the two GPUs. You can run on them deep learning models and run monitors on them at the same time. I do it all the time with no problems!
  Reply
Nikolaos Tsarmpopoulos says
2017-08-14 at 02:29
Hi Tim,
Nvidia announced that they will unlock the Titan xp performance. Does this help with Deep Learning? It’s currently unclear to me what exactly they have unlocked, is it FP64 and/or FP16?
Thanks and regards
Reply
Massinissa says
2017-08-11 at 10:34
Hi Tim,
Thank you for these valuable informations,
I am planning to build a 5 x 1080 GTX using a ASRock Z87 Extreme4 motherboard,
I was wondering if it is such a good idea to use usb pcie x1 to x16 riser to plug all the gpus and save some money instead of buying a multiple x16 motherboard
Do you think this will have a big impact on my deepleaning processes?
Thank in advance
Reply
- Tim Dettmers says
  2017-09-01 at 16:17
  I am not sure if that will work. This is usually done for cryptomining but I have never seen a successful setup for deep learning. You could try this and see if it works or not and let us know. I am curious.
  Reply
Jeff says
2017-07-31 at 16:06
Hi Tim,
Thanks for all the good info.
A lot of the specs for the new Volta GPUs cards that will be coming out seems to focus on game play. Maybe I missed it but have you seen any kind of break down comparison with other GPUs like the Titan Xp or the 1050 Ti? I am just wondering if it would be a good idea to wait for the Volta or just dive in for a couple of Xp’s now. What’s your thoughts?
Thanks,
Jeff
Reply
- Tim Dettmers says
  2017-08-03 at 03:05
  It is difficult to say. The rumor mills point out Volta consumer around 2017 Q3, but Ti and Titan cards in 2018 Q1/Q2. It is unclear if the cards also come with the TensorCores that the V100 has. If not, then it is not worth waiting, if they have those, then a wait might be worth it. In any case, you would wait for quite some time, so if you need GPU power now, I would go with Titan Xp or GTX 1080 Ti (it is easier to sell those to gamers once Volta hits the market).
  Reply
nikola rahman says
2017-07-29 at 13:37
Hi Tim.
I currently have the following configuration:
GPU: GTX 1080
CPU: i5-6400 CPU @ 2.70GHz (Max # of PCI Express Lanes = 16)
Motherboard: H110M PRO-D (1 PCIex16)
RAM: 32GB
PSU: max. power 750W
I want to upgrade this machine with at least one more GPU for the moment and gradually add more GPUs over the next couple of months. I want this machine to be used by 4-5 people. I like this solution better than the one where everyone gets a PC because one person can use more than one GPU if it’s available, and it’s cheaper (probably). Can you recommend a motherboard and a CPU for this purpose and maybe give your comment on this approach since I don’t have any experience?
Thanks!
Reply
- Tim Dettmers says
  2017-08-03 at 03:09
  There are no motherboards which are exceptional for 4 GPUs. They are all expensive and have their problems. The best bet is you search on pcpartpicker.com for 4-way SLI motherboards and select one with a good rating and good price. Optionally you can look in this comment section (or the comment section of my hardware blog post for motherboards that other people picked. The CPU option is difficult to recommend. For deep learning, you will not need a big CPU, but depending on what each user does (preprocessing etc) one might need a CPU with more cores. I would go with a fast 4 core, or with 6+ cores.
  Reply
Robin Colclough says
2017-07-27 at 12:35
I don’t think you should be so fast to dismiss AMD’s solutions, Caffe is available and tested, and so are many applications using OpenMI.
AMD is also offering free hardware for testing for interesting projects, but regardless of that, researchers should start enquiring now, to see if what is on offer meets or even exceeds their needs.
Reply
Robin Colclough says
2017-07-26 at 19:38
No need to wait til the end of the year, OpenMI is not just an option to PyTorch, but PyTorch is now available for AMD GPUs, as you can read below.
OpenMI offers the benefits of source code availability, meaning users can fine-tune the code to best fit their needs, and also improve the code base.
For those starting AI / Deep-learning projects, AMD offers a fully functional alternative to NVidia with a 30 – 60% cost saving. Not forgetting that AMD have many AI projects using their technology.
PyTorch :-
“For PyTorch, we’re seriously looking into AMD’s MIOpen/ROCm software stack to enable users who want to use AMD GPUs.
We have ports of PyTorch ready and we’re already running and testing full networks (with some kinks that’ll be resolved). I’ll give an update when things are in good shape.
Thanks to AMD for doing ports of cutorch and cunn to ROCm to make our work easier.
permalinkembedreport
[–]JustFinishedBSG
I am very very interested. I’m pretty worried by nvidia utter unchecked domination in ML.
I’m eager to see your benchmarks, if it’s competitive in PyTorch I’ll definitely build an AMD workstation”
Source:-
https://www.reddit.com/r/MachineLearning/comments/6kv3rs/mopen_10_released_by_amd_deep_learning_software/
AMD AI/Deep learning products now available:-
https://instinct.radeon.com/en-us/category/products/
Reply
- Tim Dettmers says
  2017-07-27 at 01:33
  The software is getting there, but it has to be battle-tested first. It will take some time until the rough edges are smoothed. Also, a community is very important, currently the OpenMI community is far too small to have any impact. The same is true for PyTorch when compared to TensorFlow, but the current trends indicate that very soon this may change. However, in the end, the overall picture counts, and as of now I cannot recommend AMD GPUs since everything is still in their infancy. I might update my blog post soon though, to indicate that AMD GPUs are now a competitive option if one is willing to handle the rough edges and live with less community support.
  Reply
Robin Colclough says
2017-07-26 at 12:07
Actually, AMD have already successfully ported over 99% of the NVidia deep learning code to their Instinct GPUs, so compatibility will not be a problem.
As Facebook have recently open-sourced their CAFFE2 deep learning code, that will also be available when using AMD GPUs.
Without competition, AI will not push ahead, nor will price drop enough to spread AI use. As such, AMD’s massive investment in AI and deep learning is crucial to our societies.
“AMD took the Caffe framework with 55,000 lines of optimized CUDA code and applied their HIP tooling. 99.6% of the 55,000 lines of code was translated automatically. The remaining code took a week to complete by a single developer. Once ported, the HIP code performed as well as the original CUDA version.”
https://instinct.radeon.com/en-us/the-potential-disruptiveness-of-amds-open-source-deep-learning-strategy/
“Today Facebook open sourced Caffe2. The deep learning framework follows in the steps of the original Caffe, a project started at the University of California, Berkeley. Caffe2 offers developers greater flexibility for building high-performance products that deploy efficiently.”
https://techcrunch.com/2017/04/18/facebook-open-sources-caffe2-its-flexible-deep-learning-framework-of-choice/
Have you considered using the new breakthrough AMD technology for AI, from the Ryzen CPU’s aimed at parallel processing, to the new AMD Instinct series of GPUs?
These will offer superior power and facilities to NVidia and at least 30% lower cost, and have been designed to be ready for future AI computing needs, being much more scalable than NVidia technology.
Launched June 2017: “AMD’s Radeon Instinct MI25 GPU Accelerator Crushes Deep Learning Tasks With 24.6 TFLOPS FP16 Compute”
“Considering that EPYC server processor have up to 128 PCIe lanes available, AMD is claiming that the platform will be able to link up with Radeon Instinct GPUs with full bandwidth without the need to resort to PCI Express switches (which is a big plus). As we reported in March, AMD opines that an EPYC server linked up with four Radeon Instinct MI25 GPU accelerators has roughly the same computing power as the human brain. ”
Read more at :
https://hothardware.com/news/amd-radeon-instinct-mi25-gpu-accelerator-deep-learning-246-tflops-fp16
https://instinct.radeon.com/en-us/product/mi/
https://instinct.radeon.com/en-us/the-potential-disruptiveness-of-amds-open-source-deep-learning-strategy/
https://instinct.radeon.com/en-us/the-potential-disruptiveness-of-amds-open-source-deep-learning-strategy/
Reply
- Tim Dettmers says
  2017-07-26 at 19:00
  Indeed, this is a big step forward. The hardware is there, but the software and community is not behind it fully yet. I think once PyTorch has full AMD support we will see a shift. I think AMD is getting more and more competitive and soon it will be a great alternative option if not the better option over NVIDIA GPUs.
  Reply
- Eric Perbos-Brinck says
  2017-07-26 at 19:22
  I’m a fanboy of AMD GPU’s + FreeSync monitor combo for gaming, insane value vs Nvidia.
  But when it comes to choosing a GPU for your Deep Learning personal station **TODAY**, there’s no possible hesitation: Nvidia with CUDA/CUdnn all the way, from the GTX 1060 6Gb to the 1080Ti/Titan Xp.
  Building a Deep Learning stable rig is complex enough for beginners (dealing with Win vs Linux, Python 2.7 vs 3.6, Theano vs TensorFlow and so on), no need to add a layer of cutting-edge tuning with AMD “work-in-progress” 😀
  Now the moment AMD eco-system is truly operational and crowd-tested, I’ll be the first one to drop Intel/Nvidia to return to AMD with a Ryzen 1800x/Vega 10.
  Reply
Anas Koara says
2017-07-24 at 23:06
hi Its my first deep learning project and the only gpu I could find is
http://www.gpuzoo.com/GPU-AFOX/GeForce_GT630_-_AF630-1024D3L1.html
What do you this about is it suitable for Neural machine translation
my data about 4 gb and 1 million sentence.
Reply
- Tim Dettmers says
  2017-07-26 at 06:02
  It is a bit tight. You might be better off using a CPU and a library with good CPU support. I heard facebook has quite good CPU libraries. They might be integrated into PyTorch, but I am not sure.
  Reply
Greg says
2017-07-19 at 01:55
Hi Tim,
I want to first thank you for all your awesome posts. You definitely rock!
I have what I think is a quick question for you.
In regards to deep learning and GPU processing – I have one Titan X. What’s your opinion on either keeping and adding to the Titan X getting 3 more or selling the Titan X and going with 4 of something more modern and affordable like the 1070 or ? What do you think – considering function and price.
Thanks in advance for your advice.
Reply
- Tim Dettmers says
  2017-07-19 at 05:47
  I think this depends on what do you want your cards to use for. If your current models saturate your Titan X then it might make sense to stick to more Titan Xs. However, if you memory consumption per model is usually much lower than that it makes sense to get GTX 1070s. This should be a good indicator of what kind of card would be best for you. Also consider to keep the GTX Titan X and buy additional GTX 1070s. You will not be able to parallelize across all 4 GPUs, but this might be a bit cheaper option.
  Reply
  - Greg says
    2017-07-19 at 07:50
    I’m thinking future too….and the GTX 1070 may need to be upgrade to fill the shoes of the Titan X ; however, I hear the GTX 1080 Ti might be a good alternative to the Titan x across the performance board.
    Thoughts on this Tim?
    Greg
    Reply
    - Tim Dettmers says
      2017-07-19 at 22:31
      If you can wait until Q1/Q2 2018 I would stick with any of the mentioned GPUs and then upgrade to a Volta GPU in Q1/Q2 2018. The problem with GTX 1080 Ti is that it will lose quite some value once the Volta GPUs hit the market. You could get GTX 1080 Ti and sell them before Volta comes out, but I think upgrading to Volta directly might be smarter (if you can afford the wait).
      Reply
Nikos Tsarmpopoulos says
2017-06-27 at 02:33
Hi Tim,
The new AMD Vega Frontier Edition card comes with 25 TFLOPS of FP16 compute and 480 GB/s memory throughput. AMD pushes the card primarily for Graphics workstations and AI (???). Is there a framework that would scale in a multi-GPU environment and supports AMD’s technology (OpenCL I presume)?
Thank you
Reply
- Tim Dettmers says
  2017-07-05 at 00:49
  Currently, I am not aware of libraries with good AMD support, but that might change quickly. The specs and price are pretty good and there seems to be more and more effort put into deep learning on AMD cards. It might be competitive with NVIDIA by the end of the year.
  Reply
Jay Karimi says
2017-06-24 at 20:31
I am in the “getting started in deep learning but serious about it ” category; I have completed andrej karpathy’s and andrew ng’s course. I am primarily interested in computer vision(kaggle) and reinforcement learning (openai gym). I am looking to build a deep learning pc, here is my parts list:
https://pcpartpicker.com/list/n7KZm8
Should I keep the gtx 1070 or should I spend the extra $250 to get the gtx 1080 ti? Will my current cpu be able to support the gtx 1080 ti comfortably?
Reply
- Tim Dettmers says
  2017-06-25 at 00:58
  Looks like a very solid, high-quality low-price rig. I think the GTX 1070 will be enough for now. You will be able to run most models, and those that you cannot run you could run in 16-bit precision. If you have the extra money though, the GTX 1080 Ti will give you a very solid card which you will probably not need to upgrade even after Volta hits the market (although selling a GTX 1070 and getting a GTX 1170 would not cost much either). So both choices are good. The GTX 1080 Ti is not necessary but would add comfort (no precision fiddling) and you will be good with it for at least a year or two.
  Reply
Ahmed Kalil says
2017-06-22 at 23:46
Hi Tim,
can you comment on this build: https://pcpartpicker.com/list/nThsQV
Thank You
Reply
- Tim Dettmers says
  2017-06-25 at 00:54
  This is a pretty high-cost rig. For deep learning performance, it will be not necessarily better than a cheaper rig, but you will be able to do lots of other stuff with it, like data science, Kaggle competitions and so forth. Make sure the motherboard supports Broadwell-E out of the box.
  Reply
  - Ahmed Kalil says
    2017-06-25 at 06:43
    Thank you Tim very much, i really appreciate your support
    Reply
Vick says
2017-06-19 at 04:29
Hello, nice guide, based on which I bot a 1050 Ti. But going thru’ the installation of drivers/CUDA etc., I came across this page: https://developer.nvidia.com/cuda-gpus
Here, 1050 Ti is not listed as a supported GPU. Am I stuck with a useless card? When I try to install the CUDA 8 Toolkit (it always gives me an “Nvidia installer failed” error. Cannot get past it. Also, do I need V Studio installed? I am trying to install as per instruction here: http://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html#axzz4kPWtoq7o
I need Tensorflow to work with my GPU, that is all. Any advice? Thanks.
Reply
- Tim Dettmers says
  2017-06-20 at 00:11
  The GTX 1050 Ti supports CUDA fine, the problem is that you probably need the right compiler to make everything work. So yes, if you are missing the right visual studio then this is the missing ingredient.
  Reply
  - Vick says
    2017-06-22 at 07:43
    Thanks Tim.
    Am I supposed to install the full Visual Studio compiler, or will the VS 15 Express edition do?
    The Nvidia CUDA 8.0 toolkit will not install; always exits with an error, “Nvidia installer failed” — and the VS 15 Express exited with an error (“some part did not install as expected”). Looks like this has the potential to turn into a nightmare. Any ideas, anyone? Thanks. I am running Win7Pro, SP1. Trying to get this EVGA GPU working for TF.
    Reply
deeplearn9 says
2017-06-18 at 06:33
Hi Tim,
What do you think of the following config for computer vision deep learning and scientific computing (MATLAB, Mathematica)?
Intel – Core i7-6850K 3.6GHz 6-Core Processor
Asus – SABERTOOTH X99 ATX LGA2011-3 Motherboard
Corsair – Vengeance LPX 64GB (4 x 16GB) DDR4-2800 Memory
Samsung – 960 Pro 1.0TB M.2-2280 Solid State Drive
EVGA – GeForce GTX 1080 Ti 11GB SC2 HYBRID GAMING Video Card
SeaSonic – PRIME Titanium 850W 80+ Titanium Certified Fully-Modular ATX Power Supply
I’m oversizing the PSU because I might want to add another 1080 Ti in the future. Some questions I have:
1) What kind of cooling do you recommend for the CPU?
2) Do you hook your machine to an Uninterruptible Power Supply/Surge Protector?
Reply
- ravi jagan says
  2017-06-20 at 00:17
  There may be a problem with the samsung disks with linux . I have a dell alienware aurora R6 and had difficulty installing linux. I saw the forums and no one has been able to install on R6 – apparently samsung shows up as 0 bytes or something. I am using tensorflow on windows10 on the same hardware. works fine, the GPU is getting utilized but I dont know about performance.
  Reply
  - Tim Dettmers says
    2017-06-22 at 05:43
    This is a good point, Ravi. A quick google search shows that some people have problems with PCIe SSDs under Linux. However, there are already some solutions popping up, so it seems that fixes and instructions how to get it working are underway. So I guess it should work in some way, but it may require more fiddling before it works.
    Reply
- Tim Dettmers says
  2017-06-20 at 00:17
  Looks good. Strong CPU and RAM components are useful for scientific computing in general and I think it is okay to invest a bit more into these for your case.
  1) For CPU cooling a normal air cooler will be sufficient; often CPUs run rather cool (60-70 C) and do not need more than air cooling.
  2) Not really needed; I use a power strip with surge protector and that should protect your PC from most things which can cause a power surge
  Reply
Jan Kohut says
2017-06-16 at 18:37
Hi, thanks for the article. I want to buy GTX 1060. Do you think there is any risk in buying factory overclocked GPU? Can I be sure that the factory overclocked card will be precise in computations? I already read in comments that the speed difference isn’t noticeable but the prices in my country aren’t much different so for me it’s basically the same to buy normal or overclocked…And also I would like to ask if there is any difference between 8GHz and 9GHz GDDR5? The cards I’m considering are:
Overclocked:
MSI GeForce GTX 1060 GAMING X+ 6G
MSI GeForce GTX 1060 GAMING X 6G
GIGABYTE GeForce GTX 1060 G1 Gaming
ASUS GTX 1060 DUAL OC 6GB
No overclocked:
ASUS GeForce GTX 1060 Turbo 6GB
ASUS GTX 1060 DUAL 6GB
Thaks for your opinion 🙂
Reply
- Tim Dettmers says
  2017-06-16 at 20:15
  The factory overclocks are usually in specific bounds which are still safe to use. So it should be fine. As you say the clock speed is no big difference, however, the memory clock makes a big difference. So if the prices are similar I would go for a 9GHz GDDR5 model.
  Reply
Haris Jabbar says
2017-06-15 at 13:34
Hello Tim
I stumbled across this news item about a pair of cryptocurrency-specific GPU SKUs by NVIDIA. (http://wccftech.com/nvidia-pascal-gpu-cryptocurrency-mining-price-specs-performance-detailed/)
With a price tag of 350$ for a GTX1080-class card, do you think it’s a good buy for deeplearning? The only downside is no video ports, but that doesn’t matter for DL anyway.
Thanks
Haris
Reply
- Tim Dettmers says
  2017-06-15 at 20:04
  They will be a bit slower, but more power efficient. If you use your GPUs a lot, then this will make a much more cost efficient card over time. If you plan to use your card for deep learning 24/7 it would be a very solid choice and probably the best choice when looking at overall cost efficiency.
  Reply
  - Haris Jabbar says
    2017-06-17 at 18:34
    Thank you for the quick reply!
    Could you say why they would be slower? Because I want to weigh in the speed vs power savings.
    Reply
  - Eric Perbos-Brinck says
    2017-06-17 at 19:37
    I’m always amazed about claims such as:
    “Buy two of these cards (at 350$ each) and place it in this system and you are looking at making over $5,000 a year in extra income! via LegitReviews”
    If so true, the opportunity cost for Nvidia or AMD is insane: why sell them in the first place, and not keep them to mine cryptocurrency themselves in giant farms for their shareholder’s immediate benefit ?
    I’d be a shareholder, I’d be furious at the management for such a lousy decision 😀
    Reply
    - Tim Dettmers says
      2017-06-20 at 00:20
      This happened in the past, then suddenly difficulty stagnated while hardware increased and the GPUs were worthless. The last time this happened from one week to the next. Mining hardware worth $15000 was worth $500 one week later. So if you look at it from a long-term perspective, going into cryptocurrency mining would not be a good strategy for NVIDIA or AMD, it is just too risky and unstable.
      Reply
Yuan says
2017-06-14 at 17:25
Thank you for the excellent post. I have a few questions about PCI slots.
1) I found most pre-built workstations have only one PCIe x16 and some PCIe x4 and x1. Can GTX 1080ti work well with x4 or x1?
2) Do you have recommendation on pre-built systems with two PCIe x16 slots? (I prefer not to build one from scratch, but okay with simple installation like adding GPUs and RAMs.)
Reply
- Tim Dettmers says
  2017-06-15 at 20:01
  I have seen some mining rigs which use x1, but I do not think they support full CUDA capabilities with that or you need some special hardware to interface these things. It probably also breaks standard deep learning software. The best is just to stick to x16 slots.
  Reply
smb says
2017-06-13 at 22:32
Hey,
My question is (again 🙂 ) about PCIe 2 performance. But most questions are about new mainboards, while I am thinking about buying a used xeon workstation with this http://ark.intel.com/products/36783 chipset in a 2 PCIe x16 configuration. There is already an old Nvidia Quadro 4000 card in there and I want to add a 1080 8gb.
Is there anything that wouldn’t work in this setup?
Reply
- Tim Dettmers says
  2017-06-15 at 19:59
  It would work just fine. PCIe guarantees downwards compatibility, so the GTX 1080 Ti should work just fine. The CPU is a bit small, but it should run deep learning models just fine. For other CPU intensive work it will be slow, but for deep learning work you will probably get the best performance for the lowest price. One thing to notice is that you cannot parallelize across a Quadro 4000 and a GTX 1080 Ti, but you will be able to run separate models just fine.
  Reply
Mario says
2017-06-13 at 16:07
Hi Tim
First off, kudos , like everybody else, for making our life immensely easier. I apologise in advance for an inaccuracies, this is all rather new to me.
I am currently considering getting my first rig, I have settled on a 1080Ti as a good compromise of ram, budget, performance and future-proofness.
For what I sense, the next logical upgrade would be to add another 1080Ti whenever I max it out, and continue using 16 lanes per card.
Having 3 cards on the rig would offer less of an improvement for training times, because of a limit to 8 bit lanes, increased overhead in coordinating parallelisation, increased cooling requirements, larger PSU.
Hence my envisaged upgrade route would be, if and when required:
1) Add a second 1080Ti.
2) Add a second rig (this will help with HyperParam tuning, I suspect it won’t help much with parallel training on the SAME model)
Am I wrong in envisaging it this way? I.e. would it still be more cost efficient to try and push a third and fourth card on the same rig? In which case I have to pay more attention to the motherboard selection. I want to have 64GB of RAM on the motherboard itself, so this is already pushing my into the “right” territory with regard to choosing a motheboard that supports 3 or 4 GPUs.
Inputs welcome. – and thanks again.
MR
Reply
- Tim Dettmers says
  2017-06-15 at 19:56
  Do not worry too much about the problems with lanes, cooling etc if you add a third GPU. The slow-down will be noticeable (10-20%) but you still get the most bang for the buck. You are right that adding another server does not help with training a single model (it gets very complicated and inefficient), and it would be kind of wasted as a parameter machine. I would go and stuff one machine as much as possible and if you need more computing power after that, then buy a new machine.
  Reply
Jay says
2017-06-13 at 04:06
where do you stand on nvidia founder’s edition vs custom gpus; is there a preference for one over the other in a DL setting?
Reply
- Tim Dettmers says
  2017-06-15 at 19:53
  The run the same chip, so they are essentially the same. It is a bit like overclocked GPUs vs normal ones; for deep learning it makes almost no difference.
  Reply
Alper says
2017-06-12 at 19:23
Hi Tim, thanks for such a through exploration. But I have some others questions in mind, related to CPU side. Normally, with a single card, (TitanX maxwell, as well as 1080Ti), one CPU core stays on top, other three fluctuate between 10-50 range (p8z77, i3570K, 32GB, under DIGITS). Now I am planning to make a change, with an X99-E WS board. The only thing I could not decide is, if DL apps can use single core only, should we look fastest single core CPUs? At the time of writing, the fastest single core is on the i7700K 4.2 CPU. I am planning to buy a Xeon 2683 v3 processor which appears faster on CPU benchmarks, but slower when it comes to single-core perf.
Due to this fast-single-core subject, I cannot decide. Since my aim is to go for 4 GPUs, should I go for 4-cores one, or 14-cores one? I have used dual TitanX setup for a while and saw CPU percentage raised to 280%s, compared to 170-180%s with a single GPU. From this observation, CPU perf. appears important to a degree, to my eye. Any opinions?
Reply
- Tim Dettmers says
  2017-06-15 at 19:52
  The CPU speed will make litte difference, see my analysis in my other blog post about hardware. Often a high percentage of CPU means actively waiting for the GPU to do its job. Running 4 or more nets on 2 GPUs on a 4 core CPU never caused any problems for me; the CPU never saturated (and this is with background loader threads). So no worries on the CPU size, almost any CPU will be sufficient.
  Reply
M fazaeli says
2017-06-12 at 15:18
Hi,
after we chose a GPU like 1080 ti, how to assemble a good box for it. There is bunch of MB that are gaming specific and they are not designed for days of computing. choosing best model is also not a good option cause they cost much more than the GPU it self. having a 1000watt+ PSU and a open loop cooler is cost u can upgrade form 1070 to 1080ti.
does these card need xmp rams on 3000MH or a 2133 is enough?
I think the story continuous more complex after choosing the card. and I ask u light up the way to select system part not for overclocking GPU but just turn it on. This guide will change the budget to afford for GPU.
Reply
- Tim Dettmers says
  2017-06-15 at 19:49
  Gaming motherboards are usually more than enough. I do not think compute specific motherboards have a real advantage. I would choose a gaming motherboard which has good reviews (which often means it is reliable). I would try to go with less fancy options to keep the price down. You do not really need DDR4, and a high clock for RAM is also not needed for deep learning. You might want those features if you work on more data science related stuff, but not for deep learning.
  Reply
wang says
2017-06-12 at 08:21
Hey Tim, perfect article! Thank you very much！
I have a question， how about tesla M2090？ compared to 1060.
Reply
- Tim Dettmers says
  2017-06-15 at 19:44
  A Tesla M2090 would be much slower than a GTX 1060 and has no real advantage. It costs more. So definitely choose a GTX 1060 over a Tesla M2090.
  Reply
Matthew says
2017-06-10 at 11:30
Hi,
I’m hoping someone could help recommend which GPU I should buy. My budget is limited at the moment, and I’m just starting out with deep learning. So I have narrowed down my options to either the GTX 1050 ti or the 3GB version of the 1060. I’m mainly interested, at least to start with, in things like Pix2pix and CycleGan. So I’m unsure if the extra 1GB of memory on the 1050ti would be better, or the extra compute power of the 1060. The 6GB version of the 1060 is a bit beyond my budget at the moment.
Thanks.
Reply
- Tim Dettmers says
  2017-06-15 at 19:42
  Tough question. For pix2pix and GANs in general more memory is often better. However, this is also dependent on your data. I think you could do some interesting and fun stuff with 3GB already, and if networks do not fit into memory you can always use some memory tricks. If you use 16-bit floating point models then you should be quite fine with 3GB, so I would opt for the GTX 1060.
  Reply
hamza says
2017-06-08 at 07:11
Hi,
Thank you Tim for such helpful suggestions. I am interested in 4 GPU(EVGA with custom cooling technology ICX) setup for DL, does any one based on their experience can recommend/suggest any reliable motherboard for holding 4 GPUs comfortably with ventilation space. It is very confusing as there many many board available but some have little/no space for 04 gpus and others have reliability issues as posted on newegg. Secondly my question is will 1300W PSU will be enough to support 04 GPUs?? Thank you very much.
Reply
- Tim Dettmers says
  2017-06-15 at 19:38
  Yes, the motherboard question is tricky. The cooling does not make a huge difference though, it is mostly about cooling on the GPUs not around them. The environment where the GPU is standing and the speed of the fans are much more important than the case for cooling. So I would opt for a reliable motherboard even if there is not too much ventilation space.
  Reply
- Tim Dettmers says
  2017-06-15 at 19:47
  Oh and I forgot, 1300W PSU should be sufficient for 4 GPUs. If your CPU draws a lot (> 250W) you want to up the wattage a bit.
  Reply
  - hamza says
    2017-06-17 at 06:09
    Hi Tim. Thank you for your suggestion. Can you please look at these two newegg links and based on your extensive experience, make a suggestion for mbo. My criteria is durability and to be future proof. I want to install 02 1080 Ti for now and 02 Volta consumer gpu when they will come next year, (each card will take 2 slots).
    First is Asus E-WS mbo having 07 Pcie sockets,price 514$, https://www.newegg.com/Product/Product.aspx?Item=N82E16813182968&ignorebbr=1&cm_re=asus_e-ws_x99_motherboard-_-13-182-968-_-Product and second is evga having 05 Pcie sockets, price 300 $, https://www.newegg.com/Product/Product.aspx?Item=N82E16813188163&nm_mc=AFC-C8Junction&cm_mmc=AFC-C8Junction-PCPartPicker,%20LLC-_-na-_-na-_-na&cm_sp=&AID=10446076&PID=3938566&SID= . Thank you very much.
    Reply
    - Tim Dettmers says
      2017-06-20 at 00:25
      I have no experience with these motherboards. The Asus one seems better as it seems more reliable from the reviews. However, if you factor in the price I would go with the EVGA one. It will probably work out just fine and is much better from a cost efficiency perspective.
      Reply
Chanhyuk Jung says
2017-06-07 at 15:02
I am trying to make a autonomous drone navigation system for a project. I want to have two cameras with two motors each(so they could turn like the eye), two accelerometers(because humans have two) and two gyroscopes(same reason as before) as inputs for the neural net and out put the four motors of the drone. I’m trying to apply deep learning to make the drone autonomous. But I’ve only worked on gpus with at least 100GB/s bandwidth. Since the computer needs gpio to control the motors and recieve input in real time, I went for a single board computer. I couldn’t find boards with great gpu performance except for the jetson modules. What single board computer or SoC would you recommend?
Reply
- Tim Dettmers says
  2017-06-07 at 19:08
  I think Jetson is almost the only way to go here. Are modules are to big to work on drones. However, another option would be to send images to an external computer, process them there and send the results back. This is quite difficult and will probably have a latency of at least 300ms and probably closer to 750ms. So I would go with a Jetson module (or the new isolated GPUs which is basically a Jetson without any other parts). You can also interface a Jetson with an Arduino and as such you should all that you need for motor control.
  Reply
- Tim says
  2017-08-08 at 22:57
  Movidius has some chips that might be useful for you, although they are quite specialized for visual deep learning.
  Reply
David says
2017-06-05 at 20:12
I’m currently using my cpu (Xeon E5-2620 with 64GB of memory) to train large convolution networks for 3D medical images, and it is rather slow. The training (using keras/tensorflow) takes up 30-60 GB of memory, so I don’t think the network could train on a single graphics card. Would buying/adding multiple graphics cards net my system enough GPU memory to train a 3D CNN?
Reply
- Tim Dettmers says
  2017-06-07 at 19:04
  I would try to adjust your network rather than your hardware. Medical images are usually large so make sure you slice them up in smaller images (and pool the results for each image to get a classification). You can also utilize 1×1 convolutions and pooling to reduce your memory footprint further. I would not try to run on such data with CPUs as it will take too long too train. Multiple GPUs used in model parallelism for convolution is not too difficult to implement and would probably work if you use efficient parallelization frameworks such as CNTK or PyTorch, but it quite http://timdettmers.com/wp-admin/edit-comments.php#comments-formcumbersome still.
  Reply
Nikolaos Tsarmpopoulos says
2017-06-05 at 12:48
Hi,
NVLink has just been introduced for a professional grade desktop card, Quadro GP100 and the link adapter alone costs £1000 or so. It is intended for HPC and deep learning applications but it won’t be necessary for the consumer cards, since x8 PCIE lanes are still considered sufficient and the cards can still use x16 lanes, even in a 2-way SLI configuration, with a 40 lanes CPU. Hence, I think it’s very unlikely NVLink will be included in consumer cards anytime soon.
Reply
Nikolaos Tsarmpopoulos says
2017-06-05 at 12:13
Hi,
NVLink has just been introduced for the first time on the Quadro GV100 for the desktop. The link adapter alone costs £1000 or so. It is intended for HPC and deep learning applications but it won’t be necessary for the consumer cards, since x8 PCIE lanes are still considered sufficient and the cards can still use x16 lanes in SLI, with a 40 lanes CPU. Hence, it’s very unlikely NVLink will be included in consumer cards anytime soon.
Reply
sami says
2017-06-03 at 12:51
Hi. Thank you Tim for such a wonderful blog. I have questions regarding Volta gpu. It is expected that Geforce version of Volta will be launched anytime towards the end of 2017 or early 2018. Do you think there would be so much difference in performance between 1080Ti based rig( which i am thinking of getting for DL) and the volta that i shud wait for volta?
My second question is when the volta will come, will it need newer motherboards or can it be used with its full strength/might on the currently available motherboards too , e.g i am thinking of getting EVGA X99 classified mbo, so if i wish to get voltas in future then could it be installed on this mbo or the voltas will need newer series of boards ?? Thank you for your help.
Reply
- Tim Dettmers says
  2017-06-05 at 03:09
  Consumer Volta cards is what you want. These Volta cards will probably be released a bit later, maybe Q1/Q2 2018 and will fit into any consumer-grade motherboard — so a EVGA X99 will work just fine. They will be a good step-up from Pascal with a similar jump in performance that you see between Maxwell and Pascal.
  Reply
  - samihaq says
    2017-06-05 at 07:47
    But won’t Volta consumer cards use NVLink interface and the current motherboards like EVGA X99 don’t support NVLink??
    In case if the current motherboards will support Volta without making it work in backward compatibility mode, i.e, without using its full potential, then probably it would be better for me to get a EVGA 1080Ti for now and get two Volta consumer cards when they will be introduced. Any suggestions, please. Thank you very much.
    Reply
brabo says
2017-06-03 at 12:26
Which mini card you will recommend for a itx platform?
Super Flower Golden Green HX 80 Plus Gold Netzteil – 350 Watt SF-350P14XE (HX)
G776 Cooltek Coolcube Aluminium Silver Mini-ITX Case
Thank you very much!:D
Reply
- Tim Dettmers says
  2017-06-05 at 03:06
  I am only aware of 3 current cards which are suitable for such systems: GTX 1050, GTX 1050 Ti, and GTX 1060 mini. These should fit into most mini-ITX cases without any problem, but second check that the dimensions are right. While these cards are made to fit ITX boards, they may not fit every mini-ITX case.
  Reply
Bill Simmons says
2017-06-01 at 15:53
Hi Tim,
Well, I am going all out to build a Deep Learning NN training platform. I plan to spend the money for the following:
2 TitanP cards w two remaining Gen3 PCIe slots available for expansion.
ASUS ROG ROG RAMPAGE V EDITION 10 w 64MB DDR4 3200 RAM, M2 SSD, i7 Processor, and with two ethernet ports – connecting 1 Ethernet Port to my internal network for access and the 2nd ethernet port connected directly to a data file server for access to training data.
Can you tell me the weaknesses with such a rig and where I might be spending money in the wrong places? Where could money be spent instead, to get an even a bigger bang for the buck?
Cheers,
Bill
Reply
- Tim Dettmers says
  2017-06-02 at 16:24
  Sounds like a reasonable setup if you want to use a data file server. If you want to get more bank for the buck you can of course use DDR3 RAM and a slow processor but with this setup you will be quite future proof for a while, so that you can upgrade GPUs as they are released. Also you are more flexible if you want to do other work that needs a fast CPU. So everything seems fine.
  Reply
Spicer says
2017-05-28 at 01:52
Hi Tim,
This is a great article and comments!
So…If you have only $1,500.00 today for your budget,
which components would you pick for a complete system
and why?
Thanks in advance.
Reply
- Tim Dettmers says
  2017-05-29 at 14:07
  I would probably pick a cheap motherboard that supports 2 GPUs; DDR3 memory, cheap CPU, 16GB of RAM, a SSD, a 3TB hard drive and then probably two GTX 1070 or better if budget allows.
  Reply
- Eric Perbos-Brinck says
  2017-05-29 at 21:03
  @Spicer
  Here’s a paper published today that may help you.
  https://blog.slavv.com/the-1700-great-deep-learning-box-assembly-setup-and-benchmarks-148c5ebe6415
  Eric
  Reply
vir das says
2017-05-26 at 23:54
Hey. I am creative coder and a designer. I am looking forward to experiment with AI, interaction design and images.
Projects similar to https://aiexperiments.withgoogle.com/autodraw
1) I am a beginner and looking a laptop as it’s handy. Little low on budget so i need to decide.
Feasible option is
960 4gb http://amzn.to/2r5xujB
But if it doesn’t work at all i might consider the following 2 options:
Gtx 1060 3gb — http://amzn.to/2qXCzwc
Gtx 1060-6gb — http://amzn.to/2rZlpMO
Not able to decide which be sufficient.
2) Do i need to learn machine learning and train neural networks by myself or can i just apply things from the open sources already available? Do you know good resources for the same?
Reply
- Tim Dettmers says
  2017-05-27 at 11:29
  You can use pretrained models which will limit you to certain themes, but you do not need learn to train these models yourself. A GTX 960 is also sufficient for this (even a CPU would be okay). This might be enough for some interaction design experimentation. If you want to go beyond that I would recommend a GTX 1060 6GB. You will need to learn how to train these models (which might not be that difficult; it is basically changing parameters in open source code).
  Reply
- Eric Perbos-Brinck says
  2017-05-27 at 14:31
  Regarding your question #2, I highly recommend “Practical Deep Learning for Coders” by Jeremy Howard (ex Kaggle Chief Scientist) and Rachel Thomas, in partnership with the Data Institute of the University of San Francisco.
  https://www.usfca.edu/data-institute/certificates/deep-learning-part-one
  It’s a free MOOC with superb ressources (Videos, Class Notes, Papers, Assignments, Forums).
  Reply
weisheng says
2017-05-25 at 05:12
Hi Tim,
Thanks for your sharing and I enjoy reading your posts.
Would you please comment my 4 1080ti build
https://pcpartpicker.com/list/nWs7Fd
Note that partpicker not allowing me to add 4 1080ti so I put 4 TitanX as dummy GPUs.
My spec
(“ref” means time of related Tim’s post,
“#” means my questions)
*1: main use: kaggle competition trained in Pytorch
*2: GPU x4 1080ti
#can this build support 4 Volta GPU in future?
*3:Motherboard: Asus x99-WS-E
<- confirmed Quad PCI 3.0 x16/x16/x16/x16, DDR4 ~128GB
*3: CPU Xeon 2609v4 40lane PCI 3.0
<-compare i7-5930k, 40% cheaper and 2x DDR4 size
2GHz, why?
#shall choose 3.7GHz i7-5930k for 2x clock speed?
@ref_cpu_clk shows underclocking i7-3820 to 1/3 causes performance drop of ~8% for MNIST and 4% for Imagenet)
<- CPU max DDR4 speed is 1866 MHz (RAM speed no difference @ref2 2017-03-23 at 18:35)
*4: RAM: 64Mbyte now, maybe 128MB in future
ref1: http://timdettmers.com/2015/03/09/deep-learning-hardware-guide/
ref2: http://timdettmers.com/2017/04/09/which-gpu-for-deep-learning/
ref_cpu_clk: http://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/cpu_underclocking2.png?zoom=1.5&resize=500%2C337
Reply
- Tim Dettmers says
  2017-05-26 at 13:38
  If you buy 4x GTX 1080 Ti and want to work on Kaggle competitions, I would not skimp on the CPU and RAM. You will do a lot of other things besides deep learning if you do Kaggle competitions.
  Otherwise it is looking good. I think 4x GTX 1080 Ti is a bit overkill for the Kaggle. You could also start with 2 GPUs, see how it goes, and add more later if you want.
  Reply
  - Weisheng says
    2017-05-29 at 07:44
    Hi Tim, thanks for your comment. I choose 128G RAM and i7 3.xGHz CPU (5930k or 6850k) based on your [CPU conclusion](http://timdettmers.com/2015/03/09/deep-learning-hardware-guide/) Two threads per GPU; > 2GHz;
    FYI, the 4 GPUs are for 2 Kaggle participants.
    Reply
Aimee Vanallen says
2017-05-24 at 18:47
This website is known as a stroll-by for all the information you wished about this and didn’t know who to ask. Glimpse right here, and you’ll definitely uncover it.
Reply
Bro perfect says
2017-05-23 at 14:56
Is it possible to combine the computational power of 6 machines with 8 GPUs each? Is it possible for an algorithm to be able to use all 48 GPUs?
Thanks
Reply
- Tim Dettmers says
  2017-05-26 at 13:34
  Yes. It is a difficult problem however you can tackle it with the right hardware (network with 40 – 100 Gbit/s InfiniBand interfaces) and right algorithms.
  I recommend block momentum or 1-bit stochastic gradient descent algorithms. Do not try asynchronous gradient methods — they are really bad!
  Reply
Ashkan says
2017-05-22 at 23:07
when I think of several parameters associated with computational costs I get confused and I wonder if we can introduce comprehensive and illustrating metrics on which could be relied specially when selecting a GPU is influential for a specific task e.g., Deep Learning. with all due respect I believe your post is some how leading and somehow confusing! for instance, one might choose 1060 6GB over 980 8GB due to the higher memory bandwidth. is that really the correct decision? and for other close-performance cards! I mean you might overlook some parameters regarding to how the program performs on GPU and how the GPU implements the codes. I mean mostly the Shading Units and Ram size at least in your comparisons. for instance, how could we compare a GC with memory bandwidth of 160Gb/s, 1500 shading units and 8GB memory with the one of 200Gb/s, 1250 units and 6GB ram? Although I have searched everywhere in the net and read several papers, I cannot answer it scientifically or at least I cannot prove it. but one think I claim is every percent increased in the number of shading units is more effective in reducing computational time than that of the same percent increase in the memory bandwidth. for example, 10 percent higher bandwidth against 10 percent larger Shadings. I think in some cases, If we only think of parallelism, then, if we have 80GB/s and 1000units equals with the case with 160GB and 500! I am not sure about that. I would be glad to hear your opinions.
Reply
- Tim Dettmers says
  2017-05-26 at 13:31
  Shading units are usually not used in CUDA computations. Memory size also does not affect performance (at least in most cases). It is difficult to give recommendations for RAM size since every every field and direction of work has very different memory requirements. For example Computer vision research on ImageNet, on GANs, and computer vision in industry all have different requirements for memory size. All I can do is to give a reasonable direction and have people make their own choice. In terms of bandwidth you can compare cards within teach chip class (Tesla, Volta etc), but across it is very difficult to give an estimate of 160GB/s vs 125GB/s etc.
  In short, my goal is not to say: “Buy this GPU if you have these and these and these requirements.”, but more like “Look, these things are important considerations if you want to buy a GPU and this makes a fast GPU. Make your choice.”
  Reply
Rajiv says
2017-05-22 at 21:38
AWS documentation mentions that the P2 instances can be optimized further, specifically persistence mode, disabling autoboost and setting the clocks to max frequency. Are any such optimizations that can be done for GTX 1080Ti cards?
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/accelerated-computing-instances.html#optimize_gpu
Reply
- Tim Dettmers says
  2017-05-26 at 13:21
  The performance difference of doing that compared to autoboost is usually not that relevant (1-5%) only if you really have large-scale, persistent operations (1-5% saving in money can be huge for big companies). If you want to improve performance focus on cooling, that is there the real benefits lie. I do not know if P2 instances have good cooling though.
  Reply
Howard Park says
2017-05-19 at 03:45
Hi Tim! Thank you for great post !
do you think we will use FP64 for deep learning any soon ?
and could you give me some examples of using FP64 for deep learning ?
Thank you.
Reply
- Tim Dettmers says
  2017-05-21 at 20:10
  FP64 is usually not needed in deep learning since the activations and gradients are already precise enough for training. Note that the gradient updates are never correct with respect to the full dataset if you use stochastic gradient descent and thus more precision just does not really help. You can do well with 32FP or even 16FP and the trend is further downwards rather than upwards. I do not think we will see FP64 in deep learning anytime soon, because frankly there is no reason to use it.
  Reply
Pedro Porto Buarque de Gusmão says
2017-05-18 at 13:04
Hi Tim,
any thoughts on the new Radeon Vega Frontier Edition, from a hardware point of view and hoping that DL libraries will come soon?
Reply
- Tim Dettmers says
  2017-05-21 at 20:07
  The card is impressive from its raw hardware specs. However, I am unaware that any major changes in deep learning hardware will come along with the card. As such, AMD cards are still not viable at this time. It might chance in a few months down the road, but currently there is not enough evidence to make a bet on a AMD card now. We just do not know if it could be useful later.
  Reply
Sourish says
2017-05-14 at 07:45
Hi. Lovely explanation! I am a newbie and I have a GPU having 4GB dedicated video memory and 4GB shared memory and around 112GB/s bandwidth (GeForce GTX 960). I need to know the number of convolution layers that I can implement using such a hardware. Besides, what would be the maximum size of the input image that I can feed to the network?
Reply
- Tim Dettmers says
  2017-05-15 at 17:54
  This depends on input size (how large the image is) and can be dramatically altered by pooling, strides, kernel size, dilation. In short: There is no concrete answer to your question, and even if you specify all the parameters, the only way to check this would be to implement the same model and test it directly. So I would suggest you just try it and see to how many layers you can get.
  Reply
Vineeth says
2017-05-11 at 22:38
Hi Tim
Great blog and very informative for the deep learning enthusiasts. I am building a rig myself for deep learning. Here are the components I am planning to get.https://pcpartpicker.com/list/WzGZd6
One question I have is regarding the processor. Seems like Xeon 1620 V4 is pretty good, in terms of 40 lanes. It is outperformed by 1650V4 but also twice the price ( $600). To add total 4 GPUs, I’ll also need a mobo with 4 PCI lanes at least, so something like a MSI X99A gaming pro looks reasonable, although not sure they’ll physically fit in. I might just do 3 gpus then. Any comments.
Reply
- Tim Dettmers says
  2017-05-15 at 17:45
  Looks reasonable, the motherboard supports 4 GPUs, but it seems only 3 GPUs fit in there. So if you want to go with 4 GPUs you should get a different motherboard.
  Reply
Kenneth says
2017-05-11 at 20:55
Looking forward to when you update this with the new V100!
Particularly curious if the increased bandwidth of nvlink v2 changes your opinion of multi gpu setups.
Reply
Ahm says
2017-05-07 at 08:55
Hello,
Thanks for the great post. I’m very new to this topic, and wanted to set up a desktop computer for deep learning project (for fun and kaggle). I had 2 questions, and really appreciate it if you could help me with them:
1) I just ordered a GTX 1080 Ti from NVidia website. Then I noticed that there are other versions like EVGA, or MSI… I’m kind of confused what they’re are, or if I should’ve got those.
2) For desktop, I just got a Lenovo ThinkCentre M900, with i7 6700 processor, 32 Gig Ram. Then I doubted if I really can put my GTX1080 inside that.. any thougth?
thanks a lot
Reply
- Tim Dettmers says
  2017-05-10 at 13:14
  The case seems to be able to hold low-profile x16 GPU cards. It seems that the system would support a GTX 1080, but the case does not. It will probably be too big for the case. I would contact your vendor for more precise information.
  Reply
  - Fabien says
    2017-05-10 at 13:30
    Hi, I would add that the PSU (website says max power 400W as an option) seems under-scaled for the 1080ti.
    Also, I experienced an issue with a Dell system where the PSU could not be changed for a more powerful one because of a proprietary motherboard connector. I would not be surprised that you found the same on this kind of system.
    As Tim suggested, contact your vendor for compatibility info.
    Reply
  - Ahm says
    2017-05-12 at 07:07
    Thanks a lot. I canceled it, and found pcpartpicker website which checks the compatibility between the parts.
    thanks again 🙂
    Reply
Kelvin says
2017-05-06 at 06:24
Hi Tim – many thanks for all the knowledge and time!
I’d like to buy a deeplearning server, would you please give a performence comparison between telsa P100 and titan Xp?
Reply
- Tim Dettmers says
  2017-05-10 at 13:06
  The Tesla P100 is faster, but cost far too much for its performance. I would not recommend the P100.
  Reply
Semanticbeeng says
2017-05-05 at 08:38
Alternatives for AWS servers
1. https://www.hetzner.de/sk/hosting/produkte_rootserver/ex51ssd-gpu
dedicated server (not VM)
1 GeForce® GTX 1080 (only)
64 GB DDR4 RAM
€ 120 / month
2. https://www.ovh.ie/dedicated_servers/
4 x NVIDIA Geforce GTX 1070
64GB DDR4 RAM
€ 659.99 / month
Reply
- Tim Dettmers says
  2017-05-05 at 14:55
  I have heard good things about the hetzner one; it can be a good option. Thanks for the info.
  Reply
Nacho says
2017-05-04 at 11:48
Hi!
I am currently starting my thesis on Adversarial Learning. The department in which I will be working provides me (limited) remote access to a workstation in order to train large volumes. However, I was thinking on getting some cheap machine (my budget is very limited at the moment) in order to try small simulations since I cannot run anything on my laptop. Furthermore, I am not sure how much I will use my computer in the future (it depends if I go for PhD or not after all), so I just need a basic machine that allows can be upgraded in case is needed.
Here a configuration I found in an online shop in Germany :
Fractal Design Core 1100
Prozessor: Intel Core i3-7100, 2x 3.90GHz
Kühler: Intel Standard Cooler
Mainboard: Gigabyte H110M-DS2, Sockel 1151
Grafikkarte: MSI GeForce GTX 1050Ti Gaming 4G
Speicher: 8GB Crucial DDR4-2133
Festplatte: 1TB Western Digital WD Blue SATA III
Laufwerk: DVD+/-Brenner, DoubleLayer ändern
Netzteil: 300W be quiet! System Power 8 80+
Soundkarte: HD-Audio Onboard
Price: 635€
Do you think it’s a big deal? Would be enough given the needs I described before? Would you change anything?
Reply
- Tim Dettmers says
  2017-05-04 at 18:01
  The setup will be okay for your thesis. Maybe a GTX 1060 would be more suitable due to the larger memory, but you should be okay to work around that 4GB GPU memory and still write an interesting thesis.
  If you like to pursue a PhD and need more GPUs the board will limit you here. The whole setup will be more than enough for any new GPU which will be out in the next few years, but your motherboard just holds one GPU. If that is okay for you (you can always access cloud based GPUs or the one’s provided for extra computational power) then this should be a good, suitable choice for you.
  Reply
  - Nacho says
    2017-05-04 at 18:30
    Thank you so much for you comment. I found a similar machine a bit cheaper:
    https://www.amazon.de/DEViLO-1243-DDR4-2133-24xDVD-RW-Gigabit-LAN/dp/B003KO3HQM/ref=sr_1_10?s=computers&ie=UTF8&qid=1493910853&sr=1-10&keywords=DEViLO&th=1
    Plus it has windows 10 already installed (some time to save). I compare them many times and I cannot see any major difference.
    Thank you once more.
    Cheers,
    Nacho
    Reply
  - Nacho says
    2017-05-06 at 13:04
    Hi Tim,
    One last question. Would a CPU AMD FX-6300, 6x 3.50GHz do a similar job than using a intel i3-7100? I have seen that the price difference is pretty big and it would maybe allow me to get a better gpu (1060 6gb). Is it worth it?
    Thank you once more
    Nacho
    Reply
    - Nacho says
      2017-05-06 at 13:31
      Actually I saw this model:
      https://www.amazon.de/8-Kern-DirectX-Gaming-PC-Computer-8×4-30/dp/B01IPDIF4Q/ref=sr_1_1?s=computers&ie=UTF8&qid=1494068667&sr=1-1&keywords=gtx%2B1060%2B6gb&th=1
      Which includes a 1060 6gb, 16 ram for 699 euros. The only thing is that it has a AMD Octa-Core FX 8370E CPU instead of an Intel.
      What do you think?
      Reply
Semanticbeeng says
2017-04-29 at 19:31
Hi Tim – many thanks for all the knowledge and time!
Application: multi-GPU setup for deep learning requiring parallelization across GPUs and at least 32GB of GPU RAM.
Say I choose to use 4 GTX 1080 Ti and am concerned with the loss due to inter GPU communication but also with the heat/cooling and noise.
Based on all your teaching above am thinking it would be better to use two smaller computer cases, with two GPUs each and connect them with an Infiniband FDR card than try to cram all 4 GPUs in a single box.
Also, having 2 smaller boxes gives :
1. more resiliency in terms of point of failure
2. dynamic scalability in terms of bring up all 4s or just use 2
3. flexibility if I want to replace 2 with more powerful math GPUs for some applications needing higher precision.
Is this on the right path?
Do you have any data on how much memory bandwidth loss there would be in this setup as opposed to putting all 4s in the same box?
Saw NA255A-XGPU https://www.youtube.com/watch?v=r5y2VbMaDWA but it is very expensive.
Please provide a critical review and advice.
Many thanks
Nick
Reply
- Tim Dettmers says
  2017-05-02 at 13:48
  I would use as many GPUs per node as you can if you want to network them with Infiniband. Parallelism will be faster the more GPUs you run per box, and if you only want to parallelize 4 GPUs then 4 GPUs in a single box will be much faster and cheaper to 2 nodes networked with Infiniband. Also programming those 4 GPUs will be much, much easier if you use just one computer. However, if you want to scale out to more nodes it might differ. This also depends on your cooling solution. If you have less than 32 GPUs I would recommend 4 GPUs per node + Infiniband cards. For usual over-the-counter motherboards you can run 3 GPUs + Infiniband card. The details of an ideal cost effective solution for small clusters depends on the size (it can differ dramatically between 8, 16, 32, 64, 96, 128 GPUs) and the cooling solution, but in general trying to crank as many GPUs into a node as possible is the most cost effective way to go and also in terms of performance ideal. In terms of maintenance costs this would not be ideal for larger clusters with +32 GPUs, but it will be okay for smaller clusters.
  Reply
  - Semanticbeeng says
    2017-05-02 at 14:21
    many thanks!
    “depends on your cooling solution” – please suggest a solution that would not kick me out of the house from heat, noise and electromagnetic radiation. 🙂
    ” If you have less than 32 GPUs I would recommend 4 GPUs per node + Infiniband cards. ” – wdym? I thought Infiniband is for in between nodes. you meant “32 GB” instead of “32 GPUs”?
    many thanks in advance, Tim!
    “programming those 4 GPUs will be much, much easier if you use just one computer” – kindly provide some hints about where this would be visible in code/libraries we’d use.
    Reply
- Eric Perbos says
  2017-05-02 at 14:27
  You’d also want at least 128gb Ram with such a setup so make sure the Mobo can (though any > 4 PCI-E most likely does anyway).
  For example, in Lesson 2 of Fast.ai MOOC, they use some code to concatenate batches into an array that takes 55gb of Ram.
  No problem on AWS original course setup as basic EC2 can scale to 2tb.
  But for newcomers with personal PC with GTX 1080Ti and 32gb RAM, it generates Memory Error: requires investigating RAM issues, rewriting code etc., instead of focusing on Deep Learning.
  Also consider the “Aero” cooling (like Founder Edition 1080 Ti) as it expels heat via the GFX backpanel outside the case, instead of 4 monsters gladly blowing at each other inside 4 walls 😉
  Reply
  - Semanticbeeng says
    2017-05-02 at 15:02
    “Founder Edition 1080 Ti” with “aero cooling” – check!
    “Fast.ai MOOC” – nice reference, check!
    “at least 128gb Ram with such a setup so make sure the Mobo can” –
    So I need 128 GB RAM on the motherboard to handle the 4 GPUs?
    Or you mean in total with the GPU RAM – and GPUs being in the same card forms a continous address space with the CPU/mother board RAM.
    If you meant second then I need 128 – 4 x 11 GB GPU RAM = 90 GB on the CPU.
    Kindly clarify.
    Reply
    - Mario says
      2017-08-21 at 16:10
      Semanticbeeng
      Another aspect to consider is that parallelising on multiple machines is not as easy as parallelising on the same machine.
      I don’t think you want to write the code that does the distribution yourself – you want it to be transparently handled by the library you are using (Tensorflow/Torch etc).
      Now, if we are talking about say, hyperparameter tuning, this is rather easy to distribute: each execution is independent (unless perhaps if you’re running something like that needs to adjust the params dynamically), and you can ship it out to a separate machine easily.
      But withing the same model, things are not that easy anymore. Some lend themselves to being distributed.
      I believe that current libraries are much better at distributing across multiple GPUs on the same machine, relatively effort-free (config change), as opposed to across a network cluster.
      I share your concern on the single point of failure (machine catching fire). So perhaps the solution is to have 5 X (4 gpu machine) ? 🙂 Just partially joking. Good luck with the heating.
      Reply
Ahmed Adly says
2017-04-28 at 15:39
Hello Tim,
You are providing great information in this blog with significant value to people into deep learning.
i’m having iMac 5k with core i5 and 32 GB RAM, and thinking to add one NVIDIA Titan Xp to it via eGPU as first step into deep learning, do you think this is a good choice or i should sell it and go directly into custom GPU rig ?
Also is there any readymade options ?
Reply
- Ahmed Adly says
  2017-04-28 at 23:28
  Also can you comment on this setup?
  https://pcpartpicker.com/list/yCzT9W
  Reply
  - Tim Dettmers says
    2017-04-29 at 13:41
    Looks like a pretty high-end setup. Such a setup is in general suitable for data science work, but for deep learning work I would get a cheaper setup and just focus on the GPUs — but I think this is kind of personal taste. If you use your computer heavily, this setup will work well. If you want to upgrade to more GPUs later though, you might want to buy a big bigger PSU. I think around 1000-1200 watts will keep you future proof on upgrading to 4 GPUs; 4 GPUs on 850 watts can be problematic.
    Reply
    - Ahmed Adly says
      2017-04-30 at 00:29
      Thanks alot Tim, I have changed the setup by adding 1200 W PSU, and 4 way SLI motherboard with i7 7700 Kapy lake processor.
      The remaining question is that this motherboard supports only 64 GB RAM, will this make future problems ? do i need 128 GB if i will work on 250 GB+ data sets ?
      updated list: https://pcpartpicker.com/list/gHTLBP
      Best,
      Ahmed
      Reply
      - Mario says
        2017-08-21 at 16:02
        Hi Ahmed
        Thank you for sharing your setup. I was wondering, did you shortlist any alternative motherboard for the 128gb limtiation?
        I have come to the same conclusion, I don’t like the 64gb limit.
- Tim Dettmers says
  2017-04-29 at 13:42
  There are some readymade options, but I would not recommend them as they are too pricey for its performance. I think building your own rig is a good option if you really want to get into deep learning. If you want to do deep learning on the side, an extension via eGPU might be the best option for you.
  Reply
Roberto says
2017-04-28 at 12:15
I’m looking for an used card. What’s better between 960 4GB ddr5 and 1050Ti 4GB ?
I’m asking because the 960 has more cuda cores. THX!
Reply
- Tim Dettmers says
  2017-04-29 at 13:44
  The cards are very similar. The GTX 1050 Ti will be maybe 0-5% faster. I would go for the cheaper one.
  Reply
Moondra says
2017-04-27 at 20:55
Thank you.
I was looking to do some Kaggle competitions as well as video editing.
I guess I will get the GTX 1060 6GB.. off to Slick deals!
Reply
- Tim Dettmers says
  2017-04-29 at 13:45
  Sounds like a solid choice! I am glad the blog post was helpful!
  Reply
quest says
2017-04-27 at 09:51
I am about to buy three Gpus for deep learning and sometimes entertainment. These are
one Gigabyte Aorus 1080 Ti card and two EVGA 1080Ti. All three are having the same chip i.e 1080 Ti and only difference is their cooling solution, my questions are:-
a) Will i be able to use all three of them for parallel deep learning ?
b) Will i be able to SLi the Aorus and EVGA cards?
c) Is there any other trouble e.g. related to bios etc by mixing the same chip cards but from multiple vendors?
Thank you very much
Reply
- Tim Dettmers says
  2017-04-29 at 13:48
  a) Yes that will work without any problem
  b) I had a SLI of an EVGA and an ASUS GTX Titan for a while and I assume this still works, so yes!
  c) There should not be any big troubles. Both for games and for deep learning the BIOS does not matter that much, mixing should cause no issues.
  Reply
- Nikos Tsarmpopoulos says
  2017-04-29 at 14:02
  I think on the 1080 Ti you can only get 2-way SLI. A standard SLI bridge will do fine for up to 4K resolution, a high-bandwidth bridge is needed for 5K and 8K resolution.
  You can use all three of them for parallel deep learning, or any other CUDA-based and OpenCL-based application.
  Reply
Fabien says
2017-04-26 at 09:54
Hi Tim,
Very interesting article, thanks very much for your work!
I expect to participate a few Kaggle competitions for fun and challenge, as well as experimenting for myself.
I need to change my GTX750ti (too slow) and I am hesitating between GTX1060 6GB and GTX1070.
The best deals for 1060 are currently around 250€ but I just found a 1070 at 300€, would you say it’s worth it?
Reply
- Tim Dettmers says
  2017-04-29 at 13:48
  Yes, the GTX 1070 seems like a good deal, I would go ahead with that!
  Reply
  - Fabien says
    2017-04-29 at 14:06
    Thanks!
    I gave it a go as I could return it if needed.
    The gap between 750ti and 1070 is so huge…
    Let’s have fun now 😀
    Reply
Felix says
2017-04-25 at 19:03
Hi Tim, thanks for that! Very comprehensive article.. I now have the GeForce GTX 1050Ti installed and running. Chose it as the PC is from 2010 (Dell Precision T1650) without changing the power supply can only supply 75W from the board. Also, read somewhere that other cards won’t fit and noticed that as I put it in it was physically millimetres away from the spot where the hard drives plug in. So imagine even if you changed the power supply in this unit for something that could run the other cards, they are probably bigger and will hit into things and so not even push into the slot. Not sure if the other cards even use the same kind of slot that was around in 2010.
Can see it is about 1/5 as fast as the fastest cards but as per your cost analysis, the best bang for your buck. Took a whole day of painful trial and error to get cuda 8.0 and cudnn 5 properly installed, but it works now finally on Linux Mint, the last niggling issues were extra lines needed in bashrc.
Just tested some style transfer and it took 9 minutes compared with hours that it took on the CPU, each iteration took like 0.25 seconds rather than like 10 seconds or so..
Reply
- Tim Dettmers says
  2017-04-29 at 13:52
  Hi Felix, thanks for your story! It seems that everything worked out well for you — it always make me happy to hear such stories! Indeed, the size of GPUs and the power requirements can be a problem and I think the GTX 1050 Ti was the right choice here.
  Reply
Nikos Tsarmpopoulos says
2017-04-23 at 05:25
Hi Tim,
This is a very good and interesting article, indeed!
What I’m still confused about is FP16 vs FP32 performance of the GTX 1000 series cards. In particular, I’ve read that deep learning uses FP16 and GTX 1000 series are too slow on FP16 (1:64), which means NVIDIA forces users of deep learning tools to buy a significantly more expensive Tesla or Quadro card.
I’m very new to deep learning and I would expect that an algorithm that requires FP16 accuracy could also be used with FP32 accuracy, is this not the case? If a card doesn’t support the performance optimisations required for doubling performance with FP16, I expect we would be limited by its FP32 performance. However, in this case, I don’t get it why NVIDIA decided to cap the performance of FP16 on these cards, i.e. why not let them perform in FP16 similarly to FP32.
Thanks
Reply
- Tim Dettmers says
  2017-04-23 at 13:21
  There are two different ways to use FP16 data:
  (1) Store data in FP16 and during computation cast it to FP24 and use FP32 computation units, or in other words: Store in FP16 but use FP32 for compute.
  (2) Store data in FP16 and use FP16 units for compute
  Almost all deep learning frameworks use (1) at the moment, because only one GPU, the P100, has FP16 units. All other cards will have very, very poor FP16 performance. I do not know why NVIDIA decided to cap performance on most 10 series cards. It might be a marketing ploy to get people to buy P100, or it is a hardware constraint and it is just difficult to put both, FP16 and FP32 compute units, on a chip and make it cheap at the same time (thus only FP32 on consumer cards to make them cheap to produce).
  But in general, it is as you said, if an algorithm runs in FP16 you can expect to be able to run it in FP32 just fine — so there should be no issues with compatibility or such.
  Reply
  - Nikos Tsarmpopoulos says
    2017-04-23 at 21:58
    Thanks for your response. My understanding was that NVIDIA had implemented FP16 in a smart way that reused FP32 to effectively double performance. I presumed this feature is implemented in firmware and sold at a premium, within the TESLA and Quadro product lines, similarly to the firmware-based (as opposed to hardware-based) implementation of ECC memory.
    If FP16 is natively implemented in separate (hardware) circuitry within the GPU, it would indeed make economic sense for NVIDIA to exclude that from the consumer-grade product. Even if this is the case, though, since it’s possible to cast FP16 to FP32 for compute, I can’t imagine why NVIDIA has not implemented FP16 by casting it to FP32, in the firmware. It’s not a question, just an observation that has left me puzzled.
    Reply
Phil says
2017-04-22 at 22:57
Hi,
First, thanks for this great article.
Do you think I could install a GTX 1060 (6GB) on the following configuration :
Processor : Intel Pentium G4400
Integrated GPU : Intel HD Graphics 510
RAM : 4GO DIMM DDR4
Motherboard : Asus H110M-K
Nvidia is telling me that GTX 1060 requires at least a core i3 to run, but I’m seeing on CPU benchmark that G4400 is not that bad compared to some versions of core i3, so I’m lost….
Thanks a lot
Phil
Reply
- Tim Dettmers says
  2017-04-23 at 13:15
  The CPU should be fine. I think NVIDIA is referring to gaming performance rather than CUDA performance. For gaming performance this might be true, but for deep learning it should have almost no effect. So it should be fine.
  Reply
Mahesh Govind says
2017-04-21 at 17:38
Thank you for the good article.
I have a machine with 2 titan X pascals and 64 GB ram .
Do you recommend to run two separate models simultaneously . Or parallelize a model across two GPUs .
regards
Mahesh
Reply
- Tim Dettmers says
  2017-04-23 at 13:12
  With 2 GPUs parallelism is still good without any major drawback in scaling. So you could run them in parallel or not, that depends on your application and personal preference. If your models run for many days, I would go for parallelism and otherwise just use them simultaneously.
  Reply
Anh says
2017-04-21 at 06:24
HI tim
May i ask you question, i am on a low budget situation and i am weighting between gtx970m and gtx1050ti (mobile version) coud you give me an advice on which one should i get
Reply
- Tim Dettmers says
  2017-04-21 at 12:25
  The cards are very similar. I think the GTX 1050 Ti (notebook) might be slightly better in performance, but in the end I would make the decision based on the cost, since the GPUs are quite similar.
  Reply
Matt Sandy says
2017-04-21 at 01:18
It would be interesting to see what happens when using an eGPU housing over Thunderbolt.
Reply
- Tim Dettmers says
  2017-04-21 at 12:23
  Indeed, many people have asked about this and I would also be curious about the performance. However, I do not have the hardware to make such tests. If someone has some performance results it would be great if somebody could post them here.
  Reply
chanhyuk jung says
2017-04-20 at 18:00
I live in Korea and the electricity bills are very expensive here, so I prioritized power efficiency over performance. But I couldn’t decide from reviews from gamers showing performance per frames per seconds. So what is the most power efficient GPU? (It doesn’t matter if it’s from amd)
Reply
- Tim Dettmers says
  2017-04-21 at 12:22
  Bigger GPUs are usually a bit more power efficient if you can fully utilize them. Of course this depends on the kind of task you are working on. If you look at performance / Watts, then all cards of the series 10 are about the same so it really depends how large your workloads are and optimize for that. That is get a series 10 card of size which fits your models. You should prefer series 10 cards over series 900 cards since they are a bit more energy efficient for the performance they offer.
  Reply
Marc-Philippe Huget says
2017-04-19 at 18:57
Hello Tim,
Except if I missed the information on the post, you should mention newest Nvidia cards limit the number of cards on a PC, GTX 1080Ti seems to be limited to two cards, this could be a main issue if we want several experiments on parallel or using multi-GPU with CNTK (and PyTorch). Knowing that, I am not sure to build a rig with GTX 1080Ti if I want to level up the system in the future.
Cheers,
mph
Reply
- Tim Dettmers says
  2017-04-21 at 12:18
  Hello Marc-Philippe,
  I think you are confusing NVIDIA SLI HB limitations with PCIe slot limitations. You will only be able to run 2 GTX 1080 Ti if use you SLI HB, but if you use compute you are able to use up to 4 GPUs per CPU. SLI and SLI HB are not used for compute, but only for gaming. Thus there should be no limitations in the number of GTX 1080 Ti you can run, besides the CPU and PCIe slot limitations.
  Reply
Spencer says
2017-04-18 at 07:31
Fantastic article!
I’m interested in starting a little beowulf cluster with some ViA mini-itx boards and I was wondering how I could add gpu compute to that on a basic level. They only have pcie x4, but I could use a riser. I was thinking the Zotac GT710 pcie x1 card – one on each board.
Reply
- Tim Dettmers says
  2017-04-18 at 10:44
  Unfortunately, the GT 710 would be quite slow, probably on-a-par with your CPU. I am not sure how well the GPUs are supported if you just connect them via a riser. If it works for PCIe x16 cards then this would be an option to go. If you just want a cheap GPU and the x16 thing works, then you can go with a GTX 1050 Ti.
  Reply
  - Spencer says
    2017-04-19 at 07:41
    Here is the board I am looking at. I’m planning on having like 40 of these rackmounted all in a cluster, and each with a gpu in it. I’m not looking for hyper power, just something fun to mess with. The riser idea sounds good! I’m just looking for budget stuff here, and I figured many low power devices is as good as one high power device.
    Reply
  - Spencer says
    2017-04-19 at 07:42
    http://www.viatech.com/en/boards/mini-itx/epia-m920/
    Forgot the link
    Reply
    - Tim Dettmers says
      2017-04-21 at 12:31
      I have no idea if that will work or not. The best thing would be to try it for one card and once you get it running roll it out to all other motherboards.
      Reply
      - Spencer says
        2017-04-24 at 09:42
        Sounds good! Still in the planning phase, so I may revise it quite a bit. I really appreciate your help!
Bruce says
2017-04-17 at 19:50
Thanks for the article. I have a McBook Pro and considering recent release of Mac drivers for the Pascal architecture I am considering getting an external GPU rig that would run over Thunderbolt 3. Any concerns with this? Do you know how much penalty I would pay for having the GPU be external to the machine? It appears on the surface that PCIe and Thunderbolt 3 are pretty similar in bandwidth.
Previously for deep learning research I have been using Amazon instances.
Reply
- Tim Dettmers says
  2017-04-18 at 10:41
  Thunderbolt 3 is 5GB/s compared to PCIe which 16 GB/s. If you use three or more GPUs the bandwidth of PCIe will shrink to 8GB/s. If you use just one GPU the penalty should be in the range of 0-15% depending on the task. Multiple GPUs should also be fine if you use them separately. However, do not try to parallelize across multiple GPUs via thunderbolt as this will hamper performance significantly.
  Reply
Martin Thoma says
2017-04-13 at 12:44
Thanks for the post. (There is a small typo: “my a small margin”)
Reply
- Tim Dettmers says
  2017-04-13 at 14:48
  Fixed — thanks for pointing out the typo!
  Reply
hashi says
2017-04-12 at 09:34
Hi,
Thank you very much for providing useful information! I’m using AWS P2(most cheapest one) but planing to switch other GPU environment, for example DELL’s desktop or laptop. What is different between laptop GPU and desktop GPU for training deep learning networks ? For example, GTX 1060 6GB on laptop and on desktop. GPU memory band width?
Cheers,
Hashi
Reply
- Tim Dettmers says
  2017-04-13 at 14:45
  For newer GPUs, that is the 10s series there is no longer any real difference between laptop and desktop GPUs (this means the GTX 1060 is very similar to the GTX 1060 laptop version). For earlier version the laptop version often has smaller bandwidth mostly; sometimes the memory is smaller as well. Usually laptop GPUs consume less energy than desktop GPUs.
  Reply
Ghulam Ahmed says
2017-04-10 at 14:13
HI, I have a GTX 650Ti 1GB GDDR5. How is it for starters?
Reply
- Tim Dettmers says
  2017-04-13 at 14:00
  It will be slow and many networks cannot be run on this GPU because its memory is too small. However, you will be able to run cuDNN which is a big plus and you should be able to run examples on MNIST and other small datasets without a problem and faster than on the CPU. So you can definitely use it to get your feet wet in deep learning!
  Reply
Adam says
2017-04-10 at 09:01
You’ve talked a bit about it in various comments but it would be great if we could get your thoughts on the real world penalties of running PCIe 3.x cards in PCe 2.x systems. I’m guessing that single GPU setups the reduced bandwidth would have minimal impact but what about multi-GPU configurations?
Reply
- Adam says
  2017-04-11 at 06:15
  Another thing I’d be curious to hear your thoughts on is the performance penalty of locating GPUs in x8 PCIe slots.
  Reply
  - Tim Dettmers says
    2017-04-13 at 14:41
    I do not think you can put GPUs in x8 slots since they need the whole x16 connection to operate. In the case if you mean putting them in x16 slots but running them with 8x PCIe lanes, this will be okay for a single GPU and for 3 or 4 GPUs this is the default speed. Only with 2 GPUs you could have 16x lanes, but the penalty of parallelism on 8x lane GPUs is not to bad if you only have two GPUs. So in general 8x lanes per GPUs are fine.
    Reply
  - Phil Goetz says
    2020-02-08 at 17:50
    I’m going to guess that you might mean “What is the penalty of putting my Nvidia card in an x16 slot running at x8?” If so, see Donald Kinghorn’s comparison, “PCIe X16 vs X8 with 4 x Titan V GPUs for Machine Learning” (https://www.pugetsystems.com/labs/hpc/PCIe-X16-vs-X8-with-4-x-Titan-V-GPUs-for-Machine-Learning-1167). He compared two TensorFlow applications on x16 vs x8. In all cases, x16 gave a performance boost of less than 5%.
    Didn’t say which version of PCIe he was using, but probably 3.0. Note that PCIe 4.0 won’t give you any benefit until nVidia cards come that use it.
    If your application suffers from running at x8 instead of x16, it might not be well-parallelized.
    Reply
- Tim Dettmers says
  2017-04-13 at 13:56
  The impact will be quite great if you have multiple GPUs. It is difficult to say how big it will be because it varies greatly between models (CNN, RNN, a mix of both, the data formats, the input size) and can also differ a lot between architecture (Resnet vs VGG vs AlexNet). What I can say, that if you use multiple GPUs with parallelism then an upgrade form PCIe 2.0 to PCIe 3.0 and an upgrade in PCIe lanes (32 lanes for 2 GPUs, or 24 lanes for 3, 36 lanes for 4 GPUs) will be the most cost efficient way to increase the performance of your system. Slower cards with these features will often outperform more expensive cards on PCIe 2.0 system or systems with not enough lanes for 16x (2 GPUs) or 8x speed (3-4 GPUs).
  Reply
kartheek says
2017-04-08 at 16:00
Hay Tim, Its great article!. I am new to ML. Currently i have a mac mini. i found few alternatives to add external graphic card through thunderbolt port. Can i run ML and Deep learning algorithms on this?
Reply
- Tim Dettmers says
  2017-04-09 at 10:50
  I have never seen reviews on this, but theoretically it should just work fine. You will see a performance penalty though which depending on the use case is anywhere between 5-25%. However, in terms of cost this might be a very efficient solution since you do not have to buy a full new computer if you use external GPUs via thunderbolt.
  Reply
Elkhan says
2017-04-08 at 00:20
Hey Tim,
Thanks for great post. Wondering if you will include 2017 version Titan XP in your comparisons soon too.
I’m planning to build my own external GPU box mainly for Kaggle NLP competitions.
Yesterday Nvidia introduced new Titan XP 2017 model.
I’m planning to buy Nvidia GPU, and use it as external GPU for NLP Deep Learning tasks.
1) Should i go with 1 Titan XP 2017 model ? Or still 2 X GEFORCE GTX® 1080 Ti will be better ?
2) What about using GPU externally with my already existing Mac or Windows laptop, connecting via Thunderbolt ?
3) What are your thoughts about TPU which Google introduced recently ?
External GPU box :
MAC: AKiTiO 2 ,
Windows: Razer Core, Alienware Graphics Amplifier, or MSI Shadow
Bizon box: https://bizon-tech.com/
Blog Post: http://www.techrepublic.com/article/how-to-build-an-external-gpu-for-4k-video-editing-vr-and-gaming/
Thanks.
Reply
- Tim Dettmers says
  2017-04-08 at 16:03
  I might update my blog post this evening.
  1) In your case I would not recommend the Titan Xp; two GTX 1080 Ti are definitely better.
  2) That should work just fine but in some cases you might see a performance drop of 15-25%. In most cases you should only see a performance drop of 5-10% though.
  3) The TPU is only for inference, that is you cannot use it for training. It is actually quite similar to the NVIDIA GPUs which exist for the same purpose. Both the NVIDIA GPU and the Google TPU are generally not really interesting for researchers and normal users, but are for (large) startups and companies.
  Reply
https://www.behance.net/bestlaptopsunder says
2017-04-07 at 19:29
Helpful info. Fortunate me I found your web site by chance, and I’m shocked why
this accident didn’t came about earlier! I bookmarked it.
Reply
foojpg says
2017-04-06 at 03:21
Hey Tim,
I already have a gtx 960 4gb graphics card. I’m faced with two options – to buy a used 960 (same model) so I can have 2 960s, or I can sell my 960 and buy a used 1060 6gb. Which one will he better? I’ve heard for gaming the 1060 will be better, but how will it affect DL?
Thanks
P.S – Please note that the price for both paths will be similar (with the 960 path being more expensive by around 25 dollars)
Reply
Henry says
2017-04-05 at 23:00
I am building a computer right now 2,000 budget and I am going with the Asus GTX 1080. Should I go with something a little less powerful or should i go with this. I really care about graphics. (games I want to get are X-com 2, Player Unknowns Battlegrounds, Civ 6, The new mount and blade) games like that
Reply
- Tim Dettmers says
  2017-04-08 at 16:06
  I do not know about graphics, but it might be a good choice for you over the GTX 1070 if you want to maximize your graphic now rather than to save some money to use it later to upgrade to another GPU. If you want to save some money go with a GTX 1070. I guess both could be good choices for you.
  Reply
Bhanu says
2017-04-05 at 21:45
Hi Tim Dettmers,
I am working on 21gb input data which consists of video frames. I need to apply deep learning to perform classification task. I will be using cnn, lstm, transfer learning. Among Tesla k80, k40 and GeForce 780 which one do you recommend? Are there any other GPU’s which you recommend. Going through your well written article, I could also think on Titan X or GTX 1080 Ti. Do I need to use multiple gpu’s or a single gpu?
Reply
- Tim Dettmers says
  2017-04-08 at 16:09
  Hi Bhanu,
  The Tesla k80 should give you the most power for this task and these models. The GTX 780 might limit you in terms of memory, so probably k40 and k80 are better for this job. The GTX 780 might be good for prototyping models. In terms of performance, there are no huge difference between these cards.
  For your task, if you work in research, I would recommend a GTX Titan X or a GTX Titan Xp depending on how much money you have. If you work in industry, I would recommend a GTX 1080 Ti, as it is more cost efficient, and the 1GB difference is not such a huge deal in industry (you can always use a slightly smaller model and still get really good results; in academia this can break your neck).
  Reply
  - Tim Kerz says
    2020-08-31 at 12:44
    Hi Tim,
    I also work with rather large computational graphs which forces me to pick a 24GB card. The K80 seems to be a reasonable low-budget choice. Though, I read that it is a “dual-gpu” consisting of 2 12GB-gpus which makes me wonder if it is even possible to fit graphs with size larger than 12GB into the memory. Do you know something about that?
    Thank you,
    Tim
    Reply
    - Tim Dettmers says
      2020-09-07 at 09:58
      Yeah the Tesla K80 seems to have separated GPU memories. As such, you will not be able to hold a 12 GB graph easily into memory. You would need to distribute it across both GPU cores which is a hassle. A used RTX Titan could be a good option for you though. Maybe you can find a cheap on one eBay!
      Reply
Amir H. Jadidinejad says
2017-04-05 at 12:04
I buy a GTX 1080 TI for deep learning research. I want to install my new card on my old desktop which has ASUS P6T motherboard (https://www.asus.com/us/Motherboards/P6T/specifications/). According to the specifications, this motherboard contains 3 x PCIe 2.0 x16 (at x16/x16/x4 mode) slots but GTX 1080 ti compatible with PCIe V3.0.
I just want to know if it’s possible to install 1080ti on my motherboard (ASUS P6T)? If yes (it seems PCIe v3 is compatible with PCIe v2), if I face with some bottleneck for leveraging PCIe v2 instead of PCIe v3?
Do you suggest to upgrade the motherboard of use the old one?
Reply
- Tim Dettmers says
  2017-04-05 at 18:05
  If you use a single GTX 1080 Ti, the penalty in performance will be small (probably 0-5%). If you want to use two GPUs with parallelism you might face larger performance penalties between 15-25%. So if you just use one GPU you should be quite fine, no new motherboard needed. If you use two GPUs then it might make sense to consider a motherboard upgrade.
  Reply
  - Amir H. Jadidinejad says
    2017-04-05 at 18:54
    Thank you for your valuable comments. I do appreciate your help.
    Reply
  - Amir H. Jadidinejad says
    2017-04-21 at 23:08
    Dear Tim,
    Would you please consider the following link?
    http://stackoverflow.com/questions/43479372/why-tensorflow-utilize-less-than-20-of-geforce-1080-ti-11gb
    Is it possible that using PCIe v2 leads to this issue (low GPU utilization)?
    Reply
    - Tim Dettmers says
      2017-04-23 at 13:11
      It is likely that your model is too small to utilize the GPU fully. What are the numbers if you try a bigger model?
      Reply
      - Amir H. Jadidinejad says
        2017-04-23 at 19:04
        You’re right. Running a bigger model leads to better utilization. Thank you.
  - Scoodood says
    2019-09-04 at 09:10
    Dear Tim, what kind of motherboard should we buy for multi-GPU setup? I thought any mobo with more than two PCIex16 v3 will do, assuming each slot is 2-slots wide, right? Thanks
    Reply
    - Tim Dettmers says
      2019-09-11 at 09:15
      If you only want to run 2 GPUs this is fine. Remember to check if the motherboard supports 2 GPUs. You can often see that if it says that it supports 2 GPUs or 16x/16x or 8x/8x configurations.
      Reply
- Eric Perbos says
  2017-04-05 at 19:19
  Just beware, if you are on Ubuntu, that several owners of the GTX 1080 Ti are struggling -here and there- to get it detected by Ubuntu, some failing totally.
  Those familiar with the history of Nvidia and Ubuntu drivers will not be surprised but nevertheless, be prepared for some headaches.
  In my case, I had to keep an old 750 Ti as GPU #1 in my rig to get Ubuntu 16.04 to start (GTX 1080 Ti as GPU #0 would not start).
  Reply
  - Mario says
    2017-06-13 at 15:51
    You mean “an old 750 Ti as GPU #1” -> “an old 750 Ti as GPU #0”?
    Reply
foojpg says
2017-04-03 at 14:34
Hey Tim,
I’m about to buy my parts, but I’m facing one big issue. I’m getting a gtx 1060, but I can’t find the budget to fit in a motherboard with 2 pcie x16 slots. This means I can’t SLI the 1060 in the future. Should I keep saving up, or is it better to just sell my old 1060 and buy a 1080 ti/higher end gpu when the time comes?
Thanks
Reply
- Tim Dettmers says
  2017-04-03 at 15:18
  That is a difficult problem. It is difficult to say what your needs will be in the future, but if you do not use parallelization with those two GPUs it is very similar to a single GTX 1080 Ti — so in that case buying the GTX 1060 with one PCIe slot will be good. If you really want to parallelize, maybe even two GTX 1080 Ti, it might be better to wait and save up for a motherboard with 2 PCIe slots. Alternatively, you could try to get a cheaper, used 2 PCIe slot motherboard from eBay.
  Reply
Alex Ekkis says
2017-03-31 at 22:23
Thanks for keeping this article updated over such a long time! I hope you will continue to do so! It was really helpful for me in deciding for a GPU!
Reply
- Tim Dettmers says
  2017-04-03 at 12:44
  I will definitely keep it up to date for the foreseeable future. Glad that you found it useful 🙂
  Reply
Paris says
2017-03-30 at 17:09
Hi Tim,
Thank you for the great article! I am in the process of building a deep learning / data science – kaggle box in the 2-2,5k range.
I was going for the gtx 1080 ti, but your argument that two gpus are better than one for learning purposes caught my eye.
I am planning on using the system mostly for nlp tasks (rnns, lstms etc) and I liked the idea of having two experiments with different hyper parameters running at the same time. So the idea would be to use the two gpus for separate model trainings and not for distributing the load. At least that’s the initial plan.
Also since we are talking about text corpus I guess the 6gb of vram would work.
On the other hand I’ve read that rnns don’t work well with multiple gpus, so I might experience problems using both of them at the same time.
Taking all that into account would you suggest eventually a two gtx 1070, two gtx 1080 or a single 1080ti? I am putting the 1080ti into the equation since there might be more to gain by having a 1080ti.
Reply
- Tim Dettmers says
  2017-03-31 at 15:46
  Hi Paris,
  I think two GTX 1070, or maybe even a single GTX 1070 for a start, might be a good match for you. There might be some competitions on kaggle that require a larger memory, but this should only be important to you if you are crazy about getting top 10 in a competition (rather than gaining experience and improving your skills). To make the choice here which is right for you. Since competition usually take a while, it might also be suitable to get a GTX 1070 and if your memory holds you back on a competition to get a GTX 1080 Ti before the competition ends (another option would be to rent a cloud based GPU for a few days). In terms of data science you will be pretty good with a GTX 1070. Most data science problems are difficult to deal with deep learning,so that often the models and the data are the problem and not necessary the memory size. For general deep learning practice a GTX 1070 works well especially for NLP you should have no memory problems in about 90% of the cases and in those cases you can just use a “smarter” model.
  Hope this helps.
  Reply
  - Paris says
    2017-04-03 at 09:20
    Thank you very much Tim for taking the time to reply back!
    Reply
Phillip Glau says
2017-03-30 at 05:23
Great followup post to your 2015 article.
One thing I would add is the the cooling system of various cards makes a difference if you’re going to stack them together in adjacent PCI slots. I have two GTX-1070s from EVGA that have the twin mounted fans that vent back into the box. (The “SC” Model) If I had it to do over again, I would get either the ‘blower’ model which vents out the back or the water cooled version.
For a moment, I had 3 cards, and (two 1070s and one 980ti) and I found that the waste heat of one card pretty much feed into the intake of the cooling fans of the adjacent cards leading to thermal overload problems. No reasonable amount of case fan cooling made a difference.
In my current setup with just the two 1070s, they’re spaced with one empty PCI slot between them so it doesn’t make much difference, but I suspect with four cards the “SC” models would have been extremely problematic.
Thanks again for both this post as well as your earlier 2015 post.
Reply
- Tim Dettmers says
  2017-03-31 at 15:34
  From my experience the ventilation within a case has very little effect of performance. I had a specially designed case for airflow and I once tested deactivating four in-case fans which are supposed to pump out the warm air. The difference was equivalent of turning up the fan speeds of the GPU by 5%. So it may make a difference if your cards are over 80 °C and your fan speeds are at 100%, but otherwise it will not improve performance. So in other words, the exhaust design of a fan is not that important, but the important bit is how well it removes heat from the heatsink on the GPU (rather than removing hot air from the case). If you compare fan designs try to find benchmark which actually test this metric.
  If there are cooling issues though, then the water cooling definitely makes a difference. However, for that to make a difference you need to have cooling problems in the first place and it involves a lot more effort and to some degree maintenance. With four cards cooling problems are more likely to occur.
  Reply
ravi jagannathan says
2017-03-28 at 21:22
This is very useful post. Is there an assumption in the above tests, that the OS is linux e.e, the deep learning package runs on Linux. or does it not matter.
I am considering a new machine, which means a sizeable investment.
It is easy to buy a windows gaming machine with the GPU installed “off the shelf” whereas there are few vendors for linux desktop . And there is side benefit of using the machine for gaming too.
Reply
- Tim Dettmers says
  2017-03-29 at 13:34
  GPU performance is OS independent since the OS barely interacts with the GPU. A gaming machine with preinstalled windows is fine, but probably you want to install Linux along-side of windows so that you can work easier with deep learning software. If you have just one disk this can be a bit of a hassle due to bootloader problems and for that I would recommend getting two separate disk and installing an OS on each.
  Reply
Thomas says
2017-03-28 at 20:14
I am looking to getting into deep learning more after taking the Udacity Machine Learning Nanodegree. This will likely not be a professional pursuit, at least for a while, but I am very interested in integrating multiple image analysis neural networks, speech/text processing and generation, and possibly using physical simulations for training. Would you recommend 2 GPUs, one to run the deep learning nets and one to run the simulation, is that even possible with things like OpenAI’s universe? Also, do you see much reason to buy aftermarket overclocked or custom cooler designs with regard to their performance for deep learning? Greatly appreciate this blog any insight you might have as I look to update my old rig for new pursuits.
Reply
- Tim Dettmers says
  2017-03-29 at 13:45
  What kind of physical simulations are you planning to run? If you want to run fluid or mechanical models then normal GPUs could be a bit problematic due to their bad double precision performance. If you need double precision for your simulation I would go with an old GTX Titan (Kepler, 6GB) from eBay for the double precision, and a current GPU for deep learning. If you run simulations that do not require double precision then a current GPU (or two if you prefer) are best.
  Overclocked GPUs do not improve performance in deep learning. Custom cooler designs can improve the performance quite a bit and this is often a good investment. However, you should check benchmarks if the custom design is actually better than the standard fan and cooler combo.
  Reply
  - Thomas says
    2017-03-29 at 18:31
    The simulations, at least at first, would be focused on robot or human modeling to allow a neural network more efficient and cost effective practice before moving to an actual system, but I can broach that topic more deeply when I get a little more experience under my belt.
    My most immediate interest is whether I should look at investing in a single aftermarket 1080 Ti (with the option to add another later on) or something closer to 2x 1070s when working with video/language processing (perception, segmentation, localization, text extraction, geometric modeling, language processing and generative response)?
    Also, looking into the NVidia drive PX system, they mention 3 different networks running to accomplish various tasks for perception, can separate networks be run on a single GPU with the proper architecture?
    Reply
    - Tim Dettmers says
      2017-03-31 at 15:39
      Yes you can train and run multiple models at the same time on one GPU, but this might be slower if the networks are big (you do not lose performance if the networks are small) and remember that memory is limited. I think I would go with a GTX 1070 first and explore your tasks from there. You can always get a GTX 1080 Ti or another GTX 1070 later. If your simulations require double precision then you could still put your money into a regular GTX Titan. I think this is the more flexible and smarter choice. The things you are talking about are conceptually difficult, so I think you will be bound by programming work and thinking about the problems rather than by computation — at least at first. So this would be another reason to start with little steps, that is with one GTX 1070.
      Reply
Haider says
2017-03-28 at 00:44
Hi Tim,
If I have a system with one 1080ti GPU, will I get x2 performance if I add another one?
The CPU is core i7 – 3770K , and its maximum RAM possible is only 32GB, also if I add another GPU the PCIv3.0 lanes will drop from 16 lanes to 8 lanes for each.
I will use them for image recognition, and I am planning to only run other attempts with different configurations on the 2nd GPU during waiting for the training the 1st GPU. I am kind of new to DL and afraid that it is not so easy to run one Network on 2 GPUs,
So probably training one network in one GPU, and training another in the 2nd will be my easiest way to use them.
My concern is about the RAM, will it be enough for 2 GPUs?
or the CPU is it fast enough to deal with 2 convNets on 2 GPUs?
Will my system be the bottleneck here in a two GPU configuration which makes it not worth the money to buy another 1080ti GPU?
And what if I buy a lower performance GPU with the 1080ti, like the GTX 1080?
Any problem with that?
Thank you for this unique blog. A lot of software advice are there in DL, but in Hardware, I barely find anything like yours.
Many Thanks
Reply
- Tim Dettmers says
  2017-03-29 at 13:41
  The performance depends on the software. For most library you can expect a speedup of about 1.6x but along that comes additional multi-GPU code that you need to write, but with some practice this should become second nature quickly. Do not be afraid of multi-GPU code.
  32 GB is more than fine for the GPUs. I am working quite comfortable at 24GB with two GPUs. 8 lanes per GPU can be a problem when you parallelize GPUs and you can expect a performance drop of roughly 10%. If the CPU and RAM is cheap then this is a good trade-off.
  The CPU will be alright, you will not see any performance drop due to the CPU.
  If you just getting started I would recommend two GTX 1070 instead of the expensive big GTX 1080 Ti. If you can find cheap GTX 1080 this might also be worth it, but a GTX 1070 should be more than enough if you just start out in deep learning.
  Reply
Amir H. Jadidinejad says
2017-03-27 at 10:00
Thank you for sharing. This thread is very helpful.
1080 Ti is out of stuck in the NVIDIA store now. Do you know when it will on the stuck again?
Reply
- Tim Dettmers says
  2017-03-27 at 23:18
  This happened with some other cards too when they were freshly released. For some other cards, the waiting time was about 1-2 months I believe. I do not know if this is indicative for the GTX 1080 Ti, but since no further information is available, this is probably what one can expect.
  Reply
- Eric Perbos says
  2017-03-28 at 21:17
  Heja Amir,
  The online offerings for the GTX 1080 Ti are getting wild now as the first batch, the Founder Edition (FE), from usual suspects (Asus, EVGA, Gigabyte, MSI, Zotac) came into retail on March 10.
  Now the second batch, custom versions with dedicated cooling and sometimes overclocking from the same usual suspects, are coming into retail at a similar price range.
  As a result, not only will you see plenty of inventory available in both FE and custom versions.
  But you will see some nice “refurbished” bargains as early-adopters of the FE are sending back their cards to upgrade to a custom version, (ab)using the right to return online purchase within 7-15 days.
  For example, I just snatched a refurbished Asus GTX 1080 Ti FE from our local “NewEgg” in Sweden for SEK 7690 instead of SEK 8490 local official price.
  Nice 10% discount easy to grab, the card still had plastic films on so the previous owner was obviously planning his (ab)use of consumer rights.
  Hope this helps.
  Reply
  - Amir H. Jadidinejad says
    2017-03-28 at 22:22
    Dear Eric,
    Thank you. Currently, I can preorder custom versions from EVGA (http://www.evga.com/products/productlist.aspx?type=0&family=GeForce+10+Series+Family&chipset=GTX+1080+Ti):
    Do you suggest these custom versions (for example: http://www.evga.com/products/product.aspx?pn=11G-P4-6393-KR) for deep learning researches or you prefer the founder edition?
    I’m scared if overclocking is not appropriate for deep learning research when you run a program for a long times.
    Reply
Somebody Else's Lover says
2017-03-26 at 21:57
After the release of 1080ti, you seem to have dropped your recommendation of 1080. You only recommend 1080ti or 1070 but why not 1080, what is wrong with it? It seems it has significantly better performance than 1070, so why not recommend 1080 as a budget but performant gpu? is it really waste of money to buy it, if so, why?
Thanks
Reply
- Tim Dettmers says
  2017-03-27 at 10:03
  Thank you, that is a valid point. I think the GTX 1080 does not have a good performance/costs ratio and has no real niche to fill. The GTX 1070 offers good performance, is cheap, and provides a good amount of memory for its price; the GTX 1080 provides a bit more performance, but not more memory and is quite a step up in price; the GTX 1080 Ti on the other hand offers even better performance, a 11GB memory which is suitable for a card of that price and that performance (enabling most state-of-the-art models) and all that at a better price than the GTX Titan X Pascal.
  If you need the performance, you often also need the memory. If you do not need the memory, this often means you are not at the edge of model performance, and thus you can wait a bit longer for your models to train as these models often do not need to train for that long anyways. Along this ride, you also save good chunk of money. I just do not see a very solid use-case for the GTX 1080 other than “I do not want to run state-of-the-art models, but I have some extra money, but not too much extra money for a GTX 1080 Ti, and I want to run my models faster”. This is a valid use-case and I would recommend the GTX 1080 for such a situation. But note that this situation is rare. For most people either the GTX 1070 is already expensive, or the GTX 1080 Ti is cheap and there are few people in-between.
  Also note that you can get almost two GTX 1070 rather than one GTX 1080. I would recommend two GTX 1070 over one GTX 1080 any day.
  Reply
  - Somebody Else's Lover says
    2017-03-27 at 21:59
    thanks for your detailed reply, but gtx 1080 price dropped rapidly after the release of 1080ti, its price gap with 1070 has narrowed significantly. To tell the truth, I purchased msi armor 1080 cheaper price than msi gaming x 1070, thanks to the weekend discount 🙂 but after your deliberate disregarding 1080, I somehow had happened to start doubting my choice, which you already cleared up 🙂
    thanks again,
    greetings from Turkey .
    Reply
    - Tim Dettmers says
      2017-03-27 at 23:16
      I did not know that the price dropped so sharply. I guess this means that the GTX 1080 might be a not so bad choice after all. Thanks for the info!
      Reply
- Eric Perbos says
  2017-03-27 at 22:30
  Tim Dettmers’s response is very logical as based on MSRP (Manufacturer Suggested Retail Price) which show little benefit into paying 30-40% additional dollars for a current GTX 1080 instead of a GTX 1070 as both have 8 Gb memory, which can often be a key factor into optimizing your ML/DL sessions. The Gb size (4Gb for a 1050 Ti, 6Gb for a 1060+ and 8Gb for botth 1070/1080) is probably the most important factor, before bandwith performance (check Nvidia for proper comparison).
  Now the truth is that many retailers are, indeed, discounting the current GTX 1080 cards in stock heavily (anticipating the “new” 1080 version with better memory/bandwith performance) that the price spread between 1070 and 1080 is nowhere near the 30-40% official MSRP.
  I’ve seen in Europe price differences between current 1070 and 1080 limited to 10-15% max due to heavy discounts.
  So if you can get a current 1080 for say USD 450 vs a 1070 for USD 400: for sure get the 1080 if you can afford.
  Hope this helps.
  Reply
  - Tim Dettmers says
    2017-03-27 at 23:17
    That makes very much sense. Thanks for your comment!
    Reply
Bill Simmons says
2017-03-24 at 17:23
Now that the GTX 1080TI is based on Pascal, what would be the difference in using that card verses the Titan X Pascal for DNN training, whether it be for vision, speech or most other complex networks?
Reply
- Tim Dettmers says
  2017-03-26 at 13:55
  The performance is pretty much equal, the only difference is that the GTX 1080 Ti has only 11GB which means some networks might not be trainable on it compared to a Titan X Pascal.
  Reply
Grey says
2017-03-23 at 22:07
Hi Tim, Thanks for the informative post. I am currently looking at the 1080 TI. Right now I am running a LSTM(24) with 24 time steps. I generally use Theano and TensorFlow. What kind of speed increase would you expect from buying 1 1080TI as opposed to 2 1080 TI cards. Just trying to figure out if its worth it. Should I buy a SLI bridge as well, does that factor in? Thanks , really enjoyed reading your blog.
Reply
- Tim Dettmers says
  2017-03-26 at 14:00
  LSTM scale quite well in terms of parallelism. The longer your timesteps the better the scaling. Theano and TensorFlow have in general quite poor parallelism support, but if you make it work you could expect a speedup of about 1.5x to 1.7x with two GTX 1080 Ti compared to one; if you use frameworks which have better parallelism capabilities like PyTorch, you can expect 1.6x to 1.8x; if you use CNTK then you can expect speedups of about 1.9x to 1.95x.
  These numbers might be lower for 24 timesteps. I have no hard numbers of when good scaling begins in terms of parallelism, but it is already difficult to utilize a big GPU fully with 24 timesteps. If you have tasks with 100-200 timesteps I think the above numbers are quite correct.
  Reply
Haris Jabbar says
2017-03-23 at 15:37
I was waiting for this very update, i.e. your recommendation after GTX1080Ti launch. As always, a very well rounded analysis.
I am building a two GPU system for the sole purpose of Deep Learning research and have put together the resources for two 1080Tis (https://de.pcpartpicker.com/list/GMyvhq). First I want to thank for your earlier posts because I used your advice for selecting every single component in this setup.
Secondly, there is one aspect you haven’t touched and I was wondering if you had any pointers to that. It’s about cooling and it’s effect on higher FLOPS. Having settled on dual 1080Ti system, now I have to select among stock cooling from FE or elaborate air cooling from AIBs or custom liquid cooling. From what I understand, FLOPS are a directly proportional to GPU frequency, so cooling the system to run the GPUs at higher clock rate should theoretically give linear increase in performance.
1) In your experience, is the linear increase in performance seen in practice too? On a related note, since you emphasized so much on memory bandwidth, is it the case that all/most DL setups are memory bound? because in that case, increasing compute performance won’t be of any use.
2) In earlier posts you recommended to go with RAM with slowest frequency as it will not be a bottleneck. From your argument in that post, I understand that a 2133MHz RAM would be good enough.
Thank you again for your time and these posts.
Reply
- Tim Dettmers says
  2017-03-23 at 18:35
  1) Bad cooling can reduce performance significantly. However, 1. case cooling does hardly anything for the GPUs, 2. the GPUs will be sufficiently cooled if you use air cooling and crank up the fans. I never tried water cooling, but this should increase performance compared to air cooling under high loads when the GPUs overheat despite max air fans. This should only occur if run them for many hours in a unventilated room. If this is the case, then water cooling may make sense. I have no hard numbers on the performance gain, but in terms of hardware, cooling is by far the biggest gain of performance (I think you can expect 0-15% performance gains). So if you are willing to put in the the extra work and money for water cooling, and you will run your GPUs a lot, then it might be a good fit for you.
  2) 2133MHz will be fine. Theoretically, the performance loss should be almost unnoticeable and probably in the 0-0.5% range compared to the highest end RAM. I personally run 1600MHz RAM and compared to other systems that I have run on, I could not detect any degradtion of performance.
  Reply
  - Haris says
    2017-03-23 at 18:43
    Thank you for prompt reply. I think I will stick to air cooling for now and keep water cooling for a later upgrade.
    Reply
foojpg says
2017-03-23 at 09:05
Hey Tim,
I’m in a confused state of mind right now. I currently have a GTX 960 4gb, which in selling. After that I’ll have enough cash to buy either a GTX 980 4gb or a GTX 1060 6gb. I plan to get serious with DL. I know most are saying the 1060 but it doesn’t have SLI! What if I want to upgrade in 5-6 months (just in case I suddenly get extremely serious)? Please help me.
Thanks
Reply
- Tim Dettmers says
  2017-03-23 at 18:25
  SLI is used for gaming only, you do not need it for parallelization (for CUDA computing the direction connection via PCIe is used). So If you want to get two GTX 1060 you can still parallelize them — no SLI required.
  Reply
Chan says
2017-03-23 at 08:56
Hi Tim,
There are a number of brands for the GTX 1080 Ti such as Asus, MSI, PALIT, ZOTAC etc. May I know does the brand matter ? I am planning to get a GTX 1080 Ti for my deep learning research, but not sure which brand to get.
Thank you.
Reply
- Tim Dettmers says
  2017-03-23 at 18:22
  In terms of deep learing performance the GPU itself are more or less the same (overclocking etc does not do anything really), however, the cards sometimes come with differernt coolers (most often it is the reference cooler though) and some brands have better coolers than others. The choice of brand shoud be made first and foremost on the cooler and if they are all the same the choice should be made on the price. So the GPUs are the same, focus on the cooler first, price second.
  Reply
Stewart Harding says
2017-03-20 at 00:18
Great blog my friend 🙂
Reply
Chris says
2017-03-19 at 18:16
Thanks for the brilliant summary! Wish I have read this before the purchase of 1080, I would have bought 1070 instead as it seems a better option for value, for the kind of NLP tasks I have at hand.
Reply
Anonymous says
2017-03-13 at 21:17
Considering the incoming refresh of Geforce 100, should I purchase entry-level 1060x6GB now or will there be something more interesting in the near future?
Reply
- Tim Dettmers says
  2017-03-19 at 18:03
  The GTX 1060 will remain a solid choice. I would pick the GTX 1060 if I were you.
  Reply
Ark Aung says
2017-03-09 at 23:28
Now that GTX1080Ti is out, would you recommend that over Titan X?
Reply
Thomas Rupp says
2017-03-02 at 22:49
Hi Tim,
Thank you for the great article and answering our questions.
NVIDIA just announced their new GTX 1080 TI. I heard that it shall even outperform the Titan X Pascal in gaming. I did not read anything about the performance of the GTX 1080 TI in Machine Learning / Deep Learning yet.
I am building a PC at the moment and have some parts already. Since the Titan X was not available over the last few weeks, I could still get the GTX 1080 TI instead.
1.) What is better, the GTX 1080 TI or Titan X? If the difference is very small, I would choose the cheaper 1080 TI and upgrade to Volta in a year or so. Is the only difference the 11 GB instead of 12 and a little bit faster clock or are some features disabled that could make problems with deep learning?
2.) Is half precision available in the GTX 1080 TI and/or the Titan X? I thought that it is only available in the much more expensive Tesla cards, but after reading through the Replies here, I am not sure anymore. To be more precise, I only care of the half precision (float 16) when it brings a considerable speed improvement (In Tesla roughly twice as fast compared to float 32). If it is available but with the same speed as float 32, I obviously do not need it.
Looking forward to your reply.
Thomas
Reply
- Thomas Rupp says
  2017-03-04 at 08:43
  Hi Tim,
  one more question: How much of an issue will the 11GB of the GTX 1080 TI be compared to the 12GB on the Titan X? Does that mean that I cannot run many of the published models that were created by people on a 12GB GPU? Do people usually fill up all of the memory available by creating deep nets that just fit in their GPU memory?
  Reply
  - Tim Dettmers says
    2017-03-06 at 11:13
    Some of the very state of the art models might not run on some of the datasets. But in general, this is a no-issue. You will still be able to run the same models, but instead of 1000 layers you will only have something like 900 layers. If you are not someone which does cutting edge computer vision research, then you should be fine with the GTX 1080 Ti.
    Alternatively, you can always run these models in 16-bit on most frameworks just fine. This is so, because most models make use of 32-bit memory. This thus requires a bit of extra work to convert the existing models to 16-bit (usually a few lines of code), but most models should run. I think you will be able to run more than 99% of the state of the art models in deep learning, and about 90% of the state of the art models in computer vision. So definitely go for a GTX 1080 Ti if you can wait for so long.
    Reply
- Tim Dettmers says
  2017-03-06 at 11:03
  You are welcome. I am always happy if my blog posts are useful!
  1.) I would definitely go with the GTX 1080 TI due to price/performance. The extra memory on the Titan X is only useful in a very few cases. However beware, it might take some time between announcement, release and when the GTX 1080 Ti is finally delivered to your doorstep — make sure you have that spare time. Make also sure you preorder it; when new GPUs are released their supply is usually sold within a week or less. You do not want to wait until the next batch is produced.
  2.) Half precision is implemented on the software layer, but not on the hardware layer for these cards. This means you can use 16-bit computation but software libraries will instead upcast it to 24-bit to do computation (which is equivalent to 32-bit computational speed. This means that you can benefit from the reduced memory size, but not yet from the increased computation speed of 16-bit computation. You only see this in the P100 which nobody can afford and probably you will only see it for consumer cards in Volta series cards which will be released next year.
  Reply
Eric PB says
2017-03-01 at 21:26
Hi Tim,
With the release of the GTX 1080 Ti and the revamp+reprice of GTX 1060/70/80, would you change anything in your TL;DR section, especially vs Pascal Titan X ?
Links to key points:
– GTX 1080 Ti: http://wccftech.com/nvidia-geforce-gtx-1080-ti-unleash-699-usd/
– Revamp+reprice of GTX 1060/70/80 etc.: http://wccftech.com/nvidia-geforce-gtx-1080-1070-1060-official-price-cut-specs-upgrade/
Cheers,
E.
Reply
- Tim Dettmers says
  2017-03-06 at 10:57
  Thank you so much for the links. I will have to look at those details, make up my mind, and update the blog post. On a first look it seems that the GTX 1070 8GB would be really the way to go for most people. Just a lot of bang for the buck. The NVIDIA Titan X seems to become obsolete for 95% of people (only vision researchers that need to squeeze every last bit of RAM for should use it) and the GTX 1080 Ti will be the way to go if you want fast compute.
  Reply
Ink says
2017-02-26 at 18:20
Could you add AWS’s new P2 instance into comparison? Thank you very much!
Reply
- Tim Dettmers says
  2017-02-27 at 00:04
  The last time I checked the new GPU instances were not viable due to their pricing. Only in some limited scenarios, where you need deep learning hardware for a very short time do AWS GPU instances make economic sense. Often it is better to buy a GPU even if it is a cheaper, slower one. With that you will get much more GPU accelerated hours for your money compared to AWS instances. If money is less of in issue AWS instances also make sense to fire up some temporary compute power for a few experiments or training a new model for startups.
  Reply
Erik says
2017-02-11 at 12:37
Hi Tim,
first of all thank you for this great article.
I understand that the memory clock speed is quite important and depending on which graphics card manufacturer/line I choose, there will be up to a 10% difference.
Here is a good overview[German]
http://www.pcgameshardware.de/Pascal-Codename-265448/Specials/Geforce-GTX-1080-Custom-Designs-Herstellerkarten-1198846/
I am going to buy a 1080 and I am wondering if it makes sense to get such an OC one.
Do you have any experience with / advice on this?
Thank you for an answer,
Erik
Reply
- Tim Dettmers says
  2017-02-14 at 19:29
  OC GPUs are good for gaming, but they hardly make a difference for deep learning. You are better of buying a GPU with other features such as better cooling. When I tested overclocking on my GPUs it was difficult to measure any improvement. Maybe you will get something in the range of 1-3% improved performance for OC GPUs — so not so much worth it if you need to pay extra for OC.
  Reply
Joe says
2017-02-02 at 22:59
Hi, it’s been a pleasure to read this article! Thanks!
Have you done any comparison of 2 x Titan X against 4 x GTX 1080? Or maybe you have some thoughts regarding it?
Reply
- Tim Dettmers says
  2017-02-10 at 17:44
  The speed of 4x 1080 vs 2 Titan X is difficult to measure, because parallelism is still not well supported for most frameworks and the speedups are often poor. If you look however at all GPUs separately, then it depends on how much memory your tasks needs. If 8GB are okay, 4x 1080 are definitely better than 2x Titan X, if not, then 2x Titan X are better.
  Reply
Hesam M says
2017-02-02 at 16:01
Hello,
I decided to buy a GTX 1060 or GTX 1070 card to try with Deep Learning, but I am curious if the RAM size of The GPU or its bandwidth/speed will affect the ACCURACY of the final model or not, by comparing these two specific GPU cards.
in the other word, I want to know selecting the GTX 1060 will just cause longer training time over GTX 1070, or it will affect the accuracy of the model either.
Reply
- Tim Dettmers says
  2017-02-10 at 17:40
  Hi Hesam, the two cards will yield the same accuracy. There are some elements in the GPU which are non-deterministic for some operations and thus the results will not be the same, but they always be of similar accuracy.
  Reply
Maciej Wieczorek says
2017-01-29 at 03:16
Amazon has introduced a new class of instances: Accelerated Computing Instances (P2), with 12GB K80 GPUs. These are much, much better than the older G2 instances, and go for $0.90/hr. Does this change anything in your analysis?
Reply
- Tim Dettmers says
  2017-01-29 at 22:55
  With such an instance you get one K80 for $0.9/h which means $21.6/day and $648/month. If you use your GPU for more than one GPU month (runtime) then it probably gets cheaper and cheaper to buy your own GPU. I do not think it makes really sense for most people.
  It is probably a good option for people doing Kaggle competitions since most of the time will be spend still on feature engineering and ensembling. For researchs, startups, and people who learn deep learning it is probably still more attractive to buy a GPU.
  Reply
Nader says
2017-01-03 at 23:32
Does having an AMD card support Theano / Keras ?
Reply
An Tran says
2017-01-03 at 16:25
“However, training may also take longer, especially the last stages of training where it becomes more and more important to have accurate gradients.” WHY the last stages of training is important? Any justification
Reply
- Tim Dettmers says
  2017-01-03 at 21:34
  It is easy to improve from a pretty bad solution to an okay solution, but it is very difficult to improve from a good solution to a very good solution. Improving our 100 meter dash time by a second is probably not so difficult, while for an Olympic athlete it is sheer impossible because they already operate at a very high level. This goes the same for neural net and their solution accuracy.
  Reply
Nader says
2016-12-25 at 00:29
How about gtx 1070 SLI ?
Reply
Neo says
2016-12-15 at 11:05
Hi Tim, I found a interesting thing recently.
I tried one Keras(both theano and tensorflow were tested) project on three different computing platforms:
A : ssd+i5 3470(3.2GHz)+GTX750Ti (2G)
B: ssd+E52620 v3+TITAN X (12G)
C: HDD+i56300HQ(2.6GHz)+GTX965M(4G)
With the same setting of cuda 8.0 and cuDNN 5.0 , A & B got similar GPU performance. However, I cannot understand why C is about 5 times slower than A. I guessed C could perform better than A before the experiment.
Reply
- Tim Dettmers says
  2016-12-16 at 16:47
  As I understand it Keras might not prefetch data. On certain problems this might introduce some latency when you load data, and loading data from hard disk is slower than SSD. If the data is loaded into memory by your code, this is however unlikely the problem. What strikes me is that A and B should not be equally fast. C could also be slow due to the laptop motherboard which has a poor or reduced PCIe connection, but usually this should not be such a big problem. Of course this could still happen for certain datasets.
  Reply
ywteh says
2016-12-13 at 16:12
Hi Tim, thanks for a great article! I’m just wondering if you had experience with installing the GTX or Titan X on rackmount servers? Or if you have recommendations for articles or providers on the web? (I’m in UK). I am having a long running discussion with IT support about whether it is possible, as we couldn’t find any big providers that would put together such a system. The main issue seems to revolve around cooling as IT says that Teslas are passive cooled while Titan X are active cooled, and may interfere with the server’s cooling system.
Reply
- Tim Dettmers says
  2016-12-13 at 22:04
  I think the passively cooled Teslas still have a 2-PCIe width, so that should not be a problem. If the current Tesla cards are 1-PCIe width, then it will be a problem and Titan Xs will not be an option.
  Cooling might indeed also an issue. If the passively cooled Teslas have intricate cooling fins then their cooling combined with active server cooling might indeed be much superior to what Titan Xs can offer. Cooling systems for clusters can be quite complicated and this might lead to Titan Xs breaking the system.
  Another issue might be just buying Titan Xs in bulk. NVIDIA does not sell them in bulk, so you will only be able to equip a small cluster with these cards (this is also the reason why you do not find any providers for such as system).
  Hope this helps!
  Reply
- Burak says
  2019-10-13 at 08:43
  * If GPUs are adjacent, they should have blower style fan.
  * The chassis should have good enough cooling to handle all the GPUs’ heat.
  * Power connection to GPUs would also be an issue.
  Reply
Markus says
2016-12-06 at 03:08
Hi Tim,
thanks for the article, it’s the most useful I found during my 14-hour google-marathon!
I’m very new to deep learning, starting with YOLO I’ve found that my gtx 670 with 2GB is seriously limiting what I can explore. Inferring from your article, if I stack multiple GPUs for CNNs, the memory will in principle add up, right? I’m asking because I will make a decision between a used Maxwell Titan X or a 1070/1080, and my main concern is the memory, thus I would like to know if it is a reasonable memory upgrade option (for CNNs) to add a second card at some point when they are cheaper. Furthermore, if the 1080 and the used Maxwell Titan X are the same price, as this a good deal?
Also, I’m concerned with the FP16x2 feature for the 1070/1080, adding only one FP16x2 core every 128 FP32 cores: If I’m using FP16, the driver might report my card is FP16v2
capable, and thus a framework might use these few FP16v2 cores instead of emulating FP16 arithmetics by promoting to FP32. Is this a valid worst-case scenario for e.g. caffe/torch/… or am I confusing something here? Also, I’ve read that before Pascal, there is effectively no storage benefit from FP16, as the numbers need to be promoted to FP32 anyway. I can only understand this if the data needs to be promoted before fetching it to the registers for computation, is this right?
Thank you
Reply
- Tim Dettmers says
  2016-12-13 at 12:22
  Hi Markus,
  unfortunately the memory will not stack up, since you probably will use data parallelism to parallelize your models (the only way of parallelism which really works well and is fast).
  If you can get a used Maxwell Titan X cheap this is a solid choice. I personally would not mind the minor slowdown compare the the added flexibility, so I would go for the Titan X as well here.
  Currently, you do not need to worry about FP16. Current code will make use of FP16 memory, but FP32 computations so that the slow FP16 compute units on the GTX 10 series will not come into play. All of this probably only becomes relevant with the next Pascal generation or even only with Volta.
  I hope this answered all of your questions. Let me know if things are still unclear.
  Reply
James says
2016-11-28 at 13:16
Hi Tim,
Really useful post, thanks.
I wondered about the Titan Black – looking online the memory bandwidth, 6GB of memory, single and double precision are better than a 1060 and at current eBay prices, are about 10-15% cheaper than a 1060.
Other than the lower power of the 1060 and warranty, would there be any reason to choose the 1060 over a Titan Black?
Thank you
Reply
- Tim Dettmers says
  2016-11-29 at 15:58
  The architecture of the GTX 1060 is more efficient than the Titan Black architecture. Thus for speed, the GTX 1060 should still be faster, but probably not by much. So the GTX Titan Black is a solid choice, especially if you also want to use the double precision.
  Reply
aws training says
2016-11-28 at 12:16
Thanks for sharing this- good stuff! Keep up the great work, we look forward to reading more from you in the future!
Reply
Carlo says
2016-11-09 at 14:49
Hi
I want to test multiple neural networks against each other using encog. For that i want to get a nvidia card. After reading your article i think about getting the 1060 but since most calculations in encog using double precision would the 780 ti be a better fit? The data file will not be large and i do not use images.
Thanks
Reply
- Tim Dettmers says
  2016-11-10 at 23:23
  The GTX 780 Ti would still be slow for double precision. Try to get a GTX Titan (regular) or GTX Titan Black they have excellent double precision performance and work generally quite okay even in 32-bit mode.
  Reply
navdeep singh says
2016-11-02 at 20:25
Hi Tim, thanks for an insightful article!
I picked up new 13″ macbook with thunderbolt 3 ports, i am thinking of a setup using GTX-1080 using eGFX enclosure – http://www.anandtech.com/show/10783/powercolor-announces-devil-box-thunderbolt-3-external-gpu-enclosure . What do you think of this idea?
Reply
- Tim Dettmers says
  2016-11-07 at 11:29
  It should work okay. There might be some performance problems when you transfer data from CPU to GPU. For most cases this should not be a problem, but if your software does not buffer data on the GPU (sending the next mini-batch while the current mini-batch is being processed) then there might be quite a performance hit. However, this performance hit is due to software and not hardware, so you should be able to write some code to fix performance issues. In general the performance should be good in most cases with around 90% performance.
  Reply
andrey kim says
2016-10-29 at 21:13
Great article. I am just a noob at this and learning . not a researcher, but application guy. I have an old mac pro 2008 with 32gb of ram (fb-dimm in 8 channel) on dual xeon quad core at 2.8ghz.(8 core, 8 thread) I’ve been using gtx750ti with 4gb on deep mask/sharp mask on torch. COCO image set took 5 days to train through 300 epoch on deep mask. I am wondering how much performance increase would I see going to GTX 1070? or I am wondering if I could add second gtx750ti that matches the one I got instead for 8gb of ram. (have room for 2 gpu)
thanks for everything
Reply
- Tim Dettmers says
  2016-11-07 at 11:23
  Adding a GTX 750Ti will not increase your overall memory since you will need to make use of data parallelism where the same model rests on all GPUs (the model is not distributed among GPU so you will see no memory savings). In terms of speed, and upgrade to a GTX 1070 should be better than two GTX 750Ti and also significantly easier in terms of programming (no multi-GPU programming needed). So I would go with the GTX 1070
  Reply
chanhyuk jung says
2016-10-09 at 17:47
I live at a place where 200kwh costs 19.92 dollars and 600kwh cost 194.7 dollars. The electricity bills grows exponentially. I usually train unsupervised learning algorithms on 8 terabytes of video. Which gpu or gpus should I get? The titan x pascal had the most bandwidth per watt but it’s a lot expensive for the little gain of performance per watt.
Reply
- Nikos Tsarmpopoulos says
  2017-05-19 at 13:00
  Take into consideration the potential cost of electricity when comparing the options of building your own machine versus renting one on a data centre.
  Reply
chanhyuk jung says
2016-10-09 at 17:26
I live at a place where 200kwh costs 19.92 dollars and 600kwh cost 194.7 dollars. The electricity bills grows exponentially. I usually train unsupervised learning algorithms on 8 terabytes of video. Which gpu or gpus should I buy?
Reply
Kristofer says
2016-10-01 at 18:21
Hi! Great article, very informative. However, I want to point out that the NVIDIA Geforce GTX Titan X and the NVIDIA Titan X are two different graphics cards (yes the naming is a little bit confusing). The Geforce GTX Titan X has Maxwell microarchitecture, while the Titan X has the newer Pascal microarchitecture. Hence there is no “GTX Titan X Pascal”.
Reply
- Tim Dettmers says
  2016-10-03 at 15:01
  Ah this is actually true. I did not realize that! Thanks for pointing that out! Probably it is still best to add Maxwell/Pascal to not confuse people, but I should remove the GTX part.
  Reply
im says
2016-09-28 at 19:30
I compared quadro k2200 with m4000.
k2200 won surprisingly m4000 in the simple network.
I am looking for a higher performance single-slot GPU than k2200.
How about k4200 ?
quadro k4200 ( single-slot and single precision = 2,072.4 gflops )
Reply
- Tim Dettmers says
  2016-10-03 at 15:11
  Check your benchmarks and if they are representative of usual deep learning performance. The K2200 should not be faster than a M4000. What kind of simple network were you testing on?
  Reply
  - im says
    2016-10-04 at 03:49
    I tested the simple network on a chainer default example as below.
    python examples/mnist/train_mnist.py –gpu 0
    result
    K2200
    avg : 14288.94418 images/sec
    M4000
    avg : 13617.58361 images/sec
    However I confirmed that the M4000 is faster than a K2200 in the complex netork like alexnet.
    [convnet-benchmarks]
    ./convnet-benchmarks/chainer/run.sh
    result
    K2200
    alexnet : 639ms, overfeet : 2195ms
    M4000
    alexnet : 315ms, overfeet : 1142ms
    I think that GPU clock is effective in the simple network.
    Is this correct ?
    GPU Clock
    K2200
    1045 MHz
    M4000
    772 MHz
    Shading Units
    K2200
    640
    M4000
    1664
    Reply
Tim says
2016-09-14 at 17:14
Hi Tim,
Thank you very much for the fast answer.
I just have one more question that is related to the CPU. I understand that having more lanes is better when working with multiple GPUs as the CPU will have enough bandwidth to sustain them. However, in the case of having just one GPU is it necessary to have more than 16 or 28 lanes? I was looking at the *Intel Core i7-5930K 3.5GHz 6-Core Processor*, which has 40 lanes (and is the cheapest in that category) but also requires an LGA 2011 and DDR4 memory which are expensive. Is this going to be too much of an overkill for the Titan X Pascal?
Thank you for your time!
Tim
Reply
- Tim Dettmers says
  2016-09-15 at 12:24
  If you are having only 1 card, then 16 lanes will be all that you need. If you upgrade to two GPUs you want to have either 32+ lanes (16 lanes for each) or just stick with 16 lanes (8 lanes for each) since the slowest GPU will always draw down the other one (24 lanes means 16x + 8x lanes, and for parallelism this will be bottlenecked by the 8 lanes). Even if you are using 8 lanes, the drop in performance may be negligible for some architectures (recurrent nets with many times steps; convolutional layers) or some parallel algorithms (1-bit quantization, block momentum). So you should be more than fine with 16 or 28 lanes.
  Reply
Eric says
2016-09-13 at 22:59
Tim!
Thanks so much for your article. It was instrumental in me buying the Maxwell Titan X about a year ago. Now, I’ve upgraded to 4 Pascal Titan X cards, but I’m having some problems getting performance to scale using data parallelism.
I’m trying to chunk a batch of images into 4 chunks and classify them (using caffe) on the 4 cards in parallel using 4 separate processes.
I’ve confirmed the processes are running on the separate cards as expected, but performance degrades as I add new cards. For example, if it takes me 0.4 sec / image on 1 card alone, when I run 2 cards in parallel, they each take about 0.7 sec / image.
Have you had any experience using multiple Pascal Titan X’s in this manner? Am I just missing something about the setup/driver install?
Thanks!
Reply
- Tim Dettmers says
  2016-09-15 at 12:20
  Were you getting better performance on your Maxwell Titan X? It also depends heavily on your network architecture; what kind of architecture were you using? Data parallelism in convolutional layers should yield good speedups, as do deep recurrent layers in general. However, if you are using data parallelism on fully connected layers this might lead to the slowdown that you are seeing — in that case the bandwidth between GPUs is just not high enough.
  Reply
Tim says
2016-09-13 at 01:30
Hey Tim,
Thank you for this fantastic article. I have learned a lot in these past couple of weeks on how to build a good computer for deep learning.
My question is rather simple, but I have not found an answer yet on the web: should I buy one Titan X Pascal or two GTX 1080s?
Thank you very much for your time,
Tim
Reply
- Tim Dettmers says
  2016-09-13 at 06:18
  Hey Tim,
  In the past I would have recommended one faster bigger GPU over two smaller, more cost-efficient ones, but I am not so sure anymore. The parallelization in deep learning software gets better and better and if you do not parallelize your code you can just run two nets at a time. However, if you really want to work on large datasets or memory-intensive domains like video, then a Titan X Pascal might be the way to go. I think it highly depends on the application. If you do not necessarily need the extra memory — that means you work mostly on applications rather than research and you are using deep learning as a tool to get good results, rather than a tool to get the best results — then two GTX 1080 should be better. Otherwise go for the Titan X Pascal.
  Tim
  Reply
  - Tim says
    2017-05-19 at 06:18
    Hi Tim,
    First of all thank you for your reply.
    I am ready to finally buy my computer however I do have a quick question about the 1080 ti and the Titan xp. For a researcher that does some GAN, LSTM and more, would you recommend 2x 1080 ti or just one Titan xp. I understand that in your first post you said that the Titan X Pascal should be the one, however I would like to know if this is still the case on the newer versions of the same graphics cards.
    Thank you so much for updating the article!
    Tim
    Reply
    - Tim Dettmers says
      2017-05-21 at 20:13
      I think two GTX 1080 Ti would be a better fit for you. It does not sound like you would need to push the final performance on ImageNet where a Titan Xp really shines. If you want to build new algorithms on top of tried and tested algorithms like LSTM and GANs then two GPUs (which are still very fast) and 1 GB less memory will be far better than one big GPU.
      Reply
Juan says
2016-09-04 at 16:41
Hey Tim, thank you so muuuch for your article!! I am in the “I started deep learning and I am serious about it” group and will buy a GTX 1060 for it. I am more specifically interested in autonomous vehicle and Simultaneous Localization and Mapping. You article has helped me clarify my currents needs and match it with a GPU and budget.
You have a new follower here!
Thanks!
Reply
- Tim Dettmers says
  2016-09-04 at 21:40
  Thank you for your kind words — I am glad that I you found my article helpful!
  Reply
Mo says
2016-09-02 at 21:14
Hi Tim. Great article.
One question: I have been given a Quadro M6000 24GB. How do you think it compares to a Titan or Titan X for deep learning (specifically Tensorflow)? I’ve used a Titan before and I am hoping that at least it wouldn’t be slower.
Thank you.
Reply
- Tim Dettmers says
  2016-09-03 at 04:28
  The Quadro M6000 is an excellent card! I do not recommend it because it is not very cost efficient. However, the very large memory and high speed which is equivalent to a regular GTX Titan X is quite impressive. On normal cards, you do not have more than 12GB of RAM which means you can train very large models on your M6000. So I would definitely stick to it!
  Reply
  - Mo says
    2016-09-04 at 22:23
    Awesome, thanks for the quick response.
    Reply
CW says
2016-08-27 at 08:48
Hi Tim,
Would multi lower tier gpu serve better than single high tier gpu given similar cost?
Eg.
3 x 1070
vs
1 x Titan X Pascal
Which would you recommend?
Reply
- Tim Dettmers says
  2016-08-28 at 19:09
  Here is one of my quora answers which deals exactly with this problem. The cards in that example are different, but the same is true for the new cards.
  Reply
  - CW says
    2016-08-30 at 12:49
    Thank for the reply.
    I am not sure if I understand the answer correctly, is it pcie bandwidth the bottleneck bandwidth you are referring which is around 8GB/s due to multiple cards compared to single larger memory titan X’s bandwidth of 336Gb/s?
    One more question, does slower ddr ram bandwidth will impact the performance of deeplearning ?
    Reply
    - Tim Dettmers says
      2016-09-03 at 04:22
      That is correct, for multiple cards the bottleneck will be the connection between the cards which in this case is the PCIe connection. Slower DDR RAM bandwidth almost decreases performance by as much as the bandwidth is lower, so it is quite important. This comparison however is not valid between different GPU series e.g. invalid for Maxwell vs. Pascal.
      Reply
Chad says
2016-08-24 at 19:00
Hi Tim,
Thanks for the great article and thanks for continuing to update it!
Am I correct that the Pascal Titan X doesn’t support FP16 computations? So if TensorFlow or Theano (or one’s library of choice) starts fully supporting FP16, would the GTX 1080 then be better than the new Titan X as it would have larger effective (FP16) memory? Do But perhaps I am missing something…
Is it clear yet whether FP16 will always be sufficient or might FP32 prove necessary in some cases?
Thanks!
Reply
- Tim Dettmers says
  2016-08-25 at 06:05
  Hey Chad, the GTX 1080 also does not support FP16 which is a shame. We will have to wait for Volta for this I guess. Probably FP16 will be sufficient for most things, since there are already many approaches which work well with lower precision, but we just have to wait.
  Reply
  - Chad says
    2016-08-25 at 17:16
    ah, ok. got it. Thanks a lot!
    Reply
  - Mike K says
    2016-08-26 at 00:16
    Which GTX, if any, support int8? Does Tensorflow support int8? Thanks for the great blog.
    Reply
    - Tim Dettmers says
      2016-08-26 at 04:23
      All GPUs support int8, both signed and unsigned; in CUDA this is just a signed or unsigned char. I think you can do regular computation just fine. However, I do not know how the support for Tensorflow is, but in general most the deep learning frameworks do not have support for computations on 8-bit tensors. You might have to work closer to the CUDA code to implement a solution, but it is definitely possible. If work with 8-bit data on the GPU, you can also input 32-bit floats and then cast them to 8-bits in the CUDA kernel; this is what torch does in its 1-bit quantization routines for example.
      Reply
Alex N says
2016-08-11 at 23:05
Hi Tim
Your site contains such a wealth a knowledge. Thank you.
I am interested in having your opinion on cooling the GPU. I contacted NVIDIA to ask what cooling solutions they would recommend on a GTX Titan X Pascal in regards to deep learning and they suggested that no additional cooling was required. Furthermore, they would discourage adding any cooling devices (such as EK WB) as it would void the warranty. What are your thoughts? Is the new Titan Pascal that cooling efficient? If not, is there a device you would recommend in particular?
Also, I am mostly interested in RNN and I plan on starting with just one GPU. Would you recommend a second GPU in light of the new SLI bridge offered by NVIDIA? Do you think it could deliver increased performance on single experiment?
Reply
- Alex N says
  2016-08-12 at 15:14
  I would also like to add that looking at the DevBox components,
  No particular cooling is added except for sufficient GPU spacing and upgraded front fans.
  http://developer.download.nvidia.com/assets/cuda/secure/DIGITS/DIGITS_DEVBOX_DESIGN_GUIDE.pdf?autho=1471007267_ccd7e14b5902fa555f7e26e1ff2fe1ee&file=DIGITS_DEVBOX_DESIGN_GUIDE.pdf
  Reply
  - Tim Dettmers says
    2016-08-13 at 21:57
    From my experience addition fans for your case are negligible (less than 5 degrees differences; often as low as 1-2 degrees). Increasing the GPU fan speed by 1% often has a larger effect than additional case fans.
    Reply
- Tim Dettmers says
  2016-08-13 at 21:55
  If you only run a single Titan X Pascal then you will indeed be fine without any other cooling solution. Sometimes it will be necessary to increase the fan speed to keep the GPU below 80 degrees, but the sound level for that is still bearable. If you use more GPUs air cooling is still fine, but when the workstation is in the same room then noise from the fans can become an issue as well as the heat (it is nice in winter, then you do not need any additional heating in your room, even if it is freezing outside). If you have multiple GPUs then moving the server to another room and just cranking up the GPU fans and accessing your server remotely is often a very practical option. If those options are not for you water cooling offers a very good solution.
  Reply
Tharun says
2016-08-11 at 16:34
Dear Tim,
Extremely thankful for the info provided in this post.
We have GPU server on which CUDA 6.0 is installed and it has two Tesla T10 graphic cards. I have a question if I can use this GPU system for deep learning as Tesla T10 is quite old as of now. I am facing some hardware issues with installing caffe on this server. It has ubuntu 12.04 LTS as OS.
Thanks in advance
Tharun
Reply
- Tim Dettmers says
  2016-08-13 at 21:47
  The T10 Tesla chip has a too low compute capability so that you will not be able to use cuDNN. Without that you can still run some deep learning libraries but your options will be limited and training will be slow. You might want to just use your CPU or try to get a better GPU.
  Reply
Pablo Castillo says
2016-08-10 at 16:41
Thanks for sharing your knowledge about this topics.
Regards.
Reply
stoo says
2016-08-07 at 18:29
The best article about choosing GPUs for deep learning I’ve ever read!
As a CNN learner with few budget, I decide to buy a GTX 1060 replacing the old Quardro K620. Since GTX 1060 does not support SLI and you wrote that “using the PCIe 3.0 interface for communication in multi-GPU applications”. I am a little worry about upgrading later soon. Should I buy a GTX 1070 instead of GTX 1060? Thanks.
Reply
- Tim Dettmers says
  2016-08-08 at 06:44
  Maybe this was a bit confusing, but you do not need SLI for deep learning applications. The GPUs communicate via the channels that are imprinted on the motherboard. So you can use multiple GTX 1060 in parallel without any problem.
  Reply
Ondrej says
2016-08-04 at 14:48
Hi Tim,
first of all, thank you for your awesome articles about deep learning. They have been very usefull for me. Since I use Caffe and CNTK framework for deep learning and GPU computing speed is very important, encouraged by your last article update (GTX Titan X Pascal = 0.7 GTX 1080 = 0.5 GTX 980 Ti) and very positive reviews on Internet, I decided to upgrade my GTX 980 Ti (Maxwell) with brand new GTX 1080 (Pascal). In order to compare performance of both architectures – new Pascal with old Maxwell (and of course because I just want to see how well my new GTX 1080 performs to justify expense 🙂 I benchmarked both cards in Caffe (CNTK is not cuDNN 5 ready yet). To my big surprise new GTX 1080 is about 20% slower in AlexNet training than old GTX 980 Ti. I realized two benchmarks in order to compare performance in different operating systems but with practically same results. The reason why I chosed different versions of CUDA and cuDNN is that Pascal architecture is supported only in CUDA 8RC and cuDNN 5.0 and Maxwell architecture performs better in CUDA 7.5 and cuDNN 4.0 (otherwise you get poor performance).
Maybe I have done something wrong in my benchmark (but I’m not aware anything…), could you give me some advice how to improve training perfomance on GTX 1080 with Caffe? Is there any other framework which support Pascal architecture with full speed?
First benchmark:
OS: Windows 7 64bit
Nvidia drivers: 368.81
Caffe buid for GTX 1080 : Visual studio 2013 64bit, CUDA 8RC, cuDNN 5.0
Caffe buid for GTX 980 Ti: Visual studio 2013 64bit, CUDA 7.5, cuDNN 4.0
Caffe buid for GTX 980 Ti: Visual studio 2013 64bit, CUDA 7.5, cuDNN 5.0
Caffe buid for GTX 980 Ti: Visual studio 2013 64bit, CUDA 7.5, cuDNN 5.1
GTX 1080 performance : 4512 samples/sec.
GTX 980 Ti perfomance: 5407 samples/sec. (cuDNN 4.0) best performance
GTX 980 Ti perfomance: 4305 samples/sec. (cuDNN 5.0)
GTX 980 Ti perfomance: 4364 samples/sec. (cuDNN 5.1)
Second benchmark:
OS: Ubuntu 16.04.1
Nvidia drivers: 367.35
Caffe buid for GTX 1080 : gcc 5.4.0, CUDA 8RC, cuDNN 5.0
Caffe buid for GTX 980 Ti: gcc 5.4.0, CUDA 7.5, cuDNN 4.0
GTX 1080 performance : 4563 samples/sec.
GTX 980 Ti perfomance: 5385 samples/sec.
Thank you very much,
Ondrej
Reply
- Tim Dettmers says
  2016-08-05 at 05:27
  Thank you Ondrej for sharing — these are some very insightful results!
  I am not entirely sure how convolutional algorithm selection works in Caffe, but this might be the main reason for the performance discrepancy. The cards might have better performance for certain kernel sizes and for certain convolutional algorithms. But all in all these are quite some hard numbers and there is little room for arguing. I think I need to update my blog post with some new numbers. To learn that the performance of Maxwell cards is such much better with cuDNN 4.0 is also very valuable. I will definitely add this in an update to the blog post.
  Thanks again for sharing all this information!
  Reply
Vikram says
2016-08-04 at 06:54
Exceptionally excellent blog
Thank-you you so much for your valuable reply
Reply
vikram says
2016-08-02 at 10:47
hi
my confusions is
1) quadro series k2000 and higher capable enough for deep learning beginning.
2)keplar , maxwel, pascal how much difference dose it make on performance as a beginner
3)gtx titan x is pascal or maxwell
5)parameters to be considered for comparison of GPU as far as deep learning is concern
4)please suggest any GPU according to you
Reply
- Tim Dettmers says
  2016-08-04 at 06:49
  1) A k2000 will be okay and you can do some tests, but its memory and performance will not be sufficient for larger datasets
  2) Get Maxwell or Pascal GPU if you have the money; Kepler is slow
  3) There is a one Titan X for Pascal and one for Maxwell
  5) Look at memory bandwidth mostly
  4) GTX 1060
  Reply
Anonymous says
2016-07-31 at 20:56
My brother recommended I would possibly like this blog.
He used to be totally right. This submit actually made my day.
You can not believe just how much time I had spent for this info!
Thanks!
Reply
- Tim Dettmers says
  2016-08-04 at 06:46
  I am glad to hear that you and your brother found my blog post helpful 🙂 Thank you!
  Reply
Alex telitsine says
2016-07-29 at 06:15
I’m curious if unified memory in cuda 8 will work for dual 1080,
Then theoretically dual 1080 nvlink setup will crush tyranny in memory and flops ?
Reply
- Tim Dettmers says
  2016-08-04 at 06:42
  Unified memory is more a theoretical than practical concept right now. The CPU and GPU memory is still managed by the same mechanism as before and just that the transfers are hidden. Currently you will not see any benefits for this over Maxwell GPUs.
  Reply
  - Raj says
    2017-07-01 at 00:27
    I have a used 1060 6gb on hand. I am planning to get into research type deep learning. However I am still getting started and dont understand all the nitty gritty of parameter tuning batch sizes etc. You mention 6gb would be limiting in deep learning. If I understand right using small batch sizes would not converge on large models like resnet with a 1060 (I am shooting in the dark here wrt terminology since I still a beginner). I was thus looking to get a 1080ti when I came across an improved form of “unified memory” introduced from Pascal and cuda 6+. I have a 6core HT xeon cpu+32gb ram. Could I use some system ram to remove the 6gb limitation.
    I understand that p100 can do this and not incur heavy copy latency due to a new “page migration engine” feature wherein if access for data on gpu memory leads to a pagefault the program suspends the call and request the relevant memory page (instead of whole data) from the cpu.
    http://parallelplusplus.blogspot.in/2016/09/nvidia-pascals-gpu-architecture-most.html
    There is confusion about this feature in the 1080ti even though it uses the same gp100 module. The conclusion being that it can check for “page fault” but not do “prefetch”.
    https://stackoverflow.com/a/43770299
    and is seconded by
    https://stackoverflow.com/a/40011988 for titan xp
    Since 1080 (and by inference 1060 6gb since they both have gp400) also has this ConcurrentManagedAccess set to 1 according to
    https://devtalk.nvidia.com/default/topic/1015688/on-demand-paging/?offset=5
    http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-device-properties
    I am guessing I wouldnt benefit much from a new purchase of a 1080ti.
    however it looks like these features in cuda 8 in c++ havent been yet been applied to higher language libraries.
    https://github.com/inducer/pycuda/issues/116
    Looks like AMD’s Vega might have this feature too through IOMMUv2 hardware passthrough and considering since AMD’s miopen 1.0 supports tensorflow and torch 7. It might be an good alternative.
    http://parallelplusplus.blogspot.in/2017/01/fine-grained-memory-management-on-amd.html
    http://www.amd.com/system/files/2017-06/TIRIAS-AMD-Epyc-GPU-Server.pdf
    https://www.reddit.com/r/Amd/comments/61oolv/vega_you_are_no_longer_limited_by_the_amount_of/
    Would be interested in hearing your thoughts specifically if a 1060 would be capable of addressing more than 6gb without heavy penalty maybe in the future (cant extrapolate what the yes to ‘page fault’ and no to ‘prefetch’ would mean in this context) and if it would be faster/slower than the AMD vega solution using IOMMU
    Reply
    - Tim Dettmers says
      2017-07-05 at 00:56
      I think the easiest and often overlooked option is just to switch to 16-bit models which doubles your memory. This is supported by most libraries (you are right that the “page migration engine” is not supported by any deep learning library). Other than that I think one could always adjust the network to make it work on 6GB — with this you will not be able to achieve state-of-the-art results, but it will be close enough and you save yourself from a lot of hassle. I think this also makes practically the most sense.
      Reply
George says
2016-07-28 at 22:14
Hello Tim,
I’ve been following your blog for 3 months now and since then I have been waiting to buy a GTX Titan X Pascal. However, there are rumors about Nvidia releasing the Volta architecture on the next year with HBM2. What your thoughts about the investment on a Pascal architecture based GPU currently? Thank you.
Reply
- Tim Dettmers says
  2016-08-04 at 06:40
  If you already have a good Maxwell GPU the wait for Volta might well be worth it. However, this of course depends on your applications and then of course you can always sell your Pascal GPU once Volta hits the market. Both options have its pro and cons.
  Reply
Ken says
2016-07-20 at 17:20
What are you thoughts on the GTX 1060? an easy replacement for the 960 on the low end? Over-clocking the 1060 looks like it can get close to a FE 1070 minus 2GB of memory. Thoughts?
Reply
Ken says
2016-07-17 at 15:56
Is a GT 635 capable of cudnn (conv_dnn) acceleration? in theory its a GK_208 kepler chip with 2GB of mem. I know its a crap card but its the only Nvidia card I had lying around. I have not been able to get GPU acceleration on WIN 8.1 to work – so wanted to ask if its my theano/cuda/keras installation thats the issue, or if its the card.. before I throw any money at the problem and buy a better GPU 960+. Should I go to windows 10?
Reply
- Tim Dettmers says
  2016-07-19 at 07:19
  Your card, although crappy, is a kepler card and should work just fine. Windows could be the issue here. Often it is not well supported by deep learning frameworks. You could try CNTK which has better windows support. If you try CNTK it is important that you follow this install tutorial step-by-step from top to bottom.
  I would not recommend Windows for doing deep learning as you will often run into problems. I would encourage you to try to switch to Ubuntu. Although the experience is not as great when you make the switch, you will soon find that it is much superior for deep learning.
  Reply
  - Ken says
    2016-07-20 at 17:40
    Thanks Tim, I did eventually get the GT 635 working under WIN 8.1 – on my Dell, about a 2.7x improvement over my Mac Pro’s 6 core xeons. Getting things going on OSX was much easier. I still don’t think the GT 635 is using cuDNN (cuDNN not available) but I’ll have to play – I get the sense I could get another 2x with it. The 2GB of Vram sucks, I really have to limit the batch sizes I can work with.
    Reply
Rudra says
2016-07-15 at 17:23
Hi Tim,
Thanks for a great article, it helped a lot.
I am a beginner in the field of deep learning and have built and used only a couple of architectures on my CPU (currently a student so decided not to invest on GPU’s right away).
I have a question regarding the amount of CUDA programming required if I decide to do some sort of research in this field. I have mostly implemented my vanilla models in Keras and learning lasagne so that I can come up with novel architecture.
Reply
- Tim Dettmers says
  2016-07-19 at 07:22
  I know quite many researchers whose CUDA skills are not the best. You often need CUDA skills to implement efficient implementations of novel procedures or to optimize the flow of operations in existing architectures, but if you want to come of with novel architectures and can live with a slight performance loss, then no or very little CUDA skills are required. Sometimes you will have cases where you cannot progress due to your lacking CUDA skills, but this is rarely an issue. So do not waste your time with CUDA!
  Reply
  - Rudra says
    2016-07-21 at 10:03
    Thanks for the reply 🙂
    Reply
Joshua Stanton says
2016-07-05 at 11:53
Hey there!
great article! I am currently trying to flicker train GoogleLeNet on 400 of my own images using my SLI 780ti’s. But i keep getting errors. such as cannot find file -> dir to file location(one of the images im training it on) but the file is there and the correct dir is in the train file. do you have any idea why this would be? also in the guide i followed to do this the guy had 4gb vram and used batch of 40 with 256×256 images i did the same but with batch size of 30 to account for the 3gb vram. am i doing something wrong here? how can i optimise the training to work on my video card? i appreciate any help you can give! thanks Josh!
Reply
Haider says
2016-07-03 at 17:47
Thanks Tim. Great update to this article, as usual.
I was eager to see any info on the support of half precision (16 bit) processing in GTX 1080. Some articles were speculating few days before their release that they might be inactivated by nVidia and reserving this feature for future nVidia P100 pascal cards. However after around1 month from releasing the1000 gtx series, nobody seems to mention anything related to this important feature. This alone, if had been enabled in GTX will make it up to ~1.5 to 2x with more TFlops/s processing power in comparison to maxwell GPUs including Titan X. And as you mentioned it will add the bonus for less memory requirements up to half. However it is still not clear whether the accuracy of the NN will be the same in comparison to the single precision and whether we can do half precision for all the parameters. Which of course are important to estimate how much will be the speedup and how much less is the memory requirement for a given task.
Reply
Haider says
2016-07-03 at 17:42
Thanks Tim. Great update to this article, as usual.
I was eager to see any info on the support of half precision (16 bit) processing in GTX 1080. Some articles were speculating few days before their release that they might be inactivated by nVidia and reserving this feature for future nVidia P100 pascal cards. However after around1 month from releasing the1000 gtx series, nobody seems to mention anything related to this important feature. This alone, if had been enabled in GTX will make it up to ~1.5 to 2x with more TFlops/s processing power in comparison to maxwell GPUs including Titan X. And as you mentioned it will add the bonus for less memory requirements up to half. However it is still not clear whether the accuracy of the NN will be the same in comparison to the single precision and whether we can do half precision for all the parameters. Which of course are important to estimate how much will be the speedup and how much less is the memory requirement for a given task.
Reply
Niko Bertrand says
2016-06-30 at 20:19
Thanks for the quick reply!
One final question, which may sound completely stupid.
Reference Cards vs Custom GPUs?
Often the Clock speed and sometime VRAM speed is OC by default on many non reference cards, but if I look through builds and if included screenshots. It seems that mostly reference cards are used. The only reason I could think of is the “predictable” cooler height.
Thanks again!
Reply
- Niko Bertrand says
  2016-06-30 at 20:23
  Ups meant Founders Edition vs reference cards,.. happens if the mind wanders elsewhere!
  Reply
Swapnil says
2016-06-30 at 01:23
Hi, I am trying to find a Kepler card (CC >= 3.0/3.5) for my research. Could you please suggest one ? I had GeForce GTX Titan X earlier but I could not use it for dual purpose i.e. for computation and display driver. Titan X does not allow this. So I’m searching a Kepler card which allows dual purpose. Kindly suggest one.
Reply
- Tim Dettmers says
  2016-06-30 at 20:59
  Try to recheck your configuration. I am running deep learing and a display driver on a GTX Titan X for quite some time and it is running just fine.
  Reply
Niko Bertrand says
2016-06-28 at 14:37
Hi Tim, thanks for updating the article! Your blog helped me a lot in increasing my understanding of Machine Learning and the Technologies behind it.
Up to now I mostly used AWS or Azure for my computations, but I am planning a new PC build. Unfortunately I have still some unanswered questions where even the mighty Google could not help! 🙂
Anyways, I was wondering whether you use all your GPU(s) for Monitor output as well? I read a lot about screen tearing / blank screens/ X Stopping for a few seconds while running the algorithms.
A possible solution would be to get a dedicated GPU for Display output such as the GTX 950 running at x8 to connect 3 monitors while having 2 GTX 1080 at x16 Speed just for computation. What is your opinion / experience regarding this matter?
Furthermore as my current PC’s CPU, only has 16 PCIe Lanes but an IGPU build in. Could I use the IGPU for graphics output while a 1080 is build in for computation? I have found a thread on Quora but the only feedback given was to get a CPU with 40PCIe Lanes. Of course this is true but as cash does not grow on trees and AMD Zen and new Skylake Extreme Chipsets are on the horizon.
Your feedback is highly appreciated and thanks in advance!
Reply
- Tim Dettmers says
  2016-06-28 at 15:16
  Personally, I never had any problems with video output from GPUs on which I also do computation.
  The integrated iGPU is independent of your dedicated GTX 1080 and does not eat up any lanes. So you can easily run graphics from your iGPU and compute with 16 lanes from your GTX 1080.
  The only problem you might encounter are problems with the intel/NVIDIA driver combination. I would not care too much about the performance reduction (0-5%) and I have yet to see problems with using a GPU for both graphics and compute at the same time and I have worked with about 5-6 different setups.
  So I would try the iGPU + GTX 1080 setup and if you run into errors just use the GTX 1080 for everything.
  Reply
Harvey says
2016-06-21 at 00:38
Tim, what is the InfiniBand 40Gbit/s interconnect card for? Do I absolutely need the card if I was going to do a muti-GPU solution? And are all three of your Titan X cards are connected using SLI solutions?
Reply
- Tim Dettmers says
  2016-06-23 at 15:53
  You only need InfiniBand if you want to connect multiple computers. For multiple GPUs you just need multiple PCIe slots and a CPU that supports enough lanes. SLI is only used for games, but not for CUDA compute.
  Reply
Suriya says
2016-06-06 at 15:34
Hi,
Very nice post! Found it really useful and I felt GeForce 980 suggestion for Kaggle competitions really apt. However, I am wondering how good are the mobile versions of the GeForce series for Kaggle such as 940M, 960M, 980M and so on. Any thoughts on this?
Reply
- Tim Dettmers says
  2016-06-11 at 16:14
  I think for Kaggle anything >=6GB of memory will do just fine. If you have a slower 6GB card then you have to wait longer but it is still much faster than a laptop CPU, and although slower than a desktop you still get a nice speedup and a good deep learning experience. Getter one of the fast cards is however often a money issue as laptops that have them are exceptionally expensive. So a laptop card is good for tinkering and getting some good results on kaggle competition. However, if you really want to win a deep learning kaggle competition computational power is often very important and then only the high end desktop cards will do.
  Reply
mariano says
2016-06-02 at 23:29
I was able to use Tensorflow, the last google machine learning framework with and NVidia GTX 960 on Ubuntu 16.04. It’s not officially supported but can be used.
I’ve posted a tutorial about how to install it here:
http://stackoverflow.com/questions/37600808/how-to-install-tensorflow-from-source-with-unofficial-gpus-support-in-ubuntu-16
Reply
Rishikesh says
2016-05-26 at 11:43
What about new nvidia GPUs like GTX 1080 and GTX 1070, please review these after they released on the perspective of deep learning. Nvidia claim that GTX 1080 performance beat GTX Titan GPU, Is it true for Deep learning task ?
I am about to buy new GPU for deep learning task so please suggest me which GPU, I should buy with budget vs performance ratio ?
Reply
Ricardo says
2016-05-21 at 08:58
Great article! I would love to see some benchmarks on actual deep learning tasks.
I was under the impression that single precision could potentially result in large errors. In large networks with small weights/gradients, won’t the limited precision propagate through the net causing a snowballing effect?
I admit I have not experimented with this, or tried calculating it, but this is what I think. I’ve been trying to get my hands on a Titan / Titan Black, but with what you suggest, it would be much better getting the new Pascal cards.
With that being said, how would ‘half precision’ do with deep learning then?
Reply
- Tim Dettmers says
  2016-05-23 at 09:48
  The problem with actual deep learning benchmarks is hat you need the actually hardware and I do not have all these GPUs.
  Working with low precision is just fine. The error is not high enough to cause problems. It was even shown that this is true for using single bits instead of floats since stochastic gradient descent only needs to minimize the expectation of the log likelihood, not the log likelihood of mini-batches.
  Yes, Pascal will be better than Titan or Titan Black. Half-precision will double performance on Pascal since half-floating computations are supported. This is not true for Kepler or Maxwell, where you can store 16-bit floats, but not compute with them (you need to cast them into 32-bits).
  Reply
Amit says
2016-05-19 at 12:46
Hey,
Great Writeup.
I have a GTX 970M with i7 6700 (desktop CPU) on a Clevo laptop.
How good is GTX 970m for deep learning?
Reply
- Tim Dettmers says
  2016-05-26 at 11:10
  A GTX 970m is pretty okay, especially the 6GB variant will be enough to explore deep learning and fit some good models on data. However, you will not be able to fit state of the art models, or medium sized models in good time.
  Reply
Thomas R says
2016-05-19 at 10:46
Thank you for this great article. What is your opinion about the new Pascal GPUs? How would you rank the GTX1080 and GTX1070 compared to the GTX Titan X? Is it better to buy the newer GTX 1080 or to buy a Titan X which has more memory?
Reply
- Tim Dettmers says
  2016-05-26 at 11:11
  Both cards are better. I do not have any hard data on this yet, but it seems that the GTX 1080 is just better — especially if you use 16-bit data.
  Reply
Haris Jabbar says
2016-05-13 at 07:43
Your blog posts have become a must-read for anyone starting on deep learning with GPUs. Very well written, especially for newbies.
I was wondering though if/when you will write about the new beast: GTX 1080? I am thinking of putting together a multi GPU workstation with these cards. If you could compare the 1080 with Titan or 900 series cards, that would be super useful for me (and i am sure quit a few other folks)
Reply
Haider says
2016-05-07 at 07:57
By the way, the price difference between Asus, EVGA,… etc. vs the original nvidia seems pretty high. Titan x in Amazon priced around 1300 to 1400 usd vs 999usd in nvidia online store. Do you advise against buying the original nvidia? If yes, why? What is the difference? Which brand you prefer?
Many thanks Tim. Your posts are unique. We badly need hardware posts for deep learning!
Reply
- Tim Dettmers says
  2016-05-08 at 15:06
  For deep learning the performance of the NVIDIA one will be almost the same as ASUS, EVGA etc (probably about 0-3% difference in performance). The brands like EVGA might also add something like dual-boot BIOS for the card, but otherwise it is the same chip. So definitely go for the NVIDIA one.
  Reply
  - Haider says
    2016-07-29 at 21:06
    I read this interesting discussion about the difference in reliability, heat issues and future hardware failures of the reference design cards vs the OEM design cards:
    https://hashcat.net/forum/thread-4386.html
    The opinion was strongly against buying the OEM design cards. Especially for computing and 24/7 working of GPUs.
    I read all the 3 pages and it seems there is no citation or any scientific study backing up the opinion, but it seems he has a first hand of experience who bought thousands of NVidia cards before.
    So what is your comment about this? Should we avoid OEM design cards and stick with the original NVidia reference cards?
    Reply
    - Hayder Hussein says
      2017-01-31 at 02:06
      Answering my own question above:
      I asked the same question to the author of this blog post (Matt Bach) of Puget systems and he was kind to answer based on around 4000 Nvidia cards that they have installed in his company:
      https://www.pugetsystems.com/labs/articles/Most-Reliable-PC-Hardware-of-2016-872/
      I will quote the discussion happened in the comments of the above article, in case anybody is interested:
      Matt Bach :
      Interesting question and one that is a bit hard to answer since we don’t really track individual cards by usage. I will tell you, however, that we lean towards reference cards if the card is expected to be put under a heavy load or if multiple cards will be in a system. Many of the 3rd party designs like the EVGA ACX and ASUS STRIX series don’t have very good rear exhaust so the air tends to stay in the system and you have to vent it with the chassis fans. That is fine for a single card, but as soon as you stack multiple cards into a system it can produce a lot of heat that is hard to get rid of. The Linus video John posted in reply to your comment lines up pretty closely what we have seen in our testing.
      I did go ahead and pull some failure numbers from the last two years. This is looking at all the reference cards we sold (EVGA, ASUS, and PNY mostly) versus the EVGA ACX and ASUS STRIX cards (which are the only non-reference cards we tend to sell):
      Total Failures: Reference 1.8%, EVGA ACX 5.0%, ASUS STRIX 6.6%
      DOA/Shop Failures: Reference 1.0%, EVGA ACX 3.9%, ASUS STRIX 1.5%
      Field Failures: Reference .7%, EVGA ACX 1.1%, ASUS STRIX 3.4%
      Again, we don’t know the specific usage for each card, but this is looking at about 4,000 cards in total so it should average out pretty well. If anything, since we prefer to use the reference cards in 24/7 compute situations this is making the reference cards look worse than they actually are. The most telling is probably the field failure rate since that is where the cards fail over time. In that case, the reference are only a bit better than the EVGA ACX, but quite a bit better than the ASUS STRIX cards.
      Overall, I would definitely advise using the reference style cards for anything that is heavy load. We find them to work more reliably both out of the box and over time, and the fact that they exhaust out the rear really helps keep them cooler – especially when you have more than one card.
      Hayder Hussein:
      Recently Nvidia began selling their own cards by themselves (with a bit higher price). What will be your preference? The cards that Nvidia are manufacturing and selling by themselves or a third party reference design cards like EVGA or Asus ?
      Matt Bach :
      As far as I know, NVIDIA is only selling their own of the Titan X Pascal card. I think that was just because supply of the GPU core or memory is so tight that they couldn’t supply all the different manufacturers so they decided to sell it directly. I believe the goal is to get it to the different manufacturers eventually, but who knows when/if that will happen.
      If they start doing that for the other models too, there really shouldn’t be much of a difference between an NVIDIA branded card and a reference Asus/EVGA/whatever. Really hard to know if NVIDIA would have a different reliability than other brands but my gut instinct is that the difference would be minimal.
      Reply
      - Tim Dettmers says
        2017-02-01 at 15:10
        That is really insightful, thank you for your comment!
Haider says
2016-05-07 at 07:49
Hi
I keep coming to this great article. I was about to buy a 980 ti only when discovered that today nvidia announced the pascal gtx 1080 to be released in the end of may 2016. Maybe you want to put an update to your article with this fantastic performance and price of gtx 1080/1070.
Reply
- Tim Dettmers says
  2016-05-08 at 15:04
  I will update the blog post soon. I want to wait until some reliable performance statistics are available.
  Reply
Richard says
2016-04-28 at 15:28
What can I expect from a Quadro M2000M (see http://www.notebookcheck.net/NVIDIA-Quadro-M2000M.151581.0.html) with 4GB RAM in a “I started deep learning and I am serious about it” situation?
Reply
- Tim Dettmers says
  2016-05-08 at 14:57
  It will be comparable to a GTX 960.
  Reply
Wajahat says
2016-04-14 at 13:05
Hi Tim
Thanks a lot for sharing such valuable information.
Do you know if it will be possible to use and external GPU enclosure for deep learning
such as a Razer core?
http://www.anandtech.com/show/10137/razer-core-thunderbolt-3-egfx-chassis-499399-amd-nvidia-shipping-in-april
Would there be any compromise on the efficiency?
Best Regards
Wjahat
Reply
- Tim Dettmers says
  2016-04-24 at 07:54
  There will be a penalty to get the data from your CPU to your GPU, but the performance on the GPU will not be impacted. Depending on the software and the network you are training, you can expect a 0-15% decrease in performance. This should still be better than the performance you could get for a good laptop GPU.
  Reply
Amir H. Jadidinejad says
2016-04-07 at 01:41
Thank you for sharing this. Please update the list with new Tesla P100 and compare it with TitanX.
Reply
- Tim Dettmers says
  2016-04-07 at 19:25
  I will probably do this on the weekend.
  Reply
Steve says
2016-04-04 at 16:01
What if I install a single gtx 960 to a PCIe 2.0 slot instead of a 3.0?
Reply
- Tim Dettmers says
  2016-04-05 at 20:07
  It will be a bit slower to transfer data to the GPU, but for deep learning this is negligible. So not really a problem.
  Reply
Vu Pham says
2016-03-17 at 11:53
Hi Tim, Do you know (or recommend) any good DL project for “Graphic card testing” on Github? Recently I’m cooperating with a hardware retailer so they lend me bunch of NVIDIA graphic cards (titan, titanx, titan black, titan z, 980 ti, 980, 970, 780 Ti, 780…).
Reply
Cheer says
2016-03-15 at 19:47
I am using GTX 970 and two 750 (with 1GD5, 2GD5)
But there is not big difference in speed.
Rather, it seems 750 is slightly faster than 970.
Would you tell me the reason?
Thanks.
Reply
- Tim Dettmers says
  2016-03-18 at 13:17
  Hmm this seems strange. It might be that the GTX 970 hit the memory limit and thus is running more slowly so that it gets overtaken by a GTX 750. On what kind of task have you tested this?
  Reply
Hossein says
2016-03-09 at 14:12
Thanks alot, actually I dont want to play with this card, I need its bandwidth and its memory to run some applications (a deep learning Framework called caffe ).
Currently I have a GTX750 2G GDDR5, I need 4Gig at least . at the very same time, I also need a higher bandwidth card.
I cant buy the GTX980, its too expensive for me, I was skeptical to go for the GTX960 4G or the GTX970 4G (3.5G).
basically, GTX960 is 128 bit and it gives me 112 G of bandwidth, while the GTX970 is 256 and gives me 192+G bandwidth.
My current cards bandwidth is only 80!
So I just need to know, Do I have access to the whole 4 gigabyte of vram? playing games aside?
Does it crash if it exceeds the 3.5G limit or it just gets slower?
Reply
- Hossein says
  2016-03-09 at 14:14
  I mistakenly posted this here ;! ( This was supposed to be in techpowerup!)
  Reply
Hossein says
2016-03-09 at 14:08
I wonder what exactly happens when we exceed the 3.5G limit of the GTX970?!
Will it crash? if not how much slower does it get when it passes that limit?
I want to know, if passing the limit and getting slower, would it still be faster than the GTX960 ? If it is so , that would be great.
Has anyone ever observed or benchmarked this ? Have you?
Thanks again
Reply
Wajahat says
2016-03-07 at 14:11
Hi Tim
Thanks a lot for this article. I was looking for something like this.
I have a quick question. What would be the expected speedup for ConvNets with a GTX Titan X vs Core i7 4770-3.4 Ghz?
A rough idea would do the job?
Best Regards
Wajahat
Reply
Pawel Kozela says
2016-02-28 at 02:37
Hi Tim,
thanks for great guidelines !
In case somebody’s interested in the numbers – I’ve just bought a GTX 960 (http://www.gigabyte.com/products/product-page.aspx?pid=5400#ov) and I’m getting ~50% better performance than AWS G2.2 instance (keras / tensorflow backend).
Reply
- Tim Dettmers says
  2016-02-28 at 11:34
  Thank you, Pawel. That are very useful statistics!
  Reply
Hossein says
2016-02-17 at 15:39
Great post ,
I bought a GTX 750, considering your article I’m doomed right?
I have a question though, I have’nt tested this yet, but here it goes.
Do you think I can use VGGnet or Alex Krizhevsky net for Cifar10? GTX750 has 2G of ram and it GDDR5. CIFAR10 is only 60K, and of size 32*32*3! maybe it fits?! Im not sure.
What do you think on this ? May I be able to give pascal voc 2007 as well?
Thanks again
Reply
- Tim Dettmers says
  2016-02-17 at 22:42
  The GTX 750 will be a bit slow, but you should still be able to do some deep learning with it. If you are using libraries that support 16bit convolutional nets then you should be able to train Alexnet even on ImageNet; so CIFAR10 should not be a problem. To use VGG on CIFAR10 should work out, but maybe it might be a bit tight especially if you use 32bit networks. I have no experience with the PASCAL VOC2007 dataset, but the image sizes seem to be similar to ImageNet, thus AlexNet should work out, but probably not VGG, even with 16bits.
  Reply
  - Hossein says
    2016-03-09 at 05:51
    Thanks you very much .
    By the way I’m using Caffe and I guess it only supports 32 bit convnets. I’m already hitting my limit using a 4 conv layer network (1991Mbs or so ) and overall only 2~3 Mbs of GPU remains .
    Your article and help was of great help to me sir and I thank you from the bottom of my heart .
    God bless you
    Hossein
    Reply
Alexander says
2016-02-07 at 13:23
Hey Tim!
I was thinking about GTX 970 issue again. According to the test, it loses bandwidth above 3.5GB. But what does it mean exactly?
-does it start affecting bandwidth for memory below 3.5GB as well? (I guess no)
-does it decrease GPU computing performance itself? (I guess no)
-what if input data allocated in GPU memory below 3.5GB, and only CNN weights allocated above 3.5GB? In that case upper 0.5GB shouldn’t be in use for data exchange and may not affect overall bandwidth? I understand we don’t control this allocation by default, but what in theory?
Reply
Alexander says
2016-01-21 at 09:31
Hi Tim!
Thank you for the excellent blog post.
I use various neural nets (i.e. sometimes large, sometimes small) and hesitate to choose between GTX 970 and GTX 960. What is better if we set price factor aside?
– 970 is ~2x faster than 960, but as you say it has troubles.
– on the other hand, Nvidia had shown that GTX 980 has the same memory troubles > 3.5GB
http://www.pcper.com/news/Graphics-Cards/NVIDIA-Responds-GTX-970-35GB-Memory-Issue
If we take their information for granted, I don’t understand your point re. memory troubles in GTX 970 at all, because you do recommend GTX 980
Simply put, is GTX 970 still faster than GTX 960 on large nets or not? What concrete troubles we face using 970 on large nets?
Thank you again, Alexander
Reply
- Tim Dettmers says
  2016-01-25 at 14:15
  Hi Alexander,
  if you look at the screenshots again you see that the bandwidth for the GTX 980 does not drop when we increase the memory. So the GTX 980 does not have memory problems.
  Regarding your question of 960 vs. 970: The 970 is much better if you can stay below 3.5GB of memory, but much worse otherwise. If you train sometimes some large nets, but you are not insisting on very good results (rather you are satisfied with good results) I would go with the GTX 970. If you train something big and hit the 3.5GB barrier, just adjust your neural architecture to be a bit smaller and you should be alright (or you might try different things like 16-bit networks, or aggressive use of 1×1 convolutional kernels (inception) to keep the memory footprint small).
  Reply
  - Alexander says
    2016-01-25 at 21:31
    Thanks, Tim!
    Indeed, I overlooked the first screenshot, it makes a difference.
    Don’t understand Nvidia’ statement still, somehow they equated GTX 980 and 970 above 3.5 GB, but no matter.
    Reply
Stas says
2016-01-19 at 19:20
Awesome work, this article really clears out the questions I had about available GPU options for deep learning.
What can you say about the Jetson series, namely the latest TX1?
Is it recommended to get as an alternative to PC rig with desktop GPU’s?
Reply
- Tim Dettmers says
  2016-01-25 at 14:10
  I was also thinking about the idea to get a Jetson TX1 instead of a new laptop, but in the end it is more convenient and more efficient to have a small laptop and ssh into a desktop or an AWS GPU instance. A AWS GPU instance will be quite a bit faster than the Jetson TX1 so that the Jetson only makes sense if you really want to do mobile deep learning, or if you want to prototype algorithms for future generation of smartphones that will use the Tegra X1 GPU.
  Reply
Marc-Philippe Huget says
2015-12-14 at 10:43
Hello Tim,
First of all, I bounced on your blog when looking for Deep Learning configuration and I loved your posts that confirm my thoughts.
I have two questions if you have time to answer them:
(1) For specific problems, I will train my DNN with ImageNet with some other classes, for this, I don’t care waiting for a while (well, a long while), when the DNN will be ready, do you know if configuration (one to four Titan X, 12GB each) will not delay too much when scene labelling images. I would like to have answers by seconds like Clarifai does. I guess this is dependent of the number of hidden layers I could have in my DNN
(2) Do you have enough long use of your configuration to provide feedback on MTBF for GPU cards? I guess like disks, running a system on 24/7 basis will impact the longevity of GPU cards…
Thanks in advance for your answers
mph
Reply
- Tim Dettmers says
  2015-12-15 at 22:03
  (1) Yes, this is highly dependent on the network architecture and it is difficult to say more about this. However, this benchmark page by Soumith Chintala might give you some hint what you can expect from your architecture given a certain depth and size of the data. Regarding parallelization: You usually use LSTMs for labelling scenes and these can be easily parallelized. However, running image recognition and labelling in tandem is difficult to parallelize. You are highly dependent on implementations of certain libraries here because it cost just too much time to implement it yourself. So I recommend to make your choice for the number of GPUs dependent on the software package you want to use.
  (2) I had no failures so far — but of course this is for a sample size of 1. I have heard from other people that use multiple GPUs that they had multiple failures in a year, but I think this is rather unusual. If you keep the temperatures below 80 degrees your GPUs should be just fine (theoretically).
  Reply
Alex says
2015-11-30 at 20:52
Hi! Fantastic article. Are there any on demand solution such as Amazon but with 980Ti on board? I can’t find any.
Reply
- Tim Dettmers says
  2015-12-01 at 15:00
  Amazon needs to use special GPUs which are virtualizable. Currently the best cards with such capability are kepler cards which are similar to the GTX 680. However, other vendors might have GPU servers for rent with better GPUs (as they do not use virtualization), but these server are often quite expensive.
  Reply
Mister says
2015-11-05 at 21:06
I have access to a NVIDIA Grid K2 card on a virtual machine and I have some questions related to this:
1. How does this card rank compared to the other models?
2. More importantly, are there any issues I should be aware of when using this card or just doing deep learning on a virtual machine in general?
I do not have the option of using any other machine than the one provided.
Reply
- Mister says
  2015-11-05 at 21:30
  And of course, thanks for some great articles! They are a big help.
  Reply
  - Tim Dettmers says
    2015-11-06 at 18:45
    You are welcome! I am glad that it helped!
    Reply
- Tim Dettmers says
  2015-11-06 at 18:44
  1. The Grid K2 card will roughly perform as good as a GTX 680, although its PCIe connection might be crippled due to virtualization.
  2. Depends highly on the hardware/software setup. Generally there should not be any issue other than problems with parallelism.
  Reply
  - Mister says
    2015-11-27 at 03:28
    Do you know what versions of CUDA it is comparible with? Would it work with CUDA 7.5?
    Reply
Nghia says
2015-10-27 at 16:02
Hi Tim,
Come across the internet for deep learning on this blog is great for newbie like me.
I have 2 choices in hands now: 1 GTX 980 4GB and 2 GTX 780 ti 3GB SLI. Which one do you recommend that should come to the hardware box for my deep learning research?
I am more favour of 2 780 Ti as learning from your writing on CUDA cores + memory bandwidth.
Thank you very much.
Nghia
Reply
- Tim Dettmers says
  2015-10-27 at 17:27
  I would favor the GTX 980 which will be much faster than 2 GTX 780 Ti even if you use the two cards in parallel. However, the 2 GTX 780 Ti will much better if you run independent algorithms and thus enables you to learn how to train deep learning algorithms successfully more quickly. On the other hand, the 3GB on them is rather limiting and will prevent you to train current state of the art convolutional networks. If you want to train convolutional networks I would suggest you choose the GTX 980 rather than 2 GTX 780 due to this.
  Reply
  - Nghia says
    2015-10-29 at 16:54
    Thank you very much for the advice.
    Is it possible to put all three card into one machine and that give me enough environment to learn parallelism programming and study deep learning with neuron network (torch7 & Lua)?
    A system with those 3 cards (780Ti x2 + 980 x1) will yield better performance overall or drag it down due to the hardware imparity and complexity?
    Reply
    - Tim Dettmers says
      2015-10-30 at 09:34
      Yes, you could run all three cards in one machine. However you can only select one type of GPU for your graphics; and for parallelism only the two 780 will work together. There might be problems with the driver though, and it might be that you need to select your Maxwell card (980) to be your graphics output.
      In a three card system you could tinker with parallelism with the 780s and switch to the 980 if you are short on memory. If you run NervanaGPU you could also use 16-bit floating point models, thus doubling your memory, however, NervanaGPU will not work on your Kepler 780 cards.
      Reply
      - Nghia Tran says
        2015-10-30 at 12:21
        Thank you very much, Tim.
        For the sake of study,
        From the specs:
        + The GTX 780Ti with 2880 CUDA cores + 3GB (384bit bandwidth), and double that with SLI
        + The GTX 980 with 2048 CUDA cores + 4GB (256bit bandwidth).
        Does VRAM 1GB/core difference make a big deal in deep learning?
        I will benchmark and post the result once I got hand on to run the system with above 2 configuration.
Tony says
2015-10-06 at 13:48
Hey Tim, not to bother too much. I bought a 980 Ti, and things have been great. However, I was just doing some searching, and saw that the AMD Radeon R9 390X is ~$400 on Newegg and has 8gb memory and 500gb bandwidth. These specs are roughly 30% better than the 980-TI for $650.
I was wondering what your thoughts are on this? Is AMD compute architecture slower compared to Nvidia Kepler architecture for deep learning? In the next month or so, I’m considering purchasing another card.
Based upon numbers, it seems that the AMD cards are much cheaper compared to Nvidia. I was hoping you could comment on this!
Reply
- Tim Dettmers says
  2015-10-06 at 14:40
  Theoretically the AMD card should be faster, but the problem is the software: Since no good software exists for AMD cards you will have to write most of the code yourself with an AMD card. Even if you manage to implement good convolutions the AMD card will likely perform worse than the NVIDIA one because the NVIDIA convolutional kernels have been optimized by a few dozen researchers for more than 3 years.
  NVIDIA Pascal cards will have up to 750-1000 GB/s memory bandwidth, so it is worth waiting for Pascal which probably will be released in about a year.
  Reply
  - Tony says
    2015-10-09 at 19:00
    Yea — I can’t wait for Pascal. For now, will just rock out with the 980 TI’s. Thanks alot!
    Reply
Tony says
2015-09-24 at 14:12
Tim, Such a great article. I’m going back and forth between the titan z and the titan x. I can probably buy the titan z for ~$500 from my friend. I’m very confused as to how much memory it actually has. I see that it has 6gb x 2.
I guess my question is: Is the Titan Z have the same specs as the Titan X in terms of memory? How does this work from a deep learning perspective (currently using theano)
Many Thanks,
Reply
- Tony says
  2015-09-24 at 14:13
  One thing I should add is that I’m building RNN’s (specifically LSTM’s) with this Titan Z or Titan X. I’m also considering the 980 TI
  Reply
- Tim Dettmers says
  2015-09-24 at 15:15
  Please have a look at my answer on quora which deals exactly with this topic. Basically, I recommend you to go for the GTX Titan X. However, $500 for GTX Titan Z is also a good deal. Memory-wise, you can think of the GTX Titan Z, as two normal GTX Titan with a connection between the two GPUs — so two GPUs with 6GB of memory each.
  Reply
  - Tony says
    2015-09-25 at 18:31
    That makes much more sense. Thanks again — checked out your response on quora. You’ve really changed my views on how to set up deep learning systems. Can’t even begin to express how thankful I am.
    Reply
Michael Holm says
2015-09-23 at 21:05
Hello Tim,
Thank you for your article. I understand that researchers need a good GPU for training a top performing (convolutional) neural network. Can you share any thought on what compute power is required (or what is typically desired) for transfer learning (i.e. fine tuning of an existing model) and for model deployment?
Thank you!
Reply
Mattias Johansson says
2015-09-21 at 06:23
Hello Tim
Thank you very much for you in-depth hardware analysis (both this and the other one you did). I basically ended up buying a new computer based only on your ideas 🙂
I choose the GTX 960 and then I might upgrade next year if I feeling this is something for me.
But in a lot of places I read about this imagenet db. The problem there seems to be that i need to be a researcher (or in education) to download the data. Do you know anything about this? Is there any way for me as a private person (that is doing this for fun) to download the data? The reason why I want this dataset is because it is huge and it also would be fun to be able to compare how my nets works compared to other people.
If not, what other image databases except for CIFAR and MIST do you recommend?
Thanks agan.
Reply
- Tim Dettmers says
  2015-09-22 at 08:57
  Hello Mattias, I am afraid there is no way around the educational email address for downloading the dataset. It is really is a shame, but if these images would be exploited commercially then the whole system of free datasets would break down — so it is mainly due to legal reasons.
  There are other good image datasets like the google street view house number dataset; you can also work with Kaggle datasets that feature images, which has the advantage that you get immediate feedback how well you do and the forums are excellent to read up how the best competitors did receive their results.
  Reply
  - Mattias Johansson says
    2015-09-23 at 06:01
    Thanks for quick reply,
    I will look into both Kaggle and the street view data set then 🙂
    Reply
Dong Ta says
2015-09-16 at 22:54
What do you think of Titan X superclocked vs. regular Titan X? Are the up/down sides noticeable?
Reply
- Tim Dettmers says
  2015-09-21 at 07:11
  The upgrade should be unnoticeable (0-5% increased speed) and I would recommend a superclocked version only if you do not pay any additional money for that.
  Reply
  - Sergei Wallace says
    2016-01-05 at 09:28
    Possibly (probably) a dumb question but can you use a superclocked GPU with an non-superclocked GPU? Reason I ask is that a cheap used superclocked Titan Black is for sale on ebay as well as another cheap Titan Black (non-superclocked). Just want to make sure I wouldn’t be making some mistake by buying the second one if I decided to get two Titan black GPUs.
    p.s. thanks for the blog. Super helpful for all of us noobies.
    Reply
    - Tim Dettmers says
      2016-01-09 at 10:33
      Yes, this will work without any problem. I myself have been using 3 different kind of GTX Titan for many months. In deep learning the different of compute clock also makes hardly a difference, so that the GPUs will not diverge during parallel computation. So there should be no problems.
      Reply
Bjarke says
2015-09-16 at 14:29
Are there any important differences between the GTX 980 and the GTX 980 TI? It seems that we can only get the latter. While it seems faster, I’m not skilled enough in the area to know whether it has any issues related to using it for deep learning.
Reply
- Tim Dettmers says
  2015-09-21 at 07:10
  The GTX 980 Ti is as fast at the GTX Titan X (50% faster than GTX 980), but has 6GB of memory instead of 12GB. There are no issue with the card, it should work flawlessly.
  Reply
Vu Pham says
2015-09-13 at 15:18
Hi Tim,
Right now i’m in between 2 choices: 2 gtx 690 and a Titanx. Both come with same price. Which one do you think is better for conv net? Or Multimodal Recurrent Neural Net
Reply
- Tim Dettmers says
  2015-09-13 at 15:35
  I would definitely pick a GTX Titan X over two GTX 690, mainly because using two GTX 690 for parallelism is difficult and will be slower than a single Titan X. Running multiple algorithms (different algorithms on each GPU) on the two GTX 960 will be good, but a Titan X comes close to this due to its higher processing speed.
  Reply
naveen DN says
2015-09-09 at 08:15
Hi Tim
I’m planning to build a pc mainly for kaggle and getting started with deep learning.
This is my first time.For my budget I’m thinking of going with
i7-4790k
GTX 960 4GB
Gigabyte GA-Z97X-UD3H-BK or Asus Z97-A 32GB DDR3 Intel Motherboard
I’m wishing to replace the gtx 960 or add another card later on …
Is this is a good build ? please offer your suggestions
Thanks in advance:)
Reply
- Tim Dettmers says
  2015-09-09 at 10:33
  Looks like a solid cheap build with one GPU. The build will suffice for a Pascal card once it comes available and thus should last about 4 years with a Pascal upgrade. The GTX 960 is a good choice to try things out, and use deep learning on kaggle. You will not able to build the best models, but models that are competitive with the top 10% in deep learning kaggle competitions. Once you get the hang of it, you can upgrade and you will be able to run the models that usually win those kaggle competitions.
  Reply
Alvas says
2015-08-14 at 00:35
Is it possible to use the GTX 960M for Deep Learning? http://www.geforce.com/hardware/notebook-gpus/geforce-gtx-960m/specifications. It has 2.5GB GDDR though. Maybe a pre-built specs with http://t.co/FTmEDrJDwb ?
Reply
- Tim Dettmers says
  2015-08-16 at 09:32
  A GTX 960M will be comparable in performance to a GTX 950. So you should see a good speedup using this GPU, but it will not be a huge speedup compared to other GPUs. However, compared to laptop CPUs the speedup will still be considerable. To do more serious deep learning work on a laptop you need more memory and preferably faster computation; a GTX 970M or GTX 980M should be very good for this.
  Reply
Vu Pham says
2015-08-03 at 04:25
Hi Tim, 1st i want to say that I’m truly extremely impressed with your blog, its very helpful.
Talking about the bandwidth of PCI Ex, have u ever heard about plx tech with their pex 8747 bridge (Chip). Anandtech has a good review on how does it work and effect on gaming: http://www.anandtech.com/show/6170/four-multigpu-z77-boards-from-280350-plx-pex-8747-featuring-gigabyte-asrock-ecs-and-evga. They even said that it can also replicate 4 x16 lanes on a cpu which is 28lanes.
Reply
- Tim Dettmers says
  2015-08-03 at 05:55
  Someone mentioned it before in the comments, but that was another mainboard with 48x PCIe 3.0 lanes; now that you say you can operate with 16x on all four GPUs I got curious and looked at the details.
  It turns out that this chip switches the data in a clever way, so that a GPU will have full bandwidth when it needs high speed. However, when all GPUs need high speed bandwidth, the chip is still limited by the 40 PCIe lanes that are available at the physical level. When we transfer data in deep learning we need to synchronize gradients (data parallelism) or output (model parallelism) across all GPUs to achieve meaningful parallelism, as such this chip will provide no speedups for deep learning, because all GPUs have to transfer at the same time.
  Transferring the data one after the other is most often not feasible, because we need to complete a full iteration of stochastic gradient descent in order to work on the next iterations. Delaying updates would be an option, but one would suffer losses in accuracy and the updates would not be that efficient anymore (4 delayed updates = 2-3 real updates?). This would make this approach rather useless.
  Reply
  - Vu Pham says
    2015-08-04 at 13:02
    thank for your detailed explanation.
    Reply
Joe Hoover says
2015-08-01 at 04:55
Hi Tim, thank you for posting and updating this, I’ve found it very helpful.
I do have a general question, though, about quadro cards, which I’ve noticed neither you nor many others discuss using for deep learning. I’m configuring a new machine and, due to some administrative constraints, it is easiest to go with a quadro k5000.
I had specced out a different machine with a GTX 980, but it’s looking like it will harder to purchase it. My questions are whether there is anything I should be aware of regarding using quadro cards for deep learning and whether you might be able to ball park the performance difference. We will probably be running moderately sized experiments and are comfortable losing some speed for the sake of convenience; however, if there would be a major difference between the 980 and k5000, then we might need to reconsider. I know it is difficult to make comparisons across architectures, but any wisdom that you might be able to share would be greatly appreciated.
Thanks!
Reply
- Tim Dettmers says
  2015-08-01 at 05:09
  The k5000 is based on a Kepler chip and has 173 GB/s memory bandwidth. Thus is should be a bit slower than a GTX 680.
  Reply
- Falak says
  2017-01-09 at 15:44
  Hi!
  I am in a similar situation. No comparison of quadro and geforce available anywhere. Just curious, which one did you end up buying and how did it work out?
  Reply
serige says
2015-07-16 at 18:22
From what I read, GPU Direct RDMA is only available for workstation cards (Quadro/Tesla). But it seems like you are able to get your cluster to work with a few GTX Titan’s and IB cards here. Not sure what am I missing.
Reply
- Tim Dettmers says
  2015-07-16 at 18:36
  You will need a Mellanox InfiniBand card. For me a ConnectX-2 worked, but usually only ConnectX-3 and ConnectX-IB are supported. I never tested GPU Direct RDMA with Maxwell, so it might not work there.
  To get it working on Kepler devices, you will need the patch you find under downloads here (nvidia_peer_memory-1.0-0.tar.gz):
  http://www.mellanox.com/page/products_dyn?product_family=116
  Even with that I needed quite some time to configure everything, so prepare yourself for a long read of documentations and error google search queries.
  Reply
Haider says
2015-07-07 at 03:50
Hi Tim,
Do you think it is better to buy Titan X now or waiting the new Pascal if I want to invest in just one GPU withing the coming 4 years?
Reply
- Tim Dettmers says
  2015-07-08 at 06:54
  The Pascal architecture should be a quite large upgrade when compared to Maxwell. However, you have to wait more than a year for them to arrive. If your current GPU is okay, I would wait. If you have no GPU at all, you can use AWS GPU instances, or but a GTX 970, and sell it after one year, to buy a Pascal card.
  Reply
need some says
2015-07-05 at 21:35
Can you comment on this note on the cuda-convnet page
https://code.google.com/p/cuda-convnet/wiki/Compiling
?
“Note: A Fermi-generation GPU (GTX 4xx, GTX 5xx, or Tesla equivalent) is required to run this code. Older GPUs won’t work. Newer (Kepler) GPUs also will work, but as the GTX 680 is a terrible, terrible GPU for non-gaming purposes, I would not recommend that you use it. (It will be slow). ”
I am probably in the “started DL, serious about it”-group, and would have probably bought the GTX 680 after reading your (great) article.
Reply
- Tim Dettmers says
  2015-07-08 at 07:28
  This is very much true. The performance of the GTX 680 is just bad. But because the Fermi GPUs (4xx and 5xx) are not compatible with the NVIDIA cuDNN library which is used by many deep learning frameworks, I do not recommend the GTX 5xx. The GTX 7xx series is much faster, but also much more expensive than a GTX 680 (except the GTX 960, which is about as fast as the GTX 680), so the GTX 680 despite being so slow, is the only viable choice (besides GTX 960) for a very low budget.
  As you can see in the comment of zeecrux, the GTX 960 might actually be better than the GTX 680 by quite a margin. So probably it is better to get a GTX 960 if you find a cheap on. If this is too expensive, settle for a GTX 580.
  Reply
  - need some says
    2015-07-09 at 08:16
    Ok, thank you! I can’t see any comment by zeecrux ? How bad is the performance of the GTX 960 ? Is it sufficient to have if you mainly want to get started with DL, play around with it, do the occasional kaggle comp, or is it not even worth spending the money in this case ? Buying a Titan X or GTX 980 is quite an investment for a beginner ?
    Reply
    - Tim Dettmers says
      2015-07-09 at 09:28
      Ah I did not realize, the comment of zeecrux was on my other blog post, the full hardware guide. Here is the comment:
      ImageNet on K40:
      Training is 19.2 secs / 20 iterations (5,120 images) – with cuDNN
      and GTX770:
      cuDNN Training: 24.3 secs / 20 iterations (5,120 images)
      (source: http://caffe.berkeleyvision.org/performance_hardware.html)
      I trained ImageNet model on a GTX 960 and have this result:
      Training is around 26 secs / 20 iterations (5,120 images) – with cuDNN
      A K40 is about as fast as a GTX Titan. So the GTX 960 is definitely faster and better than a GTX 680. It should be sufficient for most kaggle competitions and is a perfect card to get startet with deep learning.
      So it makes good sense to buy a GTX 960 and wait for Pascal to arrive in Q3/Q4 2016, instead of buying a GTX 980 Ti or GTX 980 now.
      Reply
  - tommy says
    2017-06-29 at 15:06
    Hey Tim,
    Can i know where to check this statement? “But because the Fermi GPUs (4xx and 5xx) are not compatible with the NVIDIA cuDNN library “. TIA.
    Reply
    - Tim Dettmers says
      2017-07-05 at 00:40
      Check this stackoverflow answer for a full answer and source to that question.
      Reply
mmm says
2015-06-24 at 10:07
About this:
“GTX Titan X = 0.66 GTX 980 = 0.6 GTX 970 = 0.5 GTX Titan = 0.40 GTX 580
GTX Titan X = 0.35 GTX 680 = 0.35 AWS GPU instance (g2.2 and g2.8) = 0.33 GTX 960”
Have you actually measured the times/used these gpus or are you “guessing”?
Thank you for the article!
Reply
- Tim Dettmers says
  2015-06-24 at 10:24
  Very good question!
  Because deep learning is bandwidth-bound, the performance of a GPU is determined by its bandwidth. However, this is only true for GPUs with the same architecture (Maxwell, Kepler, Fermi). So for example: The comparisons between GTX Titan X and GTX 980 should be quite accurate.
  Comparisons across architectures are more difficult and I cannot assess them objectively (because I do not have all the GPUs listed). To provide a relatively accurate measure I sought out information where a direct comparison was made across architecture. Some of these are opinion or “feeling”-based, other sources of information are not relevant (game performance measures), but there are some sources of information which are relatively objective (performance measures for bandwidth-bound cryptocurrency mining); so I weighted each piece of information according to its relevance and then I rounded everything to neat numbers for comparisons between architectures.
  So all in all, these measure are quite opinionated and do not rely on good evidence. But I think I can make more accurate estimates than people that do not know GPUs well. Therefore I think it is the right thing to include this somewhat inaccurate information here.
  Reply
  - Dmitry says
    2015-10-23 at 16:28
    Hi,
    Thanks a lot for the updated comparison. I bought a 780 Ti a year ago and it’s interesting how it compares to the newer cards? I use it for NLP tasks mainly, including RNNs, starting with LSTMs.
    Also, do I get it right that ‘GTX Titan X = 0.66 GTX 980’ means that 980 is actually 2/3 as fast as Titan X or the other way round?
    Reply
    - Tim Dettmers says
      2015-10-26 at 22:04
      A GTX 780 Ti is pretty much the same as a GTX Titan Black in terms of performance (slower than a GTX 980). Exactly, the 980 is about 2/3 the speed of a Titan X.
      Reply
Jay says
2015-06-19 at 07:36
Sweet, thanks.
Reply
Jay says
2015-06-17 at 23:49
Will the Pascal GPUs have any special requirements, such as X99 or DDR4? I am currently planning a Z97 build with DDR3, but don’t want to be stuck in a years time! Thanks, J
Reply
- Tim Dettmers says
  2015-06-18 at 06:14
  According to the information that is available, Pascal will not need X99 or DDR4 (which would be quite limiting for sales), instead Pascal cards will just be like a normal card stuck in a PCIe slot with NVLink on top (just like SLI) and thus no new hardware is needed.
  Reply
Kumar says
2015-06-07 at 11:15
Hi,
I am a novice at deep nets and would like to start with some very small convolutional nets. I was thinking of using a GTX750TI (in my part of the world it is not really very cheap for a student). I would convince my advisor to get a more expensive card after I would be able to show some results. Will it be sufficient to do a meaning convolutional net using Theano?
Reply
- Tim Dettmers says
  2015-06-07 at 12:01
  Your best choice in this situation will be to use an amazon web service GPU spot instance. These instances have small costs ($0.1 and hour or so) and you will able to produce results quickly and cheaply, after which your advisor might be willing to buy an expensive GPU. To save more costs, it would be best to prototype your solution on a CPU (just test that the code is working correctly) and then start up an AWS GPU instance and let your code run for a few hours/days. This should be the best solution.
  I think there are predefined AWS images which you can load, so that you do not have to install anything — google “AMI AWS + theano” or “AMI AWS + torch” to find more.
  Reply
  - Kumar says
    2015-06-08 at 10:44
    Thanks a lot for the suggestion. I will go ahead and try this.
    Reply
Jay says
2015-05-30 at 10:31
Thanks for this great article. What do you think of the upcoming GTX 980 Ti? I have read it has 6GB and clock speed/cores closer to the Titan X. Rumoured to be $650-750. I was about to buy a new PC, but thought I might hold out as it’s coming in June.
Reply
- Tim Dettmers says
  2015-05-30 at 10:41
  The GTX 980 Ti seems to be great. 6GB of RAM is sufficient for most tasks (unless you use super large data sets, doing video classification, and use expensive convolutional architectures) and the speed is about the same. If you use Nervana System 16 bit kernels (which will be integrated into torch7) then there should be no issues with memory even with these expensive tasks.
  So the GTX 980 Ti seems to be the new best choice in terms of cost effectiveness.
  Reply
vasconl says
2015-05-19 at 23:10
Hi,
Nice article! You recommended all high-end cards. What about mid-range cards for those with a really tight budget? For example, the GT 740 line has a model with 4GB gddr5, 5000 mt/s mem clock, a 128 bus width and is rated at ~750 GFLOPS. Will such a card likely give a nice boost in neural net training (assuming it fits in the cards mem) over a mid-range quad-core CPU?
Thanks!
Reply
- Tim Dettmers says
  2015-05-20 at 13:59
  The GTX 740 with 4GB GDDR5 is a very good choice for a low budget. Maybe I should even include that option in my post for a very low budget. A GT 740 will definitely be faster than quad-core CPUs (probably anything between 3 to 7 times faster, depending on the CPU and the problem).
  Reply
salemameen says
2015-05-02 at 01:51
Is it easy to install GTX Titan X in a Mac Pro? Does it need external hardware or power supply or just plug in?
Reply
salemameen says
2015-05-01 at 12:25
Many thanks Tim
Reply
salemameen says
2015-04-30 at 20:51
Hi Tim,
Thank you for your advices I found them very very useful. I have many questions please and feel very to answer some of them. I have many choices to buy a powerful laptop or computer My budget is (£4000.00).
I would like to buy Mac Pro (cost nearly £3400.00) , so can I apply deep learning of this machine as it uses the OSX operating system and I want to use torch7 in my implementation. Second, I will buy Titan x then I have two choices, First, I will install TITAN X GPU in Mac Pro. Second, I will buy Alienware Amplifier (to use TITAN X) with Alienware 13 laptop. Could you please tell me if this possible and easy to make it because I am not a computer engineer, but I want to use deep learning in my research.
Best regards,
Salem
Reply
- Tim Dettmers says
  2015-05-01 at 04:56
  I googled the Alienware Amplifier and I read it only has 4GB/s of bandwidth internal and it might be that there are other problems. If you use a single GPU, this is not too much of a concern, but be prepared to deal with performance decreases in the range of 5-25%. If there are technical details that I overlooked the performance decrease might be much higher — you will need to look into that yourself.
  The GTX Titan X in a Mac Pro will do just fine I guess. While most deep learning libraries will work well with OSX there might be a few problems here and there, but I think torch7 will work fine.
  However, consider also that you will pay a heavy price for the aesthetics of apple products. You could buy a normal high end computer with 2 GTX Titan X and it will be still cheaper than a Mac Pro. Ubuntu or any other Linux-based OS need some time to get comfortable with, but they work just as well as OSX and often make programming easier than OSX does. So it is basically all down to aesthetics vs performance — that’s your call!
  Reply
  - salemameen says
    2015-05-01 at 12:29
    Is it easy to install GTX Titan X in a Mac Pro? Does it need external hardware or power supply or just plug in?
    Reply
Dimiter says
2015-04-26 at 21:36
Hi Tim,
Thanks for sharing all this info.
I don’t understand the difference between GTX 980 from say Asus and Nvidia.
Obviously same architecture, but are they much different at all?
Why it seems hard to find Nvidia products in Europe?
Thanks
Reply
- Tim Dettmers says
  2015-04-27 at 12:46
  So this is the way how a GPU is produced and comes into your hands:
  1. NVIDIA designs a circuit for a GPU
  2. It makes a contract with a semiconductor producer (currently TSMC in Taiwan)
  3. The semiconductor producer produces the GPU and sends it to NVIDIA
  4. NVIDIA sends the GPU to companies such as ASUS, EVGA etc.
  5. ASUS, EVGA, etc. modify the GPU (clock speeds, fan — nothing fundamental, the chip stays the same)
  6. You buy the GPU from either 5. or 4.
  So while all GPUs are from NVIDIA you might buy a branded GPU from, say, ASUS. This GPU is the very same GPU as another GPU from, say, EVGA. Both GPUs run the very same chip. So essentially, all GPUs are the same (for a given chip).
  Some GPUs are not not available in other countries because of regulations (NVIDIA might have no license, but other brands have?) and because it might not be profitable to sell it there in the first place (you will need to have a different set of logistics for international trade; NVIDIA might not have the expertise and infrastructure for this, but regular hardware companies like ASUS, EVGA do).
  Reply
yakup says
2015-04-24 at 07:00
Hi Tim,
I have benefited from this excellent post. I have a question regarding amazon gpu instances. Can you give a rough estimate of the performance of amazon gpu? Like GTX TITAN X = ? amazon gpu.
Thanks,
Reply
- Tim Dettmers says
  2015-04-24 at 11:16
  Thanks, this was a good point, I added it to the blog post. The new AWS GPUs (g2.2 and g2.8 large) are about as fast as a GTX 680 (they are based on the same chip, but are slightly different to support virtualization). However, there are still some performance decreases due to virtualization for the memory transfer from CPU to GPU and between GPUs; this is hard to measure and should have little impact if you use just one GPU. If you perform multi-GPU computing the performance will degrade harshly.
  Reply
Jack says
2015-04-23 at 21:54
Hi Tim-
Does the platform you plan on DLing on matter? By this I mean x99, z97, AM3+, ect. X99 is able to utilize more threads and cores than z97, but I’m not sure if that helps at all, similar to cryptocurrency mining, where hardware besides the GPU dosent matter.
Reply
- Tim Dettmers says
  2015-04-24 at 04:07
  Hi Jack-
  Please have a look at my full hardware guide for details, but in short, hardware besides the GPU does not matter much (although a bit more than in cryptocurrency mining).
  Reply
  - Jack says
    2015-04-24 at 04:24
    Ok, sure, thanks.
    Reply
johno says
2015-04-21 at 14:13
Hi Tim, great post! I feel lucky that I chose a 580 a couple of years ago when I started experimenting with neural nets. If there had been an article like this then I wouldn’t have been so nervous!
I’m wondering if you have any quick tips for fresh Ubuntu installs with current nvidia cards? When I got my used system running a couple of years ago it took quite a while and I fought with drivers, my 580 wasn’t recognized, etc.. On the table next to me is a brand new build that I just put together that I’m hoping will streamline my ML work. It’s an intel X99 system with a Titan X (I bought into the hype!). Windows went on fine(although I will rarely use it) and Ubuntu will go on shortly. I’m not looking forward to wrestling with drivers…so any tips would be greatly appreciated. If you have a cheat-sheet or want to do a post, I’m sure it will be warmly welcomed by many…especially me!
Reply
- Tim Dettmers says
  2015-04-21 at 15:01
  Yeah, I also had my troubles with installing the latest drivers on ubuntu, but soon I got the hang of it. You want to do this:
  0. Download driver and remember the path where you saved the file
  1. Purge system from nvidia and nouveau driver
  2. Blacklist nouveau driver
  3. Reboot
  4. Ctrl + Alt + F1
  5. sudo service lightdm stop
  5. chmod +x driver_file
  6. sudo ./driverfile
  And you should be done. Sometimes I had troubles with stopping lightdm; you have two options:
  1. try sudo /etc/init.d/lightdm stop
  2. killing all lightdm processes (sudo killall lightdm or (1) ps ux | grep lightdm, (2) find process id, (3) sudo kill -9 id)
  For me the second option worked.
  You can find more details to the first steps here:
  http://askubuntu.com/questions/451221/ubuntu-14-04-install-nvidia-driver
  Reply
  - johno says
    2015-04-24 at 19:59
    Thanks for the reply Tim. I was able to get it all up and running pretty painlessly. Hopefully your response helps somebody else too…it’s nice to have this sort of information in one spot if it’s GPU+DNN related.
    On a performance note, my new system with the Titan X is more than 10 times faster on an MNIST training run than my other computer (i5-2500k + gx580 3Gb). And for fun, I cranked up the mini-batch size on a Caffe example (flicker finetuning) and got 41% validation accuracy in under an hour.
    I believe you hit a nerve with a couple of your blog posts…I think the type of information that you’re giving is quite valuable, especially to folks who haven’t done much of this stuff.
    One possible information portal could be a wiki where people can outline how they set up various environments (theano, caffe, torch, etc..) and the associated dependencies. Myself, I set up a few and I’m left with a few questions like for example…
    -given all the dependencies, which should be build versus apt-get versus pip? A holistic outlook would be a very education thing. I found myself building the base libraries and using the setup method for many python packages but after a while there were so many I started using apt-get and pip and adding things to my paths…blah blah…at the end everything works but I admin I lost track of all the details.
    I know that I’m not alone!! Having a wiki resource that I could contribute to during the process would be good for me and for others doing the same thing….instead of hunting down disparate sources and answering questing on stackoverflow.
    I mention this because you probably already have a ton of traffic because of a couple key posts that you have. Put a wiki up and I promise I’ll contribute! I’ll consider doing it myself as well…time…need more time!
    thanks again.
    Reply
    - Tim Dettmers says
      2015-04-27 at 12:31
      Thanks, johno, I am glad that you found my blog posts and comments useful. A wiki is a great idea and I am looking into that. Maybe when I move this site to a private host this will be easy to setup. Right now I do not have time for that, but I will probably migrate my blog in a two months or so.
      Reply
- Timothy Scharf says
  2015-04-23 at 04:04
  I am a bit of a novice but got it done in a few hours.
  My thoughts
  Try and start with a clean install of a NVIDIA supported linux distro ( unbuntu 14.04 LTS) is on there
  I used the Linux distro proprietary drivers, instead of downloading them from NVIDIA. X-org-edgers PPA has them and they keep them pretty current. This means you can install the actual NVIDIA driver via sudo apt-get, and also (more importantly) upgrade the driver in a few months when NVIDIA easily. It also blacklists Nouveau automatically. You can toggle between driver versions in the software manager as it shows you all the drivers you have.
  Once you have the driver working, you are most of the way there. I ran into a few troubles with the CUDA install, as sometimes your computer may have some libraries missing, or conflicts. But I got CUDA 7_0 going pretty quickly
  these two links helped
  http://bikulov.org/blog/2015/02/28/install-cuda-6-dot-5-on-clean-ubuntu-14-dot-04/
  http://developer.download.nvidia.com/compute/cuda/6_0/rel/docs/CUDA_Getting_Started_Linux.pdf
  there is gonna be some trial and error, be ready to reinstall ubuntu and take another try at it.
  good luck
  Reply
Timothy Scharf says
2015-04-08 at 01:35
hey Tim,
you been a big help – I have included the results from CUDA bandwidth test (which is included in the samples file of the basic CUDA install.)
This is for a GTX 980 running on 64bit linux with i3770 CPU, and PCIe 2.0 lanes on motherboard.
This look reasonable?
Are they indicative of anything?
the device/host and host/device speeds are typically the bottleneck you speak of?
no reply necessary – just learning
thanks again
tim@ssd-tim ~/NVIDIA_CUDA-7.0_Samples/bin/x86_64/linux/release $ ./bandwidthTest
[CUDA Bandwidth Test] – Starting…
Running on…
Device 0: GeForce GTX 980
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12280.8
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12027.4
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 154402.9
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Reply
- Tim Dettmers says
  2015-04-08 at 06:28
  Looks quite reasonable; the bandwidth from host to device, and device to host is limited by either RAM or PCIe 2.0 and 12GB/s is faster than expected; 150GB/s is slower than the 224GB/s which the GTX 980 is capable of, but this is due to the small memory size of 30MB — so this looks fine.
  Reply
Mario B. says
2015-04-01 at 23:56
What is your opinion about the different brands (EVGA, ASUS, MSI, GIGABYTE) of the video card for the same model?
Thanks for this post Tim, is very illustrating.
Reply
- Tim Dettmers says
  2015-04-04 at 07:27
  EVGA cards often have many extra features (dual BIOs, extra fan design) and a bit higher clock and/or memory, but their cards are more expensive too. However, with respect to price/performance it often depends from card to card which is the best one and one cannot make general conclusions from a brand. Overall, the fan design is often more important than the clock rates and extra features. The best way to determine the best brand, is often to look for references of how hot one card runs compared to another and then think if the price difference justifies the extra money.
  Most often though, one brand will be just as the next and the performance gains will be negligible — so going for the cheapest brand is a good strategy in most cases.
  Reply
adalyac says
2015-03-26 at 13:49
“a 6GB GPU is plenty for now” — don’t you get severely limited in the batch size (like, 30 max) for 10^8+ parameter convnets (eg simonyan very deep, googlenet)?
although I think some DL toolkits are starting to come with functionality of updating weights after >1 batch load/unload onto gpu, which I guess would result in theoretically unlimited batch size, though not sure how this would impact speed?
Reply
- Tim Dettmers says
  2015-03-26 at 14:27
  This is a good point, Alex. I think you can also get very good results with conv nets that feature less memory intensive architectures, but the field of deep learning is moving so fast, that 6 GB might soon be insufficient. Right now, I think one has still quite a bit of freedom with 6 GB of memory.
  A batch and activation unload/load procedure would be limited by the ~8GB/s bandwidth between GPU and CPU, so there will be definitely a decrease in performance if you unload/load a majority of the needed activation values. Because the bandwidth bottlenecks are very similar to parallelism, one can expect a decrease in performance of about 10-30% if you unload/load the whole net. So this would be an acceptable procedure for very large conv nets, however smaller nets with less parameters would still be more practical I think.
  Reply
Timothy Scharf says
2015-03-23 at 14:14
Any comments on this new Maxwell architecture Titan X? $1000 US
http://www.pcworld.com/article/2898093/nvidia-fully-reveals-1000-titan-x-the-most-advanced-gpu-ever.html
seemingly has a massive memory bandwidth bump – for example the gtx 980 specs claim 224 GB/sec with the Maxwell architecture, this has 336 GB/sec (and also comes stock with 12GB VRAM!)
Along that line, are the memory bandwith specs not apples to apples comparisons across different Nvidia architectures?
i.e. the also 780ti claims 336GB/sec with the Kepler architecture – but you claim the 980 with 224GB/sec bandwidth can out benchmark it for basic neural net activities?
Appreciate this post
Reply
- Tim Dettmers says
  2015-03-23 at 15:26
  You can compare bandwidth within microarchitecture (Maxwell: GTX Titan X vs GTX 980, or Kepler: GTX 680 vs GTX 780), but across architectures you cannot do that (Maxwell card X vs Kepler card X). The very minute changes in the design of a microchip can make vast difference in bandwidth, FLOPS, or FLOPS/watt.
  Kepler was about FLOPS/watt and double precision performance for scientific computing (engineering, simulation etc.), but the complex design lead to poor utilization of the bandwidth (memory bus times memory clock). With Maxwell the NVIDIA engineers developed an architecture which has both energy efficiency and good bandwidth utilization, but the double precision suffered in turn — you just cannot everything. Thus Maxwell cards make great gaming and deep learning cards, but poor cards for scientific computing.
  The GTX Titan X is so fast, because it has a very large memory bus width (384 bit), an efficient architecture (Maxwell) and a high memory clock rate (7 Ghz) — and all this in one piece of hardware.
  Reply
InternetJ says
2015-03-17 at 19:57
I know that these are not recommended, but 580 won’t work for me because of the lack of Torch 7 support: will the 660 or 660 Ti work with Torch 7? Is this possible to check before purchasing? Thank you!
Reply
- Tim Dettmers says
  2015-03-17 at 20:52
  The cuDNN component of Torch 7 needs a GPU with compute capability 3.5. A 660 or 660Ti will not work; You can find out which GPUs have which compute capability here.
  Reply
vinhnguyenx says
2015-03-11 at 12:58
great posts, Tim!
which deep learning framework you often use for your work may I ask?
Reply
- Tim Dettmers says
  2015-03-17 at 21:00
  I programmed my own library for my work on the parallelization of deep learning; otherwise I use Torch7 with which I am much more productive than with Caffe or Pylearn2/Theano.
  Reply
  - Ashkan says
    2017-05-22 at 21:55
    what about Tensorflow?
    Reply
    - Tim Dettmers says
      2017-05-26 at 13:24
      The commend is quite outdated now. TensorFlow is good. I personally favor PyTorch. I believe one can be much more productive with PyTorch — at least I am.
      Reply
Soumyajit Ganguly says
2015-03-07 at 10:37
Hi Tim,
I am a bit confused between buying your recommended GTX 580 and a new GTX 750 (maxwell). The models which I am getting in ebay are around 120 USD but they are 1.5GB models. One big problem with the 580 would be, buying a new PSU (500watt). As you stated the maxwell architecture is the best, then would the GTX 750 (512 CUDA cores, 1GB DDR5) be a good choice? It will be about 95 USD and I can also do without an expensive PSU.
My research area is mainly in text mining and nlp, not much of images. Other than this I would do Kaggle competetions.
Reply
- Tim Dettmers says
  2015-03-08 at 10:42
  A GTX 750 will be a bit slower than a GTX 580 which should be fine and more cost effective in your case. However, maybe you want to opt for the 2 GB version; with 1 GB it will be difficult to run convolutional nets; 2 GB will also be limiting of course, but you could use it on most Kaggle competitions I think.
  Reply
fayzur20 says
2015-03-02 at 11:17
Thanks for the explanation. Looking forward to read the other post.
Reply
elanmart says
2015-02-28 at 14:05
Hey! Thanks for the great post!
I have one question, however:
I’m in the “started DL, serious about it” group and have a decent PC already, although without NVIDIA GPU. I’m also a 1st yr student, so GTX 980 is out of question 😉 The question is: what do You think about Amazon EC2? I could easily buy a GTX580, but I’m not sure if it’s the best way to spend my money. And when I think about more expensive cards (like 980 or the ones to be released in 2016) it seems like running a spot instance for 10 cents per hour is a much better choice.
What could be the main drawbacks of doing DL on EC2 instead of my own hardware?
Reply
- timdettmers says
  2015-03-01 at 08:58
  I think a Amazon web services (AWS) EC2 instance might be a great choice for you. AWS is great if you want to use a single or multiple separate GPUs (one GPU for one deep net). However, you cannot use them for multi-GPU computation (multiple GPUs for one deep net) as the virtualization cripples the PCIe bandwidth; there are rather complicated hacks that improve the bandwidth, but it is still bad. Everything beyond two GPUs will not work on AWS because their interconnect is way to slow for that.
  Reply
  - Gideon says
    2015-04-17 at 14:13
    Is the AWS single GPU limitation relevant to the new g2.8xlarge instance? (see https://aws.amazon.com/blogs/aws/new-g2-instance-type-with-4x-more-gpu-power/).
    Reply
    - Tim Dettmers says
      2015-04-17 at 14:24
      It seems to run the same GPUs as those in the g2.2xlarge which would still impede parallelization for neural networks, but I do not know for sure without some hard numbers. I bet that with custom patches 4 GPU parallelism is viable although still slow (probably one GTX Titan X will be faster than the 4 GPUs on the instance). More than 4 GPUs still will not work due to the poor interconnect.
      Reply
yakup says
2015-02-28 at 09:50
Hey Tim,
Thanks for the excellent detailed post. I look forward to reading your other posts. Keep going 🙂
Reply
yakup says
2015-02-27 at 09:51
Hi Tim,
Thanks for your excellent blog posts. I am a statistician and I want to go into deep learning area. I have a budget of 1500-2000 $. Can you recommend me a good desktop system for deep learning purposes? From your blog post I know that I will get a gtx 980 but, what about cpu, ram, motherboard requirement?
Thanks
Reply
- timdettmers says
  2015-02-27 at 10:49
  Hi Yakup,
  I wanted to write a blog post with detailed advice about this topic sometimes in the next two weeks and if you can wait for that you might get some insights what hardware is right for you. But I also want to give you some general, less specific advice.
  If you might be getting more GPUs in the future, it is better that you will buy a motherboard with PCIe 3.0 and 7 PCIx16 slots (one GPU takes typically two slots). If you will use only 1-2 GPUs, then almost any motherboard with do (PCIe 2.0 would be also be okay). Plan to get a power supply unit (PSU) which has enough Watts to power all GPUs you will get in the future (e.g. if you will get a maximum of 4, then buy a +1400 Watts PSU). The CPU does not need to be fast or have many cores. Twice as many threads as you have GPUs is almost always sufficient (for Intel CPUs we mostly have: 1 core = 2 threads); any CPU with more than 3GHz is okay; less than 3GHz might give you a tiny penalty in speed of about 1-3%. Fast memory caches are often more important for CPUs, but in the big picture they also contribute little in overall performance; a typical CPU with slow memory will decrease the overall performance by a few percent.
  One can work around a small RAM by loading data sequentially from your hard drive into your RAM, but it is often more convenient to have a larger RAM; two times the RAM your GPU has gives you more freedom and flexibility (i.e. 8GB RAM for a GTX 980). A SSD will it make more comfortable to work, but similarly to the CPU offers little performance gains (0-2%; depends on the software implementation); a SSD is nice if you need to preprocess large amounts of data and save them into smaller batches, e.g. preprocessing 200GB of data and save them into batches of 2GB is a situation in which SSDs can save a lot of time. If you decide to get a SSD, a good rule might be to buy a SSD that is twice as large as your largest data set. If you get a SSD, you should also get a large hard drive where you can move old data sets to.
  So the bottom line is, a $1000 system should perform at least at 95% of a $2000 system; but a $2000 system offers more convenience and might save some time for preprocessing.
  Reply
  - Dewan says
    2015-02-27 at 15:52
    Hi Tim,
    Nice and very informative post. I have a question regarding processor. Would you suggest to build a computer with AMD processor (for example, AMD FX-8350 4.0GHz 8-Core Processor) over INTEL based processor for deep learning? I also do not know, AMD processor has PCI 3.0 support. Could you please give your thought on this?
    And thanks a lot for the wonderful post.
    Reply
    - timdettmers says
      2015-02-27 at 16:26
      Thanks for your comment, Dewan. An AMD CPU is just as good as a Intel CPU; in fact I might favor AMD over Intel CPUs because Intel CPU pack just too much unnecessary punch – one simply does not need so much processing power as all the computation is done by the GPU. The CPU is only used to transfer data to the GPU and to start kernels (which is little more than a function call). Transferring data means that the CPU should have a high memory clock and a memory controller with many channels. This is often not advertised on CPUs as it not so relevant for ordinary computation, but you want to choose the CPU with the larger memory bandwidth (memory clock times memory controller channels). The clock on the processor itself is less relevant here.
      A 4GHz 8 core AMD CPU might be a bit overkill. You could definitely settle for less without any degradation in performance. But what you say about PCIe 3.0 support is important (some new Haswell CPUs do not support 40 lanes, but only 32; I think most AMD CPUs support all 40 lanes). As I wrote above I will write a more detailed analysis in a week or two.
      Reply
vikasing says
2015-02-25 at 10:16
Great article!
You did not talk about the number of cores present in a graphics card (CUDA cores in case of nVidia). My perception was that a card with more cores will always be better because more number of cores will lead to a better parallelism, hence the training might be faster, given that the memory is enough. Plz correct me if my understanding is wrong.
Which card would you suggest for RNNs and a data size of 15-20 Gb (wikipedia/freebase size)? A 960 would be good enough? Or should I go with a 970 one? 580 is not available in my country.
Reply
- timdettmers says
  2015-02-25 at 11:50
  Thanks for your comment. CUDA cores relate more closely to FLOPS and not to bandwidth, but it is the bandwidth that you want for deep learning. So cuda cores are a bad proxy for performance in deep learning. What you really want is a high memory bus width (e.g. 384 bits) and high memory clock (e.g. 7000MHz) – anything other than that hardly matters for deep learning.
  Mat Kelcey did some tests with theano for the GTX 970 and it seems that the GPU has no memory problems for compute – so the GTX 970 might be a good choice then.
  Reply
  - vikasing says
    2015-02-25 at 12:35
    Thanks a lot 🙂
    Reply
gwern says
2015-02-23 at 20:08
(Duplicate paragraph: “I quickly found”.)
Reply
- timdettmers says
  2015-02-23 at 20:37
  Thanks, fixed.
  Reply
ragv says
2015-02-04 at 14:00
Hi, I am planning to replicate ImageNet object identification problem using CNNs as published in recent paper by G. Hinton et al… ( just as an exercise to learn about deep learning and CNNs ).
1. What GPU would you recommend considering I am student. I heard the original paper used 2 GTX 580 and yet took a week to train the 7 layer deep network? Is this true? Could the same be done using a single GTX 580 or GTX 970? How much time will it take to train the same on a GTX 970 or a single GTX 580 ? ( A week of time is okay for me )
2. What kind of modifications in the original implementation could I do ( like 5 or 6 hidden layers instead of 7, or lesser number of objects to detect etc. ), to make this little project of mine easier to implement on a lower budget while at the same time helping me learn about the deep nets and CNNs ?
3. What kind of libraries would you recommend for the same? Torch7 or pylearn2 / theano ( I am fairly proficient in python but not so much in lua ).
4. Is there a small scale implementation of this anywhere in github etc?
Also thanks a lot for the wonderful post. 🙂
Reply
- timdettmers says
  2015-02-04 at 15:37
  1. All GPUs with 4 GB should be able to run the network; you can run a bit smaller networks on one GTX 580; these networks will always take more than 5 days, even on the fastest GPUs
  2. Read about convolutional neural networks, then you will understand what the layers do and how you can use them. This is a good, thorough tutorial: http://danielnouri.org/notes/2014/12/17/using-convolutional-neural-nets-to-detect-facial-keypoints-tutorial/
  3. I would try pylearn2, convnet2, and caffe and pick which suits you best
  4. The implementations are generally general implementations, i.e. you run small and large networks with the same code, it is only a difference in a parameters to a function; if you mean by “small”, a less complex API I heard good things about the Lasagne library
  Reply
  - Jack says
    2015-02-04 at 18:36
    Hi Tim, super interesting article. What case did you use for the build that had the GPUs vertical?
    Reply
    - timdettmers says
      2015-02-04 at 18:44
      It looks like it is vertical, but it is not. I took that picture while my computer was laying on the ground. However, I use the Cooler Master HAF X for both of my computer. I bought this tower because it has a dedicated large fan for the GPU slot – in retrospect I am unsure if the fan is helping that much. There is another tower I saw that actually has vertical slots, but again I am unsure if that helps so much. I would probably opt for liquid cooling for my next system. It is more difficult to maintain, but has much better performance. With liquid cooling almost any case would go that fits the mainboard and GPUs.
      Reply
      - Jack says
        2015-02-04 at 19:33
        It looks like there is a bracket supporting the end of the cards, did that come with the case or did you put them in to support the cards?
Mike says
2015-01-20 at 05:06
Hi Tim. Thanks for this very informative post.
Do you know how much of a boost Maxwell gives? I’m trying to decide between a GTX 850M with 4GB DDR3 and a Quadro K1100M with 2GB GDDR5. I understand that the K1100M is roughly equivalent to the 750M. Which gives the bigger boost: going from Kepler to Maxwell or from Geforce to Quadro (including from DDR3 to GDDR5)?
Thanks so much!
Reply
- timdettmers says
  2015-01-20 at 07:26
  Going from DDR3 to GDDR5 is a larger boost than going from Kepler to Maxwell. However, the Quadro K1100M has only a slightly faster bandwidth than the GTX 850M which will probably cancel out the benefits, so that both cards will perform at about the same level. If you want to use convolutional neural networks the 4GB memory on the GTX 850M might make the differnce; otherwise I would go with the cheaper option.
  Reply
  - Mike says
    2015-01-20 at 17:14
    Thanks!
    Reply
Monica says
2015-01-16 at 18:41
Hi Tim. Do you have any references that explain why the convolutional kernels need more memory beyond that used by the network parameters. I am trying to figure out why Alex’s net needs just over 3.5Gb when the parameters alone only take ~0.4 Gb…what’s hogging the rest?!?
Reply
- timdettmers says
  2015-01-16 at 20:30
  Thanks for your comment Monica. This is indeed something I overlooked, which is actually a quite important issue when selecting a GPU. I hope to address this in an update I aim to write soon.
  To answer your question: The increase memory usage stems from memory that is allocated during the computation of the convolutions to increase computational efficiency: Because image patches overlap one saves a lot of computation when one saves some of the image values to then reused them for an overlapping image patch. Albeit at a cost of device memory, one can achieve tremendous increases in computational efficiency when one does cleverly as Alex does in his CUDA kernels. Other solutions that use fast Fourier transforms (FFTs) are said to be even faster than Alex’s implementation, but these do need even more memory.
  If you are aiming to train large convolutional nets, then a good option might be to get a normal GTX Titan from eBay. If you use convolutional nets heavily, two, or even four GTX 980 (much faster than a Titan) also make sense if you plan to use the convnet2 library which supports dual GPU training. However, be aware that NVIDIA might soon release a Maxwell GTX Titan equivalent which would be much better than the GTX 980 for this application.
  Reply
Anatoly says
2014-12-30 at 18:31
Hi Tim. What open-source package would you recommend if the objective was to classify non-image data? Most packages specifically are designed for classifying images
Reply
- timdettmers says
  2015-01-01 at 19:52
  I have only superficial experience with the most libraries, as I usually used my own implementations (which I adjusted from problem to problem). However, from what I know, Torch7 is a really strong for non-image data, but you will need to learn some lua to adjust some things here and there. I think pylearn2 is also a good candidate for non-image data, but if you are not used to theano then you will need some time to learn how to use it in the first place. Libraries like deepnet – which is programmed on top of cudamat – are much easier to use for non-image data, but the available algorithms are partially outdated and some algorithms are not available at all.
  I think you always have to change a few things in order to make it work for new data and so you might also want to check out libraries like caffe and see if you like the API better than other libraries. A neater API might outweigh the costs for needing to change stuff to make it work in the first place. So the best advice might be just to look a documentations and examples, try a few libraries, and then settle for something you like and can work with.
  Reply
enedene says
2014-12-28 at 20:51
Thank you for the great post. Could you say something about having a new card on order CPU?
For example I have 4 core Intel Q6600 from year 2007 with 8Gb of RAM (without possibility to upgrade). Could this be a bottleneck, if I choose to buy new GPU for CUDA and ML?
I’m also not sure which one is a better choice GTX 780 2Gb of RAM, vs GTX 970 4Gb of RAM. 780 has more cores, but are a bit slower…
http://www.game-debate.com/gpu/index.php?gid=2438&gid2=880&compare=geforce-gtx-970-4gb-vs-geforce-gtx-780
A nice list of characteristics, still, I’m not sure which would be a better choice. I would use the GPU for all kind of problems, perhaps some with smaller networks, but I wouldn’t be shy of trying something bigger when I feel conferrable enough.
What would you recommend?
Reply
- timdettmers says
  2014-12-29 at 13:33
  Hi enedene, thanks for your question!
  Your CPU should be sufficient and should slow you down only slightly (1-10%).
  My post is now a bit outdated as the new Maxwell GPUs have been released. The Maxwell architecture is much better than the Kepler architecture and so the GTX 970 is faster than the GTX 780 even though it has lower bandwidth. So I would recommend getting a GTX 970 over a GTX 780 (of course, a GTX 980 would be better still, but a GTX 970 will be fine for most things, even for larger nets).
  For low budgets I would still recommend a GTX 580 from eBay.
  I will update my post next week to reflect the new information.
  Reply
  - enedene says
    2014-12-29 at 14:52
    Thank you for the quick reply. I will most probably get GTX 970. Looking forward to your updated post, and competing against your on Kaggle. 🙂
    Reply
  - Arvydas says
    2019-05-01 at 06:35
    Hello,
    Want to continue with one more question. Will the PCIe slot matters. I also have 10 years old CPU and a motherboard with PCIe 2.0 slot. If I use a PCIe 3.0 GPU, will I be able to work with tensorflow?
    Reply
    - Tim Dettmers says
      2019-06-13 at 18:48
      Theoretically, this should work. PCIe 3.0 is designed to be downwards compatible with PCIe 2.0.
      Reply
      - Gustavo says
        2019-07-27 at 21:09
        That “theoretically should work” apply when you have an older card (2.0) on a newer slot (3.0)
        But I know for experience with NVMe (PCI 3.0) cards on older PCI 2.0 slots do Not work.
        It did not damage the card fortunately. But the card itself wasn’t not recognized. Which is actually expected.
      - Tim Dettmers says
        2019-08-04 at 13:15
        For GPUs, it also should work the other way around — PCIe 3.0 cards in PCIe 2.0 slots — at least according to the PCIe specifications.
        It might be that NVMe devices do not follow the PCIe specifications (since they are relatively new they is not much need for backwards compatibility).
James Dang (@JamesDanged) says
2014-10-19 at 11:23
Hi, nice writeup! Are you using single or double precision floats? You said divide by 4 for the byte size, which sounds like 32 bit floats, but then you point out that the Fermi cards are better than Kepler, which is more true when talking about double precision than single, as the Fermi cards have FP64 at 1/8th of FP32 while Kepler is 1/24th. Trying to decide myself whether to go with the cheaper Geforce cards or to spring for a Titan.
Reply
- timdettmers says
  2014-10-19 at 11:32
  Thanks for you comment James. Yes, deep learning is generally done with single precision computation, as the gains in precision do not improve the results greatly.
  It depends what types of neural network you want to train and how large they are. But I think a good decision would be to go for a 3GB GTX580 from ebay, and then upgrade to a GTX 1000 series card next year. The GTX 1000 series cards will probably be quite good for deep learning, so waiting for them might be a wise choice.
  Reply
Lewis Cowles (@LewisCowles1) says
2014-10-05 at 09:04
is it any good for processing non-mathematical data or non-floating point via GPU? How about the handling of generating hashes and keypairs?
Reply
- timdettmers says
  2014-10-05 at 09:56
  Sometime it is good, but often it isn’t – it depends on the use-case. One applications of GPUs for hash generation is bitcoin mining. However the main measure of success in bitcoin mining (and cryptocurrency mining in general) is to generate as many hashes per watt of energy; GPUs are in the mid-field here, beating CPUs but are beaten by FPGA and other low-energy hardware.
  In the case of keypair generation, e.g. in mapreduce, you often do little computation, but lots of IO operations so that GPUs cannot be utilized efficiently. For many applications GPUs are significantly faster in one case, but not in another similar case, e.g. for some but not all regular expressions, and this is the main problem why GPUs are not used in other cases.
  Reply
Trace says
2014-09-28 at 10:53
How much slower mid-level GPUs are? For example, I have a Mac with GeForce 750M, is it suitable for training DNN models?
Reply
- timdettmers says
  2014-09-28 at 11:33
  There is a GT 750M version with DDR3 memory and GDDR5 memory; the GDDR5 memory will be about thrice as fast as the DDR3 version. With a GDDR5 model you probably will run three to four times slower than typical desktop GPUs but you should see a good speedup of 5-8x over a desktop CPU as well. So a GDDR5 750M will be sufficient for running most deep learning models. If you have the DDR3 version, then it might be too slow for deep learning (smaller models might take a day; larger models a week or so).
  Reply
  - Skydeep says
    2017-04-23 at 13:18
    Thanks a lot Mr.Tim D
    You have a very lucid approach to answer complicated stuff, hope you could point out what impact FloatingPoint 32 vs 16 make on speed up and how does a 1080ti stack up against the Quadro GP100?
    Reply
    - Tim Dettmers says
      2017-04-29 at 13:56
      A P100 chip, be it the P100 itself or the GP100, should be roughly 10-30% faster than a Titan Xp. I do not know of any hard, unbiased data on half-precision, but I think you could expect a speedup of about 75-100% on P100 cards compared to cards with no FP16 support, such as the Titan Xp.
      Reply
- Sriram says
  2017-05-04 at 15:56
  You don’t need a really powerful GPU for Inference. Intel’s on-board graphics is more than enough for getting real-time performance for most of the applications (unless it is a high frame rate VR experience). For training, you obviously need an NVIDIA card, but it is a one-time thing.
  Reply
  - mryan says
    2019-04-17 at 04:28
    “[training] is a one-time thing” … training should be done regularly, incorporating feedback and improving your model. it’s not a one and done situation.
    Reply
    - TomViolin says
      2019-07-31 at 01:13
      Doesn’t this completely depend on the particular application?
      Reply
    - Ivan Flecha says
      2019-08-17 at 01:09
      I think he meant training is a one time thing considering the shipping/deployment of code into final products – if I train a model for video game AI for instance, which runs plugged into a game engine, I train it once and ship it with the game. Not all models serving end-user applications need constant retraining.
      Reply
  - A.Genchev says
    2019-09-27 at 01:42
    How to implement NNs on Intel graphics ? I didn’t know it’s possible !
    Reply
Ilya Kavalerov says
2015-05-14 at 14:29
How come there are no mentions of tesla cards? You can buy them used on eBay for much cheaper than the more mainstream gaming cards. Is the only reason that it requires more hardware work since they are either too big for most regular PC boxes, or require fan assembly since they are often sold air-cooled, or are there some other costs I’m overlooking?
I also ask b/c in the literature I see more mention of GTX than teslas, but I don’t see a reason for the preference other than ease of installation and potential cost saving only at small scales.
Reply
Yuriy Filonov says
2015-07-19 at 00:36
Hey Tim, perfect article! Thanks a lot for such a thorough review. Don’t you know whether it’s possible to leverage several GPUs computational power for DNNs through running them (cards) in a cross fire mode? Won’t this give a GPU parallelism “for free” with no need to make complex software adjustment?
Reply
Tim Dettmers says
2015-05-14 at 15:10
If you look at the price of used Tesla cards and their performance then they are almost always worse than any GTX GPU. I never mentioned it, because it is highly unlikely that you can find a used Tesla card which beats other cards at performance/price. The cheapest Tesla K20 on eBay (6 months of data) went for about $1000 and is equivalent to a GTX 680, the cheapest K10 went for $260, but the 2nd cheapest was above $1000 and the K10 is slower than a GTX 680. A GTX 680 now goes for about $180. Other cards from the GTX 900 series are significantly faster than the Tesla cards and most often also cheaper.
Tesla cards from the Fermi generation are affordable, but I would not recommend them, because you cannot use standard deep learning software with them (also they are very slow because they are so dated). On top you have of course the cooling issues etc — so Tesla cards make only sense if you are exceptional lucky to snatch one for a very low price.
Reply
andrea de luca says
2019-10-23 at 13:10
The teslas you find on ebay at more or less the same cost of mainstream cards are quite outdated (kepler) and passively cooled (that is, a very canalized airflow and/or a rackmounted case).
Reply
Tim Dettmers says
2015-07-19 at 07:51
AMD CrossFireX as well as NVIDIA SLI are built to exchange framebuffer information across two GPUs. It seems that these interfaces can only be used for that, mainly because they are to slow for ordinary parallel algorithms. This will change with a new NVLink interface NVIDIA is building which will supersede the PCIe 3.0 interface for GPU computing. However, as of now, you cannot get around using the PCIe 3.0 interface for communication in multi-GPU applications.
Reply
Alex says
2017-05-12 at 17:37
With AMD cards you can have as many cards as you have PCIe slots. You don’t need to CrossFire them. You can just use them in parallel.
I’m really surprised AMD cards aren’t used for deep learning. I stumbled across this article and was expecting some comparison, but saw everything was NVIDIA. I’m coming from the Bitcoin world, where AMD is king for number crunching. That’s why it’s surprising they’re not being used here. It’s like the complete opposite.
Reply
Mike Stone says
2017-08-21 at 17:25
Hi Tim,
thanks for your sharing, and very good papers.
So the nvlink will improve the parallel performance much?
Reply
Tim Dettmers says
2017-05-15 at 17:47
There is no optimized deep learning software for AMD cards, that is the only reason why AMD cards are not usually used for deep learning. CrossFire or SLI is not used in deep learning computations; it is really only the software part that makes AMD cards unviable for deep learning.
Reply
Robin Colclough says
2017-07-26 at 19:43
AMD cards are used for deep learning, and there is a massive investment taking that forward with Caffe and PyTorch support, and the new OpenMI API.
AMD are driving AI / Deep learning costs down 30 – 60% for researchers and developers.
More details here:-
https://instinct.radeon.com/en-us/category/products/
I am not employed by AMD, and have no financial interest other than seeing more competition in the market.
Reply
Robin Colclough says
2017-06-16 at 16:35
Actually, that is no longer the case, last year AMD made available in depth support for deep learning algorithms on its latest GPUs, which are considerably less costly than NVidia ->”AMD took the Caffe framework with 55,000 lines of optimized CUDA code and applied their HIP tooling. 99.6% of the 55,000 lines of code was translated automatically. The remaining code took a week to optimize. Once ported, the HIP code performed as well as the original CUDA version.
HIP is 99% compatible with CUDA, and provides a migration path for developers to support an alternative and much less costly GPU platform. This is great for developers who already have a large CUDA code base.
Early this year AMD decided to get even “closer to the metal” by announcing the “Lightning Compiler Initiative.” This HCC compiler now supports the direct generation of the Radeon GPU instruction set (known as GSN ISA) instead of HSAIL.”
With the new Vega GPU, deep learning on AMD will go beyond what NVidia can offer, and with much greater economies. In addition, the integrated Ryzen+Vega APUs will provide powerful low cost deep learning on sun $600 laptops.
More details are available : https://www.extremetech.com/computing/240908-amd-launches-new-radeon-instinct-gpus-tackle-deep-learning-artificial-intelligence
and
https://www.extremetech.com/computing/240908-amd-launches-new-radeon-instinct-gpus-tackle-deep-learning-artificial-intelligence
Reply
Tim Dettmers says
2017-06-16 at 20:04
These advances look very promising. I believe by the end of the year AMD GPUs will be a good alternative option to NVIDIA GPUs. However, currently there are still limitations in the breadth of software that you can use, and it is not clear to me how reliable the entire codebase is. I imagine you could stumble into some problems here and there, but the biggest problem is that currently there is only support for Caffe. If it were TensorFlow or PyTorch support then I would readily recommend AMD GPUs over NVIDIA’s as they are more cost efficient. AMD is definitely on the horizon and might soon be a formidable opponent for NVIDIA!
Reply
Mathieu says
2017-12-24 at 01:53
I don’t think AMD will compete with Nvidia. There is a huge work with the drivers and software (and community), and I don’t see AMD catch up with Nvidia. It’s been like this since decades almost. Nvidia was providing drivers for FreeBSD, and AMD was just letting OS communities do the job, which is huge.
AMD might develop something interesting with integrated GPU CPU and memory in one chip to address the memory bandwidth and latency problems, but without some serious and long term involvement in software, it might be interesting only on the paper and for marketing.
Reply
Tim Dettmers says
2018-01-15 at 23:07
The market for AI hardware is so big, I think AMD just has to get into the market (just like Intel). I think AMD has a good chance to find wider adoption in 2018.
Reply
Jo says
2019-08-18 at 05:11
I just don’t understand why AMD is not putting more effort on their deep learning support if the AI market is so big.
Reply
Tim Dettmers says
2019-09-11 at 09:23
If they cannot compete they could lose a lot of money to do very little. Focusing on gaming and other data center needs is a good strategy.
Reply
basil thomas says
2019-09-16 at 06:29
easy answer: AI market is not that big but AI marketing is!!
Reply
Tim Dettmers says
2017-09-01 at 16:14
It helps quite a bit, but it is too expensive to make sense for any consumer. NVLink can be used to speed up supercomputers. But not all GPU clusters profit from it. Here at Microsoft, they have such advanced algorithms, that there is basically no bottleneck on the PCIe bus and thus NVLink has no major benefit over normal PCIe.
Reply

Newer Comments »

Skip links

Main navigation

Overview

How do GPUs work?

The Most Important GPU Specs for Deep Learning Processing Speed

Tensor Cores

Matrix multiplication without Tensor Cores

Matrix multiplication with Tensor Cores

Matrix multiplication with Tensor Cores and Asynchronous copies (RTX 30/RTX 40) and TMA (H100)

Memory Bandwidth

L2 Cache / Shared Memory / L1 Cache / Registers

Estimating Ada / Hopper Deep Learning Performance

Practical Ada / Hopper Speed Estimates

Possible Biases in Estimates

Advantages and Problems for RTX40 and RTX 30 Series

Sparse Network Training

Low-precision Computation

Fan Designs and GPUs Temperature Issues

3-slot Design and Power Issues

Power Limiting: An Elegant Solution to Solve the Power Problem?

RTX 4090s and Melting Power Connectors: How to Prevent Problems

8-bit Float Support in H100 and RTX 40 series GPUs

Raw Performance Ranking of GPUs

GPU Deep Learning Performance per Dollar

GPU Recommendations

Is it better to wait for future GPUs for an upgrade? The future of GPUs.

Question & Answers & Misconceptions

Do I need PCIe 4.0 or PCIe 5.0?

Do I need 8x/16x PCIe lanes?

How do I fit 4x RTX 4090 or 3090 if they take up 3 PCIe slots each?

How do I cool 4x RTX 3090 or 4x RTX 3080?

Can I use multiple GPUs of different GPU types?

What is NVLink, and is it useful?

I do not have enough money, even for the cheapest GPUs you recommend. What can I do?

What is the carbon footprint of GPUs? How can I use GPUs without polluting the environment?

What do I need to parallelize across two machines?

Is the sparse matrix multiplication features suitable for sparse matrices in general?

Do I need an Intel CPU to power a multi-GPU setup?

Does computer case design matter for cooling?

Will AMD GPUs + ROCm ever catch up with NVIDIA GPUs + CUDA?

When is it better to use the cloud vs a dedicated GPU desktop/server?

Version History

Acknowledgments

Related

Related Posts

Reader Interactions

Comments

Leave a Reply Cancel reply