Deep learning is a field with intense computational requirements, and your choice of GPU will fundamentally determine your deep learning experience. But what features are important if you want to buy a new GPU? GPU RAM, cores, tensor cores, caches? How to make a cost-efficient choice? This blog post will delve into these questions, tackle common misconceptions, give you an intuitive understanding of how to think about GPUs, and will lend you advice, which will help you to make a choice that is right for you.
This blog post is designed to give you different levels of understanding of GPUs and the new Ampere series GPUs from NVIDIA. You have the choice: (1) If you are not interested in the details of how GPUs work, what makes a GPU fast compared to a CPU, and what is unique about the new NVIDIA RTX 40 Ampere series, you can skip right to the performance and performance per dollar charts and the recommendation section. The cost/performance numbers form the core of the blog post and the content surrounding it explains the details of what makes up GPU performance.
(2) If you worry about specific questions, I have answered and addressed the most common questions and misconceptions in the later part of the blog post.
(3) If you want to get an in-depth understanding of how GPUs, caches, and Tensor Cores work, the best is to read the blog post from start to finish. You might want to skip a section or two based on your understanding of the presented topics.
Overview
This blog post is structured in the following way. First, I will explain what makes a GPU fast. I will discuss CPUs vs GPUs, Tensor Cores, memory bandwidth, and the memory hierarchy of GPUs and how these relate to deep learning performance. These explanations might help you get a more intuitive sense of what to look for in a GPU. I discuss the unique features of the new NVIDIA RTX 40 Ampere GPU series that are worth considering if you buy a GPU. From there, I make GPU recommendations for different scenarios. After that follows a Q&A section of common questions posed to me in Twitter threads; in that section, I will also address common misconceptions and some miscellaneous issues, such as cloud vs desktop, cooling, AMD vs NVIDIA, and others.
How do GPUs work?
If you use GPUs frequently, it is useful to understand how they work. This knowledge will help you to undstand cases where are GPUs fast or slow. In turn, you might be able to understand better why you need a GPU in the first place and how other future hardware options might be able to compete. You can skip this section if you just want the useful performance numbers and arguments to help you decide which GPU to buy. The best high-level explanation for the question of how GPUs work is my following Quora answer:
This is a high-level explanation that explains quite well why GPUs are better than CPUs for deep learning. If we look at the details, we can understand what makes one GPU better than another.
The Most Important GPU Specs for Deep Learning Processing Speed
This section can help you build a more intuitive understanding of how to think about deep learning performance. This understanding will help you to evaluate future GPUs by yourself. This section is sorted by the importance of each component. Tensor Cores are most important, followed by memory bandwidth of a GPU, the cache hierachy, and only then FLOPS of a GPU.
Tensor Cores
Tensor Cores are tiny cores that perform very efficient matrix multiplication. Since the most expensive part of any deep neural network is matrix multiplication Tensor Cores are very useful. In fast, they are so powerful, that I do not recommend any GPUs that do not have Tensor Cores.
It is helpful to understand how they work to appreciate the importance of these computational units specialized for matrix multiplication. Here I will show you a simple example of A*B=C matrix multiplication, where all matrices have a size of 32×32, what a computational pattern looks like with and without Tensor Cores. This is a simplified example, and not the exact way how a high performing matrix multiplication kernel would be written, but it has all the basics. A CUDA programmer would take this as a first “draft” and then optimize it step-by-step with concepts like double buffering, register optimization, occupancy optimization, instruction-level parallelism, and many others, which I will not discuss at this point.
To understand this example fully, you have to understand the concepts of cycles. If a processor runs at 1GHz, it can do 10^9 cycles per second. Each cycle represents an opportunity for computation. However, most of the time, operations take longer than one cycle. Thus we essentially have a queue where the next operations needs to wait for the next operation to finish. This is also called the latency of the operation.
Here are some important latency cycle timings for operations. These times can change from GPU generation to GPU generation. These numbers are for Ampere GPUs, which have relatively slow caches.
- Global memory access (up to 80GB): ~380 cycles
- L2 cache: ~200 cycles
- L1 cache or Shared memory access (up to 128 kb per Streaming Multiprocessor): ~34 cycles
- Fused multiplication and addition, a*b+c (FFMA): 4 cycles
- Tensor Core matrix multiply: 1 cycle
Each operation is always performed by a pack of 32 threads. This pack is termed a warp of threads. Warps usually operate in a synchronous pattern — threads within a warp have to wait for each other. All memory operations on the GPU are optimized for warps. For example, loading from global memory happens at a granularity of 32*4 bytes, exactly 32 floats, exactly one float for each thread in a warp. We can have up to 32 warps = 1024 threads in a streaming multiprocessor (SM), the GPU-equivalent of a CPU core. The resources of an SM are divided up among all active warps. This means that sometimes we want to run fewer warps to have more registers/shared memory/Tensor Core resources per warp.
For both of the following examples, we assume we have the same computational resources. For this small example of a 32×32 matrix multiply, we use 8 SMs (about 10% of an RTX 3090) and 8 warps per SM.
To understand how the cycle latencies play together with resources like threads per SM and shared memory per SM, we now look at examples of matrix multiplication. While the following example roughly follows the sequence of computational steps of matrix multiplication for both with and without Tensor Cores, please note that these are very simplified examples. Real cases of matrix multiplication involve much larger shared memory tiles and slightly different computational patterns.
Matrix multiplication without Tensor Cores
If we want to do an A*B=C matrix multiply, where each matrix is of size 32×32, then we want to load memory that we repeatedly access into shared memory because its latency is about five times lower (200 cycles vs 34 cycles). A memory block in shared memory is often referred to as a memory tile or just a tile. Loading two 32×32 floats into a shared memory tile can happen in parallel by using 2*32 warps. We have 8 SMs with 8 warps each, so due to parallelization, we only need to do a single sequential load from global to shared memory, which takes 200 cycles.
To do the matrix multiplication, we now need to load a vector of 32 numbers from shared memory A and shared memory B and perform a fused multiply-and-accumulate (FFMA). Then store the outputs in registers C. We divide the work so that each SM does 8x dot products (32×32) to compute 8 outputs of C. Why this is exactly 8 (4 in older algorithms) is very technical. I recommend Scott Gray’s blog post on matrix multiplication to understand this. This means we have 8x shared memory accesses at the cost of 34 cycles each and 8 FFMA operations (32 in parallel), which cost 4 cycles each. In total, we thus have a cost of:
200 cycles (global memory) + 8*34 cycles (shared memory) + 8*4 cycles (FFMA) = 504 cycles
Let’s look at the cycle cost of using Tensor Cores.
Matrix multiplication with Tensor Cores
With Tensor Cores, we can perform a 4×4 matrix multiplication in one cycle. To do that, we first need to get memory into the Tensor Core. Similarly to the above, we need to read from global memory (200 cycles) and store in shared memory. To do a 32×32 matrix multiply, we need to do 8×8=64 Tensor Cores operations. A single SM has 8 Tensor Cores. So with 8 SMs, we have 64 Tensor Cores — just the number that we need! We can transfer the data from shared memory to the Tensor Cores with 1 memory transfers (34 cycles) and then do those 64 parallel Tensor Core operations (1 cycle). This means the total cost for Tensor Cores matrix multiplication, in this case, is:
200 cycles (global memory) + 34 cycles (shared memory) + 1 cycle (Tensor Core) = 235 cycles.
Thus we reduce the matrix multiplication cost significantly from 504 cycles to 235 cycles via Tensor Cores. In this simplified case, the Tensor Cores reduced the cost of both shared memory access and FFMA operations.
This example is simplified, for example, usually each thread needs to calculate which memory to read and write to as you transfer data from global memory to shared memory. With the new Hooper (H100) architectures we additionally have the Tensor Memory Accelerator (TMA) compute these indices in hardware and thus help each thread to focus on more computation rather than computing indices.
Matrix multiplication with Tensor Cores and Asynchronous copies (RTX 30/RTX 40) and TMA (H100)
The RTX 30 Ampere and RTX 40 Ada series GPUs additionally have support to perform asynchronous transfers between global and shared memory. The H100 Hopper GPU extends this further by introducing the Tensor Memory Accelerator (TMA) unit. the TMA unit combines asynchronous copies and index calculation for read and writes simultaneously — so each thread no longer needs to calculate which is the next element to read and each thread can focus on doing more matrix multiplication calculations. This looks as follows.
The TMA unit fetches memory from global to shared memory (200 cycles). Once the data arrives, the TMA unit fetches the next block of data asynchronously from global memory. While this is happening, the threads load data from shared memory and perform the matrix multiplication via the tensor core. Once the threads are finished they wait for the TMA unit to finish the next data transfer, and the sequence repeats.
As such, due to the asynchronous nature, the second global memory read by the TMA unit is already progressing as the threads process the current shared memory tile. This means, the second read takes only 200 – 34 – 1 = 165 cycles.
Since we do many reads, only the first memory access will be slow and all other memory accesses will be partially overlapped with the TMA unit. Thus on average, we reduce the time by 35 cycles.
165 cycles (wait for async copy to finish) + 34 cycles (shared memory) + 1 cycle (Tensor Core) = 200 cycles.
Which accelerates the matrix multiplication by another 15%.
From these examples, it becomes clear why the next attribute, memory bandwidth, is so crucial for Tensor-Core-equipped GPUs. Since global memory is the by far the largest cycle cost for matrix multiplication with Tensor Cores, we would even have faster GPUs if the global memory latency could be reduced. We can do this by either increasing the clock frequency of the memory (more cycles per second, but also more heat and higher energy requirements) or by increasing the number of elements that can be transferred at any one time (bus width).
Memory Bandwidth
From the previous section, we have seen that Tensor Cores are very fast. So fast, in fact, that they are idle most of the time as they are waiting for memory to arrive from global memory. For example, during GPT-3-sized training, which uses huge matrices — the larger, the better for Tensor Cores — we have a Tensor Core TFLOPS utilization of about 45-65%, meaning that even for the large neural networks about 50% of the time, Tensor Cores are idle.
This means that when comparing two GPUs with Tensor Cores, one of the single best indicators for each GPU’s performance is their memory bandwidth. For example, The A100 GPU has 1,555 GB/s memory bandwidth vs the 900 GB/s of the V100. As such, a basic estimate of speedup of an A100 vs V100 is 1555/900 = 1.73x.
Since memory transfers to the Tensor Cores are the limiting factor in performance, we are looking for other GPU attributes that enable faster memory transfer to Tensor Cores. L2 cache, shared memory, L1 cache, and amount of registers used are all related. To understand how a memory hierarchy enables faster memory transfers, it helps to understand how matrix multiplication is performed on a GPU.
To perform matrix multiplication, we exploit the memory hierarchy of a GPU that goes from slow global memory, to faster L2 memory, to fast local shared memory, to lightning-fast registers. However, the faster the memory, the smaller it is.
While logically, L2 and L1 memory are the same, L2 cache is larger and thus the average physical distance that need to be traversed to retrieve a cache line is larger. You can see the L1 and L2 caches as organized warehouses where you want to retrieve an item. You know where the item is, but to go there takes on average much longer for the larger warehouse. This is the essential difference between L1 and L2 caches. Large = slow, small = fast.
For matrix multiplication we can use this hierarchical separate into smaller and smaller and thus faster and faster chunks of memory to perform very fast matrix multiplications. For that, we need to chunk the big matrix multiplication into smaller sub-matrix multiplications. These chunks are called memory tiles, or often for short just tiles.
We perform matrix multiplication across these smaller tiles in local shared memory that is fast and close to the streaming multiprocessor (SM) — the equivalent of a CPU core. With Tensor Cores, we go a step further: We take each tile and load a part of these tiles into Tensor Cores which is directly addressed by registers. A matrix memory tile in L2 cache is 3-5x faster than global GPU memory (GPU RAM), shared memory is ~7-10x faster than the global GPU memory, whereas the Tensor Cores’ registers are ~200x faster than the global GPU memory.
Having larger tiles means we can reuse more memory. I wrote about this in detail in my TPU vs GPU blog post. In fact, you can see TPUs as having very, very, large tiles for each Tensor Core. As such, TPUs can reuse much more memory with each transfer from global memory, which makes them a little bit more efficient at matrix multiplications than GPUs.
Each tile size is determined by how much memory we have per streaming multiprocessor (SM) and how much we L2 cache we have across all SMs. We have the following shared memory sizes on the following architectures:
- Volta (Titan V): 128kb shared memory / 6 MB L2
- Turing (RTX 20s series): 96 kb shared memory / 5.5 MB L2
- Ampere (RTX 30s series): 128 kb shared memory / 6 MB L2
- Ada (RTX 40s series): 128 kb shared memory / 72 MB L2
We see that Ada has a much larger L2 cache allowing for larger tile sizes, which reduces global memory access. For example, for BERT large during training, the input and weight matrix of any matrix multiplication fit neatly into the L2 cache of Ada (but not other Us). As such, data needs to be loaded from global memory only once and then data is available throught the L2 cache, making matrix multiplication about 1.5 – 2.0x faster for this architecture for Ada. For larger models the speedups are lower during training but certain sweetspots exist which may make certain models much faster. Inference, with a batch size larger than 8 can also benefit immensely from the larger L2 caches.
Estimating Ada / Hopper Deep Learning Performance
This section is for those who want to understand the more technical details of how I derive the performance estimates for Ampere GPUs. If you do not care about these technical aspects, it is safe to skip this section.
Practical Ada / Hopper Speed Estimates
Suppose we have an estimate for one GPU of a GPU-architecture like Hopper, Ada, Ampere, Turing, or Volta. It is easy to extrapolate these results to other GPUs from the same architecture/series. Luckily, NVIDIA already benchmarked the A100 vs V100 vs H100 across a wide range of computer vision and natural language understanding tasks. Unfortunately, NVIDIA made sure that these numbers are not directly comparable by using different batch sizes and the number of GPUs whenever possible to favor results for the H100 GPU. So in a sense, the benchmark numbers are partially honest, partially marketing numbers. In general, you could argue that using larger batch sizes is fair, as the H100/A100 GPU has more memory. Still, to compare GPU architectures, we should evaluate unbiased memory performance with the same batch size.
To get an unbiased estimate, we can scale the data center GPU results in two ways: (1) account for the differences in batch size, (2) account for the differences in using 1 vs 8 GPUs. We are lucky that we can find such an estimate for both biases in the data that NVIDIA provides.
Doubling the batch size increases throughput in terms of images/s (CNNs) by 13.6%. I benchmarked the same problem for transformers on my RTX Titan and found, surprisingly, the very same result: 13.5% — it appears that this is a robust estimate.
As we parallelize networks across more and more GPUs, we lose performance due to some networking overhead. The A100 8x GPU system has better networking (NVLink 3.0) than the V100 8x GPU system (NVLink 2.0) — this is another confounding factor. Looking directly at the data from NVIDIA, we can find that for CNNs, a system with 8x A100 has a 5% lower overhead than a system of 8x V100. This means if going from 1x A100 to 8x A100 gives you a speedup of, say, 7.00x, then going from 1x V100 to 8x V100 only gives you a speedup of 6.67x. For transformers, the figure is 7%.
Using these figures, we can estimate the speedup for a few specific deep learning architectures from the direct data that NVIDIA provides. The Tesla A100 offers the following speedup over the Tesla V100:
- SE-ResNeXt101: 1.43x
- Masked-R-CNN: 1.47x
- Transformer (12 layer, Machine Translation, WMT14 en-de): 1.70x
Thus, the figures are a bit lower than the theoretical estimate for computer vision. This might be due to smaller tensor dimensions, overhead from operations that are needed to prepare the matrix multiplication like img2col or Fast Fourier Transform (FFT), or operations that cannot saturate the GPU (final layers are often relatively small). It could also be artifacts of the specific architectures (grouped convolution).
The practical transformer estimate is very close to the theoretical estimate. This is probably because algorithms for huge matrices are very straightforward. I will use these practical estimates to calculate the cost efficiency of GPUs.
Possible Biases in Estimates
The estimates above are for H100, A100 , and V100 GPUs. In the past, NVIDIA sneaked unannounced performance degradations into the “gaming” RTX GPUs: (1) Decreased Tensor Core utilization, (2) gaming fans for cooling, (3) disabled peer-to-peer GPU transfers. It might be possible that there are unannounced performance degradations in the RTX 40 series compared to the full Hopper H100.
As of now, one of these degradations was found for Ampere GPUs: Tensor Core performance was decreased so that RTX 30 series GPUs are not as good as Quadro cards for deep learning purposes. This was also done for the RTX 20 series, so it is nothing new, but this time it was also done for the Titan equivalent card, the RTX 3090. The RTX Titan did not have performance degradation enabled.
Currently, no degradation for Ada GPUs are known, but I update this post with news on this and let my followers on twitter know.
Advantages and Problems for RTX40 and RTX 30 Series
The new NVIDIA Ampere RTX 30 series has additional benefits over the NVIDIA Turing RTX 20 series, such as sparse network training and inference. Other features, such as the new data types, should be seen more as an ease-of-use-feature as they provide the same performance boost as Turing does but without any extra programming required.
The Ada RTX 40 series has even further advances like 8-bit Float (FP8) tensor cores. The RTX 40 series also has similar power and temperature issues compared to the RTX 30. The issue of melting power connector cables in the RTX 40 can be easily prevented by connecting the power cable correctly.
Sparse Network Training
Ampere allows for fine-grained structure automatic sparse matrix multiplication at dense speeds. How does this work? Take a weight matrix and slice it into pieces of 4 elements. Now imagine 2 elements of these 4 to be zero. Figure 1 shows how this could look like.

When you multiply this sparse weight matrix with some dense inputs, the sparse matrix tensor core feature in Ampere automatically compresses the sparse matrix to a dense representation that is half the size as can be seen in Figure 2. After this compression, the densely compressed matrix tile is fed into the tensor core which computes a matrix multiplication of twice the usual size. This effectively yields a 2x speedup since the bandwidth requirements during matrix multiplication from shared memory are halved.

I was working on sparse network training in my research and I also wrote a blog post about sparse training. One criticism of my work was that “You reduce the FLOPS required for the network, but it does not yield speedups because GPUs cannot do fast sparse matrix multiplication.” Well, with the addition of the sparse matrix multiplication feature for Tensor Cores, my algorithm, or other sparse training algorithms, now actually provide speedups of up to 2x during training.

While this feature is still experimental and training sparse networks are not commonplace yet, having this feature on your GPU means you are ready for the future of sparse training.
Low-precision Computation
In my work, I’ve previously shown that new data types can improve stability during low-precision backpropagation.
![Figure 4: Low-precision deep learning 8-bit datatypes that I developed. Deep learning training benefits from highly specialized data types. My dynamic tree datatype uses a dynamic bit that indicates the beginning of a binary bisection tree that quantized the range [0, 0.9] while all previous bits are used for the exponent. This allows to dynamically represent numbers that are both large and small with high precision.](https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/8-bit_data_types.png?resize=869%2C268&ssl=1)
Currently, if you want to have stable backpropagation with 16-bit floating-point numbers (FP16), the big problem is that ordinary FP16 data types only support numbers in the range [-65,504, 65,504]. If your gradient slips past this range, your gradients explode into NaN values. To prevent this during FP16 training, we usually perform loss scaling where you multiply the loss by a small number before backpropagating to prevent this gradient explosion.
The BrainFloat 16 format (BF16) uses more bits for the exponent such that the range of possible numbers is the same as for FP32: [-3*10^38, 3*10^38]. BF16 has less precision, that is significant digits, but gradient precision is not that important for learning. So what BF16 does is that you no longer need to do any loss scaling or worry about the gradient blowing up quickly. As such, we should see an increase in training stability by using the BF16 format as a slight loss of precision.
What this means for you: With BF16 precision, training might be more stable than with FP16 precision while providing the same speedups. With 32-bit TensorFloat (TF32) precision, you get near FP32 stability while giving the speedups close to FP16. The good thing is, to use these data types, you can just replace FP32 with TF32 and FP16 with BF16 — no code changes required!
Overall, though, these new data types can be seen as lazy data types in the sense that you could have gotten all the benefits with the old data types with some additional programming efforts (proper loss scaling, initialization, normalization, using Apex). As such, these data types do not provide speedups but rather improve ease of use of low precision for training.
Fan Designs and GPUs Temperature Issues
While the new fan design of the RTX 30 series performs very well to cool the GPU, different fan designs of non-founders edition GPUs might be more problematic. If your GPU heats up beyond 80C, it will throttle itself and slow down its computational speed / power. This overheating can happen in particular if you stack multiple GPUs next to each other. A solution to this is to use PCIe extenders to create space between GPUs.
Spreading GPUs with PCIe extenders is very effective for cooling, and other fellow PhD students at the University of Washington and I use this setup with great success. It does not look pretty, but it keeps your GPUs cool! This has been running with no problems at all for 4 years now. It can also help if you do not have enough space to fit all GPUs in the PCIe slots. For example, if you can find the space within a desktop computer case, it might be possible to buy standard 3-slot-width RTX 4090 and spread them with PCIe extenders within the case. With this, you might solve both the space issue and cooling issue for a 4x RTX 4090 setup with a single simple solution.

3-slot Design and Power Issues
The RTX 3090 and RTX 4090 are 3-slot GPUs, so one will not be able to use it in a 4x setup with the default fan design from NVIDIA. This is kind of justified because it runs at over 350W TDP, and it will be difficult to cool in a multi-GPU 2-slot setting. The RTX 3080 is only slightly better at 320W TDP, and cooling a 4x RTX 3080 setup will also be very difficult.
It is also difficult to power a 4x 350W = 1400W or 4x 450W = 1800W system in the 4x RTX 3090 or 4x RTX 4090 case. Power supply units (PSUs) of 1600W are readily available, but having only 200W to power the CPU and motherboard can be too tight. The components’ maximum power is only used if the components are fully utilized, and in deep learning, the CPU is usually only under weak load. With that, a 1600W PSU might work quite well with a 4x RTX 3080 build, but for a 4x RTX 3090 build, it is better to look for high wattage PSUs (+1700W). Some of my followers have had great success with cryptomining PSUs — have a look in the comment section for more info about that. Otherwise, it is important to note that not all outlets support PSUs above 1600W, especially in the US. This is the reason why in the US, there are currently few standard desktop PSUs above 1600W on the market. If you get a server or cryptomining PSUs, beware of the form factor — make sure it fits into your computer case.
Power Limiting: An Elegant Solution to Solve the Power Problem?
It is possible to set a power limit on your GPUs. So you would be able to programmatically set the power limit of an RTX 3090 to 300W instead of their standard 350W. In a 4x GPU system, that is a saving of 200W, which might just be enough to build a 4x RTX 3090 system with a 1600W PSU feasible. It also helps to keep the GPUs cool. So setting a power limit can solve the two major problems of a 4x RTX 3080 or 4x RTX 3090 setups, cooling, and power, at the same time. For a 4x setup, you still need effective blower GPUs (and the standard design may prove adequate for this), but this resolves the PSU problem.

You might ask, “Doesn’t this slow down the GPU?” Yes, it does, but the question is by how much. I benchmarked the 4x RTX 2080 Ti system shown in Figure 5 under different power limits to test this. I benchmarked the time for 500 mini-batches for BERT Large during inference (excluding the softmax layer). I choose BERT Large inference since, from my experience, this is the deep learning model that stresses the GPU the most. As such, I would expect power limiting to have the most massive slowdown for this model. As such, the slowdowns reported here are probably close to the maximum slowdowns that you can expect. The results are shown in Figure 7.
As we can see, setting the power limit does not seriously affect performance. Limiting the power by 50W — more than enough to handle 4x RTX 3090 — decreases performance by only 7%.
RTX 4090s and Melting Power Connectors: How to Prevent Problems
There was a misconception that RTX 4090 power cables melt because they were bent. However, it was found that only 0.1% of users had this problem and the problem occured due to user error. Here a video that shows that the main problem is that cables were not inserted correctly.
So using RTX 4090 cards is perfectly safe if you follow the following install instructions:
- If you use an old cable or old GPU make sure the contacts are free of debri / dust.
- Use the power connector and stick it into the socket until you hear a *click* — this is the most important part.
- Test for good fit by wiggling the power cable left to right. The cable should not move.
- Check the contact with the socket visually, there should be no gap between cable and socket.
8-bit Float Support in H100 and RTX 40 series GPUs
The support of the 8-bit Float (FP8) is a huge advantage for the RTX 40 series and H100 GPUs. With 8-bit inputs it allows you to load the data for matrix multiplication twice as fast, you can store twice as much matrix elements in your caches which in the Ada and Hopper architecture are very large, and now with FP8 tensor cores you get 0.66 PFLOPS of compute for a RTX 4090 — this is more FLOPS then the entirety of the worlds fastest supercomputer in year 2007. 4x RTX 4090 with FP8 compute rival the faster supercomputer in the world in year 2010 (deep learning started to work just in 2009).
The main problem with using 8-bit precision is that transformers can get very unstable with so few bits and crash during training or generate non-sense during inference. I have written a paper about the emergence of instabilities in large language models and I also written a more accessible blog post.
The main take-way is this: Using 8-bit instead of 16-bit makes things very unstable, but if you keep a couple of dimensions in high precision everything works just fine.

But Int8 was already supported by the RTX 30 / A100 / Ampere generation GPUs, why is FP8 in the RTX 40 another big upgrade? The FP8 data type is much more stable than the Int8 data type and its easy to use it in functions like layer norm or non-linear functions, which are difficult to do with Integer data types. This will make it very straightforward to use it in training and inference. I think this will make FP8 training and inference relatively common in a couple of months.
If you want to read more about the advantages of Float vs Integer data types you can read my recent paper about k-bit inference scaling laws. Below you can see one relevant main result for Float vs Integer data types from this paper. We can see that bit-by-bit, the FP4 data type preserve more information than Int4 data type and thus improves the mean LLM zeroshot accuracy across 4 tasks.

Raw Performance Ranking of GPUs
Below we see a chart of raw relevative performance across all GPUs. We see that there is a gigantic gap in 8-bit performance of H100 GPUs and old cards that are optimized for 16-bit performance.

For this data, I did not model 8-bit compute for older GPUs. I did so, because 8-bit Inference and training are much more effective on Ada/Hopper GPUs because of the 8-bit Float data type and Tensor Memory Accelerator (TMA) which saves the overhead of computing read/write indices which is particularly helpful for 8-bit matrix multiplication. Ada/Hopper also have FP8 support, which makes in particular 8-bit training much more effective.
I did not model numbers for 8-bit training because to model that I need to know the latency of L1 and L2 caches on Hopper/Ada GPUs, and they are unknown and I do not have access to such GPUs. On Hopper/Ada, 8-bit training performance can well be 3-4x of 16-bit training performance if the caches are as fast as rumored.
But even with the new FP8 tensor cores there are some additional issues which are difficult to take into account when modeling GPU performance. For example, FP8 tensor cores do not support transposed matrix multiplication which means backpropagation needs either a separate transpose before multiplication or one needs to hold two sets of weights — one transposed and one non-transposed — in memory. I used two sets of weight when I experimented with Int8 training in my LLM.int8() project and this reduced the overall speedups quite significantly. I think one can do better with the right algorithms/software, but this shows that missing features like a transposed matrix multiplication for tensor cores can affect performance.
For old GPUs, Int8 inference performance is close to the 16-bit inference performance for models below 13B parameters. Int8 performance on old GPUs is only relevant if you have relatively large models with 175B parameters or more. If you are interested in 8-bit performance of older GPUs, you can read the Appendix D of my LLM.int8() paper where I benchmark Int8 performance.
GPU Deep Learning Performance per Dollar
Below we see the chart for the performance per US dollar for all GPUs sorted by 8-bit inference performance. How to use the chart to find a suitable GPU for you is as follows:
- Determine the amount of GPU memory that you need (rough heuristic: at least 12 GB for image generation; at least 24 GB for work with transformers)
- While 8-bit inference and training is experimental, it will become standard within 6 months. You might need to do some extra difficult coding to work with 8-bit in the meantime. Is that OK for you? If not, select for 16-bit performance.
- Using the metric determined in (2), find the GPU with the highest relative performance/dollar that has the amount of memory you need.
We can see that the RTX 4070 Ti is most cost-effective for 8-bit and 16-bit inference while the RTX 3080 remains most cost-effective for 16-bit training. While these GPUs are most cost-effective, they are not necessarily recommended as they do not have sufficient memory for many use-cases. However, it might be the ideal cards to get started on your deep learning journey. Some of these GPUs are excellent for Kaggle competition where one can often rely on smaller models. Since to do well in Kaggle competitions the method of how you work is more important than the models size, many of these smaller GPUs are excellent for Kaggle competitions.
The best GPUs for academic and startup servers seem to be A6000 Ada GPUs (not to be confused with A6000 Turing). The H100 SXM GPU is also very cost effective and has high memory and very strong performance. If I would build a small cluster for a company/academic lab, I would use 66-80% A6000 GPUs and 20-33% H100 SXM GPUs. If I get a good deal on L40 GPUs, I would also pick them instead of A6000, so you can always ask for a quote on these.

GPU Recommendations
I have a create a recommendation flow-chart that you can see below (click here for interactive app from Nan Xiao). While this chart will help you in 80% of cases, it might not quite work for you because the options might be too expensive. In that case, try to look at the benchmarks above and pick the most cost effective GPU that still has enough GPU memory for your use-case. You can estimate the GPU memory needed by running your problem in the vast.ai or Lambda Cloud for a while so you know what you need. The vast.ai or Lambda Cloud might also work well if you only need a GPU very sporadically (every couple of days for a few hours) and you do not need to download and process large dataset to get started. However, cloud GPUs are usually not a good option if you use your GPU for many months with a high usage rate each day (12 hours each day). You can use the example in the “When is it better to use the cloud vs a dedicated GPU desktop/server?” section below to determine if cloud GPUs are good for you.

Is it better to wait for future GPUs for an upgrade? The future of GPUs.
To understand if it makes sense to skip this generation and buy the next generation of GPUs, it makes sense to talk a bit about what improvements in the future will look like.
In the past it was possible to shrink the size of transistors to improve speed of a processor. This is coming to an end now. For example, while shrinking SRAM increased its speed (smaller distance, faster memory access), this is no longer the case. Current improvements in SRAM do not improve its performance anymore and might even be negative. While logic such as Tensor Cores get smaller, this does not necessarily make GPU faster since the main problem for matrix multiplication is to get memory to the tensor cores which is dictated by SRAM and GPU RAM speed and size. GPU RAM still increases in speed if we stack memory modules into high-bandwidth modules (HBM3+), but these are too expensive to manufacture for consumer applications. The main way to improve raw speed of GPUs is to use more power and more cooling as we have seen in the RTX 30s and 40s series. But this cannot go on for much longer.
Chiplets such as used by AMD CPUs are another straightforward way forward. AMD beat Intel by developing CPU chiplets. Chiplets are small chips that are fused together with a high speed on-chip network. You can think about them as two GPUs that are so physically close together that you can almost consider them a single big GPU. They are cheaper to manufacture, but more difficult to combine into one big chip. So you need know-how and fast connectivity between chiplets. AMD has a lot of experience with chiplet design. AMD’s next generation GPUs are going to be chiplet designs, while NVIDIA currently has no public plans for such designs. This may mean that the next generation of AMD GPUs might be better in terms of cost/performance compared to NVIDIA GPUs.
However, the main performance boost for GPUs is currently specialized logic. For example, the asynchronous copy hardware units on the Ampere generation (RTX 30 / A100 / RTX 40) or the extension, the Tensor Memory Accelerator (TMA), both reduce the overhead of copying memory from the slow global memory to fast shared memory (caches) through specialized hardware and so each thread can do more computation. The TMA also reduces overhead by performing automatic calculations of read/write indices which is particularly important for 8-bit computation where one has double the elements for the same amount of memory compared to 16-bit computation. So specialized hardware logic can accelerate matrix multiplication further.
Low-bit precision is another straightforward way forward for a couple of years. We will see widespread adoption of 8-bit inference and training in the next months. We will see widespread 4-bit inference in the next year. Currently, the technology for 4-bit training does not exists, but research looks promising and I expect the first high performance FP4 Large Language Model (LLM) with competitive predictive performance to be trained in 1-2 years time.
Going to 2-bit precision for training currently looks pretty impossible, but it is a much easier problem than shrinking transistors further. So progress in hardware mostly depends on software and algorithms that make it possible to use specialized features offered by the hardware.
We will probably be able to still improve the combination of algorithms + hardware to the year 2032, but after that will hit the end of GPU improvements (similar to smartphones). The wave of performance improvements after 2032 will come from better networking algorithms and mass hardware. It is uncertain if consumer GPUs will be relevant at this point. It might be that you need an RTX 9090 to run run Super HyperStableDiffusion Ultra Plus 9000 Extra or OpenChatGPT 5.0, but it might also be that some company will offer a high-quality API that is cheaper than the electricity cost for a RTX 9090 and you want to use a laptop + API for image generation and other tasks.
Overall, I think investing into a 8-bit capable GPU will be a very solid investment for the next 9 years. Improvements at 4-bit and 2-bit are likely small and other features like Sort Cores would only become relevant once sparse matrix multiplication can be leveraged well. We will probably see some kind of other advancement in 2-3 years which will make it into the next GPU 4 years from now, but we are running out of steam if we keep relying on matrix multiplication. This makes investments into new GPUs last longer.
Question & Answers & Misconceptions
Do I need PCIe 4.0 or PCIe 5.0?
Generally, no. PCIe 5.0 or 4.0 is great if you have a GPU cluster. It is okay if you have an 8x GPU machine, but otherwise, it does not yield many benefits. It allows better parallelization and a bit faster data transfer. Data transfers are not a bottleneck in any application. In computer vision, in the data transfer pipeline, the data storage can be a bottleneck, but not the PCIe transfer from CPU to GPU. So there is no real reason to get a PCIe 5.0 or 4.0 setup for most people. The benefits will be maybe 1-7% better parallelization in a 4 GPU setup.
Do I need 8x/16x PCIe lanes?
Same as with PCIe 4.0 — generally, no. PCIe lanes are needed for parallelization and fast data transfers, which are seldom a bottleneck. Operating GPUs on 4x lanes is fine, especially if you only have 2 GPUs. For a 4 GPU setup, I would prefer 8x lanes per GPU, but running them at 4x lanes will probably only decrease performance by around 5-10% if you parallelize across all 4 GPUs.
How do I fit 4x RTX 4090 or 3090 if they take up 3 PCIe slots each?
You need to get one of the two-slot variants, or you can try to spread them out with PCIe extenders. Besides space, you should also immediately think about cooling and a suitable PSU.
PCIe extenders might also solve both space and cooling issues, but you need to make sure that you have enough space in your case to spread out the GPUs. Make sure your PCIe extenders are long enough!
How do I cool 4x RTX 3090 or 4x RTX 3080?
See the previous section.
Can I use multiple GPUs of different GPU types?
Yes, you can! But you cannot parallelize efficiently across GPUs of different types since you will often go at the speed of the slowest GPU (data and fully sharded parallelism). So different GPUs work just fine, but parallelization across those GPUs will be inefficient since the fastest GPU will wait for the slowest GPU to catch up to a synchronization point (usually gradient update).
What is NVLink, and is it useful?
Generally, NVLink is not useful. NVLink is a high speed interconnect between GPUs. It is useful if you have a GPU cluster with +128 GPUs. Otherwise, it yields almost no benefits over standard PCIe transfers.
I do not have enough money, even for the cheapest GPUs you recommend. What can I do?
Definitely buy used GPUs. You can buy a small cheap GPU for prototyping and testing and then roll out for full experiments to the cloud like vast.ai or Lambda Cloud. This can be cheap if you train/fine-tune/inference on large models only every now and then and spent more time protoyping on smaller models.
What is the carbon footprint of GPUs? How can I use GPUs without polluting the environment?
I built a carbon calculator for calculating your carbon footprint for academics (carbon from flights to conferences + GPU time). The calculator can also be used to calculate a pure GPU carbon footprint. You will find that GPUs produce much, much more carbon than international flights. As such, you should make sure you have a green source of energy if you do not want to have an astronomical carbon footprint. If no electricity provider in our area provides green energy, the best way is to buy carbon offsets. Many people are skeptical about carbon offsets. Do they work? Are they scams?
I believe skepticism just hurts in this case, because not doing anything would be more harmful than risking the probability of getting scammed. If you worry about scams, just invest in a portfolio of offsets to minimize risk.
I worked on a project that produced carbon offsets about ten years ago. The carbon offsets were generated by burning leaking methane from mines in China. UN officials tracked the process, and they required clean digital data and physical inspections of the project site. In that case, the carbon offsets that were produced were highly reliable. I believe many other projects have similar quality standards.
What do I need to parallelize across two machines?
If you want to be on the safe side, you should get at least +50Gbits/s network cards to gain speedups if you want to parallelize across machines. I recommend having at least an EDR Infiniband setup, meaning a network card with at least 50 GBit/s bandwidth. Two EDR cards with cable are about $500 on eBay.
In some cases, you might be able to get away with 10 Gbit/s Ethernet, but this is usually only the case for special networks (certain convolutional networks) or if you use certain algorithms (Microsoft DeepSpeed).
Is the sparse matrix multiplication features suitable for sparse matrices in general?
It does not seem so. Since the granularity of the sparse matrix needs to have 2 zero-valued elements, every 4 elements, the sparse matrices need to be quite structured. It might be possible to adjust the algorithm slightly, which involves that you pool 4 values into a compressed representation of 2 values, but this also means that precise arbitrary sparse matrix multiplication is not possible with Ampere GPUs.
Do I need an Intel CPU to power a multi-GPU setup?
I do not recommend Intel CPUs unless you heavily use CPUs in Kaggle competitions (heavy linear algebra on the CPU). Even for Kaggle competitions AMD CPUs are still great, though. AMD CPUs are cheaper and better than Intel CPUs in general for deep learning. For a 4x GPU built, my go-to CPU would be a Threadripper. We built dozens of systems at our university with Threadrippers, and they all work great — no complaints yet. For 8x GPU systems, I would usually go with CPUs that your vendor has experience with. CPU and PCIe/system reliability is more important in 8x systems than straight performance or straight cost-effectiveness.
Does computer case design matter for cooling?
No. GPUs are usually perfectly cooled if there is at least a small gap between GPUs. Case design will give you 1-3 C better temperatures, space between GPUs will provide you with 10-30 C improvements. The bottom line, if you have space between GPUs, cooling does not matter. If you have no space between GPUs, you need the right cooler design (blower fan) or another solution (water cooling, PCIe extenders), but in either case, case design and case fans do not matter.
Will AMD GPUs + ROCm ever catch up with NVIDIA GPUs + CUDA?
Not in the next 1-2 years. It is a three-way problem: Tensor Cores, software, and community.
AMD GPUs are great in terms of pure silicon: Great FP16 performance, great memory bandwidth. However, their lack of Tensor Cores or the equivalent makes their deep learning performance poor compared to NVIDIA GPUs. Packed low-precision math does not cut it. Without this hardware feature, AMD GPUs will never be competitive. Rumors show that some data center card with Tensor Core equivalent is planned for 2020, but no new data emerged since then. Just having data center cards with a Tensor Core equivalent would also mean that few would be able to afford such AMD GPUs, which would give NVIDIA a competitive advantage.
Let’s say AMD introduces a Tensor-Core-like-hardware feature in the future. Then many people would say, “But there is no software that works for AMD GPUs! How am I supposed to use them?” This is mostly a misconception. The AMD software via ROCm has come to a long way, and support via PyTorch is excellent. While I have not seen many experience reports for AMD GPUs + PyTorch, all the software features are integrated. It seems, if you pick any network, you will be just fine running it on AMD GPUs. So here AMD has come a long way, and this issue is more or less solved.
However, if you solve software and the lack of Tensor Cores, AMD still has a problem: the lack of community. If you have a problem with NVIDIA GPUs, you can Google the problem and find a solution. That builds a lot of trust in NVIDIA GPUs. You have the infrastructure that makes using NVIDIA GPUs easy (any deep learning framework works, any scientific problem is well supported). You have the hacks and tricks that make usage of NVIDIA GPUs a breeze (e.g., apex). You can find experts on NVIDIA GPUs and programming around every other corner while I knew much less AMD GPU experts.
In the community aspect, AMD is a bit like Julia vs Python. Julia has a lot of potential, and many would say, and rightly so, that it is the superior programming language for scientific computing. Yet, Julia is barely used compared to Python. This is because the Python community is very strong. Numpy, SciPy, Pandas are powerful software packages that a large number of people congregate around. This is very similar to the NVIDIA vs AMD issue.
Thus, it is likely that AMD will not catch up until Tensor Core equivalent is introduced (1/2 to 1 year?) and a strong community is built around ROCm (2 years?). AMD will always snatch a part of the market share in specific subgroups (e.g., cryptocurrency mining, data centers). Still, in deep learning, NVIDIA will likely keep its monopoly for at least a couple more years.
When is it better to use the cloud vs a dedicated GPU desktop/server?
Rule-of-thumb: If you expect to do deep learning for longer than a year, it is cheaper to get a desktop GPU. Otherwise, cloud instances are preferable unless you have extensive cloud computing skills and want the benefits of scaling the number of GPUs up and down at will.
Numbers in the following paragraphs are going to change, but it serves as a scenario that helps you to understand the rough costs. You can use similar math to determine if cloud GPUs are the best solution for you.
For the exact point in time when a cloud GPU is more expensive than a desktop depends highly on the service that you are using, and it is best to do a little math on this yourself. Below I do an example calculation for an AWS V100 spot instance with 1x V100 and compare it to the price of a desktop with a single RTX 3090 (similar performance). The desktop with RTX 3090 costs $2,200 (2-GPU barebone + RTX 3090). Additionally, assuming you are in the US, there is an additional $0.12 per kWh for electricity. This compares to $2.14 per hour for the AWS on-demand instance.
At 15% utilization per year, the desktop uses:
(350 W (GPU) + 100 W (CPU))*0.15 (utilization) * 24 hours * 365 days = 591 kWh per year
So 591 kWh of electricity per year, that is an additional $71.
The break-even point for a desktop vs a cloud instance at 15% utilization (you use the cloud instance 15% of time during the day), would be about 300 days ($2,311 vs $2,270):
$2.14/h * 0.15 (utilization) * 24 hours * 300 days = $2,311
So if you expect to run deep learning models after 300 days, it is better to buy a desktop instead of using AWS on-demand instances.
You can do similar calculations for any cloud service to make the decision if you go for a cloud service or a desktop.
Common utilization rates are the following:
- PhD student personal desktop: < 15%
- PhD student slurm GPU cluster: > 35%
- Company-wide slurm research cluster: > 60%
In general, utilization rates are lower for professions where thinking about cutting edge ideas is more important than developing practical products. Some areas have low utilization rates (interpretability research), while other areas have much higher rates (machine translation, language modeling). In general, the utilization of personal machines is almost always overestimated. Commonly, most personal systems have a utilization rate between 5-10%. This is why I would highly recommend slurm GPU clusters for research groups and companies instead of individual desktop GPU machines.
Version History
- 2023-01-30: Improved font and recommendation chart. Added 5 years cost of ownership electricity perf/USD chart. Updated Async copy and TMA functionality. Slight update to FP8 training. General improvements.
- 2023-01-16: Added Hopper and Ada GPUs. Added GPU recommendation chart. Added information about the TMA unit and L2 cache.
- 2020-09-20: Added discussion of using power limiting to run 4x RTX 3090 systems. Added older GPUs to the performance and cost/performance charts. Added figures for sparse matrix multiplication.
- 2020-09-07: Added NVIDIA Ampere series GPUs. Included lots of good-to-know GPU details.
- 2019-04-03: Added RTX Titan and GTX 1660 Ti. Updated TPU section. Added startup hardware discussion.
- 2018-11-26: Added discussion of overheating issues of RTX cards.
- 2018-11-05: Added RTX 2070 and updated recommendations. Updated charts with hard performance data. Updated TPU section.
- 2018-08-21: Added RTX 2080 and RTX 2080 Ti; reworked performance analysis
- 2017-04-09: Added cost-efficiency analysis; updated recommendation with NVIDIA Titan Xp
- 2017-03-19: Cleaned up blog post; added GTX 1080 Ti
- 2016-07-23: Added Titan X Pascal and GTX 1060; updated recommendations
- 2016-06-25: Reworked multi-GPU section; removed simple neural network memory section as no longer relevant; expanded convolutional memory section; truncated AWS section due to not being efficient anymore; added my opinion about the Xeon Phi; added updates for the GTX 1000 series
- 2015-08-20: Added section for AWS GPU instances; added GTX 980 Ti to the comparison relation
- 2015-04-22: GTX 580 no longer recommended; added performance relationships between cards
- 2015-03-16: Updated GPU recommendations: GTX 970 and GTX 580
- 2015-02-23: Updated GPU recommendations and memory calculations
- 2014-09-28: Added emphasis for memory requirement of CNNs
Acknowledgments
I thank Suhail for making me aware of outdated prices on H100 GPUs, Gjorgji Kjosev for pointing out font issues, Anonymous for pointing out that the TMA unit does not exist on Ada GPUs, Scott Gray for pointing out that FP8 tensor cores have no transposed matrix multiplication, and reddit and HackerNews users for pointing out many other improvements.
For past updates of this blog post, I want to thank Mat Kelcey for helping me to debug and test custom code for the GTX 970; I want to thank Sander Dieleman for making me aware of the shortcomings of my GPU memory advice for convolutional nets; I want to thank Hannes Bretschneider for pointing out software dependency problems for the GTX 580; and I want to thank Oliver Griesel for pointing out notebook solutions for AWS instances. I want to thank Brad Nemire for providing me with an RTX Titan for benchmarking purposes. I want to thank Agrin Hilmkil, Ari Holtzman, Gabriel Ilharco, Nam Pho for their excellent feedback on the previous version of this blog post.
Hannes says
Hello, now there are some very affordable used Tesla M40s with 24 GB memory on the market. I found them starting from 650 EUR (about 720 USD). Is this a good deal for some use cases?
Tim Dettmers says
A Tesla M40 is pretty slow. I would advise you to get a Titan RTX if you really more memory as they are still much more cost efficient as M40s for 720 USD.
Sammy B says
Hi Tim,
After reading all of your excellent posts, I came up with the following build: https://pcpartpicker.com/list/4dj8Mc. Just to briefly explain my rationale: I’m buying a 2GPU machine on ~4k budget, but will probably upgrade to 4GPU in the next year. Hence, I’ve chosen to max out PSU power, RAM, and chassis size in anticipation. As for the CPU/mobo, it seems like AMD really gives you a lot more bang for the buck than Intel, hence TR+X399. And as per your posts, I’m going with blower style cards.
I’d greatly appreciate it if you could let me know whether I’m not missing any incompatibilities, and whether in my described situation I can still possibly shave off some of the costs. Two particular questions: will the CPU/mobo combo definitely support 4 2080s as well as 2+ NVME SSDs? Also, this will be used in a university setting, and I have the option of putting the build on a rack rather than in its own chassis; I’ve never done that before, any advice or resources I can look into for building on a rack?
Tim Dettmers says
Looks like a solid build. I would be careful about the case though. Often cases are just big enough to house 3 GPUs. Make sure it fits 4 GPUs.
Yes it supports 4 RTX 2080 Ti and 2+ NVMe SSDs since you have a threadripper with additional lanes. I use the same setup with 3 NVMe SSDs and it works great.
For a rack you just need the right case. You probably are looking to buy a 2U format. Ask your university about which format they need the case to be and then look for a chase of the right format that supports 4 GPUs.
Keshav says
Hi Tim,
Planning to buy a GPU. I predominantly work on NLP and most of my models only require within 8GB so planning for the RTX 2060 Super. But I would like to do some hobby projects on video analytics and I would like to know if it would work if I just used the 2060S for coding/debugging and building large models, and then just for training move it to a cloud compute like Azure or AWS with bigger config PC. That way it would be quite cost-effective.
Thanks in advance
Tim Dettmers says
That sounds reasonable. However, it also sounds like you would be doing a lot of deep learning at work / as a hobby and it might be that a bigger GPU might be just better for you. If you do NLP you probably also want to use pretrained transformers. If that is the case a RTX 2080 Ti might be better. If you do not want to use transformers you might be fine with a RTX 2060S.
Paul says
I got a good deal for i9 9900k/64GB/rtx 2070super, so went with that one. Hope I won’t regret not buying 2080 ti at the cost of ram and CPU. But I guess since I’m new to deep learning it won’t matter that much at least in the beginning.
Thanks for your reply.
Tim Dettmers says
You can also always save GPU memory in different ways, for example to aggregate gradients via multiple mini-batches. So there are always workarounds — no worries!
Jared says
Hi, I’m an architecture student and I use mostly Revit, Rhino, Grasshopper3d, and Lumion along with Adobe Cloud. I am interested in incorporating AI for generative design and deep learning but I’m finding conflicting info online about what hardware is best for this workflow. Should I go with build 1:
Intel i9-9900KS, Gigabyte z390 Aorus Ultra, RTX 2080 ti, and 128 GB DDR4-2666 RAM?
Or build 2:
AMD Ryzen 9 3900x, ASRock x570 Taichi motherboard, RTX 2080 ti, and 128 GB DDR4-2666 RAM?
Or is there another set of components that would be better than either?
I really appreciate any feedback! Thanks.
Tim Dettmers says
Both systems look fine, I would go with the one that is cheaper (which is probably the AMD Ryzen one).
Mick Lalescu says
Tim,
Excellent article, very valuable info.
I have one comment and one question.
Comment: I think you should update the article due to RTX Super series becoming available. It could be that the best performance per dollar is now on RTX Super 2060 which is roughly equivalent to the old RTX 2080 but it still costs lower.
Question: Would it make sense for a Machine Learning workstation to have to different GPUs:
– a more powerful one a GTX 20xx for executing machine learning workloads with no monitor attached
– a cheaper, less performant one, for attaching two or three monitors , for regular UI desktop tasks (email, browsing, IDE, ssh terminals, etc)
My concern here is that running the UI on the ML GPU could result in slowing down the ML tasks. Is this concern valid? I am thinking about using a Radeon GPU for UI. The Radeon choice is for two reasons, AMD is cheaper plus it is easier to identify which GPU is used by the ML task.
Tim Dettmers says
Usually it is fine to have monitors attached to a GPU and also use it for models. I have 3 monitors on one GPU and occasionally I have problems if I run a model in parallel and I max out the GPU memory to the very last bit though. However, often you just need to reduce the batch size slightly and it is fine.
Paul C says
Hey,
I’ve got ~2000$ to spend on PC and I’d like to know your opinion on the hardware priority in deep learning. Is it better to buy rtx 2080 ti, Ryzen 7 2700X/some i5 and 16GB ram, or maybe buy rtx 2070/2080(super?), i9/Ryzen 7 3700x and 32GB(64?) ram?
Tim Dettmers says
A Ryzen 7 2700X will be more than enough for 1 GPU. The 16 GB of RAM will also be more enough for the GPU, but maybe not for certain applications. If you run just deep learning algorithms and not other CPU-based modeling algorithms though, 16 GB should be fine. So I would go for the first build. If you have the need to run some CPU-based algorithms, (sklearn, Kaggle competitions etc.) then I would go for the second build.
It also depends if you want to run big state-of-the-art models. For that the 8 GPU of GPU RAM in the second options will not be enough. So in that case also go for the first option.
Mofii says
Hi Tim,
Great post. It’s very helpful. I am working on large GANs (1024×1024) and the GPU memory would be very important to me. Karras recommended in pgGAN that “high-end NVIDIA Pascal or Volta GPUs with 16GB of DRAM” would be good. I am thinking about a RTX Titan. Do you think this would be enough? For me, money is a less important issue when it comes to the GPU performance and memory requirement. Thank you for any advice!
Best,
Mofii
Tim Dettmers says
If you can afford to spend more then you could get a Quadro RTX 8000 with 48 GB of RAM ($5.5k). Otherwise, the Titan RTX ($2.8k) is already pretty good with 24 GB. You can also use techniques like batch aggregation to train with a larger batch size while requiring less memory.
Mofii says
Hi Tim,
Thank you very much for the advice!
Best,
Mofii
andrea de luca says
The rtx 8000 mentioned by Tim is the best option, but I noticed that the Tesla V100 SXM2 are crap cheap on ebay these days.
If you can buy an used SXM2 platform (supermicro, gigabyte, etc..) you can purchase four of them for the price of a single quadro 8000.
Mofii says
This sounds pretty cool. I’ll definitely check it out.
Thank you!
Best,
Mofii
Mira says
Hi Tim. Currently I am building a PC dedicated for CUDA de/compressing calculations, therefore nVidia is necessary to use, sadly (probably 2070S or maybe 2060S is enough). At its hard time it can be loaded to de/compress 2GB of data per second for couple of minutes. At this point, is ECC memory necessary?
The 8 CPU cores for some background work are fine. The most limiting requirement is a need to have 2 x 10Gb/s LAN cards. Therefore PCI lanes x16/x8 (GPU+LAN) on MB is best scenario to have.
I think about x570 platform as future proof, but not best suitable.
Then I think about Threadripper which has a lot of PCI lanes and probably better compatibility with ECC modules (which I do not know if are needed).
What would you recommend me?
Thank you a lot.
Tim Dettmers says
ECC is necessary if it is really important that your data is not corrupted. Usually, for a normal desktop, you should experience a few couple of errors per month. If you compress/decompress a lot of data each day then ECC memory can make good sense. If you do it for research purposes or every week/month once then you do not need ECC.
Even if you have a slow GPU your bottleneck will still be the x16 lanes. The GPU can process your data at hundreds of GB/s while the PCIe bus only can process 16 GB/s. However, it also depends on how intense the de/compression is, but you should be fine with most GPUs. If the de/compression is very intensive then anything above a RTX 2070 SUPER will not be much faster. I think you should be fine with the RTX 2060S/2070S. Another factor could be memory, but I guess you would need to use some streaming pipelines for your data anyway and that should work just fine with 8 GB memory.
Ruscio says
Hi Tim,
for a Computer Vision build would you go with 1x 2080 Ti or 2x 2060s Super?
Tim Dettmers says
I would go with a RTX 2080 Ti. I think for many computer vision tasks the 8 GB of memory can be limiting even if you use 16-bit computation. If you think you will only fit smaller models, then 2 2060 Super might actually be better. You can also always fit a model somehow on a 2060 Super, using tricks like aggregating the gradient over multiple small mini-batches, but this will also make the training slower so a RTX 2080 Ti might be more reliable after all.
andrea de luca says
For the price of a single 2080ti or 2x2060S, I would buy neither, opting for 2x1080ti.
Same cost, more value: you will get two cards with 11gb each. Usad in parallel, they will beat a single 2080ti both in fp16 and fp32, not to mention having twice the memory.
Compared to two 2060S, you will still have considerably more memory, more speed in fp32, and you will be just a bit slower in fp16.
But please appreciate that if your model doesn’t fit in memory, it just doesn’t fit, while if you have to wait a bit more for you training, well.. you just wait.
ttodd says
Actually,if we put the motherboard,ssd,memory cost together to analyze the cost-efficiency,it’s easy to find the RTX series gpus have the almost same unit performance per dollar,though 2070s and 2080ti would have slight higher unit performance per dollar.
And may I ask if I could use two 2070s ,one using fan-cooling,another using turbo cooling,I don’t know if the turbo would make some noise cause I have to put it in my bedroom.
Tim Dettmers says
If you want silent operation I recommend an all-in-one (AIO) hybrid cooled GPU.
You are right, if you put all the other hardware costs on top of the GPU, more expensive GPUs are much better in cost performance. If you get 2 GPUs, the RTX 2080 Ti just looks very good.
dragon says
Excuse me,What is the difference between RTX2070 MAX-Q and RTX2070 in deep learning power?
Tim Dettmers says
MAX-Q is for laptops and is a bit less powerful than a RTX 2070.
Uday says
Hi Tim,
Help me with whether GTX 1650 Max-Q compared to GTX 1050 Ti? Both are coming at the same price fitted in a MSI
Tim Dettmers says
They are about the same. The GTX 1650 Max-Q is slightly better if you use convolutional networks and 16-bit though.
Rafael says
Hi Tim
Superb blog and dedication from you. Very inspiring. Thanks in advance
Intel Xeon e5-2697-v3 (14 Cores 28 Threads with 2.6-3.6Ghz )
Asus x99 E Ws (1x C612 Chipset Motherboard)
64GB DDR4 RAM (Samsung 2133 MHz, ECC, 2Rx4, 16 GBx4)
2 x Nvidia Tesla k80 that could upgraded to 4 k80
250 GB Samsung 970 Evo Nvme M.2 SSD for OS Installation
1TB SSD Corsair MP510 for Data Storage
Fractal Design Define XL R2 titanium Cabinet
Noctua NH-D15S for CPU cooling
¿Fans type and configuration ? Still an open question for 2 to 4 GPUs
EVGA 1600W 80+ Gold PSU (Support 2-4 GPU)
I know passive tesla k80 or similar should be avoided in multiGPU configurations due to thermal issues. However, we need double precission high performance for running VASP or QUANTUM EXPRESSO in demanding DFT calculations. I am stuck in an optimum design of the case air flow for a 2 K80 GPU configuration considering that this could be extended to 4 GPUs. I have considered the following option:
(1) Using 2 powerful intake funs (High static pressure >4 mm H2O and high air flow about 110-150 CFM each) placed at a distance of approx. 18 cm from the GPUs.
As an indication, Commercial servers seems to push similar airflows (approx. 300 CFM per 4 GPU) but at a shorter distance (5-8 cm) of the GPU. In some cases, exhaust fans are additionally included outside the case to push the air directly out of the GPUs (97x97x33mm fans exhausting at max. speed around 40 CFM). I have doubts of this option for a 4 GPU configurations, but this might still work with 2 GPUs?
(2) Designing a 3D shroud for each tesla K80 with a coupled SUNON centrifugal fans (97x97x33 mm and 40-44 CFM at max. 5400 rpm speed…very noisy 54 dB fans) for active air cooling. The airflow and power is similar to that described for active cooling GTX Geforce graphic cards. This extension of about 14 cm (shroud + fan) would leave a gap of aroun 6-8 cm between the centrifugal fan of each GPU and both intake fans of the case (Now, should I considered fans with no need of so high static pressures but similar CFM air flow delivery as before (100-150 CFM each)?.
What is your experience in thermal optimization of 2 or 4 GPUs?
Any recommendation?
Best regards
Tim Dettmers says
The distance of the fans to the GPUs can be a critical factor as well as the more stream-lined airflow in commercial solutions. Since K80 GPUs are already quite expensive I would try to get a cheap commercial 2 GPU solution. I am sure you can work out a solution, and you seem to be have something that can work well, but the uncertainty, if it works, might be a bit unsettling. A commercial solution should just work out of the box.
Another option which might work much better is to get Titan V GPUs instead (better double-precision performance and cheaper, but no memory correction!). These have high double precision and come with a fan cooling solution. You still need good airflow to cool 4 of them, but cooling 2 is not a problem. You can also buy extenders and distribute the GPUs around the case if you have 4 of them and with that cooling should not be a problem.
Abhishek Rao says
Tim, if I have to make a choice between GTX 1060 and GTX 1660ti laptops. Which would it be? Is it worth spending 150$ more on the 1660ti?
Tim Dettmers says
The GTX 1660 Ti is slightly better. I personally would get the best possible because you will probably use the laptop for a long time and it is difficult and costly to upgrade your GPU for your laptop. I think the $150 could be worth it over the long-term.
anonymous says
Actually, that is no longer the case, last year AMD made available in depth support for deep learning algorithms on its latest GPUs, which are considerably less costly than NVidia
Tim Dettmers says
Do you know any source which documents the functionality for AMD GPUs for TensorFlow/PyTorch?
Sandeep says
Hi Tim,
I work in Research in NLP and Speech. Working on LSTM, GAN with datasets of size 10million+. I usually am way of increasing the batch size given [N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima,” Sep. 2016]. So usually I keep it to 256 for batch size. I am ordering new h/w in my organization and budget is not a real constraint. I am planning the following. Can you please let me know if there is any incompatibility? I plan to use Tensorflow, PyTorch and SpaCy too.
8 x NVidia RTX 2080 Ti OC edition GPU processors (for text processing)
4 x NVidia Titan RTX GPU processors (for speech and image/OCR processing)
SSD 500GB
2 x 2TB HDD 10KRPM
16 x PCI-e 3.0
Appreciate all the info you have put. Thank you
Tim Dettmers says
It sounds like what you want is a small GPU cluster (12 GPUs, right?). You can build it together in a cheap way by buying a cheap EPYC board that supports 6+ GPUs and then buy infiniband cards to put them together. However, such a setup can have its issues and it might be difficult to figure out what is wrong if it goes wrong. Once you go across multiple servers or you have 8 GPUs in a server then it is probably better to buy such a setup from a hardware vendor that sells the full machine. I tried it myself and it took more than a month to get all the software working well together — if you buy a cheap solution that you build yourself, this is what you can expect.
Host says
Great article. Been following it for years until having the right budget for a good start. Is it a good idea to provide water cooling for the 2 rtx 2070 super I’m about to get?(have one alr). If so, how much better will it gets? Thanks
Tim Dettmers says
If there is an empty slot between the two RTX 2070 you do not need water cooling. Otherwise, if the slots are next to each other, water cooling might yield 0-15% performance if you use both GPUs at the same time.
Eric Bohn says
What about blowers vs non-blowers for the 2x RTX 2070 supers on a normal SLI board like the x470, x570, or z390?
Andy says
Thanks for the great research here! I have a question. I have $3800 to spend on a laptop/mobile workstation. I’m only going to use it for deep learning and have to spend it all (otherwise the bureaucracy does not think I’m doing my job). Can you give me some advice on which laptop I should buy with the best GPU performance? It’s been exhausting looking around and I could use some advice. Thanks!
Tim Dettmers says
I would go with either RTX 2080 or with a Quadro 5000/6000. There are also the Max-Q variants, but they will have less performance it seems. Also balance the other aspects of a laptop in terms of practicality and how you use it in general.
Andy says
Thanks for the quick reply!
Mircea Giurgiu says
I am looking for similar laptop. Could give a link for possible provider (model)?
chanhyuk jung says
Do you think getting a graphics card on laptops only for debugging a good idea? And is the price justified?
Kenneth N Fricklas says
Hey Tim,
I picked up a machine with and RX5700XT/8GB and an AMD Ryzen 3900x (12 core). Is there a set of benchmarks you’d like me to run on it?
Tim Dettmers says
Thanks for the offer! I think the most useful thing would be to run all different kind of models and see if something is going wrong. If the AMD card supports all the models flawlessly then it might be worthwhile to do more careful benchmarking and in turn adding these benchmarks to the blog post.
Faruk says
Hi Tim, I am considering to buy a RTX 2070 for my Thesis work, which is a NLP task and RNN/GPT models. But there are several models from makers and all of the comparisons are for gaming. Sure I will use it for gaming as well but I want to get the best one for deep learning. Do you have info or suggestion. Thanks in advance.
Tim Dettmers says
All the RTX 2070 should have similar performance for deep learning. I would just make sure that you get one that has good cooling.
Faruk says
Ok, I think I’ve found one but got confused for an other reason. Is it better to buy just a 2080ti or 2x2070super. Which one would you pick?
Tim Dettmers says
2080 Ti if you need the memory. Otherwise 2x 2070 Super.
Faruk says
Well, that is the “if” i do not have an answer for. I will use it for LSTM and deep neural networks. You suggested to prioritize memory bandwidth over all for Lstm networks just before 16-bit compatibality. As far I understand, I can get 32gb of vram using 16-bit on 2x2070s and 22gb of vram with 2080ti. Is that true?
CNNs, on the other hand, at some I may train also but I think 2070 will handle those moderately.
Wagner says
Fot the same price which is better? 1080 ti used or 2070 new?
I do mostly computer vision.
Sorry if this was already asked, but I didnt find.
Tim Dettmers says
If you are comfortable with just using 16-bit models the RTX 2070 might be a bit better. The problem is really the 8 GB memory on the RTX 2070 which might be limiting. But otherwise, the RTX 2070 is clearly better than the GTX 1080 Ti.
andrea de luca says
Hi Tim. Right now, I got two 1080TIs (11Gb each). Before their street price begins to fall down, I’d like to sell them and acquire something newer and more future-proof.
With a budged of ~1500EUR (1000 of which will come from my 1080TIs), I devise two viable options:
1. A single quadro RTX 5000. It has 16gb of vram and full-fledged fp32 accumulation when running in fp16 (which desktop RTXs do not possess).
2. Three RTX 2070. While operating in parallel, they get 24Gb of vram, which is a lot.
My main concern regards the actual vram constraints. In other words, is there anything, in both Pytorch and TF, which still needs to be done on a single GPU and cannot be parallelized across multiple GPUs?
Tim Dettmers says
I think the RTX Titan also has full fp32 accumulation (although the difference in performance is very minor with half-baked fp32 accumulation). The three RTX 2070 cannot be combined so easily to extend the memory. For parallelism, you are usually stuck with just the memory of the smallest GPU, taht is 8 GB in case of the RTX 2070.
andrea de luca says
Thanks for your quick reply, Tim.
Could you elaborate a bit regarding the difficulties in combining GPUs to extend the memory? Currently, I use mainly Pytorch over vision tasks. It is sufficient to make use of the DataParallel API to automatically get 22Gb available (on my two 1080TIs). That is, for example I can exactly double the size of my minibatches, or the resolution (no. of pixels) of the images.
Could you also cite some examples in which that would not be possible? NLP is my next area of interest, so any example in such domain would be really useful.
Thanks!
Note: If I cannot combine the mem of three 2070 in other typical use cases, I think it would be better to stick with my good old couple of TIs.
mestrace says
Thanks for the blog.
I have a desktop with one GTX980. I was considering upgrade the GPU for academic purposes (i.e. course projects, exploration). I was considering two options: buying an additional GTX980 for SLI, or upgrade to RTX2060 (That’s what I can afford). Which way should I go, can you given out any suggestions?
Tim Dettmers says
If you can afford it, a RTX 2060 Super would be best. A would also consider a normal RTX 2060 over a GTX 980 in SLI.
Mridul Pandey says
Hey Tim,
I am getting RTX2070 SUPER supporting NVLINK at good price, what do you suggest 2 NVLINKed RTX2070 super, how will it work? Can you suggest, does NVlink doubles VRAM, if it is so then the problem of small RAM with 2070 is solved
scoodood says
Dear Tim,
I really learned a lot from you, thanks for the great article. I had some intro courses on ML/DL/RL a while ago. Now I am getting a bit serious and would like to dive deeper into this area (mostly for fun). I am thinking about started with one of these GPU and add more in the future. Which GPU would you recommend for someone like me?
2080 super (new, ~$760)
2070 super (new, ~$520)
1080 TI (used, ~$520)
Best Regard
Tim Dettmers says
The RTX 2070 Super is great. I would recommend that over a GTX 1080 Ti. If you are doing it for fun, I would not spend too much money. The RTX 2070 Super is already pretty good for any application.
Nader says
What are your thoughts on the new RTX 5000 gpus that are in the Thinkpad P53 laptop
Tim Dettmers says
If it is like a Quadro RTX 5000 it is pretty good! Must be expensive though. Not the most cost-efficient card.
Ziv Freund says
Hi Tim,
Is there any change about your recomendatation now that the super vesrions are out ?
Now RTX 2060 super has 8GB memory like the RTX 2070 super. should I still pay the extra money ?
Also, When speaking with NVIDIA guys, they claim that any GTX based GPU is not suited for 24/7 training, and for that I must buy the workstation GPU’s (like T4, V100 etc…) What is your comment do that ?
Thanks !
Ziv
Tim Dettmers says
Super card are great and should be considered if you can afford them. I will write an update sometime later. Maybe in a week or two.
Vincent says
Hi Tim, what about the new RTX 2070 Super or RTX 2060 Super? Are they supurior to the former RTX 2070 and RTX 2060?
Tim Dettmers says
Yes they are in most circumstances.
Devidas says
This is a really very good blog for beginners.
Ashish Duhan says
Hey Tim,
I am getting a gtx 1080ti(used) and RTX 2060 SUPER 8GB (almost as good as RTX 2070) for same price. Which one should i choose? Does an RTX card almost always allow a double size model with mixed precision? If yes then i would have almost 16 GB with RTX 2060 SUPER (5 GB more than 1080ti). In this case, do you see any advantage of choosing an GTX 1080ti over RTX 2060 SUPER ?
Please suggest.
Tim Dettmers says
The RTX 2060 SUPER is better.
User02 says
Hey Tim, I have just started out in the field of machine learning and seriously want to do this. I want to participate in Kaggle Contests and build projects of my own. Currently, I own a MacBook Pro 2017 model with an 8gb RAM and was looking to buy an eGPU. The only thing is I am not completely sure of what I should buy. I really liked the idea of installing windows on my Mac and use RTX 2070 and just wanted to get your views on it and what you would suggest I do.
Tim Dettmers says
If you want to do machine learning, I would try to stick with either the Mac or Linux. Windows is now also supported by most frameworks but it can be cumbersome and unreliable at times.
Warren says
Tim,
I am now learning AI/ML.
Really appreciate your blog as I am looking to buy a PC.
Can you please tell me:
1. You mentioned a couple of times about the limitation of 8GB on say RTX2080. When I install 2 x 2080 (non-TI), do I have 16 GB to use? Does it work that way?
2. Will the knowledge I gain from studying AL/ML (Udacity) using a personal GPU be able easily transferred to Google TPU down the road? Or it is going to be another battle when I need to use Google TPU in the future?
Thanks
Warren
Tim Dettmers says
1. Usually your smallest GPU memory is what you get if you parallelize across GPUs. So 8 GB.
2. Yes, it should be transferable. The only difference is how you launch your programs (and some TPU related changes to batch size and layer sizes).
Jay says
Hi Tim,
You explain that for CNN, priority is Tensor Cores > FLOPs > Memory Bandwidth > 16-bit capability.
So purely in terms of performance (without taking cost-efficiency into consideration), for CNN would you recommend an RTX 2080 Max Q over an RTX 2070?
The 2080 Max Q has 28% more Tensor Cores, although it also has 14% less FLOPs and 14% less Memory Bandwidth than the 2070…
RTX 2080 Max Q:
– 368 Tensor Cores
– 12.89 TFLOPs (FP16)
– 384 GB/s Memory Bandwidth
RTX 2070:
– 288 Tensor Cores
– 14.93 TFLOPs (FP16)
– 448 GB/s Memory Bandwidth
Thanks for this super blog by the way, always looking forward to your updates and comment replies!
Tim Dettmers says
It is not so straightforward actually. I would assume that the RTX 2070 is slightly faster here, but probably both cards are very similar. However, since I have no benchmark data I cannot say this for sure. I think all the factors (less tensor cores, but high memory bandwidth and flops) cancel each other out and both GPUs have similar performance.
Adam Kantorik says
Hey Tim.
I had to sell my desktop because I travel too much. So now I am going to upgrade my laptop, and I am deciding between graphics card – GTX 1660 Ti 6GB vs RTX 2060. I know that the RTX has new tensor cores, but is it worth on the laptop? Also, I do deeplearning just 15% of my coding time or so (but I code every day and when I do deeplearning I go all the way – LSTM, convolutional, rl,..).
If it’s going to speed up 10% I don’t think I need to pay extra for it, but I don’t want to regret the choice later. What do you think?
Tim Dettmers says
If you find a laptop with a cheap GTX 1660 Ti it might be right for you. It seems what is important for you that you can run these models at a reasonable pace but it does not need to be super fast. Also for big models there is always the cloud. I guess the GTX 1660 Ti might just be right for you!
Michel Rathé says
Hi again Tim,
What is the current state on multi gpu workstation in regard of throttling and heat.
Since the Asus rtx matrix , that has it’s own aio loop for cooling, I haven’t seen any relevant build on the net. So, for 2 to 4 gpu what design/Brand is the optimal to keep performance, reliability and longevity of the cards. If custom aio loop (or hybrid) is best choice please give high level specs. Is blower still relevant? What are the true and thoughtful practioneers using on a workstation build (not a gpu server).
The mobo will be workstation class ex: Asus Dominous extreme or Sage.
Thanks and keep on the good work,
Michel
Tim Dettmers says
I did not see any deep learning data for AIO loop GPUs. For now the blower do better if you put them side-by-side in the 4 GPU case, but what I found to be best is to have the normal, non-blower fans and use extenders to distribute the GPUs in the case. This keeps them much cooler. I would assume though that AIO loop provide even better performance (if you have the space for all those radiators). Otherwise, there are also custom loop solutions which definitely give you the highest performance, but also the highest efforts/price point to setup a system.
island fox says
Hi do you think a RTX 2080 (not ti) is good enough for doing Kaggle? How about a 2080 super?
Tim Dettmers says
Yes, both card are good enough for most cases. The only problem is the smaller 8 GB of RAM when you face “deep learning competitions”, that is computer vision competitions where convolutional networks are really important. But even then you should get pretty far with 8 GB of RAM.
Island Fox says
Thanks for the response👍
Maurice Langner says
Hi Tim,
I already wrote you a message recently, but I had to delete the respective email address.
What I would like to know is, what GPU setup you would recommend for NLP scientists working on speech recognition. I have the choice between 1-2 RTX 2080ti and a Titan RTX with student discout, which makes it equal in cost to buying two 2080ti.
I am still not sure whether it is necessary to have 24 GB when connecting two 2080ti means summing memory to 22GB and CUDA core number to about 8000.
I would be really greatful if you could help me.
best,
Maurice
Tim Dettmers says
I would go for the two RTX 2080 Ti. The memory will still be limited to 11 GB for standard APIs in popular libraries (tensorflow, pytorch), but with some effort you could utilize the full 22 GB with a single model. However, probably you will be fine with 11 GB 95% of the time.
YOYO says
The best GPU for deep learning is RTX 2070. How about RTX 2060 SUPER which has 8GB release at July. 2019. Is it possible to be the best and cost-efficient GPU?? I am going to buy an used 2070 or brand new 2060 super.
Tim Dettmers says
The RTX 2060 Super might actually be better in terms of cost efficiency, but I do not have any hard data about this. In the worst case it is just a bit worse than a RTX 2070 — there is not much too lose!
Alon says
Hi,
I’ve been wanting to purchase an RTX 2070 , but noticed a lot of complaints (on internet forums/review threads) about RTX GPU’s failing. Do you think it’s a real issue? Have you encountered problems with RTX GPU’s ?
Thanks, Alon
Tim Dettmers says
It seems that only the first batch of RTX cards was affected. At the University of Washington I know of 2 failures our of 20+ GPUs. Which is high but not as high as all these news would suggest. These cards also came mainly from the first batch of RTX cards.
Shadiq says
Hello, I absolutely respect the work you have done for this post. I have a question, since the RTX Super lineup is now available and replacing RTX 20x0s, would you reckon the overall gpu recommendation list to be the same order? So in example now the best overall gpu is now rtx 2070 super and so on. Or is there any further considerations before I purchase a rtx 2070 super myself?
Tim Dettmers says
The RTX 2060 Super looks great! The RTX 2080/2070 Super should be more cost-effective than the RTX 2080/2070 but with fluctuating prices this is difficult to say for sure.
Keshav says
Hey did you get the Super GPU, if so how is it? Why is it that the RTX 2070 Super is cheaper by $100 than the normal RTX 2070?
Harsh Mittal says
Hey Tim thanks a lot for such useful information. I wanted to start kaggle and I am serious about it.
As you recommend RTX 2070 would have been best , but since it’s not available right now, I should go with RTX 2060 Super or RTX 2070 Super?
Tim Dettmers says
If you have the money the RTX 2070 Super is a good portion better than the RTX 2070 for conv nets/transformers. If you rather train RNNs and small CNNs, get a RTX 2060 Super.
Alex says
Hi Tim,
Thank you for the great post!
I think of about PC for doing ML projects and taking part in Kaggle competitions. I usually work with image data, so I think about buying one RTX 2070 Super. Would you recommend this GPU?
Tim Dettmers says
This works well for most competitions. If there are some computer vision competitions a RTX 2070 Super will not have enough RAM to get the best results, but you will get some good results.
Jürgen B says
Hi Tim,
thank you for this very profound analysis! I have just some questions. According to what I read on techpowerup.com the GTX 1060 works well for float32, but doesn’t do so well for float16. Does that match your experience? Would you still recommend the GTX 1060 if float16 is used, or would you then recommend something different?
I am a beginner in deep learning and would just like to try out some things mainly related to convolutional nets and later would like to check out an idea I have about training on short image sequences taken from a video (black and white pictures of size maybe 50×50 and up to 10 images combined in one input vector) is it realsitic to train something like this on such a card?
One other question, you recommend the GTX 1060 for starters. Do you prefer this card over the GTX 1660 because its around 30% more expensive, or is it because the 1660 has no TPUs (just read this here), I guess the 1060 has some, right?
Thank very much in advance
Jürgen
Here the articles I mentioned:
https://www.techpowerup.com/gpu-specs/geforce-gtx-1660-ti.c3364
https://www.techpowerup.com/gpu-specs/geforce-gtx-1060-6-gb-gddr5x.c3328
Tim Dettmers says
The GTX 1060 can be run with float16 to save a little bit of memory, but it will not make training faster. 50x50x10 is very doable with a 6 GB GTX 1060. If you get a GTX 1060 with less memory it might run into some difficulties, but you might be able to go around that if you decrease the batch size to save some memory. For starters I would recommend instead the RTX 2060 or the RTX 2060 Super (if you can afford it). A cheap GTX 1060 with 6 GB from eBay is great as well.
Jürgen says
Hi Tim, thank you for your reply. In the meantime I caught an RTX 2070 on eBay. But couldn’t test it up to now, because Ubuntu 19 unfortunately installs CUDA 10.1 which is neither supported by PyTorch nor by Tensorflow, so up to now, the RTX is only an additional source of light due to the led bar on it’s edge 🙁
Tim Dettmers says
You can try to download CUDA 10.0 and install it manually. The run files are usually easy to install.
Steve Ni says
Hi, Tim! Thanks for your great article!
I am now wondering whether to buy two GTX1080 or a single RTX2080 using the same money. (In my country the price of RTX2080 is twice as GTX1080). Which should I choose? (I generally do CNN using TensorFlow at present, but maybe more models in future.)
Tim Dettmers says
This is a though choice. I think it is about the same. If you are using CNNs a lot then the RTX 2080 will be great if you train large ones (16-bit compute is great for large CNNs). If your CNNs are a bit smaller the GTX 1080 are definitely better. I personally would probably go with the two GTX 1080.
William says
Looking forward to seeing 2060/2070/2080 Super GPU and their performances, thanks!
Tim Dettmers says
I might work on this in late August / early September. Thanks for your patience!
Michael Mao says
How important is GPU clock speed for deep learning? I’m looking to spend some money on an RTX 2080 Ti as a second card alongside an RTX 2070 in my system. Since this is a much more substantial investment, I want to get the best bang for the buck in terms of deep learning performance in RTX 2080 Tis on the market. I’m looking at EVGA black edition and Aorus 11GC, which are 150 dollars apart, but has a boost clock difference of 150Mhz. It seems that all other aspects are pretty similar apart from clock speed differences. Is it worth getting the pricier model? How much does clock speed affect deep learning performances?
Tim Dettmers says
The clock speed does not matter. The cooling solution matters much more. Cooling will make a difference of -/+30% while clock speeds are about -/+5%. In terms of cooling I heard good things about the RTX 2080 Ti Sea Hawk or the Strix also seems to be decent (not sure about a Strix multi-gpu setup though).
Scarecrow says
Hi Tim!
I’m looking to buy a new laptop for deep learning. Not super into deep learning yet but in the near future, I would get into competing in kaggle competitions. I was looking at laptops from HP and Lenovo. I find almost all of these laptops within my budget have the following GPUs – MX150, GTX1050, 1050Ti, GTX 1650, and GTX 1660Ti. Which one of these is better in 2019?
My main laptop specs would be => i7 (7th gen+) and16GB RAM.
Tim Dettmers says
I would only recommend the GTX 1660 Ti version. But note that this GPU is also not very good. You can do some prototyping but not run regular deep neural networks that are a bit larger.
Cindeep says
If I have an option to buy either GTX 1060 or GTX 1660 Ti, which one do you recommend? (considering that the latter has 1536 CUDA cores compared to 1280 from the former)
Tim Dettmers says
I would recommend a GTX 1660 Ti.
Mircea Giurgiu says
If I compare price and performance of 4xRTX2080 versus 4 x RTX5000, is there a reason to adopt the RTX5000 , given that the number of cores are less?
Attila says
Your guide helped me a lot already, thank you. I’m about to become an NLP researcher, and I’d like to buy a PC for it. Until now I was thinking about going with an RTX 2070, since it just fits my budget. However RTX Super cards are arriving soon. My question is: should I go with the “old” 2070 or is the 2060 Super a better value for my purposes?
Tim Dettmers says
The RTX 2060 Super and RTX 2070 are about the same in performance. Buy the one that is cheaper.
MibaNafi says
RTX 2060 SUPER vs RTX 2070?
Tim Dettmers says
Both have the same performance, but the one that is cheaper.
MibaNafi says
Hi,
Better 2×1060 6gb or 1 RTX 2070?
Nearly Same price….
Tim Dettmers says
I would go for the RTX 2070.
Malli says
Hi Tim
Great blog and I found it very useful to kickstart my deep learning journey. This is the configuration that I am going with-
Intel Core i7 9th Gen 9700K (8Cores 8Threads with 4.9Ghz )
MSI Z390 Chipset Motherboard with WIFI, Bluetooth and
64GB Ram Support
32GB G.Skill Ripjaws V 2400Mhz DDR4
Nvidia GeForce RTX 2070 8GB
1TB Samsung 970 Evo Nvme M.2 SSD for OS Installation
2TB Western Digital HDD for Data Storage
Cooler Master H500 Irony Grey Cabinet
Cooler Master MasterLiquid 240 Cooler
1200W 80+ Platinum PowerSupply Support 2 GPU
Any comments from your side would be great. Planning to get a second RTX 2070 in 6 months.
Tim Dettmers says
The PSU wattage is a bit high right now. I guess if you add more GPUs along the way it fits very well though!
Ken Fricklas says
Hey Tim – any opinions (Early on) about the new announcements from AMD?
Seems like the $750 Ryzen 9 3950X with PCIe 4.0 and a couple of $449 RX5700XTs could be a game changer in the sub-$2000 workstation market (to say nothing of gaming).
Tim Dettmers says
PCIe 4.0 is not needed at the moment and will not yield the greatest benefit unless you parallelize across 8 GPUs. For PCIe 3.0 CPUs the EPYC CPUs are currently far the best in terms of cost/performance. I am unsure about the RX5700XT. It is difficult to get reliable information about deep learning performance for AMD cards and I am not sure if I can make a recommendation without seeing multiple benchmarks on relevant deep learning models/tasks.
Mircea Giurgiu says
I have noticed to some hardware providers that:
a) a workstation with 4 x RTX 2080 Ti,
and
b) a workstation with 2 x Quadro RTX 8000 NVLinked (the other support hardware: memory, HDD, CPU processor, etc are the same for (a) and (b)),
have, more or less, THE SAME PRICE.
Q1) Given that (a) has in total more CudaCores /TensorC, which would be the reason to purchase (b)?
Q2) Which could be the resons to purchase (a)?
Thank you in advance,
Mircea Giurgiu
Tim Dettmers says
If you have very large models (such as transformers) which need the full 48 GB of GPU memory the Quadro RTX 8000 will be a good choice. However, for normal transformers (BERT-style) or normal computer vision models, the RTX 2080 Ti is sufficient and preferred since they will yield better performance. I would recommend the 4x RTX 2080 Ti option in 9/10 cases — I think this is what you really want to get as well!
Mircea Giurgiu says
Thank you so much for the clear and prompt response. Congratulations for this blog !!
Yashovardhan Chaturvedi says
Hi,
What about for image segmentation models will rtx 8000 be better than 4 rtx 2080 ti ?
Tim Dettmers says
If your model fits into GPU memory (11 GB) then 4x RTX 2080 Ti will be much faster.
yash says
Thats the issue I am facing my model is not fitting into the GPU I can also get about 8 images in each gpu RTX 2080 ti and because of that my training time is currently 2 weeks as the dataset is quite large. So I was wondering in order to increase the batch size should I go for RTX 8000 which has 48 GB GPU memory thereby I can process through my dataset faster ?
Tim Dettmers says
An RTX 8000 is quite pricey, but if you encounter this problem often or expect it to encounter it longer in the future an RTX 8000 is a good choice. An RTX Titan might be a middle ground with 24 GB of memory, but it would be a shame if you need more than that and you are stuck with 24 GB. So if you have the money, the RTX 8000 might be the right way to go.
Another way would be to buy multiple RTX 2080 Ti and separate the model with model parallelism (grouped convolution) across the two GPUs. That is more work programming-wise, but it could also work.
yash says
Hey Tim,
Thank you for your reply.As this is in a company setting and not personal use money is not the issue. I was going through the specs of V100 32gb vs qaudro RTX 8000 the 32 bit flops of V100 are quite high. So with a focus on data parallelism and not model parallelism to go through the whole data faster what would you suggest 8 V100 32 gb vs 8 quadro rtx 8000 48gb each ? Price is not an issue ?
Joseph Quinn says
I can buy 2070 and 1080 ti at almost the same price. If the trend stays the same after nvidia super cards launch, should I go with more ram on gtx card, or just go with rtx?
Tim Dettmers says
This is a bit of a tough choice. I think if you want just ease of use a GTX 1080 might be better. If you are fine with doing a bit of extra work for 16-bit models and to fully utilize tensor cores a RTX 2070 will be slightly better (but less convenient).
Joseph Quinn says
Thanks for your reply. Now that RTX Super arrived, I can buy used 2070 for ~40$ less than 2060 Super and 1080Ti. I don’t mind getting used to work with 16-bit models. Which of the three would be the best bet?
Is it worth to spend like 200/300$ more for 2070 Super or 2080?
Peter Voksa says
I would like to know the answer to this as well
Tim Dettmers says
The RTX 2070, RTX 2070 Super, RTX 2080 and RTX 2060 Super are all competitively priced. You will not get a much better deal for any particular card. However, for a RTX 2070 for $40 less than RTX 2060 Super, I would go for that offer.
Mridul Pandey says
Hi Tim,
I am plannig to build a computer and finding my self in “I started deep learning and I am serious about it” catagory. RTX2070 looks better option for me. I am going for a ryzen 3600x CPU with it. I am planning to go for more RTX 2070 cards in future. Should i spend extra money for SLI supported motherboards or normal motherboard with multiple GPUs can give a similar performance. Please suggest.
Tim Dettmers says
As long as you can fit multiple GPUs you should be fine. Make sure, however, that your motherboard not only has the slots but also supports the multiple GPUs. Also make sure your CPU supports the RTX 2070s that you like. Many Ryzen CPUs support 2 GPUs, but not more.
James says
Kaggle now provides p100 for 9 hours runtime, and I can run 4 kernels at a time. Is there any advantage for me to buy rtx2060?
Tim Dettmers says
It is really about if the p100 limits your work or not. Sometimes when I prototype I need a single GPU to be productive. Sometimes I need to parameter search or run experiments for comparison and then even 15 GPUs will not be enough. So think about different situations and think if 9 hours is enough for that or not.
Mantas says
Hello,
If you would have to chose GPU on laptop would you ho with RTX 2060 6 GB OR RTX 2070 8 GB max q?
Tim Dettmers says
I personally would go with an RTX 2070 due to the extra memory, but it really depends what your applications are. If you want to do research then 8 GB is really the minimum.
Deep Khamaru says
I am BTech student, interested in learning Machine Learning. My choice of Machine the aforementioned task is MSI-GS65, with i7 8th Gen processor and a 6GB 1060 gpu. Now please bear with me. I know a laptop is not what is intended for ML, but I desperately need portability of some form or other. So is this the Laptop I should go for? Or something with a 2060? I don’t have the money to higher. Thank you for this article on this particular topic which kind of rare.
Tim Dettmers says
You cannot do much on a GTX 1060 these days — this should only be an option for very basic deep learning workloads. I would recommend getting a cheap desktop + cheap laptop and then to work remotely on your desktop from your laptop. For $2000 you can buy a desktop with an RTX 2070 and some cheap components and an additional good netbook.
Rytiss says
Hi,
I am looking forward to making cheap deep learning rig. I was really interested in EGPU (external graphics card) variation because I have a laptop with a pretty powerful processor (8300H i5). Due to problems appearing with EGPU, I am focusing on building a cheap rig for mainly deep learning. I am looking to make mini ITX with RTX2060 or RTX2070 (of course RTX 2070 is a way to go 🙂 ). However, CPU price itself makes system price increase severally (Buying better CPU also requires cooling fans, that also increases price). I am thinking about CPU as cheap as possible for this to handle stuff related to deep learning nicely on RTX2060/2070. I know things about bottlenecks and etc., but for example, going with Intel Pentium G5400 in the system for the deep learning purpose would be a bad decision? I am training models with float32 and float16.
Tim Dettmers says
I had a similar machine. For an RTX 2070, the Intel Pentium G5400 should be a pretty good CPU. An Intel Pentium G5400 would work well for many applications but can be a bit slow if you want to train on datasets with large input-size/second, for example, ImageNet. But the performance loss on ImageNet should still be small (probably <20%).
Mary says
HI,
What are you thoughts on Arm based CPU & GPUs for ML? Especially since ML is moving towards the edge. Have you done any performance comparisons with cortex-M family?
Tim Dettmers says
It looks like these GPUs are mostly inference GPUs for mobile devices and would perform poorly for training. Or did I miss something here? If this is accurate, then I would believe that ordinary GPUs/TPUs/Graphcores would be used for training and you would then optimize trained networks for these mobile GPUs.
Nicolas CS says
Hello There Tim, thanks for this blog, it has been quite helpful for me.
I move a lot and I am planning to buy a laptop for DL because of that. The difference in price between a laptop with a 1660ti-maxQ and a 2070-maxQ is above 900 USD. As we are talking about the maxQ versions, is it worth it to pay those 900 extra dollars?
Thank you again for saving us a lot of time in this kind of cost efficiency research.
Tim Dettmers says
900 USD is quite a big difference for those laptop GPUs! If you want to use deep learning the 900 USD investment could make sense since the GTX 1660Ti is not that good for deep learning. On the other hand, you can also get the 1660 Ti and just rely on cloud GPUs in the case you need them. This could be a very competitive alternative and it would be much faster than 2070-maxQ.
Van Long says
I wonder why GTX 2070 is less effective than 1070/1070Ti in word RNN? that sound weird…
Tim Dettmers says
It is likely due to the larger cache of the GTX 10 series.
Thanhlong says
Hi Tim, thanks to your great article about chosing gpu for deep learning. I want to ask you that should i buy rtx 2070 or 1080ti, i think in the future, the rtx will take a head because it will be support desipte the gtx 1080 have 11gb vram than 8gb vrams on rtx 2070. So can i have your idea about it. Thanks you so much
Tim Dettmers says
The extra memory on the GTX 1080 Ti can be a blessing, but if you figure out how to use 16-bit computation on the RTX 2070 reliably you will get much more out of an RTX 2070 and it will last longer. It will be a bit more painful to adapt things to 16-bit, but with a bit of work it should work out just fine.
Osama says
Hey Tim, should I go for gtx 1660ti 6gb or gtx 1070 8gb?
Tim Dettmers says
Though choice. I would probably go the GTX 1070 because of the additional memory. But either card is a good choice.
Alexander W. MacFarlane IV says
Hi Tim!
I am thinking about this from the supply side, and am curious why the option I have been considering was not mentioned.
My play is to take an oldish Xeon server that I have in my basement, install a couple of RTX2080Ti cards, and rent it out to people like you on Vast.ai.
https://vast.ai/
They seem to be capitalizing on the no RTX in data center rule by creating a service where buyers meet sellers, and my idea is to become one of the sellers.
Have you not heard of this or is there some other reason why you did not mention that as an option?
Alex
Tim Dettmers says
The problem here is reliability and customizability and ease of use. AWS/Azure/Google Cloud is not much more expensive but just offers better service. It is difficult to beat that right now. Otherwise, in the long-term, people are still better of investing in a GPU themselves.
Josh says
Thanks for the detailed breakdown! I have some questions regarding the performance comparison of the various cards for the various tasks (RNN, CNN, Transformer).
You state that memory bandwidth is crucial for RNN performance, however, in Figure 2 the 1080 Ti with 448 GB/s bandwidth has a far higher performance rating on the word RNN metric than the RTX 2080 Ti with 616 GB/s bandwidth. I guess my question is, what specs do you take into account when extrapolating from the RTX 2080 Ti to e.g. the GTX 1080 Ti?
Also, I don’t understand the combination of char RNN + Transformer metrics, from the preceding text it seems like those are dependent on different specs (memory bandwidth for RNN, bandwidth + core speed for Transformer). Why do you combine those two?
Best,
Josh
Tim Dettmers says
Unfortunately, I do not have time to make a detailed analysis of this, but here my best guess: Cache size is an important factor for performance and the GTX 1080 Ti has a larger cache than the RTX 2080 Ti. This is likely why the GTX 1080 Ti is faster for small matrix multiplications. Additionally, this could be an issue where some code-paths are just better optimized for GTX 1080 Ti and with a different hidden size the RTX 2080 Ti would have caught up. If this is true the performance might change with future releases of CUDA.
Alan Chen says
Big surprised, How could 1660Ti be worse than 1600? 1660Ti beats in all aspects and win in all Benchmarks for gaming.
Tim Dettmers says
They do not have tensor cores which are not important for gaming, but important for deep learning.
Karan Sharma says
Hi Tim,
I am planning to begin with my Deep Learning work. I am very serious about it and want to do more projects after getting done with an initial one.
I am stuck with the choice of GPU I should opt for, tensor cores and cloud TPUs put in a bit of confusion.
Please help me to pick an option. I currenty will be working on CNN
Tim Dettmers says
For long-term work with CNNs I would recommend an RTX 2070 or RTX 2080 Ti if you can afford it.
Tim says
Google Colab recent upgraded GPU’s to Nvidia Turing T4. I posted a notebook that shows the upgrade below.
https://colab.research.google.com/drive/1eIAvJEHNnx-bHGGyf1bIU4h_07pW4W4t#scrollTo=aqSYfrO5oQFm
https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-t4/t4-tensor-core-datasheet-951643.pdf
Someone do benchmarks on this upgrade yet? Thanks.
Tim Dettmers says
The T4 is roughly in-between RTX 2060 and RTX 2070 in performance. So not the best GPU.
Laurene says
Can you please help me to calculate a good cost if someone rented the following computing power from our Power Edge server: For use of 1 high-performance server node, with 24 cores for 80 hours per month. If you could, please suggest the following rates (per minute, per hour and per week)
It would be most helpful if you could also help me estimate the same costs on a newer server with new specs like NVIDIA .
Thanks
Tim Dettmers says
I would look at cloud prices (AWS/Azure/Google) of similar hardware and compare the price this way. If you rent for a month you should pay much less (maybe half) to make it competitive with cloud services. Otherwise, it is not worth it.
Hannes van der Merwe says
HI Tim
Thanks for this article. I started my deep learning journey on a Dell Laptop with GeForce 845M windows machine and that did not go too well. I worked my way through Deep Learning with Python from Francois Chollet, but my local machine was not always happy and ran out of memory on the GPU very fast. Google’s colab.research.google.com was a big save.
We are looking to buy a PC for deep learning. We are most interested in image classification, facial recognition and number plate reading. We later would like to classify video sequences into know human actions. Like he is running or jumping…
We also plan to dabble a bit with AR so that we can virtually define different areas of detection rules for live feeds. And lastly we will venture into autonomous drones/robots.
To start off we simply want to retrain an image classifier to better fit our solution, but when spending this kind of money on a device we would really like it to cover as much of our future training as possible.
The best card I could find at my supplier was a MSI RTX 2080 TI LIGHT Z – 11GB GDDR6, HDMIX1, DPX3
How far will that get me? We are currently using a caffe model, but we not really fixated on a specific solution yet. The other sample modules I could find was taking long to fire, but the caffe was close to 17 frames per second.
Thanks
Hannes
Tim Dettmers says
The RTX 2080 Ti should enable you to train everything that is out there. If something is too big, use a smaller batch size or sub-mini-batches and update the gradient every k sub-mini-batches — should work like a charm!
Nick says
Hi Tim,
Your website has been very helpful with learning about deep learning. This fall I will be starting a MS in Stats program that is computer intensive. The program has some courses in basic deep learning and they require the use of a laptop. I was wondering if you could provide some advice on what would be the best setup for a laptop to learn basic to moderate deep learning
I plan to purchase a laptop with the RTX 2070. There are 3 laptop versions of this GPU: RTX 2070, RTX 2070 Max-Q 90W, RTX 2070 Max-Q 80W. Is the difference between these three variations going to make a noticeable difference in machine learning?
Most of the laptops I’m considering have (or will have) the i7-9750 CPU with 6 cores, but I may splurge for one with a i9-9880 with 8 cores? Is the use of the extra money better to be spent on a bigger hard drive than faster CPU with more cores?
I plan to have 32 GB RAM on the laptop. Is this enough RAM to get started with deep learning?
I plan to have two SSD hard drives on the laptop. Ideally I would like two 1 TB 970 Evo Plus drives. One would be committed to Windows and the other to Linux. Is this an excessive amount of space for my beginner level of deep learning? Would two 500 GB drives be enough space?
The laptops I’m considering are the following:
TLX by Falcon Northwest
Aurora by Digital Storm
P75 Creator by MSI through Xotic PC
Raider by MSI through Xotic PC
Evo17-S by Orgin PC
Any advice you can provide will be greatly appreciated. Thank You!
Tim Dettmers says
Two 500 GB drives will be enough for almost all deep learning applications. I would get the best GPU you can afford since you will use your laptop for quite a while. Get a regular RTX 2070, the MAX-Q versions will improve deep learning performance slightly but suck away a lot of battery.
MM says
Hi Tim. I have a similar case of buying a loptop, specifically for deep learning. I went down to 2 options: 1) MSI GS65 Stealth 9SF and 2) MSI P65 Creator 9SE. Both are quite similar (with i7 9750H CPU and 512 GB SSD drive) yet the first one has RTX 2070 Max-Q while the latter one is with RTX 2060. The first one costs about 150 USD more. Does it worth to pay that extra money? Also, both has 16GB DDR4 RAM. Do you think 16GB is a reasonable option, or should I look for the 32GB variants considering the graphics cards?
Tim Dettmers says
I would say the extra 150 USD is worth it. I think 16 GB is enough for most applications.
julian says
Hey Tim,
thanks a lot for the great blog and all the work you put into answering people’s questions.
I would like to replicate or run at least a similar benchmark as your word-level biLSTM ((2) For word and char RNNs I benchmarked state-of-the-art biLSTM models).
I couldn’t find a reference to the benchmark implementation in your blog.
Could you point me to it or do you have a few more information on the benchmark you ran?
Thanks for your help
Tim Dettmers says
That sounds great! I used this repository: https://github.com/salesforce/awd-lstm-lm
Please let me know what you find — thank you!
ALEX says
Hi, I see the LSTM model using torch.nn.LSTM.
I didn’t use Pytorch,but tensorflow myself,I use Tensorflow a lot using LSTMs.
I noticed the vanilla LSTM implemention in Tensorflow is not quite good, it uses a “for loop “for each timestep, in that case ,it can’t utilize all the GPU ,for instance my GTX1660 GPU is running at 75% using Tensorflow’s basic LSTM.
but, there are other high efficient implementions of LSTM, like blockfuseLSTM in tensorflow, it runs 95% of my GPU,and it is 230 % faster.
so I wonder if you can run another kind of LSTM for these cards? like blockfuseLSTM
here is a git famous in NLP , one of the best in NER task. https://github.com/guillaumegenthial/tf_ner
thanks a lot ,I really appreciate your post ,as I am now thinking buying a 2080ti,but I found it is slower than 1080ti for most of my projects , that makes me really frustrating.
should I just buy an 1080ti ?
Tim Dettmers says
I think the standard PyTorch LSTM implementation uses a fused implementation, but I am not entirely sure. I will be a bit more careful in the next iteration of this blog post. However, I was also aiming at benchmarking code that most people would use rather than to use the most optimal code. This approach reflects how the normal experience of users would be if they had a specific GPU. Do you think this makes sense?
Raj says
What is the temperature of GPUs in idle ? Because you seem to have put them so close to each other that they may not be able to get enough air circulation, I guess.
Tim Dettmers says
Depends on the cooler on the GPU and the room temperature. Usually, you can get 30-40C in a normal office; in a hot server room multiple GPUs next to each other might idle at 50-60C.
Atralb says
Hi Tim, Thx an awful lot for all your articles. These are genuinely the best available publicly.
I have a couple questions :
– Would you buy an RTX 2070 for 475€ or a used GTX 1080 Ti for 650€ ?
– Do you think the RTX has enough memory for audio analysis (in particular music) ?-
– I am the “new to deep learning but serious about it” type of guy. Do you think having only 1 of those cards will be enough to get good experience quickly, or would you rather recommend buying 2-3 GTX 1060 for the same price ?
Tim Dettmers says
– Definitely the RTX 2070!
– I am not sure, I have never worked with audio. Usually, you can get away with a low amount of memory if you use the right techniques such as 16-bit compute and gradient aggregation and so forth — so it might work well unless you are aiming to develop state-of-the-art models.
– I think a single RTX 2070 will give you a pretty good experience. I would recommend 2 RTX 2060 if you want multiple GPUs instead of GTX 1060s.
Can says
Hi Tim, thank you for this detailed blog post.
I’m not sure whether mixed precision training really helps to decrease memory consumption at the moment when using Apex. Normalization layers use fp32 and optimizer keeps parameters in fp32 for stable results. Therefore, we almost have an fp32 replica of every fp16 data in when training a deep CNN/MLP. This might even increase memory usage.
Mixed precision increases the arithmetic intensity of GEMM/Convolution kernels and allows these kernels to utilize the Tensor Cores on modern RTX cards. Thus, mixed precision only speeds up heavy lifting operations without any significant decrease in memory used at the moment.
Thanks.
Tim Dettmers says
The weights of the network usually constitute only a very small part of the memory for the network, thus 16-bit activations can help to reduce memory. If you do 16-bit computation but aggregation in 32-bits it will have no benefit — you are right — but generally, you want to avoid that anyway. If I run 16-bit transformers I can increase the batch-size by quite a bit — it is not double the amount, but I can increase it by half and more and use the same amount of memory. I am sure it is similar for CNNs.
Yaroslav Smolin says
Thank you for the great paper. Just one question, as I understood that to achieve GPU-to-GPU parallelization I need to have the same GPU architecture. But what does this mean? For example, RTX 2070 and RTX 2080Ti have the same Turing cheap architecture, but you mentioned that they are not compatible. So basically GPUs must have the same name? I would really appreciate your answer.
Tim Dettmers says
I am actually not quite sure if you have enabled peer access between different chips but the same architecture — never tested that! From what I can tell from the NVIDIA documentation, it might just work, but I cannot give you any guarantees.
PH says
Hi Tim Dettmers:
Thanks for the blog posts.
You wrote:
>>”…. I would never recommend buying an XP Titan, Titan V, any Quadro cards, or any Founders Edition GPUs.”
Can you explain why?
I am not sure where to post the following question,
but may be this is as good a place as any since
you do not have similar topic. Here goes…..
I am trying to setup up 2 to 4 GPUs Deep Learning system using Ubuntu 18.04.
One pair of GPU is GeForce GTX 1080 Ti
The NVidia driver I have used were 390.xx and 418.xx.
They both gave me the same problem as described below.
Here is the problem:
When I have just ONE GPU, the display works, and the OS boots up just fine.
After installing TWO GPUs (and I have done this multiple times),
when I try to load the OS, the display is blank.
But if I remove the 2nd GPU, everything works again!
Note that the display was still connected to the 1st GPU when I have two GPUs setup!
What do you think is causing this problem?
Since I am using my GPUs for deep learning,
I only need one display connected to ONLY one GPU..
I did not see you have a blog for step by step installation of two or more GPUs
on a Linux system.
Do you know of one or more websites where there are step by step instructions
on installing two or more GPUs for deep learning?
Thanks for sharing your knowledge.
Tim Dettmers says
This is strange, usually, it just works. I guess there is something wrong with your NVIDIA driver installation or with your x-server config (Xord).
Ed Austin says
Great article.
On a budget, I purchased a GTX 670 with 4GB, this, although limited at least has a decent memory capacity, I was thinking of adding a second GTX 670 with another 4GB later, combined CUDA benchmarks (assuming 100% efficiency) would give me a 1060 level compute capacity with 8GB – wishful thinking – but are my assumptions too wacky?
Thanks!
Tim Dettmers says
Yeah, it does not really work like that. I would not recommend GTX 670 cards as they will be quite slow and the memory will not be sufficient. Adding a second GPU does not double your memory in most cases (only if you use model parallelism, but not library supports this well enough). You would be better of with a GTX 1060 or even better an RTX 2060 if you can afford it.
SIP says
Hey Tim, thank you for your in depth post. I do a lot of deep learning for my job and am building a machine to do personal experimentation outside work. I do a lot of vision stuff and I’ve been doing some RL stuff too. Previously, I’ve done most of my work with 1080 ti’s but from time to time I’ve run into issues with memory. I’m trying to decide whether the difference in performance and particularly memory between the 2070 and 2080 ti would be significant for my use case. I preferably would want 2 gpu’s so I could train multiple models at the same time and use data parallelism. I can afford 2 2080 ti’s but would rather go with 2 2070’s if my use case wouldn’t take advantage of the 2080 ti’s. I additionally will likely be going to grad school in 1-2 years so I don’t know if it makes more sense to get something cheaper now and then upgrade then. Do you have any suggestions?
Tim Dettmers says
If you ran into memory problems with GTX 1080 Tis then RTX 2070s might have not enough memory in some instances. However, you can use techniques where you chunk each mini-batch into multiple pieces (or in other words you aggregate smaller mini-batches) to do a single update. This saves a lot of memory. If you have not used this technique before, I would go with RTX 2070 and use this technique together with 16-bit. Otherwise, go for a single RTX 2080 Ti.
Aron Boettcher says
So on your GPU’s to avoid list. Is the argument here based solely on the price/performance relationship?
Also, how relevant do you think the Tensor core technology is? If I can only get one card, is it worth it to go with the Titan RTX (rather than doing multiple cards in SLI?) for the tensor cores?
Tim Dettmers says
Yes, I do not recommend GPUs which are a waste of money. You can get the same GPU for less money, no reason to buy these expensive ones!
Tensor Cores are good, but all RTX cards have them. You should buy a Titan RTX only if you need the additional memory.
andrea de luca says
Let’s compare two 2080ti (or even two 1080ti) with a single Titan RTX, and let us do it purely on a memory basis (that is, we neglect the training speed). Since you can have more or less the same memory as the titan with two relatively inexpensive cards, I’d like to know whether there are use cases in which a model necessary has to reside on just one single card VRAM. In other words, are there use cases that one can do with the titan but NOT with two 11Gb cards? Thanks!
Aron Boettcher says
Thanks for the response Tim,
Since I’m looking to build a machine that my company will be paying for (it will be my primary machine), its very unlikely that I would get it upgraded in the future, or that I would be add an additional card (our IT is very backwards, and although I can and have built my personal PC’s, I wont be able to change my work machine without a lengthy and expensive involvement with IT).
In this sense, I’m less concerned about the prices than I am: keeping the configuration simple enough that the IT/ Finance people don’t get confused; and that it be forward compatible enough that I’ll be able to use this machine for at least a few years. Likewise, with the models we’ve been building, we’re already hitting memory limitations.
Just letting you know these things so that you or your audience can understand that sometimes the decision is more complex than just price/ performance.
Things would be very different if it were my personal machine or my company wasn’t intending to foot the bill.
Fred Chang says
Hi,
It’s my first time buying GPU. In this article, “However, note that through 16-bit training you virtually have 16 GB of memory and any standard model should fit into your RTX 2070 easily if you use 16-bits” , but spec. of RTX 2070 is 256-bit. That’s mean RTX 2070 could run with 16-bit? Thanks!
Tim Dettmers says
256-bit is the width of the memory controller, 16-bit is concerning the width of the compute units.
Fred Chang says
Thanks! Tim, you are really a great helper!
Fred Chang says
Tim:
Thanks for reply. I have another question: Is it necessary to have two GPUs, one for display and one for data computation? Thanks!
Tim Dettmers says
Yes, I use the same GPU for displays and computation.
Mike says
Hi Tim
Very nice article. Can you comment more on your dislike of Quadros? I ask because this is obviously a fast moving field and Dell now seem to now be doing an excellent workstation deal with either single/dual Quadro RTX6000 and lots of memory 24GB, e.g.
https://www.dell.com/en-uk/work/shop/desktop-and-all-in-one-pcs/precision-7920-tower/spd/precision-7920-workstation/XCTOPT7920EMEA?selectionState=eyJPQyI6InhjdG9wdDc5MjBlbWVhIiwiTW9kcyI6W3siSWQiOjMsIk9wdHMiOlt7IklkIjoiNjRHNFIifV19LHsiSWQiOjYsIk9wdHMiOlt7IklkIjoiR0hFSTNVOSJ9XX0seyJJZCI6MTEsIk9wdHMiOlt7IklkIjoiRzVLQVkyMyJ9XX0seyJJZCI6MTQ2LCJPcHRzIjpbeyJJZCI6IkQ0MTEwIn1dfSx7IklkIjozNzIsIk9wdHMiOlt7IklkIjoiTk9PUFQifV19LHsiSWQiOjQxMiwiT3B0cyI6W3siSWQiOiJESEVBVFNLIn1dfSx7IklkIjoxMDAyLCJPcHRzIjpbeyJJZCI6IjUxODk0NSJ9XX0seyJJZCI6MTAwMywiT3B0cyI6W3siSWQiOiJVQlVOVFUifV19LHsiSWQiOjIwMDA3NiwiT3B0cyI6W3siSWQiOiI3NzYzMjcifV19XX0%3D
Is it just a matter of price for such GPUs or are there bigger issues
Thanks
Mike
Tim Dettmers says
Yes, it is just the price. These GPUs are very cost-inefficient. Personally, I would also not buy server hardware with less than 4 GPUs.
Marv_ says
What a nice article!! According to your words, I bought a 2070 recently. Thanks u at first.
I am wondering that why RTXs are not good at word RNN? In the Figure 2, it seems not good in this professional area.
If I want to do some DL projects with two cards, 1660ti or 2060 you will recommend? I am shifting from CV to NLP.
Waiting for your reply. Thx again~
Tim Dettmers says
I am not entirely sure why this is the case. I speculate that the decrease in shared memory per SM from 96 kb (Pascal) to 64 kb (Volta/Turing) decreased the performance of small matrix multiplications and thus slow down RNNs that use short sequences of length 100 or below.
Marv_ says
Wow. Your reply came quickly. Thx~
I am wondering that is a cheap titan x pascal good for now. I don’t know how much is cheap enough. 500-600 I guess?
If am only focusing on new Turing, 1660ti will you recommend? There is no tensor core in it.
Thanks again~
chanhyuk jung says
What is better in terms of cost efficiency; cheapest rtx 2070 or highest boost clock rtx 2070?
Tim Dettmers says
Cheapest RTX 2070 by far.
Aaron says
Scaleway offers Cheaper Cloud Products Compared to AWS & Google.
They also have a GPU Instance https://blog.scaleway.com/2019/gpu-instances-using-deep-learning-to-obtain-frontal-rendering-of-facial-images/
Tim Dettmers says
Looks like a good alternative, but I have no time to evaluate it in detail. Note that AWS, Azure, and Google offer more than just a GPU for a low price, but if one is fishing for cost-performance just in terms of compute this might be a good service.
James says
Tim, thanks for the article. I’m just starting out with Keras. Would you recommend RTX 2060 or GTX 1060?
Tim Dettmers says
RTX 2060 if you can afford it. If your budget is tight, go for a GTX 1060.
Zack says
Hi, i’m a degree student currently doing project on detecting tree disease from leaf.. so my dataset is picture of leaf with diseases and some normal one. Which one is better ? Gtx1080ti or rtx2070 ? I will be using CNN
Tim Dettmers says
I would go for the RTX 2070 and learn how to do 16-bit training your training framework.
Sachin says
Google offers the K80 card as a GPU option when you configure a cloud VM. Is it worth choosing it, when what you want to do is train LSTMs and Transformers?
Tim Dettmers says
Yes, the K80 is a good card for that. If it makes sense to use it compared to other GPUs depends on the price. For LSTMs a K80 works very well, but for Transformers the price should be at least 3 times better than for V100 GPUs or otherwise a V100 is more cost-efficient.
Sachin says
Thank you Tim
Mikołaj says
Tim, I love your job. But still the question for me is to choose RTX 2060 or GTX 1070 as they are priced similarly?
The main problem with 2060 is memory, which is only 6GB. I am planning to do NLP stuff that may require big dictionaries and not sure if be able to process them even with mixed precion addition (FP16 support). You advise to treat FP16 as additional 50% memory because not everything can be done without FP32, but it gives me *possible* 1 GB more over 8GB for sure in 1070. Is it worth?
The second thing – going FP16 with 1070 should also give me more memory, but not necessarily better performance (as GTXs are not optimized for that), right? So overally, if model in memory is most important for me, is 1070 the better choice in this price range?
Tim Dettmers says
You are correct in that a GTX 1070 with 16-bit will yield an additional memory benefit. If you really think the 8 GB will not be sufficient then going for a GTX 1070 might be the right choice. However, I would probably go for an RTX 2060, use 16-bit and use a small batch-size and aggregate the gradient of multiple batches before doing the weight update. Note that even with this it will be difficult to train standard-sized or big transformers. You would also run into problems when you use a GTX 1070 for such big models, but with 16-bit, small batches and gradient accumulation you might be able to fit and train a big transformer.
Mircea Giurgiu says
What about RTX8000?
Tim Dettmers says
Quadros are very cost-inefficient. I do not recommend them.
Zhihui Chan says
I am a Chinese senior.Thanks for your helpful blog. But I still have a question .When using multiple GPUs, such as two, the two different GPUs, for example, use one RTX2080Ti and one GTX2080Ti. What will happen?
Tim Dettmers says
Peer-to-peer GPU communication will not be available so you cannot do transfers like these: RTX 2080 Ti -> GTX 1080 Ti. Instead you need to make transfers like these RTX 2080 Ti -> CPU -> GTX 1080 Ti. This can make parallel training quite slow. So for parallel training you will need two GPUs of the same kind.
mnawar says
Thank you for your great post.
What do you think I should go with in March 2019?
I’m mainly work in deep reinforcement learning and need to upgrade to new 2070 or used 1080 ti; they are almost the same price In Egypt. I would buy either 2070 with international warranty from Amazon or used 1080 ti with local warranty.
My two years 1060 is bought from Amazon and I didn’t face any problems with it.
I’m biased towards 2070 only to get hands-on with 16-bit models and it may be more future-proof.
Tim Dettmers says
I agree with the RTX 2070 being a litter better, but if you need to import it the costs might be steeper. The GTX 1080 Ti is also an excellent card and probably good for another 1-2 years.
mnawar says
No, the importing fees is included in the prices. So should i get 2070.
I’m also running Z400 workstation with w3530 and have no problem in getting W3690. I’m not sure if i should go with upgrading the cpu and psu or build a complete pc for the new GPU as z400 has pci 2 and max memory of 24G
Zoran says
Hi,
How slower GTX 1050 Ti should be than GTX 1080. I have build for a first time a windows based system with a 1050 Ti, a core 2 quad CPU and I find that GPU is only 5 times faster than CPU, while the same test with GTX 1080 its around 70X faster than the CPU. And I expected that GTX 1050 Ti would be at least 20X faster that CPU because i think 1080 should not be more than 3 times faster than 1050 ti.
Tim Dettmers says
Yeah, those numbers do not sound quite right. They are close, but still a bit too far off. I am not sure what is happening. A GTX 1080 should be about 5-8x faster than a GTX 1050 Ti. I cannot think of a scenario where the GTX 1080 is 14x faster compared to a GTX 1050 Ti. Is the GTX 1080 using the same CPU?
Damian says
I am thinking to make budget pc with fe (4-7 cheap gpu cards). Do you think for example 6x 1060 3GB + 120SSD + 8 or 16GB ram and cheap processor (dual?) will be ok for AI/deep learning?
Tim Dettmers says
I would stay away from dual CPU motherboards etc. Just get a regular motherboard and 4 GPUs, that makes a much more solid system with less problems.
Andrea de Luca says
Agreed, but you said that 8 lanes per card are a bit limiting if you have 4 cards, and boards equipped with PLX/PXE are not viable due to nvidia drivers issues..
Nikos says
Lnux drivers were not affected last time I checked, ie worked fine with PLX 8747. However, I’ve personally lost confidence in NVIDIA, since it changed the rules after the products were purchased. I’m still on June 2017 drivers to use 4x GTX 1080Ti cards with Asus X299 Sage (uses PLX) and I’m giving a regular fight with Windows, as it sometimes upgrades the drivers without my consent. Hence, I can’t recommend a mobo with PLX.
Tim Dettmers says
8 Lanes per card are totally fine if you run on 4 GPUs.
andrea de luca says
Tim & Nikos: Thanks. Given this, it seems that 1080ti is still one of the most attractive gpu in terms of price/performance ratio. Note that you can train in FP16 even on pascal card (your vram will be doubled, you’ll just get a more modest speedup, which amounts to ~10-15%, for example, upon resnet50).
For something like 2000 euros, you can get four cards, and be more competitive than two 2080ti or a single titan rtx. Tell me if you concur..
Bull Shark says
I can get a tesla M40 24GB for 800. I can also get an rtx 2080ti for 1100. Which one would be the better choice? Given that in the case of M40 I take care of the thermals.
Dale Smith says
Thanks for the very interesting comments.
We just purchased an Intel Compute Stick 2 for $100. Being a startup, that fits our pre-seed budget.
Does anyone have benchmarks comparing this stick to deep learning on a RTX 2070 or a GTX 1060?
This is a pretty good summary of a case study using the NC stick. https://software.intel.com/en-us/articles/detecting-invasive-ductal-carcinoma-with-convolutional-neural-networks
Tim Dettmers says
Just using the theoretical maximum compute for an Intel Compute Stick 2 and compare it to real/practical compute of an RTX 2070, then you would have that an Intel Compute Stick 2 is about 50% as fast as a RTX 2070. The real number is probably closer to 25%. On top of this, you should consider software: Intel software is terrible and I would not recommend the Intel Compute Stick 2 for this reason. However, if you are in a low-watt setting, the Intel Compute Stick 2 might be a reasonable option if you are willing to accept software nightmares.
Pengfei Zhang says
Hi,
Can I use a SUPERMICRO SuperO MBD-C9Z390-PGW-O LGA 1151 (300 Series) Intel Z390 HDMI SATA 6Gb/s USB 3.1 ATX Intel Motherboard which has PLX chip to let my i7 8700k to support 4 1080ti for machine learning?
Cheers!
Tim Dettmers says
That could work. Just make sure that you have some form of confirmation that this setup actually works and then you will be fine.
Pengfei Zhang says
Hi, thx for your post, it really helped lots of people!
I’m building a machine for computer vision especially video analysis, to memory is a big issue for me. Considered about the memory/price, I decide to upgrade my 2 1080ti machine to 4 1080ti (1080ti can do fp16 as well, just slower….)
I have an i7-8700K CPU and prime z370-a (which support 2 8x PCI-E). Just wonder can I just replace a motherboard such as ASUS WS Z390 PRO LGA and kept my CPU? From the comments here I realized it only has 16 lanes so it seems it can only fit for 2 gpu. However, I’m wondering can I use it for 4 gpu maybe with lower speed and how slow it can be? otherwise I need to sell it and get an AMD combo, and that’s another big amount of money 🙁
Tim Dettmers says
Indeed getting more GPUs instead of faster ones has the problem that you need the CPU support it. Sometimes CPUs support running 4 GPUs with fewer lanes, but I am not sure if that works with an i7-8700K. Best is to look for other people that use a 4 GPU setup with that CPU. Otherwise, it might sense to upgrade to 2 more powerful GPUs and keep your CPU and motherboard.
yang cd says
How about 1660 ti ?
Ugo says
Yeah, it would be nice to include the 1600 if it’s not too hard! (or just overall comment on it)
Michel Rathé says
Hi Tim,
Since multi gpus setup actually comes with a lot of challenges, could the upcoming Asus rtx 2080ti MATRIX,with infinity loop, be the long awaited optimal solution ?
Thanks again for your constant expertise,
Muhammad Fazalul Rahman says
Hi Tim,
First of all thanks for the insightful article. To date, this seems to be the most reliable article to find for comparing GPUs for deep learning.
I was considering getting 4x RTX 2080 Ti for a workstation for my lab, but then I came across an article (https://www.titancomputers.com/Articles.asp?ID=258) that compares workstation and desktop GPUs. In short, they were talking about the fact that workstation GPUs, while costlier than their desktop counterparts, are built for stability and efficiency, and are meant to run at 100% for several days, while desktop GPUs are not meant for it. Taking into consideration the recent news about the dying RTX 2080 Tis, do you think that they can withstand week long training tasks?
Tim Dettmers says
These vendors have a self-serving incentive to tell you that workstation cards are designed like that — they are not. They are the very same chip as consumer cards. What changes in workstations if often (1) these cards have no fan, but larger passive cooling elements, (2) workstation servers have loud, strong airflow which transports away the heat effectively. Thus the real reason is the strong airflow through the case (and not the GPU itself) and this is difficult to achieve with consumer GPUs. Especially, the RTX 2080 Ti has problems with cooling, but there are some good cooling solutions which work and do not require to spend the extra money on server hardware. I might update the blog about this next week or so.
Jonathan ALIBERT says
Hi Tim, great post, thank you very much.
Do you know what bandwith is needed between GPU and CPU/RAM during training, depending on the discipline studied ? I’m curious about using one GPU in PCI Express 3.0 1X (984.6 MB/s) to do single GPU training (or multiple, one model by GPU, not distributed models where it should be totally inefficient). It should work on paper, because CUDA works perfectly in this condition.
If you think these tests should be run, can you give me some benchmarks to do the job ?
Tim Dettmers says
The bandwidth is usually quite low if you work with larger models. Smaller models with large inputs usually need larger PCIe bandwidth. Never seen 1x PCIe in deep learning. Would be curious if it works for you. Please let us know if you have some results.
jimmy Gu says
Hi,Tim
The rtx2060 has released almost one month .What about this card? Compared to 1070 or 1070Ti?
Also, several rtx cards users(they are all gamers) report their rtx video cards (include 2080ti,2080,2070) have blurred screens issues.Almost over 100 samples in Chinese hardware bbs . Have you heard about rtx20 series issuses?
Jason says
Tim,
I can’t thank you enough for your article and your updates to it that keep it current. I am new in machine learning and think your recommendation of multi-GPU for gaining feedback faster is great. It very much is a psychological gain and makes for quicker learning. I bought an RTX 2070 to add to my system with a GTX 1080 TI. I am having trouble being able to run a model on one GPU and use the other GPU for research and running another model. I use separate Jupyter notebook instances, and specify which GPU to use via nesting model build and training code in with “tf.device(‘/gpu:1’)”. I always get errors when I try to train on the GPU not already active. Can you point me in the right direction for learning how to use two GPU’s at the same time for different models?
Tim Dettmers says
Difficult to say where the problem is as I am not using tensorflow. Have you tried executing your code with NVIDIA_VISIBLE_DEVICES? That might help.
Ganesh says
Is a GTX 1070 max Q 8GB > RTX 2060 6GB for deep learning?
Also how much better is the 1060 max q than 1070 or 2060? The normalised ratio is hard to determine as the difference between these cards get squashed by the TPU. Thanks!
Tim Dettmers says
RTX 2060 has probably better performance, but memory can be a bit small if you use 32-bits.
Anthony Cai says
Hi, Tim
I plan to upgrade my laptop to eGPU, but the CPU is just i5-5200U(16GB MEM, 500MB SSD), is that worth upgrade this laptop to eGPU with GTX980 ti? Thanks!
Tim Dettmers says
Your existing CPU and RAM is a good match for a GTX 980 Ti as eGPU. Deep learning performance should be good.
Joshua Marsh says
Hi Tim,
Thank you for the phenomenal article!
I’m currently in the process of deciding on a GPU configuration for an AI build geared towards training models that take about a week on 8 Tesla v100s. Since my budget is limited by scholarship money, I’m currently trying to decide between:
2 x RTX 2080 Ti
4 x RTX 2070
Initially, I was leaning towards the quad RTX 2070 setup due to increased flexibility in simultaneously training smaller models etc, but the complexity of setting up the custom water cooling system (I’m not sure the 2070 even has a water block yet) seems a bit daunting for a first time pc builder, plus the fact that I would need to replace the entire water cooling system when I eventually upgrade to 4 RTX 2080 Ti’s. Water cooling is a must because it will be in my dorm room near my bed and I don’t think I will be able to bear the 24/7 noise of the blower style fans.
So that leads me a dual RTX 2080 Ti setup. The main benefit seems to be that it won’t require a complex water cooling system (initially), I won’t need any noisy blower fans, and that it is easy to upgrade to a quad setup (with water cooling). I’m just concerned that I’ll be losing an unacceptable amount of performance and flexibility in comparison to the quad RTX 2070 setup. I’m also generally concerned about the long term reliability of a custom water cooling system. I’d like to be able to leave it running for weeks on end and not have to worry about anything.
So yeah, that’s my dilemma #firstworldproblems. Any insight you could give that will help me decide would be incredibly appreciated. Thank you so much!
Tim Dettmers says
You analyzed the situation very well — it is just a tough choice. I think however that the stability of two RTX 2080 Ti and avoiding all the mess might be an advantage. You could also see if you get a big case, buy some PCIe extenders and then you zip-tie two air-cooled RTX 2070 to different locations thus avoiding the heat issues with 4 GPUs. Some here at the University of Washington use this solution. But I made no experience myself with the zip-tie method so I don’t know yet if this a good solution.
Panand says
Consider also the heat that is generated by those GPUs. I have two GTX 1080Ti’s with waterblocks and they heat up the room quickly. Going for 4xRTX 2070 and then switching to 4xRTX2080ti requires selling those 2070s and their waterblocks and setting up 4 GPU watercooling up front, or using PCIE extenders. Used waterblocks won’t be of much value, as reliability is important.
Maybe dual 2080ti and later add custom waterloop to it. Two gpu waterloop should be more gentle introduction to watercooling than four.
Tim Dettmers says
This is great information and feedback! Thank you for sharing!
ThanosPAS says
Hi Tim,
Thank you for all these very informative articles. You are a point of reference! I am going to study Bioinformatics in my Master’s and I want to focus on ML, DL in these 2 years. Of course these tasks will be only a portion of what I will be required to do but I believe there won’t be anything more demanding than these tasks. Since you need to move a computer around with you also, I am going to buy a laptop and using a cooling base underneath it, I was thinking of doing my model training and stuff in this machine. I am a little worried about the weight but you can’t have it all, I guess. The specs that I am thinking to buy are the following:
i7-8750H or i7-9750H (if it comes out in Q2 ’19 – more powerful cores, no Hyper Threading)
RTX2070 (the laptop I am eyeing draws max ~115-120 watts for the GPU)
32 GB RAM (2666mhz)
970 EVO PLUS 2 TB NMVE (it will be released in April)
gaming cooling system – Cryonat paste
17.3” display
1.Do you find this rig adequate for someone like me just starting ML training?
2. I ‘ve read your article that downclocking doesn’t play a big role in ML performance, but do you think undervolting the CPU in case of throttling improves the overall ML, DL performance/experience like it does in certain scenarios in gaming for example? (if you need to keep clocking high for single threaded performance)
Thank you for your time and patience!
Tim Dettmers says
You can also consider buying a desktop and a small laptop. Then you can always move around and ssh into your desktop when you need your GPU. Another option is to get an eGPU, but then you can only run at one place and not move around. The “mobile but heavy laptop” is also a good solution though it is definitely adequate for a large proportion of deep learning problems and models. If something does not fit into the GPU memory, you can always get a cloud GPU instance from somewhere to do your work while you use your laptop for prototyping. All of these solutions have advantages and drawbacks and it is a bit of a personal choice. I personally would get a desktop and ssh into it with my laptop.
ThanosPAS says
Thank you taking the time to answer! I appreciate it 🙂 I will consider your choice 🙂
Khaled Mohammad says
Him Tim,
Really confused should I get the RTX2060, As this is what is in my budget or should I get something else? Please can you push an update to this article including the RTX2060!
Thankyou very much!
Tim Dettmers says
Will do this sometime the next weekend.
Tim says
I haven’t gotten my hands on an RTX card yet. Is there a way to force Keras to use tensor cores or utilize fp16 that you are aware of? Does the following work, or there’s another method? Thanks.
from keras import backend as K
K.set_floatx(‘float16’)
Val Schmidt says
Hello Tim!
Thanks for writing such an informative post. Really terrific!
I have a naive noobie question.
Can you comment on strategies for training a CNN whose final application will be to do detection/classification on a platform with constrained (lesser) compute resources than the training machine? Being new to this we are fearful that we will purchase a high end GPU, tune our algorithm to optimize its capability and speed for training on that card, and then find that we cannot fit the model into the system we’ll use to deploy it in the field in real-time. Is there a strategy to ensure we don’t have this problem you could recommend, other than using the deployment machine for training, which would work but presumably be much slower.
Thanks!
-Val
Tim Dettmers says
Usually, you can use a CPU for inference after you trained a model since you will often be doing one sample at a time, a CPU is quite good for this task. So one way would be to see if the processing time on your CPU is acceptable. If it is, then everything is fine. If not, you need to make the model smaller through distillation/truncation/sparsification/quantization or by simply training a smaller model.
George M says
I’ve spammed this thread quite a bit now without knowing much at all, but I really am passionate about figuring out the FP16 puzzle before investing in an RTX card. I think I have found a data point strongly against buying, if you happen to use the esteemed fastai library (which I do). Here’s a relevant thread from their forum:
https://forums.fast.ai/t/how-to-install-rtx-enabled-fastai-cuda10/29092/23
I don’t guarantee I have this right, but at this point it sounds like there is no way to train in FP16 without halving the batch size, which pretty much defeats the purpose. Add in the half-speed FP32 accumulate, and it sure sounds like it isn’t worth it to buy. Again, I assume this is specific to fastai only – there may still be benefits if you don’t use that. Thoughts welcome.
G. says
Good news: Here’s a followup that negates the above. It appears the OP made a mistake somewhere, because others are getting fine results. Curiously, FP16 shows improvement even on a 1080ti, though not as much as on a 2080ti.
https://forums.fast.ai/t/comparision-between-to-fp16-and-to-fp32-with-mnist-sample-on-rtx-2070/35693
The usual George says
In counterpoint to the above, I have found the following:
https://forums.fast.ai/t/comparision-between-to-fp16-and-to-fp32-with-mnist-sample-on-rtx-2070/35693
It appears the OP may have made a mistake, because others are reporting FP16 working fine. Batch size was not 2x but closer to 1.8x.
Tim Dettmers says
The problem with the benchmark is that PyTorch allocates the memory but does not free it to reuse it in the future (saving the call to cudaMalloc for higher performance). To release the memory you need to call a function in pytorch to actually release it. You will not see any speedups from 16-bits in this case since the model is just too small and not compute-bound.
Joakim Edin says
Great post! I am considering buying 2 RTX 2070, but I discovered that they do not support NVlink nor SLI. Is this a problem when using Deep Learning?
Tim Dettmers says
NVLink or SLI is mostly a gaming construct. You will be fine and you should have no problems to parallelize across your GPUs.
Tim says
I see RTX 2060 and Titan R:
TX benchmarks up. https://www.phoronix.com/scan.php?page=article&item=plaidml-nvidia-amd&num=4
Tim Dettmers says
Thank you for the link! Unfortunately, these are OpenCL benchmarks which do not utilize Tensor Cores — so not the best benchmarks to compare different RTX cards.
Ethan Zou says
Could you please also add RTX 2060 to the comparison? I’m very curious how this card performs since I’m short of money but still want to try the mixed precision.
yang cd says
How about the 2060 compared with the 2070?
Mircea says
Thanks for the comprehensive guide, Tim. I’d like to weigh in on the blower-style cooler recommendation. I tested an MSI 2070 Aero (blower-style) and when installed next to my 1060 it reached 80 degrees Celsius and the puny fan was spinning at more than 3000 RPM. Needless to say, I returned it, and now have the Asus 2070 ROG, which has three fans. The card does not exceed 64 degrees Celsius and the fans don’t exceed 1900 RPM. It is phenomenally quiet despite the fact that it is factory overclocked.
So my recommendation is to NEVER get a blower-style card unless you can deal with the whiny fan noise. A large case with good airflow, two fans in the front and one in the back (creating more pressure inside to force dust and hot air out) is the way to go.
Tim Dettmers says
Sorry for your experience with a blower-style fan and thank you for your feedback. I have had a different experience with blower-style fans with GTX series cards. I think the RTX cards just might be a bit different here, which is surprising since they have lower TDP. I consider adding your experience to the next blog post update.
George M says
Mircea, if I’m understanding you correctly, this noise is called coil whine and may be caused by the individual card that you had, not necessarily the fact of a blower model. I have encountered this situation on non-blower cards as well, including AMD cards. Unfortunately, it is fairly common and there is no way to know if you have it or not, without actually testing the card.
soldierofhell says
Hi, but what about size? ROG is 4.89 cm – this is more than 2x PCI, so on typical TR4/X399 MB you can’t fit 4 such GPUs, am I right?
Actually all non-blowers seems like > 2x PCI.
Tim Dettmers says
Never thought about that. Indeed, if a GPU is larger than 2 PCI slots I would not buy it.
Rob says
Hi Tim,
Your last update was last year, and just wondering if you think Vega 20/ or the MI60 was able to reach that goal of being able to compete within range against a dedicated tensor core?
And thank you for your blog post, it was super informative. I’m hoping to do more in this space with my 1070 before i go out and buy something.
Tim Dettmers says
I think the new AMD cards might be competitive, but I need to see actual benchmarks to come to a definite opinion.
Sadak Vali says
Thanks for this article
Advantages of buying RTX 2060 6GB over GTX 1060 6GB?
in terms of 16-bit training, Modelling time, value for money and in general
Tim Dettmers says
The RTX 2060 has 16-bit training, faster model training, I have not calculated the value for money yet, but in general, it is faster but more expensive.
Martin Mocko says
For me RTX 2060 is also a very interesting card. Relatively cheap, but should offer FP16 training with 32bit accumulate in the dot product, so if I definitely need FP16 could this be the right choice? (I don’t have too much money to spare)
Tim Dettmers says
Yes, it is probably the best cheap but fast card right now. It is a perfect card to get started with deep learning.
Sadak Vali says
Can I do 16-bit training on GTX 1060 6GB?
I am planning to buy one of these 2 cards,
please suggest me, Which one to buy?
George M says
While you can technically do 16-bit, there will be no appreciable speedup, because FP16 is deliberately crippled on the 10-series cards. With the 20-series, FP16 essentially runs at full speed, except FP32 accumulate will run at half speed. So it will be faster than full FP32 training, but not by exactly double.
Bull Shark says
Hi Tim, AMD’s radeon 7 has recently been announced. How are your thoughts about this card? Two important specs of this card are the 1024GB/s memory bandwidth and the 13/14 TFLOPS. It also features 16GB of HBM2 with a memory bus of 4096bit. This al sounds pretty good to me, if I don’t consider the lack of support for deep learning libraries for AMD. How are your thoughts about this card?
Another question: there’s this deep learning library called PlaidML that supports some AMD ROCm hardware. Do you know whether they will be supporting vega 20(so the radeon 7)?
Thanks in advance !
Yaohua Guo says
Hi Tim,
Thanks for a great article, it helped a lot.
Recently I was thinking about purchasing gpu. I have some questions about the choice of memory. Is the RTX 2070’s 8G memory available for target detection (yolo, ssd, faster rcnn) and nlp (transformer) training tasks?
A friend of mine trained fast rcnn on gtx 1080ti. If the batchsize is greater than 2, it will overflow. Is the rtx 2070 8G memory available for these training models? Which tasks are enough for? Which tasks are not enough?
Thank you.
Tim Dettmers says
If you use 16-bits then theoretically the models that fit into a GTX 1080 Ti also fit into an RTX 2070. However, it is often not that straightforward since frameworks often store also 32-bit weights with 16-bits weights to do more accurate updates. You can ask your friend to use 16-bit mode and weights on GTX 1080 Ti and you will know how much the code consumes with an RTX 2070 and if it fits into that memory.
Michael says
I hope that comparisons and diagrams for RTX 2060 and RTX Titan will appear as soon as possible, a very nice article.
Houssem MENHOUR says
Hi,
Thanks for the continuously updated guide. With the release of the AMD Radeon VII and its 16GB memory, is there any chance for it to perform well enough in DL tasks? I know that it won’t beat Cuda and that the software support is not quite there yet, but I’m curious.
Tim Dettmers says
It probably performs very well. However, since I do not have the GPU myself I will only be able to discuss it in detail if benchmarks are released.
Ken Fricklas says
It’s pretty much identical in real use with the RTX2080Ti (both Inception3 at ~190). And it’s much less expensive.
ABHINAV MATHUR says
The guide is detailed and enabled my organisation to buy perfect GPU server for AI workflows, we bought 8 x Tesla-based server and they are quite powerful..
Paul says
Hello, Tim.
I carefully read your article and deeply appreciate for it.
Now I am considering to buy my first deep learning system and I really need your help!
If you have about $3000 for GPU, what are you going to buy?
1. One RTX Titan
2. Two RTX 2080Ti
3. Four RTX 2070
4. Other option?
(I am interested in analysis of medical imaging or clinical photographs)
Thank you very much and happy new year!
Tim Dettmers says
Medical imagining and clinical photographs usually need quite a bit of RAM. I would either go with two RTX 2080 Ti or one Titan RTX. RTX 2080 Tis are most cost-efficient, but the Titan RTX might be useful in some cases.
Rahul S says
Tim,
I’ve just got a new rig setup. An AMD Threadripper 1920, with 32GB DDR4 and a GeForce RTX 2070. I’ve got the software setup and stable now on Linux Mint 19.
Do I need to perform any “tuning” to extract the performance I can expect from this setup? Or is it simply plug and play? I haven’t done any overclocking or setup apart from the basics to get things running.
Thank you
Tim Dettmers says
I did some test myself with a similar setup and it seems for PyTorch installing via anaconda or compiling from source yields about the same performance. However, on the other hand, I heard that some people were reporting better performance with compiled source code. I have not done any tests with TensorFlow though. In general, compilation should always yield optimal performance, but of course, it is less convenient.
Ray Donnelly says
> In general, compilation should always yield optimal performance
I disagree. Anaconda compile software very well, most naive attempts at compilation will result is slower binaries. OK, you can get cpu flags more suited to your hardware, such that AVX512 may become available, but that isn’t used, your GPU is, that will be the only bottleneck and that’s driven by CUDA which is closed source and GPU drivers which are also closed source.
Tim Dettmers says
That makes sense. Thank you for your feedback! I will in-corporate this with the next update.
edison says
Hi
Will you test RTX 2060(6GB) in the future? and When
thx 🙂
Tim Dettmers says
I will probably look for some benchmarks next weekend and then push an update which also includes the Titan RTX.
Kevin says
Ditto on the request for including the 2060 🙂
George M says
Here’s a good start: Eric Perbos-Brinck has graciously shared his benchmarks on CIFAR, and actually got better results on a 2060 (in FP16 mode) than a 1080ti. Here’s the link:
https://towardsdatascience.com/rtx-2060-vs-gtx-1080ti-in-deep-learning-gpu-benchmarks-cheapest-rtx-vs-most-expensive-gtx-card-cd47cd9931d2
I finally took the plunge and bought a dual 2060 machine, and can confirm similar results. Note we both have dual card systems so they are running on 8 PCIE lanes. A single card at 16 may run faster, though not by much as Tim noted earlier.
I will say this, the thermals with dual 2060’s (3 120mm fans) are not good. I got up to 85C on one card and I’m sure that automatically throttles the speed, not to mention being bad for long-term life span. Will have to experiment with fan speed curves, extra fans, a new case or even a new cooling solution.
Michael says
+1
Stanley Chen says
+1
Rushi says
+1
Long-Van says
+1 for comparision !
Mikolaj says
+1
Alex Dai says
Hey Tim!
Great article!
I followed your recommendation in buying a RTX 2070, however, when testing it out straight out of the box using some benchmarks, it seemed to be performing noticeably worse than the 1080Ti, even when utilizing half precision in both training and inference.
The benchmarks I used are contained here:
https://github.com/ryujaehun/pytorch-gpu-benchmark
Is this because the benchmarks weren’t utilizing tensor cores?
Or is it because I am missing some fine tuning steps of my GPU?
Thanks,
Alex
Tim Dettmers says
The benchmarks that you linked shows that the RTX 2080 Ti (16-bit) should be twice as fast as for ResNet-152 compared to a GTX 1080 Ti (32-bits). To see what kind of GPU kernels were utilized you could run “nvprof python your_program.py –arg1 abc –arg2 def” this will log the kernels that were used. What you want to see is that 16-bit/half precision tensor core kernels were used. If this is not the case something might be off with your configuration/install.
Alex Dai says
Is the 1080Ti unable to utilize FP16?
Some of the benchmarks you’ve linked have shown some results for the 1080Ti for FP16.
Is the 30-40% figure you cite for the 2070 in FP16 vs. the 1080Ti in FP32? If so, is that a fair comparison?
The benchmarks you linked:
https://github.com/u39kun/deep-learning-benchmark
https://github.com/stefan-it/dl-benchmarks
Show that the RTX is ~10-15% faster when both the RTX 2070 and 1080Ti are in FP16 mode.
https://imgur.com/a/lClO6iU
Thanks again, Tim!
Tim Dettmers says
The GTX 1080 Ti does not support 16-bit computation. If you use 16-bit what the code does it casts 16 bits to either 24 bits (in some matrix multiplication kernels) or to 32-bits (all other code) and then performs 32-bit computation. The results are then cast back to 16-bits. This is not any faster than 32-bit execution in most cases.
RTX cards are really bad at 32-bit computation. A fair comparison would be 16-bit vs 16-bit, but since 16-bit computation is 32-bit computation under the hood for the GTX 1080 and lower, the comparison is still quite fair.
Alexandre Soares says
Hi, everyone who’s reading this post. I know the Titan V is not recommended by Tim, the writer. However, would the Titan V be a good deal if it was being sold used for $1,200? Or would buying a new 2080 ti be a better choice?
Tim Dettmers says
A Titan V for $1,200 would be a good deal. The Titan V is more powerful than the RTX cards and does not have so many issues with cooling. If you can get one for $1,200 I just go for it!
Alexandre Soares da Silva says
Thank you for your answer, Tim. This is a great source of suggestions and recommendations!
Richard S. says
Hey, I’m starting a long-term project in deep learning/ neural nets and am thinking about buying a GPU for work and (personal) learning at home – a bit of time into the project, my uni will acquire a dedicated PC for it, so I’m wondering wether a GTX 1070 would be a sensible option for now and for myself.
As far as I see, the RTX 2070 has substantially better performance in deep learning applications, but I’m unsure whether that is important for learning and testing my ideas for the project; right now I’d get a 1070 for around 350€, and a 2070 for around 520€.
What do you think?
Tim Dettmers says
A GTX 1070 is an excellent option. If money is a constraint then a GTX 1070 is a very good option — go for it! However, note that memory might be a problem sometimes (you cannot run the biggest models), but there is no cheap option with which you can do this. So I think a GTX 1070 is the best option for you.
Damien says
I think you should revisit your performances measurements… 2070 is showing very similar deep learning speed than 1080 Ti for less memory. I don’t see how it could be higher in any reasonable scoring… Or just link (or perform) benchmarks !
Tim Dettmers says
I base my results on 7 benchmark results which I link the blog below Figure 3. Two rigorous benchmarks indicate that the RTX 2070 is about 30% faster than a GTX 1080 Ti for convolution. The RTX 2080 Ti is 40% for LSTMs compared to a GTX 1080 Ti and for an RTX 2070, this is about 30%.
Jerry says
Hi Tim,
Thanks for your effort to write these helpful guides.
I am going to buy a RTX 2070 with a blower fan first since I just graduate this year. When I have sufficient money to buy a RTX titan in 2019, is it possible to speed up for Deep Neural network algorithm with these two different models of display cards?
Tim Dettmers says
You will not be able to parallelize an RTX 2070 and a RTX Titan together, but you will be able to run separate models on each of those GPUs without any problem.
Mikhail says
This is excellent, thank you so much for the insight!
Answered my question about whether I should get RTX 2070 🙂
Rinish says
Hi Tim,
I have been trying to install Tensor flow gpu and cuda 10 but with no success. Can you help me with the process or point to some source that can be helpful. I am using Ubuntu 18.04 and my graphic card is RTX 2080.
George M says
Just learned that the 2080 ti (not the Titan RTX) runs FP32 accumulate at half speed:
https://devtalk.nvidia.com/default/topic/1042897/cuda-programming-and-performance/is-geforce-rtx-2080-slower-than-geforce-gtx-1080-on-small-matrix-matrix-multiplication-/
I’m not familiar enough with mixed precision training to know – is this significant?
George M says
Correction: this should read FP32 *during mixed precision training mode.* Of course regular FP32 runs fine.
Rahul Sangole says
Tim,
I’m setting up my first deep learning system. As I search for RTX 2070 on Amazon, there seem to be many choices – EVGA, Zotac, MSI. Does it matter which rtx 2070 I pick up?
Thanks
Tim Dettmers says
Not really. If you want to get multiple RTX 2070 however, I recommend going with a brand that offers blower-style fans or even water all-in-one cooling. However, it does not matter.
Rahul S says
Thanks for the response Tim. I did end up setting up my first system with a Gigabyte RTX 2070.
Setting everything up is certainly no joke. Lots of version-compatibility issues I had to resolve before getting Python, Keras and TF to recognize the GPU. Finally having got over that hurdle, I still can’t get R or PyTorch to recognize the GPU.
While I attack that issue myself, could you point me towards how we leverage “lower precision” training on this GPU to double the memory? Where can I learn more about this?
Tim Dettmers says
Make sure you have installed the correct video driver and CUDA is visible to the software (try typing nvcc into your terminal). The easiest way to make sure that everything is working is to install PyTorch via anaconda. To use 16-bits in PyTorch, it can be as simple has calling .half() on your model. For more info you can see the NVIDIA 16-bit repo: https://github.com/NVIDIA/apex
Rahul Sangole says
Thanks Tim. I was able to get Python+R+Tensorflow+Keras working. Not able to get Pytorch working, but no matter. I’m happy with the setup so far.
Within Tensorflow, is running half precision on the RTX2070 card a matter of following the instructions in : https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#tensorflow
Tony Qi says
Hi Tim,
Your article about choosing GPUs is excellent.
I’m a master student and recently get in Machine Learning area, focusing on deep learning models. My lab doesn’t have the hardware environment and I need to make a hardware list. Since this area is totally new for me and I didn’t get the budget, just making a list. So I wonder the RTX2070 is enough? or two GTX1080ti will be better?
Thanks
Tony
Tim Dettmers says
Hi Tony, two GTX 1080 Ti will be better, but probably also more pricey. If you can get two GTX 1080 Ti at the price of one RTX 2070, definitely go for the GTX 1080 Tis.
Tony Qi says
1080ti is almost 1100CAD, 2070 is about 800CAD. So I would choose two RTX 2070s. Better than 2 1080Tis? Future study will focus on deep learning models, like compression or parameters optimation. The RAM of 2070 is 8GB, slightly smaller than 1080Ti (11G). Could it satisfy the basic need?
Yaohua Guo says
Hello, I am a Chinese nlp developer. thanks for your share.
I have a question, multi-GPU can make hole RAM large? such as BERT , the author say ,it only can run in the GPU which RAM larger than 12G ,like GTX 1080 .etc . Can i use two RTX2070 do this ? 2*8G can run BERT-base?
Two GPUs RAM like RTX2070 8G is equals with one 16G RAM GPU?
thanks for your answer!
Tim Dettmers says
Unfortunately, this does not work this way. What you describe would be some form of model parallelism. However, there is no deep learning framework which supports model parallelism in a straightforward way. Thus the only option is to get a GPU with larger memory and/or to use 16-bit computations and weights.
Yaohua Guo says
Thanks for your reply, I also do not understand , can tow RTX2070 share RAM? A big model ,like BERT , need large RAM , can we use BOTH GPUs RAM and just use one GPU to compute ?
If we can’t do this , when i run model , why all GPUs RAM allocated, and just one to compute.
thanks for you answer.
Tim Dettmers says
Two RTX 2070 generally cannot share RAM. With special code this is possible, but this is too cumbersome to really be practical. If you want to run BERT and you run into memory problems then the best bet is to get a bigger GPU. The new Titan RTX might be a good fit for you in this situation, or you can get a RTX 2080 Ti and work with 16-bits.
Yaohua Guo says
Thanks for your reply, I have another question to ask you, how can i make model work with 16-bits, use tf.float16 in TensorFlow? And BERT had been pre-trained, how can i rebuild the model to 16-bit and use the per-trained checkpoints which the author shared in github ? Can we do this ?
THANKS!
Yaohua Guo says
Do 16-bit model will harmful to model pofermance? if do ,how much?
wilkas says
How come GTX Titan (Pascal) is added as cost-efficient and cheap option, if even pre-owned card without warranty costs more than brand new RTX 2070 which is more capable?
Tim Dettmers says
That is a good point, I might have made a mistake here. I should not recommend the GTX Titan X (Pascal) anymore. It could also been that RTX 2070 was more expensive and GTX Titan (Pascal) cheaper on eBay. I will fix this later today.
Umit says
Hello Tim,
Thanks for the great post. I am currently doing research on GANs and other DNNs. You mentioned that we should look at CUDA cores as well as flops. Clearly the GB of the card is another thing to consider. In comparing a 1080Ti it has 3584 CUDA cores a vs a 2080 which has 2944 CUDA cores. I see that the 1080Ti has a lower clock speed than the 2080. Is this why there is such a large difference in the performance? Because other than the clock speed it seems the 1080Ti wins in all other categories including a 3GB advantage. I was able to get my hands on a MSI 1080Ti gaming X which is stable at ~1.8 GHz overclocked for ~$630 USD. The cheapest I’m finding a 2070 is ~$500 and 2080 is ~$700. Do the 2070 and 2080 really out perform the 1080Ti? All the gaming bench marks rank the 1080Ti higher
Tim Dettmers says
Hi Umit. What I meant with cores is Tensor Cores. CUDA cores do not map very nicely to compute performance. The RTX 2080 is faster despite the small difference in clock etcetera because it has Tensor Cores which speed up computation for 16-bits. To understand more about this issue you can read my blog post about TPU vs GPU which also discusses low-bit computation and why it is such a big advantage.
Umit says
Ah so while the 1080Ti will outperform the 2070 and 2080 in gaming when we are training nets, since the RTX cards have Tensor Cores, both the RTX cards will outperform the 1080Ti? Do you see any benefit in the GB size of the 1080Ti vs the 2070/80 cards in terms of not running out of space when training? In the past I worked off a 1050Ti and have ran into the error that a batch size of 1 would not fit into the memory.
Tim Dettmers says
Since you run models in 16-bit with RTX 2070 cards you will consume less memory than with 32-bit training. As such there is only a small difference in memory compared to a GTX 1080 Ti.
Umit says
Hello Tim, $100 here and there does not bother me much. Would you recommend a 1080Ti a or a 2080 for deep learning. You say the 2080 is faster but can it handle as much data at once? Thank you.
Tim Dettmers says
Hi Umit, if you get an RTX 2080 you would use 16-bit training and as such you would be able to store much more data on the GPU than with 32-bits. In terms of throughput of data, the RTX 2080 will also be better. However, both GPUs are great — if you find a good offer for a GTX 1080 Ti this might be very well worth it over an RTX 2080!
Xunfei says
Thanks a lot, will get a RTX 2070. Just for other’s information, RTX 2070 encounter lots of crash/artifact incidents so far.
Tim Dettmers says
Same for other RTX cards. The issue is if you use multiple of them. Try to get RTX 2070 with blower-design fan — that should be a bit better in any case. If you can find an all-in-one water cooled solution I would buy that!
andrea de luca says
Could you elaborate a bit further about that artifact-crash thing? Thanks.
Xunfei says
Refer to this post
https://wccftech.com/geforce-rtx-2080tis-are-dying-and-there-are-different-rtx-2070-chips/
Also, plenty of reviews on Newegg/Amazon claiming card died after one or two weeks of use. So I suggest if you want to buy it, from notable retailer only to save the hassle if any issue occurred to you.
Amit Garud says
Hi Tim,
Now that the RTX Titan is announced today for $2500, would you recommend that over a pair of 2080TIs with nvlink for the same price? I understand it depends on the network models being used, but is there really a model you know of that can use the 24GB of the single RTX Titan?
Tim Dettmers says
I would not recommend getting the Titan RTX. The performance is not much better than an RTX 2080 Ti but the card costs nearly twice as much. The only reason to get a Titan RTX is the 24 GB memory. If the 11 GB of 16-bit precision is not enough (equivalent to 22 GB of 32-bit precision) then the Titan RTX is a fair choice.
John L. says
Hi Tim,
I found RTX 2080 or RTX2080Ti doesn’t have p2p access. When I run the simplep2p testing in cuda sample, all of these cards shows capable for P2P, but cannot access each other. May I ask, if no P2P access, how big the impact for the multiple cards train? Even NV sells NVLink bridge, there only 2 ways bridge adapter available.
Tim Dettmers says
The problem is if you have different kind of GPUs you cannot do direct GPU transfers among them. With the two GPUs that you have and convolutional nets, you should still see an okay performance. If you train something like transformers you will have problems with parallelization.
Eri Rubin says
There actually seems to be an issue with P2P on the 2080Tis, we have a few of them, exactly the same card, and can’t seem to get P2P to work.
Tim Dettmers says
Someone mentioned something like that before — I personally have no data on this and I am unable to help you. However, if you can figure out the problem it would be great if you can report back here so others can benefit from your finding. Good luck!
James says
Hello, I was searching in comments but couldn’t find battle between gtx 1080 and rtx 2070 since they has pretty much same price. I know that from rule of thumb bandwidth if you use RNN and FLOPS if you use convolution. So is gtx better for CNN?
Sorry if you have to repeat yourself.
Tim Dettmers says
It is not as straightforward to look at FLOPs because the RTX 2070 has tensor cores and 16-bit computations. For CNNs the RTX 2070 should be better if you use 16-bits, but the GTX 1080 Ti will be better if you use 32-bits.
James says
I didn’t mean GTX 1080Ti, but simple GTX 1080 (not Ti). Does what u have said still applies to GTX 1080 (not Ti)?
Tim Dettmers says
Ah, my bad. If I would choose between RTX 2070 and GTX 1080 I would definitely go for the RTX 2070. However, the 32-bit processing still applies although the gap is not as large anymore.
Kevin says
Hi Tim,
Hands down, best article I’ve been able to find on deep learning hardware. I just have one question though. From what I could tell, you only recommended the blower style cooler on the 2080 Ti. Did you mean this? If so, why only that model not on the lower models (e.g. 2070)? I plan to start with a 2070 and add more cards as financially able (and as my skills progress). Would it be wise to get the blower model in anticipation of running a multi-gpu setup, or does it not matter for the 2070’s?
Thanks for your great contributions to the community!
Kevin
Tim Dettmers says
The RTX 2070 has much lower TDP of 175 watts compared to the RTX 2080 Ti with 250 watts. However, if you have a multi-GPU setup, I definitely recommend blower-style fans for RTX 2070 as well. I will update my blog post to reflect that.
Bruce says
Hey Tim – thanks for the helpful post.
Have you seen NVDA’s T4 for inference? Any thoughts on it?
What are the switching costs of existing work done on NVDA if you want to swap out the hardware to AMD or a TPU?
Do you think the hyperscale cloud players use the software libraries of NVDA to a similar extent as everyone else? Do they build software around AI that is embedded in NVDA so that the switching cost s are pretty high?
Thanks!
Bruce
Tim Dettmers says
The T4, just like Quadro and Tesla cards, is very cost-inefficient and thus I would not recommend it.
Switching depends on the context. I do not think a flawless switching is possible if you used NVIDIA cards in an industrial setting. You might be able to switch NVIDIA->TPU if you use TensorFlow. For AMD I am not sure, you will always need to make some adjustments to your code and it will take more or less time depending on the project.
Alfonso Campos says
It’s already been said, but I would be very interested if you had a look at Azure with the flexibility of its latest ML Service & Batch AI. In a nutshell, you can target different compute envs and easily swap between local and Cloud GPU Clusters (Hovorod).
Additionally, you can train on FPGAs. I would like to hear your thoughts on that as well, particularly vs GPU Clusters and TPUs.
Finally, if you could at some point extend the article to include memory usage for completeness that would also be great.
Thanks for sharing this!
Tim Dettmers says
These are good points and I have not been keeping up on this. I will try to include it in the next iteration of my blog post.
Michel Rathé says
Hi Tim,
We cannot thank you enough for such relevancy, facts and knowledge.
You make sense through all we actually find on the internet, especially at a time where hardware, software, AI and participants seems to be only limited by creativity.
Actually, I got caught in all this, as well as (I presume) a lot of your readers.
The question is should I leverage my actual workstation or go ‘all in’ the hype of the next xeon cascade that will be detroning gpu’s. The 2,3,4 GPU in workstation or cluster (noise, heat and choice of gpu form factor (i.e.: 2,2.7 slots).
I mainly use Matlab, Tableau software, Excel on multi-monitor (3).
I’m a market investor that is seeking to transfer and expand my knowledge in using several aspects of machine learning. Though I cannot scope, at this point, all the size and depth of my projects I do not want to be limited in a 24-36 month horizon.
I now have a Asus x99-E WS with i7-5930K (6 cores, 40 lanes) with 32G of DDR4 (2133) with 2 Asus gtx 750Ti (I know there are not relevant to machine learning but keep one, if needed for monitors). OS is Win 7 on a samsung 850 pro 512 G in a define R2 case with 750 psu). No water cooling.
Knowing that Matlab is highly vectorized I tend to improve performance on the best practices programmation sides. I also made some tests with the Parrallel toolbox. I’m mostly now experimenting with Matlab because I’m still developping the so many features (though i’ts a ongoing process).
From reading all of your posts there seems to be a relevancy, still, for my setup.
My first inclination (reaction) has been to get a newer paltform (i9 or xeon w) with 18 cores (Are those factors really significant?).
From all posts the 2080Ti gpu is a no brainer (hope they will correct the actual flaws on that gpu). I can run these on actual or next platform.
I’m looking for the sweet spot to advance at a certain level in my actual and future projects.
First line of thought would be (1) actual workstation with 1 or 2 Asus rtx 20801Ti but it seems that I have to make the right call at the beginning for the model i.e. Turbo (single fan) in case of expanding to 4. But I’m really concerned about the heat. (2) Is water cooling a real game changer to keep performance and longevity?)Upgrade to 64G mem (does 2666 really improves over 2133).
I’d appreciate your overall thought and considerations on my objectives as a high level starting point. I’m mostly at a orientation phase (and making sense) at this point.
Thanks immensely,
Michel
Tim Dettmers says
Hi Michel, I would stick with your Asus x99-E WS for now and just upgrade your GPUs. You can keep the 750 Ti for your monitors and just add an additional RTX 2080 Ti for your work. Multiple big GPUs are usually only needed if you are already experienced with deep networks and you find yourself limited by runtime. From what I read is that you are mostly in a prototyping stage where you would like to experiment with greater flexibility. A single RTX 2080 Ti will not limit you here. I think upgrading your CPU and motherboard will only yield small gains (larger gains when you work on problems beyond deep learning).
I would also suggest getting the blower-style single fan version to avoid heat problems. The ASUS RTX 2080 Ti should be the right option for you (also if you want to upgrade to 4 GPUs). Water cooling for GPUs can be a curse but also a blessing and I would not recommend water cooling in your case where you want to not touch your system for many months (too unreliable).
64 GB of RAM is good but stick to 2133. The 2666 one has no real benefit in performance.
I hope this helps you to orient yourself and come to a decision.
Michel Rathé says
Hi Tim,
Thank you for the so precise and worthy informations.
That confirms basically what I expected. By splitting the time horizon in 18-30 months we can only expect interesting leaps in hardware/software. On such basis, and from a cost/benefits standpoint, it make sense to leverage the actual setup. What I’m trying to achieve though, based on financial wisdom, is to somehow leverage and roll over the hardware on an ongoing basis up to the point where non relevant (for me) hardware can still be sold or re-used in a complementary setup.
I wish that you continue that great work you do and on presenting relevant advises to us wishful “deep learners”.
Thanks
Haider Alwasiti says
I am not very much interested in running 1 model on several gpus. I wanted the gpus to make them running several models separately.
16 lanes for 2 gpus and there are lanes coming from the chipset at x4 speed for the 3rd gpu and ssd… etc
And theoretically AMD cpus with 128 lanes can dedicate them to 4 gpu slots and use the extra lanes from the chipset for other stuff just like the Z390-E asus motherboard that i have.
wesley faria says
Great article Tim Dettmers, congratulations !!!
I’m beginner in ML/DL but I’m working on challenging project and I need to buy an hardware but I’m with a lot of questions.
The project aims to recognize tiny object in the image … let’s work with Dataset around 50 classes, each class with 1000 images, each 1024×1024 image size.I can not greatly shrink the size of the images because I need to get the details.
I need an equipment to start the project, but in the future I can buy other(s) hardware but nowadays I can’t spend much money.
My questions are:
1 – Which Nvidia card should I use ? My options are ( GTX 1080ti, RTX 2080 or other card with similar price or cheaper. I consider use multiples cards )
2 – Should I need a lot of RAM Memory with high speed ? I’m thinking around 16GB RAM DDR4 Is it good ?
3 – Should I need an ultra fast CPU ? An Intel Core 7th or 8th Generation Is it good ?
4 – Should I need an ultra fast Disk like SSD NVME ?
Thank’s a lot.
Tim Dettmers says
1) GTX 1070 Ti, RTX 2070 or GTX 1060 (6 GB) should all be fine for you.
2) 16GB DDR4 RAM is good. Clock does not matter. Latencies do matter for some tasks but not for deep learning
3) You do not need a fast CPU for deep learning. 7th generation with 4 cores is more than you would need.
4) You do not need a fast disk. A simple hard drive would suffice. However, I like to have a SSD so the OS runs smoother.
Peixiang says
Hi Tim,
Thank you very much for your detailed and helpful post.
I was deciding to buy two used 1080ti, but after reading your article I’m thinking about two new 2070 since there are no used ones. The prices are roughly the same. Which setup do you recommend? I run experiments more using LSTM.
Also, right now it’s not easy to directly use 16bit in frameworks like Pytorch, do you think using 16bit will be more prevalent in the near future (end of 2018)?
One final questions :), i’m planning to buy used e5-2670 v2 and 4×16 ecc 1866mhz memory. Are the cpu and memory the bottleneck of my system?
Thank you
Peixiang
Tim Dettmers says
Everybody will 16-bit soon — there is no reason not to do so. I would probably go with the RTX 2070s. The CPU looks more than fine to me — you should have no problems that the CPU is a bottleneck. One thing though is the DDR3 memory. If you are preprocessing a lot of data DDR4 memory might be nicer. But on the other hand, you get a fast and cheap CPU with that. I think it could work quite well!
Subash says
Hi Tim,
I want to setup the build with RTX 2070 GPUs, however in the market space i could see different players (ASUS, Gigabyte, PALIT, EVGA, MSI, Zotac etc.) and different models as well like GE Force, OC edition, Turbo, Rog Strix etc. Do you have any specific preference in terms of a particular brand or model for ML/DL setup? or the difference is negligible?
Regards
Subash
AV says
For multi GPU setups go with a card that blows hot air out of the case. These are called Turbo, Blower or Aero (MSI). I have only tested the Asus 2070 Turbo and MSI 2070 Aero and the latter had 5 degrees lower temperature.
AV says
Hi Tim,
Did you see this test by Puget systems last month? https://www.pugetsystems.com/labs/hpc/RTX-2080Ti-with-NVLINK—TensorFlow-Performance-Includes-Comparison-with-GTX-1080Ti-RTX-2070-2080-2080Ti-and-Titan-V-1267/
What do you think explains the difference between your estimate of 2070/2080ti performance difference and theirs? You suggest that 2070 is not far from 2080 ti in terms of performance with RNNs. Puget suggests that 2080 ti is about twice as fast as 2070. Many thanks.
Tim Dettmers says
See the note in the blog post, that the memory was not large enough to run a larger batch size on RTX 2070 and RTX 2080. If you compare both RTX 2070 and RTX 2080, both of which cannot run larger batch sizes, then you see that they have the same performance on this task. If you would use a smaller batch size for the RTX 2080 Ti, you would probably see the same result.
AV says
But why would you use smaller batch size with the 2080 ti?
Tim Dettmers says
A smaller mini-batch size yields a faster descent to a local minimum. Then you usually turn off momentum and you may also increase the batch size. People in NLP often use batch sizes around 32.
xu says
Hi Tim, you suggested If you already have a GTX 1080 Ti or GTX Titan (Pascal) you might want to wait until the RTX Titan is released. Your GPUs are still okay.However I already have GTX 1070Ti, should I upgrade to RTX 2070 or wait until the RTX 2080/2080ti’s price stabilize?
Tim Dettmers says
It might be worth to wait a bit more to see how the prices play out. I would also only upgrade if you feel unsatisfied with your current GPU and you have the spare money. Upgrading just to get a faster GPU is often not the right reason for a new GPU. If you find yourself limited by a GTX 1070 Ti some of the time, then this is a good reason to wait for prices to stabilize and go for an RTX 2080 / RTX 2080 Ti.
Subash says
Hi Tim,
Thank you for the wonderful article.
Am setting up with the below mentioned config. Am planning to start with one RTX 2070 initially and plan to add one more in 8-12 months down the line. My primary objective is to learn ML & DL and start with doing some kaggle projects for 1 or 2 years to get expertise.
Should I spend money on Intel i7 now considering if i get additional GPUs in future and would it be a bottleneck with 6 cores & 12 Threads? or is it better to get AMD Ryzen where the additional cores & threads would be helpful when I add additional GPUs?
ASUS Rog Strix RTX 2070 GPU & 2390-E motherboard
32 GB RAM
1 GB SSD M.2 & 2 GB SATA
Intel i7-8700K
Thanks in advance.
Cheers,
Subash
Tim Dettmers says
The CPU does not matter too much, both options will give you more than enough power to use two GPUs efficiently. A better CPU might be in particular good if you do a lot of preprocessing, that is wrangle data in Kaggle competitions — so it might be worth it for you for that reason, but not for the reason of keeping your GPUs busy.
Leo says
Hi Tim,
If I do CNNs and RCNNs, what card should I get? I have a budget of 1000-1300 USD. Should I get 2 GTX 1080 ti for 1200, or 2 RTX 2070 for 1000? My Motherboard will fit 4 GPUs for any future update but I am not planning on updating anytime soon. Thank you!
Tim Dettmers says
I would definitely go with two RTX 2070. They are a much better long-term investment.
Leo says
Does the fact that RTX 2070 doesn’t support NVLink SLI bridge affect the power of using multiple 2070s together?
Tim Dettmers says
NVLink is a bit useless for the RTX series. For 2 GPUs it does not matter if you use it or PCIe. For 4 GPUs it would give you a benefit, but since only two GPUs can be coupled at any time, one would need complex communication patterns to get a payoff. So no: The lack of NVLink has no effect on RTX 2070 parallel performance.
Tyler says
thanks for this valueable info. i was thinking gettign a 2080 instead of 2070 because the latter does not support NVLink. now I difinite go for 2070, acutally just placed an order.
chris says
Tim can you help me choose, should I buy (1 rtx 2070) or (2 gtx 1070 ti), and why:
I mainly train models using fp32 vgg16, resnet50 & alexnet. I use tensorflow.
Waiting for your reply
Tim Dettmers says
It is a big change to train your models in 16-bits but its worth it because it will be the standard in the future. If you want no hassle and just do your work two GTX 1070 Ti might be simpler to use. Also note that two GTX 1070 Tis are much faster if you are training two independent models at the same time. On the other hand, the RTX 2070 will yield good performance and with 16-bit training you will have much more memory. If you want to train big models this could be an important point.
thanh tung hoang says
Thank you for your tutorial and advice. I currently have a desktop with a GTX 1080. I plan to add another card next year but 1080 is no longer on sale. Can I run a model on a system with a 1080 and a 2070/2080?
Tim Dettmers says
Yes, that works without a problem if you do not use parallelism. If you want to parallelize across GPUs you can only do that on multiple 1080 or multiple 2070/2080. I do not think parallelism is very necessary, but if you think it is you could try to find a used GTX 1080 and buy that card.
Mario says
Hi Tim,
What were the price points used for each GPU while comparing? This would have been an excellent piece of information.
Right now, like many, I am not clear about gtx 1080 ti vs rtx 2080. Gaming versions of gtx 1080 ti are about €700 in EU and about €750 for RTX 2080. eBay market is CRAZY here – gtx 1080 tis are selling for €600 on eBay. Then there is the issue that warranties are not being transferable for second-hand purchases.
All in all, this is going to be a dilemma.
Mario
Tim Dettmers says
I used eBay and Amazon in the US. I could have added these data as well, but the charts already contain too much information. Maybe I add just a price chart so people can compare with the prices in their own countries.
AV says
Thank you , Tim. What about PCIE 3.0 at x8 speed? Does that impact RTX performance compared to x16 setting? Some C422 motherboards are able to divide 48 lanes to six PCIE slots running at x8. Would you recommend running 6 2070s in parallel at x8? Many thanks!
Sam Karopoulos says
According to this article (https://www.pugetsystems.com/labs/hpc/PCIe-X16-vs-X8-with-4-x-Titan-V-GPUs-for-Machine-Learning-1167/#results), there is negligible difference between using x16 and x8 connections. The top comment there explains that the theoretical limit for x8 is 8 GB/s and suggests that unless you are using a batch size larger than 8 GB, there will be negligible difference.
Tim Dettmers says
Indeed, people usually overestimate the importance of lanes. I have stated that before. However, the article might be a bit misleading because for certain architectures the penalties for lanes can be much larger. For example for VGG lanes are a bit more important than for GoogLeNet or ResNet. The difference however only matters if you are using 4 GPUs. For 2-3 GPUs 8x lanes are just fine and for 4 GPUs there are no standard options that give you 16x/16x/16x/16x setups anyway. For LSTMs there might also be a slowdown if you use shorter sequences, but with longer sequences lanes are also a non-issue.
However, the argument with batch size larger than 8 GB makes no sense really. You will see a performance penalty long before that.
andrea de luca says
TBH, there are some standard options that give you 16x/16x/16x/16x: think about X99E-WS, to be bought used on ebay (or C422-SAGE if you want s2066). Note that they support 16x/16x/16x/16x no matter the 40-48 lanes processor.
Nikolaos Tsarmpopoulos says
This is not entirely correct.
X99E-WS uses the PLX PEX8747 PCI-E switch, which communicates with the CPU using 32x bi-directional lanes of PCI-E 3.0. The switch also communicates with each GPU using 16x bi-directional lanes of PCI-E 3.0.
Hence, the CPU can’t transmit only, or receive only, data to/from 4x GPUs using x16 lanes per GPU concurrently. It is limitted by the 32x lanes to the switch.
However, if the software schedules the data transfers so that only two concurrent READs and two concurrent WRITEs take place in parallel (at most), the algorithm will take advantage of the PCI-E switch.
Note that NVIDIA’s recent drivers for GTX graphics cards are unstable in Windows 10 since August 2017, causing BSOD. The latest stable drivers for 3 or 4 GPUs on motherboards featuring the PLX PEX8747 switch is version 382.53. This is a know issue and NVIDIA does not seem interested in fixing it.
andrea de luca says
Thanks Nikolaos, theste are useful pieces of information. I suppose the same issues do affect the c422 SAGE 10G (It employs two PLXs). I was not aware of the driver problem. Note that 382.53 are insufficient for using the latest version of the most popular framework, not to mention tensor cores, FP16, and so on.
So, apart from dual-processor boards, I reckon that no motherboard can, as of today, allow 16x/16x/16x/16x. Not even threadripper’s boards.
Nikolaos Tsarmpopoulos says
To the best of my knowledge, all motherboards that feature the PLX PEX9747 are affected.
Regarding dual processor systems, each CPU has access to half of the system RAM. The CPUs (and the devices attached to different CPUs) exchange data via Quick Path Interconnect (QPI) or UltraPath Interconnect (UPI), with a maximum aggregate throughput of about ~9-10GT/s or ~20GB/s.
In comparison, PCI-E 3.0 x16 delivers ~15.8 GB/s. Two GPUs using x16 lanes each could facilitate ~31GB/s.
That means, if 2x GPUs connected to CPU1 need to be fed with data store in system RAM attached to CPU2, the data transfer will be limited to ~20GB/s by QPI or UPI.
Hence, to take advantage of systems with multiple GPUs and CPUs, the software needs to be NUMA-aware (non-uniform memory access).
andrea de luca says
Thanks.
Just for the sake of curiosity, how did nvidia managed to overcome such limitations on its DGX Station deep learning workstation?
Nikolaos Tsarmpopoulos says
I think NVLink does the magic.
Haider Alwasiti says
There are these new AMD CPUs announced few months ago, that give you 64 lanes without PLX. So I think if you want 4 GPUs then AMD is your friend.
AMD Ryzen Threadripper 2990WX
32 cores/64 threads
4.2GHz boost/3.0GHz base
64MB L3 cache
250W TDP
64 PCIe Gen 3.0 lanes
Price: $1,799
Availability: Aug 13, 2018
AMD Ryzen Threadripper 2970WX
24 cores/48 threads
4.2GHz boost/3.0GHz base
64MB L3 cache
250W TDP
64 PCIe Gen 3.0 lanes
Price: $1,299
Availability: Oct 2018
AMD Ryzen Threadripper 2950X
16 cores/32 threads
4.4GHz boost/3.5GHz base
32MB L3 cache
180W TDP
64 PCIe Gen 3.0 lanes
Price: $899
Availability: Aug 31, 2018
AMD Ryzen Threadripper 2920X
12 cores/24 threads
4.3GHz boost/3.5GHz base
32MB L3 cache
180W TDP
64 PCIe Gen 3.0 lanes
Price: $649
Availability: Oct 2018
Source:
https://www.zdnet.com/article/amd-unveils-world-record-breaking-intel-beating-2nd-generation-ryzen-threadripper-processors/
It seems intel could not (do not want?) make CPUs with 64 lanes.
For Nvidia DGX-2 with 16 GPUs they seem relying on the NVlinks between the V100s. They are using 2 Xeon processors (Dual Intel Xeon Platinum
8168, 2.7 GHz, 24-cores) each with 48 lanes (6 lanes/gpu?), but that is not a problem with NVlinks.
source:
http://images.nvidia.com/content/pdf/dgx-2-print-datasheet-738070-nvidia-a4-web.pdf
DGX-1
2x Intel Xeon E5-2698 v3 (16 core, Haswell-EP) with 40 lanes each
8x NVIDIA Tesla P100 (3584 CUDA Cores) with NVlinks
= 10 lanes/gpu
But again there is NVlinks between the 8 P100s gpus
source:
https://www.anandtech.com/show/10229/nvidia-announces-dgx1-server
I think, only the new AMD processors can do 4 GPUs efficiently for us without server grade tesla gpus.
My question though, if we go for the cheapest AMD with 64 lanes. How is the performance and most importantly the compatibility with the DL frameworks of AMD CPUs. Like this one with $650:
AMD Ryzen Threadripper 2920X
12 cores/24 threads
4.3GHz boost/3.5GHz base
32MB L3 cache
180W TDP
64 PCIe Gen 3.0 lanes
Price: $649
Availability: Oct 2018
Saying that, I have recently purchased the newly announced corei9-9900k ($550) for building a system with 3 GPUs .
I just don’t feel comfortable to go for AMD cpus fearing of compatibility issues and lower performance for DL or other compute tasks that I am interested in (rendering, Ansys simulations..etc.)
andrea says
Mh, I was thinking, suppose you have a mainboard which supports even less than 8x (e.g. a mixture of 16x,8x, and 4x slots. Perhaps 1x), but got enough room to accommodate 4 cards.
You can still attain full speed with NVLink?
andrea de luca says
@Haider:
note that any threadripper mb allows (at best) for 16x/16x/8x/8x, and that’s because a good amount of lanes are employed for storage and other similar stuff.
andrea de luca says
@Haider:
How do you attain 3 GPUs with a 9900K? It just has 16 lanes…
AV says
Hi Tim, can you please comment on the necessity of NVLink for parallel training? NVLink is missing on the 2070. Do the benefits differ by network architecture? I found this presentation https://www.dcs.warwick.ac.uk/pmbs/pmbs/PMBS/pres/paper1.pdf
Is there any truth to it? Thanks so much for all your input!
Tim Dettmers says
Currently, only NVLink for 2 GPUs is usable for RTX cards. For 2 GPUs NVlink is quite useless because both GPUs just has to send 1 message which can be done in parallel. For 8 GPUs you need to send 7 message of which only 2 can be sent in parallel on each PCI root complex. This means that you need to wait for an equivalent of at least 8 messages so the bandwidth requirements for equal performance are 8 times higher than in the 2 GPU case. What this means that NVLink is great for 8 GPUs but unnecessary for 2 GPUs.
Jannes says
In the “TL;DR: I am an NLP researcher” section you mention a “RTX 2070 Ti” which is not a thing yet. Did you mean the RTX 2070 or the RTX 2080 Ti?
Tim Dettmers says
Thanks, I correct this right away.
Sam Karopoulos says
Hi Tim – thanks for updating your article – it’s super helpful 🙂
I’m trying to build the most cost efficient machine learning workstation possible. Based on the August 21st revision of your article, I originally was planning the following build:
https://pcpartpicker.com/list/8Pd9Bb
I didn’t include storage as I already have plenty of SSDs and HDDs to use (and my understanding of machine learning is that the storage speed doesn’t matter much). There are 4 MSI RTX 2080 Sea Hawks in this build, with PCI-e extenders on two of the cards to make it all fit. I chose these as I’ve read that a blower card or a card with AIO cooling works better to keep a multi GPU build cool. I chose a Threadripper 1900X for its 64 PCI-e lanes. The Gigabyte X399 Designare EX has 5 PCI-e x16 slots: 2 run at x16, 2 run at x8, 1 runs at x4. I’ve read that x8 runs at about 90% efficiency compared to x16 for ML, which is an acceptable hit to performance for me.
With the November 5th revision of your article, I’m now considering using RTX2070s instead of RTX2080s. However, it looks like there aren’t any RTX2070s which come with AIO cooling, so I’m worried about cooling.
I’d really appreciate it if you could provide some advice about how you would build the most cost effective machine learning workstation possible.
Thanks 🙂
Tim Dettmers says
Look out for new RTX 2070 cards that come out. Also, have a look at gaming reviews and see if some cards have better temperatures. Also note that the RTX 2070 has lower wattage so that it should generate less heat and thus be more manageable than others cards.
peter says
Hi Tim, how are these GPUs ranked among each other in terms of doing DL in laptop (Nvidia 1050 Ti MaxQ or Quadro P1000, P2000, all 4GB)?
George M says
Thanks for this helpful article. I have a very wide-open question but I will do my best. What kinds of models are currently stretching (or will soon stretch) the limits of a single video card? By that I don’t mean models that run slow, I mean they simply will not fit regardless of FP16/batch size/accumulation type workarounds. I am just getting started in DL so it’s mostly a theoretical question at the moment, but I may buy my own hardware in a few months. I’d prefer to wait until I at least know what I’m doing and the current epidemic of RTX hardware failures gets somewhat resolved.
TLDR, I want to get a sense of whether I can get by with a single 8gb 2080 and the aforementioned techniques, or whether the day is fast approaching (or maybe already has…) when there are many important models that require a dual 2080ti system or more to run at all. I do not have a specific use case in mind yet, so let’s assume all fields are fair game – images, language, audio, video, etc. (Told you it was a wide open question) What are your thoughts?
Tim Dettmers says
I think if you use 16-bits then you should be able to run anything because it virtually extends the memory to 16 GB while the largest consumer-grade GPUs are currently at 12 GB — so you should have no problems. If people come up with even larger models then you can just use the tricks that I mentioned and you will be fine.
Bull Shark says
Which GPU would be best to choose for spec wise:
-rtx 2080ti
-less VRAM
-faster?
-amd pro duo 32gb (yes I know, plaidml, but I’m just talking spec wise)
-more VRAM
-slower?
I’m mostly interested in training really large deep learning models. For t
hese models 11GB is not enough. However, the AMD card might be so slow that training a large model on it isn’t much good either…
What would you recommend? I expect to be using the hardware pretty intensively, so cloud solutions are not a cost efficient option then. Maybe you know other faster high memory GPUs <1500?
Thank you in advance!
Tim Dettmers says
Buy an RTX 2080 Ti and train with 16-bit only (virtually doubles memory to 22GB). If that is not enough memory: (1) use a small batch size, (2) accumulate gradients over multiple batches, (3) apply the gradient. With this, you should virtually have about 30-40GB of memory.
Bull Shark says
Thanks Tim!
I’m now considering the following options:
-upgrade my old system with an rtx 2080ti. It has an i5 4670 and 8gb of ddr3 ram and no SSD and motherboard with the older PCiE 2.0(will it bottleneck?)
-build a new pc with something like i3 or cheap ryzen 2 processor 8gb ddr4 and an SSD with the rtx 2080 ti and a motherboard with PciE 3.0
-same as above but 2x1080ti
-build new pc with ryzen 2700x, 16 gb ddr4, SSD and gtx 1080ti.
andrea de luca says
I’ll give you my two cents.
1. In my experience, FP16 cannot double your memory since you cannot train in FP16 alone. What you will do is a mixture of FP32 and FP16. You will have befits in terms of memory, but you won’t double it.
2. If possible, avoid AMD processors. I’m sorry to say this, but with an Intel CPU you could benefit from Intel MKL, which help you in a lot of collateral tasks.
3. gen2 X16 won’t bottleneck you, but 8Gb of RAM definitely will. With 22Gb of VRAM, which will become more in mixed precision training, I’d settle for 64Gb of RAM, and not less.
Bull Shark says
So which of the four options would be best? I can upgrade my old system’s ram to a max of 32gb ddr3. There is also no second pcie slot for 2xgtx 1080ti so this SLI option only belongs to the ‘build a new pc’-options
Bull Shark says
Would it for example be a good idea to put an rtx 2080 ti in my old system? Or is this system just too ‘old’ to live up to the powers of the 2080 ti and would it be a better idea just to build an all new pc?
Bull Shark says
What would you think about this Tim?
Nguyen says
Hi Tim,
for some reasons, we can only order a workstation from Dell website. We have only some choices for Quadro cards, such as 02 GP100 (16GB) and other 3-4 cards (P4000, P5000, or P6000). We will do a lot of image processing and machine learning. What would you suggest, for the dual GP100 ( + 10.868,00 €) or 03 P5000 (+ 3.956,55 € ) , or 03 P6000 (+ 11.188,45 €). We can pay if we have to, but also prefer to do the same for less. Thanks in advance! Best, TR
Tim Dettmers says
These are all terrible options, I would not recommend any of them. Consider getting a Hetzner machine with a GTX 1080 ($130 per month) for prototyping and then rent AWS/Azure/TPUs to run jobs once you prototyped your models. You might also like to privately buy an RTX 2070 equipped desktop for prototyping and then just run cloud jobs on AWS/Azure/TPUs.
Anuj says
Why is it always advised to prototype the models in-house and not on the cloud? I would like to know the negatives of prototyping the model on the cloud and then buying a workstation to train the hyperparameters.
Tim Dettmers says
To make prototyping cost-efficient you would need to ramp-up and shut-down a spot instance every time you made a fix/change. This would make prototyping very slow. Debugging your models locally is much more cost-efficient. You can do this on a cheap/slow GPU (which will still be more than fast enough for debugging). Once you debugged your model you can run it in parallel in a big cloud instance.
If you train a lot of models, it is always more cost-efficient to buy a workstation even if you do extensive hyperparameter searches. However, if time is short or you do not run models frequently the cloud will be much faster or cost-efficient.
I guess it all depends on the exact setup. I am sure there are cloud providers which provide a reasonable latency to power up an instance for prototyping, but it is cumbersome and requires an initial time investment to setup up such a development procedure. So I guess what you propose can make sense, but it is just difficult to execute — not for your regular user.
Ruirui Liu says
which solution is more efficient 2* 2070 graphic cards or only 1 2080 ti graphic card for deep learning?
Thank you,
Tim Dettmers says
2x RTX 2070 is much better if the memory on them is enough for you.
andrea says
Hi Tim. The 2070 is out, along with its specifications. Street price is ~550 euros.
It seems it *has* tensor cores and (consequently) FP16 capability. I’d like to hear you opinion about its price/yield ratio.
Tim Dettmers says
I waited for the RTX 2070 to be released for an update. I will update the blog post today or tomorrow.
Brian A. Mulrooney says
I think the $499.99 models of the 2070 with 2304 cuda cores could be the new price/performance king. The cheapest 2080 is $799.99 with 2944 cuda cores. When adjusted for price per core you are looking at ~20% more cores per dollar with the 2070. Consider that these cards can be manually clocked about ~200MHz more from the boost; then you have a very serious contender at the $500 price point. Likely within 30% performance, at 2/3rds the cost of the 2080.
Tim Dettmers says
You are right. After doing my calculations the RTX 2070 is on top. One issue though is that it only has 8 GB of memory. That practically turns to 16GB though if you use 16-bit compute — which is an absolute requirement to get good performance from the RTX series — so maybe not the biggest issue.
Nile Furth says
One drawback to the RTX 2070 is that it does not have NVlink capability. That’s been a trend with Nvidia: with each successive generation, fewer and fewer cards have SLI/multi-link support. I’m still not clear on whether RTX video memory can be pooled/shared in Linux (Puget Systems indicates this cannot currently be done in Windows), but if that is your desired use case, or you want to play Tomb Raider at 4K between jobs, you might have to go with the 2080 or 2080ti instead.
Tim Dettmers says
I think for deep learning NVLink will not make the biggest difference because it is limited to two GPUs. For two GPUs the PCIe bus is not the biggest bottleneck if you want to parallelize via data parallelism. NVLink enables possibilities for model parallelism which were not possible before, which means not faster models but first and foremost bigger models. But I think the option to do efficient model parallelism is not that important because currently there are no real deep learning models which would require this. So overall I would not worry if you cannot use NVLink. The RTX 2070 is fine without.
andrea says
Tim, as you surely noticed, you cannot just do all in FP16 , e.g. you calculate the gradients in FP32, then backpropagate them in FP16, and then do the update step again in FP32. This is what they call mixed precision training. So, depending on the implementations, you will NOT double the amount of memory. The exact gains in terms of memory have to be evaluated by experimentation. Maybe you could do such experiments and let us know (I don’t have any card with FP16 tensor cores as of yet). Thanks!!
Tim Dettmers says
You should see a halving of memory if you train in straight 16-bit. Usually, you do not see a decrease in performance if you train in 16-bits if you use gradient scaling, that is multiplying the error by a big number during backprop and dividing by that number for weight updates. If you go below 10-bits you usually see some form of decrease in predictive performance, but we are not there yet! 16-bit is very save to train in.
George says
Informative and well-written article.
p says
Hi Tim, between Quadro P2000 4GB GDDR5 and GTX 1050Ti Q-Max 4GB GDDR5, which has better performance? Is the difference noticable?
Sourabh says
Hi Tim,
Thank you for your wonderful analysis.
We want to setup lab for deep learning. Mostly training CNN or ResNets.
Which Nividia GPU should be procure?
Option a) GeForce RTX 2080 Ti (Turing microarchitecture)
Option b) Nvidia TITAN V (Volta microarchitecture)
According to your blog:
“Now a combination of bandwidth, FLOPS, and Tensor Cores are the best indicator for the performance of a GPU”
“So overall, the best rule of thumb would be: Look at bandwidth if you use RNNs; look at FLOPS if you use convolution; get Tensor Cores if you can afford them”
On this wiki page (https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units)
we observe that Option b) has more Tensor Cores, GFLOPS & Badnwidth. But you have not discussed/considered about Option b) in your blog post. Please let us know which one is better specially for training computer vision deep learaning tasks.
Thank you.
Tim Dettmers says
Hi Sourabh, the Titan V is very inefficient in terms of cost/performance. I would definitely go for the RTX 2080 Ti! First benchmarks also show that their performance is comparable and thus you will probably get roughly the same performance for $1800 less if you buy RTX 2080 Ti GPUs.
Rinish says
Hi Tim I preordered a rtx 2080. I was whether the 2080 has better performance than 1080 ti. Should I cancel my 2080 and get a 1080 ti?
Tim Dettmers says
The RTX 2080 will be faster than the GTX 1080 Ti — at least 25% faster. However, these data are somewhat unclear and no high-quality CUDA 10 Tensor Core benchmarks exist yet. This means that the RTX 2080 will likely be even faster for convolutions than 25% compared to the GTX 1080 Ti.
Artur says
Hi Tim,
Thank you very much for the in depth article, very useful!
I was wondering, if I want to train two models at the same time, am I better off with one very good GPU or with training each model on two less good GPUs?
Artur
Tim Dettmers says
Hi Artur, In your case, two slower GPUs are usually better and even necessary if your networks consume a good amount of memory. If you train one model only, then you could also use the two GPUs with data parallelism for good speedups. In both cases, two small GPUs should be better. The only disadvantage is if you want to get more GPUs and they do not fit on your motherboard. Also, two smaller GPUs are often a bit more expensive than a single big one.
Artur says
Thank you very much Tim.
Another question, everyone is praising GPUs versus CPUs for different reasons that I do understand. However I was not able to find a single study that compares performance of a GPU with one of a good CPU (eg: AMD Ryzen Threadripper 2950X). All studies are using outdate/standard CPUs and compare them with very good GPUs. Have you actually compared both? People I know in the 3D field are back to CPUs (but the best ones) and were disappointed by GPUs.
Tim Dettmers says
GPUs are not good for some computational problems. Especially if you have access data selectively, that is you do not access 128 bytes sequentially (for example, an array of 32 sequential positions in a 32-bit floating point array. This might be the case why GPUs are not suitable for some problems in 3D.
Art Lee says
Hi Tim, I was wondering if you could comment which one of RTX 2080 or GTX 1080Ti should one get for deep learning if one is on a budget. These two cards are very close in price. 2080 has tensor cores but 1080Ti has 11Gb of RAM. I don’t mind sacrificing a bit of speed if I can train larger networks with 1080Ti.
Tim Dettmers says
Hi Art, I think you should go for a used (eBay or otherwise) GTX 1080 Ti — this is a very solid choice if you are on a budget right now and the memory on the RTX 2080 is not enough for you.
Frixos says
Dear Tim,
I recently discovered about your blog and it seems on the right time! I am looking to purchase a 1080 ti soon enough. At first, I wanted to buy 2 more gtx 970’s to pair with the one I have now, but I’d have to change my motherboard and psu accordingly, so I can use a multiple gpu system. It seems the 970 is severely out date tho.
I wanted to ask you if you have any experience or suggestions about the prices when it comes to the gtx 1080 ti. My main goal was to buy one around the black friday period. Do you think that’s a good idea? Or should I just get one at the beginning of october?
Mainly, I am asking because I’ve been reading that the value of the gtx 1080 ti is crazy good now when compared with the 2080 RTX. Thus; I am afraid that the prices won’t drop much during black friday whereas this time around the price drop is fairly decent for the 1080 ti.
P.S: I’d love to hear your thoughts regarding a single 1080 ti vs multiple 970’s. I am a postgraduate student with my main concern being to enter kaggle competitions.
Tim Dettmers says
Hi Frixos, I think going for a GTX 1080 Ti over a GTX 970 or RTX 2080 is a good choice if you need memory and are on a budget. It is difficult to predict how the market will look like around Black Friday though: Gamers were dissatisfied with the RTX 2080 but it seems now that benchmarks are coming out they are more positive. This might increase demand so that the GTX 1080 Ti will drop — but this could totally change in a couple of days. There are also rumors that NVIDIA has a very large stockpile of RTX cards which might make them very cheap if their sales are less than predicted.
I would go for a used GTX 1080 Ti now on eBay (or similar sites) if I were you. That should be quite cheap right now and should be a solid choice especially if you are on a budget.
Well Honey says
Here is a benchmark with TensorFlow Deepfakes. It seems like RTX 2080 Ti is nearly the same performance as Titan V.
https://www.youtube.com/watch?v=1KEHi-7r8VE at 16:45. Also CNN CIFAR-10 at 16:01
Tim Dettmers says
Thanks for making me aware of this! Deepfakes and CIFAR-10 are not the best performance benchmarks, but it hints at a general direction. It seems that the RTX 2080 Ti is closer to the Titan V than I thought and the RTX 2080 is closer to the RTX 2080. This might change for particular cases like LSTM benchmarks and straight ImageNet benchmarks. I will probably update the blog post at the end of the week.
Well Honey says
from Pugetsystems:
https://www.pugetsystems.com/labs/hpc/NVIDIA-RTX-2080-Ti-vs-2080-vs-1080-Ti-vs-Titan-V-TensorFlow-Performance-with-CUDA-10-0-1247/
peter says
Hi Tim, am I correct that laptop with NVIDIA GeForce MX150 2GB GDDR5 won’t be able to run the Tensorflow GPU version because it is not listed in the list of GPU compatible with CUDA?
stkarlos says
any update over this ?
Bordeaux25 says
Hi Tim,
thank you for your very throughout guide.
I was wondering if a quad-core CPU like an AMD Ryzen 2400G is too much of a limiting factor for a DL build based on a single 1080Ti or on a RTX 2080 (Ti).
Thanks
Tim Dettmers says
I think the AMD Ryzen 2400G is perfect for a single GTX 1080 Ti or an RTX 2080 Ti — so from a deep learning perspective, this is totally fine. The question is if you have other workloads which are CPU heavy and depending on the answer you might want to upgrade your CPU.
Chip Reuben says
When you use the GPUs for machine learning, what gets used for your video card? Say I am installing two EVGA GeForce GTX 1080 Ti on a MSI X299M Pro Carbon AC along with a Intel Core i9 7900X 3.3GHz Ten Core 13.37MB 140W CPU (and the necessary RAM, fans, etc.)
Does one get used for the video card and one for GPU computing? Then what if I use an SLI bridge? What gets used for the video card?
Or let’s just say I start with just one EVGA GeForce GTX 1080 Ti? What gets used as the video card?
Tim Dettmers says
You can use for GPU computing, but one card will consume a bit of RAM for the monitor (usually 100-300 MB, depending on resolution etc) and will consume a bit of compute to render the display (less than 5%; usually close to 0%). With an SLI bridge both GPUs will drive the display(s), but you can still use both GPUs and both GPUs RAM for CUDA computations — so no worry, this will not affect your CUDA / deep learning experience at all!
Willfried Wienholt says
Thanks, Tim, for your excellent blog post including the RTX 20 series.
I wonder in what way the new series put higher demands on desktop systems for deep learning purposes? What kind of a CPU, ram, and motherboard would you consider for excellent performance today if data sets are reasonably small (several 100 MB) but networks designed for NLP, classification, time series prediction might need quite some performance.
Tim Dettmers says
I should update my other blog post about hardware. The CPU does not matter that much if you have a single GPU, but it can matter if you have four. Supported PCIe lanes on both CPU and motherboard are an important criterion if you have more than 2 GPUs — aim for at least 8 lanes per GPU. Any DDR4 RAM is good for deep learning, however, if a bulk of your work is in machine learning or you need to preprocess data a lot I recommend getting fast DDR4 RAM. Get at least the same amount of RAM as your GPUs have combined RAM (if you have 4 GPUs you can go a bit lower instead of 44 GBs (4x RTX 2080 Ti) you can go with 32 GB for example). If you want a NVMe SSD because you process datasets a lot make sure that you have enough lanes for the GPUs — remember 8 lanes for each GPU + 4 PCIe lanes for the NVMe SSD.
Alexander/O says
On Supermicro 4028GR-TR2 the PCIe PLX switches have 96 lanes each. There are two of them connected to the CPU1.
This might be the problem wrt. bandwidth issue. http://www.supermicro.com/support/faqs/faq.cfm?faq=20732
Will test it on Monday. I did check the 4027GR-TRT today and ACSCtl are all negative (correct to get maximum bandwidth throughput).
Tim says
Will you update this with the T4?
Thomas says
Hi Tim,
Very nice article! I just bought the gtx1080ti for 650 euros, which is around 100 euros cheaper than the average price of a 1080ti in my country (around 750-800 euros) . The cheaper price is due to a 1 day only discount, so I decided to just go for it, since I can always just return it within 2 weeks If I decide to go for something else.
It was really hard to choose with the RTX2080 on the horizon with tensor cores. However, the cheapest RTX2080 card here currently sells for 850 euros, which is significantly more expensive. Plus I am also a bit wary because of the lack of benchmarks.
What do you think, will I be good with the 1080ti, also considering the price? Or should I definitely return it and buy he RTX2080?
Tim Dettmers says
Hi Thomas, congrats on the deal! I think this is really determined by up-to-date benchmarks. Given the numbers that I calculated, both choices definitely make sense. If you see benchmarks that favor RTX cards also make sure to have a look at software support for TensorCores. If people complain that they cannot run their models reliably with TensorCores then the GTX 1080 Ti will be better — at least until the problems with the TensorCores are fixed in TensorFlow / PyTorch.
Thomas says
Thanks a lot, I am very happy with it. The cheapest RTX2080 is about 30% more expensive than the gtx1080ti I have now, but I really doubt that the RTX2080 will perform 30 % better than the gtx1080ti. But we will have to see of course.
Also, as you said: software support will probably be lacking for at least a little while. It is still bleeding edge tech.
Remi Cadene says
The best card for Deep Learning is TitanV.
The best ratio between computing capability and price is TitanXp.
RTX 2080 and RTX 2080 ti are not made for Deep Learning, they are made for Gaming and Computer-Animated Movies.
A benchmark with VGG16 and ResNet50 in 16 bits and 32 bits is needed!!!!
Also 16 bits training does not work well. A mixed precision (16 bits + 32 bits) is needed to converge to good local minima: http://on-demand.gputechconf.com/gtc/2018/video/S81012/
Tim Dettmers says
RTX 2080 and RTX 2080 Ti GPUs have Tensor Cores which are made for deep learning — so I do not see your point.
I did myself research on low-precision deep learning and I trained convolutional networks with 8-bit activations and gradients — that was not a big problem and the results were the same statistically. Training neural networks with 16-bits is very practical. In some cases, you will see a decrease in performance, but it is not enough to justify sticking to 32-bits. There is even work that shows you can train neural networks without any loss of precision with 10-14 bits. Of course, for people that want to squeeze the last bit of performance out of a network (usually only computer vision researchers) software needs to be adjusted to work under these circumstances, but if there is a need these things will be developed.
It is in NVIDIA’s interest to make us believe that mixed precision is the only way to go. I do not believe that until I see more evidence for that.
Vadim says
Hello, Tim!
Thanks for great article!
What do you think about ready to use PC (when you don’t need to buy parts separately and somehow connect it especially if you didn’t know how)?
Can you recommend some models?
Thanks in advance!
Tim Dettmers says
There are some deep-learning-branded desktop PCs but they are very expensive. I would recommend going with a gaming PC with a GPU that you want. See if the computer allows for additional GPUs. This will allow you to add another GPU if you like to later on. Exchanging GPUs is easier than building a computer on your own, but building a computer is also not that difficult and it is a good skill to have. But if you insist on not building your own PC a gaming PC is a good option.
Otherwise, there are also services where you buy parts and they then put together all the parts for you. This could also be an option if you can find a service like that. A local computer shop might do that for you and then you also always have somebody to talk to if something is wrong with the hardware of your computer.
Alexander says
If the cost for RTX Titan is going be anything like Titan V, then the end cost in USD with GTX 10xx + Titan RTX will be $4000 + opportunity cost.
I just had my company send back to Nvidia 4 (out of 8 cards) we bought on Aug. 8 because I could not in good faith allow them to spend $12000, when I can use that money to buy ten(10) overclocked (thinking water-cooled EVGA RTX 2080 Ti) and totally populate our 10 GPU Supermicro.
Just buy a highly overclocked 2080 Ti, and I bet it will be faster than RTX Titan because there is very little extra hardware on that Turing chip to activate to convert it into a Titan.
Tim Dettmers says
That makes sense. The only big difference between a Titan and an XX80 is usually a bit of RAM. The XX80 Ti however often closed that gap. A Titan makes also sense if it is important to squeeze every bit of performance out of a GPU slot. However, if you have already a 10 GPU server and just need to swap GPUs then the RTX 2080 Ti is an excellent choice. I would have done the same in your situation.
Alexander/O says
I am observing very bad bandwidth benchmarks for Titan V GPUs, plugged into a Supermicro single root complex system. I have no idea why.
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, TITAN V, pciBusID: 4, pciDeviceID: 0, pciDomainID:0
Device: 1, TITAN V, pciBusID: 6, pciDeviceID: 0, pciDomainID:0
Device: 2, TITAN V, pciBusID: 8, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
***NOTE: In case a device doesn’t have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1 2
0 1 1 1
1 1 1 1
2 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2
0 552.51 5.73 5.73
1 5.76 555.65 5.72
2 5.77 5.71 555.65
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2
0 554.08 4.21 4.18
1 4.14 558.04 4.18
2 4.14 4.21 556.45
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2
0 561.65 6.12 6.14
1 6.11 561.24 6.12
2 6.11 6.10 562.46
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2
0 563.67 8.04 8.06
1 8.05 563.27 8.09
2 8.06 8.10 560.84
P2P=Disabled Latency Matrix (us)
GPU 0 1 2
0 2.20 16.68 17.39
1 17.02 2.18 16.88
2 16.73 16.95 2.20
CPU 0 1 2
0 4.71 11.17 11.20
1 11.18 4.75 11.10
2 11.15 10.93 4.58
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2
0 2.16 1.65 1.65
1 1.65 2.19 1.64
2 1.69 1.69 2.21
CPU 0 1 2
0 4.69 3.22 3.20
1 3.20 4.57 3.16
2 3.22 3.17 4.63
Tim Dettmers says
I cannot find the exact benchmark that you are using quickly. But the simpleP2P benchmark from the CUDA samples uses 64 MB buffers. If you are running on 8 lanes per GPU on a single root complex then 4-5 GB/s is not unreasonable. In usual applications, you will not see bandwidths above 7 GB/s. You can achieve the theoretical 8 GB/s if you use very large buffers, but that will not happen in practice. In my theoretical model for deep learning parallelism, I benchmarked the PCIe bandwidth for usual sizes for gradients and activations and find that it is about 5 GB/s for 8 PCIe lanes — very similar to your numbers. If you increase the buffer-size you should see higher numbers.
Alexander/O says
Numbers for Supermicro 4027GR-TRT with 4 Titan X (Maxwell, EVGA Hybrid)
developer@theano:~/NVIDIA_CUDA-9.2_Samples/1_Utilities/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest |more
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce GTX TITAN X, pciBusID: 4, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce GTX TITAN X, pciBusID: 5, pciDeviceID: 0, pciDomainID:0
Device: 2, GeForce GTX TITAN X, pciBusID: 8, pciDeviceID: 0, pciDomainID:0
Device: 3, GeForce GTX TITAN X, pciBusID: 9, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
***NOTE: In case a device doesn’t have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1 2 3
0 1 1 1 1
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 253.07 9.61 10.76 10.62
1 9.63 257.11 10.71 10.65
2 11.06 10.92 256.94 9.67
3 10.95 10.78 9.63 253.54
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3
0 254.21 13.18 10.27 10.27
1 13.18 258.23 10.09 10.27
2 10.27 10.18 258.28 13.18
3 10.27 10.17 13.18 255.75
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 256.31 10.32 18.56 18.20
1 10.12 257.80 18.54 17.98
2 18.25 18.53 258.08 10.14
3 18.36 18.21 10.18 255.89
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 256.58 25.40 18.62 18.62
1 25.41 258.61 18.60 18.62
2 18.62 18.62 258.56 25.41
3 18.61 18.61 25.41 256.49
P2P=Disabled Latency Matrix (us)
GPU 0 1 2 3
0 3.01 11.34 18.27 11.97
1 17.15 2.89 13.50 14.48
2 11.68 12.73 2.95 12.17
3 13.64 14.89 17.64 2.94
CPU 0 1 2 3
0 7.64 13.85 14.21 12.92
1 13.70 7.00 13.61 13.25
2 13.71 13.65 7.25 13.76
3 14.04 13.48 13.78 7.22
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2 3
0 2.91 1.22 1.61 1.62
1 1.18 2.88 1.61 1.61
2 1.62 1.63 3.01 1.20
3 1.62 1.62 1.19 2.93
CPU 0 1 2 3
0 7.25 3.98 3.86 3.86
1 3.64 6.59 3.60 3.70
2 3.63 3.61 6.89 3.62
3 3.63 3.58 3.76 6.63
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Alexander/O says
On Supermicro 4028GR-TR2 the PCIe PLX switches have 96 lanes each. There are two of them connected to the CPU1.
This might be the problem http://www.supermicro.com/support/faqs/faq.cfm?faq=20732
Will test it on Monday. I did check the 4027GR-TRT today and ACSCtl are all negative (correct to get maximum bandwidth numbers).
Emmanuel says
Hello,
Thanks for this very usefull article !!
I wanted to use some GTX 1080 ti (or RTX 2080 ti maybe after reading this) to start using deep learning on images, and as I am in a university, I will not be a data center, but we have one, maybe my GPU’s will be used by other people (with very retreint access)… I’m wondering what this new policy agreement for CUDA looks like, but I am not able to find where it is specified that CUDA should not be used in data center with GTX or RTX.
Do you have any other information on this, or do you know where I can find these policy agreement ?
Thanks a lot for your help !
Tim Dettmers says
The thing is that it is nowhere defined what a “data center” is. But I think as long as you are using a desktop computer (literally a computer under your desk) you will have no problems, even if other people log into that computer from the distance. This policy’s main purpose is to prevent a company/university from buying hundreds of GPUs and put them into networked computers — I think this would also be a good definition for the “no data center” policy: Dozens to hundreds of computers which are networked.
Mathias says
There is no mention of Azure?
Tim Dettmers says
You are right, I should add a few words about Azure. I will try to do that in the next update. Thanks for the feedback.
Facundo Calcagno says
I dealt with an issue like this last month. In my company they decided to start with an amazon Pé.xlarge instange that included a K80 GPU. Everything moved really slowed in terms of training. Hence, I tried the same pytorch code in my personal computer that has a titan xp ans noticed a 5x boost up.
I use Pytorch 0.4, a 3D CNN (Modifiend C3D), Data Augmentation, 4GB dataset.
Can you explain me why there is so much difference?
Tim Dettmers says
I do not think this is a GPU issue since the difference of the GPUs should be much smaller and the compiled code is very similar and thus this should not be an issue of compiling with wrong flags etc. What I suspect is that the CPU or the PCIe transfer might be a bottleneck here. The data augmentation that you mention, do you do it before you pass data to the GPU? If so, this would be my main suspect. Another thing is that 3D data can be very large, if you do not use asynchronous CPU->GPU transfers, then this might be very slow on AWS where the PCIe architecture is often degraded through virtualization. I think the new V100 GPUs have better virtualization and you might be able to go around this issue with a newer AWS instance. Hope this helps.
CALCAGNO Facundo says
Yes indeed, for Data Augmentation I’m using the Pytorch pipeline that makes the data augmentation in Numpy arrays before sending them to the GPU.
Thanks!
Unknown says
Thanks a lot!
It was great!
Cherla Sri Krishna Kiran says
Hi Tim,
I would like your opinion and advice. I am trying to purchase a GPU for kaggle competitions. The GTX 1080Ti and RTX 2080 cost the same in my country. The 1080Ti has 11GB VRAM but the 2080 has tensor cores. Which one is better in the long run?
Thanks,
CSKK
Tim Dettmers says
Difficult to say for sure for now since benchmarks for the RTX 2080 are lacking. However, I would believe that you get better performance from an RTX 2080 if you use 16-bit precision.
Norman Heckscher says
It’s worth considering a free option from Google. Basic hardware, yet, more than enough to get started.
https://colab.research.google.com/
Tim Dettmers says
That is a good point. I should have looked into that. I will try to include it in an update with the hard RTX performance numbers.
Tim O'Hear says
This is even more the case nowadays. You can get Colab pro for 10$/month that has less usage restrictions (Colab standard bans you for 12 hours after 12 hours use).
Nowadays you typcially get a P100 16Gb and V100 16Gb are starting to turn up on Colab Pro.
Tim Dettmers says
Thanks for your feedback. From the overall feedback that I got a broader discussion of cloud solutions was one of the main issues mentioned. I will consider adding a bit more content and writing a small update that focuses more on the cloud.
pep says
If I using External GPU Enclosures, which card does it make sense the most to buy? I mean giving the fact that Thunderbolt 3 might be a bottleneck, at what point it becomes pointless to buy a better and quicker card?
Thanks
Tim Dettmers says
The bandwidth on Thunderbolt 3 is pretty high and you should not see a serious dent in performance. I would expect at most a hit of about 10-15% of performance. The performance penalty is also more determined by the task than by the GPU. The simpler the task the higher the performance penalty. If you are running a large ResNet model on a large image or running a BiDAF model on a paragraph you see almost no performance degradation.
gaoyuanbo says
Hi Tim, i am a tensorflow user. i am using V100 to train model, my network has a lot of LSTM. After reading your blog, i replaced tf.float16 with tf.float32, but i found the speed didn’t become fast. Can you give me some advices? Thank you very much.
Tim Dettmers says
It might need more configuration. The LSTM benchmarks that I quote use this setup:
Marcin says
I wonder – you mentioned RTX2080 and 2080Ti on your list. However have you heard any confirmation that their tensor cores are actually going to be exposed via CUDA? I mean, it IS a gaming card by it’s nature and it could cannibalize Nvidia’s Quadro lineup. At the very least I have heard nothing about it ACTUALLY being supported yet.
Tim Dettmers says
I do not think they would have two different Tensor Cores, ones that work for Tesla and Quadro and are programmable, and one that only works for gaming — but we will see in the next days and weeks.
Ani says
Hi Tim,
Thanks for the informative post.
I just learnt about “NVIDIA® JETSON XAVIER™ DEVELOPER KIT”.
https://developer.nvidia.com/embedded/jetson-xavier-faq
I know, it’s an edge device, but they are offering it at a great discount to the retail price.
I don’t currently own a NVIDIA GPU. Can this be used as a platform for learning ML and for training models?
I would really appreciate your opinion on this one.
Thanks in advance,
– A
Tim Dettmers says
I had a Jetson before and one problem can be their CPU which makes installing software more complicated since not all packages support ARM CPUs. I had additional problems with a CPU that only supports 32-bit (the newer Jetson should support 64-bit) but still, you will spend some time setting everything up if you are planning to use it via ssh from a laptop or desktop. The performance is rather poor compared to dedicated GPUs (even though there is a discount). I would only get it if you need it for robotics etc., that is only if you actually need a small, portable GPU.
Joseph says
I have read your comments about the previous Jetsons, but the new Xavier looks very promising.
The Xavier reportedly operates at 30 teraflops, which makes it roughly 3x faster than a GTX 1080i. So for $2500 you get a full deep learning computer which appears competitive with what most hobbyists would build for that amount of money.
So what are the downsides?
Joseph says
One last thing. The product page says:
“Can NVIDIA GPUs be used with the Jetson AGX Xavier Developer Kit?
The current early access JetPack release does not support this; support will be added in a future release.”
If I’m understanding correctly, does that mean that you could add another GPU card to the Jetson AGX Xavier?
Tim Dettmers says
As I understand it, this is meant to enable prototyping for the Xavier Jetson on your desktop GPU and when finished roll your code out to the Jetson AGX Xavier. This would mean that a company that uses Xavier for mobile application does not need one Xavier per developer, but developers can work from their desktop stations.
Tim Dettmers says
Since the introduction of Tensor Cores the interpretation of these numbers is no longer straightforward: (1) The Xavier Jetson can theoretically operate at 30 teraOPS which means 8-bit or 4-bit compute (not teraFLOPs for 32-bit or 16-bit compute), and (2) the theoretical FLOPS while realistic for standard GPUs are very unrealistic for Tensor Cores at 8-bit because your algorithms will be generally limited by bandwidth and not compute for Tensor Cores since the Xaver memory is very slow with 137GB/s. This bottleneck is not so strong for RTX cards which have about 600GB/s. You can expect a Xavier Jetston to be about half as fast or less compared to a GTX 1080 Ti for training networks.
Wingman says
It is now 2020, and NVIDIA recently upgraded the RAM to 32GB on the Xavier dev kit and the current price is $700. I’ve used the TX1 dev kit as a headless training box in the past (yes, not fantastic performance), but I’d like to upgrade to something that allows me to do much larger batch sizes on CNN’s. The next step in my home project is to take my dataset that now has >100K images and see if I can bootstrap e.g. MobileNetV2 from scratch. (Let’s just say I’m … patient.) Now, newer GPU’s are definitely faster, but for getting a larger batch size, I don’t see too many options more cost efficient than the $700 Xavier dev kit for the RAM. What do you think? Is the computational power tradeoff for the larger (albeit shared) RAM maybe a reasonable deal?
Tim Dettmers says
I would only recommend the Xavier kit for regular training/development if you also want to use it for some hacky mobile stuff like vision for robotics. For a headless GPU box I would maybe rent a computer in the cloud or buy a small computer with a small GPU ($600 – $700).
Wingman says
Thanks Tim for the advice on the Xavier! I did more research and decided to go with a used SuperMicro GPU server hosting a K80 (and room for one more). The speed might be slower but the VRAM/$ ratio is hard to beat and I get a good CPU/RAM on the main board too. I’m new to the world of servers so it was a fun journey: https://wingman-jr.blogspot.com/2020/03/the-quest-for-new-hardware-pt-1.html
Ps Z says
Thanks for the sharing and it’s really helping a lot. However, I’m just being curious on how did you get the performance data of RTX 2080Ti right after it came to public? Did you get these cards in advance from nvidia?
Tim Dettmers says
Thank you for your comment. This is a bit buried in the blog post, but I mention my methodology briefly:
You also find the LSTM and Convolutional benchmark for V100 and Titan V in the blog post. From other sources I know how many Tensor Cores the RTX 2080 and RTX 2080 Ti have. If I combine these numbers with the method above I arrive at the numbers in the charts. This numbers will be rough estimates though and I am curious how close they are to reality when benchmark data appears for the first time. I will update the blog post once these numbers become available.
Ps Z says
Sorry I didn’t notice that. Thanks a lot!
Dave says
Hi Tim,
Firstly, thanks A LOT for your work in this article. Secondly, i agree with you (and other folks), lets wait for reliable benchmarks before buying the 2080 ti/2080.
One last thing, Im trying to buy a new gtx 1080ti but it is always out of stock at the nvidia web. Does anyone know how often it is in stock?
Best,
Dave
Tim Dettmers says
I think NVIDIA does not have them anymore in stock on purpose. They want you to buy the new RTX cards. Try another vendor they should still have the GTX 1080 Ti cards. I would also recommend getting a cheap one on eBay!
James says
Hi Tim,
Thanks for the update, this helped me to decide on a 1070 second hand.
For the cloud options, is there any reason why Azure has been left out? They now offer their Data Science Virtual Machine on Linux which can be equipped with K80, P100 or the new V100 all in multiples of 1 to 4 GPUs.
I don’t have any experience with Azure DSVM, but wondered if it had been left out for a reason?
Thanks
andrea de luca says
About the RTXs: Since they got double-fan ventilation, do you think one can stack four cards in a single box?
Tim Dettmers says
If you look at the display connectors of the current cards, you can see that they will occupy two slots. This might differ for future designs, but keep in mind that all GPUs that have a high TDP are usually 2 slots high because the heat would be difficult to dissipate otherwise. I do not think the fans and heatsinks on the RTX 2080 can be made so much better that they would allow for a single slot since the TDP is still quite high.
Ting says
Thanks for your article. I plan to get a GPU to run convolutional neural networks, but the IT supporter said GTX 1080 require a different power supply thus cannot be considered, they recommanded me NVIDIA Quadro P2000, 5GB, but seems someone said this is not for network, but is for CAD, do you know this GPU and if it fits for a CNN?
Thanks!
Tim Dettmers says
A Quadro P2000 would be fine for deep learning. Slowe than a GTX 1080, but you can still run most networks especially if the input size is not too large. However, the P2000 is a bit pricey. I would recommend at GTX 1050 Ti with 4 GB of RAM which is 25% slower than a P2000 but half the price.
Peter Bartlett says
According to this:
https://cloud.google.com/gpu/
I can get a P100 GPU for 1 /hr @ $1.49
I’m thinking this will be my new ‘I have no money’ option 🙂
Tim Dettmers says
That is $252 a week if you run it non-stop. I would recommend a Hetzner GPU instance which is 99 euros per month for a GTX 1080.
Aparna Ravishankar says
Hi Tim,
I would like your opinion and advice. I would like to set up my own work station at home to pursue my interest in parallel programming (CUDA gpu programming). I’m also keen on running experiments with webots (robotic simulator). I was doing a bit of research on various work stations in the market. I am particularly keen on this one.- BOSS Xeon E5-2630 v4 Titan xp SLI. I would like your opinion on it. Is it worth the money or would you recommend something else?
The work station is NOT for gaming, but programming and simualations.
Looking forward to your reply,
Thanking you,
Aparna.
Tim Dettmers says
It seems a bit overpriced and has not the best value for a deep learning desktop. If you need 10 cores for your robotics simulator it might be different, but the CPU is too powerful for doing deep learning alone. I would save some money on that. You also do not need ECC DDR4 memory. I would go with usual DDR4 memory. These are the things you pay extra for when you buy that machine. The problem is that there is currently not a good vendor that sells good desktops for deep learning. I would recommend to buy parts and build the PC by yourself. You can put together a PC on https://pcpartpicker.com/ and follow a few youtube videos to learn how to put together your desktop.
Chirjot Singh says
GeForce MX150 gddr5 2 gb or GeForce 940MX ddr3 2gb for deep learning
Tim Dettmers says
The Geforce MX150 will be faster. I have not looked at the price though and the 940MX might also have a good cost/performance. I, however, would probably go with the MX150.
Kevin says
Hey tim, thanks for the fantastic article. I’m still in uni and finding my way into the deep learning world but I was planning on building my own machine for the purpose and had some questions about hardware that I was struggling to find answers for in terms of deep learning programs because most pc hardware sources seem to be about gaming. (I’m not sure specifically what programs or datasets I would be using so these questions are meant pretty generally). I was planning on starting with a gtx 1080ti and adding another later when my budget allowed for it.
My first question is whether my systems cooling would be fine with a cpu AIO cooler and the two 1080 ti’s in sli left on air cooling, or whether it would be worth investing in a custom liquid cooling setup (especially since I’d have to take it apart to add a second gpu later on).
And secondly, I’ve done some research on the topic and it seems two good cpu choices for this build are the intel i7 8700k and the intel i7 6850k. Now they both have 6 cores but the 8700k has a higher clock speed but only 16 PCIe lanes whereas the 6850k has a lower clock speed but 40 PCIe lanes. I was wondering where generally for deep learning, would it be better to bottleneck the two gpus to 8x lanes in exchange for the higher clock speed on the 8700k, or would it be better to opt for the lower clock speed on the 6850k, knowing that the two gpus wont hit the PCIe lane bottleneck? From what I can tell the 8700k is better for gaming but because that is not my primary goal, I can’t tell if that translates to deep learning programs considering the PCIe lane bottleneck I may be putting on myself.
Thanks in advance!
Tim Dettmers says
GTX 1080 Tis on air are fine. Liquid is always better, but I do not think it is worth the trouble. You also do not need 16 PCIe lanes per GPU. Even for parallelism, if you only have 2 GPUs, 8 lanes will not make a big difference two 16 lanes. With respect to CPUs, I think both are fine. I personally would go with the cheapest Threadripper, but in any case, it does not make a big difference for deep learning.
Age says
Wow, so many comments to your great post. If you or anyone could help with recommendations that would be appreciated.
I have 1xGeforce gtx1080ti open style that blows air into case (bought new) and 1xgtx 1080 (bought used) that also is open style the blows hot air into case. (Ive been using dl/ml for a couple if years but only last week read that some gpus funnel hot air out of case rather than into the case)
I have 2 machines and a gpu in each. One machine can fit 2x xeons for total of 80lanes and 3×16,1×8 pcie3 config (currently 1x xeon)
I would like to get another gpu and an not sure if i should get a 1080 ($650 AUD used) or 1080ti ($1060 AUD new). I domantly use dl for time series and nlp work, occasional conv nets.
If i got a ti id need to water cool the 2x ti’s which adds complexity but should help with gpu lifespan (currently 75-80 deg during training). One case is big enough for 420/360mm water cooling rads.
Other option is to get a 1080 used , run with second 1080 aircooled at at probably relatively high temps and hope they last a few years.
Cheers
Tim Dettmers says
The deal on the GTX 1080 looks better — for the application to sequential data 8 GB should be okay. If you want to train something big like BiDAF it can be tricky, but often it will train just fine with a smaller batch size. From my experience, the ventilation of air inside the case makes differences in the range between 2-4 °C and is not really worth the money. Water cooling, on the other hand, is very effective. If you can get a cheap water cooling solution, it might also make sense to invest in GTX 1080 Ti — as you mention the lifetime will be improved (although I never had a GPU die) and the cooling will provide additional performance which will make the GTX 1080 Ti competitive with an air-cooled 4 GPU node with next-generation GPUs.
Personally, I would probably go with the GTX 1080 on air and upgrade to a liquid cooled GTX 1180 or Vega 20 (if it is < $2000) if they come available. But water-cooled GTX 1080 Ti would give you more performance now, and an okay performance for the future, so that you might want to skip a generation of GPUs and invest in GTX 1280 or Navi GPUs that come in 1-2 years. Both are good decisions. I do not know if I would buy liquid cooled GTX 1080 since they will be outdated soon and the cooler might not fit newer models of the GTX 1100 series or AMD GPUs.
Marko Rantala says
Nice, analytical view.
teja says
I am using a deeplearning code which will use tensorflow library and Googlenet CNN or RESNET CNN where the dataset images are around 1000 images in each datasets where total data sets will be 4. My laptop is i5 7th generation + 2.5 GHZ CPU upto 3.1 GHZ , 2 cores and 3MB cache and the NIVIDIA 940MX (2gb, DDR3) . Is this possible for me to use my laptop for my deep learning purpose if not please suggest me some best and cheap ways? I am a researcher .
Tim Dettmers says
Training on your GPU will be a bit slow. Your CPU seems to be okay and you can expect to train a model on it in about 12 to 24 hours. You could run a model overnight. If this is too inconvenient for you, you might want to use GPU spot instances in the cloud, with which the training time should be a couple of minutes (about 5 minutes). If you plan to do a lot of work in deep learning in the future, I would recommend getting a desktop instead of a cloud server. You can get a small desktop with a GTX 1050 Ti for about $500; if you want to do more you can upgrade the GTX 1050 Ti once the GTX 11 series hits the market (2018 Q3).
Michael says
Dear Tim,
thanks a lot for this thorough comparison and the multiple updates over more than two years! Your blog is a very helpful source of information!
While I would be able to make up my mind on which GPU to buy given the state of mid 2017, I have a bit trouble finding such excellent comparisons for spring 2018. Do you have any recommendations either for later GPU models or can you point to newer sources of information?
Thanks a lot, Michael
Tim Dettmers says
I will publish an update which will include TPUs, AMD GPUs and GTX 11 series sometime in the next month. In the meantime, I would suggest getting a cheap GPU to get your task done and wait for updated GPUs in 2018 Q3 which will increase performance by roughly 50% — so it is well worth the wait. Otherwise, if you do not want to wait, the recommendations of this blog post still stand.
Sunny says
An updated blog would be great. I am particularly excited about the rumored Vega 20, with 16-32 GB, 1 Tb/s , ECC, 150-175 W … probably too good to be true but for sure Q3 might have some important hw releases for HPC and ML from both team Green and Red
Tim Dettmers says
Indeed the Vega 20 looks exciting! The question is really if the price is right. With a good price this could be the turning point from NVIDIA to AMD (albeit a very slow turning, at least initially).
sslz says
Xeon Phi is one of my main consideration as hardware accelerator for deep learning. Have you tried the latest Intel suite of compiler? Are their better support for Python?
BTW, will Xeon Phi become a better option than Nvidia if I can write my own library in C and C++?
Tim Dettmers says
I do not recommend using the Xeon Phi — it will be a very frustrating experience. The experience might be better than I had about 2 years ago, but it will still be bad. Writing code for the Xeon Phi is terrible and you still not be able to come even close to the practical performance of cuDNN because Xeon Phis are too difficult to optimize fully: L1 cache management is an absolute mess; I never bothered about register optimization, but I do not see any tools for that to make it work. So I strongly recommend against using Xeon Phis if you care about using your hardware to do something useful. If you are interested in expanding the Xeon Phi code base for the sake of doing so it might be an option, but note that even if you publish good code, few are probably willing to use it along with the Xeon Phi.
P says
Hello, I can’t find my questions posted few days ago so maybe it did not go through.
How is the performance between the 1060 (6GB) and 1070Ti? For example, if the 1070Ti can get a simulation completed in one hour, about how much longer will it take the 1060 to complete? Given my background, is the 1060 sufficiently good enough for me to learn about DL for the first few months? I have done some work on NN in grad school long before most people heard about the term NN.
Tim Dettmers says
A GTX 1070 Ti should be about 25% faster than a GTX 1060. If you get the version with 6GB of ram, a GTX 1060 will be sufficient to explore and learn almost all areas of deep learning — so a good choice for the start!
Bruno says
Hi Tim, thanks so much for the efforts put in this article and the comments.
Question for you: I am considering buying a gaming box NVIDIA AORUS GTX 1080 for deep learning purpose, as I have a lenovo laptop with a thunderbolt 3 port.
I know it is probably less efficient than an integrated 1080 card, still any other reason whyt you think it is not a good idea ?
Thanks a lot!
Tim Dettmers says
This is a very good choice. With Thunderbolt 3 you will have almost the same performance as a dedicated machine. The performance penalty should be between 0-10%, and for most tasks, it will be 1-3%.
Tchicken says
Hello from France…
I have a small budget…
Someone give me 2 GPU NVidia GTX 590 that I think to ride in SLI on a motherboard MSI Z270 SLI PLUS with an I7-7700K, can I do properly deep learning or should I look for something else ?
In advance thank you for your answers, friendly, Michel POULET.
Tim Dettmers says
A GTX 590 will not be sufficient. To run cuDNN which is often a requirement for fast CNNs and RNNs you need at least a GTX 600 series or better.
tchicken says
Is a GTX 1060 3 Gb is good ?
Thanks.
Tchicken says
Sorry to bother you again, I found this PC :
processeur : Core I3 3220 2×3,3Ghz
ventirad : Artic Cooling Freezer 13
Carte mère : ASUS P8h77-M micro-ATX
mémoire : 8Go (2×4 Go) DDR3 Crucial Tactical Tracer 1600Mhz
GPU : MSI GTX660 Twin Frozer III 2Go GDDR5
2xDVI 1xHDMI 1xDP
stockage : SSD Crucial M4 128
HDD Wester Digital Caviar Green 500Go
Boitier : Zalman H11 Plus bleu, 2x120mm + 2x90mm
Alim : Corsair CX430M v2
can it allow me to start in Deep Learning?
In advance thank you for your response.
P says
Hello Tim, how is the Nvidia 1060 compared with 1070Ti? For example, how many % faster does a 1070Ti get the simulations done compared with the 1060?
William Benjamin says
Hi Tim,
I already have a 1080Ti and was planning to add another card to the set up. Goes it make more sense
1) Get another 1080Ti
2) Get a Titan Xp. Based on the current prices, there is a $200 difference between 1080Ti and a Titan Xp.
Tim Dettmers says
I think the Titan Xp for $200 is a good deal. However, if you get a GTX 1080 Ti, it might be easier to parallelize across your GPUs. Theoretically, you will have that option also with Titan Xp since they run the same chipset with same compute capability, but parallelism might not be effective with two devices of different speeds. If you want to use parallelism a lot, then I would go with a GTX 1080 Ti, otherwise, if you only use parallelism from time to time, then I would go with the Titan Xp deal.
jims1990 says
Hi Tim,
Great article, it helped me a lot!
I have read all comments and didn’t find answer for question:
Let’s assume some budget, let it be (10-13k USD).
What will be the best option for deep learning (mostly image recognition task):
a) buying 4x GTX 1080 Ti 11 GB GGDR5X
b) buying 1x Tesla P100 16 GB GGDR5X
c) anything else?
Let’s assume that other parts in setup are the same (like intel i7 or i9, 128 GB RAM, SSD disk) – it’s all about performance of GPU.
For now I’m consider that 4x1080Ti will be best, mostly because of parallelization possibilities.
I found some benchmark comparison between 1080Ti and P100, but any discussion about 4x 1080Ti and 1xP100. I guess this will be helpful not only for me.
Thank you in advance for your answer.
Best regards!
Tim Dettmers says
I would definitely go with 4x GTX 1080 Ti. You have much more flexibility how to use them. They are faster than the P100 if you use parallelization, they are better if you work with multiple people on the same server, they are better if you want to run multiple models at the same time. If I would buy a system for research, I personally would go with 4x GTX 1080 Tis at the moment.
P says
Hello Tim, are laptops with GeForce GTX 1050-1080 (with or without Ti) sufficiently good enough to do DL work? How many RAM in the GPU and laptop would be sufficient? I know the more the better.
Tim Dettmers says
It depends on what you are trying to do. Cutting-edge research is not possible, but with any card above 6GB you can do a lot of nice things. You can do Kaggle, use state-of-the-art models on smaller datasets, train many different models in NLP. With an 8GB card you do most things except memory hungry computer vision models on ImageNet and similar things. The GTX 10 series laptop GPUs are very powerful, so its definitely a good option!
P says
Thanks Tim. I have worked on Neural Networks in grad school but haven’t done any deep learning work. I am trying to decide whether to build a high end workstation now or buy a cheap laptop/desktop with hardware “just good enough” for the next few months of learning. Do you think Nvidia 1050 GPU, 16GB RAM and i7-7700 or AMD Ryzen 7 1700 would be sufficient to get me started until Fall? I am also concerned if AMD CPUs such as Ryzen 7 1700, 1800X and Threadripper would be compatible with Tensorflow and other deep learning related frameworks and libraries. In general, is it better to get Intel CPU? Does it matter much?
Tim Dettmers says
Threadripper is a fantastic CPU for deep learning with its 64 PCIe lanes — I am using it too! Its fully compatible with all deep learning software and the advantage is you can run (multiple) NVMe SSDs without a deep learning penalty to your GPUs.
The laptop with the GTX 1050 can get you started. You will be able to run code and algorithms, but mostly on small datasets. A laptop with a GTX can come in handy later since you can program and test code locally on your laptop and later run it on a big machine. This is, for example, nice when you are at airports with shitty wifi.
P says
Thanks Tim. Do I have to worry that the Threadripper CPUs do not support as many instruction sets as the Intel’s CPUs? Is compatibility with AVX, AVX2 and AVX-512 important? I think I read somewhere that python or tensorflow can take advantageous of these instruction sets. Few months ago, there were some issues running Ubuntu on Threadripper systems. For example, some users complained about PCI related compatibility issues. Since we use GPUs, I am concerned that these issues might affect us. Have these issues been resolved?
Tim Dettmers says
I did not experience any problems so far. If you do not get a GPU you should pay attention to CPU performance, but I would not worry about it if you get a GPU, since you will use the CPU mostly for prototyping if at all.
If I look at benchmarks of general matrix-matrix multiplication then the Threadripper fares well even without those features though — so it should be fine.
P says
Hello Tim, thanks. In case of using only one 1080Ti, is it true that even the lowest end Threadripper 1900X performs better than an Intel 7900X which supports AVX-512? How about the case of using two 1080Ti? I am trying to choose between the two CPUs. Which do you recommend?
Marcus says
Hi Tim,
thanks a lot for this interesting article.
What is your opinion about other nvidia models like Quadro or Tesla? I have to assemble a (some) system(s) for development in a research group for different purposes. Tasks will be: Object recognition / computer vision for collision avoidance, mapping of the environment, predictive maintenance based on systems monitoring data and, to bring everything together, decision making / reasoning based on all the above inputs to create “intelligent” behaviour within a graphically intense simulation. On top of that the developed functions should be transfered on a mobile robot equipped with a Jetson TX2 board for real world testing. (MATLAB/Simulink).
Is there a chance to find that one Cuda card configuration compatible for all these tasks? I hope this idea/question is not absolutely hare-brained, if so, I apologize as a total newbie to this field 🙂 .
Many greetings,
Marcus
Tim Dettmers says
Tesla and Quadro cards are rebranded GTX cards; they are not really worth the price. However, the CUDA license agreement states that you are not allowed to use GTX cards in “data centers”. So you might want to avoid building a data center and instead give each person in the research group a GTX GPU.
The applications that you mentioned are mainly robots and computer vision. Some of it might require high memory, others might not be compute and memory intensive. Its difficult to say from this high-level description, but it might be best if you aim for a GTX 1080 Ti or a Titan Xp — with these cards you will fulfill the requirements and since GTX 1070/1080 cards are expensive due to crypo-mining, a 1080 Ti and a Titan Xp are good choices in themselves — so I would go with either of those cards.
Marcus says
Thanks a lot for your tipps! I’ll do it this way.
dougr says
Tim,
You have previously recommended against the Titan XP due to the price delta… however, with current availability and pricing of 1080 ti cards, most of the time, the price differential from a 1080ti and a titanXP (direct from nvidia) is less than $200. Independent of the current limbo status, if you needed a GPU in the near term, what premium would you place on a titan XP vs a 1080ti? I’m patient enough to wait for any nvidia announcements at GTC in late March, but lacking something there, do not think I have the patience to wait for AMD to come around on the software side.
On a related note, there are a few articles out there discussing GDDR6 in the next round of consumer cards, and estimating memory bandwidth based on predicted bus widths and GDDR6 speeds. Most seem to assume that whenever nvidia gets around to releasing their next architecture, that the initial top-end card will have 16GB of GDDR6 and memory bandwidth nearing 600GB/s (and well above that if they use a wider bus in the later enthusiast card). Do you think those are reasonable assumptions given the market dynamics for consumer cards (gaming, crypto, ML)?
Always impressed with your thoughts on the state of the market, Tim… great resource.
Tim Dettmers says
I agree, if you can snatch a cheap Titan XP it is well worth it. There also seems to be recent announcements that no new GPU will be introduced for the major GPU conferences. The concept will be introduced, but not the GPU itself. NVIDIA’s strategy is likely to introduce a gaming GPU so that deep learning folks have to buy the Titan V if they want to deep learning. If this is really true, then investing into a Titan XP makes a lot of sense.
I think your predictions for GDDR6 could make sense. Probably GDDR6 is also cheaper to produce than HMB2 memory so I expect that we see a lot of cards with it, but as mentioned above, it might be that we see no deep learning cards with that. We will see in the next months.
dougr says
Ended up picking up a water-cooled 1080ti in a bundle with a cpu cooler from EVGA for $920… might use the cooler, or just resell it. $300 is too much of a premium for the XP for me.
Amin says
Hello, Thanks for excellent information.
I want to know how much VRAM i need?
Software:
Ubuntu 16.04 , Caffe , Mobilenet SSD
In CPU-only mode, training consume 6-8 GB of RAM.
But I have a weak system (with core i5 2400 and 10 GB 1333 RAM) right now and I want to find a good system build.
training speed = 75 iteration/Hour !
GTX 690 seems to be a good choice because of its 384 (GB/Sec) Memory Bandwidth and 3 M cuda core clocks and also low price.
But I’m not sure about its 4 GB VRAM.
Also I will be very thankful if you give me an estimation about training speed with this build:
GTX 690
Core i7 6700k (4-4.2 GHz, 8 MB cache)
16 GB DDR4 2133 dual
Tim Dettmers says
I would get a more modern GPU. For cheap GPUs I would recommend a GTX 1050Ti 4GB with 16-bit training, GTX 1060 6GB or a GTX 1070 (with 8GB). I am not sure where the memory consumption comes from, but you want to make sure that the dataset is not stored on the GPU.
The new CPU and RAM will hardly affect training performance at all (maybe 10-15% faster) and your money is better invested in buying a better GPU.
Amin says
Thanks.
I don’t have GPU yet.
(system = core i5 2400 and 10 GB 1333 RAM)
In this system, training consume 6-8 GB of RAM.
I want to know if I buy a new system that has GPU, how much VRAM will be consumed or needed?
Because I want to decide about GPU model.
I don’t want to change software and network parameters. (Except using CUDA!)
Do all of this RAM consumption will goes to VRAM?
I decided to buy Lenovo Y900 RE gaming tower.
GTX 1080
Core i7 6700k
16 GB DDR4 2133
Bill Ross says
Would a Gtx 1030 be a valid option? I have a 1080 ti, but it is hardly used at all with some of my models (0-13% of Volatile GPU-Util for predictions), and if Cudnn is supported, I wonder if this would do for predictions, which run for ~8-10 days:
input shape: 3794
model file size: 93040732
keras/tensorflow
E.g. I loaded/ran 83 models in one session on the 1080 ti before running out of memory (forgot to add keras.clear_session()), running 100M cases through each.
I have a dedicated python data prep thread with a queue 2 deep, bringing the process to 150% .
KR says
Hi Tim,
What are your thoughts on putting 7 gpus in a single machine or is 4 gpus the absolute limit? To make it work you would need a mobo with 7 PCIe slots, water cooling to make the gpus fit in a single slot and two power supplies.
See the link below:
http://rawandrendered.com/Octane-Render-Hepta-GPU-Build
Is this a bad idea?
Thanks for the great post.
Tim Dettmers says
I assume that you want to parallelize across 7 cards. If not there is no reason to get a 7 GPU computer as there are many hardware problems and bottlenecks associated with this. The build that you linked is not good for deep learning because you have few PCIe lanes for parallelization and would be very slow. If you want to get a 7 GPU system, the only way to go currently is with an EPYC CPU and this motherboard: http://b2b.gigabyte.com/Server-Motherboard/AMD-EPYC-7000
Even then there might be problems, but the motherboard above is the only situation where you have enough PCIe lanes and avoid issues with PCIe switches and software. With other CPUs/motherboard you would need to use 2 CPUs to get the required PCIe lanes for good parallelization performance as you cannot do cuda-aware MPI parallelization across different PCI root complexes that 2 CPU designs use.
In general, I would only advise you to build such a system if you really need parallelization across 7 GPUs. If you can do your work with 4 GPUs I would definitely recommend going with a 4 GPU setup as there are no software/hardware problems with that — a 4 GPU system is very straightforward.
Nikos Tsarmpopoulos says
The MZ31-AR0-rev-10 doesn’t appear to feature 7x PCI-E x16 slots. It features five full length x16 PCI-E ports, and two x8 ports.
jungju says
Hello, I have one question. Is it possible to do deep learning with only a PCIe slot available (3.0×16) computer (I`ve bought a motherboard for my web server which has only one PCIe slot) ?
Tim Dettmers says
Yes this will work without any problem.
Jagan Babu says
Dear Tim,
Iam currently working on Deep learning data model training for image recognition related.
I want create my own work environment and stuck up choosing between MSI X trio 1080 ti and EVGA 1080 ti FTW3 icx technology.I will running huge data models >250GB.Iam looking for robust model without burning related issues and cost factor do not matter.Please advise.
Tim Dettmers says
Those cards are very much the same. If you have benchmarks about how their fans perform for cooling, go with the cooler fans. Other than that there will be no difference. If you are worried about temperatures you might want invest in a liquid cooled GPU. They will be much, much cooler (and also faster!).
Jagan Babu says
Thank you Tim…:)
Dave says
Hello Tim, how is the performance of 2-3 Nvidia 1080Ti installed in a computer compared with one NVIDIA’s new TITAN V installed on the same computer?
Tim Dettmers says
For LSTMs the 3x GTX 1080Ti will be faster, for convolution the Titan V. Overall I would prefer the GTX 1080Ti as it is much easier to run multiple networks at the same time. This does not work well on a single GPU and is slow.
Lety says
Hi Tim
I’m planning to by a 1070Ti, any opinion? it’s not in your benchmark analysis.
Thanks.
Tim Dettmers says
Since GPUs can be expensive due to cryptocurrency mining I would keep my eyes open for a cheap GTX 1070/1070Ti/1080 and grab the first cheap card that you can find. All these cards have similar cost/performance. If you can find a used GTX 1070/1080 this might be a bit cheaper and I personally would prefer such a card compared to a brand new GTX 1070Ti.
Bruce Wang says
Brilliant analysis and conclusion.
Thanks a million.
Mathieu says
Hello, and many thanks for your article which is very much useful and relevant, especially the Xeon phi part (for me), and Intel’s behavior (BTW did you have premium account which gives you the VIP/direct feedback? I have it and they are rather fast an efficient !). But also few comments are very interesting. I’m doing private research in computer vision and AI, and I daily practice HPC.
Do you think you might update your paper with Quadro P6000 ? which is very much relevant when we need to hold all the data in the device ! And then the Titan V ? which is expensive too, and at the same time not that much expensive, if the “business” is using FP16 and lots of tensor maths. For example, it might be interesting the compare 1-2 Titan V with 3-4 1080 ti ! For deep learning, yes but also for other algorithms which scale well without NVlink or even peer to peer communication.
Best
Tim Dettmers says
I think these cards are too expensive for most people, so I will not cover them here. I also do not have enough data on these cards to compare them directly.
Asaf Oron says
Thank you very much Tim for this post.
I am a bit confused: when i look for a certain card e.g GTX 1060 i find the card from multiple vendors i.e. zotac or asus. some of them have the nvidia logo on the box some dont. Are these the same cards ? how do i choose ?
I’m looking for my first gpu for a windows pc capable of running convnets. i have a budget of around 300 $. what would you recommend ?
Tim Dettmers says
Yes, there are the same. The vendors buy GPUs from NVIDIA and put their own driver, fans etc on it, but essentially it is still a NVIDIA GPU in any case.
Sunny says
Hi Tim,
Looking to build a Deep Learning PC and I am pretty new to the hardware side of things. Any comments on the recently released Nvidia Titan V ? I am specifically interested in a 2 Titan V GPU setup (with Xeon or Ryzen), but have been reading that this card has disabled SLI / NVLink. Will still be useful to shell out $6K to have a powerful Deep Learning setup that will be viable for few years at least ?
Tim Dettmers says
I do not recommend buying Titan Vs. They are too cost-ineffective. I will write an new blog post about this topic in the next days.
Amir H. Jadidinejad says
Would you please review the functionality of the new Titan V in the field of deep learning and compare it with others such as 1080 ti?
EricPB says
Hi Tim,
NVidia just made a surprise announcement yesterday: they are releasing, for immediate purchase, a Titan V priced at $3000 with specs almost identical to the Tesla V100 PCIe ($10,000).
https://www.anandtech.com/show/12135/nvidia-announces-nvidia-titan-v-video-card-gv100-for-3000-dollars
For $3000, you can get either four units of a GTX 1080 TI ($750 a piece) or a single Titan V.
Which option would you go for Deep Learning ?
Cheers,
E.
Haider says
Hi Tim,
I have one 1080ti, and want to buy another 1080ti.
I am thinking to add a third GPU 1060-6GB, so that when I want to use the PC for coding or other purposes during training NN on the other two 1080ti GPUs, it will not be sluggish. And perhaps running this third GPU as well for NN training when I don’t use the PC, so I can train another NN architecture in the meantime.
I am new to deep learning, but when I have used my only 1080ti for 3DSMax rendering using VRays RT (GPU rendering), it became annoyingly slow.
The question is how many lanes my CPU does support:
Core i7-3770K spec says up to either 1×16 or 2×8 or 1×8 & 2×4 and they was not clear what is the maximum lanes, but seems 16 lanes.
Motherboard GA-Z77X-D3H specs says (under the Expansion slots):
1 x PCI Express x16 slot, running at x16 (PCIEX16)
* For optimum performance, if only one PCI Express graphics card is to be installed, be sure to install it in the PCIEX16 slot.
1 x PCI Express x16 slot, running at x8 (PCIEX8)
* The PCIEX8 slot shares bandwidth with the PCIEX16 slot. When the PCIEX8 slot is populated, the PCIEX16 slot will operate at up to x8 mode.
1 x PCI Express x16 slot, running at x4 (PCIEX4)
* The PCIEX4 slot shares bandwidth with the PCIEX1_1/2/3 slots. The PCIEX1_1/2/3 slots slots will become unavailable when a PCIe x4 expansion card is installed.
3 x PCI Express x1 slots
(The PCIEX4 and PCIEX1 slots conform to PCI Express 2.0 standard.)
This is a bit confusing. Now, can I connect the two gtx1080ti GPUs with PCI3.0 running at x8, and the other gtx1060 with PCI2.0 running at x4 ? Or the maximum lanes is 16x, so no room for the third GPU?
Perhaps the Motherboard Block Diagram is more enlightening at page 8 here:
http://download.gigabyte.eu/FileList/Manual/mb-manual_ga-z77x-d3h_v1.1_e.pdf
It seems the CPU indeed has maximum 16 lanes of PCI3.0 x16 , but the other extra 4 lanes are not coming from the CPU PCI Express Bus. It is coming from the Intel Z77 chip which has its own PCI2.0 bus running at 4x.
What do you think?
Many thanks!
References:
https://ark.intel.com/products/65523/Intel-Core-i7-3770K-Processor-8M-Cache-up-to-3_90-GHz
https://www.gigabyte.com/Motherboard/GA-Z77X-D3H-rev-11#sp
https://www.gigabyte.com/Motherboard/GA-Z77X-D3H-rev-11#support-manual
http://download.gigabyte.eu/FileList/Manual/mb-manual_ga-z77x-d3h_v1.1_e.pdf
Jack Marck says
Hi Tim,
I’m interested in deep reinforcement learning for robotics. Since the training episodes for this type of work is often done in real-time, is there any tangible benefit to training on a beefy GPU?
Jack Marck says
What are your thoughts on the recent availability of Volta cards via AWS? By my estimate, it would take about 245hrs on AWS On-Demand to break even with a 1080ti.
deepuser says
Hi Tim, Excellent article! Thanks very much for the detailed analysis!
We are looking to build a machine for running deep learning algorithms for computer vision on relatively large data sets (many tera bytes). We are deciding between a machine with 4 Titan Xps vs a machine with say 1 p100. Some background: We envision this machine to be used not for experimentation or for training different models on different GPUs, but this will be used to train “a” model and do inference. So, for us, 4 GPUs are useful if we can use multi GPU training/inference (data/model parallelization) and if at all it gives us a significant performance improvement. The other option is to have a single high performance GPU like the p100. But we certainly don’t require double precision accuracy. This being the case, do you have any suggestions on either of the two options or something else altogether? Thanks again for your advice!
Scott says
Tim what about the lastest Amazon AWS EC2 P3 instances based on the Volta V100?
Is their price/perfomance competitive?
ArtVandelay says
Hello Tim,
Thanks for this write up, it’s been very helpful.
You say that for computer vision, for some of the leading models Titan Xp would be recommended, also for computer vision with data sets larger than 250 gig(or something along those lines).
So I’m asking, with regards to Deepmind and their PySC2 research. They are aimed at image based machine learning AI’s.
So would a Titan Xp be better here? Or would a 1080 Ti suffice?
Also, I imagine you know of Matthew Fisher and his blog. He’s done some very interesting things with AI and computer vision.
I am interested in doing what he did with SC2. Intercept the Direct3D 9 API to allow his AI to interact with the game. For something like this would a 1080 Ti be okay? Or would an Titan Xp make a huge difference?
Thank you
sharath says
Hi Tim,
I want to purchase a good GPU specifically for Natural Language Processing computations. Could you suggest me a good GPU in both nVidia and AMD GPU types which can handle good amount of NLP tasks.
And also suggest the best API (OpenCL, OpenGL, Vulkan) for NLP purpose in AMD GPUs for NLP computations for Microsoft OS and Linux OS types.
Paco says
Hello Tim, really useful staff in you site. Congratulations. I have been reading around that the tesla v100 has 900GB/s memory bandwidth but when i go to aws to read p3 specs it says EBS bandwith 1.5 – 14 Gbps. That is the difference in performance due to virtualization that you were talking about? or this is another metric? Many thanks
ANkit Dhingra says
HI , Thanks for this well written article, Its is really helpful.
I am beginner in deep learning and going to start deep learning models soon, Data is not too big rather small.
I have a 16 Core Xeon 2.4GHz CPU, 32GB RAM.
It comes preloaded with Nvidia Quadro k620 (2GB) GPU card.
Will this small GPU will work on small data and small deep networks for our development phase, Or we have to buy a new GPU just now? Although by next month we expect a new machine with better GPU.
Thanks in Advance
Ankit Dhingra
Arturo Pablo Rocha Ramirez says
What card i would choose for start in deep learning a old/new gtx960/970 or a new gtx 1050ti, the price of the three cards is very similar, or i can choose a more old gen like 770 or 780.
Thanks for this good post!
Iago Burduli says
will have a 2×1080 ti bottleneck because of the 28 pci express line of Intel Core i7 7820x, as they will work 16+8 line scheme?
Tim Dettmers says
Its not a bottleneck, but if you run parallel algorithms, you will see a decrease in performance of 0-10% compared to a 16/16 lanes setup.
Robin Colclough says
You can upgrade to 64 PCIe lanes by using the AMD Threadripper 1900x, which now retails for US$ 449- on Amazon.
Combine this with AMD Radeon Instinct AI cards, and the Cuda-cross library, and the performance/$ cost of AI reduces by over 50%.
John A says
thanks for the deep information 🙂
Tim Dettmers says
You are welcome! 🙂
Marian says
Thanks for the great article on GPU selection. Any chance you could offer updated advice on system building? About a year ago, I built a system with 4 1080s based on your recommendations. Now I am interested in building an 8 GPU system, but it seems this will require some rather specialized hardware, and I am not sure what to buy or if it is even practical for an individual or small company to build this kind of machine. Also, I am curious about the new AMD processors that may have more PCI lanes and whether they would be preferred over Xeon now. Also, it would be great to see an article about getting visibility into PCI bus usage and whether it is a bottleneck or not. This is something I have often wondered about.
Tim Dettmers says
Thanks for your feedback. Indeed these are some issues which have been raised again and again and I should write about it to clarify these points.
To give you a few short answers:
– I do not recommend 8 GPU systems; for optimal performance, they require special code which no framework supports. Some frameworks should work with less optimal performance algorithms without a problem though (less performance would be about 7.9x speedup vs 6.5x speedup for convnets for example). Another problems with such a system is the power supply (multiple PSUs usually) and cooling (often special cooling solutions are needed). My recommendation is to go with 4 GPU systems which are independent instead of 8 GPU systems.
– AMD processors do not have any big advantage over Intel CPUs; the extra lanes make almost no difference. Pay attention to cooling solutions first. If you have a liquid cooling solution and you parallelize most of your workloads, only then it may make sense to look into more lanes. Usually, parallel algorithms have a much larger effect than lanes. Good algorithm + few lanes > bad algorithm + maximum lanes.
Marc-Philippe Huget says
Dear Tim,
Have you checked post and people working on developing Ethereum mining rig? They consider up to 8-12 GPU (or pretend it is possible) in one unique machine. This could be effective for multi-GPU simulations. Any thoughts about that?
Regards,
mph
new_dl_learner says
Hello Tim, how do the mobile version of Nvidia 1040, 1050, 1060 with/without Ti perform? Are they not as good as the desktop version? I am considering to get a new laptop. such as a Surface Book 2 or Lenovo’s. Thanks
Maximilian says
Hello,
Thank you very much for the tutorial!
I am sorry if I have missed the point but I am a skint student haha.
So I have three options for a GPU that is under £200 a GTX 1060 6GB for ~£200 a 980ti 6GB for a similar price (but way better spec) or the 780ti 2GB for ~£60.
I am planning on running a facial detection model on the GPU.
What are the positives and negatives of these GPUs? Why wouldn’t I get the 780ti which has the best (CUDA score – https://browser.geekbench.com/cuda-benchmarks) for the price – Do I need more than 2GB (model is unlikely to be anywhere near that big?), what is wrong with it being so old?
Any advice would be greatly appreciated.
Thank you
Tim Dettmers says
2 GB can be a bit short on the memory side. There are some sophisticated facial detection models and they might need more RAM than that. I would go for a GTX 980Ti if I were you. If you just run inference on pretrained models, 2GB should be enough though and then a GTX 780Ti would be a good choice.
huichen says
HI Tim
Compare the parameters with GTX780 and GTX1050TI,
include the bandwidth,buswidth,cuda cores,TFLOPS,
GTX780 always better than GTX1050TI.
And they has a almost same price!
But in your opinion,the GTX780 < GTX1050TI in DL ?
Thanks!
Tim Dettmers says
Good catch! The problem is that you cannot compare bandwidth, bus-width, CUDA cores and say “X is better than Y”. The architecture of the chip determines how efficient these things can be used and the GTX 10 architecture is much more efficient than the GTX 7 architecture. However, I have no hard data, and it might still be that GTX 780 > GTX 1050 Ti, but unless somebody has some deep learning benchmarks on, say, ResNet performance I would still assume that GTX 1050Ti > GTX 780.
Kenneth Hesse says
Hi Tim,
Thanks for the post.
I have been studying machine learning theory for the past three months and I’m itching to start experimentation. I’ll start with feed forward networks, but I am most interested in sequence learning using recurrent networks. I want to experiment with single and double layered networks using LSTM cells. I want to mess around with bidirectional network architectures as well. But, from reading UCSD’s critical review I’m left with the impression that researchers mostly use Titan XPs for these types of things (Lipton, Berkowitz, Elkan [2015]). If so I’ll focus my experimentation on feed forwards for now and have to wait for grad school to mess around with sequence capable architectures. Do you know if the GeForce 1080 will be sufficient for training recurrent networks?
Here is my build so far:
Case: Corsair 540 Mid-Tower
Motherboard: MSI