Hardware Archives — Tim Dettmers

Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning

Tim Dettmers — Mon, 30 Jan 2023 15:50:00 +0000

Deep learning is a field with intense computational requirements, and your choice of GPU will fundamentally determine your deep learning experience. But what features are important if you want to buy a new GPU? GPU RAM, cores, tensor cores, caches? How to make a cost-efficient choice? This blog post will delve into these questions, tackle common misconceptions, give you an intuitive understanding of how to think about GPUs, and will lend you advice, which will help you to make a choice that is right for you.

This blog post is designed to give you different levels of understanding of GPUs and the new Ampere series GPUs from NVIDIA. You have the choice: (1) If you are not interested in the details of how GPUs work, what makes a GPU fast compared to a CPU, and what is unique about the new NVIDIA RTX 40 Ampere series, you can skip right to the performance and performance per dollar charts and the recommendation section. The cost/performance numbers form the core of the blog post and the content surrounding it explains the details of what makes up GPU performance.

(2) If you worry about specific questions, I have answered and addressed the most common questions and misconceptions in the later part of the blog post.

(3) If you want to get an in-depth understanding of how GPUs, caches, and Tensor Cores work, the best is to read the blog post from start to finish. You might want to skip a section or two based on your understanding of the presented topics.

Overview

This blog post is structured in the following way. First, I will explain what makes a GPU fast. I will discuss CPUs vs GPUs, Tensor Cores, memory bandwidth, and the memory hierarchy of GPUs and how these relate to deep learning performance. These explanations might help you get a more intuitive sense of what to look for in a GPU. I discuss the unique features of the new NVIDIA RTX 40 Ampere GPU series that are worth considering if you buy a GPU. From there, I make GPU recommendations for different scenarios. After that follows a Q&A section of common questions posed to me in Twitter threads; in that section, I will also address common misconceptions and some miscellaneous issues, such as cloud vs desktop, cooling, AMD vs NVIDIA, and others.

How do GPUs work?

If you use GPUs frequently, it is useful to understand how they work. This knowledge will help you to undstand cases where are GPUs fast or slow. In turn, you might be able to understand better why you need a GPU in the first place and how other future hardware options might be able to compete. You can skip this section if you just want the useful performance numbers and arguments to help you decide which GPU to buy. The best high-level explanation for the question of how GPUs work is my following Quora answer:

Read Tim Dettmers‘ answer to Why are GPUs well-suited to deep learning? on Quora

This is a high-level explanation that explains quite well why GPUs are better than CPUs for deep learning. If we look at the details, we can understand what makes one GPU better than another.

The Most Important GPU Specs for Deep Learning Processing Speed

This section can help you build a more intuitive understanding of how to think about deep learning performance. This understanding will help you to evaluate future GPUs by yourself. This section is sorted by the importance of each component. Tensor Cores are most important, followed by memory bandwidth of a GPU, the cache hierachy, and only then FLOPS of a GPU.

Tensor Cores

Tensor Cores are tiny cores that perform very efficient matrix multiplication. Since the most expensive part of any deep neural network is matrix multiplication Tensor Cores are very useful. In fast, they are so powerful, that I do not recommend any GPUs that do not have Tensor Cores.

It is helpful to understand how they work to appreciate the importance of these computational units specialized for matrix multiplication. Here I will show you a simple example of A*B=C matrix multiplication, where all matrices have a size of 32×32, what a computational pattern looks like with and without Tensor Cores. This is a simplified example, and not the exact way how a high performing matrix multiplication kernel would be written, but it has all the basics. A CUDA programmer would take this as a first “draft” and then optimize it step-by-step with concepts like double buffering, register optimization, occupancy optimization, instruction-level parallelism, and many others, which I will not discuss at this point.

To understand this example fully, you have to understand the concepts of cycles. If a processor runs at 1GHz, it can do 10^9 cycles per second. Each cycle represents an opportunity for computation. However, most of the time, operations take longer than one cycle. Thus we essentially have a queue where the next operations needs to wait for the next operation to finish. This is also called the latency of the operation.

Here are some important latency cycle timings for operations. These times can change from GPU generation to GPU generation. These numbers are for Ampere GPUs, which have relatively slow caches.

Global memory access (up to 80GB): ~380 cycles
L2 cache: ~200 cycles
L1 cache or Shared memory access (up to 128 kb per Streaming Multiprocessor): ~34 cycles
Fused multiplication and addition, a*b+c (FFMA): 4 cycles
Tensor Core matrix multiply: 1 cycle

Each operation is always performed by a pack of 32 threads. This pack is termed a warp of threads. Warps usually operate in a synchronous pattern — threads within a warp have to wait for each other. All memory operations on the GPU are optimized for warps. For example, loading from global memory happens at a granularity of 32*4 bytes, exactly 32 floats, exactly one float for each thread in a warp. We can have up to 32 warps = 1024 threads in a streaming multiprocessor (SM), the GPU-equivalent of a CPU core. The resources of an SM are divided up among all active warps. This means that sometimes we want to run fewer warps to have more registers/shared memory/Tensor Core resources per warp.

For both of the following examples, we assume we have the same computational resources. For this small example of a 32×32 matrix multiply, we use 8 SMs (about 10% of an RTX 3090) and 8 warps per SM.

To understand how the cycle latencies play together with resources like threads per SM and shared memory per SM, we now look at examples of matrix multiplication. While the following example roughly follows the sequence of computational steps of matrix multiplication for both with and without Tensor Cores, please note that these are very simplified examples. Real cases of matrix multiplication involve much larger shared memory tiles and slightly different computational patterns.

Matrix multiplication without Tensor Cores

If we want to do an A*B=C matrix multiply, where each matrix is of size 32×32, then we want to load memory that we repeatedly access into shared memory because its latency is about five times lower (200 cycles vs 34 cycles). A memory block in shared memory is often referred to as a memory tile or just a tile. Loading two 32×32 floats into a shared memory tile can happen in parallel by using 2*32 warps. We have 8 SMs with 8 warps each, so due to parallelization, we only need to do a single sequential load from global to shared memory, which takes 200 cycles.

To do the matrix multiplication, we now need to load a vector of 32 numbers from shared memory A and shared memory B and perform a fused multiply-and-accumulate (FFMA). Then store the outputs in registers C. We divide the work so that each SM does 8x dot products (32×32) to compute 8 outputs of C. Why this is exactly 8 (4 in older algorithms) is very technical. I recommend Scott Gray’s blog post on matrix multiplication to understand this. This means we have 8x shared memory accesses at the cost of 34 cycles each and 8 FFMA operations (32 in parallel), which cost 4 cycles each. In total, we thus have a cost of:

200 cycles (global memory) + 8*34 cycles (shared memory) + 8*4 cycles (FFMA) = 504 cycles

Let’s look at the cycle cost of using Tensor Cores.

Matrix multiplication with Tensor Cores

With Tensor Cores, we can perform a 4×4 matrix multiplication in one cycle. To do that, we first need to get memory into the Tensor Core. Similarly to the above, we need to read from global memory (200 cycles) and store in shared memory. To do a 32×32 matrix multiply, we need to do 8×8=64 Tensor Cores operations. A single SM has 8 Tensor Cores. So with 8 SMs, we have 64 Tensor Cores — just the number that we need! We can transfer the data from shared memory to the Tensor Cores with 1 memory transfers (34 cycles) and then do those 64 parallel Tensor Core operations (1 cycle). This means the total cost for Tensor Cores matrix multiplication, in this case, is:

200 cycles (global memory) + 34 cycles (shared memory) + 1 cycle (Tensor Core) = 235 cycles.

Thus we reduce the matrix multiplication cost significantly from 504 cycles to 235 cycles via Tensor Cores. In this simplified case, the Tensor Cores reduced the cost of both shared memory access and FFMA operations.

This example is simplified, for example, usually each thread needs to calculate which memory to read and write to as you transfer data from global memory to shared memory. With the new Hooper (H100) architectures we additionally have the Tensor Memory Accelerator (TMA) compute these indices in hardware and thus help each thread to focus on more computation rather than computing indices.

Matrix multiplication with Tensor Cores and Asynchronous copies (RTX 30/RTX 40) and TMA (H100)

The RTX 30 Ampere and RTX 40 Ada series GPUs additionally have support to perform asynchronous transfers between global and shared memory. The H100 Hopper GPU extends this further by introducing the Tensor Memory Accelerator (TMA) unit. the TMA unit combines asynchronous copies and index calculation for read and writes simultaneously — so each thread no longer needs to calculate which is the next element to read and each thread can focus on doing more matrix multiplication calculations. This looks as follows.

The TMA unit fetches memory from global to shared memory (200 cycles). Once the data arrives, the TMA unit fetches the next block of data asynchronously from global memory. While this is happening, the threads load data from shared memory and perform the matrix multiplication via the tensor core. Once the threads are finished they wait for the TMA unit to finish the next data transfer, and the sequence repeats.

As such, due to the asynchronous nature, the second global memory read by the TMA unit is already progressing as the threads process the current shared memory tile. This means, the second read takes only 200 – 34 – 1 = 165 cycles.

Since we do many reads, only the first memory access will be slow and all other memory accesses will be partially overlapped with the TMA unit. Thus on average, we reduce the time by 35 cycles.

165 cycles (wait for async copy to finish) + 34 cycles (shared memory) + 1 cycle (Tensor Core) = 200 cycles.

Which accelerates the matrix multiplication by another 15%.

From these examples, it becomes clear why the next attribute, memory bandwidth, is so crucial for Tensor-Core-equipped GPUs. Since global memory is the by far the largest cycle cost for matrix multiplication with Tensor Cores, we would even have faster GPUs if the global memory latency could be reduced. We can do this by either increasing the clock frequency of the memory (more cycles per second, but also more heat and higher energy requirements) or by increasing the number of elements that can be transferred at any one time (bus width).

Memory Bandwidth

From the previous section, we have seen that Tensor Cores are very fast. So fast, in fact, that they are idle most of the time as they are waiting for memory to arrive from global memory. For example, during GPT-3-sized training, which uses huge matrices — the larger, the better for Tensor Cores — we have a Tensor Core TFLOPS utilization of about 45-65%, meaning that even for the large neural networks about 50% of the time, Tensor Cores are idle.

This means that when comparing two GPUs with Tensor Cores, one of the single best indicators for each GPU’s performance is their memory bandwidth. For example, The A100 GPU has 1,555 GB/s memory bandwidth vs the 900 GB/s of the V100. As such, a basic estimate of speedup of an A100 vs V100 is 1555/900 = 1.73x.

L2 Cache / Shared Memory / L1 Cache / Registers

Since memory transfers to the Tensor Cores are the limiting factor in performance, we are looking for other GPU attributes that enable faster memory transfer to Tensor Cores. L2 cache, shared memory, L1 cache, and amount of registers used are all related. To understand how a memory hierarchy enables faster memory transfers, it helps to understand how matrix multiplication is performed on a GPU.

To perform matrix multiplication, we exploit the memory hierarchy of a GPU that goes from slow global memory, to faster L2 memory, to fast local shared memory, to lightning-fast registers. However, the faster the memory, the smaller it is.

While logically, L2 and L1 memory are the same, L2 cache is larger and thus the average physical distance that need to be traversed to retrieve a cache line is larger. You can see the L1 and L2 caches as organized warehouses where you want to retrieve an item. You know where the item is, but to go there takes on average much longer for the larger warehouse. This is the essential difference between L1 and L2 caches. Large = slow, small = fast.

For matrix multiplication we can use this hierarchical separate into smaller and smaller and thus faster and faster chunks of memory to perform very fast matrix multiplications. For that, we need to chunk the big matrix multiplication into smaller sub-matrix multiplications. These chunks are called memory tiles, or often for short just tiles.

We perform matrix multiplication across these smaller tiles in local shared memory that is fast and close to the streaming multiprocessor (SM) — the equivalent of a CPU core. With Tensor Cores, we go a step further: We take each tile and load a part of these tiles into Tensor Cores which is directly addressed by registers. A matrix memory tile in L2 cache is 3-5x faster than global GPU memory (GPU RAM), shared memory is ~7-10x faster than the global GPU memory, whereas the Tensor Cores’ registers are ~200x faster than the global GPU memory.

Having larger tiles means we can reuse more memory. I wrote about this in detail in my TPU vs GPU blog post. In fact, you can see TPUs as having very, very, large tiles for each Tensor Core. As such, TPUs can reuse much more memory with each transfer from global memory, which makes them a little bit more efficient at matrix multiplications than GPUs.

Each tile size is determined by how much memory we have per streaming multiprocessor (SM) and how much we L2 cache we have across all SMs. We have the following shared memory sizes on the following architectures:

Volta (Titan V): 128kb shared memory / 6 MB L2
Turing (RTX 20s series): 96 kb shared memory / 5.5 MB L2
Ampere (RTX 30s series): 128 kb shared memory / 6 MB L2
Ada (RTX 40s series): 128 kb shared memory / 72 MB L2

We see that Ada has a much larger L2 cache allowing for larger tile sizes, which reduces global memory access. For example, for BERT large during training, the input and weight matrix of any matrix multiplication fit neatly into the L2 cache of Ada (but not other Us). As such, data needs to be loaded from global memory only once and then data is available throught the L2 cache, making matrix multiplication about 1.5 – 2.0x faster for this architecture for Ada. For larger models the speedups are lower during training but certain sweetspots exist which may make certain models much faster. Inference, with a batch size larger than 8 can also benefit immensely from the larger L2 caches.

Estimating Ada / Hopper Deep Learning Performance

This section is for those who want to understand the more technical details of how I derive the performance estimates for Ampere GPUs. If you do not care about these technical aspects, it is safe to skip this section.

Practical Ada / Hopper Speed Estimates

Suppose we have an estimate for one GPU of a GPU-architecture like Hopper, Ada, Ampere, Turing, or Volta. It is easy to extrapolate these results to other GPUs from the same architecture/series. Luckily, NVIDIA already benchmarked the A100 vs V100 vs H100 across a wide range of computer vision and natural language understanding tasks. Unfortunately, NVIDIA made sure that these numbers are not directly comparable by using different batch sizes and the number of GPUs whenever possible to favor results for the H100 GPU. So in a sense, the benchmark numbers are partially honest, partially marketing numbers. In general, you could argue that using larger batch sizes is fair, as the H100/A100 GPU has more memory. Still, to compare GPU architectures, we should evaluate unbiased memory performance with the same batch size.

To get an unbiased estimate, we can scale the data center GPU results in two ways: (1) account for the differences in batch size, (2) account for the differences in using 1 vs 8 GPUs. We are lucky that we can find such an estimate for both biases in the data that NVIDIA provides.

Doubling the batch size increases throughput in terms of images/s (CNNs) by 13.6%. I benchmarked the same problem for transformers on my RTX Titan and found, surprisingly, the very same result: 13.5% — it appears that this is a robust estimate.

As we parallelize networks across more and more GPUs, we lose performance due to some networking overhead. The A100 8x GPU system has better networking (NVLink 3.0) than the V100 8x GPU system (NVLink 2.0) — this is another confounding factor. Looking directly at the data from NVIDIA, we can find that for CNNs, a system with 8x A100 has a 5% lower overhead than a system of 8x V100. This means if going from 1x A100 to 8x A100 gives you a speedup of, say, 7.00x, then going from 1x V100 to 8x V100 only gives you a speedup of 6.67x. For transformers, the figure is 7%.

Using these figures, we can estimate the speedup for a few specific deep learning architectures from the direct data that NVIDIA provides. The Tesla A100 offers the following speedup over the Tesla V100:

SE-ResNeXt101: 1.43x
Masked-R-CNN: 1.47x
Transformer (12 layer, Machine Translation, WMT14 en-de): 1.70x

Thus, the figures are a bit lower than the theoretical estimate for computer vision. This might be due to smaller tensor dimensions, overhead from operations that are needed to prepare the matrix multiplication like img2col or Fast Fourier Transform (FFT), or operations that cannot saturate the GPU (final layers are often relatively small). It could also be artifacts of the specific architectures (grouped convolution).

The practical transformer estimate is very close to the theoretical estimate. This is probably because algorithms for huge matrices are very straightforward. I will use these practical estimates to calculate the cost efficiency of GPUs.

Possible Biases in Estimates

The estimates above are for H100, A100 , and V100 GPUs. In the past, NVIDIA sneaked unannounced performance degradations into the “gaming” RTX GPUs: (1) Decreased Tensor Core utilization, (2) gaming fans for cooling, (3) disabled peer-to-peer GPU transfers. It might be possible that there are unannounced performance degradations in the RTX 40 series compared to the full Hopper H100.

As of now, one of these degradations was found for Ampere GPUs: Tensor Core performance was decreased so that RTX 30 series GPUs are not as good as Quadro cards for deep learning purposes. This was also done for the RTX 20 series, so it is nothing new, but this time it was also done for the Titan equivalent card, the RTX 3090. The RTX Titan did not have performance degradation enabled.

Currently, no degradation for Ada GPUs are known, but I update this post with news on this and let my followers on twitter know.

Advantages and Problems for RTX40 and RTX 30 Series

The new NVIDIA Ampere RTX 30 series has additional benefits over the NVIDIA Turing RTX 20 series, such as sparse network training and inference. Other features, such as the new data types, should be seen more as an ease-of-use-feature as they provide the same performance boost as Turing does but without any extra programming required.

The Ada RTX 40 series has even further advances like 8-bit Float (FP8) tensor cores. The RTX 40 series also has similar power and temperature issues compared to the RTX 30. The issue of melting power connector cables in the RTX 40 can be easily prevented by connecting the power cable correctly.

Sparse Network Training

Ampere allows for fine-grained structure automatic sparse matrix multiplication at dense speeds. How does this work? Take a weight matrix and slice it into pieces of 4 elements. Now imagine 2 elements of these 4 to be zero. Figure 1 shows how this could look like.

Figure X: Structure supported by the sparse matrix multiplication feature in Ampere GPUs. Figure is taken from Jeff Pool’s GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA.

" data-image-caption="

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matrix_ampere.png?fit=249%2C300&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matrix_ampere.png?fit=321%2C387&ssl=1" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matrix_ampere.png?resize=258%2C311&ssl=1" alt="Figure 1: Structure supported by the sparse matrix multiplication feature in Ampere GPUs. The figure is taken from Jeff Pool's GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA." class="wp-image-935" width="258" height="311" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matrix_ampere.png?w=321&ssl=1 321w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matrix_ampere.png?resize=249%2C300&ssl=1 249w" sizes="(max-width: 258px) 100vw, 258px" data-recalc-dims="1" />

Figure 1: Structure supported by the sparse matrix multiplication feature in Ampere GPUs. The figure is taken from Jeff Pool’s GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA.

When you multiply this sparse weight matrix with some dense inputs, the sparse matrix tensor core feature in Ampere automatically compresses the sparse matrix to a dense representation that is half the size as can be seen in Figure 2. After this compression, the densely compressed matrix tile is fed into the tensor core which computes a matrix multiplication of twice the usual size. This effectively yields a 2x speedup since the bandwidth requirements during matrix multiplication from shared memory are halved.

Figure X: The sparse matrix is compressed to a dense representation before the matrix multiplication is performed. Figure is taken from Jeff Pool’s GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA.

" data-image-caption="

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matmul.png?fit=300%2C181&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matmul.png?fit=1024%2C619&ssl=1" width="1024" height="619" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matmul.png?resize=1024%2C619&ssl=1" alt="Figure 2: The sparse matrix is compressed to a dense representation before the matrix multiplication is performed. " class="wp-image-934" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matmul.png?resize=1024%2C619&ssl=1 1024w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matmul.png?resize=300%2C181&ssl=1 300w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matmul.png?resize=768%2C464&ssl=1 768w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matmul.png?w=1055&ssl=1 1055w" sizes="(max-width: 1000px) 100vw, 1000px" data-recalc-dims="1" />

Figure 2: The sparse matrix is compressed to a dense representation before the matrix multiplication is performed. The figure is taken from Jeff Pool’s GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA.

I was working on sparse network training in my research and I also wrote a blog post about sparse training. One criticism of my work was that “You reduce the FLOPS required for the network, but it does not yield speedups because GPUs cannot do fast sparse matrix multiplication.” Well, with the addition of the sparse matrix multiplication feature for Tensor Cores, my algorithm, or other sparse training algorithms, now actually provide speedups of up to 2x during training.

Figure X: The sparse training algorithm developed has three stages: (1) Determine the importance of each layer. (2) Remove the smallest, unimportant weights. (3) Grow new weights proportional to the importance of each layers.

" data-image-caption="

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?fit=300%2C145&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?fit=1024%2C493&ssl=1" width="1024" height="493" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?resize=1024%2C493&ssl=1" alt="Figure 3: The sparse training algorithm that I developed has three stages: (1) Determine the importance of each layer. (2) Remove the smallest, unimportant weights. (3) Grow new weights proportional to the importance of each layer. Read more about my work in my sparse training blog post." class="wp-image-779" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?resize=1024%2C493&ssl=1 1024w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?resize=300%2C145&ssl=1 300w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?resize=768%2C370&ssl=1 768w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?w=1096&ssl=1 1096w" sizes="(max-width: 1000px) 100vw, 1000px" data-recalc-dims="1" />

Figure 3: The sparse training algorithm that I developed has three stages: (1) Determine the importance of each layer. (2) Remove the smallest, unimportant weights. (3) Grow new weights proportional to the importance of each layer. Read more about my work in my sparse training blog post.

While this feature is still experimental and training sparse networks are not commonplace yet, having this feature on your GPU means you are ready for the future of sparse training.

Low-precision Computation

In my work, I’ve previously shown that new data types can improve stability during low-precision backpropagation.

Figure X: Low-precision deep learning 8-bit datatypes that I developed. Deep learning training benefits from highly specialized data types. My dynamic tree datatype uses a dynamic bit that indicates the beginning of a binary bisection tree that quantized the range [0, 0.9] while all previous bits are used for the exponent. This allows to dynamically represent large numbers and small numbers with high precision.

" data-image-caption="

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/8-bit_data_types.png?fit=300%2C93&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/8-bit_data_types.png?fit=869%2C268&ssl=1" width="869" height="268" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/8-bit_data_types.png?resize=869%2C268&ssl=1" alt="Figure 4: Low-precision deep learning 8-bit datatypes that I developed. Deep learning training benefits from highly specialized data types. My dynamic tree datatype uses a dynamic bit that indicates the beginning of a binary bisection tree that quantized the range [0, 0.9] while all previous bits are used for the exponent. This allows to dynamically represent numbers that are both large and small with high precision." class="wp-image-941" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/8-bit_data_types.png?w=869&ssl=1 869w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/8-bit_data_types.png?resize=300%2C93&ssl=1 300w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/8-bit_data_types.png?resize=768%2C237&ssl=1 768w" sizes="(max-width: 869px) 100vw, 869px" data-recalc-dims="1" />

Figure 4: Low-precision deep learning 8-bit datatypes that I developed. Deep learning training benefits from highly specialized data types. My dynamic tree datatype uses a dynamic bit that indicates the beginning of a binary bisection tree that quantized the range [0, 0.9] while all previous bits are used for the exponent. This allows to dynamically represent numbers that are both large and small with high precision.

Currently, if you want to have stable backpropagation with 16-bit floating-point numbers (FP16), the big problem is that ordinary FP16 data types only support numbers in the range [-65,504, 65,504]. If your gradient slips past this range, your gradients explode into NaN values. To prevent this during FP16 training, we usually perform loss scaling where you multiply the loss by a small number before backpropagating to prevent this gradient explosion.

The BrainFloat 16 format (BF16) uses more bits for the exponent such that the range of possible numbers is the same as for FP32: [-3*10^38, 3*10^38]. BF16 has less precision, that is significant digits, but gradient precision is not that important for learning. So what BF16 does is that you no longer need to do any loss scaling or worry about the gradient blowing up quickly. As such, we should see an increase in training stability by using the BF16 format as a slight loss of precision.

What this means for you: With BF16 precision, training might be more stable than with FP16 precision while providing the same speedups. With 32-bit TensorFloat (TF32) precision, you get near FP32 stability while giving the speedups close to FP16. The good thing is, to use these data types, you can just replace FP32 with TF32 and FP16 with BF16 — no code changes required!

Overall, though, these new data types can be seen as lazy data types in the sense that you could have gotten all the benefits with the old data types with some additional programming efforts (proper loss scaling, initialization, normalization, using Apex). As such, these data types do not provide speedups but rather improve ease of use of low precision for training.

Fan Designs and GPUs Temperature Issues

While the new fan design of the RTX 30 series performs very well to cool the GPU, different fan designs of non-founders edition GPUs might be more problematic. If your GPU heats up beyond 80C, it will throttle itself and slow down its computational speed / power. This overheating can happen in particular if you stack multiple GPUs next to each other. A solution to this is to use PCIe extenders to create space between GPUs.

Spreading GPUs with PCIe extenders is very effective for cooling, and other fellow PhD students at the University of Washington and I use this setup with great success. It does not look pretty, but it keeps your GPUs cool! This has been running with no problems at all for 4 years now. It can also help if you do not have enough space to fit all GPUs in the PCIe slots. For example, if you can find the space within a desktop computer case, it might be possible to buy standard 3-slot-width RTX 4090 and spread them with PCIe extenders within the case. With this, you might solve both the space issue and cooling issue for a 4x RTX 4090 setup with a single simple solution.

4x GPUs with PCIe extenders

" data-image-caption="

4x GPUs with PCIe extenders

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders-scaled.jpg?fit=225%2C300&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders-scaled.jpg?fit=768%2C1024&ssl=1" width="768" height="1024" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders.jpg?resize=768%2C1024&ssl=1" alt="Figure 5: 4x GPUs with PCIe extenders. It looks like a mess, but it is very effective for cooling. I used this rig for 2 years and cooling is excellent despite problematic RTX 2080 Ti Founders Edition GPUs." class="wp-image-861" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders-scaled.jpg?resize=768%2C1024&ssl=1 768w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders-scaled.jpg?resize=225%2C300&ssl=1 225w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders-scaled.jpg?resize=1152%2C1536&ssl=1 1152w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders-scaled.jpg?resize=1536%2C2048&ssl=1 1536w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders-scaled.jpg?w=1920&ssl=1 1920w" sizes="(max-width: 768px) 100vw, 768px" data-recalc-dims="1" />

Figure 5: 4x GPUs with PCIe extenders. It looks like a mess, but it is very effective for cooling. I used this rig for 4 years and cooling is excellent despite problematic RTX 2080 Ti Founders Edition GPUs.

3-slot Design and Power Issues

The RTX 3090 and RTX 4090 are 3-slot GPUs, so one will not be able to use it in a 4x setup with the default fan design from NVIDIA. This is kind of justified because it runs at over 350W TDP, and it will be difficult to cool in a multi-GPU 2-slot setting. The RTX 3080 is only slightly better at 320W TDP, and cooling a 4x RTX 3080 setup will also be very difficult.

It is also difficult to power a 4x 350W = 1400W or 4x 450W = 1800W system in the 4x RTX 3090 or 4x RTX 4090 case. Power supply units (PSUs) of 1600W are readily available, but having only 200W to power the CPU and motherboard can be too tight. The components’ maximum power is only used if the components are fully utilized, and in deep learning, the CPU is usually only under weak load. With that, a 1600W PSU might work quite well with a 4x RTX 3080 build, but for a 4x RTX 3090 build, it is better to look for high wattage PSUs (+1700W). Some of my followers have had great success with cryptomining PSUs — have a look in the comment section for more info about that. Otherwise, it is important to note that not all outlets support PSUs above 1600W, especially in the US. This is the reason why in the US, there are currently few standard desktop PSUs above 1600W on the market. If you get a server or cryptomining PSUs, beware of the form factor — make sure it fits into your computer case.

Power Limiting: An Elegant Solution to Solve the Power Problem?

It is possible to set a power limit on your GPUs. So you would be able to programmatically set the power limit of an RTX 3090 to 300W instead of their standard 350W. In a 4x GPU system, that is a saving of 200W, which might just be enough to build a 4x RTX 3090 system with a 1600W PSU feasible. It also helps to keep the GPUs cool. So setting a power limit can solve the two major problems of a 4x RTX 3080 or 4x RTX 3090 setups, cooling, and power, at the same time. For a 4x setup, you still need effective blower GPUs (and the standard design may prove adequate for this), but this resolves the PSU problem.

Figure X: Reducing the power limit has a slight cooling effect. Reducing the RTX 2080 Ti power limit by 50-60 W decreases temperatures slightly and fans run more silent.

" data-image-caption="

Figure X: Reducing the power limit has a slight cooling effect. Reducing the RTX 2080 Ti power limit by 50-60 W decreases temperatures slightly and fans run more silent.

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?fit=298%2C300&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?fit=1017%2C1024&ssl=1" width="1017" height="1024" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?resize=1017%2C1024&ssl=1" alt="Figure 6: Reducing the power limit has a slight cooling effect. Reducing the RTX 2080 Ti power limit by 50-60 W decreases temperatures slightly and fans run more silent." class="wp-image-933" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?resize=1017%2C1024&ssl=1 1017w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?resize=298%2C300&ssl=1 298w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?resize=150%2C150&ssl=1 150w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?resize=768%2C773&ssl=1 768w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?w=1187&ssl=1 1187w" sizes="(max-width: 1000px) 100vw, 1000px" data-recalc-dims="1" />

Figure 6: Reducing the power limit has a slight cooling effect. Reducing the RTX 2080 Ti power limit by 50-60 W decreases temperatures slightly and fans run more silent.

You might ask, “Doesn’t this slow down the GPU?” Yes, it does, but the question is by how much. I benchmarked the 4x RTX 2080 Ti system shown in Figure 5 under different power limits to test this. I benchmarked the time for 500 mini-batches for BERT Large during inference (excluding the softmax layer). I choose BERT Large inference since, from my experience, this is the deep learning model that stresses the GPU the most. As such, I would expect power limiting to have the most massive slowdown for this model. As such, the slowdowns reported here are probably close to the maximum slowdowns that you can expect. The results are shown in Figure 7.

Figure 6: Measured slowdown for a given power limit on an RTX 2080 Ti. Measurements taken are mean processing times for 500 mini-batches of BERT Large during inference (excluding softmax layer).

" data-image-caption="

Figure 6: Measured slowdown for a given power limit on an RTX 2080 Ti. Measurements taken are mean processing times for 500 mini-batches of BERT Large during inference (excluding softmax layer).

" data-medium-file="https://timdettmers.com/wp-content/uploads/2020/09/RTX-2080-Ti-Slowdown-vs-Power-Limit.svg" data-large-file="https://timdettmers.com/wp-content/uploads/2020/09/RTX-2080-Ti-Slowdown-vs-Power-Limit.svg" width="853" height="703" src="https://timdettmers.com/wp-content/uploads/2020/09/RTX-2080-Ti-Slowdown-vs-Power-Limit.svg" alt="Figure 7: Measured slowdown for a given power limit on an RTX 2080 Ti. Measurements taken are mean processing times for 500 mini-batches of BERT Large during inference (excluding softmax layer)." class="wp-image-939"/>

Figure 7: Measured slowdown for a given power limit on an RTX 2080 Ti. Measurements taken are mean processing times for 500 mini-batches of BERT Large during inference (excluding softmax layer).

As we can see, setting the power limit does not seriously affect performance. Limiting the power by 50W — more than enough to handle 4x RTX 3090 — decreases performance by only 7%.

RTX 4090s and Melting Power Connectors: How to Prevent Problems

There was a misconception that RTX 4090 power cables melt because they were bent. However, it was found that only 0.1% of users had this problem and the problem occured due to user error. Here a video that shows that the main problem is that cables were not inserted correctly.

So using RTX 4090 cards is perfectly safe if you follow the following install instructions:

If you use an old cable or old GPU make sure the contacts are free of debri / dust.
Use the power connector and stick it into the socket until you hear a *click* — this is the most important part.
Test for good fit by wiggling the power cable left to right. The cable should not move.
Check the contact with the socket visually, there should be no gap between cable and socket.

8-bit Float Support in H100 and RTX 40 series GPUs

The support of the 8-bit Float (FP8) is a huge advantage for the RTX 40 series and H100 GPUs. With 8-bit inputs it allows you to load the data for matrix multiplication twice as fast, you can store twice as much matrix elements in your caches which in the Ada and Hopper architecture are very large, and now with FP8 tensor cores you get 0.66 PFLOPS of compute for a RTX 4090 — this is more FLOPS then the entirety of the worlds fastest supercomputer in year 2007. 4x RTX 4090 with FP8 compute rival the faster supercomputer in the world in year 2010 (deep learning started to work just in 2009).

The main problem with using 8-bit precision is that transformers can get very unstable with so few bits and crash during training or generate non-sense during inference. I have written a paper about the emergence of instabilities in large language models and I also written a more accessible blog post.

The main take-way is this: Using 8-bit instead of 16-bit makes things very unstable, but if you keep a couple of dimensions in high precision everything works just fine.

Main results from my work on 8-bit matrix multiplication for Large Language Models (LLMs). We can see that the best 8-bit baseline fails to deliver good zero-shot performance. The method that I developed, LLM.int8(), can perform Int8 matrix multiplication with the same results as the 16-bit baseline.

But Int8 was already supported by the RTX 30 / A100 / Ampere generation GPUs, why is FP8 in the RTX 40 another big upgrade? The FP8 data type is much more stable than the Int8 data type and its easy to use it in functions like layer norm or non-linear functions, which are difficult to do with Integer data types. This will make it very straightforward to use it in training and inference. I think this will make FP8 training and inference relatively common in a couple of months.

If you want to read more about the advantages of Float vs Integer data types you can read my recent paper about k-bit inference scaling laws. Below you can see one relevant main result for Float vs Integer data types from this paper. We can see that bit-by-bit, the FP4 data type preserve more information than Int4 data type and thus improves the mean LLM zeroshot accuracy across 4 tasks.

4-bit Inference scaling laws for Pythia Large Language Models for different data types. We see that bit-by-bit, 4-bit float data types have better zeroshot accuracy compared to the Int4 data types.

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/pythia_4bit_datatypes2.png?fit=300%2C226&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/pythia_4bit_datatypes2.png?fit=1024%2C773&ssl=1" width="1024" height="773" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/pythia_4bit_datatypes2.png?resize=1024%2C773&ssl=1" alt="" class="wp-image-1151" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/pythia_4bit_datatypes2.png?resize=1024%2C773&ssl=1 1024w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/pythia_4bit_datatypes2.png?resize=300%2C226&ssl=1 300w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/pythia_4bit_datatypes2.png?resize=768%2C580&ssl=1 768w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/pythia_4bit_datatypes2.png?w=1159&ssl=1 1159w" sizes="(max-width: 1000px) 100vw, 1000px" data-recalc-dims="1" />

4-bit Inference scaling laws for Pythia Large Language Models for different data types. We see that bit-by-bit, 4-bit float data types have better zeroshot accuracy compared to the Int4 data types.

Raw Performance Ranking of GPUs

Below we see a chart of raw relevative performance across all GPUs. We see that there is a gigantic gap in 8-bit performance of H100 GPUs and old cards that are optimized for 16-bit performance.

Shown is raw relative performance of GPUs. For example, an RTX 4090 has about 0.33x performance of a H100 SMX for 8-bit inference. In other words, a H100 SMX is three times faster for 8-bit inference compared to a RTX 4090.

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?fit=300%2C295&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?fit=1024%2C1006&ssl=1" width="1024" height="1006" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?resize=1024%2C1006&ssl=1" alt="" class="wp-image-1160" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?resize=1024%2C1006&ssl=1 1024w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?resize=300%2C295&ssl=1 300w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?resize=768%2C754&ssl=1 768w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?resize=1536%2C1509&ssl=1 1536w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?w=1703&ssl=1 1703w" sizes="(max-width: 1000px) 100vw, 1000px" data-recalc-dims="1" />

Shown is raw relative transformer performance of GPUs. For example, an RTX 4090 has about 0.33x performance of a H100 SMX for 8-bit inference. In other words, a H100 SMX is three times faster for 8-bit inference compared to a RTX 4090.

For this data, I did not model 8-bit compute for older GPUs. I did so, because 8-bit Inference and training are much more effective on Ada/Hopper GPUs because of the 8-bit Float data type and Tensor Memory Accelerator (TMA) which saves the overhead of computing read/write indices which is particularly helpful for 8-bit matrix multiplication. Ada/Hopper also have FP8 support, which makes in particular 8-bit training much more effective.

I did not model numbers for 8-bit training because to model that I need to know the latency of L1 and L2 caches on Hopper/Ada GPUs, and they are unknown and I do not have access to such GPUs. On Hopper/Ada, 8-bit training performance can well be 3-4x of 16-bit training performance if the caches are as fast as rumored.

But even with the new FP8 tensor cores there are some additional issues which are difficult to take into account when modeling GPU performance. For example, FP8 tensor cores do not support transposed matrix multiplication which means backpropagation needs either a separate transpose before multiplication or one needs to hold two sets of weights — one transposed and one non-transposed — in memory. I used two sets of weight when I experimented with Int8 training in my LLM.int8() project and this reduced the overall speedups quite significantly. I think one can do better with the right algorithms/software, but this shows that missing features like a transposed matrix multiplication for tensor cores can affect performance.

For old GPUs, Int8 inference performance is close to the 16-bit inference performance for models below 13B parameters. Int8 performance on old GPUs is only relevant if you have relatively large models with 175B parameters or more. If you are interested in 8-bit performance of older GPUs, you can read the Appendix D of my LLM.int8() paper where I benchmark Int8 performance.

GPU Deep Learning Performance per Dollar

Below we see the chart for the performance per US dollar for all GPUs sorted by 8-bit inference performance. How to use the chart to find a suitable GPU for you is as follows:

Determine the amount of GPU memory that you need (rough heuristic: at least 12 GB for image generation; at least 24 GB for work with transformers)
While 8-bit inference and training is experimental, it will become standard within 6 months. You might need to do some extra difficult coding to work with 8-bit in the meantime. Is that OK for you? If not, select for 16-bit performance.
Using the metric determined in (2), find the GPU with the highest relative performance/dollar that has the amount of memory you need.

We can see that the RTX 4070 Ti is most cost-effective for 8-bit and 16-bit inference while the RTX 3080 remains most cost-effective for 16-bit training. While these GPUs are most cost-effective, they are not necessarily recommended as they do not have sufficient memory for many use-cases. However, it might be the ideal cards to get started on your deep learning journey. Some of these GPUs are excellent for Kaggle competition where one can often rely on smaller models. Since to do well in Kaggle competitions the method of how you work is more important than the models size, many of these smaller GPUs are excellent for Kaggle competitions.

The best GPUs for academic and startup servers seem to be A6000 Ada GPUs (not to be confused with A6000 Turing). The H100 SXM GPU is also very cost effective and has high memory and very strong performance. If I would build a small cluster for a company/academic lab, I would use 66-80% A6000 GPUs and 20-33% H100 SXM GPUs. If I get a good deal on L40 GPUs, I would also pick them instead of A6000, so you can always ask for a quote on these.

Shown is relative performance per US Dollar of GPUs normalized by the cost for a desktop computer and the average Amazon and eBay price for each GPU. Additionally, the electricity cost of ownership for 5 years is added with an electricity price of 0.175 USD per kWh and a 15% GPU utilization rate. The electricity cost for a RTX 4090 is about $100 per year. How to read and interpret the chart: a desktop computer with RTX 4070 Ti cards owned for 5 years yields about 2x more 8-bit inference performance per dollar compared to a RTX 3090 GPU.

GPU Recommendations

I have a create a recommendation flow-chart that you can see below (click here for interactive app from Nan Xiao). While this chart will help you in 80% of cases, it might not quite work for you because the options might be too expensive. In that case, try to look at the benchmarks above and pick the most cost effective GPU that still has enough GPU memory for your use-case. You can estimate the GPU memory needed by running your problem in the vast.ai or Lambda Cloud for a while so you know what you need. The vast.ai or Lambda Cloud might also work well if you only need a GPU very sporadically (every couple of days for a few hours) and you do not need to download and process large dataset to get started. However, cloud GPUs are usually not a good option if you use your GPU for many months with a high usage rate each day (12 hours each day). You can use the example in the “When is it better to use the cloud vs a dedicated GPU desktop/server?” section below to determine if cloud GPUs are good for you.

GPU recommendation chart for Ada/Hopper GPUs. Follow the answers to the Yes/No questions to find the GPU that is most suitable for you. While this chart works well in about 80% of cases, you might end up with a GPU that is too expensive. Use the cost/performance charts above to make a selection instead.

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/gpu_recommendations.png?fit=300%2C244&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/gpu_recommendations.png?fit=845%2C686&ssl=1" width="845" height="686" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/gpu_recommendations.png?resize=845%2C686&ssl=1" alt="" class="wp-image-1173" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/gpu_recommendations.png?w=845&ssl=1 845w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/gpu_recommendations.png?resize=300%2C244&ssl=1 300w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/gpu_recommendations.png?resize=768%2C623&ssl=1 768w" sizes="(max-width: 845px) 100vw, 845px" data-recalc-dims="1" />

Is it better to wait for future GPUs for an upgrade? The future of GPUs.

To understand if it makes sense to skip this generation and buy the next generation of GPUs, it makes sense to talk a bit about what improvements in the future will look like.

In the past it was possible to shrink the size of transistors to improve speed of a processor. This is coming to an end now. For example, while shrinking SRAM increased its speed (smaller distance, faster memory access), this is no longer the case. Current improvements in SRAM do not improve its performance anymore and might even be negative. While logic such as Tensor Cores get smaller, this does not necessarily make GPU faster since the main problem for matrix multiplication is to get memory to the tensor cores which is dictated by SRAM and GPU RAM speed and size. GPU RAM still increases in speed if we stack memory modules into high-bandwidth modules (HBM3+), but these are too expensive to manufacture for consumer applications. The main way to improve raw speed of GPUs is to use more power and more cooling as we have seen in the RTX 30s and 40s series. But this cannot go on for much longer.

Chiplets such as used by AMD CPUs are another straightforward way forward. AMD beat Intel by developing CPU chiplets. Chiplets are small chips that are fused together with a high speed on-chip network. You can think about them as two GPUs that are so physically close together that you can almost consider them a single big GPU. They are cheaper to manufacture, but more difficult to combine into one big chip. So you need know-how and fast connectivity between chiplets. AMD has a lot of experience with chiplet design. AMD’s next generation GPUs are going to be chiplet designs, while NVIDIA currently has no public plans for such designs. This may mean that the next generation of AMD GPUs might be better in terms of cost/performance compared to NVIDIA GPUs.

However, the main performance boost for GPUs is currently specialized logic. For example, the asynchronous copy hardware units on the Ampere generation (RTX 30 / A100 / RTX 40) or the extension, the Tensor Memory Accelerator (TMA), both reduce the overhead of copying memory from the slow global memory to fast shared memory (caches) through specialized hardware and so each thread can do more computation. The TMA also reduces overhead by performing automatic calculations of read/write indices which is particularly important for 8-bit computation where one has double the elements for the same amount of memory compared to 16-bit computation. So specialized hardware logic can accelerate matrix multiplication further.
Low-bit precision is another straightforward way forward for a couple of years. We will see widespread adoption of 8-bit inference and training in the next months. We will see widespread 4-bit inference in the next year. Currently, the technology for 4-bit training does not exists, but research looks promising and I expect the first high performance FP4 Large Language Model (LLM) with competitive predictive performance to be trained in 1-2 years time.

Going to 2-bit precision for training currently looks pretty impossible, but it is a much easier problem than shrinking transistors further. So progress in hardware mostly depends on software and algorithms that make it possible to use specialized features offered by the hardware.

We will probably be able to still improve the combination of algorithms + hardware to the year 2032, but after that will hit the end of GPU improvements (similar to smartphones). The wave of performance improvements after 2032 will come from better networking algorithms and mass hardware. It is uncertain if consumer GPUs will be relevant at this point. It might be that you need an RTX 9090 to run run Super HyperStableDiffusion Ultra Plus 9000 Extra or OpenChatGPT 5.0, but it might also be that some company will offer a high-quality API that is cheaper than the electricity cost for a RTX 9090 and you want to use a laptop + API for image generation and other tasks.

Overall, I think investing into a 8-bit capable GPU will be a very solid investment for the next 9 years. Improvements at 4-bit and 2-bit are likely small and other features like Sort Cores would only become relevant once sparse matrix multiplication can be leveraged well. We will probably see some kind of other advancement in 2-3 years which will make it into the next GPU 4 years from now, but we are running out of steam if we keep relying on matrix multiplication. This makes investments into new GPUs last longer.

Question & Answers & Misconceptions

Do I need PCIe 4.0 or PCIe 5.0?

Generally, no. PCIe 5.0 or 4.0 is great if you have a GPU cluster. It is okay if you have an 8x GPU machine, but otherwise, it does not yield many benefits. It allows better parallelization and a bit faster data transfer. Data transfers are not a bottleneck in any application. In computer vision, in the data transfer pipeline, the data storage can be a bottleneck, but not the PCIe transfer from CPU to GPU. So there is no real reason to get a PCIe 5.0 or 4.0 setup for most people. The benefits will be maybe 1-7% better parallelization in a 4 GPU setup.

Do I need 8x/16x PCIe lanes?

Same as with PCIe 4.0 — generally, no. PCIe lanes are needed for parallelization and fast data transfers, which are seldom a bottleneck. Operating GPUs on 4x lanes is fine, especially if you only have 2 GPUs. For a 4 GPU setup, I would prefer 8x lanes per GPU, but running them at 4x lanes will probably only decrease performance by around 5-10% if you parallelize across all 4 GPUs.

How do I fit 4x RTX 4090 or 3090 if they take up 3 PCIe slots each?

You need to get one of the two-slot variants, or you can try to spread them out with PCIe extenders. Besides space, you should also immediately think about cooling and a suitable PSU.

PCIe extenders might also solve both space and cooling issues, but you need to make sure that you have enough space in your case to spread out the GPUs. Make sure your PCIe extenders are long enough!

How do I cool 4x RTX 3090 or 4x RTX 3080?

See the previous section.

Can I use multiple GPUs of different GPU types?

Yes, you can! But you cannot parallelize efficiently across GPUs of different types since you will often go at the speed of the slowest GPU (data and fully sharded parallelism). So different GPUs work just fine, but parallelization across those GPUs will be inefficient since the fastest GPU will wait for the slowest GPU to catch up to a synchronization point (usually gradient update).

What is NVLink, and is it useful?

Generally, NVLink is not useful. NVLink is a high speed interconnect between GPUs. It is useful if you have a GPU cluster with +128 GPUs. Otherwise, it yields almost no benefits over standard PCIe transfers.

I do not have enough money, even for the cheapest GPUs you recommend. What can I do?

Definitely buy used GPUs. You can buy a small cheap GPU for prototyping and testing and then roll out for full experiments to the cloud like vast.ai or Lambda Cloud. This can be cheap if you train/fine-tune/inference on large models only every now and then and spent more time protoyping on smaller models.

What is the carbon footprint of GPUs? How can I use GPUs without polluting the environment?

I built a carbon calculator for calculating your carbon footprint for academics (carbon from flights to conferences + GPU time). The calculator can also be used to calculate a pure GPU carbon footprint. You will find that GPUs produce much, much more carbon than international flights. As such, you should make sure you have a green source of energy if you do not want to have an astronomical carbon footprint. If no electricity provider in our area provides green energy, the best way is to buy carbon offsets. Many people are skeptical about carbon offsets. Do they work? Are they scams?

I believe skepticism just hurts in this case, because not doing anything would be more harmful than risking the probability of getting scammed. If you worry about scams, just invest in a portfolio of offsets to minimize risk.

I worked on a project that produced carbon offsets about ten years ago. The carbon offsets were generated by burning leaking methane from mines in China. UN officials tracked the process, and they required clean digital data and physical inspections of the project site. In that case, the carbon offsets that were produced were highly reliable. I believe many other projects have similar quality standards.

What do I need to parallelize across two machines?

If you want to be on the safe side, you should get at least +50Gbits/s network cards to gain speedups if you want to parallelize across machines. I recommend having at least an EDR Infiniband setup, meaning a network card with at least 50 GBit/s bandwidth. Two EDR cards with cable are about $500 on eBay.

In some cases, you might be able to get away with 10 Gbit/s Ethernet, but this is usually only the case for special networks (certain convolutional networks) or if you use certain algorithms (Microsoft DeepSpeed).

Is the sparse matrix multiplication features suitable for sparse matrices in general?

It does not seem so. Since the granularity of the sparse matrix needs to have 2 zero-valued elements, every 4 elements, the sparse matrices need to be quite structured. It might be possible to adjust the algorithm slightly, which involves that you pool 4 values into a compressed representation of 2 values, but this also means that precise arbitrary sparse matrix multiplication is not possible with Ampere GPUs.

Do I need an Intel CPU to power a multi-GPU setup?

I do not recommend Intel CPUs unless you heavily use CPUs in Kaggle competitions (heavy linear algebra on the CPU). Even for Kaggle competitions AMD CPUs are still great, though. AMD CPUs are cheaper and better than Intel CPUs in general for deep learning. For a 4x GPU built, my go-to CPU would be a Threadripper. We built dozens of systems at our university with Threadrippers, and they all work great — no complaints yet. For 8x GPU systems, I would usually go with CPUs that your vendor has experience with. CPU and PCIe/system reliability is more important in 8x systems than straight performance or straight cost-effectiveness.

Does computer case design matter for cooling?

No. GPUs are usually perfectly cooled if there is at least a small gap between GPUs. Case design will give you 1-3 C better temperatures, space between GPUs will provide you with 10-30 C improvements. The bottom line, if you have space between GPUs, cooling does not matter. If you have no space between GPUs, you need the right cooler design (blower fan) or another solution (water cooling, PCIe extenders), but in either case, case design and case fans do not matter.

Will AMD GPUs + ROCm ever catch up with NVIDIA GPUs + CUDA?

Not in the next 1-2 years. It is a three-way problem: Tensor Cores, software, and community.

AMD GPUs are great in terms of pure silicon: Great FP16 performance, great memory bandwidth. However, their lack of Tensor Cores or the equivalent makes their deep learning performance poor compared to NVIDIA GPUs. Packed low-precision math does not cut it. Without this hardware feature, AMD GPUs will never be competitive. Rumors show that some data center card with Tensor Core equivalent is planned for 2020, but no new data emerged since then. Just having data center cards with a Tensor Core equivalent would also mean that few would be able to afford such AMD GPUs, which would give NVIDIA a competitive advantage.

Let’s say AMD introduces a Tensor-Core-like-hardware feature in the future. Then many people would say, “But there is no software that works for AMD GPUs! How am I supposed to use them?” This is mostly a misconception. The AMD software via ROCm has come to a long way, and support via PyTorch is excellent. While I have not seen many experience reports for AMD GPUs + PyTorch, all the software features are integrated. It seems, if you pick any network, you will be just fine running it on AMD GPUs. So here AMD has come a long way, and this issue is more or less solved.

However, if you solve software and the lack of Tensor Cores, AMD still has a problem: the lack of community. If you have a problem with NVIDIA GPUs, you can Google the problem and find a solution. That builds a lot of trust in NVIDIA GPUs. You have the infrastructure that makes using NVIDIA GPUs easy (any deep learning framework works, any scientific problem is well supported). You have the hacks and tricks that make usage of NVIDIA GPUs a breeze (e.g., apex). You can find experts on NVIDIA GPUs and programming around every other corner while I knew much less AMD GPU experts.

In the community aspect, AMD is a bit like Julia vs Python. Julia has a lot of potential, and many would say, and rightly so, that it is the superior programming language for scientific computing. Yet, Julia is barely used compared to Python. This is because the Python community is very strong. Numpy, SciPy, Pandas are powerful software packages that a large number of people congregate around. This is very similar to the NVIDIA vs AMD issue.

Thus, it is likely that AMD will not catch up until Tensor Core equivalent is introduced (1/2 to 1 year?) and a strong community is built around ROCm (2 years?). AMD will always snatch a part of the market share in specific subgroups (e.g., cryptocurrency mining, data centers). Still, in deep learning, NVIDIA will likely keep its monopoly for at least a couple more years.

When is it better to use the cloud vs a dedicated GPU desktop/server?

Rule-of-thumb: If you expect to do deep learning for longer than a year, it is cheaper to get a desktop GPU. Otherwise, cloud instances are preferable unless you have extensive cloud computing skills and want the benefits of scaling the number of GPUs up and down at will.

Numbers in the following paragraphs are going to change, but it serves as a scenario that helps you to understand the rough costs. You can use similar math to determine if cloud GPUs are the best solution for you.

For the exact point in time when a cloud GPU is more expensive than a desktop depends highly on the service that you are using, and it is best to do a little math on this yourself. Below I do an example calculation for an AWS V100 spot instance with 1x V100 and compare it to the price of a desktop with a single RTX 3090 (similar performance). The desktop with RTX 3090 costs $2,200 (2-GPU barebone + RTX 3090). Additionally, assuming you are in the US, there is an additional $0.12 per kWh for electricity. This compares to $2.14 per hour for the AWS on-demand instance.

At 15% utilization per year, the desktop uses:

(350 W (GPU) + 100 W (CPU))*0.15 (utilization) * 24 hours * 365 days = 591 kWh per year

So 591 kWh of electricity per year, that is an additional $71.

The break-even point for a desktop vs a cloud instance at 15% utilization (you use the cloud instance 15% of time during the day), would be about 300 days ($2,311 vs $2,270):

$2.14/h * 0.15 (utilization) * 24 hours * 300 days = $2,311

So if you expect to run deep learning models after 300 days, it is better to buy a desktop instead of using AWS on-demand instances.

You can do similar calculations for any cloud service to make the decision if you go for a cloud service or a desktop.

Common utilization rates are the following:

PhD student personal desktop: < 15%
PhD student slurm GPU cluster: > 35%
Company-wide slurm research cluster: > 60%

In general, utilization rates are lower for professions where thinking about cutting edge ideas is more important than developing practical products. Some areas have low utilization rates (interpretability research), while other areas have much higher rates (machine translation, language modeling). In general, the utilization of personal machines is almost always overestimated. Commonly, most personal systems have a utilization rate between 5-10%. This is why I would highly recommend slurm GPU clusters for research groups and companies instead of individual desktop GPU machines.

Version History

2023-01-30: Improved font and recommendation chart. Added 5 years cost of ownership electricity perf/USD chart. Updated Async copy and TMA functionality. Slight update to FP8 training. General improvements.
2023-01-16: Added Hopper and Ada GPUs. Added GPU recommendation chart. Added information about the TMA unit and L2 cache.
2020-09-20: Added discussion of using power limiting to run 4x RTX 3090 systems. Added older GPUs to the performance and cost/performance charts. Added figures for sparse matrix multiplication.
2020-09-07: Added NVIDIA Ampere series GPUs. Included lots of good-to-know GPU details.
2019-04-03: Added RTX Titan and GTX 1660 Ti. Updated TPU section. Added startup hardware discussion.
2018-11-26: Added discussion of overheating issues of RTX cards.
2018-11-05: Added RTX 2070 and updated recommendations. Updated charts with hard performance data. Updated TPU section.
2018-08-21: Added RTX 2080 and RTX 2080 Ti; reworked performance analysis
2017-04-09: Added cost-efficiency analysis; updated recommendation with NVIDIA Titan Xp
2017-03-19: Cleaned up blog post; added GTX 1080 Ti
2016-07-23: Added Titan X Pascal and GTX 1060; updated recommendations
2016-06-25: Reworked multi-GPU section; removed simple neural network memory section as no longer relevant; expanded convolutional memory section; truncated AWS section due to not being efficient anymore; added my opinion about the Xeon Phi; added updates for the GTX 1000 series
2015-08-20: Added section for AWS GPU instances; added GTX 980 Ti to the comparison relation
2015-04-22: GTX 580 no longer recommended; added performance relationships between cards
2015-03-16: Updated GPU recommendations: GTX 970 and GTX 580
2015-02-23: Updated GPU recommendations and memory calculations
2014-09-28: Added emphasis for memory requirement of CNNs

Acknowledgments

I thank Suhail for making me aware of outdated prices on H100 GPUs, Gjorgji Kjosev for pointing out font issues, Anonymous for pointing out that the TMA unit does not exist on Ada GPUs, Scott Gray for pointing out that FP8 tensor cores have no transposed matrix multiplication, and reddit and HackerNews users for pointing out many other improvements.

For past updates of this blog post, I want to thank Mat Kelcey for helping me to debug and test custom code for the GTX 970; I want to thank Sander Dieleman for making me aware of the shortcomings of my GPU memory advice for convolutional nets; I want to thank Hannes Bretschneider for pointing out software dependency problems for the GTX 580; and I want to thank Oliver Griesel for pointing out notebook solutions for AWS instances. I want to thank Brad Nemire for providing me with an RTX Titan for benchmarking purposes. I want to thank Agrin Hilmkil, Ari Holtzman, Gabriel Ilharco, Nam Pho for their excellent feedback on the previous version of this blog post.

The post Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning appeared first on Tim Dettmers.

A Full Hardware Guide to Deep Learning

Tim Dettmers — Sun, 16 Dec 2018 18:25:41 +0000

Deep Learning is very computationally intensive, so you will need a fast CPU with many cores, right? Or is it maybe wasteful to buy a fast CPU? One of the worst things you can do when building a deep learning system is to waste money on hardware that is unnecessary. Here I will guide you step by step through the hardware you will need for a cheap high-performance system.

Over the years, I build a total of 7 different deep learning workstations and despite careful research and reasoning, I made my fair share of mistake in selecting hardware parts. In this guide, I want to share my experience that I gained over the years so that you do not make the same mistakes that I did before.

The blog post is ordered by mistake severity. This means the mistakes where people usually waste the most money come first.

GPU

This blog post assumes that you will use a GPU for deep learning. If you are building or upgrading your system for deep learning, it is not sensible to leave out the GPU. The GPU is just the heart of deep learning applications – the improvement in processing speed is just too huge to ignore.

I talked at length about GPU choice in my GPU recommendations blog post, and the choice of your GPU is probably the most critical choice for your deep learning system. There are three main mistakes that you can make when choosing a GPU: (1) bad cost/performance, (2) not enough memory, (3) poor cooling.

For good cost/performance, I generally recommend an RTX 2070 or an RTX 2080 Ti. If you use these cards you should use 16-bit models. Otherwise, GTX 1070, GTX 1080, GTX 1070 Ti, and GTX 1080 Ti from eBay are fair choices and you can use these GPUs with 32-bit (but not 16-bit).

Be careful about the memory requirements when you pick your GPU. RTX cards, which can run in 16-bits, can train models which are twice as big with the same memory compared to GTX cards. As such RTX cards have a memory advantage and picking RTX cards and learn how to use 16-bit models effectively will carry you a long way. In general, the requirements for memory are roughly the following:

Research that is hunting state-of-the-art scores: >=11 GB
Research that is hunting for interesting architectures: >=8 GB
Any other research: 8 GB
Kaggle: 4 – 8 GB
Startups: 8 GB (but check the specific application area for model sizes)
Companies: 8 GB for prototyping, >=11 GB for training

Another problem to watch out for, especially if you buy multiple RTX cards is cooling. If you want to stick GPUs into PCIe slots which are next to each other you should make sure that you get GPUs with a blower-style fan. Otherwise you might run into temperature issues and your GPUs will be slower (about 30%) and die faster.

Suspect line-up
Can you identify the hardware part which is at fault for bad performance? One of these GPUs? Or maybe it is the fault of the CPU after all?

RAM

The main mistakes with RAM is to buy RAM with a too high clock rate. The second mistake is to buy not enough RAM to have a smooth prototyping experience.

Needed RAM Clock Rate

RAM clock rates are marketing stints where RAM companies lure you into buying “faster” RAM which actually yields little to no performance gains. This is best explained by “Does RAM speed REALLY matter?” video on RAM von Linus Tech Tips.

Furthermore, it is important to know that RAM speed is pretty much irrelevant for fast CPU RAM->GPU RAM transfers. This is so because (1) if you used pinned memory, your mini-batches will be transferred to the GPU without involvement from the CPU, and (2) if you do not use pinned memory the performance gains of fast vs slow RAMs is about 0-3% — spend your money elsewhere!

RAM Size

RAM size does not affect deep learning performance. However, it might hinder you from executing your GPU code comfortably (without swapping to disk). You should have enough RAM to comfortable work with your GPU. This means you should have at least the amount of RAM that matches your biggest GPU. For example, if you have a Titan RTX with 24 GB of memory you should have at least 24 GB of RAM. However, if you have more GPUs you do not necessarily need more RAM.

The problem with this “match largest GPU memory in RAM” strategy is that you might still fall short of RAM if you are processing large datasets. The best strategy here is to match your GPU and if you feel that you do not have enough RAM just buy some more.

A different strategy is influenced by psychology: Psychology tells us that concentration is a resource that is depleted over time. RAM is one of the few hardware pieces that allows you to conserve your concentration resource for more difficult programming problems. Rather than spending lots of time on circumnavigating RAM bottlenecks, you can invest your concentration on more pressing matters if you have more RAM. With a lot of RAM you can avoid those bottlenecks, save time and increase productivity on more pressing problems. Especially in Kaggle competitions, I found additional RAM very useful for feature engineering. So if you have the money and do a lot of pre-processing then additional RAM might be a good choice. So with this strategy, you want to have more, cheap RAM now rather than later.

CPU

The main mistake that people make is that people pay too much attention to PCIe lanes of a CPU. You should not care much about PCIe lanes. Instead, just look up if your CPU and motherboard combination supports the number of GPUs that you want to run. The second most common mistake is to get a CPU which is too powerful.

CPU and PCI-Express

People go crazy about PCIe lanes! However, the thing is that it has almost no effect on deep learning performance. If you have a single GPU, PCIe lanes are only needed to transfer data from your CPU RAM to your GPU RAM quickly. However, an ImageNet batch of 32 images (32x225x225x3) and 32-bit needs 1.1 milliseconds with 16 lanes, 2.3 milliseconds with 8 lanes, and 4.5 milliseconds with 4 lanes. These are theoretic numbers, and in practice you often see PCIe be twice as slow — but this is still lightning fast! PCIe lanes often have a latency in the nanosecond range and thus latency can be ignored.

Putting this together we have for an ImageNet mini-batch of 32 images and a ResNet-152 the following timing:

Forward and backward pass: 216 milliseconds (ms)
16 PCIe lanes CPU->GPU transfer: About 2 ms (1.1 ms theoretical)
8 PCIe lanes CPU->GPU transfer: About 5 ms (2.3 ms)
4 PCIe lanes CPU->GPU transfer: About 9 ms (4.5 ms)

Thus going from 4 to 16 PCIe lanes will give you a performance increase of roughly 3.2%. However, if you use PyTorch’s data loader with pinned memory you gain exactly 0% performance. So do not waste your money on PCIe lanes if you are using a single GPU!

When you select CPU PCIe lanes and motherboard PCIe lanes make sure that you select a combination which supports the desired number of GPUs. If you buy a motherboard that supports 2 GPUs, and you want to have 2 GPUs eventually, make sure that you buy a CPU that supports 2 GPUs, but do not necessarily look at PCIe lanes.

PCIe Lanes and Multi-GPU Parallelism

Are PCIe lanes important if you train networks on multiple GPUs with data parallelism? I have published a paper on this at ICLR2016, and I can tell you if you have 96 GPUs then PCIe lanes are really important. However, if you have 4 or fewer GPUs this does not matter much. If you parallelize across 2-3 GPUs, I would not care at all about PCIe lanes. With 4 GPUs, I would make sure that I can get a support of 8 PCIe lanes per GPU (32 PCIe lanes in total). Since almost nobody runs a system with more than 4 GPUs as a rule of thumb: Do not spend extra money to get more PCIe lanes per GPU — it does not matter!

Needed CPU Cores

To be able to make a wise choice for the CPU we first need to understand the CPU and how it relates to deep learning. What does the CPU do for deep learning? The CPU does little computation when you run your deep nets on a GPU. Mostly it (1) initiates GPU function calls, (2) executes CPU functions.

By far the most useful application for your CPU is data preprocessing. There are two different common data processing strategies which have different CPU needs.

The first strategy is preprocessing while you train:

Loop:

Load mini-batch
Preprocess mini-batch
Train on mini-batch

The second strategy is preprocessing before any training:

Preprocess data
Loop:
1. Load preprocessed mini-batch
2. Train on mini-batch

For the first strategy, a good CPU with many cores can boost performance significantly. For the second strategy, you do not need a very good CPU. For the first strategy, I recommend a minimum of 4 threads per GPU — that is usually two cores per GPU. I have not done hard tests for this, but you should gain about 0-5% additional performance per additional core/GPU.

For the second strategy, I recommend a minimum of 2 threads per GPU — that is usually one core per GPU. You will not see significant gains in performance when you have more cores if you are using the second strategy.

Needed CPU Clock Rate (Frequency)

When people think about fast CPUs they usually first think about the clock rate. 4GHz is better than 3.5GHz, or is it? This is generally true for comparing processors with the same architecture, e.g. “Ivy Bridge”, but it does not compare well between processors. Also, it is not always the best measure of performance.

In the case of deep learning there is very little computation to be done by the CPU: Increase a few variables here, evaluate some Boolean expression there, make some function calls on the GPU or within the program – all these depend on the CPU core clock rate.

While this reasoning seems sensible, there is the fact that the CPU has 100% usage when I run deep learning programs, so what is the issue here? I did some CPU core rate underclocking experiments to find out.

CPU underclocking on MNIST and ImageNet: Performance is measured as time taken on 100 epochs MNIST or half an epoch on ImageNet with different CPU core clock rates, where the maximum clock rate is taken as a base line for each CPU. For comparison: Upgrading from a GTX 580 to a GTX Titan is about +20% performance; from GTX Titan to GTX 980 another +30% performance; GPU overclocking yields about +5% performance for any GPU

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/cpu_underclocking2.png?fit=300%2C202&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/cpu_underclocking2.png?fit=603%2C406&ssl=1" class="wp-image-161" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/cpu_underclocking2.png?resize=804%2C541" alt="CPU underclocking on MNIST and ImageNet: Performance is measured as time taken on 100 epochs MNIST or half an epoch on ImageNet with different CPU core clock rates, where the maximum clock rate is taken as a base line for each CPU. For comparison: Upgrading from a GTX 580 to a GTX Titan is about +20% performance; from GTX Titan to GTX 980 another +30% performance; GPU overclocking yields about +5% performance for any GPU" width="804" height="541" data-recalc-dims="1" />

CPU underclocking on MNIST and ImageNet: Performance is measured as time taken on 200 epochs MNIST or a quarter epoch on ImageNet with different CPU core clock rates, where the maximum clock rate is taken as a baseline for each CPU. For comparison: Upgrading from a GTX 680 to a GTX Titan is about +15% performance; from GTX Titan to GTX 980 another +20% performance; GPU overclocking yields about +5% performance for any GPU

Note that these experiments are on a hardware that is dated, however, these results should still be the same for modern CPUs/GPUs.

Hard drive/SSD

The hard drive is not usually a bottleneck for deep learning. However, if you do stupid things it will hurt you: If you read your data from disk when they are needed (blocking wait) then a 100 MB/s hard drive will cost you about 185 milliseconds for an ImageNet mini-batch of size 32 — ouch! However, if you asynchronously fetch the data before it is used (for example torch vision loaders), then you will have loaded the mini-batch in 185 milliseconds while the compute time for most deep neural networks on ImageNet is about 200 milliseconds. Thus you will not face any performance penalty since you load the next mini-batch while the current is still computing.

However, I recommend an SSD for comfort and productivity: Programs start and respond more quickly, and pre-processing with large files is quite a bit faster. If you buy an NVMe SSD you will have an even smoother experience when compared to a regular SSD.

Thus the ideal setup is to have a large and slow hard drive for datasets and an SSD for productivity and comfort.

Power supply unit (PSU)

Generally, you want a PSU that is sufficient to accommodate all your future GPUs. GPUs typically get more energy efficient over time; so while other components will need to be replaced, a PSU should last a long while so a good PSU is a good investment.

You can calculate the required watts by adding up the watt of your CPU and GPUs with an additional 10% of watts for other components and as a buffer for power spikes. For example, if you have 4 GPUs with each 250 watts TDP and a CPU with 150 watts TDP, then you will need a PSU with a minimum of 4×250 + 150 + 100 = 1250 watts. I would usually add another 10% just to be sure everything works out, which in this case would result in a total of 1375 Watts. I would round up in this case an get a 1400 watts PSU.

One important part to be aware of is that even if a PSU has the required wattage, it might not have enough PCIe 8-pin or 6-pin connectors. Make sure you have enough connectors on the PSU to support all your GPUs!

Another important thing is to buy a PSU with high power efficiency rating – especially if you run many GPUs and will run them for a longer time.

Running a 4 GPU system on full power (1000-1500 watts) to train a convolutional net for two weeks will amount to 300-500 kWh, which in Germany – with rather high power costs of 20 cents per kWh – will amount to 60-100€ ($66-111). If this price is for a 100% efficiency, then training such a net with an 80% power supply would increase the costs by an additional 18-26€ – ouch! This is much less for a single GPU, but the point still holds – spending a bit more money on an efficient power supply makes good sense.

Using a couple of GPUs around the clock will significantly increase your carbon footprint and it will overshadow transportation (mainly airplane) and other factors that contribute to your footprint. If you want to be responsible, please consider going carbon neutral like the NYU Machine Learning for Language Group (ML2) — it is easy to do, cheap, and should be standard for deep learning researchers.

CPU and GPU Cooling

Cooling is important and it can be a significant bottleneck which reduces performance more than poor hardware choices do. You should be fine with a standard heat sink or all-in-one (AIO) water cooling solution for your CPU, but what for your GPU you will need to make special considerations.

Air Cooling GPUs

Air cooling is safe and solid for a single GPU or if you have multiple GPUs with space between them (2 GPUs in a 3-4 GPU case). However, one of the biggest mistakes can be made when you try to cool 3-4 GPUs and you need to think carefully about your options in this case.

Modern GPUs will increase their speed – and thus power consumption – up to their maximum when they run an algorithm, but as soon as the GPU hits a temperature barrier – often 80 °C – the GPU will decrease the speed so that the temperature threshold is not breached. This enables the best performance while keeping your GPU safe from overheating.

However, typical pre-programmed schedules for fan speeds are badly designed for deep learning programs, so that this temperature threshold is reached within seconds after starting a deep learning program. The result is a decreased performance (0-10%) which can be significant for multiple GPUs (10-25%) where the GPU heat up each other.

Since NVIDIA GPUs are first and foremost gaming GPUs, they are optimized for Windows. You can change the fan schedule with a few clicks in Windows, but not so in Linux, and as most deep learning libraries are written for Linux this is a problem.

The only option under Linux is to use to set a configuration for your Xorg server (Ubuntu) where you set the option “coolbits”. This works very well for a single GPU, but if you have multiple GPUs where some of them are headless, i.e. they have no monitor attached to them, you have to emulate a monitor which is hard and hacky. I tried it for a long time and had frustrating hours with a live boot CD to recover my graphics settings – I could never get it running properly on headless GPUs.

The most important point of consideration if you run 3-4 GPUs on air cooling is to pay attention to the fan design. The “blower” fan design pushes the air out to the back of the case so that fresh, cooler air is pushed into the GPU. Non-blower fans suck in air in the vincity of the GPU and cool the GPU. However, if you have multiple GPUs next to each other then there is no cool air around and GPUs with non-blower fans will heat up more and more until they throttle themselves down to reach cooler temperatures. Avoid non-blower fans in 3-4 GPU setups at all costs.

Water Cooling GPUs For Multiple GPUs

Another, more costly, and craftier option is to use water cooling. I do not recommend water cooling if you have a single GPU or if you have space between your two GPUs (2 GPUs in 3-4 GPU board). However, water cooling makes sure that even the beefiest GPU stay cool in a 4 GPU setup which is not possible when you cool with air. Another advantage of water cooling is that it operates much more silently, which is a big plus if you run multiple GPUs in an area where other people work. Water cooling will cost you about $100 for each GPU and some additional upfront costs (something like $50). Water cooling will also require some additional effort to assemble your computer, but there are many detailed guides on that and it should only require a few more hours of time in total. Maintenance should not be that complicated or effortful.

A Big Case for Cooling?

I bought large towers for my deep learning cluster, because they have additional fans for the GPU area, but I found this to be largely irrelevant: About 2-5 °C decrease, not worth the investment and the bulkiness of the cases. The most important part is really the cooling solution directly on your GPU — do not select an expensive case for its GPU cooling capability. Go cheap here. The case should fit your GPUs but thats it!

Conclusion Cooling

So in the end it is simple: For 1 GPU air cooling is best. For multiple GPUs, you should get blower-style air cooling and accept a tiny performance penalty (10-15%), or you pay extra for water cooling which is also more difficult to setup correctly and you have no performance penalty. Air and water cooling are all reasonable choices in certain situations. I would however recommend air cooling for simplicity in general — get a blower-style GPU if you run multiple GPUs. If you want to user water cooling try to find all-in-one (AIO) water cooling solutions for GPUs.

Motherboard

Your motherboard should have enough PCIe ports to support the number of GPUs you want to run (usually limited to four GPUs, even if you have more PCIe slots); remember that most GPUs have a width of two PCIe slots, so buy a motherboard that has enough space between PCIe slots if you intend to use multiple GPUs. Make sure your motherboard not only has the PCIe slots, but actually supports the GPU setup that you want to run. You can usually find information in this if you search your motherboard of choice on newegg and look at PCIe section on the specification page.

Computer Case

When you select a case, you should make sure that it supports full length GPUs that sit on top of your motherboard. Most cases support full length GPUs, but you should be suspicious if you buy a small case. Check its dimensions and specifications; you can also try a google image search of that model and see if you find pictures with GPUs in them.

If you use custom water cooling, make sure your case has enough space for the radiators. This is especially true if you use water cooling for your GPUs. The radiator of each GPU will need some space — make sure your setup actually fits into the GPU.

Monitors

I first thought it would be silly to write about monitors also, but they make such a huge difference and are so important that I just have to write about them.

The money I spent on my 3 27 inch monitors is probably the best money I have ever spent. Productivity goes up by a lot when using multiple monitors. I feel desperately crippled if I have to work with a single monitor. Do not short-change yourself on this matter. What good is a fast deep learning system if you are not able to operate it in an efficient manner?

Typical layout when I do deep learning: Left: Papers, google searcheres, gmail, stackoverflow threads; middle: Code; right: Output windows, R, folders, systems monitors, GPU monitors, to-do list, and other small applications.

" data-image-caption="" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/2015-03-04-13-58-10.jpg?fit=300%2C169&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/2015-03-04-13-58-10.jpg?fit=1024%2C576&ssl=1" class="wp-image-123 size-full" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/2015-03-04-13-58-10.jpg?resize=700%2C394" alt="2015-03-04 13.58.10" width="700" height="394" data-recalc-dims="1" />

Typical monitor layout when I do deep learning: Left: Papers, Google searches, gmail, stackoverflow; middle: Code; right: Output windows, R, folders, systems monitors, GPU monitors, to-do list, and other small applications.

Some words on building a PC

Many people are scared to build computers. The hardware components are expensive and you do not want to do something wrong. But it is really simple as components that do not belong together do not fit together. The motherboard manual is often very specific how to assemble everything and there are tons of guides and step by step videos which guide you through the process if you have no experience.

The great thing about building a computer is, that you know everything that there is to know about building a computer when you did it once, because all computer are built in the very same way – so building a computer will become a life skill that you will be able to apply again and again. So no reason to hold back!

Conclusion / TL;DR

GPU: RTX 2070 or RTX 2080 Ti. GTX 1070, GTX 1080, GTX 1070 Ti, and GTX 1080 Ti from eBay are good too!
CPU: 1-2 cores per GPU depending how you preprocess data. > 2GHz; CPU should support the number of GPUs that you want to run. PCIe lanes do not matter.

RAM:
– Clock rates do not matter — buy the cheapest RAM.
– Buy at least as much CPU RAM to match the RAM of your largest GPU.
– Buy more RAM only when needed.
– More RAM can be useful if you frequently work with large datasets.

Hard drive/SSD:
– Hard drive for data (>= 3TB)
– Use SSD for comfort and preprocessing small datasets.

PSU:
– Add up watts of GPUs + CPU. Then multiply the total by 110% for required Wattage.
– Get a high efficiency rating if you use a multiple GPUs.
– Make sure the PSU has enough PCIe connectors (6+8pins)

Cooling:
– CPU: get standard CPU cooler or all-in-one (AIO) water cooling solution
– GPU:
– Use air cooling
– Get GPUs with “blower-style” fans if you buy multiple GPUs
– Set coolbits flag in your Xorg config to control fan speeds

Motherboard:
– Get as many PCIe slots as you need for your (future) GPUs (one GPU takes two slots; max 4 GPUs per system)

Monitors:
– An additional monitor might make you more productive than an additional GPU.

Update 2018-12-14: Reworked entire blog post with up-to-date recommendations.
Update 2015-04-22: Removed recommendation for GTX 580

The post A Full Hardware Guide to Deep Learning appeared first on Tim Dettmers.

TPUs vs GPUs for Transformers (BERT)

Tim Dettmers — Wed, 17 Oct 2018 18:13:03 +0000

On the computational side, there have been confusions about how TPUs and GPUs relate to BERT. BERT base was trained with 4 TPU pods (16 TPU chips) in 4 days and BERT large with 16 TPUs (64 TPU chips) in 4 days. Does this mean only Google can train a BERT model? Does this mean that GPUs are dead? There are two fundamental things to understand here: (1) A TPU is a matrix multiplication engine — it does matrix multiplication and matrix operations, but not much else. It is fast at computing matrix multiplication, but one has to understand that (2) the slowest thing in matrix multiplication is to get the elements from the main memory and load it into the processing unit. In other words, the most expensive part in matrix multiplication is memory loads. Note the computational load for BERT should be about 90% for matrix multiplication. From these facts, we can do a small technical analysis on this topic.

Bandwidth Model for TPUs and GPUs

Transformers for TPUs

A common operation in BERT is matrix multiplication A*B=C where A is 256×1024 and B is 1024×1024 in dimension. A TPU computes such a matrix multiplication by splitting the matrix into many smaller 128×128 matrix multiplications. This means we need to load 16 128×128 matrix tiles from matrix A — and due to the nature of matrix multiplication — we need to load 64 tiles from B for every tile in A. This is a total of 16*64=1024 128×128 loads. At 16-bit that is a total of 32 MB of data.

Now we make a simplification: We assume that there is no latency if we do two memory loads after each other, which is actually not too unreasonable since often you can hide memory access latency under thread parallelism. In simple words, this means: While we wait for one 128×128 matrix copy to complete, we already do the next one. In doing it this way, we only wait for the first memory copy and we do not wait for other copies. This is a core reason why GPUs are fast and why we use many threads in GPUs thus 0 latency for overlapping memory transfers is not too far off from the real world. Using this simplification, we can now plainly use the memory bandwidth to compute the time needed to load the memory for the matrix multiplication. If we look at the bandwidth of the TPU we find that we have 600 GB/s, so we need 5.2e-05 seconds to transfer the 32 MB of data.

Transformers on GPUs

For a GPU we have the same process, but we use smaller tiles with more processors. Similarly to the TPU, we use two loads in parallel to hide memory latency. For GPUs, however, we would have a tile size of 96×96 for 16-bit data. If we take a V100 Tesla GPU, then we can run 160 of these in parallel at full bandwidth with low memory latency. What this means compared to a TPU: Instead of 2 matrix units which can hold 128×128 matrices, the GPU has 160 units (80 SMs, 160 thread blocks, each thread block has two 96×96 matrices) which hold two 96×96 matrices. Again this ensures that we can hide the memory latency through parallelism.

If we repeat the calculation from the top we receive the following: For matrix A with 256×1024 we have 33 96×96 tiles; for B with 1024×1024 we have 121 96×96 tiles. In total, we need to do 33*121=3993 loads of size 96×96 for a total of 70 MB. A V100 runs at 900 GB/s and so the memory loads will take 7.6e-05 seconds. Thus our model predicts that a GPU is 32% slower than a TPU for this specific scenario. Note that matrix tiles stay the same for an RTX 2080 Ti GPU, but the memory bandwidth decreases to 616 GB/s. Which means an RTX 2080 Ti is 54% slower than a TPU.

Note that both TPU and GPUs with Tensor Cores compute the respective matrix multiplication tile in one cycle. Thus the computation is about equally fast — the difference is only in how the memory is loaded.

BERT Training Time Estimate for GPUs

Using this data, a GPU cluster of V100s/RTX 2080 Tis with good networking (Infiniband +56GBits/s) and good parallelization algorithms (for example using Microsoft’s CNTK) we can expect to train BERT large on 64 GPUs (the equivalent to 16 TPUs) or BERT base on 16 GPUs in 5 1/3 days or 8 1/2 days. On an 8 GPU machine for V100/RTX 2080 Tis with any software and any parallelization algorithm (PyTorch, TensorFlow) one can expect to train BERT large in 21 days or 34 days and BERT base in 10 2/3 or 17 days. For a standard 4 GPU desktop with RTX 2080 Ti (much cheaper than other options), one can expect to replicate BERT large in 68 days and BERT base in 34 days.

Limitations of the Bandwidth Model

Note that all models are wrong, but some are useful. I would expect that this bandwidth model is in about 30% of the correct runtime values for TPU vs GPU.

The biggest limitation is that these calculations are for specific matrices sizes. Computational differences can be amplified for certain sizes. For example, if your batch-size is 128, there is a slight speedup for GPUs compared to TPUs. If you go below a batch size of 128 you can expect GPUs to be significantly faster; increasing the matrix B further makes TPUs better and better compared to GPUs. Decreasing the size of matrix B will make the performance of GPUs better. Note that the BERT paper optimized matrix A and B sizes for the TPU — one would not choose these dimensions if you train with a GPU. So this comparison might favor TPUs slightly.

Further direct limitations include fused operations. The TPU can calculate additional element-wise operations such as a non-linear activation function or a bias on the fly within a matrix multiplication. This means that the TPU does not need to load from slow global memory as often as a GPU. The GPU also supports these operations but NVIDIA has not implemented them and thus GPU users will not be able to benefit from this. Thus one can expect a slowdown of about 1.6% (loading and storing a 256×1024 matrix) for each element-wise operation for a GPU. For example, if you apply a non-linear function and a bias, then the TPU would be about 3.2% faster compared to GPUs in this scenario.

The Importance of 32-bit vs 16-bit vs 8-bit

If we repeat the same calculations from above for 32-bit values (64x64x tiles) we find that TPUs would be 5.3x faster. So the datatype size has a much larger effect than switching from TPU to GPU and vice versa.

TPUs do not support 8-bit training, but Turing GPUs do. So we can also have a look at how 8-bit matrix multiplication would impact performance. I published research on 8-bit models and it is not too difficult to train them with 8-bit alone. In fact, the literature on low-bit computing is quite rich. With 32-bit accumulation as supported by Turing GPUs 8-bit training should be even easier. If we can make 8-bit computing work for general models this would entail huge speedups for transformers. If we repeat the above calculations for 8-bit for GPUs (128×128 tile) we find that GPUs are 3.0x faster than TPUs. 8-bit computation on an affordable standard machine with 4 RTX 2080 Ti would take about 11 days for BERT base and 22 days for BERT large. All of this makes 16-bit computational ability for a GPU and important criterion if you are looking for a GPU to work with transformers.

Conclusion

TPUs are about 32% to 54% faster for training BERT-like models. One can expect to replicate BERT base on an 8 GPU machine within about 10 to 17 days. On a standard, affordable GPU machine with 4 GPUs one can expect to train BERT base for about 34 days using 16-bit or about 11 days using 8-bit.

The post TPUs vs GPUs for Transformers (BERT) appeared first on Tim Dettmers.

Deep Learning Hardware Limbo

Tim Dettmers — Thu, 21 Dec 2017 11:10:23 +0000

With the release of the Titan V, we now entered deep learning hardware limbo. It is unclear if NVIDIA will be able to keep its spot as the main deep learning hardware vendor in 2018 and both AMD and Intel Nervana will have a shot at overtaking NVIDIA. So for consumers, I cannot recommend buying any hardware right now. The most prudent choice is to wait until the hardware limbo passes. This might take as little as 3 months or as long as 9 months. So why did we enter deep learning hardware limbo just now?

NVIDIA has decided that it needs to cash-in on its monopoly position before the competition emerges. It needs the cash in order to defend itself in the next 1-2 years. This is reflected by the choice to price the Titan V at $3000. With TensorCores the Titan V has a new shiny deep learning feature, but at the same time, its cost/performance ratio is abysmal. This makes the Titan V very unattractive. But because there is no alternative, people will need to eat what there are served – at least for now.

The competition is strong. We have AMD whose hardware is now already better than NVIDIA’s and plans to get itself together to produce some deep learning software which is actually usable. With this step, the cost/performance ratio will easily outmatch NVIDIA cards and AMD will become the new standard. NVIDIA’s cash advantage will help fight AMD off so that we might see very cheap NVIDIA cards in the future. Note that this will only happen if AMD is able to push forward with good software — if AMD falters, NVIDIA cards will remain expensive and AMD will have lost its opportunity to grab the throne.

There is also a new contender in town: The Neural Network Processor (NNP) form Intel Nervana. With several unique features, it packs quite a punch. These new features make me drool — they are exactly what I want as a CUDA developer. The NNP solves most problems I face when I want to write CUDA kernels which are optimized for deep learning. This chip is the first true deep learning chip.

In general, for a 1-chip vs 1-chip ranking, we will see Nervana > AMD > NVIDIA, just because NVIDIA has to service gaming/deep learning/high-performance computing at once, while AMD only needs to service gaming/deep learning, whereas Nervana can just concentrate on deep learning – a huge advantage. The more concentrated a designed architecture, the less junk is on the chip for deep learning.

However, the winner is not determined by pure performance, and not even by pure cost/performance. It is determined by cost/performance + community + deep learning frameworks.

Let’s have a closer look at the individual positions of Nervana, AMD, and NVIDIA to see where they stand.

Nervana’s Neural Network Processors

Why @NaveenGRao is excited about #Intel Nervana NNP: https://t.co/DAqgOYFtoR #AI pic.twitter.com/dbLMdnp63t

— Intel AI (@IntelAI) October 31, 2017

Nervana’s design is very special mainly due to its large programmable caches (similar to CUDA shared memory) which are 10 times bigger per chip compared to GPUs and 50 times bigger per compute unit compared to GPUs. With this one will be able to design in-cache algorithms and models. This will speed up inference by at least an order of magnitude and one will be able to easily train on terabytes of data with small in-cache deep learning models, say, a multi-layer LSTM with 200 units. This will make this chip very attractive for startups and larger companies. Due to a special datatype, Flexpoint, one is able to store more data in caches/RAM and compute faster yielding even more benefits. All of this could mean speedup of about 10x compared to current NVIDIA GPUs for everybody. But this is only so if the main obstacles can be overcome: Community and software.

For the normal users and researchers, it will all depend on the community. Without community, we will not see in-cache algorithms. Without community, we will not see good software frameworks and it will be difficult to work with the chip. Everybody wants to use solid deep learning frameworks and it is questionable if Neon, Nervana’s deep learning framework, is up for the task. Software comes before hardware. If Nervana only ships pretty chips and does not push the software and community aspect effectively it will lose out to AMD and NVIDIA.

The community and software question is tightly bound to the price. If the price is too high, and students are not able to afford the NNP then no community can manifest itself around it. You do not get robust communities by just catering for industry. Although industry yields the main income for hardware companies, students are the main driver for the community. So if the price is right and students can afford it, then the community and the software will follow. Anything above $3000 will not work out. Anything above $2000 is critical and one would require special discounts for students to create a robust community. An NNP priced at $2000 will be manageable and find some adoption. Anything below $1500 will make Nervana the market leader for at least 2-3 years. An NNP at $1000 would make it extremely tough for NVIDIA and AMD to compete — software would not even be a question here, it follows automatically.

I personally will switch to NNPs if they are priced below $2500. They are just so much superior to GPUs for deep learning and I will be able to do things which are just impossible with NVIDIA hardware. If they are over $2500 then it also reaches my pain point for good hardware. I save up a lot of money to buy hardware — good hardware is just important to me — but I have to live from something.

For usual consumers not only the price will be important, but also how the community is handled. If we do not see Intel immediately pumping resources into the community to start up a solid software machinery then the NNP is likely to stagnate and die off. Unfortunately, Intel has a good history of mismanaging communities — it would be a shame if this happens because I really would like to see Nervana succeed.

In summary, we will see Nervana’s NNP will emerge as a clear winner if it will be priced below $2000 and if we see strong community and software development within the first few months after its release. With a higher price and less community support, the NNP will be strong, but might not be able to surpass other solutions in terms of cost/performance and convenience. If the software and community efforts fail or if the NNP is priced at $4000 it will likely fail. A price above $2000 will require significant discounts for students for the NNP to be viable.

AMD: Cheap and Powerful – If You Can Use It

Source: Wikipedia

AMDs cards are incredible. The Vega Frontier Edition series clearly outmatches NVIDIA counterparts, and, from unbiased benchmarks of Volta vs Pascal, it seems that the Vega Frontier will be on-a-par or better compared to a Titan V if it is liquid cooled. Note that the Vega is based on an old architecture while the Titan V is brand new. The new AMD architecture, which will be released in 2018Q3 will increase performance further still.

AMD hopes to advance deep learning hardware by just switching from 32-bit floats to 16-bit floats. This is a very simple and powerful strategy. The chips will not be useful for high-performance computing, but they will be solid for gamers and the deep learning community while development costs will be low because 16-bit float computation is straightforward.

They will not be able to compete in terms of performance with Nervana’s NNP, but the cost/performance might outmatch everything on the market. You can get a liquid cooled Vega Frontier for $700 which might be just a little worse than a $3000 Titan V.

The problem is software. Even if you have this powerful AMD GPU, you will hardly be able to use it – no major framework supports AMD GPUs well enough.

AMD is in limbo itself – in software limbo. It seems they want to abandon OpenCL for HIP but currently they officially still push and support the OpenCL path. If they push through with HIP, and if they put some good deep learning software on the market (not only libraries for convolution and matrix multiplication but full deep learning frameworks, say, HIP support for PyTorch) in the next 9 months then their release of their new GPU in 2018Q3 has the potential to demolish all competitors.

So in summary, if AMD gets its shit together in terms of software, it might become the dominating deep learning hardware solution.

NVIDIA: The Titan

NVIDIA’s position is solid. They have the best software, the best tools, their hardware is good and the community is large, strong and well integrated.

NVIDIA’s main issue is that they have to serve multiple communities: High-performance computing people, deep learning people, and gamers, and this is a huge strain on their hardware. It is expensive to design chips which are custom made for these communities and NVIDIA’s strategy was currently to design a one-size-fits-all architecture. This worked until it didn’t. The Titan V is just mediocre all-around.

With the emerging competitors, NVIDIA has two choices. (1) Push the price on their cards down until they starve the competition to death, or (2) they can develop specialized deep learning hardware on their own. NVIDIA has the resources to pursue the first strategy, and it also has the expertise for the second strategy. A new design, however, will take some time and NVIDIA might lose the throne to another company in the meantime. So we might see both strategies played out a once: Starving competitors so that NVIDIA can compete until their own deep learning chip hits the market.

In summary, NVIDIAs throne is threatened, but it has the resources and the expertise to fight against emerging players. We will probably see cheaper NVIDIA cards in the future and chips which are more specialized for deep learning. If NVIDIA does not lower its prices it might (temporarily) pass the throne to another player.

Conclusion

Deep learning hardware limbo means that it makes no sense to invest in deep learning hardware right now, but it also means we will have cheaper NVIDIA cards, usable AMD cards, and ultra-fast Nervana cards quite soon. It is an exciting time and we consumers will profit from this immensely. But for now, we have to be patient. We have to wait. I will keep you updated as the situation changes.

The post Deep Learning Hardware Limbo appeared first on Tim Dettmers.

The Brain vs Deep Learning Part I: Computational Complexity — Or Why the Singularity Is Nowhere Near

Tim Dettmers — Mon, 27 Jul 2015 10:20:05 +0000

In this blog post I will delve into the brain and explain its basic information processing machinery and compare it to deep learning. I do this by moving step-by-step along with the brains electrochemical and biological information processing pipeline and relating it directly to the architecture of convolutional nets. Thereby we will see that a neuron and a convolutional net are very similar information processing machines. While performing this comparison, I will also discuss the computational complexity of these processes and thus derive an estimate for the brains overall computational power. I will use these estimates, along with knowledge from high performance computing, to show that it is unlikely that there will be a technological singularity in this century.

This blog post is complex as it arcs over multiple topics in order to unify them into a coherent framework of thought. I have tried to make this article as readable as possible, but I might have not succeeded in all places. Thus, if you find yourself in an unclear passage it might become clearer a few paragraphs down the road where I pick up the thought again and integrate it with another discipline.

First I will give a brief overview about the predictions for a technological singularity and topics which are aligned with that. Then I will start the integration of ideas between the brain and deep learning. I finish with discussing high performance computing and how this all relates to predictions about a technological singularity.

The part which compares the brains information processing steps to deep learning is self-contained, and readers which are not interested in predictions for a technological singularity may skip to this part.

Part I: Evaluating current predictions of a technological singularity

There were a lot of headlines recently about predictions that artificial intelligence will reach super-human intelligence as early as 2030 and that this might herald the beginning of human extinction, or at least dramatically altering everyday life. How was this prediction made?

Factors which help to predict a singularity

Ray Kurzweil has made many very accurate predictions and his methods to reach these predictions are quite simple for computing devices: Look at the exponential growth of computing power, efficiency, and size, and then extrapolate. This way, you could easily predict the emergence of small computers which fit into your hands and with a bit of creativity, one could imagine that one day there would be tablets and smartphones. The trends were there, you just needed to imagine what could be done with computers which you can hold in your hand.

Similarly, Ray Kurzweil predicted the emergence of strong AI which is as intelligent or more intelligent than humans. For this prediction he also used data for the exponential growth of computing power and compared this to an estimate for the computational power of the brain.

He also acknowledges that the software will be as important as the hardware, and that the software development of strong AI will take longer because such software can only be developed once fast computer systems are available. This can be felt in the area of deep learning, where solid ideas of the 1990s were unfeasible due to the slow computers. Once graphic processing units (GPUs) were used, these computing limitations were quickly removed and rapid progress could be made.

However, Kurzweil also stresses that once the hardware level is reached, first “simple” strong AI systems will be developed quickly. He sets the date for brain-like computational power to 2020 and the emergence of strong AI (first human like intelligence or better) to 2030. Why these numbers? With persisting growth in computing power in 2019 we will reach the computing power which is equivalent to the human brain — or will we?

This estimate is based on two things: (1) The estimate for the complexity of the brain, (2) the estimate for the growth in computing power. As we will see, both these estimates are not up-to-date with current technology and knowledge about neuroscience and high performance computing.

Our knowledge of neuroscience doubles about every year. Using this doubling period, in the year of 2005 we would only have possessed about 0.098% of the neuroscience knowledge that we have today. This number is a bit off, because the doubling time was about 2 years in 2005 while it is less than a year now, but overall it is way below 1 %.

The thing is that Ray Kurzweil based his predictions on the neuroscience of 2005 and never updated them. An estimate for the brains computational power based on 1% of the neuroscience knowledge does not seem right. Here is small list of a few important discoveries made in the last two years which increase the computing power of the brain by many orders of magnitude:

It was shown that brain connections rather than being passive cables, can themselves process information and alter the behavior of neurons in meaningful ways, e.g. brain connections help you to see the objects in everyday life. This fact alone increases brain computational complexity by several orders of magnitude
Neurons which do not fire still learn: There is much more going on than electrical spikes in neurons and brain connections: Proteins, which are the little biological machines which make everything in your body work, combined with local electric potential do a lot of information processing on their own — no activation of the neuron required
Neurons change their genome dynamically to produce the right proteins to handle everyday information processing tasks. Brain: “Oh you are reading a blog. Wait a second, I just upregulate this reading-gene to help you understand the content of the blog better.” (This is an exaggeration — but it is not too far off)

Before we look at the complexity of the brain, let us first look at brain simulations. Brain simulations are often used to predict human-like intelligence. If we can simulate a human brain, then it will not be long until we are able to develop human-like intelligence, right? So the next paragraph looks at this reasoning. Can brain simulations really provide reliable evidence for predicting the emergence of artificial intelligence?

The problems with brain simulations

Brain simulations simulate the electrical signals which are emitted by neurons and the size of the connections between neurons. A brain simulation starts with random signals and the whole system stabilizes according to rules which are thought to govern information processing steps in the brain. After running these rules for some time, stable signals may form which can be compared to the signals of the brain. If the signals of the simulation are similar to recordings of the brain, this increases our confidence that our chosen rules are somewhat similar to the rules that the brain uses. Thus we can validate large scale information processing rules in the brain. However, the big problem with brain simulations is, that this is pretty much all we can do.

We do not gain any understanding what these signals mean or what function they could possess. We cannot test any meaningful hypotheses with this brain model other than the vague “our rules produce similar activity”. The lack of precise hypotheses which make accurate predictions (“If the activity is like this, then the circuit detected an apple instead of an orange”) is one of the loudest criticism of the European brain simulation project. The brain project is regarded as rather useless by many neuroscientists and even dangerous, because it sucks away money for useful neuroscience projects which actually shed light on neural information processing.

Another problem is that these brain simulations rely on models which are outdated, incomplete and which dismiss many biological parts in neurological information processing. This is mainly so, because the electrical information processing in the brain is much better understood. Another more conveniently reason is, that current models are already able to reproduce the needed output patterns (which is the main goal after all) and so there is no need to update these models to be more brain-like.

So to summarize, the problems with brain simulations are:

Not possible to test specific scientific hypotheses (compare this to the large hadron collider project with its perfectly defined hypotheses)
Does not simulate real brain processing (no firing connections, no biological interactions)
Does not give any insight into the functionality of brain processing (the meaning of the simulated activity is not assessed)

The last point is the most important argument against the usefulness of brain processing for strong-AI estimation. If we could develop a brain simulation of the visual system, which would do well on say, the MNIST and ImageNet data sets, this would be useful to estimate progress in brain-like AI. But without this, or any similar observable function, brain simulations remain rather useless with respect to AI.

With this said, brain simulations are still valuable to test hypothesized general rules of information processing in the brain —we have nothing better for this — but they are quite useless to make sense of what the information processing in the brain means, and thus constitute unreliable evidence for predicting the progress in AI. Anything that relies on brain simulation as evidence for predictions of future strong-AI should be looked at with great skepticism.

Estimating the brains computational complexity

As mentioned in the introduction, the estimates of the brain’s complexity are a decade old and many new discoveries made this old estimate obsolete. I never came across an estimate which is up to date, so here I derive my own estimate. While doing this, I will focus mostly on the electrochemical information processing and neglect the biological interactions within the neuron, because they are too complex (and this blog post is already very long). Therefore the estimate that is derived here can be thought of as a lower bound of complexity — it should always be assumed that the brain is more complex than this.

During the construction of this model of complexity, I will also relate every step in the model with its deep learning equivalents. This will give you a better understanding of how close deep learning is related to and how fast deep learning really is compared to the human brain

Defining reference numbers for the model

We know some facts and estimates which help us to start with our model building:

The brain uses learning algorithms which are very different from deep learning, but the architecture of neurons is similar to convolutional nets
The adult brain has 86 billion neurons, about 10 trillion synapse, and about 300 billion dendrites (tree-like structures with synapses on them)
The brain of a child has far more than 100 billion neurons, and has synapses and dendrites in excess of 15 trillion and 150 billion, respectively
The brain of a fetus has more than a trillion neurons; neurons which are misplaced die quickly (this is also the reason why adults have fewer neurons than children)

Location of the cerebellum which contains roughly 3/4 of all neurons and connections. Image source: 1

Location of the cerebrum; also referred to as “the cortex”. More precisely, the cortex is the outer layer of the brain, which contains most neurons of the cerebrum. Image source: 1

The cerebellum, the super computer of the brain, contains roughly ¾ of all neurons (this ratio is consistent in most mammal species)
The cerebrum, the main driver of “intelligence”, contains roughly ¼ of all neurons
An average neuron in the cerebellum has about 25000 synapses
An average neuron in the cerebrum has about 5000-15000 synapses

The number of neurons is well known; the number of synapses and dendrites is only known within a certain boundary and I chose conservative estimates here.

The average synapses per neuron differ wildly between neurons, and the number here is a rough average. It is known that most synapses in the cerebellum are made between dendrites of Purkinje neurons and two different types of neurons that make connections that “climb up” or “cross parallel” with the Purkinje’s synapses. It is known that Purkinje cells have about 100000 synapses each. Because these cells have by far the largest weight in the cerebellum, one can estimate the complexity of the brain best if one looks at these neurons and at the interactions that they make.

There are many hundreds of different types of neurons; here some of the more common neurons. Thanks to Robert Stufflebeam for this image (source).

It is important to differentiate between the complexity of a brain region and its functional importance. While almost all computation is carried out by the cerebellum, almost all important functions are carried out by the cerebrum (or cortex). The cortex uses the cerebellum to generate predictions, corrections and conclusions, but the cortex accumulates these insights and acts upon them.

For the cerebrum it is known that neurons almost never have more than 50000 synapses, and unlike the cerebellum, most neurons have a number of synapses within the range of 5000-15000.

How do we use these numbers?

A common approach for estimating the computational complexity of the brain is to assume all information processing in the brain can be represented by the combination of impulses when a neuron fires (action potentials) and the size (mostly number of receptors) of the synapses that each neuron has. Thus one can multiply the estimates for the number of neurons and their synapses and add everything together. Then one multiplies this by the rate of fire for the average neurons which is about 200 action potentials per second. This model is what Ray Kurzweil uses to create his estimate. While this model was okay a few decades ago, it is not suitable to model the brain from a modern view point, as it leaves out much of the important neurological information processing which is so much more than mere firing neurons.

A model which approximates the behavior of neurons more accurately is the extended linear-nonlinear-Poisson cascade model (LNP). The extended LNP model is currently viewed as an accurate model of how neurons process information. However, the extended LNP model still leaves out some fine details, which are deemed unimportant to model large scale brain function. Indeed adding these fine details to the model will add almost no additional computational complexity, but makes the model more complex to understand — thus including these details in simulations would violate the scientific method which seeks to find the simplest models for a given theory. However, this extended model is actually very similar to deep learning and thus I will include these details here.

There are other good models that are also suitable for this. The primary reason why I chose the LNP model is that it is very close to deep learning. This makes this model perfect to compare the architecture of a neuron to the architecture of a convolutional net. I will do this in the next section and at the same time I will derive an estimate for the complexity of the brain.

Part II: The brain vs. deep learning — a comparative analysis

Now I will explain step by step how the brain processes information. I will mention the steps of information processing which are well understood and which are supported by reliable evidence. On top of these steps, there are many intermediary steps at the biological level (proteins and genes) which are still poorly understood but known to be very important for information processing. I will not go into depth into these biological processes but provide a short outline, which might help the knowledge hungry readers to delve into these depths themselves. We now begin this journey from the neurotransmitters released from a firing neuron and walk along all its processes until we reach the point where the next neuron releases its neurotransmitters, so that we return to where we started.

The next section introduces a couple of new terms which are necessary to follow the rest of the blog post, so read it carefully if you are not familiar with basic neurobiology.

Image sources: 1,2,3,4

Neurons use the axon — a tube like structure— to transmit their electric signals over long stretches in the brain. When a neuron fires, it fires an action potential — an electrical signal— down its axon which branches into a tree of small endings, called axon terminals. On the ending of each of these axon terminals sit some proteins which convert this electrical message back into a chemical one: Small balls — called synaptic vesicles — filled with a couple of neurotransmitters each are released into an area outside of the neuron, called synaptic cleft. This area separates the axon terminal from the beginning of the next neuron (a synapse) and allows the neurotransmitter to move freely to pursue different tasks.

The synapses are most commonly located at a structure which looks very much like the roots of a tree or plant; this is the dendritic tree composed of dendrites which branch into larger arms (this represents the connections between neurons in a neural network), which finally reach the core of the cell, which is called soma. These dendrites hold almost all synapses which connect one neuron to the next and thus form the principal connections. A synapse may hold hundreds of receptors to which neurotransmitter can bind themselves.

You can imagine this compound of axon terminal and synapses at a dendrite as the (dense) input layer (of an image if you will) into a convolutional net. Each neuron may have less than 5 dendrites or as many as a few hundred thousand. Later we will see that the function of the dendritic tree is similar to the combination of a convolutional layer followed by max-pooling in a convolutional network.

Going back to the biological process, the synaptic vesicles merge with the surface of the axon terminal and turn themselves inside-out spilling their neurotransmitters into the synaptic cleft. There the neurotransmitters drift in a vibrating motion due to the temperature in the environment, until they (1) find a fitting lock (receptor protein) which fits their key (the neurotransmitter), (2) the neurotransmitters encounter a protein which disintegrates them, or (3) the neurotransmitters encounter a protein which pulls them back into the axon (reuptake) where they are reused. Antidepressants mostly work by (3) preventing, or (4) enhancing the reuptake of the neurotransmitter serotonin; (3) preventing reuptake will yield changes in information processing after some days or weeks, while (4) enhancing reuptake leads to changes within seconds or minutes. So neurotransmitter reuptake mechanisms are integral for minute to minute information processing. Reuptake is ignored in the LNP model.

However, the combination of the amount of neurotransmitters released, the number of synapses for a given neurotransmitter, and how many neurotransmitters actually make it into a fitting protein on the synapse can be thought of as the weight parameter in a densely (fully) connected layer of a neural network, or in other words, the total input to a neuron is the sum of all axon-terminal-neurotransmitter-synapse interactions. Mathematically, we can model this as the dot product between two matrices (A dot B; [amount of neurotransmitters of all inputs] dot [amount of fitting proteins on all synapses]).

After a neurotransmitter has locked onto a fitting protein on a synapse, it can do a lot of different things: Most commonly, neurotransmitters will just (1) open up channels, to let charged particles flow (through diffusion) into the dendrites, but it can also cause a rarer effect with huge consequences: The neurotransmitter (2) binds to a G-protein which then produces a protein signaling cascade which, (2a) activates (upregulates) a gene which is then used to produce a new protein which is integrated into either the surface of the neuron, its dendrites, and/or its synapses; which (2b) alerts existing proteins to do a certain function at a specific site (create or remove more synapses, unblock some entrances, attach new proteins to the surface of the synapse). This is ignored in the NLP model.

Once the channels are open, negatively or positively charged particles enter into the dendritic spine. A dendritic spine is a small mushroom-like structure on to which the synapse is attached. These dendritic spines can store electric potential and have their own dynamics of information processing. This is ignored in the NLP model.

Dendritic spines have their own internals information processing dynamics which is largely determined by its shape and size. Image source: 1,2

The charge of the particles that may enter the dendritic spine are either negatively or positively charged — some neurotransmitters only open channels for negative particles, others only for positive ones. There are also channels which let positively charged particles leave the neuron, thus increasing the negativity of the electric potential (a neuron “fires” if it becomes too positive). The size and shape of the mushroom-like dendritic spine corresponds to its behavior. This is ignored in the NLP model.

Once particles entered the spine, there are many things they can affect. Most commonly, they will (1) just travel along the dendrites to the cell body in the neuron and then, if the cell gets too positively charged (depolarization) they induce an action potential (the neuron “fires”). But other actions are also common: The charged particles accumulate in the dendritic spine directly and (2) open up voltage-gated channels which may polarize the cell further (this is an example of the dendritic spine information processing mentioned above). Another very important process are (3) dendritic spikes.

Dendritic spikes

Dendritic spikes are a phenomenon which has been known to exist for some years, but only in 2013 the techniques were advanced enough to collect the data to show that these spikes were important for information processing. To measure dendritic spikes, you have to attach some very tiny clamps onto dendrites with the help of a computer which moves the clamp with great precision. To have some sort of idea where your clamp is, you need a special microscope to observe the clamp as you progress onto a dendrite. Even then you mostly attach the clamp in a rather blind matter because at such tiny scale every movement made is a rather giant leap. Only a few teams in the world have the equipment and skill to attach such clamps onto dendrites.

However, the direct data gathered by those few teams was enough to establish dendritic spikes as important information processing events. Due to the introduction of dendritic spikes into computational models of neurons, the complexity of a single neuron has become very similar to a convolutional net with two convolutional layers. As we see later the LNP model also uses non-linearities very similar to a rectified linear function, and also makes use of a spike generator which is very similar to dropout – so a neuron is very much like an entire convolutional net. But more about that later and back to dendritic spikes and what exactly they are.

Dendritic spikes occur when a critical level of depolarization is reached in a dendrite. The depolarization discharges as an electric potential along the walls of the dendrite and may trigger voltage-gated channels along its way through the dendritic tree and eventually, if strong enough, the electric potential reaches the core of the neuron where it may trigger a true action potential. If the dendritic spike fails to trigger an action potential, the opened voltage-gated channels in neighboring dendrites may do exactly that a split second later. Due to channels opened from the dendritic spike more charged particles enter the neuron, which then may either trigger (common) or stifle (rare) a full action potential at the neurons cell body (soma).

A shows a computer model of a neuron that does not model dendritic spikes; B models simple dynamics of dendritic spikes; C models more complex dynamics of dendritic spikes which takes into account the one dimensional diffusion of particles (which is similar to a convolution operation). Take note that these images are only snapshots in a particular moment of time. A big thanks to Berd Kuhn. Image copyright © 2014 Anwar, Roome, Nedelescu, Chen, Kuhn and De Schutter as published in Frontiers in Cellular Neuroscience (Anwar et al. 2014)

This process is very similar to max-pooling, where a single large activation “overwrites” other neighboring values. However, after a dendritic spike, neighboring values are not overwritten like during max-pooling used in deep learning, but the opening of voltage-gated channels greatly amplifies the signals in all neighboring branches within the dendritic tree. Thus a dendritic spike may heighten the electrochemical levels in neighboring dendrites to a level which is more similar to the maximum input — this effect is close to max-pooling.

Indeed it was shown that dendritic spikes in the visual system serve the same purpose as max pooling in convolutional nets for object recognition: In deep learning, max-pooling is used to achieve (limited) rotation, translation, and scale invariance (meaning that our algorithm can detect an object in an image where the object is rotated, moved, or shrunk/enlarged by a few pixels). One can think of this process as setting all surrounding pixels to the same large activation and make each activation share the weight to the next layer (in software the values are discarded for computational efficiency — this is mathematically equivalent). Similarly, it was shown that dendritic spikes in the visual system are sensitive to the orientation of an object. So dendritic spikes do not only have computational similarity, but also similarities in function.

The analogy does not end here. During neural back-propagation — that is when the action potential travels from the cell body back into the dendritic tree — the signal cannot backpropagate into the dendritic branch where the dendritic spike originated because these are “deactivated” due to the recent electrical activity. Thus a clear learning signal is sent to inactivated branches. At first this may seem like the exact opposite from the backpropagation used for max-pooling, where everything but the max-pooling activation is backpropagated. However, the absence of a backpropagation signal in a dendrite is a rare event and represents a learning signal on its own. Thus, dendrites which produce dendritic spikes have special learning signals just like activated units in max-pooling.

To better understand what dendritic spikes are and what they look like, I very much want to encourage you to watch this video (for which I do not have the copyright). The video shows how two dendritic spikes lead to an action potential.

This combination of dendritic spikes and action potentials and the structure of the dendritic tree has been found to be critical for learning and memory in the hippocampus, the main brain region responsible for forming new memories and writing them to our “hard drive” at night.

Dendritic spikes are one of the main drivers of computational complexity which have been left out from past models of the complexity of the brain. Also, these new findings show that neural back-propagation does not have to be neuron-to-neuron in order to learn complex functions; a single neuron already implements a convolutional net and thus has enough computational complexity to model complex phenomena. As such, there is little need for learning rules that span multiple neurons — a single neuron can produce the same outputs we create with our convolutional nets today.

But these findings about dendritic spikes are not the only advance made in our understanding of the information processing steps during this stage of the neural information processing pathway. Genetic manipulation and targeted protein synthesis are sources that increase computational complexity by orders of magnitude, and only recently we made advances which reveal the true extend of biological information processing.

Protein signaling cascades

As I said in the introduction of this part, I will not cover the parts of biological information processing extensively, but I want to give you enough information so that you can start learning more from here.

One thing one has to understand is that a cell looks much different from how it is displayed in text books. Cells crawl with proteins: There are about 10 billion proteins in any given human cell and these proteins are not idle: They combine with other proteins, work on a task, or jitter around to find new tasks to work on.

All the functions described above are the work of proteins. For example the key-and-lock mechanism and the channels that play the gatekeeper for the charged particles that leave and enter the neuron are all proteins. The proteins I mean in this paragraph are not these common proteins, but proteins with special biological functions.

As an example the abundant neurotransmitter glutamate may bind to a NDMA receptor which then opens up its channels for many different kinds of charged particles and after being opened, the channel only closes when the neuron fires. The strength of synapses is highly dependent on this process, where the synapse is adjusted according to the location of the NDMA receptor and the timing of signals which are backpropagated to the synapses. We know this process is critical to learning in the brain, but it is only a small piece in a large puzzle.

The charged particles which may enter the neuron may additionally induce protein signaling cascades own their own. For example the cascade below shows how an activated NMDA receptor (green) lets charged calcium CA2+ inside which triggers a cascade which eventually leads to AMPAR receptors (violet) being trafficked and installed on the synapse.

Image source: 1

It was shown again and again that these special proteins have a great influence on the information processing in neurons, but it is difficult to pick out a specific type of protein from this seemingly chaotic soup of 10 billion proteins and study its precise function. Findings are often complex with a chain of reactions involving many different proteins until a desired end-product or end-function is reached. Often the start and end functions are known but not the exact path which led from one to the other. Sophisticated technology helped greatly to study proteins in detail, and as technology gets better and better we will further our understanding of biological information processing in neurons.

Genetic manipulation

The complexity of biological information processing does not end with protein signaling cascades, the 10 billion proteins are not a random soup of workers that do their tasks, but these workers are designed in specific quantities to serve specific functions that are relevant at the moment. All this is controlled by a tight feedback loop involving helper proteins, DNA, and messenger RNA (mRNA).

If we use programming metaphors to describe this whole process, then the DNA represents the whole github website with all its public packages, and messenger RNA is a big library which features many other smaller libraries with different functions (something like the C++ boost library).

It all begins with a programming problem you want to solve (a biological problem is detected). You use google and stackoverflow to find recommendations for libraries which you can use to solve the problem and soon you find a post that suggests that you use library X to solve problem Y (problem Y is detected on a local level in a cell with known solution of protein X; the protein that detected this defect then cascades into a chain of protein signals which leads to the upregulation of the gene G which can produce protein X; here upregulation is a “Hey! Produce more of this, please!” signal to the nucleus of the cell where the DNA lies). You download the library and compile it (the gene G is copied (transcribed) as a short string of mRNA from the very long string of DNA). You then do configure the install (the mRNA leaves the core) with the respective configuration (the mRNA is translated into a protein, the protein may be adjusted by other proteins after this), and install the library in a global “/lib” directory (the protein folds itself into its correct form after which it is fully functional). After you have installed the library, you import the needed part of the library to your program (the folded protein travels (randomly) to the site where it is needed) and you use certain functions of this library to solve your problem (the protein does some kind of work to solve the problem).

Additional to this, neurons may also dynamically alter their genome, that is they can dynamically change their github repository to add or remove libraries.

To understand this process further, you may want to watch the following video, which shows how HIV produces its proteins and how the virus can change the host DNA to suit its needs. The process described in this video animation is very similar to what is going on in neurons. To make it more similar to the process in neurons, imagine that HIV is a neurotransmitter and that everything contained in the HIV cell is in the neuron in the first place. What you have then is an accurate representation of how neurons make use of theirs genes and proteins:

You may ask, isn’t it so that every cell in your body has (almost) the same DNA in order to be able to replicate itself? Generally, this is true for most cells, but not true for most neurons. Neurons will typically have a genome that is different from the original genome that you were assigned to at birth. Neurons may have additional or fewer chromosomes and have sequences of information removed or added from certain chromosomes.

It was shown, that this behavior is important for information processing and if gone awry, this may contribute to brain disorders like depression or Alzheimer’s disease. Recently it was also shown, that neurons change their genome on a daily basis to improve information processing demands.

So when you sit at your desk for five days, and then on the weekend decide to go on a hike, it makes good sense that the brain adapts its neurons for this new task, because entirely different information processing is needed after this change of environment.

Equally, in an evolutionary sense, it would be beneficial to have different “modes” for hunting/gathering and social activity within the village — and it seems that this function might be for something like this. In general, the biological information processing apparatus is extremely efficient in responding to slower information processing demands that range from minutes to hours.

With respect to deep learning, an equivalent function would be to alter the function of a trained convolutional net in significant but rule-based ways; for example to apply a transformation to all parameters when changing from one to another task (recognition of street numbers -> transform parameters -> recognition of pedestrians).

Nothing of this biological information processing is modeled by the LNP model.

Looking back at all this, it seems rather strange that so many researchers think they that they can replicate the brain’s behavior by concentrating on the electrochemical properties and inter-neuron interactions only. Imagine that every unit in a convolutional network has its own github, from which it learns to dynamically download, compile and use the best libraries to solve a certain task. From all this you can see that a single neuron is probably more complex than an entire convolutional net, but we continue from here in our focus on electrochemical processes and see where it leads us.

Back to the LNP model

After all this above, there is only one more relevant step in information processing for our model. Once a critical level of depolarization is reached, a neuron will most often fire, but not always. There are mechanisms that prevent a neuron from firing. For example shortly after a neuron fired, its electric potential is too positive to produce a fully-fledged action potential, and thus it cannot fire again. This blockage may be present even when a sufficient electric potential is reached, because this blockade is a biological function and not a physical switch.

In the LNP model, this blockage of an action potential is modeled as an inhomogeneous Poisson process which has a Poisson distribution. A Poisson process with a Poisson distribution as a model means that the neuron has a very high probability to fire the first or second time it reached its threshold potential, but it may also be (with a exponentially decreasing probability) that a neuron may not fire for many more times.

A Poisson(0.5) distribution with a randomly drawn sample. Here 0,1,2,3 represents the waiting time until the neuron fires, thus 0 means it fires without delay, while 2 means it will not fire for two cycles even if it could fire physically.

There are exceptions to this rule, where neurons disable this mechanism and fire continuously at the rates which are governed by the physics alone — but these are special events which I will ignore at this point. Generally, this whole process is very similar to dropout used in deep learning which uses a uniform distribution instead of a Poisson distribution; thus this process can be viewed as some kind of regularization method that the brain uses instead of dropout.

In the next step, if the neuron fires, it releases an action potential. The action potential has very little difference in its amplitude, meaning the electric potential generated by the neuron almost always has the same magnitude, and thus is a reliable signal. As this signal travels down the axon it gets weaker and weaker. When it flows into the branches of the axon terminal, its final strength will be dependent on the shape and length of these branches; so each axon terminal will receive a different amount of electrical potential. This spatial information, together with the temporal information due to the spiking pattern of action potentials, is then translated into electrochemical information (it was shown that they are translated into spikes of neurotransmitters themselves that last about 2ms). To adjust the output signal, the axon terminal can move, grow or shrink (spatial), or it may alter its protein makeup which is responsible for releasing the synaptic vesicles (temporal).

Now we are back at the beginning: Neurotransmitters are released from the axon terminal (which can be modeled as a dense matrix multiplication) and the steps repeat themselves.

Learning and memory in the brain

Now that we went through the whole process back to back, let us put all this into context to see how the brain uses all this in concert.

Most neurons repeat the process of receive-inputs-and-fire about 50 to 1000 times per second; the firing frequency is highly dependent on the type of neuron and if a neuron is actively processesing tasks. Even if a neuron does not process a task it will fire continuously in a random fashion. Once some meaningful information is processed, this random firing activity makes way for a highly synchronized activity between neighboring neurons in a brain region. This synchronized activity is poorly understood, but is thought to be integral to understanding information processing in the brain and how it learns.

Currently, it is not precisely known how the brain learns. We do know that it adjusts synapses with some sort of reinforcement learning algorithm in order to learn new memories, but the precise details are unclear and the weak and contradicting evidence indicates that we are missing some important pieces of the puzzle. We got the big picture right, but we cannot figure out the brain’s learning algorithm without the fine detail which we are still lacking.

Concerning memories, we know that some memories are directly stored in the hippocampus, the main learning region of the brain (if you lose your hippocampus in each brain hemisphere, you cannot form new memories). However, most long-term memories are created and integrated with other memories during your REM sleep phase, when so called sleep spindles unwind the information of your hippocampus to all other brain areas. Long-term memories are generally all local: Your visual memories are stored in the visual system; your memories for your tongue (taste, texture) are stored in the brain region responsible for your tongue, etcetera.

It is also known, that the hippocampus acts as a memory buffer. Once it is full, you need to sleep to empty its contents to the rest of your brain (through sleep spindles during REM sleep); this might be why babies sleep so much and so irregularly —once their learning buffer is full, they sleep to quickly clear their buffer in order to learn more after they wake. You can still learn when this memory buffer is full, but retention is much worse and new memories might wrangle with other memories in the buffer for space and displace them —so really get your needed amount of sleep. Sleeping less and irregularly is unproductive, especially for students who need to learn.

The hippocampus in each hemisphere is shown in red. Image source: 1

Because memories are integrated with other memories during your “write buffer to hard-drive” stage, sleep is also very important for creativity. The next time you recall a certain memory after you slept, it might be altered with some new information that your brain thought to be fitting to attach to that memory.

I think we all had this: We wake up with some crazy new idea, only to see that it was quite nonsensical in the first place — so our brain is not perfect either and makes mistakes. But other times it just works: One time I tortured myself with a math problem for 7 hours non-stop, only to go to bed disappointed with only about a quarter of the whole problem solved. After I woke, I immediately had two new ideas how to solve the problem: The first did not work; but second made things very easy and I could sketch a solution to the math problem within 15 minutes — an ode to sleep!

Now why do I talk about memories when this blog post is about computation? The thing is that memory creation — or in other words — a method to store computed results for a long time, is critical for any intelligence. In brain simulations, one is satisfied if the synapse and activations occur in the same distribution as they do in the real brain, but one does not care if these synapses or activations correspond to anything meaningful — like memories or “distributed representations” needed for functions such as object recognition. This is a great flaw. Brain simulations have no memories.

In brain simulation, the diffusion of electrochemical particles is modeled by differential equations. These differential equations are complex, but can be modeled with simple techniques like Euler’s method to approximate these complex differential equations. The result has poor accuracy (meaning high error) but the algorithm is very computationally efficient and the accuracy is sufficient to reproduce the activities of real neurons along with their size and distribution of synapses. The great disadvantage is that we generally cannot learn parameters from a method like this — we cannot create meaningful memories.

However, as I have shown in my blog post about convolution, we can also model diffusion by applying convolution — a very computationally complex operation. The advantage about convolution is that we can use methods like maximum-likelihood estimation with backpropagation to learn parameters which lead to meaningful representations which are akin to memories (just like we do in convolutional nets). This is exactly akin to the LNP model with its convolution operation.

So besides its great similarity to deep learning models, the LNP model is also justified in that it is actually possible to learn parameters which yield meaningful memories (where with memories I mean here distributed representations like those we find in deep learning algorithms).

This then also justifies the next point where I estimate the brain’s complexity by using convolution instead of Euler’s method on differential equations.

Another point to take away from for our model is, that we currently have no complexity assigned for the creation of memories (we only modeled the forward pass, not the backward pass with backpropagation). As such, we underestimate the complexity of the brain, but because we do not know how the brain learns, we cannot make any accurate estimates for the computational complexity of learning. With that said and kept in the back of our mind, let us move on to bringing the whole model together for a lower bound of computational complexity.

Bringing it all together for a mathematical estimation of complexity

The next part is a bit tricky: We need to estimate the numbers for N, M, n and m and these differ widely among neurons.

We know that 50 of the 86 billion neurons in the brain are cerebellar granule neurons, so these neurons and their connection will be quite important in our estimation.

Cerebellar granule neurons are very tiny neurons with about 4 dendrites. Their main input is from the cortex. They integrate these signals and then send them along a T-shaped axon which feeds into the dendrites of Purkinje neurons.

Purkinje neurons are by far the most complex neurons, but there are only about 100 million of them. They may have more than a 100000 synapses each and about 1000 dendrites. Multiple Purkinje neurons bundle their outputs in about a dozen deep nuclei (a bunch of densely packed neurons) which then send signals back to the cortex.

This process is very crucial for non-verbal intelligence, abstract thinking and abstract creativity (creativity: Name as many words beginning with the letter A; abstract creativity: What if gravity bends space-time (general relativity)? What if these birds belonged to the same species when they came to this island (evolution)?). It was thought a few decades ago that the cerebellum only computes outputs for movement; for example while Einstein’s cerebrum was handled and studied carefully, his cerebellum was basically just cut off and put away, because it was regarded as a “primitive” brain part.

But since then it was shown that the cerebellum forms 1:1 connections with most brain regions of the cortex. Indeed, changes in the front part of the cerebellum during the ages 23 to 25 may change your non-verbal IQ by up to 30 points, and changes of 10-15 IQ points are common. This is very useful in most instances, whereas we lose neurons which perform a function which we do not need in everyday lives (calculus, or the foreign language which you learned but never used).

So it is crucial to get the estimation of the cerebellum right not only because it contains most neurons, but also because it is important for intelligence and information processing in general.

Estimation of cerebellar filter dimensions

Now if we look at a single dendrite, it branches off into a few branches and thus has a tree like structure. Along its total length it is usually packed with synapses. Dendritic spikes can originate in any branch of a dendrite (spatial dimension). When we take 3 branches per dendrite, and 4 dendrites in total we have a convolutional filter of size 3 and 4 for cerebellar granule neurons. Since linear convolution over two dimensions is the same as convolution over one dimension followed by convolution over the other dimension, we can also model this as a single 3×4 convolution operation. Also note that this is mathematically identical to a model that describes the diffusion of particles originating from different sources (feature map) which diffuse according to a rule in their neighborhood (kernel) — this is exactly what happens at a physical level. More on this view in my blog post about convolution.

Here I have chosen to represent the spatial domain with a single dimension. It was shown that the shape of the dendritic tree is also important in the resulting information processing and thus we would need two dimensions for the spatial domain. However, data is lacking to represent this mathematically in a meaningful way and thus I proceed with the simplification to one spatial dimension.

The temporal dimension is also important here: Charged particles may linger for a while until they are pumped out of the neuron. It is difficult to estimate a meaningful time frame, because the brain uses continuous time while our deep learning algorithms only know discrete time steps.

No single estimate makes sense from a biological perspective, but from a psychological perspective we know that the brain can take up unconscious information that is presented in an image in about 20 milliseconds (this involves only some fast, special parts of the brain). For conscious recognition of an object we need more time — at least 65 milliseconds, and on average about 80-200 milliseconds for reliable conscious recognition. This involves all the usual parts that are active for object recognition.

From these estimates, one can think about this process as “building up the information of the seen image over time within a neuron”. However, a neuron can only process information if it can differentiate meaningful information from random information (remember, neurons fire randomly if they do not actively process information). Once a certain level of “meaningful information” is present, the neuron actively reacts to that information. So in a certain sense information processing can be thought of as an epidemic of useful information that spreads across the brain: Information can only spread to one neuron, if the neighboring neuron is already infected with this information. Thinking in this way, such an epidemic of information infects all neurons in the brain within 80-200 milliseconds.

As such we can say that, while the object lacks details in the first 20 milliseconds, there is full detail at about 80-200 milliseconds. If we translate this into discrete images at the rate of 30 frames per second (normal video playback) —or in other words time steps — then 20 milliseconds would be 0.6 time steps, and 80-200 milliseconds 2.4-6 time steps. This means, that all the visual information that a neuron needs for its processing will be present in the neuron within 2.4 to 6 frames.

To make calculations easier, I here now choose a fixed time dimension of 5 time steps for neural processes. This means for the dendrites we have spatio-temporal convolutional filters of size 3x4x5 for cerebellar granule neurons. For Purkinje neurons a similar estimate would be filters of a size of about 10x1000x5. The non-linearity then reduces these inputs to a single number for each dendrite. This number represents an instantaneous firing rate, that is, the number represents how often the neuron fires in the respective interval of time, for example at 5 Hz, 100 Hz, 0 Hz etcetera. If the potential is too negative, no spike will result (0 HZ); if the potential is positive enough, then the magnitude of the spike is often proportional to the magnitude of the electric potential —but not always.

It was shown that dendritic summation of this firing rate can be linear (the sum), sub-linear (less than the sum), supra-linear (more than the sum) or bistable (less than the sum, or more than the sum, depending on the respective input); these behaviors of summation often differ from neuron to neuron. It is known that Purkinje neurons use linear summation, and thus their summation to form a spike rate is very similar to the rectified linear function max(0,x) which is commonly used in deep learning. Non-linear sums can be thought of different activation functions. It is important to add, that the activation function is determined by the type of the neuron.

The filters in the soma (or cell body) can be thought of as an additional temporal convolutional filter with a size of 1 in the spatial domain. So this is a filter that reduces the input to a single dimension with a time dimension of 5, that is, a 1x1x5 convolutional filter (this will be the same for all neurons).

Again, the non-linearity then reduces this to an instantaneous firing rate, which then is dropped out by a Poisson process, which is then fed into a weight-matrix.

At this point I want to again emphasize, that it is not correct to view the output of a neuron as binary; the information conveyed by a firing neuron is more like an if-then-else branch: “if(fire == True and dropout == False){ release_ neurotransmitters(); }else{ sleep(0.02); }”

The neurotransmitters are the true output of a neuron, but this is often confused. The source of this confusion is that it is very difficult to study neurotransmitter release and its dynamics with a synapse, while it is ridiculously easy to study action potentials. Most models of neurons thus model the output as action potentials because we have a lot of reliable data here; we do not have such data for neurotransmitter interactions at a real-time level. This is why action potentials are often confused as the true outputs of neurons when they are not.

When a neuron fires, this impulse can be thought of as being converted to a discrete number at the axon terminal (number of vesicles which are released) and is multiplied by another discrete number which represents the amount of receptors on the synapse (this whole process corresponds to a dense or fully connected weight in convolutional nets). In the next step of information processing, charged particles flow into the neuron and build up a real-valued electric potential. This has also some similarities to batch-normalization, because values are normalized into the range [0,threshold] (neuron: relative to the initial potential of the neuron; convolutional net: relative to the mean of activations in batch-normalization). When we look at this whole process, we can model it as a matrix multiplication between two real-valued matrices (doing a scaled normalization before or after this is mathematically equivalent, because matrix multiplication is a linear operation).

Therefore we can think of axon-terminal-synapse interactions between neurons as a matrix multiplication between two real-valued matrices.

Estimation of cerebellar input/output dimensions

Cerebellar granule neurons typically receive inputs from about four axons (most often connections from the cortex). Each axon forms about 3-4 synapses with the dendritic claw of the granule neuron (a dendrite ending shaped as if you would hold a tennis ball in your hand) so there are a total of about 15 inputs via synapses to the granule neurons. The granule neuron itself ends in a T shaped axon which crosses directly through the dendrites of Purkinje neurons with which it forms about 100 synapses.

Purkinje neurons receive inputs from about 100000 connections made with granule neurons and they themselves make about 1000 connections in the deep nuclei. There are estimates which are much higher and no accurate number for the number of synapses exists as far as I know. The number of 100000 synapses might be a slight overestimate (but 75000 would be too conservative), but I use it anyways to make the math simpler.

All these dimensions are taken times the time dimension as discussed above, so that the input for granule neurons for example has a dimensionality of 15×5.

So with this we can finally calculate the complexity of a cerebellar granule neuron together with the Purkinje neurons.

So my estimate would be 1.075×10^21 FLOPS for the brain, the fastest computer on earth as of July 2013 has 0.58×10^15 FLOPS for practical application (more about this below).

Part III: Limitations and criticism

While I discussed how the brain is similar to deep learning, I did not discuss how the brain is different. One great disparity is that the dropout in the brain works with respect to all inputs, while dropout in a convolutional network works with respect to each single unit. What the brain is doing makes little sense in deep learning right now; however, if you think about combining millions of convolutional nets with each other, it makes good sense to do as the brain does. The dropout of the brain certainly would work well to decouple the activity of neurons from each other, because no neuron can depend on information from a single other neuron (because it might be dropped out), so that it is forced to take into account all the neurons it is connected with, thus eliminating biased computation (which is basically regularization).

Another limitation of the model is that it is a lower bound. This estimate does not take into account:

Backpropagation, i.e. signals that travel from the soma to the dendrites; the action potential is reflected within the axon and travels backwards (these two things may almost double the complexity)
Axon terminal information processing
Multi-neurotransmitter vesicles (can be thought of multiple output channels or filters, just as an image has multiple colors)
Geometrical shape of the dendritic tree
Dendritic spine information processing
Non-axodendritic synapses (axon-axon and axon-soma connections)
Electrical synapses
Neurotransmitter induced protein activation and signaling
Neurotransmitter induced gene regulation
Voltage induced (dendritic spikes and backpropagating signals) gene regulation
Voltage induced protein activation and signaling
Glia cells (besides having an extremely abnormal brain (about one in a billion), Einstein also had abnormally high levels of glia cells)

All these things have been shown to be important for information processing in the brain. I did not include them in my estimate because this would have made everything:

Too complex: What I have discussed so far is extremely simple if you compare that to the vastness and complexity of biological information processing
Too special: Non-axodendritic synapses can have unique information processing algorithms completely different from everything listed here, e.g. direct electrical communication between a neighboring bundle of neurons
And/or evidence is lacking to create a reliable mathematical model: Neural backpropagation, geometry of the dendritic trees, and dendritic spines

Remember that these estimates are for the whole brain. Local brain regions might have higher computational processing speed than this average when they are actively processing stimuli. Also remember that the cerebellum makes up almost all computational processing. Other brain regions integrate the knowledge of the cerebellum, but the cerebellum acts as a transformation and abstraction module for almost all information in the brain (except vision and hearing).

But wait, but we can do all this with much less computational power! We already have super-human performance in computer vision!

I would not say that we have super-human performance in computer vision. What we have is a system that beats human at naming things in images that are taken out of context of the real world (what happens before we see something in the real world shapes our perception dramatically). We almost always can recognize things in our environment, but we most often just do not know (or care about) the name of what we see.

Humans do not have the visual system to label things. Try to make a list of 1000 common physical objects in the real world —not an easy task.

To not recognize an object for us humans would mean that we see an object but cannot make sense of it. If you forgot the name of an old classmate, it does not mean you did not recognize her; it just means you forgot her name. Now imagine you get off a train stop and you know a good friend is waiting for you somewhere at the stop. You see somebody 300 meters away waving their hands who is looking in your direction — is it your friend? You do not know; you cannot recognize if it is her. That’s the difference between mere labels and object recognition.

Now if you cannot recognize something in a 30×30 pixel image, but the computer can, this also does not necessarily mean that the computer has super-human object recognition performance. First and foremost this means that your visual system does not work well for pixeled information. Our eyes are just not used to that.

Now take a look outside a window and try to label all the things you see. It will be very easy for most things, but for other things you do not know the correct labels! For example, I do not know the name for a few plants that I see when I look out of my window. However, we are fully aware what it is what we see and can name many details of the object. For example, alone by assessing their appearance, I know a lot about how much water and sunshine the unknown plants need, how fast they grow, in which way they grow, if they are old or young specimens; I know how they feel like if I touch them — or more generally — I know how these plants grow biologically and how they produce energy, and so on. I can do all this without knowing its name. Current deep learning systems cannot do this and will not do this for quite some time. Human-level performance in computer vision is far away indeed! We just reached the very first step (object recognition) and now the task is to make computer vision smart, rather than making it just good at labeling things.

Evolutionarily speaking, the main functions of our visual system have little to do with naming things that we see: Hunt and avoid being hunted, to orient ourselves in nature during foraging and make sure we pick the right berries and extract roots efficiently— these are all important functions, but probably one of the most important functions of our vision is the social function within a group or relationship.

If you Skype with someone it is quite a different communication when they have their camera enabled compared to if they have not. It is also very different to communicate with someone whose image is on a static 2D surface compared to communicating in person. Vision is critical for communication.

Our deep learning cannot do any of this efficiently.

Making sense of a world without labels

One striking case which also demonstrates the power of vision for true understanding of the environment without any labels is the case of Genie. Genie was strapped into place and left alone in a room at the age of 20 months. She was found with severe malnutrition 12 years later. She had almost no social interaction during this time and thus did not acquire any form of verbal language.

Once she got in contact with other human beings she was taught English as a language (and later also sign language), but she never really mastered it. Instead she quickly mastered non-verbal language and was truly exceptional at that.

To strangers she almost exclusively communicated with non-verbal language. There are instances where these strangers would stop in their place, leave everything behind, walk up to her and hand her a toy or another item — that item was always something that was known to be something liked and desired.

In one instance a woman got out of her car at a stoplight at an intersection, emptied her purse and handed it to Genie. The woman and Genie did not exchange a word; they understood each other completely non-verbally.

So what Genie did, was to pick up cues with her visual system and translated the emotional and cognitive state of that woman into non-verbal cues and actions, which she would then use to change the mental state of the woman. In turn that the woman would then desire to give the purse to Genie (which Genie probably could not even see).

Clearly, Genie was very exceptional at non-verbal communication — but what would happen if you pitched her against a deep learning object recognition system? The deep learning system would be much better than Genie on any data set you would pick. Do you think it would be fair to say that the convolutional net is better at object recognition than Genie is? I do not think so.

This shows how primitive and naïve our approach to computer vision is. Object recognition is a part of human vision, but it is not what makes it exceptional.

Can we do with less computational power?

“We do not need as much computational power as the brain has, because our algorithms are (will be) better than that of the brain.”

I hope you can see after the descriptions in this blog post that this statement is rather arrogant.

We do not know how the brain really learns. We do not understand information processing in the brain in detail. And yet we dare to say we can do better?

Even if we did know how the brain works in all its details, it would still be rather naïve to think we could create general intelligence with much less. The brain developed during many hundreds of millions of years through evolution. Evolutionary, it is the most malleable organ there is: The human cortex shrunk by about 10% during the last 20000 years, and the human brain adapted rapidly to the many ways we use verbal language — a very recent development in evolutionary terms.

It was also shown that the number of neurons in each animal’s brain is almost exactly the amount which it can sustain through feeding (we probably killed off the majority of all mammoths by about 20000 years ago). We humans have such large brains because we invented fire and cooking with which we could predigest food which made it possible to sustain more neurons. Without cooking, the intake of calories would not be high enough to sustain our brains and we would helplessly starve (at least a few thousand years ago; now you could survive on a raw vegan diet easily — just walk into a supermarket and buy a lot of calorie-dense foods). With this fact, it is very likely that brains are optimized exhaustively to create the best information processing which is possible for the typical calorie intake of the respective species — the function which is most expensive in an animal will be most ruthlessly optimized to enhance survival and procreation. This is also very much in line with all the complexity of the brain; every little function is optimized thoroughly and only as technology advances we can understand step by step what this complexity is made for.

There are many hundreds of different types of neurons in the brain, each with their designated function. Indeed, neuroscientists often can differentiate different brain regions and their function by looking at the changing architecture and neuron types in a brain region. Although we do not understand the details of how the circuits perform information processing, we can see that each of these unique circuits is designed carefully to perform a certain kind of function. These circuits are often replicated in evolutionary distinct species which share a common ancestor that branched off into these different species hundreds of millions of years ago, showing that such structures are evolutionarily optimal for the tasks they are processing.

The equivalent in deep learning would be, if we had 10000 different architectures of convolutional nets (with its own set of activation functions and more) which we combine meticulously to improve the overall function of our algorithm ― do you really think we can build something which can produce as complex information processing, but which follows a simple general architecture?

It is rather naïve to think that we can out-wit this fantastically complex organ when we are not even able to understand its learning algorithms.

On top of this, the statement that we will develop better algorithms than the brain uses is unfalsifiable. We can only prove it when we achieve it, we cannot disprove it. Thus it is a rather nonsensical statement that has little practical value. Theories are usually useful even when there is not enough evidence to show that they are correct.

The standard model of physics is an extremely useful theory used by physicists and engineers around the world in their daily life to develop the high tech products we enjoy; and yet this theory is not complete, it was amended just a few days ago when a new particle was proven to exist in the LHC experiment.

Imagine if there were another model, but you would only be able to use it when we have proven the existence of all particles. This model would then be rather useless. When it makes no predictions at all about the behavior in the world, we would be unable to manufacture and develop electronics with this theory. Similarly, the statement that we can develop more efficient algorithms than the brain does not help; it rather makes it more difficult to make further progress. The brain should really be our main point of orientation.

Another argument, which would be typical for Yann LeCun (he made a similar argument during a panel) would be: Arguably, airplanes are much better at flying than birds are; yet, if you describe the flight of birds it is extremely complex and every detail counts, while the flight of airplanes is described simply by the fluid flow around an airfoil. Why is it wrong to expect this simplicity from deep learning when compared to the brain?

I think this argument has some truth in it, but essentially, it asks the wrong question. I think it is clear that we need not to replicate everything in detail in order to achieve artificial intelligence, but the real question is: Where do we draw the line? If you get to know that neurons can be modeled in ways that closely resemble convolutional nets, would you go so far and say, that this model is too complex and we need to make it simpler?

Part IV: Predicting the growth of practical computational power

There is one dominant measure of performance in high-performance computing (HPC) and this measure is floating point operations per second (FLOPS) on the High Performance LINPACK (HPL) benchmark – which measures how many computations a system can do in a second when doing distributed dense matrix operations on hundreds or thousands of computers. There exists the TOP 500 list of supercomputers, which is a historical list based on this benchmark which is the main reference point for the performance of a new supercomputer system.

But a big but comes with the LINPACK benchmark. It does not reflect the performance in real, practical applications which run on modern supercomputers on a daily basis, and thus, the fastest computers on the TOP 500 list are not necessarily the fastest computers for practical applications.

Everybody in the high performance computing community knows this, but it is so entrenched in the business routine in this area, that when you design a new supercomputer system, you basically have to show that your system will be able to get a good spot on the TOP 500 in order to get funding for that supercomputer.

Sometimes such systems are practically unusable, like the Tianhe-2 supercomputer which still holds the top spot on the LINPACK benchmark after more than three years. The potential of this supercomputer goes largely unused because it is too expensive to run (electricity) and the custom hardware (custom network, Intel Xeon Phi) requires new software, which would need years of development to reach the levels of sophistication of standard HPC software. The Tianhe-2 runs only at roughly one third of its capacity, or in other words, it practically stands idle for nearly 2 out of 3 minutes. The predecessor of the Tianhe-2, the Tianhe-1, fastest computer in the world in 2010 (according to LINPACK), has not been used since 2013 due to bureaucracy reasons.

While outside of China, other supercomputers of similar design fare better, they typically do not perform so well in practical applications. This is so, because the used accelerators like graphic processing units (GPUs) or Intel Xeon Phis can deliver high FLOPS in such a setup, but they are severely limited by network bandwidth bottlenecks.

To correct the growing uselessness of the LINPACK benchmark a new measure of performance was developed: The high performance conjugate gradient benchmark (HPCG). This benchmark performs conjugate gradient, which requires more communication than LINPACK and as such comes much closer to performance numbers for real applications. I will use this benchmark to create my estimates for a singularity.

The TOP500 for the last decade and some data for the HPCG (data collection only began recently). The dashed lines indicate a forecast. The main drivers of computational growth are also shown: Multicore CPU, GPU, and in 2016-2017 3D memory, and some new unknown technology in 2020. Will this growth be sustainable?

However, this benchmark still dramatically overestimates the computing power that can be reached for artificial intelligence applications when we assume that these applications are based on deep learning.

Deep learning is currently the most promising technique for reaching artificial intelligence. It is certain that deep learning — as it is now — will not be enough, but one can say for sure that something similar to deep learning will be involved in reaching strong AI.

Deep learning, unlike other applications has an unusually high demand for network bandwidth. It is so high that for some supercomputer designs which are in the TOP 500 a deep learning application would run slower than on your desktop computer. Why is this so? Because parallel deep learning involves massive parameter synchronization which requires extensive network bandwidth: If your network bandwidth is too slow, then at some point deep learning gets slower and slower the more computers you add to your system. As such, very large systems which are usually quite fast may be extremely slow for deep learning.

The problem with all this is that the development of new network interconnects which enable high bandwidth is difficult and advances are made much more slowly than the advances of computing modules, like CPUs, GPUs and other accelerators. Just recently, Mellanox reached a milestone where they could manufacture switches and InfiniBand cards which operate at 100Gbits per second. This development is still rather experimental, and it is difficult to manufacture fiber-optic cables which can operate at this speed. As such, no supercomputer implements this new development as of yet. But with this milestone reached, there will not be another milestone for many quite a while. The doubling time for network interconnect bandwidth is about 3 years.

Similarly, there is a memory problem. While the speed of theoretical processing power of CPUs and GPUs keeps increasing, the memory bandwidth of RAM is almost static. This is a great problem, because now we are at a point where it costs more time to move the data to the compute circuits than to actually make computations with it.

With new developments such as 3D memory one can be sure that further increases in memory bandwidth will be achieved, but we have nothing after that to increase the performance further. We need new ideas and new technology. Memory will not scale itself by getting smaller and smaller.

However, currently the biggest hurdle of them all is power consumption. The Tianhe-2 uses 24 megawatts of power, which totals to $65k-$100k in electricity cost per day, or about $23 million per year. The power consumed by the Tianhe-2 would be sufficient to power about 6000 homes in Germany or 2000 homes in the US (A/C usage).

An overview about how the performance constraints changed from old to new supercomputers. Adapted from Horst Simon‘s presentation

Physical limitations

Furthermore, there are physical problems around the corner. Soon, our circuits will be so small that electrons will start to show quantum effects. One such quantum effect is quantum tunneling. In quantum tunneling an electron sits in two neighboring circuits at once, and decides randomly to which of these two locations it will go next.

If this would happen at a larger scale, it would be like charging your phone right next to your TV, and the electrons decide they want to go to your cell phone cable rather than to your TV; so they jump over to the phone cable cutting off the power to your TV. Quantum tunneling will become relevant in 2016-2017 and has to be taken into account from there on. New materials and “insulated” circuits are required to make everything work from here on.

With new materials, we need new production techniques which will be very costly because all computer chips relied on the same, old but reliable production process. We need research and development to make our known processes working with these new materials and this will not only cost money but also cost time. This will also fuel a continuing trend where the cost for producing computer chips increases exponentially (and growth may slow due to costs). Currently, the tally is at $9bn for such a semiconductor fabrication plant (fab) increasing at a relatively stable rate of about 7-10% higher costs per year for the past decades.

After this, we are at the plain physical limits. A transistor will be composed of not much more than a handful of atoms. We cannot go smaller than this, and this level of manufacturing will require extensive efforts in order to get such devices working properly. This will start to happen around 2025 and the growth may slow from here due to physical limitations.

Recent trends in the growth of computational power

So to summarize the previous section: (1) LINPACK performance does not reflect practical performance because it does not test memory and network bandwidth constraints; (2) memory and network bandwidth are now more important than computational power, however (3) advances in memory and network bandwidth will be sporadic and cannot compete with the growth in computational power; (4) electrical costs are a severe limitation (try to justify a dedicated power plant for a supercomputer if citizen face sporadic power outages), and also (5) computational power will be limited by physical boundaries in the next couple of years.

It may not come to a surprise then that the growth in computational power has been slowing down in recent years; this is mainly due to power efficiencies which will only be improved gradually, but the other factors also take its toll, like network interconnects which cannot keep up with accelerators like GPUs.

If one takes the current estimate of practical FLOPS of the fastest supercomputer, the Tianhe-2 with 0.58 petaflops on HPCG, then it would take 21 doubling periods until the lower bound of the brain’s computational power is reached. If one uses Moore’s Law, we would reach that by 2037; if we take the growth of the last 60 years, which is about 1.8 years per doubling period, we will reach this in the year 2053. If we take a lower estimate of 3 years for the doubling period due to the problems listed above we will reach this in 2078. While for normal supercomputing applications memory bandwidth is the bottleneck for practical applications as of now, this may soon change to networking bandwidth, which doubles about every 3 years. So the 2078 estimate might be quite accurate.

Growth in computing performance with respect to the HPCG benchmark. Both computing performance and factory costs are assumed to keep growing steadily at an exponential rate with doubling period of 18 or 36 months, respectively.

Now remember that, (1) the HPCG benchmark has much higher performance than typical deep learning applications which rely much more on network and memory bandwidth, and (2) that my estimate for the computational complexity of the brain is a lower bound. One can see that an estimate beyond 2100 might be not too far off. To sustain such a long and merciless increase in computation performance will require that we develop and implement many new ideas while operating at the border of physical limitations as soon as by 2020. Will this be possible?

Where there’s a will, there’s a way — the real question is: Are we prepared to pay the costs?

Conclusion

Here I have discussed the information processing steps of the brain and their complexity and compared them to those of deep learning algorithms. I focused on a discussion of basic electrochemical information processing and neglected biological information processing.

I used an extended linear-nonlinear-Poisson cascade model as groundwork and related it to convolutional architectures.

With this model, I could show that a single neuron has an information processing architecture which is very similar to current convolutional nets, featuring convolutional stages with rectified non-linearities which activities are then regularized by a dropout-like method. I also established a connection between max-pooling and voltage-gated channels which are opened by dendritic spikes. Similarities to batch-normalization exist.

This straightforward similarity gives strong reason to believe that deep learning is really on the right path. It also indicates that ideas borrowed from neurobiological processes are useful for deep learning (the problem was that progress in deep learning architectures often preceded knowledge in neurobiological processes).

My model shows that it can be estimated that the brain operates at least 10x^21 operations per second. With current rates of growth in computational power we could achieve supercomputers with brain-like capabilities by the year 2037, but estimates after the year 2080 seem more realistic when all evidence is taken into account. This estimate only holds true if we succed to stomp limitations like physical barriers (for example quantum-tunneling), capital costs for semiconductor fabrication plants, and growing electrical costs. At the same time we constantly need to innovate to solve memory bandwidth and network bandwidth problems which are or will be the bottlenecks in supercomputing. With these considerations taken into account, it is practically rather unlikely that we will achieve human-like processing capabilities anytime soon.

Closing remarks

My philosophy of this blog post was to present all information on a single web-page rather than scatter information around. I think this design helps to create a more sturdy fabric of knowledge, which, with its interwoven strains of different fields, helps to create a more thorough picture of the main ideas involved. However, it has been quite difficult to organize all this information into a coherent picture and some points might be more confusing than enlightening. Please leave a comment below to let me know if the structure and content need improvement, so that I can adjust my next blog post accordingly.

I would also love general feedback for this blog post.

Also make sure to share this blog post with your fellow deep learning colleagues. People with raw computer science backgrounds often harbor misconceptions about the brain, its parts and how it works. I think this blog post could be a suitable remedy for that.

The next blog post

The second post in this series on neuroscience and psychology will focus on the most important brain regions and their function and connectivity. The last and third part in the series will focus on psychological processes, such as memory and learning, and what we can learn from that with respect to deep learning.

Acknowledgments

I would like to thank Alexander Tonn for his useful advice and for proofreading this blog post.

Important references and sources

Neuroscience

Brunel, N., Hakim, V., & Richardson, M. J. (2014). Single neuron dynamics and computation. Current opinion in neurobiology, 25, 149-155.

Chadderton, P., Margrie, T. W., & Häusser, M. (2004). Integration of quanta in cerebellar granule cells during sensory processing. Nature, 428(6985), 856-860.

De Gennaro, L., & Ferrara, M. (2003). Sleep spindles: an overview. Sleep medicine reviews, 7(5), 423-440.

Ji, D., & Wilson, M. A. (2007). Coordinated memory replay in the visual cortex and hippocampus during sleep. Nature neuroscience, 10(1), 100-107.

Liaw, J. S., & Berger, T. W. (1999). Dynamic synapse: Harnessing the computing power of synaptic dynamics. Neurocomputing, 26, 199-206.

Ramsden, S., Richardson, F. M., Josse, G., Thomas, M. S., Ellis, C., Shakeshaft, C., … & Price, C. J. (2011). Verbal and non-verbal intelligence changes in the teenage brain. Nature, 479(7371), 113-116.

Smith, S. L., Smith, I. T., Branco, T., & Häusser, M. (2013). Dendritic spikes enhance stimulus selectivity in cortical neurons in vivo. Nature, 503(7474), 115-120.

Stoodley, C. J., & Schmahmann, J. D. (2009). Functional topography in the human cerebellum: a meta-analysis of neuroimaging studies. Neuroimage,44(2), 489-501.

High performance computing

Dongarra, J., & Heroux, M. A. (2013). Toward a new metric for ranking high performance computing systems. Sandia Report, SAND2013-4744, 312.

PDF: HPCG Specification

Interview: Why there will be no exascale computing before 2020

Slides: Why there will be no exascale computing before 2020

Interview: Challenges of exascale computing

Image references

Anwar, H., Roome, C. J., Nedelescu, H., Chen, W., Kuhn, B., & De Schutter, E. (2014). Dendritic diameters affect the spatial variability of intracellular calcium dynamics in computer models. Frontiers in cellular neuroscience, 8.

The post The Brain vs Deep Learning Part I: Computational Complexity — Or Why the Singularity Is Nowhere Near appeared first on Tim Dettmers.

How to Parallelize Deep Learning on GPUs Part 2/2: Model Parallelism

Tim Dettmers — Sun, 09 Nov 2014 19:16:55 +0000

In my last blog post I explained what model and data parallelism is and analysed how to use data parallelism effectively in deep learning. In this blog post I will focus on model parallelism.

To recap, model parallelism is, when you split the model among GPUs and use the same data for each model; so each GPU works on a part of the model rather than a part of the data. In deep learning, one approach is to do this by splitting the weights, e.g. a 1000×1000 weight matrix would be split into a 1000×250 matrix if you use four GPUs.

Model parallelism diagram. Synchronizing communication is needed after each dot product with the weight matrix for both forward and backward pass.

One advantage of this approach is immediately apparent: If we split the weights among the GPUs we can have very large neural networks which weights would not fit into the memory of a single GPU. In part I mentioned this in an earlier blog post, where I also said that such large neural networks are largely unnecessary. However, for very big unsupervised learning tasks – which will become quite important in the near future – such large networks will be needed in order to learn fine grained features that could learn “intelligent” behavior.

How does a forward and backward pass work with such split matrices? This is most obvious when we do the matrix algebra step by step:

We start looking at which would be the dot matrix multiply for the usual forward pass case. The dimensions for using model parallelism with two GPUs for a batch size of 128 and a 1000×500 weight matrix would be:

Standard: 128×1000 dot 1000×500 = 128×500

Split by weight matrix first dimension: 128×500 dot 500×500 = 128×500 -> add matrices

Split by weight matrix second dimension: 128×1000 dot 1000×250 = 128×250 -> stack matrices

To calculate the errors in the layer below we need to pass the current error through to the next layer, or more mathematically, we calculate the deltas by taking the dot product of the error of the previous layer and the weights that connect to the next layer , i.e. :

Standard: 128×500 dot 500×1000 = 128×1000

Split by weight matrix first dimension: 128×500 dot 500×500 = 128×500 -> stack matrices

Split by weight matrix second dimension: 128×250 dot 250×1000 = 128×1000 -> add matrices

We see here, we need to synchronize (adding or stacking weights) after each dot product and you may think that this is slow when compared to data parallelism, where we synchronize only once. But one can quickly see that this is not so for most cases if we do the math: In data parallelism a 1000×500 gradient needs to be transferred once for the 1000×500 layer – that’s 500000 elements; for model parallelism we just need to transfer a small matrix for each forward and backward pass with a total of 128000 or 160000 elements – that’s nearly 4 times less data! So the network card bandwidth is still the main bottleneck in the whole application, but much less so than in the data parallelism case.

This is of course all relative and depends on the network architecture. Data parallelism will be quite fast for small networks and very slow for large networks, the opposite is true for model parallelism. The more parameters we have, the more beneficial is model parallelism. Its true strength comes to play if you have neural networks where the weights do not fit into a single GPU memory. Here model parallelism might achieve that for which one would need thousands of CPUs.

However, if you run small networks where the GPUs are not saturated and have some free capacity (not all cores are running), then model parallelism will be slow. Unlike data parallelism, there are no tricks you can use to hide the communication needed for synchronization, this is because we have only partial information for the whole batch. With this partial information we cannot compute the activities in the next layer and thus have to wait for the completion of the synchronization to move forward.

How the advantages and disadvantages can be combined is best shown by Alex Krizhevsky who demonstrates the efficiency of using data parallelism in the convolutional layers and model parallelism in the dense layers of a convolutional neural network.

The post How to Parallelize Deep Learning on GPUs Part 2/2: Model Parallelism appeared first on Tim Dettmers.

How to Parallelize Deep Learning on GPUs Part 1/2: Data Parallelism

Tim Dettmers — Thu, 09 Oct 2014 14:59:09 +0000

In my last blog post I showed what to look out for when you build a GPU cluster. Most importantly, you want a fast network connection between your servers and using MPI in your programming will make things much easier than to use the options available in CUDA itself.

In this blog post I explain how to utilize such a cluster to parallelize neural networks in different ways and what the advantages and downfalls are for such algorithms. The two different algorithms are data and model parallelism. In this blog entry I will focus on data parallelism.

So what are these two? Data parallelism is when you use the same model for every thread, but feed it with different parts of the data; model parallelism is when you use the same data for every thread, but split the model among threads.

For neural networks this means that data parallelism uses the same weights and but different mini-batches in each thread; the gradients need to be synchronized, i.e. averaged, after each pass through a mini-batch.

Model parallelism splits the weights of the net equally among the threads and all threads work on a single mini-batch; here the generated output after each layer needs to be synchronized, i.e. stacked, to provide the input to the next layer.

Each method has its advantages and disadvantages which change from architecture to architecture. Let us look at data parallelism first and its bottlenecks first and in the next post I will look at model parallelism.

Severity of the network bottleneck of data parallelism

The idea of data parallelism is simple. If you have, say, 4 GPUs you split a mini-batch into parts for each of them, say, you split a mini-batch with 128 examples into 32 examples for each GPU. Then you feed the respective batch through the net and obtain gradients for each split of the mini-batch. You then use MPI to collect all the gradients and update the parameters with the overall average.

Data parallelism diagram. There is no communication in the forward pass, and during the backward pass you synchronize gradients.

The biggest problem with this approach is that during the backward pass you have to pass the whole gradient to the all other GPUs. If you have a 1000×1000 weight matrix then you need to pass 4000000 bytes to each network. If we take a 40Gbit/s network card – which is already quite fast – then you will need to pass the data from one node to another node (however, there is some additional overhead that is neglected here). If you have six GPUs in two nodes you need to pass the data to five other GPUs, three of which need to go through the network card (3x 0.75ms), while two can use PCIe 3.0 to pass the data to the other two GPUs (about three times as fast; 2x 0.25ms). However, the PCIe pass is independent of the network card pass, so the time needed is determined by the network card time alone, i.e. 2.25ms. However, only one GPU can transfer data through the network card at any one time in any one node, so that we have to multiply that time by three, i.e. 7.75ms. Now the bottom line is, that we just need about 0.2ms for a matrix multiply through that layer (100×1000 dot 1000×1000) and about twice as much for the backward pass. We can pass the gradient while we work on the next layer, but in the end the network card speed limits our overall computation by quite a bit. This is more marked the larger you scale your system: A four node system working on the same problem needs about 20.25ms to pass the gradients around to the other GPUs. One can easily see that data parallelism does not scale with size of the cluster.

To counter this bottleneck is to reduce the parameters of the gradient through max pooling, maxout units or by simply using convolution. Another way is to increase the computational time/network time ratio by other means, e.g. by using is computationally intensive optimization techniques like RMSProp. You need the same time to pass the gradients to each other, but more time is spend on computation, thus increasing the utility of the fast GPUs.

Another thing you can do when you use computationally intensive optimization techniques is to hide latency of networking under the computation of the gradients. This means while you passing the first gradient to all other nodes, you can already start a big RMSProp computation asynchronously for the next layer. This technique can give a speedup of about 0-20 % depending on network architecture.

But this is not the only problem with data parallelism. There is a very technical bottleneck hidden in the GPU architecture which took me quite a while to understand. To understand why the GPU architecture is a problem we first need to look at the usage and purpose of mini-batches.

A divergence: Why do we use mini-batches?

If we start with randomly initialized parameters or even if we start with pretrained parameters, we do not need a pass through all the data to get an accurate gradient update that will head into the direction of a local minimum. If we take MNIST as an example, if we have a gradient which includes 10 common mistakes that the network does for each class (mini-batch size of about 128), then we will go into a direction that reduces the error greatly already as the gradient captures rough and common mistakes. If we choose a greater batch size (say 512) then we not only capture common errors, but also catch errors that are more subtle. However, it is not very sensible to fine-tune a system if you know it still has major errors. So overall we gain little by increasing the batch size. We need more computation to do roughly the same and this is the main argument why we use a mini-batch size as small as possible. However, if we choose a mini-batch size that is too small, then we do not capture all the common errors which are relevant for the data set and thus our gradient might not head near a local optimum, so there is a limit how small you can make mini-batches.

How does this relate to data parallelism? If we want a mini-batch size of 128 and use data parallelism to divide it among, say, eight GPUs, then each net calculates gradients for 16 samples which is then averages with the data from the other GPUs. And exactly here kicks the hardware bottleneck in.

Memory tiles: Patches of fast GPU memory for efficient dot product calculations

To calculate dot products on the GPU, you need to copy small patches, called memory tiles, into shared memory, i.e. very fast but very small memory (limited to a few kilobytes). The problem is that the standard cuBLAS uses either a 64×128 memory tiles and when you have a batch size less than 64 you waste a lot of precious shared memory. Also if you use a batch size not equal to a multiple of 32 you equally waste shared memory (threads are only started in blocks of 32 threads), so one should use a batch size which is a multiple of 32 or multiple of 64 if possible. For data parallelism this means that you lose significant processing speed once you go below a batch size of 64 for each GPU. If you have many GPUs this can be quite limiting and this is yet another reason why the data parallelism approach does not scale well beyond a certain point.

All in all this sounds quite dire for data parallelism, but data parallelism has its uses. If you know the bottlenecks, you can wield data parallelism as a might tool for certain applications. This is demonstrated by Alex Krishevsky in his paper where he uses data parallelism in the convolutional layers of his net, and thus achieves a speedup of 3.74x by using four GPUs and 6.25x using eight GPUs. His system features two CPUs and 8 GPUs in one node, so he can use the full PCIe speed for the two sets of four GPUs and relatively fast PCIe connection between CPUs to distribute the data among all eight GPUs.

Besides convolutional neural networks, another use of data parallelism might be to use it in recurrent neural networks, which typically have less parameters and highly computationally intensive gradient updates – both are wins for data parallelism.

In my next blog post I will focus on model parallelism, which is efficient for large networks and scales well to larger clusters.

The post How to Parallelize Deep Learning on GPUs Part 1/2: Data Parallelism appeared first on Tim Dettmers.

How To Build and Use a Multi GPU System for Deep Learning

Tim Dettmers — Sun, 21 Sep 2014 15:52:40 +0000

When I started using GPUs for deep learning my deep learning skills improved quickly. When you can run experiments of algorithms and algorithms with different parameters and gain rapid feedback you can just learn much more quickly. At the beginning, deep learning is a lot of trial and error: You have to get a feel what parameters need to be adjusted, or what puzzle piece is missing in order to get a good result. A GPU helps you to fail quickly and learn important lessons so that you can keep improving. Soon my deep learning skills were sufficient to take the 2^nd place in the Crowdflower competition where the task was to predict weather labels from given tweets (sunny, raining etc.).

After this success I was tempted to use multiple GPUs in order to train deep learning algorithms even faster. I also took interest in learning very large models which do not fit into a single GPU. I thus wanted to build a little GPU cluster and explore the possibilities to speed up deep learning with multiple nodes with multiple GPUs. At the same time I was offered to do contract work as a data base developer through my old employer. This gave me opportunity to get the money to build the GPU cluster I thought of.

Important components in a GPU cluster

When I did my research on which hardware to buy I soon realized, that the main bottleneck will be the network bandwidth, i.e. how much data can be transferred from computer to computer per second. The network bandwidth of network cards (affordable cards are at about 4GB/s) does not come even close to the speed of PCIe 3.0 bandwidth (15.75 GB/s). So GPU-to-GPU communication within a computer will be fast, but it will be slow between computers. On top of that most network card only work with memory that is registered with the CPU and so the GPU to GPU transfer between two nodes would be like this: GPU 1 to CPU 1 to Network Card 1 to Network Card 2 to CPU 2 to GPU 2. What this means is, if one chooses a slow network card then there might be no speedups over a single computer. Even with fast network cards, if the cluster is large, one does not even get speedups from GPUs when compared to CPUs as the GPUs just work too fast for the network cards to keep up with them.

This is the reason why many big companies like Google and Microsoft are using CPU rather than GPU clusters to train their big neural networks. Luckily, Mellanox and Nvidia recently came together to work on that problem and the result is GPUDirect RDMA, a network card driver that can make sense of GPU memory addresses and thus can transfer data directly from GPU to GPU between computers.

NVIDIA GPUDirect RDMA can bypass the CPU for inter-node communication – data is directly transfered between two GPUs.

Generally your best bet for cheap network cards is eBay. I won an auction for a set of 40Gbit/s Mellanox network cards that support GPUDirect RDMA along with the fitting fibre cable on eBay. I already had two GTX Titan GPUs with 6GB of memory and as I wanted to build huge models that do not fit into a single memory, so I decided to keep the 6GB cards and buy more of them to build a cluster that features 24GB memory. In retrospect this was a rather foolish (and expensive) idea, but little did I know about the performance of such large models and how to evaluate the performance of GPUs. All the lessons I learned from this can be found here. Besides that the hardware is rather straightforward. For fast inter-node communication PCIe 3.0 is faster than PCIe 2.0, so I got a PCIe 3.0 board. It is also a good idea to have about two times the RAM than you have GPU memory to be able to work more freely to handle big nets. As deep learning programs use a single thread for a GPU most of the time, a CPU with as many cores as GPUs you have is often sufficient.

Hardware: Check. Software: ?

There are basically two options how to do multi-GPU programming. You do it in CUDA and have a single thread and manage the GPUs directly by setting the current device and by declaring and assigning a dedicated memory-stream to each GPU, or the other options is to use CUDA-aware MPI where a single thread is spawned for each GPU and all communication and synchronization is handled by MPI. The first method is rather complicated as you need to create efficient abstractions where you loop through the GPUs and handle streaming and computing. Even with efficient abstractions your code can blow up quickly in line count making it less readable and maintainable.

Some sample MPI code. The first action spreads one chuck of data to all others computer in the network; the second action receives one chuck of data from every process. That is all you need to do, it is very easy!

The second option is much more efficient and clean. MPI is the standard in high performance computing and its standardized library means that you can be sure that a MPI method really does what it is supposed to do. Underlying MPI the same principles are used as in the first method describes above, but the abstraction is so good that it is quite easy to adapt single GPU code to multiple GPU code (at least for data parallelism). The result is clean and maintainable code and as such I would always recommend using MPI for multi-GPU computing. As MPI libraries come in many languages and you can pair them with the language of your choice. With these two components you are ready to go and can immediately start programming deep learning algorithms for multiple GPUs.

[Image source: NVIDIA GPUDirect Key Technologies]

The post How To Build and Use a Multi GPU System for Deep Learning appeared first on Tim Dettmers.