Deep Learning Archives — Tim Dettmers

Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning

Tim Dettmers — Mon, 30 Jan 2023 15:50:00 +0000

Deep learning is a field with intense computational requirements, and your choice of GPU will fundamentally determine your deep learning experience. But what features are important if you want to buy a new GPU? GPU RAM, cores, tensor cores, caches? How to make a cost-efficient choice? This blog post will delve into these questions, tackle common misconceptions, give you an intuitive understanding of how to think about GPUs, and will lend you advice, which will help you to make a choice that is right for you.

This blog post is designed to give you different levels of understanding of GPUs and the new Ampere series GPUs from NVIDIA. You have the choice: (1) If you are not interested in the details of how GPUs work, what makes a GPU fast compared to a CPU, and what is unique about the new NVIDIA RTX 40 Ampere series, you can skip right to the performance and performance per dollar charts and the recommendation section. The cost/performance numbers form the core of the blog post and the content surrounding it explains the details of what makes up GPU performance.

(2) If you worry about specific questions, I have answered and addressed the most common questions and misconceptions in the later part of the blog post.

(3) If you want to get an in-depth understanding of how GPUs, caches, and Tensor Cores work, the best is to read the blog post from start to finish. You might want to skip a section or two based on your understanding of the presented topics.

Overview

This blog post is structured in the following way. First, I will explain what makes a GPU fast. I will discuss CPUs vs GPUs, Tensor Cores, memory bandwidth, and the memory hierarchy of GPUs and how these relate to deep learning performance. These explanations might help you get a more intuitive sense of what to look for in a GPU. I discuss the unique features of the new NVIDIA RTX 40 Ampere GPU series that are worth considering if you buy a GPU. From there, I make GPU recommendations for different scenarios. After that follows a Q&A section of common questions posed to me in Twitter threads; in that section, I will also address common misconceptions and some miscellaneous issues, such as cloud vs desktop, cooling, AMD vs NVIDIA, and others.

How do GPUs work?

If you use GPUs frequently, it is useful to understand how they work. This knowledge will help you to undstand cases where are GPUs fast or slow. In turn, you might be able to understand better why you need a GPU in the first place and how other future hardware options might be able to compete. You can skip this section if you just want the useful performance numbers and arguments to help you decide which GPU to buy. The best high-level explanation for the question of how GPUs work is my following Quora answer:

Read Tim Dettmers‘ answer to Why are GPUs well-suited to deep learning? on Quora

This is a high-level explanation that explains quite well why GPUs are better than CPUs for deep learning. If we look at the details, we can understand what makes one GPU better than another.

The Most Important GPU Specs for Deep Learning Processing Speed

This section can help you build a more intuitive understanding of how to think about deep learning performance. This understanding will help you to evaluate future GPUs by yourself. This section is sorted by the importance of each component. Tensor Cores are most important, followed by memory bandwidth of a GPU, the cache hierachy, and only then FLOPS of a GPU.

Tensor Cores

Tensor Cores are tiny cores that perform very efficient matrix multiplication. Since the most expensive part of any deep neural network is matrix multiplication Tensor Cores are very useful. In fast, they are so powerful, that I do not recommend any GPUs that do not have Tensor Cores.

It is helpful to understand how they work to appreciate the importance of these computational units specialized for matrix multiplication. Here I will show you a simple example of A*B=C matrix multiplication, where all matrices have a size of 32×32, what a computational pattern looks like with and without Tensor Cores. This is a simplified example, and not the exact way how a high performing matrix multiplication kernel would be written, but it has all the basics. A CUDA programmer would take this as a first “draft” and then optimize it step-by-step with concepts like double buffering, register optimization, occupancy optimization, instruction-level parallelism, and many others, which I will not discuss at this point.

To understand this example fully, you have to understand the concepts of cycles. If a processor runs at 1GHz, it can do 10^9 cycles per second. Each cycle represents an opportunity for computation. However, most of the time, operations take longer than one cycle. Thus we essentially have a queue where the next operations needs to wait for the next operation to finish. This is also called the latency of the operation.

Here are some important latency cycle timings for operations. These times can change from GPU generation to GPU generation. These numbers are for Ampere GPUs, which have relatively slow caches.

Global memory access (up to 80GB): ~380 cycles
L2 cache: ~200 cycles
L1 cache or Shared memory access (up to 128 kb per Streaming Multiprocessor): ~34 cycles
Fused multiplication and addition, a*b+c (FFMA): 4 cycles
Tensor Core matrix multiply: 1 cycle

Each operation is always performed by a pack of 32 threads. This pack is termed a warp of threads. Warps usually operate in a synchronous pattern — threads within a warp have to wait for each other. All memory operations on the GPU are optimized for warps. For example, loading from global memory happens at a granularity of 32*4 bytes, exactly 32 floats, exactly one float for each thread in a warp. We can have up to 32 warps = 1024 threads in a streaming multiprocessor (SM), the GPU-equivalent of a CPU core. The resources of an SM are divided up among all active warps. This means that sometimes we want to run fewer warps to have more registers/shared memory/Tensor Core resources per warp.

For both of the following examples, we assume we have the same computational resources. For this small example of a 32×32 matrix multiply, we use 8 SMs (about 10% of an RTX 3090) and 8 warps per SM.

To understand how the cycle latencies play together with resources like threads per SM and shared memory per SM, we now look at examples of matrix multiplication. While the following example roughly follows the sequence of computational steps of matrix multiplication for both with and without Tensor Cores, please note that these are very simplified examples. Real cases of matrix multiplication involve much larger shared memory tiles and slightly different computational patterns.

Matrix multiplication without Tensor Cores

If we want to do an A*B=C matrix multiply, where each matrix is of size 32×32, then we want to load memory that we repeatedly access into shared memory because its latency is about five times lower (200 cycles vs 34 cycles). A memory block in shared memory is often referred to as a memory tile or just a tile. Loading two 32×32 floats into a shared memory tile can happen in parallel by using 2*32 warps. We have 8 SMs with 8 warps each, so due to parallelization, we only need to do a single sequential load from global to shared memory, which takes 200 cycles.

To do the matrix multiplication, we now need to load a vector of 32 numbers from shared memory A and shared memory B and perform a fused multiply-and-accumulate (FFMA). Then store the outputs in registers C. We divide the work so that each SM does 8x dot products (32×32) to compute 8 outputs of C. Why this is exactly 8 (4 in older algorithms) is very technical. I recommend Scott Gray’s blog post on matrix multiplication to understand this. This means we have 8x shared memory accesses at the cost of 34 cycles each and 8 FFMA operations (32 in parallel), which cost 4 cycles each. In total, we thus have a cost of:

200 cycles (global memory) + 8*34 cycles (shared memory) + 8*4 cycles (FFMA) = 504 cycles

Let’s look at the cycle cost of using Tensor Cores.

Matrix multiplication with Tensor Cores

With Tensor Cores, we can perform a 4×4 matrix multiplication in one cycle. To do that, we first need to get memory into the Tensor Core. Similarly to the above, we need to read from global memory (200 cycles) and store in shared memory. To do a 32×32 matrix multiply, we need to do 8×8=64 Tensor Cores operations. A single SM has 8 Tensor Cores. So with 8 SMs, we have 64 Tensor Cores — just the number that we need! We can transfer the data from shared memory to the Tensor Cores with 1 memory transfers (34 cycles) and then do those 64 parallel Tensor Core operations (1 cycle). This means the total cost for Tensor Cores matrix multiplication, in this case, is:

200 cycles (global memory) + 34 cycles (shared memory) + 1 cycle (Tensor Core) = 235 cycles.

Thus we reduce the matrix multiplication cost significantly from 504 cycles to 235 cycles via Tensor Cores. In this simplified case, the Tensor Cores reduced the cost of both shared memory access and FFMA operations.

This example is simplified, for example, usually each thread needs to calculate which memory to read and write to as you transfer data from global memory to shared memory. With the new Hooper (H100) architectures we additionally have the Tensor Memory Accelerator (TMA) compute these indices in hardware and thus help each thread to focus on more computation rather than computing indices.

Matrix multiplication with Tensor Cores and Asynchronous copies (RTX 30/RTX 40) and TMA (H100)

The RTX 30 Ampere and RTX 40 Ada series GPUs additionally have support to perform asynchronous transfers between global and shared memory. The H100 Hopper GPU extends this further by introducing the Tensor Memory Accelerator (TMA) unit. the TMA unit combines asynchronous copies and index calculation for read and writes simultaneously — so each thread no longer needs to calculate which is the next element to read and each thread can focus on doing more matrix multiplication calculations. This looks as follows.

The TMA unit fetches memory from global to shared memory (200 cycles). Once the data arrives, the TMA unit fetches the next block of data asynchronously from global memory. While this is happening, the threads load data from shared memory and perform the matrix multiplication via the tensor core. Once the threads are finished they wait for the TMA unit to finish the next data transfer, and the sequence repeats.

As such, due to the asynchronous nature, the second global memory read by the TMA unit is already progressing as the threads process the current shared memory tile. This means, the second read takes only 200 – 34 – 1 = 165 cycles.

Since we do many reads, only the first memory access will be slow and all other memory accesses will be partially overlapped with the TMA unit. Thus on average, we reduce the time by 35 cycles.

165 cycles (wait for async copy to finish) + 34 cycles (shared memory) + 1 cycle (Tensor Core) = 200 cycles.

Which accelerates the matrix multiplication by another 15%.

From these examples, it becomes clear why the next attribute, memory bandwidth, is so crucial for Tensor-Core-equipped GPUs. Since global memory is the by far the largest cycle cost for matrix multiplication with Tensor Cores, we would even have faster GPUs if the global memory latency could be reduced. We can do this by either increasing the clock frequency of the memory (more cycles per second, but also more heat and higher energy requirements) or by increasing the number of elements that can be transferred at any one time (bus width).

Memory Bandwidth

From the previous section, we have seen that Tensor Cores are very fast. So fast, in fact, that they are idle most of the time as they are waiting for memory to arrive from global memory. For example, during GPT-3-sized training, which uses huge matrices — the larger, the better for Tensor Cores — we have a Tensor Core TFLOPS utilization of about 45-65%, meaning that even for the large neural networks about 50% of the time, Tensor Cores are idle.

This means that when comparing two GPUs with Tensor Cores, one of the single best indicators for each GPU’s performance is their memory bandwidth. For example, The A100 GPU has 1,555 GB/s memory bandwidth vs the 900 GB/s of the V100. As such, a basic estimate of speedup of an A100 vs V100 is 1555/900 = 1.73x.

L2 Cache / Shared Memory / L1 Cache / Registers

Since memory transfers to the Tensor Cores are the limiting factor in performance, we are looking for other GPU attributes that enable faster memory transfer to Tensor Cores. L2 cache, shared memory, L1 cache, and amount of registers used are all related. To understand how a memory hierarchy enables faster memory transfers, it helps to understand how matrix multiplication is performed on a GPU.

To perform matrix multiplication, we exploit the memory hierarchy of a GPU that goes from slow global memory, to faster L2 memory, to fast local shared memory, to lightning-fast registers. However, the faster the memory, the smaller it is.

While logically, L2 and L1 memory are the same, L2 cache is larger and thus the average physical distance that need to be traversed to retrieve a cache line is larger. You can see the L1 and L2 caches as organized warehouses where you want to retrieve an item. You know where the item is, but to go there takes on average much longer for the larger warehouse. This is the essential difference between L1 and L2 caches. Large = slow, small = fast.

For matrix multiplication we can use this hierarchical separate into smaller and smaller and thus faster and faster chunks of memory to perform very fast matrix multiplications. For that, we need to chunk the big matrix multiplication into smaller sub-matrix multiplications. These chunks are called memory tiles, or often for short just tiles.

We perform matrix multiplication across these smaller tiles in local shared memory that is fast and close to the streaming multiprocessor (SM) — the equivalent of a CPU core. With Tensor Cores, we go a step further: We take each tile and load a part of these tiles into Tensor Cores which is directly addressed by registers. A matrix memory tile in L2 cache is 3-5x faster than global GPU memory (GPU RAM), shared memory is ~7-10x faster than the global GPU memory, whereas the Tensor Cores’ registers are ~200x faster than the global GPU memory.

Having larger tiles means we can reuse more memory. I wrote about this in detail in my TPU vs GPU blog post. In fact, you can see TPUs as having very, very, large tiles for each Tensor Core. As such, TPUs can reuse much more memory with each transfer from global memory, which makes them a little bit more efficient at matrix multiplications than GPUs.

Each tile size is determined by how much memory we have per streaming multiprocessor (SM) and how much we L2 cache we have across all SMs. We have the following shared memory sizes on the following architectures:

Volta (Titan V): 128kb shared memory / 6 MB L2
Turing (RTX 20s series): 96 kb shared memory / 5.5 MB L2
Ampere (RTX 30s series): 128 kb shared memory / 6 MB L2
Ada (RTX 40s series): 128 kb shared memory / 72 MB L2

We see that Ada has a much larger L2 cache allowing for larger tile sizes, which reduces global memory access. For example, for BERT large during training, the input and weight matrix of any matrix multiplication fit neatly into the L2 cache of Ada (but not other Us). As such, data needs to be loaded from global memory only once and then data is available throught the L2 cache, making matrix multiplication about 1.5 – 2.0x faster for this architecture for Ada. For larger models the speedups are lower during training but certain sweetspots exist which may make certain models much faster. Inference, with a batch size larger than 8 can also benefit immensely from the larger L2 caches.

Estimating Ada / Hopper Deep Learning Performance

This section is for those who want to understand the more technical details of how I derive the performance estimates for Ampere GPUs. If you do not care about these technical aspects, it is safe to skip this section.

Practical Ada / Hopper Speed Estimates

Suppose we have an estimate for one GPU of a GPU-architecture like Hopper, Ada, Ampere, Turing, or Volta. It is easy to extrapolate these results to other GPUs from the same architecture/series. Luckily, NVIDIA already benchmarked the A100 vs V100 vs H100 across a wide range of computer vision and natural language understanding tasks. Unfortunately, NVIDIA made sure that these numbers are not directly comparable by using different batch sizes and the number of GPUs whenever possible to favor results for the H100 GPU. So in a sense, the benchmark numbers are partially honest, partially marketing numbers. In general, you could argue that using larger batch sizes is fair, as the H100/A100 GPU has more memory. Still, to compare GPU architectures, we should evaluate unbiased memory performance with the same batch size.

To get an unbiased estimate, we can scale the data center GPU results in two ways: (1) account for the differences in batch size, (2) account for the differences in using 1 vs 8 GPUs. We are lucky that we can find such an estimate for both biases in the data that NVIDIA provides.

Doubling the batch size increases throughput in terms of images/s (CNNs) by 13.6%. I benchmarked the same problem for transformers on my RTX Titan and found, surprisingly, the very same result: 13.5% — it appears that this is a robust estimate.

As we parallelize networks across more and more GPUs, we lose performance due to some networking overhead. The A100 8x GPU system has better networking (NVLink 3.0) than the V100 8x GPU system (NVLink 2.0) — this is another confounding factor. Looking directly at the data from NVIDIA, we can find that for CNNs, a system with 8x A100 has a 5% lower overhead than a system of 8x V100. This means if going from 1x A100 to 8x A100 gives you a speedup of, say, 7.00x, then going from 1x V100 to 8x V100 only gives you a speedup of 6.67x. For transformers, the figure is 7%.

Using these figures, we can estimate the speedup for a few specific deep learning architectures from the direct data that NVIDIA provides. The Tesla A100 offers the following speedup over the Tesla V100:

SE-ResNeXt101: 1.43x
Masked-R-CNN: 1.47x
Transformer (12 layer, Machine Translation, WMT14 en-de): 1.70x

Thus, the figures are a bit lower than the theoretical estimate for computer vision. This might be due to smaller tensor dimensions, overhead from operations that are needed to prepare the matrix multiplication like img2col or Fast Fourier Transform (FFT), or operations that cannot saturate the GPU (final layers are often relatively small). It could also be artifacts of the specific architectures (grouped convolution).

The practical transformer estimate is very close to the theoretical estimate. This is probably because algorithms for huge matrices are very straightforward. I will use these practical estimates to calculate the cost efficiency of GPUs.

Possible Biases in Estimates

The estimates above are for H100, A100 , and V100 GPUs. In the past, NVIDIA sneaked unannounced performance degradations into the “gaming” RTX GPUs: (1) Decreased Tensor Core utilization, (2) gaming fans for cooling, (3) disabled peer-to-peer GPU transfers. It might be possible that there are unannounced performance degradations in the RTX 40 series compared to the full Hopper H100.

As of now, one of these degradations was found for Ampere GPUs: Tensor Core performance was decreased so that RTX 30 series GPUs are not as good as Quadro cards for deep learning purposes. This was also done for the RTX 20 series, so it is nothing new, but this time it was also done for the Titan equivalent card, the RTX 3090. The RTX Titan did not have performance degradation enabled.

Currently, no degradation for Ada GPUs are known, but I update this post with news on this and let my followers on twitter know.

Advantages and Problems for RTX40 and RTX 30 Series

The new NVIDIA Ampere RTX 30 series has additional benefits over the NVIDIA Turing RTX 20 series, such as sparse network training and inference. Other features, such as the new data types, should be seen more as an ease-of-use-feature as they provide the same performance boost as Turing does but without any extra programming required.

The Ada RTX 40 series has even further advances like 8-bit Float (FP8) tensor cores. The RTX 40 series also has similar power and temperature issues compared to the RTX 30. The issue of melting power connector cables in the RTX 40 can be easily prevented by connecting the power cable correctly.

Sparse Network Training

Ampere allows for fine-grained structure automatic sparse matrix multiplication at dense speeds. How does this work? Take a weight matrix and slice it into pieces of 4 elements. Now imagine 2 elements of these 4 to be zero. Figure 1 shows how this could look like.

Figure X: Structure supported by the sparse matrix multiplication feature in Ampere GPUs. Figure is taken from Jeff Pool’s GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA.

" data-image-caption="

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matrix_ampere.png?fit=249%2C300&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matrix_ampere.png?fit=321%2C387&ssl=1" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matrix_ampere.png?resize=258%2C311&ssl=1" alt="Figure 1: Structure supported by the sparse matrix multiplication feature in Ampere GPUs. The figure is taken from Jeff Pool's GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA." class="wp-image-935" width="258" height="311" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matrix_ampere.png?w=321&ssl=1 321w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matrix_ampere.png?resize=249%2C300&ssl=1 249w" sizes="(max-width: 258px) 100vw, 258px" data-recalc-dims="1" />

Figure 1: Structure supported by the sparse matrix multiplication feature in Ampere GPUs. The figure is taken from Jeff Pool’s GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA.

When you multiply this sparse weight matrix with some dense inputs, the sparse matrix tensor core feature in Ampere automatically compresses the sparse matrix to a dense representation that is half the size as can be seen in Figure 2. After this compression, the densely compressed matrix tile is fed into the tensor core which computes a matrix multiplication of twice the usual size. This effectively yields a 2x speedup since the bandwidth requirements during matrix multiplication from shared memory are halved.

Figure X: The sparse matrix is compressed to a dense representation before the matrix multiplication is performed. Figure is taken from Jeff Pool’s GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA.

" data-image-caption="

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matmul.png?fit=300%2C181&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matmul.png?fit=1024%2C619&ssl=1" width="1024" height="619" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matmul.png?resize=1024%2C619&ssl=1" alt="Figure 2: The sparse matrix is compressed to a dense representation before the matrix multiplication is performed. " class="wp-image-934" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matmul.png?resize=1024%2C619&ssl=1 1024w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matmul.png?resize=300%2C181&ssl=1 300w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matmul.png?resize=768%2C464&ssl=1 768w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/sparse_matmul.png?w=1055&ssl=1 1055w" sizes="(max-width: 1000px) 100vw, 1000px" data-recalc-dims="1" />

Figure 2: The sparse matrix is compressed to a dense representation before the matrix multiplication is performed. The figure is taken from Jeff Pool’s GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA.

I was working on sparse network training in my research and I also wrote a blog post about sparse training. One criticism of my work was that “You reduce the FLOPS required for the network, but it does not yield speedups because GPUs cannot do fast sparse matrix multiplication.” Well, with the addition of the sparse matrix multiplication feature for Tensor Cores, my algorithm, or other sparse training algorithms, now actually provide speedups of up to 2x during training.

Figure X: The sparse training algorithm developed has three stages: (1) Determine the importance of each layer. (2) Remove the smallest, unimportant weights. (3) Grow new weights proportional to the importance of each layers.

" data-image-caption="

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?fit=300%2C145&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?fit=1024%2C493&ssl=1" width="1024" height="493" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?resize=1024%2C493&ssl=1" alt="Figure 3: The sparse training algorithm that I developed has three stages: (1) Determine the importance of each layer. (2) Remove the smallest, unimportant weights. (3) Grow new weights proportional to the importance of each layer. Read more about my work in my sparse training blog post." class="wp-image-779" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?resize=1024%2C493&ssl=1 1024w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?resize=300%2C145&ssl=1 300w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?resize=768%2C370&ssl=1 768w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?w=1096&ssl=1 1096w" sizes="(max-width: 1000px) 100vw, 1000px" data-recalc-dims="1" />

Figure 3: The sparse training algorithm that I developed has three stages: (1) Determine the importance of each layer. (2) Remove the smallest, unimportant weights. (3) Grow new weights proportional to the importance of each layer. Read more about my work in my sparse training blog post.

While this feature is still experimental and training sparse networks are not commonplace yet, having this feature on your GPU means you are ready for the future of sparse training.

Low-precision Computation

In my work, I’ve previously shown that new data types can improve stability during low-precision backpropagation.

Figure X: Low-precision deep learning 8-bit datatypes that I developed. Deep learning training benefits from highly specialized data types. My dynamic tree datatype uses a dynamic bit that indicates the beginning of a binary bisection tree that quantized the range [0, 0.9] while all previous bits are used for the exponent. This allows to dynamically represent large numbers and small numbers with high precision.

" data-image-caption="

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/8-bit_data_types.png?fit=300%2C93&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/8-bit_data_types.png?fit=869%2C268&ssl=1" width="869" height="268" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/8-bit_data_types.png?resize=869%2C268&ssl=1" alt="Figure 4: Low-precision deep learning 8-bit datatypes that I developed. Deep learning training benefits from highly specialized data types. My dynamic tree datatype uses a dynamic bit that indicates the beginning of a binary bisection tree that quantized the range [0, 0.9] while all previous bits are used for the exponent. This allows to dynamically represent numbers that are both large and small with high precision." class="wp-image-941" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/8-bit_data_types.png?w=869&ssl=1 869w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/8-bit_data_types.png?resize=300%2C93&ssl=1 300w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/8-bit_data_types.png?resize=768%2C237&ssl=1 768w" sizes="(max-width: 869px) 100vw, 869px" data-recalc-dims="1" />

Figure 4: Low-precision deep learning 8-bit datatypes that I developed. Deep learning training benefits from highly specialized data types. My dynamic tree datatype uses a dynamic bit that indicates the beginning of a binary bisection tree that quantized the range [0, 0.9] while all previous bits are used for the exponent. This allows to dynamically represent numbers that are both large and small with high precision.

Currently, if you want to have stable backpropagation with 16-bit floating-point numbers (FP16), the big problem is that ordinary FP16 data types only support numbers in the range [-65,504, 65,504]. If your gradient slips past this range, your gradients explode into NaN values. To prevent this during FP16 training, we usually perform loss scaling where you multiply the loss by a small number before backpropagating to prevent this gradient explosion.

The BrainFloat 16 format (BF16) uses more bits for the exponent such that the range of possible numbers is the same as for FP32: [-3*10^38, 3*10^38]. BF16 has less precision, that is significant digits, but gradient precision is not that important for learning. So what BF16 does is that you no longer need to do any loss scaling or worry about the gradient blowing up quickly. As such, we should see an increase in training stability by using the BF16 format as a slight loss of precision.

What this means for you: With BF16 precision, training might be more stable than with FP16 precision while providing the same speedups. With 32-bit TensorFloat (TF32) precision, you get near FP32 stability while giving the speedups close to FP16. The good thing is, to use these data types, you can just replace FP32 with TF32 and FP16 with BF16 — no code changes required!

Overall, though, these new data types can be seen as lazy data types in the sense that you could have gotten all the benefits with the old data types with some additional programming efforts (proper loss scaling, initialization, normalization, using Apex). As such, these data types do not provide speedups but rather improve ease of use of low precision for training.

Fan Designs and GPUs Temperature Issues

While the new fan design of the RTX 30 series performs very well to cool the GPU, different fan designs of non-founders edition GPUs might be more problematic. If your GPU heats up beyond 80C, it will throttle itself and slow down its computational speed / power. This overheating can happen in particular if you stack multiple GPUs next to each other. A solution to this is to use PCIe extenders to create space between GPUs.

Spreading GPUs with PCIe extenders is very effective for cooling, and other fellow PhD students at the University of Washington and I use this setup with great success. It does not look pretty, but it keeps your GPUs cool! This has been running with no problems at all for 4 years now. It can also help if you do not have enough space to fit all GPUs in the PCIe slots. For example, if you can find the space within a desktop computer case, it might be possible to buy standard 3-slot-width RTX 4090 and spread them with PCIe extenders within the case. With this, you might solve both the space issue and cooling issue for a 4x RTX 4090 setup with a single simple solution.

4x GPUs with PCIe extenders

" data-image-caption="

4x GPUs with PCIe extenders

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders-scaled.jpg?fit=225%2C300&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders-scaled.jpg?fit=768%2C1024&ssl=1" width="768" height="1024" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders.jpg?resize=768%2C1024&ssl=1" alt="Figure 5: 4x GPUs with PCIe extenders. It looks like a mess, but it is very effective for cooling. I used this rig for 2 years and cooling is excellent despite problematic RTX 2080 Ti Founders Edition GPUs." class="wp-image-861" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders-scaled.jpg?resize=768%2C1024&ssl=1 768w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders-scaled.jpg?resize=225%2C300&ssl=1 225w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders-scaled.jpg?resize=1152%2C1536&ssl=1 1152w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders-scaled.jpg?resize=1536%2C2048&ssl=1 1536w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/4x_RTX2080Ti_desktop_extenders-scaled.jpg?w=1920&ssl=1 1920w" sizes="(max-width: 768px) 100vw, 768px" data-recalc-dims="1" />

Figure 5: 4x GPUs with PCIe extenders. It looks like a mess, but it is very effective for cooling. I used this rig for 4 years and cooling is excellent despite problematic RTX 2080 Ti Founders Edition GPUs.

3-slot Design and Power Issues

The RTX 3090 and RTX 4090 are 3-slot GPUs, so one will not be able to use it in a 4x setup with the default fan design from NVIDIA. This is kind of justified because it runs at over 350W TDP, and it will be difficult to cool in a multi-GPU 2-slot setting. The RTX 3080 is only slightly better at 320W TDP, and cooling a 4x RTX 3080 setup will also be very difficult.

It is also difficult to power a 4x 350W = 1400W or 4x 450W = 1800W system in the 4x RTX 3090 or 4x RTX 4090 case. Power supply units (PSUs) of 1600W are readily available, but having only 200W to power the CPU and motherboard can be too tight. The components’ maximum power is only used if the components are fully utilized, and in deep learning, the CPU is usually only under weak load. With that, a 1600W PSU might work quite well with a 4x RTX 3080 build, but for a 4x RTX 3090 build, it is better to look for high wattage PSUs (+1700W). Some of my followers have had great success with cryptomining PSUs — have a look in the comment section for more info about that. Otherwise, it is important to note that not all outlets support PSUs above 1600W, especially in the US. This is the reason why in the US, there are currently few standard desktop PSUs above 1600W on the market. If you get a server or cryptomining PSUs, beware of the form factor — make sure it fits into your computer case.

Power Limiting: An Elegant Solution to Solve the Power Problem?

It is possible to set a power limit on your GPUs. So you would be able to programmatically set the power limit of an RTX 3090 to 300W instead of their standard 350W. In a 4x GPU system, that is a saving of 200W, which might just be enough to build a 4x RTX 3090 system with a 1600W PSU feasible. It also helps to keep the GPUs cool. So setting a power limit can solve the two major problems of a 4x RTX 3080 or 4x RTX 3090 setups, cooling, and power, at the same time. For a 4x setup, you still need effective blower GPUs (and the standard design may prove adequate for this), but this resolves the PSU problem.

Figure X: Reducing the power limit has a slight cooling effect. Reducing the RTX 2080 Ti power limit by 50-60 W decreases temperatures slightly and fans run more silent.

" data-image-caption="

Figure X: Reducing the power limit has a slight cooling effect. Reducing the RTX 2080 Ti power limit by 50-60 W decreases temperatures slightly and fans run more silent.

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?fit=298%2C300&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?fit=1017%2C1024&ssl=1" width="1017" height="1024" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?resize=1017%2C1024&ssl=1" alt="Figure 6: Reducing the power limit has a slight cooling effect. Reducing the RTX 2080 Ti power limit by 50-60 W decreases temperatures slightly and fans run more silent." class="wp-image-933" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?resize=1017%2C1024&ssl=1 1017w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?resize=298%2C300&ssl=1 298w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?resize=150%2C150&ssl=1 150w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?resize=768%2C773&ssl=1 768w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2020/09/power_limit_nvidia_smi.png?w=1187&ssl=1 1187w" sizes="(max-width: 1000px) 100vw, 1000px" data-recalc-dims="1" />

Figure 6: Reducing the power limit has a slight cooling effect. Reducing the RTX 2080 Ti power limit by 50-60 W decreases temperatures slightly and fans run more silent.

You might ask, “Doesn’t this slow down the GPU?” Yes, it does, but the question is by how much. I benchmarked the 4x RTX 2080 Ti system shown in Figure 5 under different power limits to test this. I benchmarked the time for 500 mini-batches for BERT Large during inference (excluding the softmax layer). I choose BERT Large inference since, from my experience, this is the deep learning model that stresses the GPU the most. As such, I would expect power limiting to have the most massive slowdown for this model. As such, the slowdowns reported here are probably close to the maximum slowdowns that you can expect. The results are shown in Figure 7.

Figure 6: Measured slowdown for a given power limit on an RTX 2080 Ti. Measurements taken are mean processing times for 500 mini-batches of BERT Large during inference (excluding softmax layer).

" data-image-caption="

Figure 6: Measured slowdown for a given power limit on an RTX 2080 Ti. Measurements taken are mean processing times for 500 mini-batches of BERT Large during inference (excluding softmax layer).

" data-medium-file="https://timdettmers.com/wp-content/uploads/2020/09/RTX-2080-Ti-Slowdown-vs-Power-Limit.svg" data-large-file="https://timdettmers.com/wp-content/uploads/2020/09/RTX-2080-Ti-Slowdown-vs-Power-Limit.svg" width="853" height="703" src="https://timdettmers.com/wp-content/uploads/2020/09/RTX-2080-Ti-Slowdown-vs-Power-Limit.svg" alt="Figure 7: Measured slowdown for a given power limit on an RTX 2080 Ti. Measurements taken are mean processing times for 500 mini-batches of BERT Large during inference (excluding softmax layer)." class="wp-image-939"/>

Figure 7: Measured slowdown for a given power limit on an RTX 2080 Ti. Measurements taken are mean processing times for 500 mini-batches of BERT Large during inference (excluding softmax layer).

As we can see, setting the power limit does not seriously affect performance. Limiting the power by 50W — more than enough to handle 4x RTX 3090 — decreases performance by only 7%.

RTX 4090s and Melting Power Connectors: How to Prevent Problems

There was a misconception that RTX 4090 power cables melt because they were bent. However, it was found that only 0.1% of users had this problem and the problem occured due to user error. Here a video that shows that the main problem is that cables were not inserted correctly.

So using RTX 4090 cards is perfectly safe if you follow the following install instructions:

If you use an old cable or old GPU make sure the contacts are free of debri / dust.
Use the power connector and stick it into the socket until you hear a *click* — this is the most important part.
Test for good fit by wiggling the power cable left to right. The cable should not move.
Check the contact with the socket visually, there should be no gap between cable and socket.

8-bit Float Support in H100 and RTX 40 series GPUs

The support of the 8-bit Float (FP8) is a huge advantage for the RTX 40 series and H100 GPUs. With 8-bit inputs it allows you to load the data for matrix multiplication twice as fast, you can store twice as much matrix elements in your caches which in the Ada and Hopper architecture are very large, and now with FP8 tensor cores you get 0.66 PFLOPS of compute for a RTX 4090 — this is more FLOPS then the entirety of the worlds fastest supercomputer in year 2007. 4x RTX 4090 with FP8 compute rival the faster supercomputer in the world in year 2010 (deep learning started to work just in 2009).

The main problem with using 8-bit precision is that transformers can get very unstable with so few bits and crash during training or generate non-sense during inference. I have written a paper about the emergence of instabilities in large language models and I also written a more accessible blog post.

The main take-way is this: Using 8-bit instead of 16-bit makes things very unstable, but if you keep a couple of dimensions in high precision everything works just fine.

Main results from my work on 8-bit matrix multiplication for Large Language Models (LLMs). We can see that the best 8-bit baseline fails to deliver good zero-shot performance. The method that I developed, LLM.int8(), can perform Int8 matrix multiplication with the same results as the 16-bit baseline.

But Int8 was already supported by the RTX 30 / A100 / Ampere generation GPUs, why is FP8 in the RTX 40 another big upgrade? The FP8 data type is much more stable than the Int8 data type and its easy to use it in functions like layer norm or non-linear functions, which are difficult to do with Integer data types. This will make it very straightforward to use it in training and inference. I think this will make FP8 training and inference relatively common in a couple of months.

If you want to read more about the advantages of Float vs Integer data types you can read my recent paper about k-bit inference scaling laws. Below you can see one relevant main result for Float vs Integer data types from this paper. We can see that bit-by-bit, the FP4 data type preserve more information than Int4 data type and thus improves the mean LLM zeroshot accuracy across 4 tasks.

4-bit Inference scaling laws for Pythia Large Language Models for different data types. We see that bit-by-bit, 4-bit float data types have better zeroshot accuracy compared to the Int4 data types.

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/pythia_4bit_datatypes2.png?fit=300%2C226&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/pythia_4bit_datatypes2.png?fit=1024%2C773&ssl=1" width="1024" height="773" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/pythia_4bit_datatypes2.png?resize=1024%2C773&ssl=1" alt="" class="wp-image-1151" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/pythia_4bit_datatypes2.png?resize=1024%2C773&ssl=1 1024w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/pythia_4bit_datatypes2.png?resize=300%2C226&ssl=1 300w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/pythia_4bit_datatypes2.png?resize=768%2C580&ssl=1 768w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/pythia_4bit_datatypes2.png?w=1159&ssl=1 1159w" sizes="(max-width: 1000px) 100vw, 1000px" data-recalc-dims="1" />

4-bit Inference scaling laws for Pythia Large Language Models for different data types. We see that bit-by-bit, 4-bit float data types have better zeroshot accuracy compared to the Int4 data types.

Raw Performance Ranking of GPUs

Below we see a chart of raw relevative performance across all GPUs. We see that there is a gigantic gap in 8-bit performance of H100 GPUs and old cards that are optimized for 16-bit performance.

Shown is raw relative performance of GPUs. For example, an RTX 4090 has about 0.33x performance of a H100 SMX for 8-bit inference. In other words, a H100 SMX is three times faster for 8-bit inference compared to a RTX 4090.

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?fit=300%2C295&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?fit=1024%2C1006&ssl=1" width="1024" height="1006" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?resize=1024%2C1006&ssl=1" alt="" class="wp-image-1160" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?resize=1024%2C1006&ssl=1 1024w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?resize=300%2C295&ssl=1 300w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?resize=768%2C754&ssl=1 768w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?resize=1536%2C1509&ssl=1 1536w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/GPUS_Ada_raw_performance3.png?w=1703&ssl=1 1703w" sizes="(max-width: 1000px) 100vw, 1000px" data-recalc-dims="1" />

Shown is raw relative transformer performance of GPUs. For example, an RTX 4090 has about 0.33x performance of a H100 SMX for 8-bit inference. In other words, a H100 SMX is three times faster for 8-bit inference compared to a RTX 4090.

For this data, I did not model 8-bit compute for older GPUs. I did so, because 8-bit Inference and training are much more effective on Ada/Hopper GPUs because of the 8-bit Float data type and Tensor Memory Accelerator (TMA) which saves the overhead of computing read/write indices which is particularly helpful for 8-bit matrix multiplication. Ada/Hopper also have FP8 support, which makes in particular 8-bit training much more effective.

I did not model numbers for 8-bit training because to model that I need to know the latency of L1 and L2 caches on Hopper/Ada GPUs, and they are unknown and I do not have access to such GPUs. On Hopper/Ada, 8-bit training performance can well be 3-4x of 16-bit training performance if the caches are as fast as rumored.

But even with the new FP8 tensor cores there are some additional issues which are difficult to take into account when modeling GPU performance. For example, FP8 tensor cores do not support transposed matrix multiplication which means backpropagation needs either a separate transpose before multiplication or one needs to hold two sets of weights — one transposed and one non-transposed — in memory. I used two sets of weight when I experimented with Int8 training in my LLM.int8() project and this reduced the overall speedups quite significantly. I think one can do better with the right algorithms/software, but this shows that missing features like a transposed matrix multiplication for tensor cores can affect performance.

For old GPUs, Int8 inference performance is close to the 16-bit inference performance for models below 13B parameters. Int8 performance on old GPUs is only relevant if you have relatively large models with 175B parameters or more. If you are interested in 8-bit performance of older GPUs, you can read the Appendix D of my LLM.int8() paper where I benchmark Int8 performance.

GPU Deep Learning Performance per Dollar

Below we see the chart for the performance per US dollar for all GPUs sorted by 8-bit inference performance. How to use the chart to find a suitable GPU for you is as follows:

Determine the amount of GPU memory that you need (rough heuristic: at least 12 GB for image generation; at least 24 GB for work with transformers)
While 8-bit inference and training is experimental, it will become standard within 6 months. You might need to do some extra difficult coding to work with 8-bit in the meantime. Is that OK for you? If not, select for 16-bit performance.
Using the metric determined in (2), find the GPU with the highest relative performance/dollar that has the amount of memory you need.

We can see that the RTX 4070 Ti is most cost-effective for 8-bit and 16-bit inference while the RTX 3080 remains most cost-effective for 16-bit training. While these GPUs are most cost-effective, they are not necessarily recommended as they do not have sufficient memory for many use-cases. However, it might be the ideal cards to get started on your deep learning journey. Some of these GPUs are excellent for Kaggle competition where one can often rely on smaller models. Since to do well in Kaggle competitions the method of how you work is more important than the models size, many of these smaller GPUs are excellent for Kaggle competitions.

The best GPUs for academic and startup servers seem to be A6000 Ada GPUs (not to be confused with A6000 Turing). The H100 SXM GPU is also very cost effective and has high memory and very strong performance. If I would build a small cluster for a company/academic lab, I would use 66-80% A6000 GPUs and 20-33% H100 SXM GPUs. If I get a good deal on L40 GPUs, I would also pick them instead of A6000, so you can always ask for a quote on these.

Shown is relative performance per US Dollar of GPUs normalized by the cost for a desktop computer and the average Amazon and eBay price for each GPU. Additionally, the electricity cost of ownership for 5 years is added with an electricity price of 0.175 USD per kWh and a 15% GPU utilization rate. The electricity cost for a RTX 4090 is about $100 per year. How to read and interpret the chart: a desktop computer with RTX 4070 Ti cards owned for 5 years yields about 2x more 8-bit inference performance per dollar compared to a RTX 3090 GPU.

GPU Recommendations

I have a create a recommendation flow-chart that you can see below (click here for interactive app from Nan Xiao). While this chart will help you in 80% of cases, it might not quite work for you because the options might be too expensive. In that case, try to look at the benchmarks above and pick the most cost effective GPU that still has enough GPU memory for your use-case. You can estimate the GPU memory needed by running your problem in the vast.ai or Lambda Cloud for a while so you know what you need. The vast.ai or Lambda Cloud might also work well if you only need a GPU very sporadically (every couple of days for a few hours) and you do not need to download and process large dataset to get started. However, cloud GPUs are usually not a good option if you use your GPU for many months with a high usage rate each day (12 hours each day). You can use the example in the “When is it better to use the cloud vs a dedicated GPU desktop/server?” section below to determine if cloud GPUs are good for you.

GPU recommendation chart for Ada/Hopper GPUs. Follow the answers to the Yes/No questions to find the GPU that is most suitable for you. While this chart works well in about 80% of cases, you might end up with a GPU that is too expensive. Use the cost/performance charts above to make a selection instead.

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/gpu_recommendations.png?fit=300%2C244&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/gpu_recommendations.png?fit=845%2C686&ssl=1" width="845" height="686" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/gpu_recommendations.png?resize=845%2C686&ssl=1" alt="" class="wp-image-1173" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/gpu_recommendations.png?w=845&ssl=1 845w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/gpu_recommendations.png?resize=300%2C244&ssl=1 300w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2023/01/gpu_recommendations.png?resize=768%2C623&ssl=1 768w" sizes="(max-width: 845px) 100vw, 845px" data-recalc-dims="1" />

Is it better to wait for future GPUs for an upgrade? The future of GPUs.

To understand if it makes sense to skip this generation and buy the next generation of GPUs, it makes sense to talk a bit about what improvements in the future will look like.

In the past it was possible to shrink the size of transistors to improve speed of a processor. This is coming to an end now. For example, while shrinking SRAM increased its speed (smaller distance, faster memory access), this is no longer the case. Current improvements in SRAM do not improve its performance anymore and might even be negative. While logic such as Tensor Cores get smaller, this does not necessarily make GPU faster since the main problem for matrix multiplication is to get memory to the tensor cores which is dictated by SRAM and GPU RAM speed and size. GPU RAM still increases in speed if we stack memory modules into high-bandwidth modules (HBM3+), but these are too expensive to manufacture for consumer applications. The main way to improve raw speed of GPUs is to use more power and more cooling as we have seen in the RTX 30s and 40s series. But this cannot go on for much longer.

Chiplets such as used by AMD CPUs are another straightforward way forward. AMD beat Intel by developing CPU chiplets. Chiplets are small chips that are fused together with a high speed on-chip network. You can think about them as two GPUs that are so physically close together that you can almost consider them a single big GPU. They are cheaper to manufacture, but more difficult to combine into one big chip. So you need know-how and fast connectivity between chiplets. AMD has a lot of experience with chiplet design. AMD’s next generation GPUs are going to be chiplet designs, while NVIDIA currently has no public plans for such designs. This may mean that the next generation of AMD GPUs might be better in terms of cost/performance compared to NVIDIA GPUs.

However, the main performance boost for GPUs is currently specialized logic. For example, the asynchronous copy hardware units on the Ampere generation (RTX 30 / A100 / RTX 40) or the extension, the Tensor Memory Accelerator (TMA), both reduce the overhead of copying memory from the slow global memory to fast shared memory (caches) through specialized hardware and so each thread can do more computation. The TMA also reduces overhead by performing automatic calculations of read/write indices which is particularly important for 8-bit computation where one has double the elements for the same amount of memory compared to 16-bit computation. So specialized hardware logic can accelerate matrix multiplication further.
Low-bit precision is another straightforward way forward for a couple of years. We will see widespread adoption of 8-bit inference and training in the next months. We will see widespread 4-bit inference in the next year. Currently, the technology for 4-bit training does not exists, but research looks promising and I expect the first high performance FP4 Large Language Model (LLM) with competitive predictive performance to be trained in 1-2 years time.

Going to 2-bit precision for training currently looks pretty impossible, but it is a much easier problem than shrinking transistors further. So progress in hardware mostly depends on software and algorithms that make it possible to use specialized features offered by the hardware.

We will probably be able to still improve the combination of algorithms + hardware to the year 2032, but after that will hit the end of GPU improvements (similar to smartphones). The wave of performance improvements after 2032 will come from better networking algorithms and mass hardware. It is uncertain if consumer GPUs will be relevant at this point. It might be that you need an RTX 9090 to run run Super HyperStableDiffusion Ultra Plus 9000 Extra or OpenChatGPT 5.0, but it might also be that some company will offer a high-quality API that is cheaper than the electricity cost for a RTX 9090 and you want to use a laptop + API for image generation and other tasks.

Overall, I think investing into a 8-bit capable GPU will be a very solid investment for the next 9 years. Improvements at 4-bit and 2-bit are likely small and other features like Sort Cores would only become relevant once sparse matrix multiplication can be leveraged well. We will probably see some kind of other advancement in 2-3 years which will make it into the next GPU 4 years from now, but we are running out of steam if we keep relying on matrix multiplication. This makes investments into new GPUs last longer.

Question & Answers & Misconceptions

Do I need PCIe 4.0 or PCIe 5.0?

Generally, no. PCIe 5.0 or 4.0 is great if you have a GPU cluster. It is okay if you have an 8x GPU machine, but otherwise, it does not yield many benefits. It allows better parallelization and a bit faster data transfer. Data transfers are not a bottleneck in any application. In computer vision, in the data transfer pipeline, the data storage can be a bottleneck, but not the PCIe transfer from CPU to GPU. So there is no real reason to get a PCIe 5.0 or 4.0 setup for most people. The benefits will be maybe 1-7% better parallelization in a 4 GPU setup.

Do I need 8x/16x PCIe lanes?

Same as with PCIe 4.0 — generally, no. PCIe lanes are needed for parallelization and fast data transfers, which are seldom a bottleneck. Operating GPUs on 4x lanes is fine, especially if you only have 2 GPUs. For a 4 GPU setup, I would prefer 8x lanes per GPU, but running them at 4x lanes will probably only decrease performance by around 5-10% if you parallelize across all 4 GPUs.

How do I fit 4x RTX 4090 or 3090 if they take up 3 PCIe slots each?

You need to get one of the two-slot variants, or you can try to spread them out with PCIe extenders. Besides space, you should also immediately think about cooling and a suitable PSU.

PCIe extenders might also solve both space and cooling issues, but you need to make sure that you have enough space in your case to spread out the GPUs. Make sure your PCIe extenders are long enough!

How do I cool 4x RTX 3090 or 4x RTX 3080?

See the previous section.

Can I use multiple GPUs of different GPU types?

Yes, you can! But you cannot parallelize efficiently across GPUs of different types since you will often go at the speed of the slowest GPU (data and fully sharded parallelism). So different GPUs work just fine, but parallelization across those GPUs will be inefficient since the fastest GPU will wait for the slowest GPU to catch up to a synchronization point (usually gradient update).

What is NVLink, and is it useful?

Generally, NVLink is not useful. NVLink is a high speed interconnect between GPUs. It is useful if you have a GPU cluster with +128 GPUs. Otherwise, it yields almost no benefits over standard PCIe transfers.

I do not have enough money, even for the cheapest GPUs you recommend. What can I do?

Definitely buy used GPUs. You can buy a small cheap GPU for prototyping and testing and then roll out for full experiments to the cloud like vast.ai or Lambda Cloud. This can be cheap if you train/fine-tune/inference on large models only every now and then and spent more time protoyping on smaller models.

What is the carbon footprint of GPUs? How can I use GPUs without polluting the environment?

I built a carbon calculator for calculating your carbon footprint for academics (carbon from flights to conferences + GPU time). The calculator can also be used to calculate a pure GPU carbon footprint. You will find that GPUs produce much, much more carbon than international flights. As such, you should make sure you have a green source of energy if you do not want to have an astronomical carbon footprint. If no electricity provider in our area provides green energy, the best way is to buy carbon offsets. Many people are skeptical about carbon offsets. Do they work? Are they scams?

I believe skepticism just hurts in this case, because not doing anything would be more harmful than risking the probability of getting scammed. If you worry about scams, just invest in a portfolio of offsets to minimize risk.

I worked on a project that produced carbon offsets about ten years ago. The carbon offsets were generated by burning leaking methane from mines in China. UN officials tracked the process, and they required clean digital data and physical inspections of the project site. In that case, the carbon offsets that were produced were highly reliable. I believe many other projects have similar quality standards.

What do I need to parallelize across two machines?

If you want to be on the safe side, you should get at least +50Gbits/s network cards to gain speedups if you want to parallelize across machines. I recommend having at least an EDR Infiniband setup, meaning a network card with at least 50 GBit/s bandwidth. Two EDR cards with cable are about $500 on eBay.

In some cases, you might be able to get away with 10 Gbit/s Ethernet, but this is usually only the case for special networks (certain convolutional networks) or if you use certain algorithms (Microsoft DeepSpeed).

Is the sparse matrix multiplication features suitable for sparse matrices in general?

It does not seem so. Since the granularity of the sparse matrix needs to have 2 zero-valued elements, every 4 elements, the sparse matrices need to be quite structured. It might be possible to adjust the algorithm slightly, which involves that you pool 4 values into a compressed representation of 2 values, but this also means that precise arbitrary sparse matrix multiplication is not possible with Ampere GPUs.

Do I need an Intel CPU to power a multi-GPU setup?

I do not recommend Intel CPUs unless you heavily use CPUs in Kaggle competitions (heavy linear algebra on the CPU). Even for Kaggle competitions AMD CPUs are still great, though. AMD CPUs are cheaper and better than Intel CPUs in general for deep learning. For a 4x GPU built, my go-to CPU would be a Threadripper. We built dozens of systems at our university with Threadrippers, and they all work great — no complaints yet. For 8x GPU systems, I would usually go with CPUs that your vendor has experience with. CPU and PCIe/system reliability is more important in 8x systems than straight performance or straight cost-effectiveness.

Does computer case design matter for cooling?

No. GPUs are usually perfectly cooled if there is at least a small gap between GPUs. Case design will give you 1-3 C better temperatures, space between GPUs will provide you with 10-30 C improvements. The bottom line, if you have space between GPUs, cooling does not matter. If you have no space between GPUs, you need the right cooler design (blower fan) or another solution (water cooling, PCIe extenders), but in either case, case design and case fans do not matter.

Will AMD GPUs + ROCm ever catch up with NVIDIA GPUs + CUDA?

Not in the next 1-2 years. It is a three-way problem: Tensor Cores, software, and community.

AMD GPUs are great in terms of pure silicon: Great FP16 performance, great memory bandwidth. However, their lack of Tensor Cores or the equivalent makes their deep learning performance poor compared to NVIDIA GPUs. Packed low-precision math does not cut it. Without this hardware feature, AMD GPUs will never be competitive. Rumors show that some data center card with Tensor Core equivalent is planned for 2020, but no new data emerged since then. Just having data center cards with a Tensor Core equivalent would also mean that few would be able to afford such AMD GPUs, which would give NVIDIA a competitive advantage.

Let’s say AMD introduces a Tensor-Core-like-hardware feature in the future. Then many people would say, “But there is no software that works for AMD GPUs! How am I supposed to use them?” This is mostly a misconception. The AMD software via ROCm has come to a long way, and support via PyTorch is excellent. While I have not seen many experience reports for AMD GPUs + PyTorch, all the software features are integrated. It seems, if you pick any network, you will be just fine running it on AMD GPUs. So here AMD has come a long way, and this issue is more or less solved.

However, if you solve software and the lack of Tensor Cores, AMD still has a problem: the lack of community. If you have a problem with NVIDIA GPUs, you can Google the problem and find a solution. That builds a lot of trust in NVIDIA GPUs. You have the infrastructure that makes using NVIDIA GPUs easy (any deep learning framework works, any scientific problem is well supported). You have the hacks and tricks that make usage of NVIDIA GPUs a breeze (e.g., apex). You can find experts on NVIDIA GPUs and programming around every other corner while I knew much less AMD GPU experts.

In the community aspect, AMD is a bit like Julia vs Python. Julia has a lot of potential, and many would say, and rightly so, that it is the superior programming language for scientific computing. Yet, Julia is barely used compared to Python. This is because the Python community is very strong. Numpy, SciPy, Pandas are powerful software packages that a large number of people congregate around. This is very similar to the NVIDIA vs AMD issue.

Thus, it is likely that AMD will not catch up until Tensor Core equivalent is introduced (1/2 to 1 year?) and a strong community is built around ROCm (2 years?). AMD will always snatch a part of the market share in specific subgroups (e.g., cryptocurrency mining, data centers). Still, in deep learning, NVIDIA will likely keep its monopoly for at least a couple more years.

When is it better to use the cloud vs a dedicated GPU desktop/server?

Rule-of-thumb: If you expect to do deep learning for longer than a year, it is cheaper to get a desktop GPU. Otherwise, cloud instances are preferable unless you have extensive cloud computing skills and want the benefits of scaling the number of GPUs up and down at will.

Numbers in the following paragraphs are going to change, but it serves as a scenario that helps you to understand the rough costs. You can use similar math to determine if cloud GPUs are the best solution for you.

For the exact point in time when a cloud GPU is more expensive than a desktop depends highly on the service that you are using, and it is best to do a little math on this yourself. Below I do an example calculation for an AWS V100 spot instance with 1x V100 and compare it to the price of a desktop with a single RTX 3090 (similar performance). The desktop with RTX 3090 costs $2,200 (2-GPU barebone + RTX 3090). Additionally, assuming you are in the US, there is an additional $0.12 per kWh for electricity. This compares to $2.14 per hour for the AWS on-demand instance.

At 15% utilization per year, the desktop uses:

(350 W (GPU) + 100 W (CPU))*0.15 (utilization) * 24 hours * 365 days = 591 kWh per year

So 591 kWh of electricity per year, that is an additional $71.

The break-even point for a desktop vs a cloud instance at 15% utilization (you use the cloud instance 15% of time during the day), would be about 300 days ($2,311 vs $2,270):

$2.14/h * 0.15 (utilization) * 24 hours * 300 days = $2,311

So if you expect to run deep learning models after 300 days, it is better to buy a desktop instead of using AWS on-demand instances.

You can do similar calculations for any cloud service to make the decision if you go for a cloud service or a desktop.

Common utilization rates are the following:

PhD student personal desktop: < 15%
PhD student slurm GPU cluster: > 35%
Company-wide slurm research cluster: > 60%

In general, utilization rates are lower for professions where thinking about cutting edge ideas is more important than developing practical products. Some areas have low utilization rates (interpretability research), while other areas have much higher rates (machine translation, language modeling). In general, the utilization of personal machines is almost always overestimated. Commonly, most personal systems have a utilization rate between 5-10%. This is why I would highly recommend slurm GPU clusters for research groups and companies instead of individual desktop GPU machines.

Version History

2023-01-30: Improved font and recommendation chart. Added 5 years cost of ownership electricity perf/USD chart. Updated Async copy and TMA functionality. Slight update to FP8 training. General improvements.
2023-01-16: Added Hopper and Ada GPUs. Added GPU recommendation chart. Added information about the TMA unit and L2 cache.
2020-09-20: Added discussion of using power limiting to run 4x RTX 3090 systems. Added older GPUs to the performance and cost/performance charts. Added figures for sparse matrix multiplication.
2020-09-07: Added NVIDIA Ampere series GPUs. Included lots of good-to-know GPU details.
2019-04-03: Added RTX Titan and GTX 1660 Ti. Updated TPU section. Added startup hardware discussion.
2018-11-26: Added discussion of overheating issues of RTX cards.
2018-11-05: Added RTX 2070 and updated recommendations. Updated charts with hard performance data. Updated TPU section.
2018-08-21: Added RTX 2080 and RTX 2080 Ti; reworked performance analysis
2017-04-09: Added cost-efficiency analysis; updated recommendation with NVIDIA Titan Xp
2017-03-19: Cleaned up blog post; added GTX 1080 Ti
2016-07-23: Added Titan X Pascal and GTX 1060; updated recommendations
2016-06-25: Reworked multi-GPU section; removed simple neural network memory section as no longer relevant; expanded convolutional memory section; truncated AWS section due to not being efficient anymore; added my opinion about the Xeon Phi; added updates for the GTX 1000 series
2015-08-20: Added section for AWS GPU instances; added GTX 980 Ti to the comparison relation
2015-04-22: GTX 580 no longer recommended; added performance relationships between cards
2015-03-16: Updated GPU recommendations: GTX 970 and GTX 580
2015-02-23: Updated GPU recommendations and memory calculations
2014-09-28: Added emphasis for memory requirement of CNNs

Acknowledgments

I thank Suhail for making me aware of outdated prices on H100 GPUs, Gjorgji Kjosev for pointing out font issues, Anonymous for pointing out that the TMA unit does not exist on Ada GPUs, Scott Gray for pointing out that FP8 tensor cores have no transposed matrix multiplication, and reddit and HackerNews users for pointing out many other improvements.

For past updates of this blog post, I want to thank Mat Kelcey for helping me to debug and test custom code for the GTX 970; I want to thank Sander Dieleman for making me aware of the shortcomings of my GPU memory advice for convolutional nets; I want to thank Hannes Bretschneider for pointing out software dependency problems for the GTX 580; and I want to thank Oliver Griesel for pointing out notebook solutions for AWS instances. I want to thank Brad Nemire for providing me with an RTX Titan for benchmarking purposes. I want to thank Agrin Hilmkil, Ari Holtzman, Gabriel Ilharco, Nam Pho for their excellent feedback on the previous version of this blog post.

The post Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning appeared first on Tim Dettmers.

LLM.int8() and Emergent Features

Tim Dettmers — Wed, 17 Aug 2022 12:51:19 +0000

When I attended NAACL, I wanted to do a little test. I had two pitches for my LLM.int8() paper. One pitch is about how I use advanced quantization methods to achieve no performance degradation transformer inference at scale that makes large models more accessible. The other pitch talks about emergent outliers in transformers and how they radically change what transformers learn and how they function.

From that, I learned that quantization research is like printers. Nobody cares about printers. Nobody likes printers. But everybody is happy if printers do their job.

How that job is done for you through the bitsandbytes library with Hugging Face integration so that you can easily run OPT-175B and BLOOM-176B on a single machine is described in another blog post by my colleague and collaborator Younes Belkada.

This blog post will spill some mandatory details about quantization, but I want to mostly make it about these emergent features that I found in transformers at scale. I know the claims in the paper are highly robust. This blog post is a more speculative version of the paper that teases out the super curious details about the fascinating properties surrounding the emergent outlier features I found. I cannot spill all the details because my next project will delve deep into understanding these outlier features, but the space is so rich that I am happy to give you many curious details.

Mandatory quantization details

In a previous version of this blog post, I jokingly had a section with the big title “All You Ever Wanted to Know about Quantization” The section read: “If you quantize from 16-bit to 8-bit, you lose precision which might degrade model prediction quality.”

That is it.

Most people do not want to learn more about quantization — and honestly, the small sentence above is already enough information. The details are very gritty and complicated, but it is all in the code. The math and concepts are very simple and straightforward — if you have worked on quantization before. If you have not encountered quantization, it is likely a hot devilish nightmare that will eat your liver.

For those that say, “Pfff! Why do I need a liver anyways?”. Well, here you go. For others, just move ahead and read about the mysteries of emergent features.

What is quantization?

Let us say you have a data type I5 with values [0, 1, 2, 3, 4, 5] and a data type, I3, with values [0, 2, 4], how do you quantize from data type I5 to I3? You follow a two-step procedure:

Normalize the range of I5 into I3.
Round to the nearest value of I3.

Let’s do an example. Let’s say we have the vector [3, 1, 2, 3] in I5, and we want to quantize to I3.

Here the step-by-step recipe for quantization:

We find the absolute maximum value of the vector: [3, 1, 2, 3] -> 3
Then we divide by that value: [3, 1, 2, 3] -> [1, 0.33, 0.66, 1.0]
And now we multiple by the range of the target data type I3, which is 4: [1, 0.33, 0.66, 1.0] -> [4.0, 1.33, 2.66, 4.0]
Now we round to the nearest value: [4.0, 1.33, 2.66, 4.0] -> [4, 0, 2, 4]

We now converted [3, 1, 2, 4] in I5 to [4, 0, 2, 4] in I3. To dequantize, we reverse this process.

Divide by 4: [4, 0, 2, 4] -> [1.0, 0.0, 0.5, 1.0]
Multiply by the absolute maximum: [1.0, 0.0, 0.5, 1.0] -> [3.0, 0.0, 1.5, 3.0]
Now we round again: [3.0, 0.0, 1.5, 3.0] -> [3, 0, 2, 3]

We see that our dequantization and quantization led to one error:
[3, 1, 2, 3] to [3, 0, 2, 3]
The second element changed from 1 to 0. This is a quantization error that leads to the loss of information in terms of how precise the information is encoded. If we have such errors and propagate them through many layers of a neural network, they accumulate, and they may change the result of a prediction and degrade the prediction quality.

How to make quantization methods more precise

Quantization can be enhanced in two ways. Use a better data type, or use more normalization constants (absolute maximum).

Regarding data types, Int8 is a terrible data type for deep learning. That is why I developed new data types in my research. However, currently, GPUs do not support other than Int8 data types on the hardware level, and as such, we are out of luck and need to use Int8.

The only way to improve quantization is through more normalization constants. A normalization constant squishes the input distribution, for example, I5, into the target distribution, for example, I3. We can increase precision, by squishing each vector only as much as is needed. For example, if you have the two vectors:

[3, 1, 2, 3]
[0, 2, 2, 0]

Then you can squish the first by 4 and the second by 2. This will give you twice the precision to quantize the second vector because the inputs are now spread over a broader range of the I3 data type. In fact, the second vector can be quantized without errors if you use an additional absolute maximum value. If you use only a single constant over both vectors (tensor-wise constants), then you will have two errors.

Vector-wise quantization

So now that we know how to make quantization more precise, how do we achieve maximum precision for matrix multiplication?

The key is this: If we use different normalization constants for dependent vectors, we then need to recover this information in the dequantization step. For example, if we subtract a constant to center one distribution over another: (A-minA)(B-minB) then to dequantize the output in A*B=C we need to do:

A*B = C

(A-minA)(B-minB) = A*B – A*minB – B*minA + minA*minB = C – A*minB – B*minA + minA*minB

As such, dependent quantization produces additional computation, in this case, a couple of matrix-vector multiplications and additions which can be expensive (if we assume, A and B are matrices).

As such, we look for the most normalization constants we can get that are still independent. What does this look like?

We can see a matrix multiplication as a sequence of independent inner products between row vectors of A and column vectors of B. We can have a separate constant for each of these vectors. Denormalization happens by multiplying these two constants together for a particular element. No other computation is needed. This is vector-wise quantization. More details in the paper.

Mixed precision decomposition

Before we come to the emergent magnitude features, let me explain the last part of our method that is absolutely critical to achieving zero-degradation quantization at scales of up to 175B parameters.

So it turns out, that transformers have these emergent features that have very large values. They occur in particular hidden dimensions and are active in up to 75% of all sequence dimensions. They occur in all layers (well most layers, but we come to that). So if you have a transformer hidden state X of dimensionality [batch, sequence, hidden], then X[:, :, i] for some i have values that look like this:

[-60.. -45, -51, -35, -20, -67]

Whereas 99.9% of dimensions look like this (normally distributed with one outlier)

[-0.10, -0.23, 0.08, -0.38, -0.28, -0.29, -2.11, 0.34, -0.53, -67.0]

If we quantize and dequantize a row without an outlier, we get this:

[-0.10, -0.23, 0.08, -0.38, -0.28, -0.28, -2.11, 0.33, -0.53]

only a single error, -0.28 instead of -0.29, at the 0.01 precision level. However, if we quantize the same vector with the outlier, we get this:

[ -0.00, -0.00, 0.00, -0.53, -0.53, -0.53, -2.11, 0.53, -0.53, -67.00]

In other words, even if we use vector-wise quantization, we squish a lot of information to zero and have large errors. On average, vectors without outliers have a mean error of 0.015. This vector has an error of 0.12. Do this for a couple of layers, and we remove all information and end up with pure noise.

The problem is that at a scale of 6.7B parameters and above, 75% of hidden state sequences are affected. So this absolutely wrecks quantization.

The good news is that these outliers are highly systematic. While you have 150,000 outliers per sequence in a 6.7B transformer, they only occur in 6 feature dimensions (6 different indices “i” as in X[:, :, i]).

As such, we can separate these emergent features into a separate, high precision matrix multiplication, quantize the other 99.9% of values to Int8, can combine the output of both matrix multiplications. This avoids the information squishing to zero effect, and we can recover full transformer performance.

Results

The results show that this method works well. We can recover full performance by using the LLM.int8() quantization procedure. You can clearly see that there is a big dip in performance for the 8-bit baseline, which is vector-wise quantization. We need both vector-wise quantization and mixed precision decomposition, that is, the full LLM.int8() method to recover full performance. Either of these methods alone is not sufficient.

Emergent Features

There are a lot of exciting findings in the paper:

Emergence is not sudden but gradual and grows according to an exponential function related to perplexity and not model size.
Outlier features grow very quickly once their phase shift occurs.
The number of outliers features is strictly proportional to perplexity.

Many other findings did not make it into the paper because these were too difficult to verify robustly, but I wanted to share them here anyway. Since these results are less robust, take them with a grain of salt.

But I am jumping ahead: What is the emergence, and what makes an emergent feature? If I put it in my own words, I would say:

Emergence is a gradual change in a property that suddenly undergoes a phase shift and then changes the quality of its substrate.

Let’s think step-by-step.

Substrate: Transformer
Property: Very large features in particular hidden dimensions across the transformer
Gradual change: Decreasing perplexity, more and larger outlier features
Phase shift: Outlier features suddenly become available in all transformer layers and coordinate through a few hidden dimensions.
Change of quality: Highly sparse, almost discrete attention; very dense FFN layers; “dual attention”; long-range attention (?); stable training through increased numerical stability.

Some additional terms. What is a feature?

If you have hidden state X that is passed along a transformer with dimensionality [batch, sequence, hidden], then a feature is a particular dimension X[:, :, i], which offers some weak explanation for the label.

Emergent Features in a Nutshell

To get a little sense of what this is all about, here is a short explanation encapsulating everything important about emergent features.

The most intuitive explanation of feature outliers is that transformers have two processing streams. One stream learns features that explain the inputs, and the other stream learns features that remove other features. Removing noisy, context-irrelevant features is the key to making accurate predictions. The more noisy, context-irrelevant features you remove in early layers, the less conflicting high-level features you have in later layers.

For example, if you classify dogs vs. cats, it makes sense to “sharpen” the key features that make these animals different (e.g. cat eyes, cat ears) and remove the similar features (fur color and potentially texture). This is particularly relevant if you have many noisy “weak” features as in natural language processing.

If you take this mechanism to an extreme, you can get discretization, which goes hand-in-hand with context-dependent memory and “reasoning” over elements. Discretization means, you have, say, 100 features, but you decide to remove 99% of them by setting them to zero, and you amplify the rest. The result is a single feature that is now a discrete entity. Once discretized, this entity can be stored and reused later.

To coordinate these streams throughout the transformer, it is useful to dedicate certain hidden dimensions to the functionality of removing other features. That way, if the transformer needs to remove features, it knows beforehand which feature dimension to access to perform that functionality.

How do you remove features? You have a single dimension with very large positive or negative values, and you multiply that dimension with a positive/negative number.

Take the following matrix, which is similar to how emergent features are represented in hidden states.

[0, 1, -60, 4]
[3, 0, -50, -2]
[-1, 0, -55, 1]
[3, 2, -60, 1]

If we want to remove, say, features (columns) 0 and 3 in a matrix multiplication followed by a non-linear function, all we have to do is to multiply everything by a negative number and multiply the outlier feature for columns 0 and 3 by a positive number. If we do this with negative and positive 1s, it looks like this:

[-1, -1, -1, -1]
[-1, -1, -1, -1]
[ 1, -1, -1, 1]
[-1, -1, -1, -1]

We receive the following after a softmax:

[0, 0.5, 0.5, 0]
[0, 0.5, 0.5, 0]
[0, 0.5, 0.5, 0]
[0, 0.5, 0.5, 0]

The neat thing about this system is, that if you always maintain the outlier feature in dimension 3, you know beforehand where to insert a positive number to remove a feature (row 3 of the other matrix).

Transformers seem to coordinate these dimensions throughout all layers except the attention function and the second feedforward network where these outliers are “consumed” to remove features.

This means that transformers always use a certain dimension for these outliers, and each layer “knows” beforehand how to remove a feature because these feature dimensions always have very large values with a particular sign (some are negative, some are positive).

However, this full “coordination” through a single dimension only happens after the phase shift. Before the phase shift, in transformers with less than 6.7B parameters some layers disagree which dimension to use for these large features.

How Emergent Features Emerge

Emergent outlier features are present in even very small transformers (125M parameters), and they do start out in the attention projection layers (key/query/value). Feature outliers are “consumed” in the attention function (softmax) and the second fully connected sublayer (contraction layer). The outlier features are likely consumed in these layers since the second feedforward network (FFN) sub-layer, and the softmax have non-linear functions that can easily squash features to zero.

Once you scale transformers a bit more (350M to 1.3B), outliers also occur in the FFN and attention output layers. At this scale, some successive attention layers and FFN layers use the same dimension to coordinate what features to remove. This has synergy. The attention layer is good at context-dependent selection and pattern matching, while the FFN layers are good at globally, context-independent pattern matching.

At this scale, however, outliers are still probabilistic. This means they occur mostly in some dimensions, but these dimensions can change slightly from mini-batch to mini-batch and between layer and layer. At this scale, layers have not yet learned to coordinate outlier features through the same dimension. This makes it more difficult to remove unwanted features.

At the 2.7B to 6B scale, things become much more coordinated. Now 60% of layers agree on which outlier dimension to use.

The phase shift happens around 6.7B, where 100% of layers use the same dimension for outliers. At this point, a couple of things happen rapidly:

Outliers become very large quickly. They grow from about 15 for a 6B model to about 60 for a 13B model. OPT-66B has outliers of size around 95, which indicates this growth phase is temporary.
Attention layers become very sparse. The attention is very concentrated so that just a few sequence dimensions determine the top probability and the overall probability mass. Almost all sequence dimensions have zero probability. However, this is still context-dependent, and the transformer seems to be “unsure” what to attend to for some sequences.
FFN layers become more “dense”. While in computer vision, you can prune about 95% of weights without severe performance degradation, that number is 30% for transformers trained on NLP data. After emergence, this number shrinks to well below 5%. It seems that canceling out features can remove noise that is generated from the many weak features that are activated. Because these are silenced now, each set of neurons can learn much more features that are almost independent of each other due to the masking of context-dependent features.
Transformers become more stable. If you treat the outlier features separately, I believe you can probably run and even train transformers in less than 8-bit precision without degradation in performance.

The Most Important Take-aways for Your Research

You may say, “This is all good and well, Tim, but what does this mean for me and my research?” Good question! I think it changes quite a bit.

There are two types of transformers and you should not generalize from one to the other.

From these findings it is clear that transformer after the phase shift at 6.7B parameters behave very different to transformers before the phase shift. As such, one should not try to generalize from <6.7B transformers to beyond 6.7B parameters.

But training and using 6.7B transformers can be pretty painful. At Facebook AI research, I had a 1.3B parameter baseline and I would usually run 2-3 of those models on 128 GPUs each for a total of 384 GPUs. Despite these massive resources it would still feel “slow” in that my research progress was mostly hindered by compute. I imagine to train 6.7B models on 8 GPUs or even 32 GPUs must be super painful. Is there a way that we can avoid this?

I think another key finding from the paper can help. We found that emergence of features occurs smoothly according to an exponential distribution of decreasing perplexity. As such, one could do the following.

We train multiple smaller models, say, 125M, 350M and 1.3B parameters, and then we measure the emergent property in those models and relate it to the property that we are interested in analyzing, for example, a new architecture or a new from of interpreting models. Once we gathered this data, we can measure how the change in the emergent property changes the results of our new method. With that, we might be able to determine if our new method generalizes to models beyond 6.7B parameters.

While, by definition, the phase shift leads to a stark change in behavior, this method of extrapolating emergent behavior might yield more robust predictions for your research. It would be effortful and complicated to do this, but this is better than “wishful thinking” research that does not generalize.

We might be able to find new emergent properties by studying “scaling laws of emergence”.

The finding that emergence can be measured in small models means that new emergent properties that require models larger than 175B parameters might be already measurable in the open-source OPT models.

If we can correlate statistics of a property with increasing capabilities and if this property follows a function that will eventually, “threshold”, we might have discovered a new emergent property that leads to new capabilities.

Conclusion

In this blog post, I introduced LLM.int8() and gave an introduction into the emergent features that we discovered in language model at scale. I discussed the implication of these emergent features in particular how it relates to generalization.

The post LLM.int8() and Emergent Features appeared first on Tim Dettmers.

Sparse Networks from Scratch: Faster Training without Losing Performance

Tim Dettmers — Thu, 11 Jul 2019 13:07:26 +0000

This blog post is about my work, Sparse Networks from Scratch: Faster Training without Losing Performance, with Luke Zettlemoyer on fast training of neural networks which we keep sparse throughout training. We show that by developing an algorithm, sparse momentum, we can initialize a neural network with sparse random weights and train it to dense performance levels — all while doing just a single training run. Furthermore, If we use optimized sparse convolution algorithms, we can speed up training between 3.5x for VGG to 12x for Wide Residual Networks. This stands in stark contrast to computationally expensive methods which require repetitive prune-and-retrain cycles as used by the Lottery Ticket Hypothesis (Frankle and Carbin, 2019) and other work. Thus we show that training sparse networks to dense performance levels does not require “winning the initialization lottery” but can be done reliably from random weights if combined with a method that moves weights around the network in a smart way. We call the paradigm that maintains sparsity throughout training while maintaining dense performance levels sparse learning. While this work shows that sparse learning is possible, future work holds the promise to train larger and deep networks on more data while requiring the same or less computational resources as current dense networks.

Why Sparse Learning?

A significant driver of progress in deep learning has been advances in computational resources. From 2010 to 2018 we saw an increase of 9700% in computational GPU performance. However, we can expect increases of just little more than 80% GPU performance in the next 5-8 years due to reaching the physical limits of semiconductor technology. What does a research world look like where we cannot make further improvements in computational power?

A glimpse of this comes from the natural language processing (NLP) community where pretrained language models like ELMO, GPT, BERT, GPT-2, Grover, and XL-Net dominate the entire field by outperforming other methods on most NLP tasks. These models are often rather simple: You train them on lots of documents, and the task is mainly to predict a word given a sequence of other words — a bit like doing a fill-in-the-blank puzzle. The catch? These models are so big that they take well in excess of 100 GPUs hours to train. This is particularly frustrating for academic researchers that want to understand these models but are unable to do so because they lack the computational resources that big companies have. To truly understand these massive pretrained language models, a primary goal should be to democratize the training of these models by developing more resourceful training procedures.

One way to achieve this is to look at the human brain for inspiration. The human brain consumes 1/10th of the energy of a GPU but is 10^9 times more powerful. What makes the brain so computational efficient? There are many reasons, but one reason is sparsity.

It has been found that the more neurons a primate brain has the fewer connections does the average neuron make with all other neurons (Herculano-Houzel et al., 2010). This is very much contrary to how we design deep neural networks, which is to connect every new neuron in a layer with all neurons in the previous layer. We already understand how to compress a fully trained dense network to a sparse network (Han et al., 2015), but there has been little work on how to do this successfully if one starts from a sparse network which we keep sparse during training. How do we do this?

Sparse Momentum: An Efficient Way to Train Sparse Networks

This section explains the sparse momentum algorithm from intutiton up to the full algorithm.

" data-image-caption="

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?fit=300%2C145&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?fit=1024%2C493&ssl=1" class="wp-image-779 size-full" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?resize=1096%2C528&ssl=1" alt="Sparse Momentum determines where to grow new weights in a sparse network by looking at the weighted average of recent gradients (momentum) to find weights and layers which reduce the error consistently. (1) We determine the importance of each layer according to the mean momentum magnitude. (2) For each layer, we remove the 50\% of the smallest weights. (3) We then redistribute the weights across layers according to layer importance. Within a layer we grow weights where the momentum magnitude is large." width="1096" height="528" srcset="https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?w=1096&ssl=1 1096w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?resize=300%2C145&ssl=1 300w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?resize=768%2C370&ssl=1 768w, https://i0.wp.com/timdettmers.com/wp-content/uploads/2019/07/sparse_momentum.png?resize=1024%2C493&ssl=1 1024w" sizes="(max-width: 1000px) 100vw, 1000px" data-recalc-dims="1" />

Figure 1: Sparse Momentum determines where to grow new weights in a sparse network by looking at the weighted average of recent gradients (momentum) to find weights and layers which reduce the error consistently. (1) We determine the importance of each layer according to the mean momentum magnitude. (2) For each layer, we remove the 50% of the smallest weights. (3) We then redistribute the weights across layers according to layer importance. Within a layer we grow weights where the momentum magnitude is large.

What is the Main Quality of Good Sparse Learning Algorithms?

In sparse learning, the most important thing is to use every single weight in a neural network as effectively as possible. If you define “effectiveness” as “reducing the error,” then we have an obvious perspective on how we can proceed. We need to find a measure which describes how effective a weight is at reducing the error and remove all weights which do not. Once we removed weights, we want to regrow new weights in locations which we think are promising at reducing the error in the future.

If we look at the gradient of the error with respect to the weight, we actually have precisely such a measure. However, if we look at successive gradients, we find that gradients can wildly oscillate. For example if you have a neural network which classifies handwritten digits 0 to 9 then a weight might be good at detecting a straight line at the top and it might help to reduce the error for the numbers 5, 7 but then it might not help or even be detrimental for numbers 0, 1, 2, 3, 6, 8, 9. Instead, a weight which detects a curvy pattern in the top right might help for 0, 2, 3, 8, 9 and as such we would expect that this weight reduces the error more consistently over time than the “straight line at the top” weight. How can we detect such promising weights in a neural network automatically?

Momentum: Finding Weights that Reduce the Error Consistently

If you take the north pole to be a local minimum and a compass needle the gradient towards the local minimum, then you can simulate stochastic gradient descent updates by shaking the compass wildly to spin the compass needle. With every time the needle passes the north pole it will slow down and line-up more and more with the north pole, however, due to the spin it will still “overshoot” that direction. So it might be unclear where the north pole is from two or three measurements while the needle is still moving back and forth. However, if you take the average directions — one time the needle is a bit to the left of the north pole, another time it is more to the right — then these deviations cancel out, and you will immediately get a direction which is very close to the real north pole.

This is the main idea behind the momentum optimization technique: We average successive gradients to get a better estimate of the direction of the local minimum. Similarly to the compass needle, which gets more and more accurate over time as it slows down, we want to weight more recent gradient directions in stochastic gradient descent more highly. One way to do this is to assign a weighted average where we assign a much larger weight to the current gradient and a small weight to the previous gradients — this is called exponential smoothing. Through exponential smoothing the gradients of the weight we receive a weighted gradient matrix — this matrix is the momentum matrix which gives momentum optimization its name. With this measure, we can identify which are the weights which reduce the error consistently.

Redistributing Weights: The Mean Momentum Magnitude of a Layer

From here we make the first important observation for our sparse momentum algorithm: If the momentum of a weight indicates how much it reduces the error consistently, then the mean momentum magnitude of all the weights in a layer should indicate how much each layer is reducing the error on average. We take the magnitude because two different weights might consistently go into a negative direction and a positive direction. By taking the mean momentum magnitude of layers, we can easily compare how effective the average weight in each layer is. This enables to say, for example, that a weight in a convolutional layer A is on average 1/3 as effective at reducing the error as the average weight in fully connected layer B, or vice versa. This method enables us to redistribute weights effectively: if we find “useless” weights, we now know precisely in which layer to put it — but where to put them exactly within a layer?

Which Weights Should be Removed? Where to Regrow them?

The next two problems are more straightforward: Which are the most useless weights? Where do we grow weights within a layer? The first problem is a common problem in neural network compression research, where one often prunes the weights with the smallest magnitude. Why does this make sense? If we assume all weights receive on average inputs of similar magnitude — a reasonable assumption if one uses batch normalization — then weights with small magnitudes make the smallest difference in the activation of a neuron. As such, removing them should change the predictive performance of our networks by the smallest amount.

Once we removed weights and redistributed them to weight-effective layers as measured by the mean momentum magnitude of a layer, we need to decide where exactly to grow them within a layer. One possible solution becomes apparent if we ask: “Which two unconnected neurons would reduce the error consistently if we connect them?” The answer to this question would again point to the momentum magnitude. This time, however, we want to look at the momentum magnitude of “missing” or zero-valued weights, that is, we want to look at those weights which have been excluded from training before. Thus we grow weights in locations where missing weights have the largest momentum magnitude. This completes the sparse momentum algorithm, which depicted in Figure 1.

Results

The results are quite impressive! We compared against compression algorithms on MNIST, where sparse momentum outperforms most other methods. This is a pretty good result given that compression methods start from a dense network and usually retrain repetitively while we train a sparse network from scratch! Another impressive result is that we can match or even exceed the performance of dense networks by using 20% of weights (80% sparsity). On CIFAR-10, we compare against Single-shot Network Pruning which is designed for simplicity and not performance — so it is not surprising that sparse momentum does better. However, what is interesting is that we can train both VGG16-D (a version of VGG16 with two fully connected layers) and Wide Residual Network (WRN) 16-10 (16 layers deep and very wide WRN) to dense performance levels with just 5% of weights. For other networks, sparse momentum comes close to dense performance levels. Furthermore, as I will show later, with an optimized sparse convolution algorithm, we would be able to train a variety of networks to yield the same performance levels while training between 3.0-5.6x faster!

Sparse Momentum results compared to neural network compression methods on MNIST for LeNet-300-100 and LeNet-5 Caffe.

ImageNet results for Sparse momentum and related methods. For the models that are not fully sparse, the first convolution and all downsample residual connections are dense from the start of training. In the fully sparse setting, all layers are sparse. Sparse momentum works better than other methods and works almost equally well if all the weights are sparse. This indicates that sparse momentum is efficient at finding important layers which require a high density.

On ImageNet, we are not able to reach dense performance levels, which indicates that there is room to improve sparse momentum. However, we can demonstrate that sparse momentum has a clear lead compared to other methods that maintain sparse weights throughout training.

Speedups

The main promise of sparse learning was to accelerate training — were we successful? Yes — and no. Sparse momentum accelerates training efficiently if we measure possible speedups for sparse convolution, but since sparse networks were only very recently used for training, no optimized sparse convolution algorithms exist for the GPU — at least not for the fine-grained sparse patterns of weights as exhibited by sparse momentum.

As such, we divide the speedups into two groups: Possible speedups which could be achieved if sparse convolution algorithms would exist, and speedups which we can achieve today with standard dense convolutional algorithms. How can dense convolutions help for sparse networks?

If we look at the sparsity pattern of our network, we have the case where a convolutional channel is entirely empty — a convolutional filter full of zeros! If this happens, we can remove the channel from the computation without changing the results of the convolution and thus gain speedups.

Sparse momentum can replicate dense performance levels for a range of networks with a fraction of the weights thus leading to speedups.

However, if we look at the speedups, we see there is a marked difference between sparse convolution and dense convolution speedups. This clearly shows the need for optimized sparse convolution algorithms for the GPU.

Why does Sparse Learning Work?

Some of our sparse networks trained with sparse momentum matched the performance levels of dense networks with just 5% of weights. What makes these 5% of weights so efficient that they can match a neural network with 20 times as many weights?

To look into this question, we looked at how the features of sparse networks compare to dense networks. Low-level features might include things like edge detectors. Mid-level features might be things like wheels, noses, eyes, paws. High-level features might be the “face” of a car, a cat face, a fridge door, and so forth.

To reduce features to numbers we look at convolutional channels — the equivalent to a “neuron” in a convolutional network — and how useful the channel is to classes in the dataset. Edge detectors should be useful to almost all classes in the dataset — in other words, they should have a low level of class-specialization. Mid-level features like eyes should be useful to—some classes such as cats, dogs, and humans. High-level features should be useful to a few selected classes — they are highly class-specialized.

Figure 6: Class-specialization histograms for sparse and dense networks for AlexNet, VGG16 and WRN 28-2.

What we find is that on average, sparse networks learn features which are useful to a broader range of classes — they learn more general features. This might be a possible explanation of why sparse networks can match the performance of dense networks with as few as 5% weights.

The Future of Sparse Learning

I believe sparse learning has a very bright future because (1) GPUs will stagnate in performance over the next years, (2) specialized processors for sparse workloads, Graphcore processors, are around the corner. Graphcore processors store an entire network in its 300 MB cache and accelerate it by a factor of roughly 100x. This means, if we can compress a network to 300 MB during training, then we will have 100x faster training overall. Training a ResNet-50 on ImageNet would then take only roughly 15 minutes using one Graphcore processor. With sparse learning, the 300 MB limit will be in reach without a problem.

My prediction is that the first research team that can train a sparse neural network on a Graphcore processor successfully will unlock an entirely new level of artificial intelligence.

Besides this, another challenge is to apply sparse learning algorithms to natural language processing (NLP). Unsurprisingly, my experimentation on transformers for natural language processing tasks show that sparse learning is much more difficult in NLP compared to computer vision — lots of work to do!

Try Sparse Momentum with Your Own Model in 10 Lines of Code!

Figure 7: Example of a generic sparse learning script which you can use for your own model. With my sparselearning library it is easy to use sparse momentum: (1) Import the library, (2) add the parser options, (3) wrap your model with the Masking class, (4) apply mask instead of optimizer, (5) apply sparse momentum at the end of epoch. The library is also easily extendable with your own sparse learning algorithms for growth, pruning, or redistribution — all it takes is a few lines of code!

To make sparse learning accessible to everyone I developed a sparse learning library which allows the easy application of existing algorithms like sparse momentum to your own models — it can be done in less than 10 lines of code. The library is also designed to make it very easy to add your own sparse learning methods. You find my sparse learning library on GitHub.

Questions?

For questions, I prefer if you post them below if they are simple and straightforward. If you have a more formal question regarding our work that requires careful answers, you can post an the question as a GitHub issue — I will try to answer as timely as possible.

Acknowledgements

I thank Luke Zettlemoyer for feedback on an early draft of this blog post.

References

Frankle, J. and Carbin, M. (2019). The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR 2019.

Han, S., Pool, J., Tran, J., and Dally, W. (2015). Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages
1135—1143.

Herculano-Houzel, S., Mota, B., Wong, P., and Kaas, J.H. (2010). Connectivity-driven white matter scaling and folding in primate cerebral cortex. In Proceedings of the National Academy of Sciences of the United States of America, 107 44:19008—13.

The post Sparse Networks from Scratch: Faster Training without Losing Performance appeared first on Tim Dettmers.

The Brain vs Deep Learning Part I: Computational Complexity — Or Why the Singularity Is Nowhere Near

Tim Dettmers — Mon, 27 Jul 2015 10:20:05 +0000

In this blog post I will delve into the brain and explain its basic information processing machinery and compare it to deep learning. I do this by moving step-by-step along with the brains electrochemical and biological information processing pipeline and relating it directly to the architecture of convolutional nets. Thereby we will see that a neuron and a convolutional net are very similar information processing machines. While performing this comparison, I will also discuss the computational complexity of these processes and thus derive an estimate for the brains overall computational power. I will use these estimates, along with knowledge from high performance computing, to show that it is unlikely that there will be a technological singularity in this century.

This blog post is complex as it arcs over multiple topics in order to unify them into a coherent framework of thought. I have tried to make this article as readable as possible, but I might have not succeeded in all places. Thus, if you find yourself in an unclear passage it might become clearer a few paragraphs down the road where I pick up the thought again and integrate it with another discipline.

First I will give a brief overview about the predictions for a technological singularity and topics which are aligned with that. Then I will start the integration of ideas between the brain and deep learning. I finish with discussing high performance computing and how this all relates to predictions about a technological singularity.

The part which compares the brains information processing steps to deep learning is self-contained, and readers which are not interested in predictions for a technological singularity may skip to this part.

Part I: Evaluating current predictions of a technological singularity

There were a lot of headlines recently about predictions that artificial intelligence will reach super-human intelligence as early as 2030 and that this might herald the beginning of human extinction, or at least dramatically altering everyday life. How was this prediction made?

Factors which help to predict a singularity

Ray Kurzweil has made many very accurate predictions and his methods to reach these predictions are quite simple for computing devices: Look at the exponential growth of computing power, efficiency, and size, and then extrapolate. This way, you could easily predict the emergence of small computers which fit into your hands and with a bit of creativity, one could imagine that one day there would be tablets and smartphones. The trends were there, you just needed to imagine what could be done with computers which you can hold in your hand.

Similarly, Ray Kurzweil predicted the emergence of strong AI which is as intelligent or more intelligent than humans. For this prediction he also used data for the exponential growth of computing power and compared this to an estimate for the computational power of the brain.

He also acknowledges that the software will be as important as the hardware, and that the software development of strong AI will take longer because such software can only be developed once fast computer systems are available. This can be felt in the area of deep learning, where solid ideas of the 1990s were unfeasible due to the slow computers. Once graphic processing units (GPUs) were used, these computing limitations were quickly removed and rapid progress could be made.

However, Kurzweil also stresses that once the hardware level is reached, first “simple” strong AI systems will be developed quickly. He sets the date for brain-like computational power to 2020 and the emergence of strong AI (first human like intelligence or better) to 2030. Why these numbers? With persisting growth in computing power in 2019 we will reach the computing power which is equivalent to the human brain — or will we?

This estimate is based on two things: (1) The estimate for the complexity of the brain, (2) the estimate for the growth in computing power. As we will see, both these estimates are not up-to-date with current technology and knowledge about neuroscience and high performance computing.

Our knowledge of neuroscience doubles about every year. Using this doubling period, in the year of 2005 we would only have possessed about 0.098% of the neuroscience knowledge that we have today. This number is a bit off, because the doubling time was about 2 years in 2005 while it is less than a year now, but overall it is way below 1 %.

The thing is that Ray Kurzweil based his predictions on the neuroscience of 2005 and never updated them. An estimate for the brains computational power based on 1% of the neuroscience knowledge does not seem right. Here is small list of a few important discoveries made in the last two years which increase the computing power of the brain by many orders of magnitude:

It was shown that brain connections rather than being passive cables, can themselves process information and alter the behavior of neurons in meaningful ways, e.g. brain connections help you to see the objects in everyday life. This fact alone increases brain computational complexity by several orders of magnitude
Neurons which do not fire still learn: There is much more going on than electrical spikes in neurons and brain connections: Proteins, which are the little biological machines which make everything in your body work, combined with local electric potential do a lot of information processing on their own — no activation of the neuron required
Neurons change their genome dynamically to produce the right proteins to handle everyday information processing tasks. Brain: “Oh you are reading a blog. Wait a second, I just upregulate this reading-gene to help you understand the content of the blog better.” (This is an exaggeration — but it is not too far off)

Before we look at the complexity of the brain, let us first look at brain simulations. Brain simulations are often used to predict human-like intelligence. If we can simulate a human brain, then it will not be long until we are able to develop human-like intelligence, right? So the next paragraph looks at this reasoning. Can brain simulations really provide reliable evidence for predicting the emergence of artificial intelligence?

The problems with brain simulations

Brain simulations simulate the electrical signals which are emitted by neurons and the size of the connections between neurons. A brain simulation starts with random signals and the whole system stabilizes according to rules which are thought to govern information processing steps in the brain. After running these rules for some time, stable signals may form which can be compared to the signals of the brain. If the signals of the simulation are similar to recordings of the brain, this increases our confidence that our chosen rules are somewhat similar to the rules that the brain uses. Thus we can validate large scale information processing rules in the brain. However, the big problem with brain simulations is, that this is pretty much all we can do.

We do not gain any understanding what these signals mean or what function they could possess. We cannot test any meaningful hypotheses with this brain model other than the vague “our rules produce similar activity”. The lack of precise hypotheses which make accurate predictions (“If the activity is like this, then the circuit detected an apple instead of an orange”) is one of the loudest criticism of the European brain simulation project. The brain project is regarded as rather useless by many neuroscientists and even dangerous, because it sucks away money for useful neuroscience projects which actually shed light on neural information processing.

Another problem is that these brain simulations rely on models which are outdated, incomplete and which dismiss many biological parts in neurological information processing. This is mainly so, because the electrical information processing in the brain is much better understood. Another more conveniently reason is, that current models are already able to reproduce the needed output patterns (which is the main goal after all) and so there is no need to update these models to be more brain-like.

So to summarize, the problems with brain simulations are:

Not possible to test specific scientific hypotheses (compare this to the large hadron collider project with its perfectly defined hypotheses)
Does not simulate real brain processing (no firing connections, no biological interactions)
Does not give any insight into the functionality of brain processing (the meaning of the simulated activity is not assessed)

The last point is the most important argument against the usefulness of brain processing for strong-AI estimation. If we could develop a brain simulation of the visual system, which would do well on say, the MNIST and ImageNet data sets, this would be useful to estimate progress in brain-like AI. But without this, or any similar observable function, brain simulations remain rather useless with respect to AI.

With this said, brain simulations are still valuable to test hypothesized general rules of information processing in the brain —we have nothing better for this — but they are quite useless to make sense of what the information processing in the brain means, and thus constitute unreliable evidence for predicting the progress in AI. Anything that relies on brain simulation as evidence for predictions of future strong-AI should be looked at with great skepticism.

Estimating the brains computational complexity

As mentioned in the introduction, the estimates of the brain’s complexity are a decade old and many new discoveries made this old estimate obsolete. I never came across an estimate which is up to date, so here I derive my own estimate. While doing this, I will focus mostly on the electrochemical information processing and neglect the biological interactions within the neuron, because they are too complex (and this blog post is already very long). Therefore the estimate that is derived here can be thought of as a lower bound of complexity — it should always be assumed that the brain is more complex than this.

During the construction of this model of complexity, I will also relate every step in the model with its deep learning equivalents. This will give you a better understanding of how close deep learning is related to and how fast deep learning really is compared to the human brain

Defining reference numbers for the model

We know some facts and estimates which help us to start with our model building:

The brain uses learning algorithms which are very different from deep learning, but the architecture of neurons is similar to convolutional nets
The adult brain has 86 billion neurons, about 10 trillion synapse, and about 300 billion dendrites (tree-like structures with synapses on them)
The brain of a child has far more than 100 billion neurons, and has synapses and dendrites in excess of 15 trillion and 150 billion, respectively
The brain of a fetus has more than a trillion neurons; neurons which are misplaced die quickly (this is also the reason why adults have fewer neurons than children)

Location of the cerebellum which contains roughly 3/4 of all neurons and connections. Image source: 1

Location of the cerebrum; also referred to as “the cortex”. More precisely, the cortex is the outer layer of the brain, which contains most neurons of the cerebrum. Image source: 1

The cerebellum, the super computer of the brain, contains roughly ¾ of all neurons (this ratio is consistent in most mammal species)
The cerebrum, the main driver of “intelligence”, contains roughly ¼ of all neurons
An average neuron in the cerebellum has about 25000 synapses
An average neuron in the cerebrum has about 5000-15000 synapses

The number of neurons is well known; the number of synapses and dendrites is only known within a certain boundary and I chose conservative estimates here.

The average synapses per neuron differ wildly between neurons, and the number here is a rough average. It is known that most synapses in the cerebellum are made between dendrites of Purkinje neurons and two different types of neurons that make connections that “climb up” or “cross parallel” with the Purkinje’s synapses. It is known that Purkinje cells have about 100000 synapses each. Because these cells have by far the largest weight in the cerebellum, one can estimate the complexity of the brain best if one looks at these neurons and at the interactions that they make.

There are many hundreds of different types of neurons; here some of the more common neurons. Thanks to Robert Stufflebeam for this image (source).

It is important to differentiate between the complexity of a brain region and its functional importance. While almost all computation is carried out by the cerebellum, almost all important functions are carried out by the cerebrum (or cortex). The cortex uses the cerebellum to generate predictions, corrections and conclusions, but the cortex accumulates these insights and acts upon them.

For the cerebrum it is known that neurons almost never have more than 50000 synapses, and unlike the cerebellum, most neurons have a number of synapses within the range of 5000-15000.

How do we use these numbers?

A common approach for estimating the computational complexity of the brain is to assume all information processing in the brain can be represented by the combination of impulses when a neuron fires (action potentials) and the size (mostly number of receptors) of the synapses that each neuron has. Thus one can multiply the estimates for the number of neurons and their synapses and add everything together. Then one multiplies this by the rate of fire for the average neurons which is about 200 action potentials per second. This model is what Ray Kurzweil uses to create his estimate. While this model was okay a few decades ago, it is not suitable to model the brain from a modern view point, as it leaves out much of the important neurological information processing which is so much more than mere firing neurons.

A model which approximates the behavior of neurons more accurately is the extended linear-nonlinear-Poisson cascade model (LNP). The extended LNP model is currently viewed as an accurate model of how neurons process information. However, the extended LNP model still leaves out some fine details, which are deemed unimportant to model large scale brain function. Indeed adding these fine details to the model will add almost no additional computational complexity, but makes the model more complex to understand — thus including these details in simulations would violate the scientific method which seeks to find the simplest models for a given theory. However, this extended model is actually very similar to deep learning and thus I will include these details here.

There are other good models that are also suitable for this. The primary reason why I chose the LNP model is that it is very close to deep learning. This makes this model perfect to compare the architecture of a neuron to the architecture of a convolutional net. I will do this in the next section and at the same time I will derive an estimate for the complexity of the brain.

Part II: The brain vs. deep learning — a comparative analysis

Now I will explain step by step how the brain processes information. I will mention the steps of information processing which are well understood and which are supported by reliable evidence. On top of these steps, there are many intermediary steps at the biological level (proteins and genes) which are still poorly understood but known to be very important for information processing. I will not go into depth into these biological processes but provide a short outline, which might help the knowledge hungry readers to delve into these depths themselves. We now begin this journey from the neurotransmitters released from a firing neuron and walk along all its processes until we reach the point where the next neuron releases its neurotransmitters, so that we return to where we started.

The next section introduces a couple of new terms which are necessary to follow the rest of the blog post, so read it carefully if you are not familiar with basic neurobiology.

Image sources: 1,2,3,4

Neurons use the axon — a tube like structure— to transmit their electric signals over long stretches in the brain. When a neuron fires, it fires an action potential — an electrical signal— down its axon which branches into a tree of small endings, called axon terminals. On the ending of each of these axon terminals sit some proteins which convert this electrical message back into a chemical one: Small balls — called synaptic vesicles — filled with a couple of neurotransmitters each are released into an area outside of the neuron, called synaptic cleft. This area separates the axon terminal from the beginning of the next neuron (a synapse) and allows the neurotransmitter to move freely to pursue different tasks.

The synapses are most commonly located at a structure which looks very much like the roots of a tree or plant; this is the dendritic tree composed of dendrites which branch into larger arms (this represents the connections between neurons in a neural network), which finally reach the core of the cell, which is called soma. These dendrites hold almost all synapses which connect one neuron to the next and thus form the principal connections. A synapse may hold hundreds of receptors to which neurotransmitter can bind themselves.

You can imagine this compound of axon terminal and synapses at a dendrite as the (dense) input layer (of an image if you will) into a convolutional net. Each neuron may have less than 5 dendrites or as many as a few hundred thousand. Later we will see that the function of the dendritic tree is similar to the combination of a convolutional layer followed by max-pooling in a convolutional network.

Going back to the biological process, the synaptic vesicles merge with the surface of the axon terminal and turn themselves inside-out spilling their neurotransmitters into the synaptic cleft. There the neurotransmitters drift in a vibrating motion due to the temperature in the environment, until they (1) find a fitting lock (receptor protein) which fits their key (the neurotransmitter), (2) the neurotransmitters encounter a protein which disintegrates them, or (3) the neurotransmitters encounter a protein which pulls them back into the axon (reuptake) where they are reused. Antidepressants mostly work by (3) preventing, or (4) enhancing the reuptake of the neurotransmitter serotonin; (3) preventing reuptake will yield changes in information processing after some days or weeks, while (4) enhancing reuptake leads to changes within seconds or minutes. So neurotransmitter reuptake mechanisms are integral for minute to minute information processing. Reuptake is ignored in the LNP model.

However, the combination of the amount of neurotransmitters released, the number of synapses for a given neurotransmitter, and how many neurotransmitters actually make it into a fitting protein on the synapse can be thought of as the weight parameter in a densely (fully) connected layer of a neural network, or in other words, the total input to a neuron is the sum of all axon-terminal-neurotransmitter-synapse interactions. Mathematically, we can model this as the dot product between two matrices (A dot B; [amount of neurotransmitters of all inputs] dot [amount of fitting proteins on all synapses]).

After a neurotransmitter has locked onto a fitting protein on a synapse, it can do a lot of different things: Most commonly, neurotransmitters will just (1) open up channels, to let charged particles flow (through diffusion) into the dendrites, but it can also cause a rarer effect with huge consequences: The neurotransmitter (2) binds to a G-protein which then produces a protein signaling cascade which, (2a) activates (upregulates) a gene which is then used to produce a new protein which is integrated into either the surface of the neuron, its dendrites, and/or its synapses; which (2b) alerts existing proteins to do a certain function at a specific site (create or remove more synapses, unblock some entrances, attach new proteins to the surface of the synapse). This is ignored in the NLP model.

Once the channels are open, negatively or positively charged particles enter into the dendritic spine. A dendritic spine is a small mushroom-like structure on to which the synapse is attached. These dendritic spines can store electric potential and have their own dynamics of information processing. This is ignored in the NLP model.

Dendritic spines have their own internals information processing dynamics which is largely determined by its shape and size. Image source: 1,2

The charge of the particles that may enter the dendritic spine are either negatively or positively charged — some neurotransmitters only open channels for negative particles, others only for positive ones. There are also channels which let positively charged particles leave the neuron, thus increasing the negativity of the electric potential (a neuron “fires” if it becomes too positive). The size and shape of the mushroom-like dendritic spine corresponds to its behavior. This is ignored in the NLP model.

Once particles entered the spine, there are many things they can affect. Most commonly, they will (1) just travel along the dendrites to the cell body in the neuron and then, if the cell gets too positively charged (depolarization) they induce an action potential (the neuron “fires”). But other actions are also common: The charged particles accumulate in the dendritic spine directly and (2) open up voltage-gated channels which may polarize the cell further (this is an example of the dendritic spine information processing mentioned above). Another very important process are (3) dendritic spikes.

Dendritic spikes

Dendritic spikes are a phenomenon which has been known to exist for some years, but only in 2013 the techniques were advanced enough to collect the data to show that these spikes were important for information processing. To measure dendritic spikes, you have to attach some very tiny clamps onto dendrites with the help of a computer which moves the clamp with great precision. To have some sort of idea where your clamp is, you need a special microscope to observe the clamp as you progress onto a dendrite. Even then you mostly attach the clamp in a rather blind matter because at such tiny scale every movement made is a rather giant leap. Only a few teams in the world have the equipment and skill to attach such clamps onto dendrites.

However, the direct data gathered by those few teams was enough to establish dendritic spikes as important information processing events. Due to the introduction of dendritic spikes into computational models of neurons, the complexity of a single neuron has become very similar to a convolutional net with two convolutional layers. As we see later the LNP model also uses non-linearities very similar to a rectified linear function, and also makes use of a spike generator which is very similar to dropout – so a neuron is very much like an entire convolutional net. But more about that later and back to dendritic spikes and what exactly they are.

Dendritic spikes occur when a critical level of depolarization is reached in a dendrite. The depolarization discharges as an electric potential along the walls of the dendrite and may trigger voltage-gated channels along its way through the dendritic tree and eventually, if strong enough, the electric potential reaches the core of the neuron where it may trigger a true action potential. If the dendritic spike fails to trigger an action potential, the opened voltage-gated channels in neighboring dendrites may do exactly that a split second later. Due to channels opened from the dendritic spike more charged particles enter the neuron, which then may either trigger (common) or stifle (rare) a full action potential at the neurons cell body (soma).

A shows a computer model of a neuron that does not model dendritic spikes; B models simple dynamics of dendritic spikes; C models more complex dynamics of dendritic spikes which takes into account the one dimensional diffusion of particles (which is similar to a convolution operation). Take note that these images are only snapshots in a particular moment of time. A big thanks to Berd Kuhn. Image copyright © 2014 Anwar, Roome, Nedelescu, Chen, Kuhn and De Schutter as published in Frontiers in Cellular Neuroscience (Anwar et al. 2014)

This process is very similar to max-pooling, where a single large activation “overwrites” other neighboring values. However, after a dendritic spike, neighboring values are not overwritten like during max-pooling used in deep learning, but the opening of voltage-gated channels greatly amplifies the signals in all neighboring branches within the dendritic tree. Thus a dendritic spike may heighten the electrochemical levels in neighboring dendrites to a level which is more similar to the maximum input — this effect is close to max-pooling.

Indeed it was shown that dendritic spikes in the visual system serve the same purpose as max pooling in convolutional nets for object recognition: In deep learning, max-pooling is used to achieve (limited) rotation, translation, and scale invariance (meaning that our algorithm can detect an object in an image where the object is rotated, moved, or shrunk/enlarged by a few pixels). One can think of this process as setting all surrounding pixels to the same large activation and make each activation share the weight to the next layer (in software the values are discarded for computational efficiency — this is mathematically equivalent). Similarly, it was shown that dendritic spikes in the visual system are sensitive to the orientation of an object. So dendritic spikes do not only have computational similarity, but also similarities in function.

The analogy does not end here. During neural back-propagation — that is when the action potential travels from the cell body back into the dendritic tree — the signal cannot backpropagate into the dendritic branch where the dendritic spike originated because these are “deactivated” due to the recent electrical activity. Thus a clear learning signal is sent to inactivated branches. At first this may seem like the exact opposite from the backpropagation used for max-pooling, where everything but the max-pooling activation is backpropagated. However, the absence of a backpropagation signal in a dendrite is a rare event and represents a learning signal on its own. Thus, dendrites which produce dendritic spikes have special learning signals just like activated units in max-pooling.

To better understand what dendritic spikes are and what they look like, I very much want to encourage you to watch this video (for which I do not have the copyright). The video shows how two dendritic spikes lead to an action potential.

This combination of dendritic spikes and action potentials and the structure of the dendritic tree has been found to be critical for learning and memory in the hippocampus, the main brain region responsible for forming new memories and writing them to our “hard drive” at night.

Dendritic spikes are one of the main drivers of computational complexity which have been left out from past models of the complexity of the brain. Also, these new findings show that neural back-propagation does not have to be neuron-to-neuron in order to learn complex functions; a single neuron already implements a convolutional net and thus has enough computational complexity to model complex phenomena. As such, there is little need for learning rules that span multiple neurons — a single neuron can produce the same outputs we create with our convolutional nets today.

But these findings about dendritic spikes are not the only advance made in our understanding of the information processing steps during this stage of the neural information processing pathway. Genetic manipulation and targeted protein synthesis are sources that increase computational complexity by orders of magnitude, and only recently we made advances which reveal the true extend of biological information processing.

Protein signaling cascades

As I said in the introduction of this part, I will not cover the parts of biological information processing extensively, but I want to give you enough information so that you can start learning more from here.

One thing one has to understand is that a cell looks much different from how it is displayed in text books. Cells crawl with proteins: There are about 10 billion proteins in any given human cell and these proteins are not idle: They combine with other proteins, work on a task, or jitter around to find new tasks to work on.

All the functions described above are the work of proteins. For example the key-and-lock mechanism and the channels that play the gatekeeper for the charged particles that leave and enter the neuron are all proteins. The proteins I mean in this paragraph are not these common proteins, but proteins with special biological functions.

As an example the abundant neurotransmitter glutamate may bind to a NDMA receptor which then opens up its channels for many different kinds of charged particles and after being opened, the channel only closes when the neuron fires. The strength of synapses is highly dependent on this process, where the synapse is adjusted according to the location of the NDMA receptor and the timing of signals which are backpropagated to the synapses. We know this process is critical to learning in the brain, but it is only a small piece in a large puzzle.

The charged particles which may enter the neuron may additionally induce protein signaling cascades own their own. For example the cascade below shows how an activated NMDA receptor (green) lets charged calcium CA2+ inside which triggers a cascade which eventually leads to AMPAR receptors (violet) being trafficked and installed on the synapse.

Image source: 1

It was shown again and again that these special proteins have a great influence on the information processing in neurons, but it is difficult to pick out a specific type of protein from this seemingly chaotic soup of 10 billion proteins and study its precise function. Findings are often complex with a chain of reactions involving many different proteins until a desired end-product or end-function is reached. Often the start and end functions are known but not the exact path which led from one to the other. Sophisticated technology helped greatly to study proteins in detail, and as technology gets better and better we will further our understanding of biological information processing in neurons.

Genetic manipulation

The complexity of biological information processing does not end with protein signaling cascades, the 10 billion proteins are not a random soup of workers that do their tasks, but these workers are designed in specific quantities to serve specific functions that are relevant at the moment. All this is controlled by a tight feedback loop involving helper proteins, DNA, and messenger RNA (mRNA).

If we use programming metaphors to describe this whole process, then the DNA represents the whole github website with all its public packages, and messenger RNA is a big library which features many other smaller libraries with different functions (something like the C++ boost library).

It all begins with a programming problem you want to solve (a biological problem is detected). You use google and stackoverflow to find recommendations for libraries which you can use to solve the problem and soon you find a post that suggests that you use library X to solve problem Y (problem Y is detected on a local level in a cell with known solution of protein X; the protein that detected this defect then cascades into a chain of protein signals which leads to the upregulation of the gene G which can produce protein X; here upregulation is a “Hey! Produce more of this, please!” signal to the nucleus of the cell where the DNA lies). You download the library and compile it (the gene G is copied (transcribed) as a short string of mRNA from the very long string of DNA). You then do configure the install (the mRNA leaves the core) with the respective configuration (the mRNA is translated into a protein, the protein may be adjusted by other proteins after this), and install the library in a global “/lib” directory (the protein folds itself into its correct form after which it is fully functional). After you have installed the library, you import the needed part of the library to your program (the folded protein travels (randomly) to the site where it is needed) and you use certain functions of this library to solve your problem (the protein does some kind of work to solve the problem).

Additional to this, neurons may also dynamically alter their genome, that is they can dynamically change their github repository to add or remove libraries.

To understand this process further, you may want to watch the following video, which shows how HIV produces its proteins and how the virus can change the host DNA to suit its needs. The process described in this video animation is very similar to what is going on in neurons. To make it more similar to the process in neurons, imagine that HIV is a neurotransmitter and that everything contained in the HIV cell is in the neuron in the first place. What you have then is an accurate representation of how neurons make use of theirs genes and proteins:

You may ask, isn’t it so that every cell in your body has (almost) the same DNA in order to be able to replicate itself? Generally, this is true for most cells, but not true for most neurons. Neurons will typically have a genome that is different from the original genome that you were assigned to at birth. Neurons may have additional or fewer chromosomes and have sequences of information removed or added from certain chromosomes.

It was shown, that this behavior is important for information processing and if gone awry, this may contribute to brain disorders like depression or Alzheimer’s disease. Recently it was also shown, that neurons change their genome on a daily basis to improve information processing demands.

So when you sit at your desk for five days, and then on the weekend decide to go on a hike, it makes good sense that the brain adapts its neurons for this new task, because entirely different information processing is needed after this change of environment.

Equally, in an evolutionary sense, it would be beneficial to have different “modes” for hunting/gathering and social activity within the village — and it seems that this function might be for something like this. In general, the biological information processing apparatus is extremely efficient in responding to slower information processing demands that range from minutes to hours.

With respect to deep learning, an equivalent function would be to alter the function of a trained convolutional net in significant but rule-based ways; for example to apply a transformation to all parameters when changing from one to another task (recognition of street numbers -> transform parameters -> recognition of pedestrians).

Nothing of this biological information processing is modeled by the LNP model.

Looking back at all this, it seems rather strange that so many researchers think they that they can replicate the brain’s behavior by concentrating on the electrochemical properties and inter-neuron interactions only. Imagine that every unit in a convolutional network has its own github, from which it learns to dynamically download, compile and use the best libraries to solve a certain task. From all this you can see that a single neuron is probably more complex than an entire convolutional net, but we continue from here in our focus on electrochemical processes and see where it leads us.

Back to the LNP model

After all this above, there is only one more relevant step in information processing for our model. Once a critical level of depolarization is reached, a neuron will most often fire, but not always. There are mechanisms that prevent a neuron from firing. For example shortly after a neuron fired, its electric potential is too positive to produce a fully-fledged action potential, and thus it cannot fire again. This blockage may be present even when a sufficient electric potential is reached, because this blockade is a biological function and not a physical switch.

In the LNP model, this blockage of an action potential is modeled as an inhomogeneous Poisson process which has a Poisson distribution. A Poisson process with a Poisson distribution as a model means that the neuron has a very high probability to fire the first or second time it reached its threshold potential, but it may also be (with a exponentially decreasing probability) that a neuron may not fire for many more times.

A Poisson(0.5) distribution with a randomly drawn sample. Here 0,1,2,3 represents the waiting time until the neuron fires, thus 0 means it fires without delay, while 2 means it will not fire for two cycles even if it could fire physically.

There are exceptions to this rule, where neurons disable this mechanism and fire continuously at the rates which are governed by the physics alone — but these are special events which I will ignore at this point. Generally, this whole process is very similar to dropout used in deep learning which uses a uniform distribution instead of a Poisson distribution; thus this process can be viewed as some kind of regularization method that the brain uses instead of dropout.

In the next step, if the neuron fires, it releases an action potential. The action potential has very little difference in its amplitude, meaning the electric potential generated by the neuron almost always has the same magnitude, and thus is a reliable signal. As this signal travels down the axon it gets weaker and weaker. When it flows into the branches of the axon terminal, its final strength will be dependent on the shape and length of these branches; so each axon terminal will receive a different amount of electrical potential. This spatial information, together with the temporal information due to the spiking pattern of action potentials, is then translated into electrochemical information (it was shown that they are translated into spikes of neurotransmitters themselves that last about 2ms). To adjust the output signal, the axon terminal can move, grow or shrink (spatial), or it may alter its protein makeup which is responsible for releasing the synaptic vesicles (temporal).

Now we are back at the beginning: Neurotransmitters are released from the axon terminal (which can be modeled as a dense matrix multiplication) and the steps repeat themselves.

Learning and memory in the brain

Now that we went through the whole process back to back, let us put all this into context to see how the brain uses all this in concert.

Most neurons repeat the process of receive-inputs-and-fire about 50 to 1000 times per second; the firing frequency is highly dependent on the type of neuron and if a neuron is actively processesing tasks. Even if a neuron does not process a task it will fire continuously in a random fashion. Once some meaningful information is processed, this random firing activity makes way for a highly synchronized activity between neighboring neurons in a brain region. This synchronized activity is poorly understood, but is thought to be integral to understanding information processing in the brain and how it learns.

Currently, it is not precisely known how the brain learns. We do know that it adjusts synapses with some sort of reinforcement learning algorithm in order to learn new memories, but the precise details are unclear and the weak and contradicting evidence indicates that we are missing some important pieces of the puzzle. We got the big picture right, but we cannot figure out the brain’s learning algorithm without the fine detail which we are still lacking.

Concerning memories, we know that some memories are directly stored in the hippocampus, the main learning region of the brain (if you lose your hippocampus in each brain hemisphere, you cannot form new memories). However, most long-term memories are created and integrated with other memories during your REM sleep phase, when so called sleep spindles unwind the information of your hippocampus to all other brain areas. Long-term memories are generally all local: Your visual memories are stored in the visual system; your memories for your tongue (taste, texture) are stored in the brain region responsible for your tongue, etcetera.

It is also known, that the hippocampus acts as a memory buffer. Once it is full, you need to sleep to empty its contents to the rest of your brain (through sleep spindles during REM sleep); this might be why babies sleep so much and so irregularly —once their learning buffer is full, they sleep to quickly clear their buffer in order to learn more after they wake. You can still learn when this memory buffer is full, but retention is much worse and new memories might wrangle with other memories in the buffer for space and displace them —so really get your needed amount of sleep. Sleeping less and irregularly is unproductive, especially for students who need to learn.

The hippocampus in each hemisphere is shown in red. Image source: 1

Because memories are integrated with other memories during your “write buffer to hard-drive” stage, sleep is also very important for creativity. The next time you recall a certain memory after you slept, it might be altered with some new information that your brain thought to be fitting to attach to that memory.

I think we all had this: We wake up with some crazy new idea, only to see that it was quite nonsensical in the first place — so our brain is not perfect either and makes mistakes. But other times it just works: One time I tortured myself with a math problem for 7 hours non-stop, only to go to bed disappointed with only about a quarter of the whole problem solved. After I woke, I immediately had two new ideas how to solve the problem: The first did not work; but second made things very easy and I could sketch a solution to the math problem within 15 minutes — an ode to sleep!

Now why do I talk about memories when this blog post is about computation? The thing is that memory creation — or in other words — a method to store computed results for a long time, is critical for any intelligence. In brain simulations, one is satisfied if the synapse and activations occur in the same distribution as they do in the real brain, but one does not care if these synapses or activations correspond to anything meaningful — like memories or “distributed representations” needed for functions such as object recognition. This is a great flaw. Brain simulations have no memories.

In brain simulation, the diffusion of electrochemical particles is modeled by differential equations. These differential equations are complex, but can be modeled with simple techniques like Euler’s method to approximate these complex differential equations. The result has poor accuracy (meaning high error) but the algorithm is very computationally efficient and the accuracy is sufficient to reproduce the activities of real neurons along with their size and distribution of synapses. The great disadvantage is that we generally cannot learn parameters from a method like this — we cannot create meaningful memories.

However, as I have shown in my blog post about convolution, we can also model diffusion by applying convolution — a very computationally complex operation. The advantage about convolution is that we can use methods like maximum-likelihood estimation with backpropagation to learn parameters which lead to meaningful representations which are akin to memories (just like we do in convolutional nets). This is exactly akin to the LNP model with its convolution operation.

So besides its great similarity to deep learning models, the LNP model is also justified in that it is actually possible to learn parameters which yield meaningful memories (where with memories I mean here distributed representations like those we find in deep learning algorithms).

This then also justifies the next point where I estimate the brain’s complexity by using convolution instead of Euler’s method on differential equations.

Another point to take away from for our model is, that we currently have no complexity assigned for the creation of memories (we only modeled the forward pass, not the backward pass with backpropagation). As such, we underestimate the complexity of the brain, but because we do not know how the brain learns, we cannot make any accurate estimates for the computational complexity of learning. With that said and kept in the back of our mind, let us move on to bringing the whole model together for a lower bound of computational complexity.

Bringing it all together for a mathematical estimation of complexity

The next part is a bit tricky: We need to estimate the numbers for N, M, n and m and these differ widely among neurons.

We know that 50 of the 86 billion neurons in the brain are cerebellar granule neurons, so these neurons and their connection will be quite important in our estimation.

Cerebellar granule neurons are very tiny neurons with about 4 dendrites. Their main input is from the cortex. They integrate these signals and then send them along a T-shaped axon which feeds into the dendrites of Purkinje neurons.

Purkinje neurons are by far the most complex neurons, but there are only about 100 million of them. They may have more than a 100000 synapses each and about 1000 dendrites. Multiple Purkinje neurons bundle their outputs in about a dozen deep nuclei (a bunch of densely packed neurons) which then send signals back to the cortex.

This process is very crucial for non-verbal intelligence, abstract thinking and abstract creativity (creativity: Name as many words beginning with the letter A; abstract creativity: What if gravity bends space-time (general relativity)? What if these birds belonged to the same species when they came to this island (evolution)?). It was thought a few decades ago that the cerebellum only computes outputs for movement; for example while Einstein’s cerebrum was handled and studied carefully, his cerebellum was basically just cut off and put away, because it was regarded as a “primitive” brain part.

But since then it was shown that the cerebellum forms 1:1 connections with most brain regions of the cortex. Indeed, changes in the front part of the cerebellum during the ages 23 to 25 may change your non-verbal IQ by up to 30 points, and changes of 10-15 IQ points are common. This is very useful in most instances, whereas we lose neurons which perform a function which we do not need in everyday lives (calculus, or the foreign language which you learned but never used).

So it is crucial to get the estimation of the cerebellum right not only because it contains most neurons, but also because it is important for intelligence and information processing in general.

Estimation of cerebellar filter dimensions

Now if we look at a single dendrite, it branches off into a few branches and thus has a tree like structure. Along its total length it is usually packed with synapses. Dendritic spikes can originate in any branch of a dendrite (spatial dimension). When we take 3 branches per dendrite, and 4 dendrites in total we have a convolutional filter of size 3 and 4 for cerebellar granule neurons. Since linear convolution over two dimensions is the same as convolution over one dimension followed by convolution over the other dimension, we can also model this as a single 3×4 convolution operation. Also note that this is mathematically identical to a model that describes the diffusion of particles originating from different sources (feature map) which diffuse according to a rule in their neighborhood (kernel) — this is exactly what happens at a physical level. More on this view in my blog post about convolution.

Here I have chosen to represent the spatial domain with a single dimension. It was shown that the shape of the dendritic tree is also important in the resulting information processing and thus we would need two dimensions for the spatial domain. However, data is lacking to represent this mathematically in a meaningful way and thus I proceed with the simplification to one spatial dimension.

The temporal dimension is also important here: Charged particles may linger for a while until they are pumped out of the neuron. It is difficult to estimate a meaningful time frame, because the brain uses continuous time while our deep learning algorithms only know discrete time steps.

No single estimate makes sense from a biological perspective, but from a psychological perspective we know that the brain can take up unconscious information that is presented in an image in about 20 milliseconds (this involves only some fast, special parts of the brain). For conscious recognition of an object we need more time — at least 65 milliseconds, and on average about 80-200 milliseconds for reliable conscious recognition. This involves all the usual parts that are active for object recognition.

From these estimates, one can think about this process as “building up the information of the seen image over time within a neuron”. However, a neuron can only process information if it can differentiate meaningful information from random information (remember, neurons fire randomly if they do not actively process information). Once a certain level of “meaningful information” is present, the neuron actively reacts to that information. So in a certain sense information processing can be thought of as an epidemic of useful information that spreads across the brain: Information can only spread to one neuron, if the neighboring neuron is already infected with this information. Thinking in this way, such an epidemic of information infects all neurons in the brain within 80-200 milliseconds.

As such we can say that, while the object lacks details in the first 20 milliseconds, there is full detail at about 80-200 milliseconds. If we translate this into discrete images at the rate of 30 frames per second (normal video playback) —or in other words time steps — then 20 milliseconds would be 0.6 time steps, and 80-200 milliseconds 2.4-6 time steps. This means, that all the visual information that a neuron needs for its processing will be present in the neuron within 2.4 to 6 frames.

To make calculations easier, I here now choose a fixed time dimension of 5 time steps for neural processes. This means for the dendrites we have spatio-temporal convolutional filters of size 3x4x5 for cerebellar granule neurons. For Purkinje neurons a similar estimate would be filters of a size of about 10x1000x5. The non-linearity then reduces these inputs to a single number for each dendrite. This number represents an instantaneous firing rate, that is, the number represents how often the neuron fires in the respective interval of time, for example at 5 Hz, 100 Hz, 0 Hz etcetera. If the potential is too negative, no spike will result (0 HZ); if the potential is positive enough, then the magnitude of the spike is often proportional to the magnitude of the electric potential —but not always.

It was shown that dendritic summation of this firing rate can be linear (the sum), sub-linear (less than the sum), supra-linear (more than the sum) or bistable (less than the sum, or more than the sum, depending on the respective input); these behaviors of summation often differ from neuron to neuron. It is known that Purkinje neurons use linear summation, and thus their summation to form a spike rate is very similar to the rectified linear function max(0,x) which is commonly used in deep learning. Non-linear sums can be thought of different activation functions. It is important to add, that the activation function is determined by the type of the neuron.

The filters in the soma (or cell body) can be thought of as an additional temporal convolutional filter with a size of 1 in the spatial domain. So this is a filter that reduces the input to a single dimension with a time dimension of 5, that is, a 1x1x5 convolutional filter (this will be the same for all neurons).

Again, the non-linearity then reduces this to an instantaneous firing rate, which then is dropped out by a Poisson process, which is then fed into a weight-matrix.

At this point I want to again emphasize, that it is not correct to view the output of a neuron as binary; the information conveyed by a firing neuron is more like an if-then-else branch: “if(fire == True and dropout == False){ release_ neurotransmitters(); }else{ sleep(0.02); }”

The neurotransmitters are the true output of a neuron, but this is often confused. The source of this confusion is that it is very difficult to study neurotransmitter release and its dynamics with a synapse, while it is ridiculously easy to study action potentials. Most models of neurons thus model the output as action potentials because we have a lot of reliable data here; we do not have such data for neurotransmitter interactions at a real-time level. This is why action potentials are often confused as the true outputs of neurons when they are not.

When a neuron fires, this impulse can be thought of as being converted to a discrete number at the axon terminal (number of vesicles which are released) and is multiplied by another discrete number which represents the amount of receptors on the synapse (this whole process corresponds to a dense or fully connected weight in convolutional nets). In the next step of information processing, charged particles flow into the neuron and build up a real-valued electric potential. This has also some similarities to batch-normalization, because values are normalized into the range [0,threshold] (neuron: relative to the initial potential of the neuron; convolutional net: relative to the mean of activations in batch-normalization). When we look at this whole process, we can model it as a matrix multiplication between two real-valued matrices (doing a scaled normalization before or after this is mathematically equivalent, because matrix multiplication is a linear operation).

Therefore we can think of axon-terminal-synapse interactions between neurons as a matrix multiplication between two real-valued matrices.

Estimation of cerebellar input/output dimensions

Cerebellar granule neurons typically receive inputs from about four axons (most often connections from the cortex). Each axon forms about 3-4 synapses with the dendritic claw of the granule neuron (a dendrite ending shaped as if you would hold a tennis ball in your hand) so there are a total of about 15 inputs via synapses to the granule neurons. The granule neuron itself ends in a T shaped axon which crosses directly through the dendrites of Purkinje neurons with which it forms about 100 synapses.

Purkinje neurons receive inputs from about 100000 connections made with granule neurons and they themselves make about 1000 connections in the deep nuclei. There are estimates which are much higher and no accurate number for the number of synapses exists as far as I know. The number of 100000 synapses might be a slight overestimate (but 75000 would be too conservative), but I use it anyways to make the math simpler.

All these dimensions are taken times the time dimension as discussed above, so that the input for granule neurons for example has a dimensionality of 15×5.

So with this we can finally calculate the complexity of a cerebellar granule neuron together with the Purkinje neurons.

So my estimate would be 1.075×10^21 FLOPS for the brain, the fastest computer on earth as of July 2013 has 0.58×10^15 FLOPS for practical application (more about this below).

Part III: Limitations and criticism

While I discussed how the brain is similar to deep learning, I did not discuss how the brain is different. One great disparity is that the dropout in the brain works with respect to all inputs, while dropout in a convolutional network works with respect to each single unit. What the brain is doing makes little sense in deep learning right now; however, if you think about combining millions of convolutional nets with each other, it makes good sense to do as the brain does. The dropout of the brain certainly would work well to decouple the activity of neurons from each other, because no neuron can depend on information from a single other neuron (because it might be dropped out), so that it is forced to take into account all the neurons it is connected with, thus eliminating biased computation (which is basically regularization).

Another limitation of the model is that it is a lower bound. This estimate does not take into account:

Backpropagation, i.e. signals that travel from the soma to the dendrites; the action potential is reflected within the axon and travels backwards (these two things may almost double the complexity)
Axon terminal information processing
Multi-neurotransmitter vesicles (can be thought of multiple output channels or filters, just as an image has multiple colors)
Geometrical shape of the dendritic tree
Dendritic spine information processing
Non-axodendritic synapses (axon-axon and axon-soma connections)
Electrical synapses
Neurotransmitter induced protein activation and signaling
Neurotransmitter induced gene regulation
Voltage induced (dendritic spikes and backpropagating signals) gene regulation
Voltage induced protein activation and signaling
Glia cells (besides having an extremely abnormal brain (about one in a billion), Einstein also had abnormally high levels of glia cells)

All these things have been shown to be important for information processing in the brain. I did not include them in my estimate because this would have made everything:

Too complex: What I have discussed so far is extremely simple if you compare that to the vastness and complexity of biological information processing
Too special: Non-axodendritic synapses can have unique information processing algorithms completely different from everything listed here, e.g. direct electrical communication between a neighboring bundle of neurons
And/or evidence is lacking to create a reliable mathematical model: Neural backpropagation, geometry of the dendritic trees, and dendritic spines

Remember that these estimates are for the whole brain. Local brain regions might have higher computational processing speed than this average when they are actively processing stimuli. Also remember that the cerebellum makes up almost all computational processing. Other brain regions integrate the knowledge of the cerebellum, but the cerebellum acts as a transformation and abstraction module for almost all information in the brain (except vision and hearing).

But wait, but we can do all this with much less computational power! We already have super-human performance in computer vision!

I would not say that we have super-human performance in computer vision. What we have is a system that beats human at naming things in images that are taken out of context of the real world (what happens before we see something in the real world shapes our perception dramatically). We almost always can recognize things in our environment, but we most often just do not know (or care about) the name of what we see.

Humans do not have the visual system to label things. Try to make a list of 1000 common physical objects in the real world —not an easy task.

To not recognize an object for us humans would mean that we see an object but cannot make sense of it. If you forgot the name of an old classmate, it does not mean you did not recognize her; it just means you forgot her name. Now imagine you get off a train stop and you know a good friend is waiting for you somewhere at the stop. You see somebody 300 meters away waving their hands who is looking in your direction — is it your friend? You do not know; you cannot recognize if it is her. That’s the difference between mere labels and object recognition.

Now if you cannot recognize something in a 30×30 pixel image, but the computer can, this also does not necessarily mean that the computer has super-human object recognition performance. First and foremost this means that your visual system does not work well for pixeled information. Our eyes are just not used to that.

Now take a look outside a window and try to label all the things you see. It will be very easy for most things, but for other things you do not know the correct labels! For example, I do not know the name for a few plants that I see when I look out of my window. However, we are fully aware what it is what we see and can name many details of the object. For example, alone by assessing their appearance, I know a lot about how much water and sunshine the unknown plants need, how fast they grow, in which way they grow, if they are old or young specimens; I know how they feel like if I touch them — or more generally — I know how these plants grow biologically and how they produce energy, and so on. I can do all this without knowing its name. Current deep learning systems cannot do this and will not do this for quite some time. Human-level performance in computer vision is far away indeed! We just reached the very first step (object recognition) and now the task is to make computer vision smart, rather than making it just good at labeling things.

Evolutionarily speaking, the main functions of our visual system have little to do with naming things that we see: Hunt and avoid being hunted, to orient ourselves in nature during foraging and make sure we pick the right berries and extract roots efficiently— these are all important functions, but probably one of the most important functions of our vision is the social function within a group or relationship.

If you Skype with someone it is quite a different communication when they have their camera enabled compared to if they have not. It is also very different to communicate with someone whose image is on a static 2D surface compared to communicating in person. Vision is critical for communication.

Our deep learning cannot do any of this efficiently.

Making sense of a world without labels

One striking case which also demonstrates the power of vision for true understanding of the environment without any labels is the case of Genie. Genie was strapped into place and left alone in a room at the age of 20 months. She was found with severe malnutrition 12 years later. She had almost no social interaction during this time and thus did not acquire any form of verbal language.

Once she got in contact with other human beings she was taught English as a language (and later also sign language), but she never really mastered it. Instead she quickly mastered non-verbal language and was truly exceptional at that.

To strangers she almost exclusively communicated with non-verbal language. There are instances where these strangers would stop in their place, leave everything behind, walk up to her and hand her a toy or another item — that item was always something that was known to be something liked and desired.

In one instance a woman got out of her car at a stoplight at an intersection, emptied her purse and handed it to Genie. The woman and Genie did not exchange a word; they understood each other completely non-verbally.

So what Genie did, was to pick up cues with her visual system and translated the emotional and cognitive state of that woman into non-verbal cues and actions, which she would then use to change the mental state of the woman. In turn that the woman would then desire to give the purse to Genie (which Genie probably could not even see).

Clearly, Genie was very exceptional at non-verbal communication — but what would happen if you pitched her against a deep learning object recognition system? The deep learning system would be much better than Genie on any data set you would pick. Do you think it would be fair to say that the convolutional net is better at object recognition than Genie is? I do not think so.

This shows how primitive and naïve our approach to computer vision is. Object recognition is a part of human vision, but it is not what makes it exceptional.

Can we do with less computational power?

“We do not need as much computational power as the brain has, because our algorithms are (will be) better than that of the brain.”

I hope you can see after the descriptions in this blog post that this statement is rather arrogant.

We do not know how the brain really learns. We do not understand information processing in the brain in detail. And yet we dare to say we can do better?

Even if we did know how the brain works in all its details, it would still be rather naïve to think we could create general intelligence with much less. The brain developed during many hundreds of millions of years through evolution. Evolutionary, it is the most malleable organ there is: The human cortex shrunk by about 10% during the last 20000 years, and the human brain adapted rapidly to the many ways we use verbal language — a very recent development in evolutionary terms.

It was also shown that the number of neurons in each animal’s brain is almost exactly the amount which it can sustain through feeding (we probably killed off the majority of all mammoths by about 20000 years ago). We humans have such large brains because we invented fire and cooking with which we could predigest food which made it possible to sustain more neurons. Without cooking, the intake of calories would not be high enough to sustain our brains and we would helplessly starve (at least a few thousand years ago; now you could survive on a raw vegan diet easily — just walk into a supermarket and buy a lot of calorie-dense foods). With this fact, it is very likely that brains are optimized exhaustively to create the best information processing which is possible for the typical calorie intake of the respective species — the function which is most expensive in an animal will be most ruthlessly optimized to enhance survival and procreation. This is also very much in line with all the complexity of the brain; every little function is optimized thoroughly and only as technology advances we can understand step by step what this complexity is made for.

There are many hundreds of different types of neurons in the brain, each with their designated function. Indeed, neuroscientists often can differentiate different brain regions and their function by looking at the changing architecture and neuron types in a brain region. Although we do not understand the details of how the circuits perform information processing, we can see that each of these unique circuits is designed carefully to perform a certain kind of function. These circuits are often replicated in evolutionary distinct species which share a common ancestor that branched off into these different species hundreds of millions of years ago, showing that such structures are evolutionarily optimal for the tasks they are processing.

The equivalent in deep learning would be, if we had 10000 different architectures of convolutional nets (with its own set of activation functions and more) which we combine meticulously to improve the overall function of our algorithm ― do you really think we can build something which can produce as complex information processing, but which follows a simple general architecture?

It is rather naïve to think that we can out-wit this fantastically complex organ when we are not even able to understand its learning algorithms.

On top of this, the statement that we will develop better algorithms than the brain uses is unfalsifiable. We can only prove it when we achieve it, we cannot disprove it. Thus it is a rather nonsensical statement that has little practical value. Theories are usually useful even when there is not enough evidence to show that they are correct.

The standard model of physics is an extremely useful theory used by physicists and engineers around the world in their daily life to develop the high tech products we enjoy; and yet this theory is not complete, it was amended just a few days ago when a new particle was proven to exist in the LHC experiment.

Imagine if there were another model, but you would only be able to use it when we have proven the existence of all particles. This model would then be rather useless. When it makes no predictions at all about the behavior in the world, we would be unable to manufacture and develop electronics with this theory. Similarly, the statement that we can develop more efficient algorithms than the brain does not help; it rather makes it more difficult to make further progress. The brain should really be our main point of orientation.

Another argument, which would be typical for Yann LeCun (he made a similar argument during a panel) would be: Arguably, airplanes are much better at flying than birds are; yet, if you describe the flight of birds it is extremely complex and every detail counts, while the flight of airplanes is described simply by the fluid flow around an airfoil. Why is it wrong to expect this simplicity from deep learning when compared to the brain?

I think this argument has some truth in it, but essentially, it asks the wrong question. I think it is clear that we need not to replicate everything in detail in order to achieve artificial intelligence, but the real question is: Where do we draw the line? If you get to know that neurons can be modeled in ways that closely resemble convolutional nets, would you go so far and say, that this model is too complex and we need to make it simpler?

Part IV: Predicting the growth of practical computational power

There is one dominant measure of performance in high-performance computing (HPC) and this measure is floating point operations per second (FLOPS) on the High Performance LINPACK (HPL) benchmark – which measures how many computations a system can do in a second when doing distributed dense matrix operations on hundreds or thousands of computers. There exists the TOP 500 list of supercomputers, which is a historical list based on this benchmark which is the main reference point for the performance of a new supercomputer system.

But a big but comes with the LINPACK benchmark. It does not reflect the performance in real, practical applications which run on modern supercomputers on a daily basis, and thus, the fastest computers on the TOP 500 list are not necessarily the fastest computers for practical applications.

Everybody in the high performance computing community knows this, but it is so entrenched in the business routine in this area, that when you design a new supercomputer system, you basically have to show that your system will be able to get a good spot on the TOP 500 in order to get funding for that supercomputer.

Sometimes such systems are practically unusable, like the Tianhe-2 supercomputer which still holds the top spot on the LINPACK benchmark after more than three years. The potential of this supercomputer goes largely unused because it is too expensive to run (electricity) and the custom hardware (custom network, Intel Xeon Phi) requires new software, which would need years of development to reach the levels of sophistication of standard HPC software. The Tianhe-2 runs only at roughly one third of its capacity, or in other words, it practically stands idle for nearly 2 out of 3 minutes. The predecessor of the Tianhe-2, the Tianhe-1, fastest computer in the world in 2010 (according to LINPACK), has not been used since 2013 due to bureaucracy reasons.

While outside of China, other supercomputers of similar design fare better, they typically do not perform so well in practical applications. This is so, because the used accelerators like graphic processing units (GPUs) or Intel Xeon Phis can deliver high FLOPS in such a setup, but they are severely limited by network bandwidth bottlenecks.

To correct the growing uselessness of the LINPACK benchmark a new measure of performance was developed: The high performance conjugate gradient benchmark (HPCG). This benchmark performs conjugate gradient, which requires more communication than LINPACK and as such comes much closer to performance numbers for real applications. I will use this benchmark to create my estimates for a singularity.

The TOP500 for the last decade and some data for the HPCG (data collection only began recently). The dashed lines indicate a forecast. The main drivers of computational growth are also shown: Multicore CPU, GPU, and in 2016-2017 3D memory, and some new unknown technology in 2020. Will this growth be sustainable?

However, this benchmark still dramatically overestimates the computing power that can be reached for artificial intelligence applications when we assume that these applications are based on deep learning.

Deep learning is currently the most promising technique for reaching artificial intelligence. It is certain that deep learning — as it is now — will not be enough, but one can say for sure that something similar to deep learning will be involved in reaching strong AI.

Deep learning, unlike other applications has an unusually high demand for network bandwidth. It is so high that for some supercomputer designs which are in the TOP 500 a deep learning application would run slower than on your desktop computer. Why is this so? Because parallel deep learning involves massive parameter synchronization which requires extensive network bandwidth: If your network bandwidth is too slow, then at some point deep learning gets slower and slower the more computers you add to your system. As such, very large systems which are usually quite fast may be extremely slow for deep learning.

The problem with all this is that the development of new network interconnects which enable high bandwidth is difficult and advances are made much more slowly than the advances of computing modules, like CPUs, GPUs and other accelerators. Just recently, Mellanox reached a milestone where they could manufacture switches and InfiniBand cards which operate at 100Gbits per second. This development is still rather experimental, and it is difficult to manufacture fiber-optic cables which can operate at this speed. As such, no supercomputer implements this new development as of yet. But with this milestone reached, there will not be another milestone for many quite a while. The doubling time for network interconnect bandwidth is about 3 years.

Similarly, there is a memory problem. While the speed of theoretical processing power of CPUs and GPUs keeps increasing, the memory bandwidth of RAM is almost static. This is a great problem, because now we are at a point where it costs more time to move the data to the compute circuits than to actually make computations with it.

With new developments such as 3D memory one can be sure that further increases in memory bandwidth will be achieved, but we have nothing after that to increase the performance further. We need new ideas and new technology. Memory will not scale itself by getting smaller and smaller.

However, currently the biggest hurdle of them all is power consumption. The Tianhe-2 uses 24 megawatts of power, which totals to $65k-$100k in electricity cost per day, or about $23 million per year. The power consumed by the Tianhe-2 would be sufficient to power about 6000 homes in Germany or 2000 homes in the US (A/C usage).

An overview about how the performance constraints changed from old to new supercomputers. Adapted from Horst Simon‘s presentation

Physical limitations

Furthermore, there are physical problems around the corner. Soon, our circuits will be so small that electrons will start to show quantum effects. One such quantum effect is quantum tunneling. In quantum tunneling an electron sits in two neighboring circuits at once, and decides randomly to which of these two locations it will go next.

If this would happen at a larger scale, it would be like charging your phone right next to your TV, and the electrons decide they want to go to your cell phone cable rather than to your TV; so they jump over to the phone cable cutting off the power to your TV. Quantum tunneling will become relevant in 2016-2017 and has to be taken into account from there on. New materials and “insulated” circuits are required to make everything work from here on.

With new materials, we need new production techniques which will be very costly because all computer chips relied on the same, old but reliable production process. We need research and development to make our known processes working with these new materials and this will not only cost money but also cost time. This will also fuel a continuing trend where the cost for producing computer chips increases exponentially (and growth may slow due to costs). Currently, the tally is at $9bn for such a semiconductor fabrication plant (fab) increasing at a relatively stable rate of about 7-10% higher costs per year for the past decades.

After this, we are at the plain physical limits. A transistor will be composed of not much more than a handful of atoms. We cannot go smaller than this, and this level of manufacturing will require extensive efforts in order to get such devices working properly. This will start to happen around 2025 and the growth may slow from here due to physical limitations.

Recent trends in the growth of computational power

So to summarize the previous section: (1) LINPACK performance does not reflect practical performance because it does not test memory and network bandwidth constraints; (2) memory and network bandwidth are now more important than computational power, however (3) advances in memory and network bandwidth will be sporadic and cannot compete with the growth in computational power; (4) electrical costs are a severe limitation (try to justify a dedicated power plant for a supercomputer if citizen face sporadic power outages), and also (5) computational power will be limited by physical boundaries in the next couple of years.

It may not come to a surprise then that the growth in computational power has been slowing down in recent years; this is mainly due to power efficiencies which will only be improved gradually, but the other factors also take its toll, like network interconnects which cannot keep up with accelerators like GPUs.

If one takes the current estimate of practical FLOPS of the fastest supercomputer, the Tianhe-2 with 0.58 petaflops on HPCG, then it would take 21 doubling periods until the lower bound of the brain’s computational power is reached. If one uses Moore’s Law, we would reach that by 2037; if we take the growth of the last 60 years, which is about 1.8 years per doubling period, we will reach this in the year 2053. If we take a lower estimate of 3 years for the doubling period due to the problems listed above we will reach this in 2078. While for normal supercomputing applications memory bandwidth is the bottleneck for practical applications as of now, this may soon change to networking bandwidth, which doubles about every 3 years. So the 2078 estimate might be quite accurate.

Growth in computing performance with respect to the HPCG benchmark. Both computing performance and factory costs are assumed to keep growing steadily at an exponential rate with doubling period of 18 or 36 months, respectively.

Now remember that, (1) the HPCG benchmark has much higher performance than typical deep learning applications which rely much more on network and memory bandwidth, and (2) that my estimate for the computational complexity of the brain is a lower bound. One can see that an estimate beyond 2100 might be not too far off. To sustain such a long and merciless increase in computation performance will require that we develop and implement many new ideas while operating at the border of physical limitations as soon as by 2020. Will this be possible?

Where there’s a will, there’s a way — the real question is: Are we prepared to pay the costs?

Conclusion

Here I have discussed the information processing steps of the brain and their complexity and compared them to those of deep learning algorithms. I focused on a discussion of basic electrochemical information processing and neglected biological information processing.

I used an extended linear-nonlinear-Poisson cascade model as groundwork and related it to convolutional architectures.

With this model, I could show that a single neuron has an information processing architecture which is very similar to current convolutional nets, featuring convolutional stages with rectified non-linearities which activities are then regularized by a dropout-like method. I also established a connection between max-pooling and voltage-gated channels which are opened by dendritic spikes. Similarities to batch-normalization exist.

This straightforward similarity gives strong reason to believe that deep learning is really on the right path. It also indicates that ideas borrowed from neurobiological processes are useful for deep learning (the problem was that progress in deep learning architectures often preceded knowledge in neurobiological processes).

My model shows that it can be estimated that the brain operates at least 10x^21 operations per second. With current rates of growth in computational power we could achieve supercomputers with brain-like capabilities by the year 2037, but estimates after the year 2080 seem more realistic when all evidence is taken into account. This estimate only holds true if we succed to stomp limitations like physical barriers (for example quantum-tunneling), capital costs for semiconductor fabrication plants, and growing electrical costs. At the same time we constantly need to innovate to solve memory bandwidth and network bandwidth problems which are or will be the bottlenecks in supercomputing. With these considerations taken into account, it is practically rather unlikely that we will achieve human-like processing capabilities anytime soon.

Closing remarks

My philosophy of this blog post was to present all information on a single web-page rather than scatter information around. I think this design helps to create a more sturdy fabric of knowledge, which, with its interwoven strains of different fields, helps to create a more thorough picture of the main ideas involved. However, it has been quite difficult to organize all this information into a coherent picture and some points might be more confusing than enlightening. Please leave a comment below to let me know if the structure and content need improvement, so that I can adjust my next blog post accordingly.

I would also love general feedback for this blog post.

Also make sure to share this blog post with your fellow deep learning colleagues. People with raw computer science backgrounds often harbor misconceptions about the brain, its parts and how it works. I think this blog post could be a suitable remedy for that.

The next blog post

The second post in this series on neuroscience and psychology will focus on the most important brain regions and their function and connectivity. The last and third part in the series will focus on psychological processes, such as memory and learning, and what we can learn from that with respect to deep learning.

Acknowledgments

I would like to thank Alexander Tonn for his useful advice and for proofreading this blog post.

Important references and sources

Neuroscience

Brunel, N., Hakim, V., & Richardson, M. J. (2014). Single neuron dynamics and computation. Current opinion in neurobiology, 25, 149-155.

Chadderton, P., Margrie, T. W., & Häusser, M. (2004). Integration of quanta in cerebellar granule cells during sensory processing. Nature, 428(6985), 856-860.

De Gennaro, L., & Ferrara, M. (2003). Sleep spindles: an overview. Sleep medicine reviews, 7(5), 423-440.

Ji, D., & Wilson, M. A. (2007). Coordinated memory replay in the visual cortex and hippocampus during sleep. Nature neuroscience, 10(1), 100-107.

Liaw, J. S., & Berger, T. W. (1999). Dynamic synapse: Harnessing the computing power of synaptic dynamics. Neurocomputing, 26, 199-206.

Ramsden, S., Richardson, F. M., Josse, G., Thomas, M. S., Ellis, C., Shakeshaft, C., … & Price, C. J. (2011). Verbal and non-verbal intelligence changes in the teenage brain. Nature, 479(7371), 113-116.

Smith, S. L., Smith, I. T., Branco, T., & Häusser, M. (2013). Dendritic spikes enhance stimulus selectivity in cortical neurons in vivo. Nature, 503(7474), 115-120.

Stoodley, C. J., & Schmahmann, J. D. (2009). Functional topography in the human cerebellum: a meta-analysis of neuroimaging studies. Neuroimage,44(2), 489-501.

High performance computing

Dongarra, J., & Heroux, M. A. (2013). Toward a new metric for ranking high performance computing systems. Sandia Report, SAND2013-4744, 312.

PDF: HPCG Specification

Interview: Why there will be no exascale computing before 2020

Slides: Why there will be no exascale computing before 2020

Interview: Challenges of exascale computing

Image references

Anwar, H., Roome, C. J., Nedelescu, H., Chen, W., Kuhn, B., & De Schutter, E. (2014). Dendritic diameters affect the spatial variability of intracellular calcium dynamics in computer models. Frontiers in cellular neuroscience, 8.

The post The Brain vs Deep Learning Part I: Computational Complexity — Or Why the Singularity Is Nowhere Near appeared first on Tim Dettmers.

Understanding Convolution in Deep Learning

Tim Dettmers — Thu, 26 Mar 2015 22:08:00 +0000

Convolution is probably the most important concept in deep learning right now. It was convolution and convolutional nets that catapulted deep learning to the forefront of almost any machine learning task there is. But what makes convolution so powerful? How does it work? In this blog post I will explain convolution and relate it to other concepts that will help you to understand convolution thoroughly.

There are already some blog post regarding convolution in deep learning, but I found all of them highly confusing with unnecessary mathematical details that do not further the understanding in any meaningful way. This blog post will also have many mathematical details, but I will approach them from a conceptual point of view where I represent the underlying mathematics with images everybody should be able to understand. The first part of this blog post is aimed at anybody who wants to understand the general concept of convolution and convolutional nets in deep learning. The second part of this blog post includes advanced concepts and is aimed to further and enhance the understanding of convolution for deep learning researchers and specialists.

What is convolution?

This whole blog post will build up to answer exactly this question, but it may be very helpful to first understand in which direction this is going, so what is convolution in rough terms?

You can imagine convolution as the mixing of information. Imagine two buckets full of information which are poured into one single bucket and then mixed according to a specific rule. Each bucket of information has its own recipe, which describes how the information in one bucket mixes with the other. So convolution is an orderly procedure where two sources of information are intertwined.

Convolution can also be described mathematically, in fact, it is a mathematical operation like addition, multiplication or a derivative, and while this operation is complex in itself, it can be very useful to simplify even more complex equations. Convolutions are heavily used in physics and engineering to simplify such complex equations and in the second part — after a short mathematical development of convolution — we will relate and integrate ideas between these fields of science and deep learning to gain a deeper understanding of convolution. But for now we will look at convolution from a practical perspective.

How do we apply convolution to images?

When we apply convolution to images, we apply it in two dimensions — that is the width and height of the image. We mix two buckets of information: The first bucket is the input image, which has a total of three matrices of pixels — one matrix each for the red, blue and green color channels; a pixel consists of an integer value between 0 and 255 in each color channel. The second bucket is the convolution kernel, a single matrix of floating point numbers where the pattern and the size of the numbers can be thought of as a recipe for how to intertwine the input image with the kernel in the convolution operation. The output of the kernel is the altered image which is often called a feature map in deep learning. There will be one feature map for every color channel.

Convolution of an image with an edge detector convolution kernel. Sources: 1 2

We now perform the actual intertwining of these two pieces of information through convolution. One way to apply convolution is to take an image patch from the input image of the size of the kernel — here we have a 100×100 image, and a 3×3 kernel, so we would take 3×3 patches — and then do an element wise multiplication with the image patch and convolution kernel. The sum of this multiplication then results in one pixel of the feature map. After one pixel of the feature map has been computed, the center of the image patch extractor slides one pixel into another direction, and repeats this computation. The computation ends when all pixels of the feature map have been computed this way. This procedure is illustrated for one image patch in the following gif.

Calculating convolution by operating on images patches.

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/aa-convolution-02.gif?fit=300%2C250&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/aa-convolution-02.gif?fit=360%2C300&ssl=1" class="wp-image-193" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/aa-convolution-02.gif?resize=500%2C417&ssl=1" alt="Calculating convolution by operating on images patches." width="500" height="417" data-recalc-dims="1" />

Convolution operation for one pixel of the resulting feature map: One image patch (red) of the original image (RAM) is multiplied by the kernel, and its sum is written to the feature map pixel (Buffer RAM). Gif by Glen Williamson who runs a website that features many technical gifs.

As you can see there is also a normalization procedure where the output value is normalized by the size of the kernel (9); this is to ensure that the total intensity of the picture and the feature map stays the same.

Why is convolution of images useful in machine learning?

There can be a lot of distracting information in images that is not relevant to what we are trying to achieve. A good example of this is a project I did together with Jannek Thomas in the Burda Bootcamp. The Burda Bootcamp is a rapid prototyping lab where students work in a hackathon-style environment to create technologically risky products in very short intervals. Together with my 9 colleagues, we created 11 products in 2 months. In one project I wanted to build a fashion image search with deep autoencoders: You upload an image of a fashion item and the autoencoder should find images that contain clothes with similar style.

Now if you want to differentiate between styles of clothes, the colors of the clothes will not be that useful for doing that; also minute details like emblems of the brand will be rather unimportant. What is most important is probably the shape of the clothes. Generally, the shape of a blouse is very different from the shape of a shirt, jacket, or trouser. So if we could filter the unnecessary information out of images then our algorithm will not be distracted by the unnecessary details like color and branded emblems. We can achieve this easily by convoluting images with kernels.

My colleague Jannek Thomas preprocessed the data and applied a Sobel edge detector (similar to the kernel above) to filter everything out of the image except the outlines of the shape of an object — this is why the application of convolution is often called filtering, and the kernels are often called filters (a more exact definition of this filtering processes will follow below). The resulting feature map from the edge detector kernel will be very helpful if you want to differentiate between different types of clothes, because only relevant shape information remains.

Sobel filtered inputs to and results from the trained autoencoder: The top-left image is the search query and the other images are the results which have an autoencoder code that is most similar to the search query as measured by cosine similarity. You see that the autoencoder really just looks at the shape of the search query and not its color. However, you can also see that this procedure does not work well for images of people wearing clothes (5th column) and that it is sensitive to the shapes of clothes hangers (4th column).

We can take this a step further: There are dozens of different kernels which produce many different feature maps, e.g. which sharpen the image (more details), or which blur the image (less details), and each feature map may help our algorithm to do better on its task (details, like 3 instead of 2 buttons on your jacket might be important).

Using this kind of procedure — taking inputs, transforming inputs and feeding the transformed inputs to an algorithm — is called feature engineering. Feature engineering is very difficult, and there are little resources which help you to learn this skill. In consequence, there are very few people which can apply feature engineering skillfully to a wide range of tasks. Feature engineering is — hands down — the most important skill to score well in Kaggle competitions. Feature engineering is so difficult because for each type of data and each type of problem, different features do well: Knowledge of feature engineering for image tasks will be quite useless for time series data; and even if we have two similar image tasks, it will not be easy to engineer good features because the objects in the images also determine what will work and what will not. It takes a lot of experience to get all of this right.

So feature engineering is very difficult and you have to start from scratch for each new task in order to do well. But when we look at images, might it be possible to automatically find the kernels which are most suitable for a task?

Enter convolutional nets

Convolutional nets do exactly this. Instead of having fixed numbers in our kernel, we assign parameters to these kernels which will be trained on the data. As we train our convolutional net, the kernel will get better and better at filtering a given image (or a given feature map) for relevant information. This process is automatic and is called feature learning. Feature learning automatically generalizes to each new task: We just need to simply train our network to find new filters which are relevant for the new task. This is what makes convolutional nets so powerful — no difficulties with feature engineering!

Usually we do not learn a single kernel in convolutional nets, instead we learn a hierarchy of multiple kernels at the same time. For example a 32x16x16 kernel applied to a 256×256 image would produce 32 feature maps of size 241×241 (this is the standard size, the size may vary from implementation to implementation; ). So automatically we learn 32 new features that have relevant information for our task in them. These feature then provide the inputs for the next kernel which filters the inputs again. Once we learned our hierarchical features, we simply pass them to a fully connected, simple neural network that combines them in order to classify the input image into classes. That is nearly all that there is to know about convolutional nets at a conceptual level (pooling procedures are important too, but that would be another blog post).

Part II: Advanced concepts

We now have a very good intuition of what convolution is, and what is going on in convolutional nets, and why convolutional nets are so powerful. But we can dig deeper to understand what is really going on within a convolution operation. In doing so, we will see that the original interpretation of computing a convolution is rather cumbersome and we can develop more sophisticated interpretations which will help us to think about convolutions much more broadly so that we can apply them on many different data. To achieve this deeper understanding the first step is to understand the convolution theorem.

The convolution theorem

To develop the concept of convolution further, we make use of the convolution theorem, which relates convolution in the time/space domain — where convolution features an unwieldy integral or sum — to a mere element wise multiplication in the frequency/Fourier domain. This theorem is very powerful and is widely applied in many sciences. The convolution theorem is also one of the reasons why the fast Fourier transform (FFT) algorithm is thought by some to be one of the most important algorithms of the 20^th century.

The first equation is the one dimensional continuous convolution theorem of two general continuous functions; the second equation is the 2D discrete convolution theorem for discrete image data. Here denotes a convolution operation, denotes the Fourier transform, the inverse Fourier transform, and is a normalization constant. Note that “discrete” here means that our data consists of a countable number of variables (pixels); and 1D means that our variables can be laid out in one dimension in a meaningful way, e.g. time is one dimensional (one second after the other), images are two dimensional (pixels have rows and columns), videos are three dimensional (pixels have rows and columns, and images come one after another).

To get a better understanding what happens in the convolution theorem we will now look at the interpretation of Fourier transforms with respect to digital image processing.

Fast Fourier transforms

The fast Fourier transform is an algorithm that transforms data from the space/time domain into the frequency or Fourier domain. The Fourier transform describes the original function in a sum of wave-like cosine and sine terms. It is important to note, that the Fourier transform is generally complex valued, which means that a real value is transformed into a complex value with a real and imaginary part. Usually the imaginary part is only important for certain operations and to transform the frequencies back into the space/time domain and will be largely ignored in this blog post. Below you can see a visualization how a signal (a function of information often with a time parameter, often periodic) is transformed by a Fourier transform.

Transformation of the time domain (red) into the frequency domain (blue). Source

You may be unaware of this, but it might well be that you see Fourier transformed values on a daily basis: If the red signal is a song then the blue values might be the equalizer bars displayed by your mp3 player.

The Fourier domain for images

Images by Fisher & Koryllos (1998). Bob Fisher also runs an excellent website about Fourier transforms and image processing in general.

How can we imagine frequencies for images? Imagine a piece of paper with one of the two patterns from above on it. Now imagine a wave traveling from one edge of the paper to the other where the wave pierces through the paper at each stripe of a certain color and hovers over the other. Such waves pierce the black and white parts in specific intervals, for example, every two pixels — this represents the frequency. In the Fourier transform lower frequencies are closer to the center and higher frequencies are at the edges (the maximum frequency for an image is at the very edge). The location of Fourier transform values with high intensity (white in the images) are ordered according to the direction of the greatest change in intensity in the original image. This is very apparent from the next image and its log Fourier transforms (applying the log to the real values decreases the differences in pixel intensity in the image — we see information more easily this way).

Images by Fisher & Koryllos (1998). Source

We immediately see that a Fourier transform contains a lot of information about the orientation of an object in an image. If an object is turned by, say, 37% degrees, it is difficult to tell that from the original pixel information, but very clear from the Fourier transformed values.

This is an important insight: Due to the convolution theorem, we can imagine that convolutional nets operate on images in the Fourier domain and from the images above we now know that images in that domain contain a lot of information about orientation. Thus convolutional nets should be better than traditional algorithms when it comes to rotated images and this is indeed the case (although convolutional nets are still very bad at this when we compare them to human vision).

Frequency filtering and convolution

The reason why the convolution operation is often described as a filtering operation, and why convolution kernels are often named filters will be apparent from the next example, which is very close to convolution.

Images by Fisher & Koryllos (1998). Source

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/filtered-image1.png?fit=300%2C81&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/filtered-image1.png?fit=1024%2C277&ssl=1" class="wp-image-274 size-full" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/filtered-image1.png?resize=1061%2C287&ssl=1" alt="" width="1061" height="287" data-recalc-dims="1" />

Images by Fisher & Koryllos (1998). Source

If we transform the original image with a Fourier transform and then multiply it by a circle padded by zeros (zeros=black) in the Fourier domain, we filter out all high frequency values (they will be set to zero, due to the zero padded values). Note that the filtered image still has the same striped pattern, but its quality is much worse now — this is how jpeg compression works (although a different but similar transform is used), we transform the image, keep only certain frequencies and transform back to the spatial image domain; the compression ratio would be the size of the black area to the size of the circle in this example.

If we now imagine that the circle is a convolution kernel, then we have fully fledged convolution — just as in convolutional nets. There are still many tricks to speed up and stabilize the computation of convolutions with Fourier transforms, but this is the basic principle how it is done.

Now that we have established the meaning of the convolution theorem and Fourier transforms, we can now apply this understanding to different fields in science and enhance our interpretation of convolution in deep learning.

Insights from fluid mechanics

Fluid mechanics concerns itself with the creation of differential equation models for flows of fluids like air and water (air flows around an airplane; water flows around suspended parts of a bridge). Fourier transforms not only simplify convolution, but also differentiation, and this is why Fourier transforms are widely used in the field of fluid mechanics, or any field with differential equations for that matter. Sometimes the only way to find an analytic solution to a fluid flow problem is to simplify a partial differential equation with a Fourier transform. In this process we can sometimes rewrite the solution of such a partial differential equation in terms of a convolution of two functions which then allows for very easy interpretation of the solution. This is the case for the diffusion equation in one dimension, and for some two dimensional diffusion processes for functions in cylindrical or spherical polar coordinates.

Diffusion

You can mix two fluids (milk and coffee) by moving the fluid with an outside force (mixing with a spoon) — this is called convection and is usually very fast. But you could also wait and the two fluids would mix themselves on their own (if it is chemically possible) — this is called diffusion and is usually a very slow when compared to convection.

Imagine an aquarium that is split into two by a thin, removable barrier where one side of the aquarium is filled with salt water, and the other side with fresh water. If you now remove the thin barrier carefully, the two fluids will mix together until the whole aquarium has the same concentration of salt everywhere. This process is more “violent” the greater the difference in saltiness between the fresh water and salt water.

Now imagine you have a square aquarium with 256×256 thin barriers that separate 256×256 cubes each with different salt concentration. If you remove the barrier now, there will be little mixing between two cubes with little difference in salt concentration, but rapid mixing between two cubes with very different salt concentrations. Now imagine that the 256×256 grid is an image, the cubes are pixels, and the salt concentration is the intensity of each pixel. Instead of diffusion of salt concentrations we now have diffusion of pixel information.

It turns out, this is exactly one part of the convolution for the diffusion equation solution: One part is simply the initial concentrations of a certain fluid in a certain area — or in image terms — the initial image with its initial pixel intensities. To complete the interpretation of convolution as a diffusion process we need to interpret the second part of the solution to the diffusion equation: The propagator.

Interpreting the propagator

The propagator is a probability density function, which denotes into which direction fluid particles diffuse over time. The problem here is that we do not have a probability function in deep learning, but a convolution kernel — how can we unify these concepts?

We can apply a normalization that turns the convolution kernel into a probability density function. This is just like computing the softmax for output values in a classification tasks. Here the softmax normalization for the edge detector kernel from the first example above.

Softmax of an edge detector: To calculate the softmax normalization, we taking each value of the kernel and apply . After that we divide by the sum of all . Please note that this technique to calculate the softmax will be fine for most convolution kernels, but for more complex data the computation is a bit different to ensure numerical stability (floating point computation is inherently unstable for very large and very small values and you have to carefully navigate around troubles in this case).

Now we have a full interpretation of convolution on images in terms of diffusion. We can imagine the operation of convolution as a two part diffusion process: Firstly, there is strong diffusion where pixel intensities change (from black to white, or from yellow to blue, etc.) and secondly, the diffusion process in an area is regulated by the probability distribution of the convolution kernel. That means that each pixel in the kernel area, diffuses into another position within the kernel according to the kernel probability density.

For the edge detector above almost all information in the surrounding area will concentrate in a single space (this is unnatural for diffusion in fluids, but this interpretation is mathematically correct). For example all pixels that are under the 0.0001 values, will very likely flow into the center pixel and accumulate there. The final concentration will be largest where the largest differences between neighboring pixels are, because here the diffusion process is most marked. In turn, the greatest differences in neighboring pixels is there, where the edges between different objects are, so this explains why the kernel above is an edge detector.

So there we have it: Convolution as diffusion of information. We can apply this interpretation directly on other kernels. Sometimes we have to apply a softmax normalization for interpretation, but generally the numbers in itself say a lot about what will happen. Take the following kernel for example. Can you now interpret what that kernel is doing? Click here to find the solution (there is a link back to this position).

Wait, there is something fishy here

How come that we have deterministic behavior if we have a convolution kernel with probabilities? We have to interpret that single particles diffuse according to the probability distribution of the kernel, according to the propagator, don’t we?

Yes, this is indeed true. However, if you take a tiny piece of fluid, say a tiny drop of water, you still have millions of water molecules in that tiny drop of water, and while a single molecule behaves stochastically according to the probability distribution of the propagator, a whole bunch of molecules have quasi deterministic behavior —this is an important interpretation from statistical mechanics and thus also for diffusion in fluid mechanics. We can interpret the probabilities of the propagator as the average distribution of information or pixel intensities; Thus our interpretation is correct from a viewpoint of fluid mechanics. However, there is also a valid stochastic interpretation for convolution.

Insights from quantum mechanics

The propagator is an important concept in quantum mechanics. In quantum mechanics a particle can be in a superposition where it has two or more properties which usually exclude themselves in our empirical world: For example, in quantum mechanics a particle can be at two places at the same time — that is a single object in two places.

However, when you measure the state of the particle — for example where the particle is right now — it will be either at one place or the other. In other terms, you destroy the superposition state by observation of the particle. The propagator then describes the probability distribution where you can expect the particle to be. So after measurement a particle might be — according to the probability distribution of the propagator — with 30% probability in place A and 70% probability in place B.

If we have entangled particles (spooky action at a distance), a few particles can hold hundreds or even millions of different states at the same time — this is the power promised by quantum computers.

So if we use this interpretation for deep learning, we can think that the pixels in an image are in a superposition state, so that in each image patch, each pixel is in 9 positions at the same time (if our kernel is 3×3). Once we apply the convolution we make a measurement and the superposition of each pixel collapses into a single position as described by the probability distribution of the convolution kernel, or in other words: For each pixel, we choose one pixel of the 9 pixels at random (with the probability of the kernel) and the resulting pixel is the average of all these pixels. For this interpretation to be true, this needs to be a true stochastic process, which means, the same image and the same kernel will generally yield different results. This interpretation does not relate one to one to convolution but it might give you ideas how to the apply convolution in stochastic ways or how to develop quantum algorithms for convolutional nets. A quantum algorithm would be able to calculate all possible combinations described by the kernel with one computation and in linear time/qubits with respect to the size of image and kernel.

Insights from probability theory

Convolution is closely related to cross-correlation. Cross-correlation is an operation which takes a small piece of information (a few seconds of a song) to filter a large piece of information (the whole song) for similarity (similar techniques are used on youtube to automatically tag videos for copyrights infringements).

Relation between cross-correlation and convolution: Here $latex {star}&bg=ffffff$ denotes cross correlation and $latex {f^*}&bg=ffffff$ denotes the complex conjugate of $latex {f}&bg=ffffff$.

" data-medium-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/cross-correlation2.png?fit=300%2C93&ssl=1" data-large-file="https://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/cross-correlation2.png?fit=1024%2C316&ssl=1" class="wp-image-301 size-full" src="https://i0.wp.com/timdettmers.com/wp-content/uploads/2015/03/cross-correlation2.png?resize=700%2C216&ssl=1" alt="" width="700" height="216" data-recalc-dims="1" />

Relation between cross-correlation and convolution: Here denotes cross correlation and denotes the complex conjugate of .

While cross correlation seems unwieldy, there is a trick with which we can easily relate it to convolution in deep learning: For images we can simply turn the search image upside down to perform cross-correlation through convolution. When we perform convolution of an image of a person with an upside image of a face, then the result will be an image with one or multiple bright pixels at the location where the face was matched with the person.

Cross-correlation via convolution: The input and kernel are padded with zeros and the kernel is rotated by 180 degrees. The white spot marks the area with the strongest pixel-wise correlation between image and kernel. Note that the output image is in the spatial domain, the inverse Fourier transform was already applied. Images taken from Steven Smith’s excellent free online book about digital signal processing.

This example also illustrates padding with zeros to stabilize the Fourier transform and this is required in many version of Fourier transforms. There are versions which require different padding schemes: Some implementation warp the kernel around itself and require only padding for the kernel, and yet other implementations perform divide-and-conquer steps and require no padding at all. I will not expand on this; the literature on Fourier transforms is vast and there are many tricks to be learned to make it run better — especially for images.

At lower levels, convolutional nets will not perform cross correlation, because we know that they perform edge detection in the very first convolutional layers. But in later layers, where more abstract features are generated, it is possible that a convolutional net learns to perform cross-correlation by convolution. It is imaginable that the bright pixels from the cross-correlation will be redirected to units which detect faces (the Google brain project has some units in its architecture which are dedicated to faces, cats etc.; maybe cross correlation plays a role here?).

Insights from statistics

What is the difference between statistical models and machine learning models? Statistical models often concentrate on very few variables which can be easily interpreted. Statistical models are built to answer questions: Is drug A better than drug B?

Machine learning models are about predictive performance: Drug A increases successful outcomes by 17.83% with respect to drug B for people with age X, but 22.34% for people with age Y.

Machine learning models are often much more powerful for prediction than statistical models, but they are not reliable. Statistical models are important to reach accurate and reliable conclusions: Even when drug A is 17.83% better than drug B, we do not know if this might be due to chance or not; we need statistical models to determine this.

Two important statistical models for time series data are the weighted moving average and the autoregressive models which can be combined into the ARIMA model (autoregressive integrated moving average model). ARIMA models are rather weak when compared to models like long short-term recurrent neural networks, but ARIMA models are extremely robust when you have low dimensional data (1-5 dimensions). Although their interpretation is often effortful, ARIMA models are not a blackbox like deep learning algorithms and this is a great advantage if you need very reliable models.

It turns out that we can rewrite these models as convolutions and thus we can show that convolutions in deep learning can be interpreted as functions which produce local ARIMA features which are then passed to the next layer. This idea however, does not overlap fully, and so we must be cautious and see when we really can apply this idea.

Here is a constant function which takes the kernel as parameter; white noise is data with mean zero, a standard deviation of one, and each variable is uncorrelated with respect to the other variables.

When we pre-process data we make it often very similar to white noise: We often center it around zero and set the variance/standard deviation to one. Creating uncorrelated variables is less often used because it is computationally intensive, however, conceptually it is straight forward: We reorient the axes along the eigenvectors of the data.

Decorrelation by reorientation along eigenvectors: The eigenvectors of this data are represented by the arrows. If we want to decorrelate the data, we reorient the axes to have the same direction as the eigenvectors. This technique is also used in PCA, where the dimensions with the least variance (shortest eigenvectors) are dropped after reorientation.

Now, if we take to be the bias, then we have an expression that is very similar to a convolution in deep learning. So the outputs from a convolutional layer can be interpreted as outputs from an autoregressive model if we pre-process the data to be white noise.

The interpretation of the weighted moving average is simple: It is just standard convolution on some data (input) with a certain weight (kernel). This interpretation becomes clearer when we look at the Gaussian smoothing kernel at the end of the page. The Gaussian smoothing kernel can be interpreted as a weighted average of the pixels in each pixel’s neighborhood, or in other words, the pixels are averaged in their neighborhood (pixels “blend in”, edges are smoothed).

While a single kernel cannot create both, autoregressive and weighted moving average features, we usually have multiple kernels and in combination all these kernels might contain some features which are like a weighted moving average model and some which are like an autoregressive model.

Conclusion

In this blog post we have seen what convolution is all about and why it is so powerful in deep learning. The interpretation of image patches is easy to understand and easy to compute but it has many conceptual limitations. We developed convolutions by Fourier transforms and saw that Fourier transforms contain a lot of information about orientation of an image. With the powerful convolution theorem we then developed an interpretation of convolution as the diffusion of information across pixels. We then extended the concept of the propagator in the view of quantum mechanics to receive a stochastic interpretation of the usually deterministic process. We showed that cross-correlation is very similar to convolution and that the performance of convolutional nets may depend on the correlation between feature maps which is induced through convolution. Finally, we finished with relating convolution to autoregressive and moving average models.

Personally, I found it very interesting to work on this blog post. I felt for long time that my undergraduate studies in mathematics and statistics were wasted somehow, because they were so unpractical (even though I study applied math). But later — like an emergent property — all these thoughts linked together and practically useful understanding emerged. I think this is a great example why one should be patient and carefully study all university courses — even if they seem useless at first.

Solution to the quiz above: The information diffuses nearly equally among all pixels; and this process will be stronger for neighboring pixels that differ more. This means that sharp edges will be smoothed out and information that is in one pixel, will diffuse and mix slightly with surrounding pixels. This kernel is known as a Gaussian blur or as Gaussian smoothing. Continue reading. Sources: 1 2

Image source reference

R. B. Fisher, K. Koryllos, “Interactive Textbooks; Embedding Image Processing Operator Demonstrations in Text”, Int. J. of Pattern Recognition and Artificial Intelligence, Vol 12, No 8, pp 1095-1123, 1998.

The post Understanding Convolution in Deep Learning appeared first on Tim Dettmers.

How to Parallelize Deep Learning on GPUs Part 2/2: Model Parallelism

Tim Dettmers — Sun, 09 Nov 2014 19:16:55 +0000

In my last blog post I explained what model and data parallelism is and analysed how to use data parallelism effectively in deep learning. In this blog post I will focus on model parallelism.

To recap, model parallelism is, when you split the model among GPUs and use the same data for each model; so each GPU works on a part of the model rather than a part of the data. In deep learning, one approach is to do this by splitting the weights, e.g. a 1000×1000 weight matrix would be split into a 1000×250 matrix if you use four GPUs.

Model parallelism diagram. Synchronizing communication is needed after each dot product with the weight matrix for both forward and backward pass.

One advantage of this approach is immediately apparent: If we split the weights among the GPUs we can have very large neural networks which weights would not fit into the memory of a single GPU. In part I mentioned this in an earlier blog post, where I also said that such large neural networks are largely unnecessary. However, for very big unsupervised learning tasks – which will become quite important in the near future – such large networks will be needed in order to learn fine grained features that could learn “intelligent” behavior.

How does a forward and backward pass work with such split matrices? This is most obvious when we do the matrix algebra step by step:

We start looking at which would be the dot matrix multiply for the usual forward pass case. The dimensions for using model parallelism with two GPUs for a batch size of 128 and a 1000×500 weight matrix would be:

Standard: 128×1000 dot 1000×500 = 128×500

Split by weight matrix first dimension: 128×500 dot 500×500 = 128×500 -> add matrices

Split by weight matrix second dimension: 128×1000 dot 1000×250 = 128×250 -> stack matrices

To calculate the errors in the layer below we need to pass the current error through to the next layer, or more mathematically, we calculate the deltas by taking the dot product of the error of the previous layer and the weights that connect to the next layer , i.e. :

Standard: 128×500 dot 500×1000 = 128×1000

Split by weight matrix first dimension: 128×500 dot 500×500 = 128×500 -> stack matrices

Split by weight matrix second dimension: 128×250 dot 250×1000 = 128×1000 -> add matrices

We see here, we need to synchronize (adding or stacking weights) after each dot product and you may think that this is slow when compared to data parallelism, where we synchronize only once. But one can quickly see that this is not so for most cases if we do the math: In data parallelism a 1000×500 gradient needs to be transferred once for the 1000×500 layer – that’s 500000 elements; for model parallelism we just need to transfer a small matrix for each forward and backward pass with a total of 128000 or 160000 elements – that’s nearly 4 times less data! So the network card bandwidth is still the main bottleneck in the whole application, but much less so than in the data parallelism case.

This is of course all relative and depends on the network architecture. Data parallelism will be quite fast for small networks and very slow for large networks, the opposite is true for model parallelism. The more parameters we have, the more beneficial is model parallelism. Its true strength comes to play if you have neural networks where the weights do not fit into a single GPU memory. Here model parallelism might achieve that for which one would need thousands of CPUs.

However, if you run small networks where the GPUs are not saturated and have some free capacity (not all cores are running), then model parallelism will be slow. Unlike data parallelism, there are no tricks you can use to hide the communication needed for synchronization, this is because we have only partial information for the whole batch. With this partial information we cannot compute the activities in the next layer and thus have to wait for the completion of the synchronization to move forward.

How the advantages and disadvantages can be combined is best shown by Alex Krizhevsky who demonstrates the efficiency of using data parallelism in the convolutional layers and model parallelism in the dense layers of a convolutional neural network.

The post How to Parallelize Deep Learning on GPUs Part 2/2: Model Parallelism appeared first on Tim Dettmers.

How to Parallelize Deep Learning on GPUs Part 1/2: Data Parallelism

Tim Dettmers — Thu, 09 Oct 2014 14:59:09 +0000

In my last blog post I showed what to look out for when you build a GPU cluster. Most importantly, you want a fast network connection between your servers and using MPI in your programming will make things much easier than to use the options available in CUDA itself.

In this blog post I explain how to utilize such a cluster to parallelize neural networks in different ways and what the advantages and downfalls are for such algorithms. The two different algorithms are data and model parallelism. In this blog entry I will focus on data parallelism.

So what are these two? Data parallelism is when you use the same model for every thread, but feed it with different parts of the data; model parallelism is when you use the same data for every thread, but split the model among threads.

For neural networks this means that data parallelism uses the same weights and but different mini-batches in each thread; the gradients need to be synchronized, i.e. averaged, after each pass through a mini-batch.

Model parallelism splits the weights of the net equally among the threads and all threads work on a single mini-batch; here the generated output after each layer needs to be synchronized, i.e. stacked, to provide the input to the next layer.

Each method has its advantages and disadvantages which change from architecture to architecture. Let us look at data parallelism first and its bottlenecks first and in the next post I will look at model parallelism.

Severity of the network bottleneck of data parallelism

The idea of data parallelism is simple. If you have, say, 4 GPUs you split a mini-batch into parts for each of them, say, you split a mini-batch with 128 examples into 32 examples for each GPU. Then you feed the respective batch through the net and obtain gradients for each split of the mini-batch. You then use MPI to collect all the gradients and update the parameters with the overall average.

Data parallelism diagram. There is no communication in the forward pass, and during the backward pass you synchronize gradients.

The biggest problem with this approach is that during the backward pass you have to pass the whole gradient to the all other GPUs. If you have a 1000×1000 weight matrix then you need to pass 4000000 bytes to each network. If we take a 40Gbit/s network card – which is already quite fast – then you will need to pass the data from one node to another node (however, there is some additional overhead that is neglected here). If you have six GPUs in two nodes you need to pass the data to five other GPUs, three of which need to go through the network card (3x 0.75ms), while two can use PCIe 3.0 to pass the data to the other two GPUs (about three times as fast; 2x 0.25ms). However, the PCIe pass is independent of the network card pass, so the time needed is determined by the network card time alone, i.e. 2.25ms. However, only one GPU can transfer data through the network card at any one time in any one node, so that we have to multiply that time by three, i.e. 7.75ms. Now the bottom line is, that we just need about 0.2ms for a matrix multiply through that layer (100×1000 dot 1000×1000) and about twice as much for the backward pass. We can pass the gradient while we work on the next layer, but in the end the network card speed limits our overall computation by quite a bit. This is more marked the larger you scale your system: A four node system working on the same problem needs about 20.25ms to pass the gradients around to the other GPUs. One can easily see that data parallelism does not scale with size of the cluster.

To counter this bottleneck is to reduce the parameters of the gradient through max pooling, maxout units or by simply using convolution. Another way is to increase the computational time/network time ratio by other means, e.g. by using is computationally intensive optimization techniques like RMSProp. You need the same time to pass the gradients to each other, but more time is spend on computation, thus increasing the utility of the fast GPUs.

Another thing you can do when you use computationally intensive optimization techniques is to hide latency of networking under the computation of the gradients. This means while you passing the first gradient to all other nodes, you can already start a big RMSProp computation asynchronously for the next layer. This technique can give a speedup of about 0-20 % depending on network architecture.

But this is not the only problem with data parallelism. There is a very technical bottleneck hidden in the GPU architecture which took me quite a while to understand. To understand why the GPU architecture is a problem we first need to look at the usage and purpose of mini-batches.

A divergence: Why do we use mini-batches?

If we start with randomly initialized parameters or even if we start with pretrained parameters, we do not need a pass through all the data to get an accurate gradient update that will head into the direction of a local minimum. If we take MNIST as an example, if we have a gradient which includes 10 common mistakes that the network does for each class (mini-batch size of about 128), then we will go into a direction that reduces the error greatly already as the gradient captures rough and common mistakes. If we choose a greater batch size (say 512) then we not only capture common errors, but also catch errors that are more subtle. However, it is not very sensible to fine-tune a system if you know it still has major errors. So overall we gain little by increasing the batch size. We need more computation to do roughly the same and this is the main argument why we use a mini-batch size as small as possible. However, if we choose a mini-batch size that is too small, then we do not capture all the common errors which are relevant for the data set and thus our gradient might not head near a local optimum, so there is a limit how small you can make mini-batches.

How does this relate to data parallelism? If we want a mini-batch size of 128 and use data parallelism to divide it among, say, eight GPUs, then each net calculates gradients for 16 samples which is then averages with the data from the other GPUs. And exactly here kicks the hardware bottleneck in.

Memory tiles: Patches of fast GPU memory for efficient dot product calculations

To calculate dot products on the GPU, you need to copy small patches, called memory tiles, into shared memory, i.e. very fast but very small memory (limited to a few kilobytes). The problem is that the standard cuBLAS uses either a 64×128 memory tiles and when you have a batch size less than 64 you waste a lot of precious shared memory. Also if you use a batch size not equal to a multiple of 32 you equally waste shared memory (threads are only started in blocks of 32 threads), so one should use a batch size which is a multiple of 32 or multiple of 64 if possible. For data parallelism this means that you lose significant processing speed once you go below a batch size of 64 for each GPU. If you have many GPUs this can be quite limiting and this is yet another reason why the data parallelism approach does not scale well beyond a certain point.

All in all this sounds quite dire for data parallelism, but data parallelism has its uses. If you know the bottlenecks, you can wield data parallelism as a might tool for certain applications. This is demonstrated by Alex Krishevsky in his paper where he uses data parallelism in the convolutional layers of his net, and thus achieves a speedup of 3.74x by using four GPUs and 6.25x using eight GPUs. His system features two CPUs and 8 GPUs in one node, so he can use the full PCIe speed for the two sets of four GPUs and relatively fast PCIe connection between CPUs to distribute the data among all eight GPUs.

Besides convolutional neural networks, another use of data parallelism might be to use it in recurrent neural networks, which typically have less parameters and highly computationally intensive gradient updates – both are wins for data parallelism.

In my next blog post I will focus on model parallelism, which is efficient for large networks and scales well to larger clusters.

The post How to Parallelize Deep Learning on GPUs Part 1/2: Data Parallelism appeared first on Tim Dettmers.