On the computational side, there have been confusions about how TPUs and GPUs relate to BERT. BERT base was trained with 4 TPU pods (16 TPU chips) in 4 days and BERT large with 16 TPUs (64 TPU chips) in 4 days. Does this mean only Google can train a BERT model? Does this mean that GPUs are dead? There are two fundamental things to understand here: (1) A TPU is a matrix multiplication engine — it does matrix multiplication and matrix operations, but not much else. It is fast at computing matrix multiplication, but one has to understand that (2) the slowest thing in matrix multiplication is to get the elements from the main memory and load it into the processing unit. In other words, the most expensive part in matrix multiplication is memory loads. Note the computational load for BERT should be about 90% for matrix multiplication. From these facts, we can do a small technical analysis on this topic.

## Bandwidth Model for TPUs and GPUs

### Transformers for TPUs

A common operation in BERT is matrix multiplication A*B=C where A is 256×1024 and B is 1024×1024 in dimension. A TPU computes such a matrix multiplication by splitting the matrix into many smaller 128×128 matrix multiplications. This means we need to load 16 128×128 matrix tiles from matrix A — and due to the nature of matrix multiplication — we need to load 64 tiles from B for every tile in A. This is a total of 16*64=1024 128×128 loads. At 16-bit that is a total of 32 MB of data.

Now we make a simplification: We assume that there is no latency if we do two memory loads after each other, which is actually not too unreasonable since often you can hide memory access latency under thread parallelism. In simple words, this means: While we wait for one 128×128 matrix copy to complete, we already do the next one. In doing it this way, we only wait for the first memory copy and we do not wait for other copies. This is a core reason why GPUs are fast and why we use many threads in GPUs thus 0 latency for overlapping memory transfers is not too far off from the real world. Using this simplification, we can now plainly use the memory bandwidth to compute the time needed to load the memory for the matrix multiplication. If we look at the bandwidth of the TPU we find that we have 600 GB/s, so we need 5.2e-05 seconds to transfer the 32 MB of data.

### Transformers on GPUs

For a GPU we have the same process, but we use smaller tiles with more processors. Similarly to the TPU, we use two loads in parallel to hide memory latency. For GPUs, however, we would have a tile size of 96×96 for 16-bit data. If we take a V100 Tesla GPU, then we can run 160 of these in parallel at full bandwidth with low memory latency. What this means compared to a TPU: Instead of 2 matrix units which can hold 128×128 matrices, the GPU has 160 units (80 SMs, 160 thread blocks, each thread block has two 96×96 matrices) which hold two 96×96 matrices. Again this ensures that we can hide the memory latency through parallelism.

If we repeat the calculation from the top we receive the following: For matrix A with 256×1024 we have 33 96×96 tiles; for B with 1024×1024 we have 121 96×96 tiles. In total, we need to do 33*121=3993 loads of size 96×96 for a total of 70 MB. A V100 runs at 900 GB/s and so the memory loads will take 7.6e-05 seconds. Thus our model predicts that a GPU is 32% slower than a TPU for this specific scenario. Note that matrix tiles stay the same for an RTX 2080 Ti GPU, but the memory bandwidth decreases to 616 GB/s. Which means an RTX 2080 Ti is 54% slower than a TPU.

Note that both TPU and GPUs with Tensor Cores compute the respective matrix multiplication tile in one cycle. Thus the computation is about equally fast — the difference is only in how the memory is loaded.

### BERT Training Time Estimate for GPUs

Using this data, a GPU cluster of V100s/RTX 2080 Tis with good networking (Infiniband +56GBits/s) and good parallelization algorithms (for example using Microsoft’s CNTK) we can expect to train BERT large on 64 GPUs (the equivalent to 16 TPUs) or BERT base on 16 GPUs in 5 1/3 days or 8 1/2 days. On an 8 GPU machine for V100/RTX 2080 Tis with any software and any parallelization algorithm (PyTorch, TensorFlow) one can expect to train BERT large in 21 days or 34 days and BERT base in 10 2/3 or 17 days. For a standard 4 GPU desktop with RTX 2080 Ti (much cheaper than other options), one can expect to replicate BERT large in 68 days and BERT base in 34 days.

## Limitations of the Bandwidth Model

Note that all models are wrong, but some are useful. I would expect that this bandwidth model is in about 30% of the correct runtime values for TPU vs GPU.

The biggest limitation is that these calculations are for specific matrices sizes. Computational differences can be amplified for certain sizes. For example, if your batch-size is 128, there is a slight speedup for GPUs compared to TPUs. If you go below a batch size of 128 you can expect GPUs to be significantly faster; increasing the matrix B further makes TPUs better and better compared to GPUs. Decreasing the size of matrix B will make the performance of GPUs better. Note that the BERT paper optimized matrix A and B sizes for the TPU — one would not choose these dimensions if you train with a GPU. So this comparison might favor TPUs slightly.

Further direct limitations include fused operations. The TPU can calculate additional element-wise operations such as a non-linear activation function or a bias on the fly within a matrix multiplication. This means that the TPU does not need to load from slow global memory as often as a GPU. The GPU also supports these operations but NVIDIA has not implemented them and thus GPU users will not be able to benefit from this. Thus one can expect a slowdown of about 1.6% (loading and storing a 256×1024 matrix) for each element-wise operation for a GPU. For example, if you apply a non-linear function and a bias, then the TPU would be about 3.2% faster compared to GPUs in this scenario.

## The Importance of 32-bit vs 16-bit vs 8-bit

If we repeat the same calculations from above for 32-bit values (64x64x tiles) we find that TPUs would be 5.3x faster. So the datatype size has a much larger effect than switching from TPU to GPU and vice versa.

TPUs do not support 8-bit training, but Turing GPUs do. So we can also have a look at how 8-bit matrix multiplication would impact performance. I published research on 8-bit models and it is not too difficult to train them with 8-bit alone. In fact, the literature on low-bit computing is quite rich. With 32-bit accumulation as supported by Turing GPUs 8-bit training should be even easier. If we can make 8-bit computing work for general models this would entail huge speedups for transformers. If we repeat the above calculations for 8-bit for GPUs (128×128 tile) we find that GPUs are 3.0x faster than TPUs. 8-bit computation on an affordable standard machine with 4 RTX 2080 Ti would take about 11 days for BERT base and 22 days for BERT large. All of this makes 16-bit computational ability for a GPU and important criterion if you are looking for a GPU to work with transformers.

## Conclusion

TPUs are about 32% to 54% faster for training BERT-like models. One can expect to replicate BERT base on an 8 GPU machine within about 10 to 17 days. On a standard, affordable GPU machine with 4 GPUs one can expect to train BERT base for about 34 days using 16-bit or about 11 days using 8-bit.

Michael says

Tim, thanks for your post. A couple of questions:

1. You assume there’s no cache on GPU or TPU, because if there’s cache, then we wouldn’t necessarily need to load 64 tiles of B from main memory for each tile from A. Correct? This becomes especially relevant if A is input which changes for each computation, while B is weights which stay the same for many computations (for example, during batch training).

2. GPU or TPU complete single tile multiplication in one cycle. How many cycles are needed to perform the entire A*B multiplication? In other words, how many tiles can be computed in parallel by TPU/GPU? How does time to compute A*B compares to data transfer time (i.e. 5.3e-5s)?

Tim Dettmers says

1. Yes, my calculations assume that the tiles are loaded into shared memory which essentially programmable L1 cache. Normal L1 cache does play a role when you aggregate the values for matrix C, but not so much when you load from matrix A and B.

2. Theoretically, a TPU can compute one 128×128 matrix multiplication per clock per MXU. There are two MXU on a TPU chip and thus two 128×128 matrix multiply per clock. A GPU can compute 8 4×4 per SM per clock or one 16×16 matrix multiplication per two clocks per SM. There are 80 SMs on a Tesla V100 so you can compute 40 16×16 matrix multiplications per clock which is the equivalent of about one 96×96 matrix multiplication per clock. So TPUs have more than 2x higher theoretical throughput. Note that these numbers are theoretical and can never be reached for practical programs.

Here a simple example to demonstrate that theoretical compute does not matter: A single L1 memory access costs about 5-6 clocks. A global memory access costs about 500 clocks. So loading a 96×96 tile from GPU RAM costs 1000 clocks (144 copies of 128 bytes per SM or 2 sequential loads) to copy to shared memory, it costs 5 clocks to access the memory and write it to registers, and it costs 1 clock to do the matrix multiplication. In this simple example, the matrix multiplication computation used up 0.05% of computational resources. So the Tensor Cores utilization was 0.05% in this case. It will be even lower for a TPU. This improves for larger matrices, but usually, it is never higher than a few percents. This example demonstrates that the theoretical compute performance does not matter in practice because you cannot reach it — especially for matrix multiplication. For convolution, this is a bit different, but the differences in theoretical compute performance are not so different between TPUs and GPUs in the first place.

basil thomas says

Interesting assessment but I think the real world performance of gpus may be much worse as you may have access to lots of VRAM but very small access to L1 cache. If the TPU has a huge cache near L1 speeds then the google tpu will still be an order of magnitude faster than a bunch of 2080Ti cards.

I think you should look at the tensor cores primitives code as I think it is geared for warp wide processing from the get go and unless you are doing a huge amount of matrix multiplication in a row, i think the speedup is nowhere near the 12x speedup that Nvidia touts. The DLSS denoising algo is a good one for the tensor cores to apply and attain a more impressive speedup. Most of the benchmarks I have seen so far for for tensor cores generally attain a speedup of 2-4x which is not quite as efficient as it could be.

I hope Nvidia continues to speed up the main SM cores with more parallel processing as much as they add more custom hardware like the tensor/ray tracing cores

Tim Dettmers says

One of the major advantages of GPUs is that they have very large amounts of L1 cache or shared memory. A Tesla V100 has 7.5 MB of L1 cache / shared memory. As I understand it, the TPU does not have a cache (it does not need one). So this is not an issue in my comparison.

I agree that the 12x speedup of NVIDIA is unrealistic. It’s a biased comparison. As I noted in the blog post, Tensor Cores do not make a big difference here. The true winner is 16-bit compute units on the RTX 20 series which was previously only available on too expensive Volta cards.

Michael says

V100 has 6,144 KB of L2 cache. This means that, at least in theory, when multiplying two matrices [256, 1024] and [1024, 1024] at FP16, both can fit in cache. Which means we only need to transfer 2.5MB of data from main memory to L2 cache, and then 32MB of data from L2 to registers (and perhaps L1/shared cache makes this even more efficient, but I don’t know the details).

By the way, correction: it takes 4-8 cycles to perform FMA instructions, depending on precision, 28 cycles to access L1 cache, and 193 cycles to access L2 [1].

Also, it seems like tensor cores operate on 16×16 matrix chunks with higher throughput than regular FP16 FMA instructions. Tensor cores performance is probably more relevant for TPU comparison.

Finally, I’m curious how do you explain results in Figure 4.8 in [1]. Tensor cores seems to be 90% utilized even for huge matrix sizes.

[1] https://arxiv.org/abs/1804.06826

Tim Dettmers says

Indeed, I got the cycles wrong — thanks for the correction! I thought memory access was 1/5/120/600 but the paper shows for Volta it is 4/19/193/600(?). Note however that L2 access is still not much faster than global memory access. So L2 access will not help much for computation.

It is a neat idea to do everything in L2 cache, but this is very difficult in practice. SMs work independently and you have no control over memory layout. If one SM lags behind your memory might be gone. If a register gets spilled out of L1 cache your memory might be gone.

This is really what shared memory is for: It is programmable, so you can put memory there and can be sure that it will remain intact. For L1 and L2 cache, this is not guaranteed.

Tensor Cores are can be utilized well if you use very large matrices that is right. My example uses very small matrices, and there of course Tensor Cores do badly. The example that you give is for a 4096×4096 times 4096×4096 matrix multiply. Utilization of 1024×1024 times 1024×1024 matrix multiplication is around 30%. This is 4 times larger than the matrix multiply that I analyze here. The graph shows Tensor Core utilization of about 10% for matrix multiplication which is comparable to what happens in BERT. Utilizations between 5-15% is what I meant in my last comment, which is sort of comparable what you can do without Tensor Cores. The 16-bit are really more important than Tensor Cores here.

basil thomas says

The huge advantage of the TPU architecture is the ability to execute way more math instructions without using any threads and the hardware to support threads. The TPU code is a very minimalist architecture that is dedicated to executing FMA instructions as fast as possible based directly on the number of ALU’s it can use per cycle. The Nvidia gpu depends on threading with the warp engine to execute as many matrix fma instructions with dedicated tensor core alu’s which are incredibility smaller in number as compared to the number of TPU alu’s. Like i said before, I expect the performance of a TPU to be at least an order of magnitude better than Volta/Turing tensor cores. When, not if, each Nvidia sm can do what the tensor cores can do, only then would performance be comparable. By then I expect TPU to be able to execute millions fma instructions per cycle especially if they use 7nm fabricated chips. I think the tensor cores or more specifically fma capability should be built directly into each sm alu instead having seperate alu’s dedicated to tensor cores. They will never be able to match AI performance of the TPU if google continues their minimalist design of directly using a huge number of alu’s per clock cycle without any hardware thread support.

Conversely, Nvidia gpu have a better advantage in that they are easily programmable in CUDA for a much larger number and range of algorithms .

Tim Dettmers says

See my reply to Michael’s comment on an analysis of TPU vs Tensor Cores on this matter.

basil thomas says

yes, I agree with your comment in general but the TPU overall has a much more simply dedicated architecture that Nvidia will be very hard to match unless the fma tensor core matrix alu’s are directly embedded into all the sm cores. The sm cores currently have 2 alu per core with only simultaneous access between an integer op and float op. The nvidia arch basically turns all the the sm’s into a variable length vector engine via it’s SIMT architecture. This very good for generally compute but not optimized for general purpose fma matrix math. Will be very interesting how Nvidia proceeds with their architecture as it is still basically geared to processing general instruction per thread without going to VLIW type instructiuons.

Bryan says

The BERT paper says 64 TPU chips and your post says 256 chips. What accounts for that difference?

Tim Dettmers says

One TPU has 4 TPU chips. One TPU chip is the equivalent of one GPU.

Juyong says

Hi. In the BERT paper, it says that 16 Cloud TPU Pods have 64 TPU chips total, so you don’t have to multiply additional 4 for compute the number of chips.

Tim Dettmers says

Each TPUv2 has 4 chips that are equivalent to 4 GPU chips. See this blog post for more info: https://blog.riseml.com/benchmarking-googles-new-tpuv2-121c03b71384

Michael says

I just looked at the followup to the article you linked to: https://blog.riseml.com/comparing-google-tpuv2-against-nvidia-v100-on-resnet-50-c2bbb6a51e5e where they tested the performance of Resnet-50 on Imagenet, and the result is that V100 is about the same speed at batch size 1024 (optimal for TPUs), and better at larger batch sizes.

This might seem surprising at first, but starts to make sense taking into account power consumption.

Chester says

Why 64 GPUs is equivalent to 4 TPU pods? I don’t really get it.

Google says, 1 TPU pod = 64 TPU devices = 256 TPU chips = 512 cores.

Tim Dettmers says

That is correct, however, the BERT paper is using the following compute resources:

Abe says

I think “1 Cloud TPU” = “4 TPU Chips” = “8 TPU cores” = “8 TPU processors”. Sources:

https://cloud.google.com/tpu/docs/deciding-pod-versus-tpu

https://github.com/google-research/bert/issues/67#issuecomment-436351513

From the BERT paper:

“Training of BERT_LARGE was performed on 16 Cloud TPUs (64 TPU chips total)”

I’m also not sure why the post here mentions “256 chips”, rather than 64.

Tim Dettmers says

I think you are right — I made a mistake here. I will look into this when I have time and update the numbers.

Otto Fazzl says

Thank you for a great article!

I have a question: when you say that large matrices are split into small pieces and multiplied, is that similar to the Strassen algorithm? Where can I learn more about how GPUs handle matrix multiplication that does not get too technical?

Tim Dettmers says

Usually, you do not use the Strassen algorithm. It is too difficult to optimize loads of tiles already and doing that well often does much better than a naively implemented Strassen algorithm. A good start for learning about matrix multiplication algorithms is first to read the two example kernels in the CUDA programming documentation. After that, you can have a look at Scott Gray’s fast matrix multiplication algorithms: https://github.com/NervanaSystems/maxas/wiki/SGEMM. Scott Gray usually writes the fastest matrix multiplication algorithms and improved on his old algorithm further. These algorithms are also used by NVIDIA in their library.

Haibin says

typo: 7.6r-05 seconds -> 7.6e-05 seconds

Tim Dettmers says

Fixed — thank you!

Enrique says

Why do you need to load 64 tiles of B for each tile of A? In total, there are 16 tiles of A and 64 tiles of B, but not every tile of B has to be multiplied by each tile of A. It would make more sense to load 8 tiles of B for each tile of A:

for i=1:2

for k=1:8

Load tile A(i,k)

for j=1:8

Load tile B(k,j)

Tim Dettmers says

A TPU can only hold one tile of A and B at any time. The tile is evicted once you move on to the next sub-matrix multiplication. As such, you need to reload the tiles of B more often if you want to aggregate within a specific memory location. Aggregating across the entire length of the matrix would entail less tile loads for B, but more tile store executions and would be slower — it is faster (2x) to aggregate in the same location for 16 tiles rather than to change location with every tile. This is a major reason why lowing the bits from 32-bit to 16-bit is so effective because it reduces the factor by which you need to reload tiles on a GPU.

Jay says

For a GPU we have the same process, but we use smaller tiles with more processors.

Why? Can’t we use more tiles for GPU? Since it will decrease the load and eventually can match up with TPU? Who decides the number of tiles in use?

Tim Dettmers says

This is a great question, thank you! The number of tiles is determined by the amount of shared memory you have available. You have been 64kb to 96kb per streaming multiprocessor (SM), but of that memory you usually need to reserve half of it to do double buffered loads since the memory load latency is pretty high. For very big, modern GPUs, like Titan RTX, we have about 72 SM and so a total of 4608 kb of memory for tiles. About half of that is reserved for double buffering. So you have 2304 kb of total memory. You need to distribute this unto at least 142 thread blocks to get full compute utilization, that means a maximum size of about 16kb for tiles. Usually, you do not want actually maximum compute utilization since memory movement is the bottleneck, so you run with less thread blocks to increase the memory tile size.

Overall, you can thus expect that the memory tile size will span about 16-20 kb for between matrices A and B. If you assume the memory is split between both matrices equally that is about 8192 bytes for the maximum title size which is 4096 elements for 16-bit floats. This means, for a large GPU the maximum tile size is about 32×128. You can go a bit larger, but it will quickly eat into computational performance. Note, that this calculation is the for largest GPU (Titan RTX). For other GPUs, other tile sizes will provide better performance. Optimal tile size is a difficult concept which often cannot be really theoretically determined. Often the only solution is to write a matrix multiply algorithm for a certain tile size and benchmark it against other tile sizes to get the best tile size.