Comments on: TPUs vs GPUs for Transformers (BERT)

By: Tim Dettmers

Tim Dettmers — Sun, 24 Nov 2019 23:54:18 +0000

In reply to Jay.

This is a great question, thank you! The number of tiles is determined by the amount of shared memory you have available. You have been 64kb to 96kb per streaming multiprocessor (SM), but of that memory you usually need to reserve half of it to do double buffered loads since the memory load latency is pretty high. For very big, modern GPUs, like Titan RTX, we have about 72 SM and so a total of 4608 kb of memory for tiles. About half of that is reserved for double buffering. So you have 2304 kb of total memory. You need to distribute this unto at least 142 thread blocks to get full compute utilization, that means a maximum size of about 16kb for tiles. Usually, you do not want actually maximum compute utilization since memory movement is the bottleneck, so you run with less thread blocks to increase the memory tile size.

Overall, you can thus expect that the memory tile size will span about 16-20 kb for between matrices A and B. If you assume the memory is split between both matrices equally that is about 8192 bytes for the maximum title size which is 4096 elements for 16-bit floats. This means, for a large GPU the maximum tile size is about 32×128. You can go a bit larger, but it will quickly eat into computational performance. Note, that this calculation is the for largest GPU (Titan RTX). For other GPUs, other tile sizes will provide better performance. Optimal tile size is a difficult concept which often cannot be really theoretically determined. Often the only solution is to write a matrix multiply algorithm for a certain tile size and benchmark it against other tile sizes to get the best tile size.

By: Jay

Jay — Sat, 23 Nov 2019 15:32:27 +0000

For a GPU we have the same process, but we use smaller tiles with more processors.
Why? Can’t we use more tiles for GPU? Since it will decrease the load and eventually can match up with TPU? Who decides the number of tiles in use?

By: Tim Dettmers

Tim Dettmers — Thu, 11 Jul 2019 13:38:25 +0000

In reply to Enrique. A TPU can only hold one tile of A and B at any time. The tile is evicted once you move on to the next sub-matrix multiplication. As such, you need to reload the tiles of B more often if you want to aggregate within a specific memory location. Aggregating across the entire length of the matrix would entail less tile loads for B, but more tile store executions and would be slower — it is faster (2x) to aggregate in the same location for 16 tiles rather than to change location with every tile. This is a major reason why lowing the bits from 32-bit to 16-bit is so effective because it reduces the factor by which you need to reload tiles on a GPU.

By: Enrique

Enrique — Tue, 09 Jul 2019 13:23:41 +0000

Why do you need to load 64 tiles of B for each tile of A? In total, there are 16 tiles of A and 64 tiles of B, but not every tile of B has to be multiplied by each tile of A. It would make more sense to load 8 tiles of B for each tile of A:
for i=1:2
for k=1:8
Load tile A(i,k)
for j=1:8
Load tile B(k,j)

By: Tim Dettmers

Tim Dettmers — Fri, 14 Jun 2019 02:33:45 +0000

In reply to Abe. I think you are right — I made a mistake here. I will look into this when I have time and update the numbers.

By: Abe

Abe — Mon, 13 May 2019 14:13:36 +0000

In reply to Tim Dettmers.

I think “1 Cloud TPU” = “4 TPU Chips” = “8 TPU cores” = “8 TPU processors”. Sources:

https://cloud.google.com/tpu/docs/deciding-pod-versus-tpu

https://github.com/google-research/bert/issues/67#issuecomment-436351513

From the BERT paper:
“Training of BERT_LARGE was performed on 16 Cloud TPUs (64 TPU chips total)”

I’m also not sure why the post here mentions “256 chips”, rather than 64.

By: Tim Dettmers

Tim Dettmers — Tue, 16 Apr 2019 20:09:49 +0000

In reply to Chester. That is correct, however, the BERT paper is using the following compute resources:

Training of BERT BASE was performed on 4 Cloud TPUs in Pod configuration (16 TPU chips total).5 Training of BERTLARGE was performed on 16 Cloud TPUs (64 TPU chips total).

By: Chester

Chester — Mon, 15 Apr 2019 11:51:56 +0000

In reply to Tim Dettmers. Why 64 GPUs is equivalent to 4 TPU pods? I don't really get it. Google says, 1 TPU pod = 64 TPU devices = 256 TPU chips = 512 cores.

By: Tim Dettmers

Tim Dettmers — Wed, 20 Feb 2019 22:26:08 +0000

In reply to Haibin. Fixed — thank you!

By: Haibin

Haibin — Wed, 06 Feb 2019 21:03:36 +0000

typo: 7.6r-05 seconds -> 7.6e-05 seconds