Accelerators Archives — Tim Dettmers

TPUs vs GPUs for Transformers (BERT)

Tim Dettmers — Wed, 17 Oct 2018 18:13:03 +0000

On the computational side, there have been confusions about how TPUs and GPUs relate to BERT. BERT base was trained with 4 TPU pods (16 TPU chips) in 4 days and BERT large with 16 TPUs (64 TPU chips) in 4 days. Does this mean only Google can train a BERT model? Does this mean that GPUs are dead? There are two fundamental things to understand here: (1) A TPU is a matrix multiplication engine — it does matrix multiplication and matrix operations, but not much else. It is fast at computing matrix multiplication, but one has to understand that (2) the slowest thing in matrix multiplication is to get the elements from the main memory and load it into the processing unit. In other words, the most expensive part in matrix multiplication is memory loads. Note the computational load for BERT should be about 90% for matrix multiplication. From these facts, we can do a small technical analysis on this topic.

Bandwidth Model for TPUs and GPUs

Transformers for TPUs

A common operation in BERT is matrix multiplication A*B=C where A is 256×1024 and B is 1024×1024 in dimension. A TPU computes such a matrix multiplication by splitting the matrix into many smaller 128×128 matrix multiplications. This means we need to load 16 128×128 matrix tiles from matrix A — and due to the nature of matrix multiplication — we need to load 64 tiles from B for every tile in A. This is a total of 16*64=1024 128×128 loads. At 16-bit that is a total of 32 MB of data.

Now we make a simplification: We assume that there is no latency if we do two memory loads after each other, which is actually not too unreasonable since often you can hide memory access latency under thread parallelism. In simple words, this means: While we wait for one 128×128 matrix copy to complete, we already do the next one. In doing it this way, we only wait for the first memory copy and we do not wait for other copies. This is a core reason why GPUs are fast and why we use many threads in GPUs thus 0 latency for overlapping memory transfers is not too far off from the real world. Using this simplification, we can now plainly use the memory bandwidth to compute the time needed to load the memory for the matrix multiplication. If we look at the bandwidth of the TPU we find that we have 600 GB/s, so we need 5.2e-05 seconds to transfer the 32 MB of data.

Transformers on GPUs

For a GPU we have the same process, but we use smaller tiles with more processors. Similarly to the TPU, we use two loads in parallel to hide memory latency. For GPUs, however, we would have a tile size of 96×96 for 16-bit data. If we take a V100 Tesla GPU, then we can run 160 of these in parallel at full bandwidth with low memory latency. What this means compared to a TPU: Instead of 2 matrix units which can hold 128×128 matrices, the GPU has 160 units (80 SMs, 160 thread blocks, each thread block has two 96×96 matrices) which hold two 96×96 matrices. Again this ensures that we can hide the memory latency through parallelism.

If we repeat the calculation from the top we receive the following: For matrix A with 256×1024 we have 33 96×96 tiles; for B with 1024×1024 we have 121 96×96 tiles. In total, we need to do 33*121=3993 loads of size 96×96 for a total of 70 MB. A V100 runs at 900 GB/s and so the memory loads will take 7.6e-05 seconds. Thus our model predicts that a GPU is 32% slower than a TPU for this specific scenario. Note that matrix tiles stay the same for an RTX 2080 Ti GPU, but the memory bandwidth decreases to 616 GB/s. Which means an RTX 2080 Ti is 54% slower than a TPU.

Note that both TPU and GPUs with Tensor Cores compute the respective matrix multiplication tile in one cycle. Thus the computation is about equally fast — the difference is only in how the memory is loaded.

BERT Training Time Estimate for GPUs

Using this data, a GPU cluster of V100s/RTX 2080 Tis with good networking (Infiniband +56GBits/s) and good parallelization algorithms (for example using Microsoft’s CNTK) we can expect to train BERT large on 64 GPUs (the equivalent to 16 TPUs) or BERT base on 16 GPUs in 5 1/3 days or 8 1/2 days. On an 8 GPU machine for V100/RTX 2080 Tis with any software and any parallelization algorithm (PyTorch, TensorFlow) one can expect to train BERT large in 21 days or 34 days and BERT base in 10 2/3 or 17 days. For a standard 4 GPU desktop with RTX 2080 Ti (much cheaper than other options), one can expect to replicate BERT large in 68 days and BERT base in 34 days.

Limitations of the Bandwidth Model

Note that all models are wrong, but some are useful. I would expect that this bandwidth model is in about 30% of the correct runtime values for TPU vs GPU.

The biggest limitation is that these calculations are for specific matrices sizes. Computational differences can be amplified for certain sizes. For example, if your batch-size is 128, there is a slight speedup for GPUs compared to TPUs. If you go below a batch size of 128 you can expect GPUs to be significantly faster; increasing the matrix B further makes TPUs better and better compared to GPUs. Decreasing the size of matrix B will make the performance of GPUs better. Note that the BERT paper optimized matrix A and B sizes for the TPU — one would not choose these dimensions if you train with a GPU. So this comparison might favor TPUs slightly.

Further direct limitations include fused operations. The TPU can calculate additional element-wise operations such as a non-linear activation function or a bias on the fly within a matrix multiplication. This means that the TPU does not need to load from slow global memory as often as a GPU. The GPU also supports these operations but NVIDIA has not implemented them and thus GPU users will not be able to benefit from this. Thus one can expect a slowdown of about 1.6% (loading and storing a 256×1024 matrix) for each element-wise operation for a GPU. For example, if you apply a non-linear function and a bias, then the TPU would be about 3.2% faster compared to GPUs in this scenario.

The Importance of 32-bit vs 16-bit vs 8-bit

If we repeat the same calculations from above for 32-bit values (64x64x tiles) we find that TPUs would be 5.3x faster. So the datatype size has a much larger effect than switching from TPU to GPU and vice versa.

TPUs do not support 8-bit training, but Turing GPUs do. So we can also have a look at how 8-bit matrix multiplication would impact performance. I published research on 8-bit models and it is not too difficult to train them with 8-bit alone. In fact, the literature on low-bit computing is quite rich. With 32-bit accumulation as supported by Turing GPUs 8-bit training should be even easier. If we can make 8-bit computing work for general models this would entail huge speedups for transformers. If we repeat the above calculations for 8-bit for GPUs (128×128 tile) we find that GPUs are 3.0x faster than TPUs. 8-bit computation on an affordable standard machine with 4 RTX 2080 Ti would take about 11 days for BERT base and 22 days for BERT large. All of this makes 16-bit computational ability for a GPU and important criterion if you are looking for a GPU to work with transformers.

Conclusion

TPUs are about 32% to 54% faster for training BERT-like models. One can expect to replicate BERT base on an 8 GPU machine within about 10 to 17 days. On a standard, affordable GPU machine with 4 GPUs one can expect to train BERT base for about 34 days using 16-bit or about 11 days using 8-bit.

The post TPUs vs GPUs for Transformers (BERT) appeared first on Tim Dettmers.

Deep Learning Hardware Limbo

Tim Dettmers — Thu, 21 Dec 2017 11:10:23 +0000

With the release of the Titan V, we now entered deep learning hardware limbo. It is unclear if NVIDIA will be able to keep its spot as the main deep learning hardware vendor in 2018 and both AMD and Intel Nervana will have a shot at overtaking NVIDIA. So for consumers, I cannot recommend buying any hardware right now. The most prudent choice is to wait until the hardware limbo passes. This might take as little as 3 months or as long as 9 months. So why did we enter deep learning hardware limbo just now?

NVIDIA has decided that it needs to cash-in on its monopoly position before the competition emerges. It needs the cash in order to defend itself in the next 1-2 years. This is reflected by the choice to price the Titan V at $3000. With TensorCores the Titan V has a new shiny deep learning feature, but at the same time, its cost/performance ratio is abysmal. This makes the Titan V very unattractive. But because there is no alternative, people will need to eat what there are served – at least for now.

The competition is strong. We have AMD whose hardware is now already better than NVIDIA’s and plans to get itself together to produce some deep learning software which is actually usable. With this step, the cost/performance ratio will easily outmatch NVIDIA cards and AMD will become the new standard. NVIDIA’s cash advantage will help fight AMD off so that we might see very cheap NVIDIA cards in the future. Note that this will only happen if AMD is able to push forward with good software — if AMD falters, NVIDIA cards will remain expensive and AMD will have lost its opportunity to grab the throne.

There is also a new contender in town: The Neural Network Processor (NNP) form Intel Nervana. With several unique features, it packs quite a punch. These new features make me drool — they are exactly what I want as a CUDA developer. The NNP solves most problems I face when I want to write CUDA kernels which are optimized for deep learning. This chip is the first true deep learning chip.

In general, for a 1-chip vs 1-chip ranking, we will see Nervana > AMD > NVIDIA, just because NVIDIA has to service gaming/deep learning/high-performance computing at once, while AMD only needs to service gaming/deep learning, whereas Nervana can just concentrate on deep learning – a huge advantage. The more concentrated a designed architecture, the less junk is on the chip for deep learning.

However, the winner is not determined by pure performance, and not even by pure cost/performance. It is determined by cost/performance + community + deep learning frameworks.

Let’s have a closer look at the individual positions of Nervana, AMD, and NVIDIA to see where they stand.

Nervana’s Neural Network Processors

Why @NaveenGRao is excited about #Intel Nervana NNP: https://t.co/DAqgOYFtoR #AI pic.twitter.com/dbLMdnp63t

— Intel AI (@IntelAI) October 31, 2017

Nervana’s design is very special mainly due to its large programmable caches (similar to CUDA shared memory) which are 10 times bigger per chip compared to GPUs and 50 times bigger per compute unit compared to GPUs. With this one will be able to design in-cache algorithms and models. This will speed up inference by at least an order of magnitude and one will be able to easily train on terabytes of data with small in-cache deep learning models, say, a multi-layer LSTM with 200 units. This will make this chip very attractive for startups and larger companies. Due to a special datatype, Flexpoint, one is able to store more data in caches/RAM and compute faster yielding even more benefits. All of this could mean speedup of about 10x compared to current NVIDIA GPUs for everybody. But this is only so if the main obstacles can be overcome: Community and software.

For the normal users and researchers, it will all depend on the community. Without community, we will not see in-cache algorithms. Without community, we will not see good software frameworks and it will be difficult to work with the chip. Everybody wants to use solid deep learning frameworks and it is questionable if Neon, Nervana’s deep learning framework, is up for the task. Software comes before hardware. If Nervana only ships pretty chips and does not push the software and community aspect effectively it will lose out to AMD and NVIDIA.

The community and software question is tightly bound to the price. If the price is too high, and students are not able to afford the NNP then no community can manifest itself around it. You do not get robust communities by just catering for industry. Although industry yields the main income for hardware companies, students are the main driver for the community. So if the price is right and students can afford it, then the community and the software will follow. Anything above $3000 will not work out. Anything above $2000 is critical and one would require special discounts for students to create a robust community. An NNP priced at $2000 will be manageable and find some adoption. Anything below $1500 will make Nervana the market leader for at least 2-3 years. An NNP at $1000 would make it extremely tough for NVIDIA and AMD to compete — software would not even be a question here, it follows automatically.

I personally will switch to NNPs if they are priced below $2500. They are just so much superior to GPUs for deep learning and I will be able to do things which are just impossible with NVIDIA hardware. If they are over $2500 then it also reaches my pain point for good hardware. I save up a lot of money to buy hardware — good hardware is just important to me — but I have to live from something.

For usual consumers not only the price will be important, but also how the community is handled. If we do not see Intel immediately pumping resources into the community to start up a solid software machinery then the NNP is likely to stagnate and die off. Unfortunately, Intel has a good history of mismanaging communities — it would be a shame if this happens because I really would like to see Nervana succeed.

In summary, we will see Nervana’s NNP will emerge as a clear winner if it will be priced below $2000 and if we see strong community and software development within the first few months after its release. With a higher price and less community support, the NNP will be strong, but might not be able to surpass other solutions in terms of cost/performance and convenience. If the software and community efforts fail or if the NNP is priced at $4000 it will likely fail. A price above $2000 will require significant discounts for students for the NNP to be viable.

AMD: Cheap and Powerful – If You Can Use It

Source: Wikipedia

AMDs cards are incredible. The Vega Frontier Edition series clearly outmatches NVIDIA counterparts, and, from unbiased benchmarks of Volta vs Pascal, it seems that the Vega Frontier will be on-a-par or better compared to a Titan V if it is liquid cooled. Note that the Vega is based on an old architecture while the Titan V is brand new. The new AMD architecture, which will be released in 2018Q3 will increase performance further still.

AMD hopes to advance deep learning hardware by just switching from 32-bit floats to 16-bit floats. This is a very simple and powerful strategy. The chips will not be useful for high-performance computing, but they will be solid for gamers and the deep learning community while development costs will be low because 16-bit float computation is straightforward.

They will not be able to compete in terms of performance with Nervana’s NNP, but the cost/performance might outmatch everything on the market. You can get a liquid cooled Vega Frontier for $700 which might be just a little worse than a $3000 Titan V.

The problem is software. Even if you have this powerful AMD GPU, you will hardly be able to use it – no major framework supports AMD GPUs well enough.

AMD is in limbo itself – in software limbo. It seems they want to abandon OpenCL for HIP but currently they officially still push and support the OpenCL path. If they push through with HIP, and if they put some good deep learning software on the market (not only libraries for convolution and matrix multiplication but full deep learning frameworks, say, HIP support for PyTorch) in the next 9 months then their release of their new GPU in 2018Q3 has the potential to demolish all competitors.

So in summary, if AMD gets its shit together in terms of software, it might become the dominating deep learning hardware solution.

NVIDIA: The Titan

NVIDIA’s position is solid. They have the best software, the best tools, their hardware is good and the community is large, strong and well integrated.

NVIDIA’s main issue is that they have to serve multiple communities: High-performance computing people, deep learning people, and gamers, and this is a huge strain on their hardware. It is expensive to design chips which are custom made for these communities and NVIDIA’s strategy was currently to design a one-size-fits-all architecture. This worked until it didn’t. The Titan V is just mediocre all-around.

With the emerging competitors, NVIDIA has two choices. (1) Push the price on their cards down until they starve the competition to death, or (2) they can develop specialized deep learning hardware on their own. NVIDIA has the resources to pursue the first strategy, and it also has the expertise for the second strategy. A new design, however, will take some time and NVIDIA might lose the throne to another company in the meantime. So we might see both strategies played out a once: Starving competitors so that NVIDIA can compete until their own deep learning chip hits the market.

In summary, NVIDIAs throne is threatened, but it has the resources and the expertise to fight against emerging players. We will probably see cheaper NVIDIA cards in the future and chips which are more specialized for deep learning. If NVIDIA does not lower its prices it might (temporarily) pass the throne to another player.

Conclusion

Deep learning hardware limbo means that it makes no sense to invest in deep learning hardware right now, but it also means we will have cheaper NVIDIA cards, usable AMD cards, and ultra-fast Nervana cards quite soon. It is an exciting time and we consumers will profit from this immensely. But for now, we have to be patient. We have to wait. I will keep you updated as the situation changes.

The post Deep Learning Hardware Limbo appeared first on Tim Dettmers.