Hardware Related Blog Posts

In-depth blog posts with complete analysis of hardware for deep learning. I compare GPUs and other accelerators and explain how they work.

Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning

2023-01-30 by Tim Dettmers 1,665 Comments

Deep learning is a field with intense computational requirements, and your choice of GPU will fundamentally determine your deep learning experience. But what features are important if you want to buy a new GPU? GPU RAM, cores, tensor cores, caches? How to make a cost-efficient choice? This blog post will delve into these questions, tackle common misconceptions, give you an intuitive understanding of how to think about GPUs, and will lend you advice, which will help you to make a choice that is right for you.

[Read more…]

A Full Hardware Guide to Deep Learning

2018-12-16 by Tim Dettmers 945 Comments

Deep Learning is very computationally intensive, so you will need a fast CPU with many cores, right? Or is it maybe wasteful to buy a fast CPU? One of the worst things you can do when building a deep learning system is to waste money on hardware that is unnecessary. Here I will guide you step by step through the hardware you will need for a cheap high-performance system.

[Read more…]

TPUs vs GPUs for Transformers (BERT)

2018-10-17 by Tim Dettmers 26 Comments

On the computational side, there have been confusions about how TPUs and GPUs relate to BERT. BERT base was trained with 4 TPU pods (16 TPU chips) in 4 days and BERT large with 16 TPUs (64 TPU chips) in 4 days. Does this mean only Google can train a BERT model? Does this mean that GPUs are dead? There are two fundamental things to understand here: (1) A TPU is a matrix multiplication engine — it does matrix multiplication and matrix operations, but not much else. It is fast at computing matrix multiplication, but one has to understand that (2) the slowest thing in matrix multiplication is to get the elements from the main memory and load it into the processing unit. In other words, the most expensive part in matrix multiplication is memory loads. Note the computational load for BERT should be about 90% for matrix multiplication. From these facts, we can do a small technical analysis on this topic.

Deep Learning Hardware Limbo

2017-12-21 by Tim Dettmers 91 Comments

With the release of the Titan V, we now entered deep learning hardware limbo. It is unclear if NVIDIA will be able to keep its spot as the main deep learning hardware vendor in 2018 and both AMD and Intel Nervana will have a shot at overtaking NVIDIA. So for consumers, I cannot recommend buying any hardware right now. The most prudent choice is to wait until the hardware limbo passes. This might take as little as 3 months or as long as 9 months. So why did we enter deep learning hardware limbo just now?

The Brain vs Deep Learning Part I: Computational Complexity — Or Why the Singularity Is Nowhere Near

2015-07-27 by Tim Dettmers 183 Comments

In this blog post I will delve into the brain and explain its basic information processing machinery and compare it to deep learning. I do this by moving step-by-step along with the brains electrochemical and biological information processing pipeline and relating it directly to the architecture of convolutional nets. Thereby we will see that a neuron and a convolutional net are very similar information processing machines. While performing this comparison, I will also discuss the computational complexity of these processes and thus derive an estimate for the brains overall computational power. I will use these estimates, along with knowledge from high performance computing, to show that it is unlikely that there will be a technological singularity in this century.

How to Parallelize Deep Learning on GPUs Part 2/2: Model Parallelism

2014-11-09 by Tim Dettmers 21 Comments

In my last blog post I explained what model and data parallelism is and analysed how to use data parallelism effectively in deep learning. In this blog post I will focus on model parallelism.

How to Parallelize Deep Learning on GPUs Part 1/2: Data Parallelism

2014-10-09 by Tim Dettmers 20 Comments

In my last blog post I showed what to look out for when you build a GPU cluster. Most importantly, you want a fast network connection between your servers and using MPI in your programming will make things much easier than to use the options available in CUDA itself.

In this blog post I explain how to utilize such a cluster to parallelize neural networks in different ways and what the advantages and downfalls are for such algorithms. The two different algorithms are data and model parallelism. In this blog entry I will focus on data parallelism.

How To Build and Use a Multi GPU System for Deep Learning

2014-09-21 by Tim Dettmers 139 Comments

When I started using GPUs for deep learning my deep learning skills improved quickly. When you can run experiments of algorithms and algorithms with different parameters and gain rapid feedback you can just learn much more quickly. At the beginning, deep learning is a lot of trial and error: You have to get a feel what parameters need to be adjusted, or what puzzle piece is missing in order to get a good result. A GPU helps you to fail quickly and learn important lessons so that you can keep improving. Soon my deep learning skills were sufficient to take the 2^nd place in the Crowdflower competition where the task was to predict weather labels from given tweets (sunny, raining etc.).

After this success I was tempted to use multiple GPUs in order to train deep learning algorithms even faster. I also took interest in learning very large models which do not fit into a single GPU. I thus wanted to build a little GPU cluster and explore the possibilities to speed up deep learning with multiple nodes with multiple GPUs. At the same time I was offered to do contract work as a data base developer through my old employer. This gave me opportunity to get the money to build the GPU cluster I thought of.

Skip links

Main navigation

Hardware Related Blog Posts