I recently had a discussion about creativity with a colleague. We were discussing music and how creative many bands and groups are. At the end of our conversation, my friend told me, half-sarcastic-half-serious, how much more creative the people in the music industry are than him and that he just cannot find good ideas in his area of research even though he tried so hard for such a long time. I was a bit surprised because I thought of him as someone very creative. However, it is not uncommon to hear scientists lament about their lack of creativity compared to academic superstars. I think about creativity in academia is a bit distorted and a straight view can help to feel less bad about one’s own creativity.
This blog post is about my work, Sparse Networks from Scratch: Faster Training without Losing Performance, with Luke Zettlemoyer on fast training of neural networks which we keep sparse throughout training. We show that by developing an algorithm, sparse momentum, we can initialize a neural network with sparse random weights and train it to dense performance levels — all while doing just a single training run. Furthermore, If we use optimized sparse convolution algorithms, we can speed up training between 3.5x for VGG to 12x for Wide Residual Networks. This stands in stark contrast to computationally expensive methods which require repetitive prune-and-retrain cycles as used by the Lottery Ticket Hypothesis (Frankle and Carbin, 2019) and other work. Thus we show that training sparse networks to dense performance levels does not require “winning the initialization lottery” but can be done reliably from random weights if combined with a method that moves weights around the network in a smart way. We call the paradigm that maintains sparsity throughout training while maintaining dense performance levels sparse learning. While this work shows that sparse learning is possible, future work holds the promise to train larger and deep networks on more data while requiring the same or less computational resources as current dense networks.
Deep learning is a field with intense computational requirements and the choice of your GPU will fundamentally determine your deep learning experience. But what features are important if you want to buy a new GPU? GPU RAM, cores, tensor cores? How to make a cost-efficient choice? This blog post will delve into these questions and will lend you advice which will help you to make a choice that is right for you.
[Read more…] about Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning
Deep Learning is very computationally intensive, so you will need a fast CPU with many cores, right? Or is it maybe wasteful to buy a fast CPU? One of the worst things you can do when building a deep learning system is to waste money on hardware that is unnecessary. Here I will guide you step by step through the hardware you will need for a cheap high-performance system.
I studied in depth how to be successful in my PhD applications and it paid off: I got admitted to Stanford, University of Washington, UCL, CMU, and NYU. This blog post is a mish-mash of how to proceed in your PhD applications from A to Z. It discusses what is important and what is not. It discusses application materials like the statement of purpose (SoP) and how to make sense of these application materials.
On the computational side, there have been confusions about how TPUs and GPUs relate to BERT. BERT base was trained with 4 TPU pods (16 TPU chips) in 4 days and BERT large with 16 TPUs (64 TPU chips) in 4 days. Does this mean only Google can train a BERT model? Does this mean that GPUs are dead? There are two fundamental things to understand here: (1) A TPU is a matrix multiplication engine — it does matrix multiplication and matrix operations, but not much else. It is fast at computing matrix multiplication, but one has to understand that (2) the slowest thing in matrix multiplication is to get the elements from the main memory and load it into the processing unit. In other words, the most expensive part in matrix multiplication is memory loads. Note the computational load for BERT should be about 90% for matrix multiplication. From these facts, we can do a small technical analysis on this topic.
With the release of the Titan V, we now entered deep learning hardware limbo. It is unclear if NVIDIA will be able to keep its spot as the main deep learning hardware vendor in 2018 and both AMD and Intel Nervana will have a shot at overtaking NVIDIA. So for consumers, I cannot recommend buying any hardware right now. The most prudent choice is to wait until the hardware limbo passes. This might take as little as 3 months or as long as 9 months. So why did we enter deep learning hardware limbo just now?
This morning I got an email about my blog post discussing the history of deep learning which rattled me back into a time of my academic career which I rather not think about. It was a low point which nearly ended my Master studies at the University of Lugano, and it made me feel so bad about blogging that I took two long years to recover. So what has happened?
This blog post looks at the growth of computation, data, deep learning researcher demographics to show that the field of deep learning could stagnate over slowing growth. We will look at recent deep learning research papers which strike up similar problems but also demonstrate how one could to solve these problems. After discussion of these papers, I conclude with promising research directions which face these challenges head on.
In this blog post I will delve into the brain and explain its basic information processing machinery and compare it to deep learning. I do this by moving step-by-step along with the brains electrochemical and biological information processing pipeline and relating it directly to the architecture of convolutional nets. Thereby we will see that a neuron and a convolutional net are very similar information processing machines. While performing this comparison, I will also discuss the computational complexity of these processes and thus derive an estimate for the brains overall computational power. I will use these estimates, along with knowledge from high performance computing, to show that it is unlikely that there will be a technological singularity in this century.