Deep learning is a field with intense computational requirements, and your choice of GPU will fundamentally determine your deep learning experience. But what features are important if you want to buy a new GPU? GPU RAM, cores, tensor cores, caches? How to make a cost-efficient choice? This blog post will delve into these questions, tackle common misconceptions, give you an intuitive understanding of how to think about GPUs, and will lend you advice, which will help you to make a choice that is right for you.[Read more…] about Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning
When I attended NAACL, I wanted to do a little test. I had two pitches for my LLM.int8() paper. One pitch is about how I use advanced quantization methods to achieve no performance degradation transformer inference at scale that makes large models more accessible. The other pitch talks about emergent outliers in transformers and how they radically change what transformers learn and how they function.
From that, I learned that quantization research is like printers. Nobody cares about printers. Nobody likes printers. But everybody is happy if printers do their job.[Read more…] about LLM.int8() and Emergent Features
This blog post is about my work, Sparse Networks from Scratch: Faster Training without Losing Performance, with Luke Zettlemoyer on fast training of neural networks which we keep sparse throughout training. We show that by developing an algorithm, sparse momentum, we can initialize a neural network with sparse random weights and train it to dense performance levels — all while doing just a single training run. Furthermore, If we use optimized sparse convolution algorithms, we can speed up training between 3.5x for VGG to 12x for Wide Residual Networks. This stands in stark contrast to computationally expensive methods which require repetitive prune-and-retrain cycles as used by the Lottery Ticket Hypothesis (Frankle and Carbin, 2019) and other work. Thus we show that training sparse networks to dense performance levels does not require “winning the initialization lottery” but can be done reliably from random weights if combined with a method that moves weights around the network in a smart way. We call the paradigm that maintains sparsity throughout training while maintaining dense performance levels sparse learning. While this work shows that sparse learning is possible, future work holds the promise to train larger and deep networks on more data while requiring the same or less computational resources as current dense networks.
In this blog post I will delve into the brain and explain its basic information processing machinery and compare it to deep learning. I do this by moving step-by-step along with the brains electrochemical and biological information processing pipeline and relating it directly to the architecture of convolutional nets. Thereby we will see that a neuron and a convolutional net are very similar information processing machines. While performing this comparison, I will also discuss the computational complexity of these processes and thus derive an estimate for the brains overall computational power. I will use these estimates, along with knowledge from high performance computing, to show that it is unlikely that there will be a technological singularity in this century.
Convolution is probably the most important concept in deep learning right now. It was convolution and convolutional nets that catapulted deep learning to the forefront of almost any machine learning task there is. But what makes convolution so powerful? How does it work? In this blog post I will explain convolution and relate it to other concepts that will help you to understand convolution thoroughly.
In my last blog post I explained what model and data parallelism is and analysed how to use data parallelism effectively in deep learning. In this blog post I will focus on model parallelism.
In my last blog post I showed what to look out for when you build a GPU cluster. Most importantly, you want a fast network connection between your servers and using MPI in your programming will make things much easier than to use the options available in CUDA itself.
In this blog post I explain how to utilize such a cluster to parallelize neural networks in different ways and what the advantages and downfalls are for such algorithms. The two different algorithms are data and model parallelism. In this blog entry I will focus on data parallelism.